From root@osprey.grizzly.com  Sat Jan  4 08:44:52 1997
Received: from scruz.net (nic.scruz.net [165.227.1.2])
          by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id IAA01315
          for <FreeBSD-gnats-submit@freebsd.org>; Sat, 4 Jan 1997 08:44:50 -0800 (PST)
Received: from osprey.grizzly.com by scruz.net (8.7.3/1.34)
	id IAA05485; Sat, 4 Jan 1997 08:44:45 -0800 (PST)
Received: (from root@localhost) by osprey.grizzly.com (8.8.4/8.7.3) id IAA00351; Sat, 4 Jan 1997 08:45:17 -0800 (PST)
Message-Id: <199701041645.IAA00351@osprey.grizzly.com>
Date: Sat, 4 Jan 1997 08:45:17 -0800 (PST)
From: markd@grizzly.com
Reply-To: markd@grizzly.com
To: FreeBSD-gnats-submit@freebsd.org
Subject: SCSI disk corruption
X-Send-Pr-Version: 3.2

>Number:         2371
>Category:       kern
>Synopsis:       SCSI disk corruption
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    gibbs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Jan  4 08:50:01 PST 1997
>Closed-Date:    Sun Mar 22 09:03:01 PST 1998
>Last-Modified:  Sun Mar 22 09:04:47 PST 1998
>Originator:     Mark Diekhans
>Release:        FreeBSD 2.2-BETA_A i386
>Organization:
>Environment:

    FreeBSD 2.2-BETA_A #0: Sun Dec 29 20:13:01 PST 1996
        markd@osprey.grizzly.com:/usr/src/sys/compile/OSPREY
    Calibrating clock(s) relative to mc146818A clock ... i586 clock: 166188714 Hz, i8254 clock: 1193152 Hz
    CPU: Pentium (166.19-MHz 586-class CPU)
      Origin = "GenuineIntel"  Id = 0x52c  Stepping=12
      Features=0x1bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8>
    real memory  = 67108864 (65536K bytes)
    avail memory = 64073728 (62572K bytes)
    Probing for devices on PCI bus 0:
    chip0 <Intel 82439> rev 1 on pci0:0
    chip1 <Intel 82371SB PCI-ISA bridge> rev 0 on pci0:7:0
    chip2 <Intel 82371SB IDE interface> rev 0 on pci0:7:1
    vga0 <VGA-compatible display device> rev 1 int a irq 10 on pci0:9
    ahc0 <Adaptec 2940 SCSI host adapter> rev 0 int a irq 11 on pci0:12
    ahc0: aic7870 Single Channel, SCSI Id=7, 16/255 SCBs
    ahc0 waiting for scsi devices to settle
    ahc0: target 0 Tagged Queuing Device
    (ahc0:0:0): "HP C3725S 6039" type 0 fixed SCSI 2
    sd0(ahc0:0:0): Direct-Access 2047MB (4194058 512 byte sectors)
    ahc0: target 1 Tagged Queuing Device
    (ahc0:1:0): "SEAGATE ST32550N 0015" type 0 fixed SCSI 2
    sd1(ahc0:1:0): Direct-Access 2047MB (4194058 512 byte sectors)
    ahc0:A:5: refuses syncronous negotiation. Using asyncronous transfers
    (ahc0:5:0): "IOMEGA ZIP 100 D.09" type 0 removable SCSI 2
    sd2(ahc0:5:0): Direct-Access 
    sd2(ahc0:5:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB
    sd2 could not mode sense (4). Using ficticious geometry

    sd2(ahc0:5:0): NOT READY asc:3a,0 Medium not present
    sd2: could not get size
    0MB (0 512 byte sectors)
    Probing for devices on the ISA bus:
    sc0 at 0x60-0x6f irq 1 on motherboard
    sc0: VGA color <4 virtual consoles, flags=0x0>
    sio0 at 0x3f8-0x3ff irq 4 on isa
    sio0: type 16550A
    sio1 at 0x2f8-0x2ff irq 3 on isa
    sio1: type 16550A
    sio2 not found at 0x3e8
    lpt0 at 0x378-0x37f irq 7 on isa
    lpt0: Interrupt-driven port
    lp0: TCP/IP capable interface
    psm0 at 0x60-0x64 irq 12 on motherboard
    psm0: device ID 0, 2 buttons?
    fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
    fdc0: NEC 72065B
    fd0: 1.44MB 3.5in
    npx0 on motherboard
    npx0: INT 16 interface

AHC options:
    controller	ahc0
    options	AHC_TAGENABLE
    options	AHC_SCBPAGING_ENABLE
    options 	AHC_ALLOW_MEMIO

>Description:
    As with many reports of disk corruption, this might not be very useful
due to lack of solid data...

    System was discovered crashed, on reboot, the second scsi disk was corrupt.
Fsck reported a large number of errors and it was eventually run with -y.
When remounted, all files where in lost+found.

    The last major activity before the crash was formatting of sd2, an IOMega
Zip drive.  SCSI format, partition and newfs had generated the following
message:

    Jan  3 22:47:13 osprey /kernel: sd2(ahc0:5:0): UNIT ATTENTION asc:28,0
    Jan  3 22:47:13 osprey /kernel: sd2(ahc0:5:0):  Not ready to ready transition, medium may have changed
    Jan  3 22:47:13 osprey /kernel: Debugger("sdopen: no slices") called.
    Jan  3 22:47:13 osprey /kernel: sd2(ahc0:5:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB
    Jan  3 22:47:13 osprey /kernel: sd2 could not mode sense (4). Using ficticious geometry
    Jan  3 22:47:13 osprey /kernel: sd2: invalid primary partition table: no magic

The system was alive for two more hours, but only doing routine tasks.

The day befofe the following errors where reported in the log:

    Jan  2 02:34:51 osprey /kernel: st0(ahc0:2:0): timed out in command phase, SCSISIGI == 0x84
    Jan  2 02:34:51 osprey /kernel: SEQADDR == 0x5a
    Jan  2 02:34:51 osprey /kernel: st0(ahc0:2:0): abort message in message buffer
    Jan  2 02:34:53 osprey /kernel: st0(ahc0:2:0): timed out in dataout phase, SCSISIGI == 0x0
    Jan  2 02:34:53 osprey /kernel: SEQADDR == 0xe
    Jan  2 02:34:53 osprey /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
    Jan  2 02:34:53 osprey /kernel: st0(ahc0:2:0): Target Busy
    Jan  2 02:34:53 osprey /kernel: st0(ahc0:2:0): Target Busy
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0): UNIT ATTENTION info?:4020040 asc:29,0
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0):  Power on, reset, or bus device reset occurred
    Jan  2 02:34:53 osprey /kernel: , retries:4
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:34:53 osprey /kernel: , retries:3
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0): ABORTED COMMAND info?:4020040 asc:4e,0 Overlapped commands attempted
    Jan  2 02:34:53 osprey /kernel: , retries:2
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:34:53 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:34:53 osprey /kernel: , retries:1
    Jan  2 02:35:03 osprey /kernel: sd0(ahc0:0:0): timed out in dataout phase, SCSISIGI == 0x0
    Jan  2 02:35:03 osprey /kernel: SEQADDR == 0xb
    Jan  2 02:35:17 osprey /kernel: SEQADDR == 0xf
    Jan  2 02:35:17 osprey /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
    Jan  2 02:35:17 osprey /kernel: spec_getpages: I/O read error
    Jan  2 02:35:17 osprey /kernel: vm_fault: pager input (probably hardware) error, PID 4026 failure
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:35:17 osprey /kernel: , retries:4
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:35:17 osprey /kernel: , retries:3
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:35:17 osprey /kernel: , retries:2
    Jan  2 02:35:17 osprey /kernel: pid 4026 (bash), uid 0: exited on signal 11 (core dumped)
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0): timed out in dataout phase, SCSISIGI == 0x0
    Jan  2 02:35:17 osprey /kernel: SEQADDR == 0xf
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0): timed out in dataout phase, SCSISIGI == 0x0
    Jan  2 02:35:17 osprey /kernel: SEQADDR == 0xe
    Jan  2 02:35:17 osprey /kernel: ahc0: Issued Channel A Bus Reset. 1 SCBs aborted
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0): UNIT ATTENTION info?:4020040 asc:29,0
    Jan  2 02:35:17 osprey /kernel: sd0(ahc0:0:0):  Power on, reset, or bus device reset occurred
    Jan  2 02:35:17 osprey /kernel: , retries:3
    Jan  2 02:35:18 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:35:18 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:35:18 osprey /kernel: , retries:2
    Jan  2 02:35:18 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:35:18 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:35:18 osprey /kernel: , retries:1
    Jan  2 02:35:18 osprey /kernel: sd0(ahc0:0:0): NOT READY info?:4020040 asc:4,1
    Jan  2 02:35:18 osprey /kernel: sd0(ahc0:0:0):  Logical unit is in process of becoming ready
    Jan  2 02:35:18 osprey /kernel: , FAILURE
    Jan  2 02:36:42 osprey /kernel: sd1(ahc0:1:0): UNIT ATTENTION asc:29,0
    Jan  2 02:36:42 osprey /kernel: sd1(ahc0:1:0):  Power on, reset, or bus device reset occurred field replaceable unit: 1
    Jan  2 02:36:42 osprey /kernel: , retries:4

Note: AHC_TAGENABLE and AHC_SCBPAGING_ENABLE and AHC_ALLOW_MEMIO where enabled.

Since I gave up driver writing before SCSI was invented, I can't very useful
in the analysis.

>How-To-Repeat:

>Fix:
>Release-Note:
>Audit-Trail:

From: j@uriah.heep.sax.de (J Wunsch)
To: markd@grizzly.com
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: kern/2371: SCSI disk corruption
Date: Sat, 4 Jan 1997 23:43:21 +0100

 As markd@grizzly.com wrote:
 
 > AHC options:
 >     controller	ahc0
 >     options	AHC_TAGENABLE
 >     options	AHC_SCBPAGING_ENABLE
                 ^^^^^^^^^^^^^^^^^^^^ Turn this off, and see if your problem
                                      persists.  I think SCB paging is not
                                      yet ready for prime-time.
 >     options 	AHC_ALLOW_MEMIO
 
 I think problems with AHC_ALLOW_MEMIO are rather `binary' in nature:
 it either works well, or starts to go afoul very early in the game.
 Problems with tagged commands usually end up in ``Overlapped commands
 attempted'' or something like this.
 
 Eeek, no:
 
 >     ahc0: target 0 Tagged Queuing Device
 >     (ahc0:0:0): "HP C3725S 6039" type 0 fixed SCSI 2
 >     sd0(ahc0:0:0): Direct-Access 2047MB (4194058 512 byte sectors)
 
 This one is known to choke on tagged commands!
 
 >     The last major activity before the crash was formatting of sd2, an IOMega
 > Zip drive.  SCSI format, partition and newfs had generated the following
 > message:
 > 
 >     Jan  3 22:47:13 osprey /kernel: sd2(ahc0:5:0): UNIT ATTENTION asc:28,0
 >     Jan  3 22:47:13 osprey /kernel: sd2(ahc0:5:0):  Not ready to ready transition, medium may have changed
 >     Jan  3 22:47:13 osprey /kernel: Debugger("sdopen: no slices") called.
 >     Jan  3 22:47:13 osprey /kernel: sd2(ahc0:5:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB
 >     Jan  3 22:47:13 osprey /kernel: sd2 could not mode sense (4). Using ficticious geometry
 >     Jan  3 22:47:13 osprey /kernel: sd2: invalid primary partition table: no magic
 
 That's something different, and basically a known problem.  You might
 get around this by creating a `quirk' record for this device in
 /sys/scsi/scsiconf.c, and assign it to the `od' driver.  This one has
 a better understanding of removable media devices.
 
 -- 
 cheers, J"org
 
 joerg_wunsch@uriah.heep.sax.de -- http://www.sax.de/~joerg/ -- NIC: JW11-RIPE
 Never trust an operating system you don't have sources for. ;-)

From: Mark Diekhans <markd@Grizzly.COM>
To: joerg_wunsch@uriah.heep.sax.de
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: kern/2371: SCSI disk corruption
Date: Sat, 4 Jan 1997 20:19:39 -0800 (PST)

 >From: j@uriah.heep.sax.de (J Wunsch)
 
 Hey mate, thanks for all the info.
 
 >Eeek, no:
 >
 >>     ahc0: target 0 Tagged Queuing Device
 >>     (ahc0:0:0): "HP C3725S 6039" type 0 fixed SCSI 2
 >>     sd0(ahc0:0:0): Direct-Access 2047MB (4194058 512 byte sectors)
 >
 >This one is known to choke on tagged commands!
 
 Ack....
 
 Perhaps there should be a warning in LINT about this.  Not booting after
 adding a option is no big deal, but destorying a disk after many days of use
 is rather painful.
 
 Thanks, I owe you a beer,
 Mark
State-Changed-From-To: open->closed 
State-Changed-By: joerg 
State-Changed-When: Sun Jan 5 17:13:54 MET 1997 
State-Changed-Why:  
I've added a few warnings about the dangerousness of some 
SCSI-related options to the LINT file in the 2.2 release 
candidate. 

State-Changed-From-To: closed->open 
State-Changed-By: gibbs 
State-Changed-When: Sun Jan 5 15:55:45 PST 1997 
State-Changed-Why:  
There are two problems here: 
1) The Adaptec driver in current and in 2.2 has a few bugs in it that 
cause the controller to wedge.  Although data is lost, data that successfully 
reaches the disk is just fine. 
2) The aic7xxx driver uses simple queue tags when tagged queueing is turned on. 
This means that I/O transactions queued to the target can be re-ordered at the 
targets discretion (except that a read following a write must return the data 
from the last queued write transaction).  Unfortunately, this can cause problems 
when syncronous write requests are issued.  There is nothing preventing a 
syncronous write from returning prior to some of the transactions queued before 
it.  I believe that this situation, when combined with a hang (meaning there 
are multiple outstanding transactions) is the cause of your "corruption" since 
meta-data state can be unstable. 

The first problem is being actively worked on by me, and should be addressed 
in the next two weeks. 

There is already a preliminary framework for fixing problem two in the 'SCSI' 
branch.  The basic approach is to take syncronous requests and use ordered 
tagged queueing on them so that synchronous requests work as they are intended. 


Responsible-Changed-From-To: freebsd-bugs->gibbs 
Responsible-Changed-By: gibbs 
Responsible-Changed-When: Sun Jan 5 15:55:45 PST 1997 
Responsible-Changed-Why:  
I will be addressing these issues. 
State-Changed-From-To: open->feedback 
State-Changed-By: gibbs 
State-Changed-When: Sat Apr 19 08:17:29 PDT 1997 
State-Changed-Why:  
Please retest using the latest (as of 4/18/97) FreeBSD-2.2 or current 
driver and see if the problem persists.  I believe it has been corrected. 

From: Mark Diekhans <markd@Grizzly.COM>
To: freebsd-gnats-submit@freebsd.org
Cc:  Subject: Re: kern/2371 SCSI disk corruption
Date: Sun, 22 Mar 1998 01:29:02 -0800 (PST)

 This problem was caused by enabled the AHC_TAGENABLE.  It turns out
 that I have a HP C3725S, which bogusly claims to support tagged cmds.
 This problem was addressed by adding a warning to LINT.  The PR 
 should probably be closed.
State-Changed-From-To: feedback->closed 
State-Changed-By: steve 
State-Changed-When: Sun Mar 22 09:03:01 PST 1998 
State-Changed-Why:  
Originator states this was caused by a drive that claimed 
to features which it did not and was corrected by a wardning 
added to LINT. 
>Unformatted:
