From zach@gaffaneys.com  Fri Oct  3 15:05:11 1997
Received: from murkwood.gaffaneys.com (dialup7.gaffaneys.com [208.155.161.57])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id PAA15767
          for <FreeBSD-gnats-submit@freebsd.org>; Fri, 3 Oct 1997 15:05:08 -0700 (PDT)
Received: (from zach@localhost)
	by murkwood.gaffaneys.com (8.8.7/8.8.6) id RAA00469;
	Fri, 3 Oct 1997 17:04:39 -0500 (CDT)
Message-Id: <199710032204.RAA00469@murkwood.gaffaneys.com>
Date: Fri, 3 Oct 1997 17:04:39 -0500 (CDT)
From: Zach Heilig <zach@gaffaneys.com>
Reply-To: zach@gaffaneys.com
To: FreeBSD-gnats-submit@freebsd.org
Subject: crash on very heavy disk activity.
X-Send-Pr-Version: 3.2

>Number:         4684
>Category:       kern
>Synopsis:       crash on very heavy disk activity.
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Oct  3 15:10:01 PDT 1997
>Closed-Date:    Mon Mar 23 04:49:00 PST 1998
>Last-Modified:  Mon Mar 23 04:49:29 PST 1998
>Originator:     Zach Heilig
>Release:        FreeBSD 2.2-STABLE i386
>Organization:
none
>Environment:

2.2 stable as of Sept 18.

Copyright (c) 1992-1997 FreeBSD Inc.
Copyright (c) 1982, 1986, 1989, 1991, 1993
	The Regents of the University of California.  All rights reserved.

FreeBSD 2.2-STABLE #0: Thu Sep 18 09:44:26 CDT 1997
    zach@murkwood.gaffaneys.com:/usr/src/sys/compile/ZACH
CPU: Pentium (166.59-MHz 586-class CPU)
  Origin = "GenuineIntel"  Id = 0x543  Stepping=3
  Features=0x8001bf<FPU,VME,DE,PSE,TSC,MSR,MCE,CX8>
real memory  = 67108864 (65536K bytes)
avail memory = 62824448 (61352K bytes)
Probing for devices on PCI bus 0:
chip0 <generic PCI bridge (vendor=1106 device=0585 subclass=0)> rev 35 on pci0:0
chip1 <generic PCI bridge (vendor=1106 device=0586 subclass=1)> rev 37 on pci0:7:0
pci0:7:1: VIA Technologies, device=0x0571, class=storage (ide) [no driver assigned]
vga0 <VGA-compatible display device> rev 0 int a irq 10 on pci0:9
ncr0 <ncr 53c875 fast20 wide scsi> rev 1 int a irq 11 on pci0:10
ncr0 waiting for scsi devices to settle
(ncr0:0:0): "MICROP 4743NS S162" type 0 fixed SCSI 2
sd0(ncr0:0:0): Direct-Access 
sd0(ncr0:0:0): 20.0 MB/s (50 ns, offset 15)

sd0(ncr0:0:0): M_DISCONNECT received, but datapointer not saved:
	data=701b4 save=e40016b0 goal=e40016d4.
4100MB (8398656 512 byte sectors)
sd0(ncr0:0:0): with 6506 cyls, 7 heads, and an average 184 sectors/track
(ncr0:1:0): "QUANTUM FIREBALL_TM2110S 300X" type 0 fixed SCSI 2
sd1(ncr0:1:0): Direct-Access 
sd1(ncr0:1:0): 20.0 MB/s (50 ns, offset 15)
2014MB (4124736 512 byte sectors)
sd1(ncr0:1:0): with 6810 cyls, 4 heads, and an average 151 sectors/track
(ncr0:2:0): "iomega jaz 1GB J.83" type 0 removable SCSI 2
sd2(ncr0:2:0): Direct-Access 
sd2(ncr0:2:0): 10.0 MB/s (100 ns, offset 15)

sd2(ncr0:2:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB
sd2 could not mode sense (4). Using ficticious geometry
1021MB (2091050 512 byte sectors)
sd2(ncr0:2:0): with 1021 cyls, 64 heads, and an average 32 sectors/track
(ncr0:4:0): "SANYO CRD-254S 1.02" type 5 removable SCSI 2
cd0(ncr0:4:0): CD-ROM 
cd0(ncr0:4:0): asynchronous.
can't get the size
Probing for devices on the ISA bus:
sc0 at 0x60-0x6f irq 1 flags 0x6 on motherboard
sc0: VGA color <16 virtual consoles, flags=0x6>
sio0 at 0x3f8-0x3ff irq 4 on isa
sio0: type 16550A
sio1 at 0x2f8-0x2ff irq 3 on isa
sio1: type 16550A
lpt0 at 0x378-0x37f irq 7 on isa
lpt0: Interrupt-driven port
lp0: TCP/IP capable interface
pca0 on motherboard
pca0: PC speaker audio driver
fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
fdc0: NEC 72065B
fd0: 1.44MB 3.5in
npx0 on motherboard
npx0: INT 16 interface
joy0 at 0x201 on isa
joy0: joystick
sb0 at 0x220 irq 5 drq 1 on isa
sb0: <SoundBlaster 16 4.13>
sbxvi0 at 0x0 drq 5 on isa
sbxvi0: <SoundBlaster 16 4.13>
sbmidi0 at 0x330 on isa
 <SoundBlaster MPU-401>
opl0 at 0x388 on isa
opl0: <Yamaha OPL-3 FM>
WARNING: / was not properly dismounted.
sd2(ncr0:2:0): UNIT ATTENTION asc:28,0
sd2(ncr0:2:0):  Not ready to ready transition, medium may have changed
sd2(ncr0:2:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB
sd2 could not mode sense (4). Using ficticious geometry

>Description:

Here are the last few console messages before the reboot:

sd2(ncr0:2:0): extraneous data discarded.
sd2(ncr0:2:0): COMMAND FAILED (9 0) @f078bc00.
sd2(ncr0:2:0): extraneous data discarded.
sd2(ncr0:2:0); COMMAND FAILED (9 0) @f06b4000.
sd2(ncr0:2:0): extraneous data discarded.
sd2(ncr0:2:0); COMMAND FAILED (9 0) @f06b4000.
sd2(ncr0:2:0): extraneous data discarded.
sd2(ncr0:2:0); COMMAND FAILED (9 0) @f06b4000.
/src: bad dir ino 145920 at offset 0: mangled entry
panic: bad dir
syncing disks... 22 22 21 18 14 9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 giving up
dumping to dev 30401, offset 131072
dump 64 63 ... 1 succeeded

This was during both an rm -rf of a large tree on sd2s1e and a cvs checkout
from the cvs repository I keep on that slice.

I have more information, if you need to know more.  I did keep the core dump
around (and logged the fsck for /dev/sd2s1e).

>How-To-Repeat:

	

>Fix:
	
	

>Release-Note:
>Audit-Trail:

From: Stefan Esser <se@FreeBSD.ORG>
To: zach@gaffaneys.com
Cc: FreeBSD-gnats-submit@freebsd.org, Stefan Esser <se@freebsd.org>
Subject: Re: kern/4684: crash on very heavy disk activity.
Date: Sun, 5 Oct 1997 11:04:22 +0200

 On 1997-10-03 17:04 -0500, Zach Heilig <zach@gaffaneys.com> wrote:
 > >Synopsis:       crash on very heavy disk activity.
 
 > FreeBSD 2.2-STABLE #0: Thu Sep 18 09:44:26 CDT 1997
 > ncr0 <ncr 53c875 fast20 wide scsi> rev 1 int a irq 11 on pci0:10
 > (ncr0:0:0): "MICROP 4743NS S162" type 0 fixed SCSI 2
 > sd0(ncr0:0:0): Direct-Access 
 > sd0(ncr0:0:0): 20.0 MB/s (50 ns, offset 15)
 > 
 > sd0(ncr0:0:0): M_DISCONNECT received, but datapointer not saved:
 > 	data=701b4 save=e40016b0 goal=e40016d4.
 
 Hmm, the drive disconnected during the probe ...
 Does this happen on each boot ?
 
 > 4100MB (8398656 512 byte sectors)
 > sd0(ncr0:0:0): with 6506 cyls, 7 heads, and an average 184 sectors/track
 > (ncr0:1:0): "QUANTUM FIREBALL_TM2110S 300X" type 0 fixed SCSI 2
 > sd1(ncr0:1:0): Direct-Access 
 > sd1(ncr0:1:0): 20.0 MB/s (50 ns, offset 15)
 > 2014MB (4124736 512 byte sectors)
 > sd1(ncr0:1:0): with 6810 cyls, 4 heads, and an average 151 sectors/track
 > (ncr0:2:0): "iomega jaz 1GB J.83" type 0 removable SCSI 2
 > sd2(ncr0:2:0): Direct-Access 
 > sd2(ncr0:2:0): 10.0 MB/s (100 ns, offset 15)
 > 
 > sd2(ncr0:2:0): ILLEGAL REQUEST asc:24,0 Invalid field in CDB
 > sd2 could not mode sense (4). Using ficticious geometry
 > 1021MB (2091050 512 byte sectors)
 > sd2(ncr0:2:0): with 1021 cyls, 64 heads, and an average 32 sectors/track
 > (ncr0:4:0): "SANYO CRD-254S 1.02" type 5 removable SCSI 2
 > cd0(ncr0:4:0): CD-ROM 
 > cd0(ncr0:4:0): asynchronous.
 
 > Here are the last few console messages before the reboot:
 > 
 > sd2(ncr0:2:0): extraneous data discarded.
 > sd2(ncr0:2:0): COMMAND FAILED (9 0) @f078bc00.
 > sd2(ncr0:2:0): extraneous data discarded.
 > sd2(ncr0:2:0); COMMAND FAILED (9 0) @f06b4000.
 > sd2(ncr0:2:0): extraneous data discarded.
 > sd2(ncr0:2:0); COMMAND FAILED (9 0) @f06b4000.
 > sd2(ncr0:2:0): extraneous data discarded.
 > sd2(ncr0:2:0); COMMAND FAILED (9 0) @f06b4000.
 > /src: bad dir ino 145920 at offset 0: mangled entry
 > panic: bad dir
 > syncing disks... 22 22 21 18 14 9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 giving up
 > dumping to dev 30401, offset 131072
 > dump 64 63 ... 1 succeeded
 > 
 > This was during both an rm -rf of a large tree on sd2s1e and a cvs checkout
 > from the cvs repository I keep on that slice.
 
 The command failed because of lack of agreement on the amount of 
 data requested. The drive stayed in a data phase, when there was 
 either no more data to deliver to it, or no more buffer space to
 store the data read (depending on whether this happened during a
 read or a write).
 
 This (together with the disconnect of your UW drive) indicates 
 there is a SCSI bus problem. SCSI strobe pulses got lost or 
 duplicated.
 
 What's the (total!) length of your SCSI bus ?
 (Internal plus external, number of connectors, if any, terminators ?)
 
 Could you try with a much reduced data rate (say 5MHz), just to
 make sure it is not caused by the bus cable ?
 
 > I have more information, if you need to know more.  I did keep the core dump
 > around (and logged the fsck for /dev/sd2s1e).
 
 I don't expect this to be a software problem. The core won't help
 much in this case, since the crash happened some time after the 
 SCSI problem was detected by the NCR chip and driver.
 
 Please check your cables and terminators.
 
 There appear to be NCR cards with erroneous documentation. They
 got the terminator enable/disable labels exchanged. I do not have
 such a card, just heard about it ...
 
 You also should be aware, that Fast-20 limits the cable length to
 half of what was allowed with Fast-10. In fact, I'd be reluctant
 to use more than 1m of cable between the controller and the last
 device, assuming you got no external devices.
 
 Don't believe specified maximum bus length, unless you know you got 
 a first grade SCSI bus cable (as was discussed recently), since only
 such a cable will guarantee that the electrical parameters of the 
 bus meet the SCSI standard requirements. You may use a cheap cable,
 if you limit the cable length to about half that allowed by the SCSI
 standard for the intended data rate, and if you don't connect to many 
 devices.
 
 Regards, STefan

From: Zach Heilig <zach@gaffaneys.com>
To: Stefan Esser <se@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org
Subject: Re: kern/4684: crash on very heavy disk activity.
Date: Tue, 7 Oct 1997 16:41:35 -0500

 I did not see my other reply come over the bugs list [from a while back], so
 I will paraphrase my other reply...
 
 On Sun, Oct 05, 1997 at 11:04:22AM +0200, Stefan Esser wrote:
 > > ncr0 <ncr 53c875 fast20 wide scsi> rev 1 int a irq 11 on pci0:10
 > > sd0(ncr0:0:0): M_DISCONNECT received, but datapointer not saved:
 > > 	data=701b4 save=e40016b0 goal=e40016d4.
 
 > Hmm, the drive disconnected during the probe ...
 > Does this happen on each boot ?
 
 Yes, this happens on every boot.  I haven't actually noticed any other problems
 with sd0 (or sd1) though.
 
 > > Here are the last few console messages before the reboot:
 ...
 > > This was during both an rm -rf of a large tree on sd2s1e and a cvs checkout
 > > from the cvs repository I keep on that slice.
 
 > The command failed because of lack of agreement on the amount of 
 > data requested. The drive stayed in a data phase, when there was 
 > either no more data to deliver to it, or no more buffer space to
 > store the data read (depending on whether this happened during a
 > read or a write).
 
 I wonder of there was a bus reset, and the jaz drive responded by just dropping
 everything it was up to (and returned/wrote incomplete data).
 
 > This (together with the disconnect of your UW drive) indicates 
 > there is a SCSI bus problem. SCSI strobe pulses got lost or 
 > duplicated.
 
 Just for clarification, I only have 50 pin scsi devices.  These devices hang
 off of a 50 pin port on the scsi card.  The card will take ultra-wide devices,
 but I do not have any.
 
 > What's the (total!) length of your SCSI bus ?
 > (Internal plus external, number of connectors, if any, terminators ?)
 
 the devices were on a 38" cable, 5 connectors, 8" between four on one end,
 and 14" between the single connector and the group.  They were in the order
 card -14"- sd0 -8"- sd1 -8"- sd2 -8"- cd0.
 
 cd0 has terminators installed.
 sd2 has only automatic termination.
 [ the above are no longer connected ].
 sd1 had termination disabled.
 sd0 has termination and term power disabled.
 the card only has one setting [and that's for auto termination].
 
 I tried 3 other cables [all with connections for 2 devices].  I re-enabled
 termination on sd1, and only connected sd0 and sd1.  sd0 still disconnected
 on boot for all three cables.  These cables were all around 24" long.  I left
 one of these other cables installed [the one that came with the card].
 
 > Could you try with a much reduced data rate (say 5MHz), just to
 > make sure it is not caused by the bus cable ?
 
 This didn't seem to make any difference.  It took longer to induce a crash,
 but it still did crash.
 
 Ok, I'm pretty much convinced this is actually a cable [or device] problem.
 Unless the ncr driver is sending a reset that is messing up the jaz drive,
 there doesn't seem to be much on the software side that can fix these things.
 
 Even though I don't seem to be having problems with sd0, that is the device
 that was added not too long ago, and there were no problems at all before
 that.  When I have time, I'll switch the drive to the end and enable all its
 termination options [and disable the termination on the other devices], and
 see if that helps any.
 
 -- 
 Zach Heilig
 We know you are a good friend, but we have to charge you for our services just
 like our other customers.  Actually, we don't like charging our friends, but
 we did a study of our clientel and discovered none of our enemies do business
 with us.  [seen in a lawyers office].

From: Zach Heilig <zach@gaffaneys.com>
To: freebsd-gnats-submit@freebsd.org
Cc:  Subject: Re: kern/4684: crash on very heavy disk activity
Date: Tue, 28 Oct 1997 04:12:27 -0600

 I am happy to report that this has been completely resolved.  All appearant
 problems went away after I switched out the TekRam DC390F with an Adaptec 2940.
 
 As a bonus, disk speeds increased by several percent as well...
 
 -- 
 Zach Heilig

From: Zach Heilig <zach@gaffaneys.com>
To: freebsd-gnats-submit@freebsd.org
Cc:  Subject: Re: kern/4684 crash on very heavy disk activity.
Date: Mon, 23 Mar 1998 02:42:38 -0600

 This can be closed.  This was a side effect of a nasty Micropolis drive unable
 to deal with a 20MHz bus (even though it was access to a different drive that
 caused the panic).
 
 -- 
 Zach Heilig -- zach@gaffaneys.com
 Real Programs don't use shared text.  Otherwise, how can they use
 functions for scratch space after they are finished calling them?
State-Changed-From-To: open->closed 
State-Changed-By: steve 
State-Changed-When: Mon Mar 23 04:49:00 PST 1998 
State-Changed-Why:  
Closed at originator's request. 
>Unformatted:
