From nobody@FreeBSD.ORG  Mon Nov  6 06:34:10 2000
Return-Path: <nobody@FreeBSD.ORG>
Received: by hub.freebsd.org (Postfix, from userid 32767)
	id 47D4137B4D7; Mon,  6 Nov 2000 06:34:10 -0800 (PST)
Message-Id: <20001106143410.47D4137B4D7@hub.freebsd.org>
Date: Mon,  6 Nov 2000 06:34:10 -0800 (PST)
From: jan.redepenning@goelz.com
Sender: nobody@FreeBSD.ORG
To: freebsd-gnats-submit@FreeBSD.org
Subject: SCSI problem halts system after long period of perfect behaviour
X-Send-Pr-Version: www-1.0

>Number:         22640
>Category:       i386
>Synopsis:       SCSI problem halts system after long period of perfect behaviour
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Nov 06 06:40:01 PST 2000
>Closed-Date:    Sat Nov 24 03:41:41 PST 2001
>Last-Modified:  Sat Nov 24 03:42:43 PST 2001
>Originator:     Jan Redepenning
>Release:        FreeBSD 3.5-STABLE
>Organization:
Goelz & Schwarz GmbH
>Environment:
FreeBSD goelz.com 3.5-STABLE FreeBSD 3.5-STABLE #0: Fri Aug  4 12:59:26 CEST 2000     kherrmann@goelz.com:/usr/src/sys/compile/GOELZ
  i386
>Description:
Machine boots and works fine (even "perfect") for a unpredictable 
period of time (between 4 and 8 days). Then, on the console (and in
the log files) there are long rows of:

goelz /kernel: (da0:ahc0:0:0:0): SCB 0x85 - timed out in datain phase, SEQADDR == 0x5e
goelz /kernel: (da0:ahc0:0:0:0): BDR message in message buffer
goelz /kernel: (da0:ahc0:0:0:0): SCB 0x85 - timed out in datain phase, SEQADDR == 0x5e
goelz /kernel: (da0:ahc0:0:0:0): no longer in timeout, status = 34b
goelz /kernel: ahc0: Issued Channel A Bus Reset. 78 SCBs aborted

repeating all the time - until manual reset of the machine (often, even
telnet doesnt work any more). Usually the problems start with da0 and 
then switch to the other drives. Configuration of the machine from the 
boot messages:

goelz /kernel: CPU: AMD-K7(tm) Processor (604.23-MHz 686-class CPU)
goelz /kernel: Origin = "AuthenticAMD"  Id = 0x612  Stepping = 2
goelz /kernel: Features=0x81f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,MMX>
goelz /kernel: AMD Features=0xc0400000<<b22>,<b30>,3DNow!>
goelz /kernel: real memory  = 268435456 (262144K bytes)
goelz /kernel: avail memory = 258494464 (252436K bytes)
goelz /kernel: Preloaded elf kernel "kernel" at 0xc02b0000.
goelz /kernel: Pentium Pro MTRR support enabled
goelz /kernel: Probing for devices on PCI bus 0:
goelz /kernel: chip0: <Host to PCI bridge (vendor=1022 device=7006)> rev 0x23 on pci0.0.0
goelz /kernel: chip1: <PCI to PCI bridge (vendor=1022 device=7007)> rev 0x01 on pci0.1.0
goelz /kernel: chip2: <PCI to ISA bridge (vendor=1106 device=0686)> rev 0x1b on pci0.4.0
goelz /kernel: ahc0: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 10 on pci0.14.0
goelz /kernel: ahc0: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs
goelz /kernel: xl0: <3Com 3c905C-TX Fast Etherlink XL> rev 0x74 int a irq 12 on pci0.16.0
goelz /kernel: xl0: Ethernet address: 00:50:da:40:b3:8e
goelz /kernel: xl0: autoneg complete, link status good (half-duplex, 100Mbps)
goelz /kernel: Probing for devices on PCI bus 1:
goelz /kernel: vga0: <ATI model 4c42 graphics accelerator> rev 0xdc int a irq 11 on pci1.5.0
goelz /kernel: Probing for devices on the ISA bus:
goelz /kernel: sc0 on isa
goelz /kernel: sc0: VGA color <16 virtual consoles, flags=0x0>
goelz /kernel: atkbdc0 at 0x60-0x6f on motherboard
goelz /kernel: atkbd0 irq 1 on isa
goelz /kernel: psm0 not found
goelz /kernel: sio0 at 0x3f8-0x3ff irq 4 flags 0x10 on isa
goelz /kernel: sio0: type 16550A
goelz /kernel: sio1 at 0x2f8-0x2ff irq 3 on isa
goelz /kernel: sio1: type 16550A
goelz /kernel: fdc0 at 0x3f0-0x3f7 irq 6 drq 2 on isa
goelz /kernel: fdc0: FIFO enabled, 8 bytes threshold
goelz /kernel: fd0: 1.44MB 3.5in
goelz /kernel: ppc0 at 0x378 irq 7 flags 0x40 on isa
goelz /kernel: ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
goelz /kernel: ppc0: FIFO with 16/16/8 bytes threshold
goelz /kernel: lpt0: <generic printer> on ppbus 0
goelz /kernel: lpt0: Interrupt-driven port
goelz /kernel: ppi0: <generic parallel i/o> on ppbus 0
goelz /kernel: plip0: <PLIP network interface> on ppbus 0
goelz /kernel: vga0 at 0x3b0-0x3df maddr 0xa0000 msize 131072 on isa
goelz /kernel: npx0 on motherboard
goelz /kernel: npx0: INT 16 interface
goelz /kernel: Waiting 2 seconds for SCSI devices to settle
goelz /kernel: changing root device to da0s1a
goelz /kernel: da2 at ahc0 bus 0 target 2 lun 0
goelz /kernel: da2: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da2: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da2: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: da3 at ahc0 bus 0 target 5 lun 0
goelz /kernel: da3: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da3: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da3: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: da1 at ahc0 bus 0 target 1 lun 0
goelz /kernel: da1: <IBM DRHS36D 0270> Fixed Direct Access SCSI-3 device
goelz /kernel: da1: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing Enabled
goelz /kernel: da1: 35239MB (72170879 512 byte sectors: 255H 63S/T 4492C)
goelz /kernel: da0 at ahc0 bus 0 target 0 lun 0
goelz /kernel: da0: <IBM DRHS36V 0270> Fixed Direct Access SCSI-3 device
goelz /kernel: da0: 40.000MB/s transfers (20.000MHz, offset 15, 16bit), Tagged Queueing Enabled
goelz /kernel: da0: 35239MB (72170879 512 byte sectors: 255H 63S/T 4492C)
goelz /kernel: da5 at ahc0 bus 0 target 9 lun 0
goelz /kernel: da5: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da5: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da5: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: da4 at ahc0 bus 0 target 8 lun 0
goelz /kernel: da4: <IBM DDYS-T36950N S80D> Fixed Direct Access SCSI-3 device
goelz /kernel: da4: 40.000MB/s transfers (20.000MHz, offset 63, 16bit), Tagged Queueing Enabled
goelz /kernel: da4: 35003MB (71687340 512 byte sectors: 255H 63S/T 4462C)
goelz /kernel: cd0 at ahc0 bus 0 target 4 lun 0
goelz /kernel: cd0: <TEAC CD-ROM CD-532S 1.0A> Removable CD-ROM SCSI-2 device
goelz /kernel: cd0: 20.000MB/s transfers (20.000MHz, offset 16)
goelz /kernel: cd0: Attempt to query device size failed: NOT READY, Medium not present

>How-To-Repeat:
Reboot, wait a few days... Im sorry that Im unable to be more
specific... Weve tried heavy load phases which worked fine; once
it crashed during such a "copy-orgy", at other times it worked fine.

:-(
>Fix:


>Release-Note:
>Audit-Trail:

From: wataru-s@mfeed.ad.jp (Wataru Satoh)
To: jan.redepenning@goelz.com, freebsd-gnats-submit@freebsd.org
Cc: wataru-s@mfeed.ad.jp
Subject: Re: i386/22640: SCSI problem halts system after long period of perfect behaviour
Date: Wed, 6 Dec 2000 02:08:25 +0900 (JST)

 Herr,
 
 I'm Wataru Satoh working for an ISP in Japan.
 
 We have caught in mysterious SCSI trouble just like yours.
 
 OS is FreeBSD 3.5.1-RELEASE, and
 host adapter is adaptec 2940UW, and
 disks are IBM 18G ultrastor, DNES-318350.
 
 suddenly died twice in a week, a disk obstinately
 kept "being accessed" LED on.
 I tried hot start without power down, but did not work
 - the machine tried to boot as if it had no disk.
 
 on second time, after success of booting with power cycle,
 I examined it under single-user mode.
 
 on the first time it equipped two same disks, but
 this time, only one drive was attached, SCSI-IDfied as #1, da0,
 which is devided into 2 partitions and a swap region.
 the secondary filesystem (da0s1e for /home), which was pretty
 heavily accessed mostly for MRTG logging and graphing, was awfully
 corrupted. the primary one (da0s1a for /) was not corrupted at all.
 
 in /var/log/dmesg.yesterday, I found following message:
 I don't know when it was printed - sorry serial console.
 
 (da0:ahc0:0:0:0): SCB 0x3e - timed out while idle, LASTPHASE == 0x1,\(wrap)
 SEQADDR == 0x153
 (da0:ahc0:0:0:0): SCB 62: Immediate reset.  Flags = 0x4040
 (da0:ahc0:0:0:0): no longer in timeout, status = 34b
 ahc0: Issued Channel A Bus Reset. 64 SCBs aborted
 
 anyone knows how to track/examine this "bug", or
 hardware/firmware failure or any other SCSI boodoo?
 
 ----
 Wataru Satoh <wataru-s@mfeed.ad.jp> / INTERNET MULTIFEED CO.
 TEL: 03-3282-1040 / FAX: 03-3282-1020
 
State-Changed-From-To: open->feedback 
State-Changed-By: mjacob 
State-Changed-When: Mon Oct 1 19:03:43 PDT 2001 
State-Changed-Why:  
Is this still a problem? The problem seems to me to actually 
be a disk h/w problem. 

http://www.FreeBSD.org/cgi/query-pr.cgi?pr=22640 
State-Changed-From-To: feedback->closed 
State-Changed-By: wilko 
State-Changed-When: Sat Nov 24 03:41:41 PST 2001 
State-Changed-Why:  
Timeout polling for feedback. mjacob asked for feedback 
on Oct 1, no reply 


http://www.FreeBSD.org/cgi/query-pr.cgi?pr=22640 
>Unformatted:
