From nobody@FreeBSD.org  Thu Aug 12 19:32:12 2004
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 40D3916A4CE
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 12 Aug 2004 19:32:12 +0000 (GMT)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1D84943D41
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 12 Aug 2004 19:32:12 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.12.11/8.12.11) with ESMTP id i7CJWBao033201
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 12 Aug 2004 19:32:11 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.12.11/8.12.11/Submit) id i7CJWBaV033200;
	Thu, 12 Aug 2004 19:32:11 GMT
	(envelope-from nobody)
Message-Id: <200408121932.i7CJWBaV033200@www.freebsd.org>
Date: Thu, 12 Aug 2004 19:32:11 GMT
From: Wayne Cox <wc_fbsd@xxiii.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: System hangs under heavy disk IO with SiI 3112 SATA150 controller and Western Digital drive
X-Send-Pr-Version: www-2.3

>Number:         70379
>Category:       kern
>Synopsis:       System hangs under heavy disk IO with SiI 3112 SATA150 controller and Western Digital drive
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    sos
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Aug 12 19:40:25 GMT 2004
>Closed-Date:    Tue Aug 24 20:07:50 GMT 2004
>Last-Modified:  Fri Nov  5 20:10:29 GMT 2004
>Originator:     Wayne Cox
>Release:        FreeBSD 5.2.1-RELEASE-p9 i386
>Organization:
Twenty-Three, Inc.  xxiii.com
>Environment:
FreeBSD stimpy.xxiii.com 5.2.1-RELEASE-p9 FreeBSD 5.2.1-RELEASE-p9 #4: Wed Aug 11 11:34:00 EDT 2004     root@stimpy.xxiii.com:/usr/src/sys/i386/compile/WMC  i386

Generic PC with Celeron 433MHz CPU, Adaptec 1210SA Serial-ATA controller using SiI 3112 chipset, Western Digital "Raptor" WD360GD SATA disk.

GENERIC kernel.

make.conf has  CFLAGS= -O -pipe;  NOPROFILE=true

Although I have messed with patching, and kernel config' & compilation, the problem is identical on a bone-stock installation.

>Description:
  Under heavy disk IO, the system reports a series of errors on the console, similar to "ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=xxxxxxx"

  Sometimes it recovers, but may go on to repeat similar errors.  At some point, it WILL error and hang.  There is no "panic" message or anything on the console.  Keyboard is unresponsive to ctrl-alt-del.  A hardware reset or power cycle is required to reboot.  Data corruption can be severe.

  This may be similar ot identical with kern/69446 or i386/59895, but those happened under somewhat different curcumstances, and make no mention of the timing issue (see FIX section.)

  SATA drives are rapidly supplanting IDE and SCSI drives, and it sure would be nice to be able to use them reliably.

>How-To-Repeat:
  Do random-seek intensive IO on the drive.  I had a large (4GB) backup .tar on one filesystem, and was attempting to extract it to another file system on the same drive/spindle.  eg:
  cd /fs2 ; tar -xvf /fs1/BigBackup.tar 

  Sequential reads don't seem to cause trouble.  For example:
    cat /fs1/BigBackup.tar > /dev/null
  causes no hiccups.

  Also, pulling files over the network hasn't caused problems.  Using rsh to pull a .tar from a remote system and un-tarring locally, or simply ftping big files works ok, even though they approach the 10MB/sec wire speed.

>Fix:
  This is a fairly slow system (433MHz) to start with.  In one similar bug report (kern/69446), the author couldn't even get the basic install to run.  So I'm speculating that it might be some sort of timing issue in the ata driver???

  One work-around I found is to artificially slow down the IO.  I the above example of repeating the problem, I was able to successfully restore the file by wasting many CPU cycles piping the data through some compression, eg:
  cd /fs2 ; gzip -c --best /fs1/BigBackup.tar | zcat | tar -xvf -

  I'm no kernel programmer.  But just as a shot in extreme darkness, I found some code in src/sys/dev/ata/ata-lowlevel.c setting a time out value ("int timeout = 5000") with a comment "might be less for fast devices".  I tried changing it to 3000 and 8000 and recompiling, but with no apparent change.

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->sos 
Responsible-Changed-By: arved 
Responsible-Changed-When: Tue Aug 24 19:09:52 GMT 2004 
Responsible-Changed-Why:  
Over to the ATA Maintainer.  
Looks like there are so many ATA problems that an  
own PR category/mailinglist might be useful 

http://www.freebsd.org/cgi/query-pr.cgi?pr=70379 
State-Changed-From-To: open->closed 
State-Changed-By: sos 
State-Changed-When: Tue Aug 24 20:01:11 GMT 2004 
State-Changed-Why:  
Problem solved in -current. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=70379 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@FreeBSD.org
Cc: sos@FreeBSD.org
Subject: Re: kern/70379: System hangs under heavy disk IO with SiI 3112 SATA150 controller and Western Digital drive
Date: Mon, 20 Sep 2004 14:44:44 -0400

 --Boundary-00=_cUyTBQPUt18gydP
 Content-Type: text/plain;
   charset="us-ascii"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: inline
 
 We are still seeing this on our amd64 box. dmesg is attached.
 
 After we upgraded the system BIOS (with new code for SiA 3114
 controller) and the kernel (to Sep3 -current), the errors are not nearly
 as frequent, and the disk will survive and entire run of "iozone -a",
 but eventually WRITE_DMA-errors start appearing, followed sometimes by
 ufs-panics.
 
 Upon reboot, the WRITE_DMA-message are triggered by the fsck's IO.
 
 What can we do to ensure this problem is gone before 5.3-RELEASE?
 
 Thanks!
 
 	-mi
 
 
 --Boundary-00=_cUyTBQPUt18gydP
 Content-Type: text/plain;
   charset="us-ascii";
   name="pandora.dmesg.txt"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="pandora.dmesg.txt"
 
 Copyright (c) 1992-2004 The FreeBSD Project.
 Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
 	The Regents of the University of California. All rights reserved.
 FreeBSD 6.0-CURRENT #2: Fri Sep  3 14:36:02 EDT 2004
     mteterin@pandora.us.murex.com:/backup/obj/usr/src/sys/DIOSCURI
 Timecounter "i8254" frequency 1193182 Hz quality 0
 CPU: AMD Opteron(tm) Processor 244 (1793.02-MHz K8-class CPU)
   Origin = "AuthenticAMD"  Id = 0xf58  Stepping = 8
   Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
   AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow+,3DNow>
 real memory  = 2147418112 (2047 MB)
 avail memory = 2064113664 (1968 MB)
 ACPI APIC Table: <A M I  OEMAPIC >
 MADT: Forcing active-low polarity and level trigger for SCI
 ioapic0 <Version 1.1> irqs 0-23 on motherboard
 ioapic1 <Version 1.1> irqs 24-27 on motherboard
 ioapic2 <Version 1.1> irqs 28-31 on motherboard
 acpi0: <A M I OEMXSDT> on motherboard
 acpi0: Power Button (fixed)
 Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x5008-0x500b on acpi0
 cpu0: <ACPI CPU> on acpi0
 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
 pci0: <ACPI PCI bus> on pcib0
 pcib1: <ACPI PCI-PCI bridge> at device 6.0 on pci0
 pci3: <ACPI PCI bus> on pcib1
 ohci0: <OHCI (generic) USB controller> mem 0xff3fc000-0xff3fcfff irq 19 at device 0.0 on pci3
 ohci0: [GIANT-LOCKED]
 usb0: OHCI version 1.0, legacy support
 usb0: <OHCI (generic) USB controller> on ohci0
 usb0: USB revision 1.0
 uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
 uhub0: 3 ports with 3 removable, self powered
 ohci1: <OHCI (generic) USB controller> mem 0xff3fd000-0xff3fdfff irq 19 at device 0.1 on pci3
 ohci1: [GIANT-LOCKED]
 usb1: OHCI version 1.0, legacy support
 usb1: <OHCI (generic) USB controller> on ohci1
 usb1: USB revision 1.0
 uhub1: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
 uhub1: 3 ports with 3 removable, self powered
 ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0x9800-0x98ff mem 0xff3fe000-0xff3fefff irq 16 at device 10.0 on pci3
 ahc0: [GIANT-LOCKED]
 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
 atapci0: <SiI 3114 SATA150 controller> port 0x9c00-0x9c0f,0xa000-0xa003,0xa400-0xa407,0xa800-0xa803,0xac00-0xac07 mem 0xff3ff400-0xff3ff7ff irq 17 at device 11.0 on pci3
 ata2: channel #0 on atapci0
 ata3: channel #1 on atapci0
 ata4: channel #2 on atapci0
 ata5: channel #3 on atapci0
 fwohci0: <Texas Instruments TSB43AB22/A> mem 0xff3f8000-0xff3fbfff,0xff3ff800-0xff3fffff irq 19 at device 12.0 on pci3
 fwohci0: OHCI version 1.10 (ROM=1)
 fwohci0: No. of Isochronous channels is 4.
 fwohci0: EUI64 00:e0:81:00:00:30:13:5c
 fwohci0: Phy 1394a available S400, 2 ports.
 fwohci0: Link S400, max_rec 2048 bytes.
 firewire0: <IEEE1394(FireWire) bus> on fwohci0
 fwe0: <Ethernet over FireWire> on firewire0
 if_fwe0: Fake Ethernet address: 02:e0:81:30:13:5c
 fwe0: Ethernet address: 02:e0:81:30:13:5c
 fwe0: if_start running deferred for Giant
 sbp0: <SBP-2/SCSI over FireWire> on firewire0
 fwohci0: Initiate bus reset
 fwohci0: node_id=0xc800ffc0, gen=1, CYCLEMASTER mode
 firewire0: 1 nodes, maxhop <= 0, cable IRM = 0 (me)
 firewire0: bus manager 0 (me)
 isab0: <PCI-ISA bridge> at device 7.0 on pci0
 isa0: <ISA bus> on isab0
 atapci1: <AMD 8111 UDMA133 controller> port 0xffa0-0xffaf,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 7.1 on pci0
 ata0: channel #0 on atapci1
 ata1: channel #1 on atapci1
 pci0: <serial bus, SMBus> at device 7.2 (no driver attached)
 pci0: <bridge, PCI-unknown> at device 7.3 (no driver attached)
 pci0: <multimedia, audio> at device 7.5 (no driver attached)
 pcib2: <ACPI PCI-PCI bridge> at device 10.0 on pci0
 pci2: <ACPI PCI bus> on pcib2
 amr0: <LSILogic MegaRAID> mem 0xe69f0000-0xe69fffff irq 26 at device 7.0 on pci2
 amr0: [GIANT-LOCKED]
 amr0: <LSILogic MegaRAID SATA 150-6D> Firmware 712T, BIOS G116, 64MB RAM
 bge0: <Broadcom BCM5703 Gigabit Ethernet, ASIC rev. 0x1002> mem 0xff1e0000-0xff1effff irq 24 at device 9.0 on pci2
 miibus0: <MII bus> on bge0
 brgphy0: <BCM5703 10/100/1000baseTX PHY> on miibus0
 brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
 bge0: Ethernet address: 00:e0:81:28:21:04
 pci0: <base peripheral, interrupt controller> at device 10.1 (no driver attached)
 pcib3: <ACPI PCI-PCI bridge> at device 11.0 on pci0
 pci1: <ACPI PCI bus> on pcib3
 pci0: <base peripheral, interrupt controller> at device 11.1 (no driver attached)
 pcib4: <ACPI Host-PCI bridge> on acpi0
 pcib4: could not get PCI interrupt routing table for \\_SB_.PCIB - AE_NOT_FOUND
 pci4: <ACPI PCI bus> on pcib4
 pcib5: <ACPI PCI-PCI bridge> at device 1.0 on pci4
 pci5: <ACPI PCI bus> on pcib5
 pci5: <display, VGA> at device 0.0 (no driver attached)
 acpi_button0: <Power Button> on acpi0
 atkbdc0: <Keyboard controller (i8042)> port 0x64,0x60 irq 1 on acpi0
 atkbd0: <AT Keyboard> flags 0x1 irq 1 on atkbdc0
 kbd0 at atkbd0
 atkbd0: [GIANT-LOCKED]
 sio0: configured irq 4 not in bitmap of probed irqs 0
 sio0: port may not be enabled
 sio0 port 0x3f8-0x3ff irq 4 on acpi0
 sio0: type 16550A
 sio1: configured irq 3 not in bitmap of probed irqs 0
 sio1: port may not be enabled
 sio1 port 0x2f8-0x2ff irq 3 on acpi0
 sio1: type 16550A
 fdc0: <floppy drive controller (FDE)> port 0x3f7,0x3f0-0x3f5 irq 6 drq 2 on acpi0
 fd0: <1440-KB 3.5" drive> on fdc0 drive 0
 ppc0 port 0x778-0x77f,0x378-0x37f irq 7 drq 3 on acpi0
 ppc0: SMC-like chipset (ECP/EPP/PS2/NIBBLE) in COMPATIBLE mode
 ppc0: FIFO with 16/16/9 bytes threshold
 ppbus0: <Parallel port bus> on ppc0
 lpt0: <Printer> on ppbus0
 lpt0: Interrupt-driven port
 orm0: <ISA Option ROMs> at iomem 0xcd000-0xcd7ff,0xc8800-0xccfff,0xc0000-0xc7fff on isa0
 sc0: <System console> at flags 0x100 on isa0
 sc0: VGA <16 virtual consoles, flags=0x300>
 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
 Timecounter "TSC" frequency 1793021908 Hz quality 800
 Timecounters tick every 0.976 msec
 acpi_cpu: throttling enabled, 8 steps (100% to 12.5%), currently 100.0%
 ATAPI_RESET time = 60us
 acd0: CDROM <FX320S/q01> at ata0-master PIO4
 ATAPI_RESET time = 230us
 ata1-slave: FAILURE - SETFEATURES SET TRANSFER MODE status=1<ERROR> error=4<ABORTED>
 afd0: REMOVABLE <IOMEGA ZIP 100 ATAPI/12.A> at ata1-slave BIOSPIO
 ad6: 190782MB <ST3200822AS/3.01> [387621/16/63] at ata3-master SATA150
 Waiting 15 seconds for SCSI devices to settle
 amrd0: <LSILogic MegaRAID logical drive> on amr0
 amrd0: 953885MB (1953556480 sectors) RAID 5 (optimal)
 sa0 at ahc0 bus 0 target 6 lun 0
 sa0: <ARCHIVE Python 06408-XXX 8130> Removable Sequential Access SCSI-3 device 
 sa0: 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
 Mounting root from ufs:/dev/ad6s1a
 bge0: gigabit link up
 ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=148597695
 
 --Boundary-00=_cUyTBQPUt18gydP--

From: Wayne Cox <wmc20@xxiii.com>
To: freebsd-gnats-submit@FreeBSD.org, wc_fbsd@xxiii.com
Cc:  
Subject: Re: kern/70379: System hangs under heavy disk IO with SiI 3112
  SATA150 controller and Western Digital drive
Date: Fri, 05 Nov 2004 15:03:01 -0500

 Re: kern/70379
 
 I initially submitted this bug report back in Aug '04, and it looked like 
 the developers had made some repairs.
 
 I've been patiently waiting on 5.3-Stable, hoping the fixes would be 
 included, and this drive / controller combo would be usable.  Finally went 
 ahead with update to 5.3-RC2, and found the problem still exists.
 
 Just ordered a Promise SATA controller, and will be switching to 
 that.  Given the problems reported with the SII SATA controllers, they 
 don't seem to be worth screwing with.  Looks like "sos" on the 
 contributions page is looking for SATA hardware;  maybe I'll check if 
 anyone is interested in the P.O.S. Adaptec 1210.
 
    -Wayne
 
>Unformatted:
