From mteterin@pandora.us.murex.com  Fri Oct  8 16:56:07 2004
Return-Path: <mteterin@pandora.us.murex.com>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C38AE16A4CF
	for <FreeBSD-gnats-submit@freebsd.org>; Fri,  8 Oct 2004 16:56:07 +0000 (GMT)
Received: from harik.murex.com (mail.murex.com [194.98.239.11])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1F21B43D54
	for <FreeBSD-gnats-submit@freebsd.org>; Fri,  8 Oct 2004 16:56:07 +0000 (GMT)
	(envelope-from mteterin@pandora.us.murex.com)
Message-Id: <200410081650.i98Go6O5015987@harik.murex.com>
Date: Fri, 8 Oct 2004 12:55:36 -0400 (EDT)
From: Mikhail Teterin <mi@aldan.algebra.com>
To: FreeBSD-gnats-submit@freebsd.org
Subject: Continuing problems with Silicon Image SATA controllers
X-Send-Pr-Version: 3.113
X-GNATS-Notify: sos@FreeBSD.org

>Number:         72451
>Category:       kern
>Synopsis:       Continuing problems with Silicon Image SATA controllers
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    sos
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Oct 08 17:00:51 GMT 2004
>Closed-Date:    Mon Apr 11 11:22:14 GMT 2005
>Last-Modified:  Mon Apr 11 18:40:24 GMT 2005
>Originator:     Mikhail Teterin
>Release:        FreeBSD 5.3-BETA5 amd64
>Organization:
Virtual Estates, Inc.
>Environment:
System: FreeBSD pandora 5.3-BETA5 FreeBSD 5.3-BETA5 #4: Mon Sep 20 16:45:55 EDT 2004 mteterin@pandora:/backup/obj/usr/src/sys/DIOSCURI amd64

Relevant dmesg.boot entries:

	atapci0: <SiI 3114 SATA150 controller> port 0x9c00-0x9c0f,0xa000-0xa003,0xa400-0xa407,0xa800-0xa803,0xac00-0xac07 mem 0xff3ff400-0xff3ff7ff irq 17 at device 11.0 on pci3
	ad6: 190782MB <ST3200822AS/3.01> [387621/16/63] at ata3-master SATA150

Ident information from the running kernel:

     $FreeBSD: src/sys/dev/ata/ata-all.c,v 1.227 2004/09/16 09:35:01 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.34 2004/08/27 14:48:32 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-lowlevel.c,v 1.47 2004/09/03 12:10:44 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-isa.c,v 1.22 2004/04/30 16:21:34 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-pci.c,v 1.88 2004/08/20 06:19:25 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-chipset.c,v 1.88 2004/09/10 10:31:37 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-dma.c,v 1.131 2004/09/10 10:31:37 sos Exp $
     $FreeBSD: src/sys/dev/ata/ata-disk.c,v 1.177 2004/09/01 12:15:44 sos Exp $
     $FreeBSD: src/sys/dev/ata/atapi-cd.c,v 1.171 2004/08/24 10:39:00 sos Exp $
     $FreeBSD: src/sys/dev/ata/atapi-fd.c,v 1.97 2004/08/05 21:11:33 sos Exp $

>Description:
	Under _combined_ disk and CPU load, the following errors start
	popping up:

ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=53404031
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=54910687
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=56806527
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=61715903
ad6: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=62103999
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=176444927
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=311594591
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=196040671
ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=306623743

	After a while, all disk IO starts hanging and even a gracefull
	reboot becomes impossible -- the machine hangs after saying:
	"some processes would not die..."

	We replaced the disk and the cables twice already.

	Under just the disk load, the problem does not appear -- the
	box survives a full run of `iozone -a' without a hitch, for
	example.

	But when we, for example, dump databases on it (over NFS) and,
	at the same time, gzip the dump for archiving, we see this.

	Or, when a big file is being uploaded with scp over a fast link
	with ssh compression. So it looks like something inside the
	ata driver is not attended to fast enough...

>How-To-Repeat:

	Run `iozone -a' on a disk, while gzip-ing a big file off of
	the same drive.

>Fix:
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->sos 
Responsible-Changed-By: simon 
Responsible-Changed-When: Fri Oct 8 17:24:57 GMT 2004 
Responsible-Changed-Why:  
Over to sos for evaluation. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72451 
Responsible-Changed-From-To: sos->feedback 
Responsible-Changed-By: sos 
Responsible-Changed-When: Mon Oct 11 11:50:31 GMT 2004 
Responsible-Changed-Why:  
There has been quite some changes since beta5, please update to the 
latest releng5 or at least beta7 (-current would be even better) and 
get back to me with that brings. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72451 
State-Changed-From-To: open->feedback 
State-Changed-By: sos 
State-Changed-When: Mon Oct 11 11:53:58 GMT 2004 
State-Changed-Why:  
Hmm seems vi was off by 1 line :) 


Responsible-Changed-From-To: feedback->sos 
Responsible-Changed-By: sos 
Responsible-Changed-When: Mon Oct 11 11:53:58 GMT 2004 
Responsible-Changed-Why:  


http://www.freebsd.org/cgi/query-pr.cgi?pr=72451 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Mon, 11 Oct 2004 15:27:33 -0400

 Well, so far, I can't even rebuild the world -- on a freshly rebooted
 system :-( According to top, two C-compilers and the cap_mk are stuck
 in the `nbufkv' state and the CPU is 100% idle.
 
 I'll reboot again and try with NFS-mounted /usr/obj ...
 
  -mi
 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org
Cc: sos@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Mon, 11 Oct 2004 18:44:13 -0400

 I used the NFS-mounted /usr/obj and noticed the same WRITE_DMA errors
 on the NFS-server itself.
 
 [...]
 Oct 11 17:48:16 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=64822936
 Oct 11 17:49:58 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=65516532
 Oct 11 17:54:53 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=65524832
 Oct 11 17:59:19 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=65012992
 Oct 11 18:03:18 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=67395724
 Oct 11 18:03:25 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66939680
 Oct 11 18:03:58 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=67406280
 Oct 11 18:05:45 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=67715240
 Oct 11 18:33:06 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66682804
 Oct 11 18:33:52 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=67056960
 Oct 11 18:37:13 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66563776
 Oct 11 18:38:02 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66713420
 Oct 11 18:38:11 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66717516
 Oct 11 18:38:18 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66563904
 Oct 11 18:38:30 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66717824
 Oct 11 18:38:41 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66678432
 Oct 11 18:38:56 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66563936
 Oct 11 18:39:06 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66563936
 Oct 11 18:39:19 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66716824
 Oct 11 18:39:25 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66716656
 Oct 11 18:39:33 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66563968
 Oct 11 18:39:40 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66718804
 Oct 11 18:39:46 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66722976
 Oct 11 18:39:55 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66563968
 Oct 11 18:40:11 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66725152
 Oct 11 18:40:18 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66564000
 Oct 11 18:40:30 mi kernel: ad4: TIMEOUT - WRITE_DMA retrying (2 retries left) 
 LBA=66721364
 [...]
 
 The server is running:
 
 FreeBSD mi 6.0-CURRENT FreeBSD 6.0-CURRENT #1: Tue Oct  5 17:56:47 EDT 2004     
 root@mi:/var/obj/usr/src/sys/Gigabyte  i386
 
 with the following versions of ATA-files in the kernel:
 
      $FreeBSD: src/sys/dev/ata/ata-all.c,v 1.228 2004/09/26 11:48:43 sos Exp $
      $FreeBSD: src/sys/dev/ata/ata-queue.c,v 1.35 2004/09/26 11:48:43 sos Exp 
 $
      $FreeBSD: src/sys/dev/ata/ata-lowlevel.c,v 1.48 2004/09/26 11:48:43 sos 
 Exp $
      $FreeBSD: src/sys/dev/ata/ata-isa.c,v 1.22 2004/04/30 16:21:34 sos Exp $
      $FreeBSD: src/sys/dev/ata/ata-pci.c,v 1.89 2004/09/26 11:42:42 sos Exp $
      $FreeBSD: src/sys/dev/ata/ata-chipset.c,v 1.90 2004/10/01 09:06:22 sos 
 Exp $
      $FreeBSD: src/sys/dev/ata/ata-dma.c,v 1.131 2004/09/10 10:31:37 sos Exp $
      $FreeBSD: src/sys/dev/ata/ata-disk.c,v 1.179 2004/09/30 20:54:59 sos Exp 
 $
      $FreeBSD: src/sys/dev/ata/atapi-cd.c,v 1.171 2004/08/24 10:39:00 sos Exp 
 $
      $FreeBSD: src/sys/dev/ata/atapi-fd.c,v 1.97 2004/08/05 21:11:33 sos Exp $
 
 and the Silicon Image-3112A on-board controller with the new "Raptor"
 1000K RPM drive:
 
 atapci1: <SiI 3112 SATA150 controller> port 
 0xc400-0xc40f,0xc000-0xc003,0xbc00-0xbc07,0xb800-0xb803,0xb400-0xb407 mem 
 0xf7025000-0xf70251ff irq 17 at device 16.0 on pci0
 ad4: 35304MB <WDC WD360GD-00FNA0/35.06K35> [71730/16/63] at ata2-master 
 SATA150
 
 So, the problem was still here in Oct 5 -current and is not limited
 to amd64 :-(
 
 The NFS-server is not hanging, fortunately, just stumbles for a
 while and recovers. The hangs originally reported are, probably,
 due to a different issue (see nbufkv-thread on -current), but the
 SATA errors are still a problem.
 
 Yours,
 
  -mi
 

From: "Mikhail T." <mi@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org
Cc: sos@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Mon, 11 Oct 2004 23:36:30 -0400

 Ok, the machine rebooted with today's world and kernel and -- under the
 same kind of load -- shows the same behaviour: lots of messages like:
 
 'ad6: FAILURE - WRITE_DMA status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=...'
 
 the reported LBA is always different, and the errors are from 2 seconds
 to a few minutes apart from each other.
 
  -mi
 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org
Cc: sos@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Wed, 13 Oct 2004 18:00:33 -0400

 Just noticed fresh changes in dev/ata and rebuilt the kernel.
 
 Under heavy load machine started reporting A LOT of WRITE_DMA errors
 and hung after a few minutes:
 
 [...]
 Oct 13 17:22:55 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=123751359
 Oct 13 17:22:56 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=34785759
 Oct 13 17:22:59 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=130147711
 Oct 13 17:23:03 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=109841727
 Oct 13 17:23:03 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=110594751
 Oct 13 17:23:07 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=137684767
 Oct 13 17:23:10 pandora kernel: ad6: FAILURE - WRITE_DMA 
 status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=80481919
 [...]
 
 At the time of hanging I had the `systat 1 -vm' running -- the screen
 froze with 10Mb/s going to ad6...
 
 May, it is possible to force the disk into slower mode -- like SATA66 or
 something? The most we ever saw sustained on it was 44Mb/s read and 21Mb/s
 write anyway... Thanks,
 
 	-mi
 
From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org, sos@freebsd.org
Cc: re@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Tue, 19 Oct 2004 17:15:19 -0400

 I rebuilt the kernel from today's -current sources and this time included
 all the WITNESS and INVARIANT options. The machine duly hung as before,
 but there were no messages from WITNESS nor INVARIANT -- only the WRITE_DMA
 warnings.
 
 The `systat 1 -vm' froze at:
 
     3 users    Load  1.08  0.83  0.66                  19  16:25
 
 Mem:KB    REAL            VIRTUAL                     VN PAGER  SWAP PAGER
         Tot   Share      Tot    Share    Free         in  out     in  out
 Act   27256    5308    65880    13608  405040 count
 All 1700040    7616  2959492    17720         pages
                                                           zfod   Interrupts
 Proc:r  p  d  s  w    Csw  Trp  Sys  Int  Sof  Flt        cow    5310 total
            5 38     11789    6  168 8051    4      248200 wire        irq1: atkb
                                                     28512 act    1028 irq0: clk
 12.5%Sys   5.5%Intr  0.0%User  0.0%Nice 82.0%Idl  1336996 inact       irq6: fdc0
 |    |    |    |    |    |    |    |    |    |      89408 cache   128 irq8: rtc
 ======+++                                          315632 free    722 irq9: acpi
                                                           daefr       irq14: ata
 Namei         Name-cache    Dir-cache                     prcfr       irq15: ata
     Calls     hits    %     hits    %                     react  1352 irq16: ahc
                                                           pdwak   722 irq17: pcm
                                                           pdpgs       irq19: fwo
 Disks  afd0   ad6 amrd0   sa0 pass0                       intrn  1358 irq24: bge
 KB/t   0.00 16.91  0.00  0.00  0.00                218880 buf         irq26: amr
 tps       0   720     0     0     0                  1365 dirtybuf
 MB/s   0.00 11.88  0.00  0.00  0.00                100000 desiredvnodes
 % busy    0   100     0     0     0                   849 numvnodes
 Showing vmstat, refresh every 1 seconds.              514
 
 The `systat 1 -if' froze as:
 
                     /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
      Load Average   |||||   
 
       Interface           Traffic               Peak                Total
 
 
             lo0  in      0.000 KB/s          3.197 KB/s          213.939 KB
                  out     0.000 KB/s          3.197 KB/s          213.939 KB
 
            bge0  in      4.938 MB/s         11.728 MB/s           10.906 GB
                  out   185.929 KB/s        450.869 KB/s          415.249 MB
 
 
 It would seem to me, this is a show-stopper for 5.3 release -- amd64 is a
 Tier1 platform, yet it can't write to the disk for long on popular hardware.
 
 Yours,
 
  -mi

From: Scott Long <scottl@freebsd.org>
To: Mikhail Teterin <mi+mx@aldan.algebra.com>
Cc: freebsd-gnats-submit@freebsd.org, sos@freebsd.org, re@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Tue, 19 Oct 2004 15:40:04 -0600

[...]
 
 How much RAM is in this system?
 
 Scott

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: Scott Long <scottl@freebsd.org>,
	freebsd-gnats-submit@freebsd.org, sos@freebsd.org
Cc: re@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Tue, 19 Oct 2004 18:06:48 -0400

 =How much RAM is in this system?
 
 2Gb. Single Opteron in a dual-capable Tyan K8W motherboard.
 
  -mi

From: Scott Long <scottl@freebsd.org>
To: Mikhail Teterin <mi+mx@aldan.algebra.com>
Cc: freebsd-gnats-submit@freebsd.org, sos@freebsd.org, re@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Tue, 19 Oct 2004 16:20:09 -0600

 Mikhail Teterin wrote:
 > =How much RAM is in this system?
 > 
 > 2Gb. Single Opteron in a dual-capable Tyan K8W motherboard.
 > 
 >  -mi
 
 Ok, I can't think of any obvious causes.  The 5.3 BETAs have been built
 on a dual Opteron with 2GB and a SATA disks without any problems.
 
 Scott

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: Scott Long <scottl@freebsd.org>
Cc: freebsd-gnats-submit@freebsd.org, sos@freebsd.org, re@freebsd.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Tue, 19 Oct 2004 18:37:54 -0400

 =Mikhail Teterin wrote:
 => =How much RAM is in this system?
 
 => 2Gb. Single Opteron in a dual-capable Tyan K8W motherboard.
 
 =Ok, I can't think of any obvious causes.  The 5.3 BETAs have been built
 =on a dual Opteron with 2GB and a SATA disks without any problems.
 
 Were these disks attached to a Silicon Image controller? Also, my problem 
 seems to be somehow related with network traffic. The way, we cause the 
 system to hang, is by telling a remote Sybase server to dump its databases 
 onto it (over NFS) one at a time. It does not have to be NFS-server -- when I 
 tried to restore(8) a filesystem on this SATA drive from the dump stored on 
 on another machine, the machine hung the same way.
 
 I just rebuilt the kernel again with the NET_WITH_GIANT (to set the mpsafenet 
 to 0) and it looks much better -- about an hour into the process, that used 
 to hang it within 25 minutes or so. If it survives the whole dump by 
 tomorrow, than -- I guess -- there is some interaction between the network 
 and ata. I'm convinced, ata is somehow involved, because when we write onto 
 our raid array (amrd0) on the same machine, there are no hangs.
 
 Could it be that ata is generally safe, but the recovery from 
 WRITE_DMA/READ_DMA failures is not?
 
  -mi

From: Mikhail Teterin <mi@corbulon.video-collage.com>
To: FreeBSD-gnats-submit@FreeBSD.org
Cc: sos@FreeBSD.org
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Fri, 22 Oct 2004 14:05:27 -0400 (EDT)

 After putting my new workstation into heavier use, I'm seeing this
 same problems on i386. This machine (current from Oct 20) has two
 Raptor disks connected to the two on-board connectors of Silicon Image
 3112A controller.
 
 There are occasional WRITE_DMA errors from these drives. Once in a
 while, the machine locks solid after one of such errors.
 
 A good way of reproducing seems to be to run cvsup mirroring the
 entire CVS-repository to one of the local drives, while cvs is
 extracting the src tree from the local repository onto another drive.
 This may succeed, but may also lead to either a hang or a ufs-panic.
 
 I plan to change the status of this PR back to 'open' -- if you need
 more 'feedback', just ask. Thanks!
 
 	-mi

From: =?ISO-8859-1?Q?S=F8ren_Schmidt?= <sos@DeepCore.dk>
To: Mikhail Teterin <mi@corbulon.video-collage.com>
Cc: FreeBSD-gnats-submit@FreeBSD.ORG, sos@FreeBSD.ORG
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Fri, 22 Oct 2004 20:07:53 +0200

 Mikhail Teterin wrote:
 > After putting my new workstation into heavier use, I'm seeing this
 > same problems on i386. This machine (current from Oct 20) has two
 > Raptor disks connected to the two on-board connectors of Silicon Image
 > 3112A controller.
 >=20
 > There are occasional WRITE_DMA errors from these drives. Once in a
 > while, the machine locks solid after one of such errors.
 >=20
 > A good way of reproducing seems to be to run cvsup mirroring the
 > entire CVS-repository to one of the local drives, while cvs is
 > extracting the src tree from the local repository onto another drive.
 > This may succeed, but may also lead to either a hang or a ufs-panic.
 >=20
 > I plan to change the status of this PR back to 'open' -- if you need
 > more 'feedback', just ask. Thanks!
 
 Well I can beat on mine (sii3112 and 70G raptors) for days wihtout a=20
 hickup. Whats the motherboard its sitting in ?
 
 
 --=20
 
 -S=F8ren
 
 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Mon, 25 Oct 2004 12:05:49 -0400

 =Well I can beat on mine (sii3112 and 70G raptors) for days without hickup. 
 =Whats the motherboard its sitting in?
 
 I see this in two computers now:
 
  i386: Pentium4 @3.06GHz in Gigabyte's GA-SINXP1394, SiI-3112A controller(s)),
  amd64: single Opteron @1.8GHz in a dual-capable Tyan K8W, SiI-3114
 
 Directing a Sybase server (on Solaris) to dump its databases onto FreeBSD 
 disks will wedge the systems reliably, although not immediately. Full dump of 
 all databases here takes about 15 hours (we estimate). The amd64 machine 
 running 5.3-stable hangs about 6 hours into it. (Under recent -current it 
 tends to panic much earlier.)
 
 The FreeBSD machines both have gigabit cards (bge on opteron, em on i386), the 
 NFS client -- Sybase -- has only a 100Mb card, but manages to sustain above 
 11Mb/s writes anyway. After some time the FreeBSD's disks begin to glitch. At 
 a certain point, machines hang :-(
 
 Some hangs seem to have occurred during rwhod's updates (in parallel to the 
 Sybase writes), for example -- /var is a different partition from the one 
 being pounded, but resides on the same disk.
 
 May be, the difference is in the source of "pounding" -- network (rwhod, cvsup 
 or NFS), rather than local writing/reading? 
 
  -mi
State-Changed-From-To: feedback->open 
State-Changed-By: mi 
State-Changed-When: Tue Oct 26 17:13:13 GMT 2004 
State-Changed-Why:  
Fulfill the earlier threat :-) 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72451 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Tue, 26 Oct 2004 13:38:25 -0400

 Another day, another hang. Right now, the machine is up, but the disk seems 
 dead -- the running processes run (systat keeps me updated once per second), 
 but the paged out ones can't be paged back in (vm_pager complaining about 
 disk), and no new processes can start. The machine was in this state for 
 about 11 hours now -- after 6 or 7 hours of hard work and great many 
 WRITE_DMA failures. Can the driver, perhaps, slow the channel speed gradually 
 on errors -- from SATA150 down to 120, say -- either automatically, or 
 through a sysctl knob? Thanks!
 
 Someone in the Linux world has similar troubles too:
 
  http://www.ussg.iu.edu/hypermail/linux/kernel/0407.2/0127.html
 
 SiI's own IDE-driver for Linux can be obtained here:
 
  http://12.24.47.40/display/2n/kb/article.asp?aid=10485&s=1
 
 Perhaps, someone with knowledge of ATA (hint-hint), can look at the two files 
 (siimage.c, and siimage.h) to see immediately, what kind of a work-around is 
 needed for SiI to work reliably? Thanks!
 
  -mi

From: Neil Hoggarth <neil.hoggarth@physiol.ox.ac.uk>
To: freebsd-gnats-submit@FreeBSD.org, mi@aldan.algebra.com
Cc:  
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Wed, 10 Nov 2004 15:21:18 +0000 (GMT)

 The Linux folk seem to have come to the conclusion that SiL controllers
 issue unusual (but SATA spec compliant) writes to the drive, if the host
 sends more than 15 LBAs of data to the controller in a single DMA.
 
 These unusual SATA transfers expose problems with the firmware in some
 drives, including a number of Seagate models.
 
 The Linux sata_sil driver now has a quirks mode, which prevents writes
 larger than 15 blocks if the attached drive is on a blacklist. This
 workaround apparently gives stability at the cost of performance. The
 only other suggested fix seems to be "don't use a blacklisted drive and
 an SiL controller". :-(
 
 Here is one thread about this in the linux-kernel mailing list archives:
 
   http://lkml.org/lkml/2004/8/12/8
 
 Regards,
 -- 
 Neil Hoggarth                                Departmental Computing Manager
 <neil.hoggarth@physiol.ox.ac.uk>                   Laboratory of Physiology
 http://www.physiol.ox.ac.uk/~njh/                  University of Oxford, UK

From: Mikhail Teterin <Mikhail.Teterin@murex.com>
To: freebsd-gnats-submit@freebsd.org
Cc: sos@freebsd.org, neil.hoggarth@physiol.ox.ac.uk
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Thu, 11 Nov 2004 19:00:21 -0500

 I'm seeing these problems with different drives on different machines (i386, 
 amd64). I suspected, the problem may have something to do with the disk 
 overheating, so we tried to shut the machine down for a day.
 
 Upon fresh start, the log is already full of errors -- several per minute.
 
 I tried using ataidle (from /usr/ports/sysutils/ataidle) to alter the drive's 
 power consumption and/or acoustic level, but all attempts to do so got
 
 ad6: FAILURE - SETFEATURES 0x05 status=51<READY,DSC,ERROR> error=4<ABORTED>
 
 ad6: FAILURE - SETFEATURES 0x42 status=51<READY,DSC,ERROR> error=4<ABORTED>
 
 for all possible power and acoustic levels. Are these features supposed to 
 work for SATA drives?
 
 The worst part is, however, that when the driver gives up (FAILURE). The 
 programs may die (segfaulting vm_pager) or the OS can panic -- in filesystem 
 code.
 
 May be, the driver should try to _really_ restart the drive -- by forcing it 
 to spin down, wait, and spin up again? If NFS client code is willing to wait 
 forever for the remote server to come back up, why can't ata?
 
 Yours,
 
  -mi
State-Changed-From-To: open->closed 
State-Changed-By: sos 
State-Changed-When: Mon Apr 11 11:18:15 GMT 2005 
State-Changed-Why:  
There is no idea in spinning down then up the drive it doesn't accomplish 
anything but drive bearing wear. The drive electronics needs to be reset and 
that has been done in ATA for ages. 
You should try the latest -current as and check what that brings, I'm using 
a sii3112 and problematic drives in a server here just to have real life 
testing, and that works, modulo the eventual timeouts thats expected. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72451 

From: Mikhail Teterin <mi+mx@aldan.algebra.com>
To: freebsd-gnats-submit@freebsd.org, mi@aldan.algebra.com
Cc:  
Subject: Re: kern/72451: Continuing problems with Silicon Image SATA controllers
Date: Mon, 11 Apr 2005 14:30:44 -0400

 > modulo the eventual timeouts thats expected.
 
 I am afraid, you misunderstood my suggestion. I think, the driver should not 
 give up in case of a timeout. Some places in the kernel do not expect such a 
 failure -- pager may panic if it can not write a page to swap, file systems 
 may become corrupted if they can not flush cached data.
 
 If NFS client can (and by default -- will) wait forever for the server to come 
 back, the ATA code should wait forever for the writing to succeed.
 
 It is difficult for me to try the new code, because the original motherboard, 
 where we saw the problem has long been replaced and the other system no 
 longer exibits the problem since I tinkered with PCI timings a little bit.
 
 	-mi
>Unformatted:
