From nobody@FreeBSD.org  Sat Jun  4 17:44:59 2011
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 850B51065675
	for <freebsd-gnats-submit@FreeBSD.org>; Sat,  4 Jun 2011 17:44:59 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22])
	by mx1.freebsd.org (Postfix) with ESMTP id 74A098FC0C
	for <freebsd-gnats-submit@FreeBSD.org>; Sat,  4 Jun 2011 17:44:59 +0000 (UTC)
Received: from red.freebsd.org (localhost [127.0.0.1])
	by red.freebsd.org (8.14.4/8.14.4) with ESMTP id p54HixfM062182
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 4 Jun 2011 17:44:59 GMT
	(envelope-from nobody@red.freebsd.org)
Received: (from nobody@localhost)
	by red.freebsd.org (8.14.4/8.14.4/Submit) id p54Hixjn062181;
	Sat, 4 Jun 2011 17:44:59 GMT
	(envelope-from nobody)
Message-Id: <201106041744.p54Hixjn062181@red.freebsd.org>
Date: Sat, 4 Jun 2011 17:44:59 GMT
From: Petteri Valkonen <petteri.valkonen@iki.fi>
To: freebsd-gnats-submit@FreeBSD.org
Subject: AHCI device timeouts with ATI IXP700 SATA controller on high IO load
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         157615
>Category:       kern
>Synopsis:       AHCI device timeouts with ATI IXP700 SATA controller on high IO load
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-amd64
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Jun 04 17:50:09 UTC 2011
>Closed-Date:    Fri Jul 08 11:43:18 UTC 2011
>Last-Modified:  Fri Jul 08 11:43:18 UTC 2011
>Originator:     Petteri Valkonen
>Release:        8.2-RELEASE
>Organization:
>Environment:
FreeBSD microserver 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
>Description:
I'm running 8.2-RELEASE with the ahci(4) driver loaded at boot time on a HP ProLiant N36L with four Samsung HD204UI drives attached to a simple (striped) ZFS pool via an ATI IXP700 SATA controller:

ahci0: <ATI IXP700 AHCI SATA controller> port 0xd000-0xd007,0xc000-0xc003,0xb000-0xb007,0xa000-0xa003,0x9000-0x900f mem
 0xfe6ffc00-0xfe6fffff irq 19 at device 17.0 on pci0
ahci0: [ITHREAD]
ahci0: AHCI v1.20 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich0: [ITHREAD]
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich1: [ITHREAD]
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich2: [ITHREAD]
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich3: [ITHREAD]

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
ada3: <SAMSUNG HD204UI 1AQ10001> ATA-8 SATA 2.x device
ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
ada3: Command Queueing enabled
ada3: 1907729MB (3907029168 512 byte sectors: 16H 63S/T 16383C)

If I attempt to scrub the pool, at some point one of the disks will time out:

Jun  1 22:48:59 microserver kernel: ahcich1: Timeout on slot 1
Jun  1 22:48:59 microserver kernel: ahcich1: is 00000000 cs 000007f8 ss 000007fe rs 000007fe tfd 40 serr 00000000
Jun  1 22:48:59 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:49:45 microserver kernel: ahcich1: Timeout on slot 10
Jun  1 22:49:45 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun  1 22:49:45 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:50:31 microserver kernel: ahcich1: Timeout on slot 10
Jun  1 22:50:31 microserver kernel: ahcich1: is 00000000 cs 00000400 ss 00000000 rs 00000400 tfd 80 serr 00000000
Jun  1 22:50:31 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:50:31 microserver kernel: (ada1:ahcich1:0:0:0): lost device
Jun  1 22:51:34 microserver kernel: ahcich1: Timeout on slot 10
Jun  1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 000ffc00 ss 000ffc00 rs 000ffc00 tfd 80 serr 00000000
Jun  1 22:51:34 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:51:34 microserver kernel: ahcich1: Poll timeout on slot 19
Jun  1 22:51:34 microserver kernel: ahcich1: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000
Jun  1 22:52:36 microserver kernel: ahcich1: Timeout on slot 19
Jun  1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 1ff80000 ss 1ff80000 rs 1ff80000 tfd 80 serr 00000000
Jun  1 22:52:36 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:52:36 microserver kernel: ahcich1: Poll timeout on slot 28
Jun  1 22:52:36 microserver kernel: ahcich1: is 00000000 cs 10000000 ss 00000000 rs 10000000 tfd 80 serr 00000000
Jun  1 22:53:38 microserver kernel: ahcich1: Timeout on slot 28
Jun  1 22:53:38 microserver kernel: ahcich1: is 00000000 cs f000003f ss f000003f rs f000003f tfd 80 serr 00000000
Jun  1 22:53:38 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:53:38 microserver kernel: ahcich1: Poll timeout on slot 5
Jun  1 22:53:38 microserver kernel: ahcich1: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd 80 serr 00000000
Jun  1 22:54:41 microserver kernel: ahcich1: Timeout on slot 5
Jun  1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00007fe0 ss 00007fe0 rs 00007fe0 tfd 80 serr 00000000
Jun  1 22:54:41 microserver kernel: ahcich1: device is not ready (timeout 15000ms) tfd = 00000080
Jun  1 22:54:41 microserver kernel: ahcich1: Poll timeout on slot 14
Jun  1 22:54:41 microserver kernel: ahcich1: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000
Jun  1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=270336 size=8192 error=6
Jun  1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398327808 size=8192 error=6
Jun  1 22:54:41 microserver root: ZFS: vdev I/O failure, zpool=backup path=/dev/label/disk2 offset=2000398589952 size=8192 error=6

The offending disk is then taken offline:

# camcontrol devlist
<SAMSUNG HD204UI 1AQ10001>         at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD204UI 1AQ10001>         at scbus2 target 0 lun 0 (ada2,pass2)
<SAMSUNG HD204UI 1AQ10001>         at scbus3 target 0 lun 0 (ada3,pass3)

I have upgraded the server's BIOS to the latest available version (2011.04.02 (A)), but the problem still persists. Furthermore, extended offline SMART self-tests (smartctl -t long) performed on all the disks report no errors.

If I switch to the old ata(4) driver, the scrub job completes without any errors.

Others have also reported the same symptoms on similar hardware (a N36L with Samsung disks), and switching drivers has also remedied the problem for them:

http://freebsd.1045724.n5.nabble.com/ahci-ko-and-IXP700-800-gt-no-disk-found-tt3948669.html#a3948673
>How-To-Repeat:
Load the ahci(4) module and begin an disk IO intensive process (e.g. a ZFS scrub). 
>Fix:


>Release-Note:
>Audit-Trail:

From: Petteri Valkonen <petteri.valkonen@iki.fi>
To: bug-followup@FreeBSD.org, petteri.valkonen@iki.fi
Cc:  
Subject: Re: amd64/157615: AHCI device timeouts with ATI IXP700 SATA
 controller on high IO load
Date: Fri, 8 Jul 2011 13:10:52 +0300

 Samsung HD204UI disks manufactured before December 2010 have a firmware bug that can cause bad sectors to be reported under certain conditions:
 
 http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
 
 My drives were all manufactured either in December 2010 (marked as "2010.12" on the label) or January 2011 ("2011.01") and, in theory, should not require patching. However, I decided to update the firmwares for good measure.
 
 After patching the failed drives' firmwares with Samsung's updater (http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386) and re-enabling the ahci(4) driver, I can now scrub the pool without any timeouts.
 
 Thus it would seem that the root cause of the problem was the buggy Samsung firmware instead of the AHCI driver. Furthermore, contrary to previous reports, *some* disks manufactured in December 2010 may still require patching. 
State-Changed-From-To: open->closed 
State-Changed-By: avg 
State-Changed-When: Fri Jul 8 11:41:19 UTC 2011 
State-Changed-Why:  
Closing per submitter's investigation and followup. 
The issue seems to have been hardware-related and not with FreeBSD. 
Furthermore, the issue has not been amd64-specific as originally 
classified by the submitter. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=157615 
>Unformatted:
