From nobody@FreeBSD.org  Tue Oct 30 20:25:38 2007
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A384D16A41A
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 30 Oct 2007 20:25:38 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id 8ED5E13C4A3
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 30 Oct 2007 20:25:38 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.1/8.14.1) with ESMTP id l9UKPbwo030432
	for <freebsd-gnats-submit@FreeBSD.org>; Tue, 30 Oct 2007 20:25:37 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.1/8.14.1/Submit) id l9UKPbgB030431;
	Tue, 30 Oct 2007 20:25:37 GMT
	(envelope-from nobody)
Message-Id: <200710302025.l9UKPbgB030431@www.freebsd.org>
Date: Tue, 30 Oct 2007 20:25:37 GMT
From: Matt Lehner <matt@aim2game.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: mpt disk timeout and hang
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         117688
>Category:       kern
>Synopsis:       [mpt] mpt disk timeout and hang
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Oct 30 20:30:00 UTC 2007
>Closed-Date:    
>Last-Modified:  Mon Sep 21 22:40:01 UTC 2009
>Originator:     Matt Lehner
>Release:        7.0-BETA1
>Organization:
>Environment:
FreeBSD vault.buffalo.rr.com 7.0-BETA1 FreeBSD 7.0-BETA1 #0: Mon Oct 22 07:41:02 UTC 2007     root@vault.buffalo.rr.com:/usr/obj/usr/src/sys/VAULT  amd64
>Description:
I installed FreeBSD7 so I could take advantage of the ZFS support. While
testing out the ZFS support, I came across an issue with the mpt(4) driver.
After an extended period of moderate to heavy load on the disks, I would
get following errors in dmesg. Moderate to heavy disk load would be
~50-70MB/s with bursts to 86MB/s and 600 ops/s per disk according to gstat.

Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6f150:13878 timed out for ccb 0xffffff0001a15000 (req->ccb 0xffffff0001a15000)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e75450:13879 timed out for ccb 0xffffff0001a10000 (req->ccb 0xffffff0001a10000)
Oct 29 13:53:40 vault kernel: mpt0: attempting to abort req 0xffffffff80e6f150:13878 function 0
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6ea00:13880 timed out for ccb 0xffffff0001998400 (req->ccb 0xffffff0001998400)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e70740:13881 timed out for ccb 0xffffff0001395400 (req->ccb 0xffffff0001395400)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e69ab0:13886 timed out for ccb 0xffffff000157dc00 (req->ccb 0xffffff000157dc00)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e762f0:13887 timed out for ccb 0xffffff0001982400 (req->ccb 0xffffff0001982400)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6b520:13888 timed out for ccb 0xffffff000198ec00 (req->ccb 0xffffff000198ec00)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e7a820:13889 timed out for ccb 0xffffff00019bf000 (req->ccb 0xffffff00019bf000)
Oct 29 13:53:40 vault kernel: mpt0: abort of req 0xffffffff80e6f150:13878 completed
Oct 29 13:53:40 vault kernel: mpt0: attempting to abort req 0xffffffff80e6f150:13878 function 0
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6dda0:13890 timed out for ccb 0xffffff0001983400 (req->ccb 0xffffff0001983400)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6df50:13891 timed out for ccb 0xffffff00019be000 (req->ccb 0xffffff00019be000)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6b9a0:13892 timed out for ccb 0xffffff00018c4400 (req->ccb 0xffffff00018c4400)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e72a20:13893 timed out for ccb 0xffffff0001a10800 (req->ccb 0xffffff0001a10800)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e696c0:13894 timed out for ccb 0xffffff000197ec00 (req->ccb 0xffffff000197ec00)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e74d90:13895 timed out for ccb 0xffffff00018c4000 (req->ccb 0xffffff00018c4000)
Oct 29 13:53:40 vault kernel: mpt0: abort of req 0xffffffff80e6f150:13878 completed
Oct 29 13:53:40 vault kernel: mpt0: attempting to abort req 0xffffffff80e6f150:13878 function 0
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e78e40:13904 timed out for ccb 0xffffff0001a0f000 (req->ccb 0xffffff0001a0f000)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e6f8a0:13905 timed out for ccb 0xffffff0001a0ac00 (req->ccb 0xffffff0001a0ac00)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e75f00:13906 timed out for ccb 0xffffff000194e000 (req->ccb 0xffffff000194e000)
Oct 29 13:53:40 vault kernel: mpt0: request 0xffffffff80e772b0:13907 timed out for ccb 0xffffff0001984000 (req->ccb 0xffffff0001984000)
Oct 29 13:53:40 vault kernel: mpt0: abort of req 0xffffffff80e6f150:13878 completed
Oct 29 13:53:40 vault kernel: mpt0: attempting to abort req 0xffffffff80e6f150:13878 function 0
Oct 29 13:53:40 vault kernel: mpt0: abort of req 0xffffffff80e6f150:13878 completed
Oct 29 13:53:40 vault kernel: mpt0: attempting to abort req 0xffffffff80e6f150:13878 function 0
Oct 29 13:53:40 vault kernel: mpt0: abort of req 0xffffffff80e6f150:13878 completed
Oct 29 13:53:40 vault kernel: mpt0: attempting to abort req 0xffffffff80e6f150:13878 function 0
Oct 29 13:53:40 vault kernel: mpt0: abort of req 0xffffffff80e6f150:13878 completed

The last two lines would continue to repeat (indefinately I would assume)
until I had to power-cycle the machine. When the server would come back
online, ZFS would function fine and it reported no checksum errors or
anything. I did a scrub and again no problems. But if I put enough load
onto the disks for an extended period of time it would again crash with
the same errors. There doesn't appear to be a certain length of time or
exact combination of factors that causes the errors. Sometimes it would
occur much more quickly than other times. When the errors were scrolling
the screen, one disk or the other or both would have their activity light
on steady.

Currently the machine boots over the network (using pxeboot) from another
machine. The ZFS array is the only physical disks it has. So while this
is happening, the system itself does not lock up.

Motherboard: Tyan Tiger i7501 S2723
CPU: Dual Opteron 244
Controller: LSI SAS3041X-R
Harddrives: 2x 1TB Hitachi Deskstars

vault# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
tank           824G  89.6G    18K  /tank
tank/storage   824G  89.6G   824G  /storage
vault#

vault# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0

errors: No known data errors
vault#

mpt0: <LSILogic SAS/SATA Adapter> port 0x8800-0x88ff mem 0xfc2fc000-0xfc2fffff,0xfc2e0000-0xfc2effff irq 28 at device 3.0 on pci1
mpt0: [ITHREAD]
mpt0: MPI Version=1.5.10.0
>How-To-Repeat:
do a lot of IO over an mpt(4) device for an extended period
>Fix:


>Release-Note:
>Audit-Trail:

From: Nathaniel W Filardo <nwf@cs.jhu.edu>
To: bug-followup@freebsd.org
Cc:  
Subject: Re: kern/117688: [mpt] mpt disk timeout and hang
Date: Thu, 13 Aug 2009 02:12:05 -0400

 --rxa7AI1jm51zUU2P
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 I just tripped over this one on a more recent system:
 
 # uname -a
 FreeBSD hydra.priv.oc.ietfng.org 8.0-CURRENT-200906 FreeBSD
 8.0-CURRENT-200906 #0: Thu Jun 11 15:23:43 UTC 2009
 root@araz.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  sparc64
 
 What additional information would be useful?
 --nwf;
 
 --rxa7AI1jm51zUU2P
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (GNU/Linux)
 
 iEYEARECAAYFAkqDrrUACgkQTeQabvr9Tc+awwCfWY/Ce4cf18749DuKkfj1dc0Q
 pSwAn22ByeFzdbsmGtMiihps5XQRjv3I
 =bS3q
 -----END PGP SIGNATURE-----
 
 --rxa7AI1jm51zUU2P--

From: Nathaniel W Filardo <nwf@cs.jhu.edu>
To: bug-followup@freebsd.org
Cc:  
Subject: Re: kern/117688: [mpt] mpt disk timeout and hang
Date: Mon, 21 Sep 2009 18:31:44 -0400

 --zL7g4HaX89HhN13a
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: inline
 
 Acting on unrelated advice and a hunch, I have run "camcontrol tags $DISK -N
 16" on each of the disks attached to my mpt and have not yet suffered a
 recurrence.  When I had set tags to 24, the system was stable for a week or
 so but did ultimately succumb to this infinite loop of request
 cancellations.  camcontrol thinks that all the disks report that they can
 use up to 255 tags.
 
 Does that perhaps shed some light on the matter for somebody more qualified
 than yours truly?
 
 --nwf;
 
 --zL7g4HaX89HhN13a
 Content-Type: application/pgp-signature
 Content-Disposition: inline
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (GNU/Linux)
 
 iEYEARECAAYFAkq3/tAACgkQTeQabvr9Tc9/6QCfXZX593Hc32+KEZcB2IbTbqDm
 o58An3P6qvKIp7c064VnLtuwcGEVioGz
 =J7P1
 -----END PGP SIGNATURE-----
 
 --zL7g4HaX89HhN13a--
>Unformatted:
