From hsu@lelu.olari.bbnetworks.net  Tue Jul 13 07:36:52 2004
Return-Path: <hsu@lelu.olari.bbnetworks.net>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9551916A4CE
	for <FreeBSD-gnats-submit@freebsd.org>; Tue, 13 Jul 2004 07:36:52 +0000 (GMT)
Received: from lelu.olari.bbnetworks.net (lelu.olari.bbnetworks.net [212.16.96.66])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 386D743D39
	for <FreeBSD-gnats-submit@freebsd.org>; Tue, 13 Jul 2004 07:36:51 +0000 (GMT)
	(envelope-from hsu@lelu.olari.bbnetworks.net)
Received: from lelu.olari.bbnetworks.net (localhost [127.0.0.1])
	by lelu.olari.bbnetworks.net (8.12.11/8.12.11) with ESMTP id i6D7ajWY038675
	for <FreeBSD-gnats-submit@freebsd.org>; Tue, 13 Jul 2004 10:36:45 +0300 (EEST)
	(envelope-from hsu@lelu.olari.bbnetworks.net)
Received: (from root@localhost)
	by lelu.olari.bbnetworks.net (8.12.11/8.12.11/Submit) id i6D7aiOW038674;
	Tue, 13 Jul 2004 10:36:44 +0300 (EEST)
	(envelope-from hsu)
Message-Id: <200407130736.i6D7aiOW038674@lelu.olari.bbnetworks.net>
Date: Tue, 13 Jul 2004 10:36:44 +0300 (EEST)
From: Heikki Suonsivu <hsu@bbnetworks.net>
Reply-To: Heikki Suonsivu <hsu@bbnetworks.net>
To: FreeBSD-gnats-submit@freebsd.org
Subject: Firewire crashes, loose pointers in kernel ?
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         68978
>Category:       kern
>Synopsis:       [panic] [ufs] crashes with failing hard disk, loose pointers in kernel?
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-fs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jul 13 07:40:23 GMT 2004
>Closed-Date:    
>Last-Modified:  Mon Feb 02 20:46:42 UTC 2009
>Originator:     Heikki Suonsivu
>Release:        FreeBSD 4.10-STABLE i386
>Organization:
bbnetworks.net
>Environment:

System: FreeBSD lelu.olari.bbnetworks.net 4.10-STABLE FreeBSD 4.10-STABLE #31: Thu Jul 8 02:19:24 EEST 2004 hsu@lelu.olari.bbnetworks.net:/usr/obj/usr/src/sys/CAT i386

Maxtor 250G external disk connected through firewire.  The disk is
somewhat suspect, but kernel should not get confused on disk request
timeouts.  I do not know certain yet whether this is caused by disk
breaking down, but noises and clicks from the disk are raising
suspicion.

The external disk contains just backups, and accesses to it are rare.
It contains one large BSD filesystem.

There are no other devices connected to the firewire.

The disk was connected with USB1, and worked ok.  It also did seem to
work with firewire for about a month, though the use level has been
quite low.

>Description:

When using the firewire disk I start getting timeouts.  The computer
hangs at some point after this, with rather random effects after
random time, console may continue to work or die, etc.

The log entries look suspiciously like something is getting seriously
wrong in the kernel:

Jul 11 14:28:26 lelu /kernel: sbp0:0:0 request timeout(cmd orb:0x14bf44fc) ... agent reset
Jul 12 00:03:58 lelu /kernel: sbp0:00 request timgout(cmd orb:0x14bf<34)"... taret reset
Jul 12 00:03:58 lelu /kernel: sbp0:0:0 ~equest timeout(gmd orb:0x14bf48a4) ... reset start
Jul 12 00:03:58 lelu /kernel: firgwire0: split uvansaction timeout dst=0xffc0 tl=0x21 state=10
Jul 12 00:03:58 lelu /kernel: sbp0:0:0!sbp_reset_start failed: resp=60NCopyright (c) 1992-2004 The FreeBSD Project.
Jul 12 00:03:58 lelu /kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul 12 00:03:58 lelu /kernel: The Regents of the University of California. All rights reserved.
Jul 12 00:03:58 lelu /kernel: FreeBSD 4.10-STABLE #31: Thu Jul  8 02:19:24 EEST 2004

Notice corrupted messages!

The disk is usually hung after first "request timeout" message.

>How-To-Repeat:

Get a broken maxtor disk ?  I can probably ship the disk to someone but
would prefer someone in Finland due to shipping cost and timing (I
want to get it replaced before warranty expires).

When the disk is not used and/or mounted, but is connected to the
firewire bus, the problems do not seem to appear.

>Fix:

>Release-Note:
>Audit-Trail:

From: Harrison Grundy <astrodog@gmail.com>
To: bug-followup@FreeBSD.org,  hsu@bbnetworks.net
Cc:  
Subject: Re: kern/68978: [crash] firewire crashes with failing hard disk,
 loose pointers in kernel?
Date: Tue, 27 Mar 2007 11:00:02 -0500

 As absurd as this might sound, can you confirm if this still occurs in 6.x?

From: Heikki Suonsivu <hsu@bbnetworks.net>
To: Harrison Grundy <astrodog@gmail.com>
Cc: bug-followup@FreeBSD.org
Subject: Re: kern/68978: [crash] firewire crashes with failing hard disk,
 loose pointers in kernel?
Date: Thu, 29 Mar 2007 00:38:52 +0300

 Harrison Grundy wrote:
 > As absurd as this might sound, can you confirm if this still occurs in 6.x?
 
 The external drives are still misbehaving, though the symptoms have been 
 changing by time.  I get panics after the umass gets confused.  I was 
 wrong in my original report, it was not just firewire, and not failing 
 disk (unless all 9+ external drives I have used are bad).  The same 
 problem occurs with Maxtor, Lacie external disks, various models, and 
 the problems occur both USB and firewire connected disks. This happens 
 at least on two PentiumIII computers and one AMD64.  I have not tested 
 others.  The more I load the disks, the more often this occurs.  The 
 rate of crashes on slow, but active computer with 5 drives is something 
 like couple of 2-3 times per month, on passive fast but passive computer 
 with 3 drives maybe every 2 months.  I cannot see any difference between 
 drives from different manufacturers, but I haven't done any scientific 
 analysis :-/
 
 I think I might have filed a new PR so it might a good idea to check for 
 duplicates.
 
 Below are some excerpts from various symptoms.
 
 I keep getting these when USB or firewire disks are loaded a lot.
 When doing a large copy these pop into console at rate of every few 
 seconds, at random-looking intervals.
 
 Mar 29 00:22:32 lelu kernel: Unknown service addr 0x0000:0x09aab800 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:32 lelu kernel: Unknown service addr 0x0000:0x0ad03000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:37 lelu kernel: Unknown service addr 0x0000:0x0e37c000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:46 lelu kernel: Unknown service addr 0x0000:0x15fd6800 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:46 lelu kernel: Unknown service addr 0x0000:0x0e716000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:47 lelu kernel: Unknown service addr 0x0000:0x0ab54000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:47 lelu kernel: Unknown service addr 0x0000:0x156e4000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:50 lelu kernel: Unknown service addr 0x0000:0x0df7a000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:57 lelu kernel: Unknown service addr 0x0000:0x10209000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:22:58 lelu kernel: Unknown service addr 0x0000:0x150f7000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:23:01 lelu kernel: Unknown service addr 0x0000:0x148a5800 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:23:02 lelu kernel: Unknown service addr 0x0000:0x10f27000 
 WREQB(1) src=0xffc1 data=8
 Mar 29 00:23:02 lelu kernel: Unknown service addr 0x0000:0x0702c000 
 WREQB(1) src=0xffc1 data=8
 
 Then, at some point I might see
 
 Mar  5 12:18:47 lelu kernel: umass0: Invalid CSW: tag 145294 should be 
 145295
 Mar  5 12:18:47 lelu kernel: umass0: Invalid CSW: tag 145295 should be 
 145296
 Mar  5 12:18:48 lelu kernel: umass0: Invalid CSW: tag 145296 should be 
 145297
 Mar  5 12:18:48 lelu kernel: umass0: Invalid CSW: tag 145297 should be 
 145298
 Mar  5 12:18:49 lelu kernel: umass0: Invalid CSW: tag 145298 should be 
 145299
 Mar  5 12:18:49 lelu kernel: 
 g_vfs_done():da0s1e[READ(offset=150001205248, lengt
 h=16384)]error = 5
 Mar  5 12:18:49 lelu kernel: 
 g_vfs_done():da0s1e[READ(offset=150001221632, lengt
 h=16384)]error = 5
 Mar  5 12:18:49 lelu kernel: g_vfs_done():da0s1e[WRITE(offset=114688, 
 length=163
 84)]error = 5
 Mar  5 12:18:49 lelu kernel: umass0: Invalid CSW: tag 145299 should be 
 145300
 Mar  5 12:18:50 lelu kernel: umass0: Invalid CSW: tag 145300 should be 
 145301
 Mar  5 12:18:51 lelu kernel: umass0: Invalid CSW: tag 145301 should be 
 145302
 Mar  5 12:18:51 lelu kernel: umass0: Invalid CSW: tag 145302 should be 
 145303
 Mar  5 12:18:52 lelu kernel: umass0: Invalid CSW: tag 145303 should be 
 145304
 Mar  5 12:18:52 lelu kernel: g_vfs_done():da0s1e[READ(offset=6209536, 
 length=204
 8)]error = 5
 Mar  5 12:18:52 lelu kernel: umass0: Invalid CSW: tag 145304 should be 
 145305
 Mar  5 12:18:53 lelu kernel: umass0: Invalid CSW: tag 145305 should be 
 145306
 Mar  5 12:18:54 lelu kernel: umass0: Invalid CSW: tag 145308 should be 
 145309
 Mar  5 12:18:54 lelu kernel: 
 g_vfs_done():da0s1e[READ(offset=411775385600, lengt
 h=16384)]error = 5
 Mar  5 12:18:54 lelu kernel: 
 g_vfs_done():da0s1e[WRITE(offset=406895493120, leng
 th=16384)]error = 5
 Mar  5 12:18:55 lelu kernel: umass0: Invalid CSW: tag 145309 should be 
 145310
 Mar  5 12:22:10 lelu kernel: umass0: BBB reset failed, TIMEOUT
 Mar  5 12:23:15 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT
 Mar  5 12:24:20 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Mar  5 12:26:30 lelu kernel: umass0: BBB reset failed, TIMEOUT
 Mar  5 12:27:35 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT
 Mar  5 12:28:40 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Mar  5 12:30:50 lelu kernel: umass0: BBB reset failed, TIMEOUT
 Mar  5 12:31:55 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT
 Mar  5 12:33:00 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Mar  5 12:35:10 lelu kernel: umass0: BBB reset failed, TIMEOUT
 Mar  5 12:36:15 lelu kernel: umass0: BBB bulk-in clear stall failed, TIMEOUT
 Mar  5 12:37:20 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 ...
 Mar  5 21:43:15 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 TIMEOUT
 Mar  5 21:44:01 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 IOERROR
 Mar  5 21:44:01 lelu kernel: umass0: BBB reset failed, IOERROR
 Mar  5 21:44:01 lelu kernel: g_vfs_done():da0s1e[WRITE(offset=114688, 
 length=163
 84)]error = 5
 Mar  5 21:44:01 lelu kernel: 
 g_vfs_done():da0s1e[READ(offset=411775860736, lengt
 h=16384)]error = 5
 Mar  5 21:44:01 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 IOERROR
 Mar  5 21:44:01 lelu kernel: umass0: BBB reset failed, IOERROR
 Mar  5 21:44:01 lelu kernel: umass0: BBB bulk-out clear stall failed, 
 IOERROR
 Mar  5 21:44:01 lelu kernel: 
 g_vfs_done():da0s1e[READ(offset=411775877120, lengt
 h=16384)]error = 5
 Mar  5 21:44:01 lelu kernel: umass0: BBB reset failed, IOERROR
 
 (Sorry about wrapping)
 
 This usually ends up with panic, or "disk removed from system" thing 
 (which also usually leads to panic when shutting down or rebooting). 
 The panics look like coming from the filesystem code when a misbehaving 
 disk disappears or locks up or gets confused, and filesystem code freaks 
 out on something it does not expect.
 
 It might be triggered by buggy disks or USB/FW controllers, but if that 
 would be true there are lots of bad hardware, as I have not yet found a 
 single working combination.  I have had 3 maxtor models and 3 Lacie 
 models.  Windows must get around this some way, as I have not seen 
 people reporting this flaky operation under windows (though, it might be 
 more difficult to tell the difference and people might use the drives 
 less actively).
 
 Sometimes I also start getting various I/O errors, which disappear when 
 resetting the system.  fsck might start spewing out bad block messages, 
 which disappear when resetting the computer and/or disks.
 
 Sometimes resetting the computer might not be enough, but also disks 
 needs to be power cycled.  This is rare (maybe couple of times a year).
 
 I not seen problems like these on Windows or Macs but the usage pattern 
 is very different, mostly occasional backups.  The drives causing grief 
 above are native bsd formatted drives.
 
 I have not seen this on 2.5 inch external drives or other USB media, but 
 I have never had one in active use like these drives are.
 
 
Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Mon Feb 2 20:41:44 UTC 2009 
Responsible-Changed-Why:  
Over to maintainer(s).  To be honest, I'm not sure this will be a 
priority - the UFS code is known to expect that the underlying 
media is 100% working.  I'll keep this PR open though, as some 
progress has been made recently on fixing up these assumptions. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=68978 
>Unformatted:
