From nobody@FreeBSD.org  Sat Apr  4 15:43:42 2009
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2BDF7106566B
	for <freebsd-gnats-submit@FreeBSD.org>; Sat,  4 Apr 2009 15:43:42 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id 18BFA8FC12
	for <freebsd-gnats-submit@FreeBSD.org>; Sat,  4 Apr 2009 15:43:42 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id n34FhfT4048302
	for <freebsd-gnats-submit@FreeBSD.org>; Sat, 4 Apr 2009 15:43:41 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id n34FhfU7048301;
	Sat, 4 Apr 2009 15:43:41 GMT
	(envelope-from nobody)
Message-Id: <200904041543.n34FhfU7048301@www.freebsd.org>
Date: Sat, 4 Apr 2009 15:43:41 GMT
From: Damian Gerow <dgerow@afflictions.org>
To: freebsd-gnats-submit@FreeBSD.org
Subject: umass attachment causes ZFS checksum errors, data loss
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         133373
>Category:       kern
>Synopsis:       [zfs] umass attachment causes ZFS checksum errors, data loss
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    trasz
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Apr 04 15:50:00 UTC 2009
>Closed-Date:    Fri Oct 02 07:33:26 UTC 2009
>Last-Modified:  Fri Oct 02 07:33:26 UTC 2009
>Originator:     Damian Gerow
>Release:        amd64 8.0-CURRENT (80074)
>Organization:
>Environment:
FreeBSD plebeian.afflictions.org 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Tue Mar 31 08:41:28 EDT 2009     dgerow@plebeian.afflictions.org:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:
Attach a umass device to a system that has a moderate load on a ZFS
filesystem.  Upon this action, various programs will start throwing
input/output errors, and ZFS checksum errors will be logged to messages.

In this case, while running an average workstation load (fetchmail,
firefox, a few terminals, and tranmission), connecting a Cowon D2
triggers the error immediately.  Note that the longer the system has
been running, the more frequently the errors show up.

Appropriate snippets from /var/log/messages (nothing shows up in dmesg):

-----
Mar 31 23:29:56 plebeian kernel: ugen7.2: <COWON Systems, Inc.> at usbus7
Mar 31 23:29:56 plebeian kernel: umass0: <COWON Systems, Inc. COWON D2  @-e" 3.57, class 0/0, rev 2.00/1.00, addr 2> on usbus7
Mar 31 23:29:56 plebeian kernel: umass0:  SCSI over Bulk-Only; quirks = 0x0000
Mar 31 23:29:56 plebeian root: Unknown USB device: vendor 0x0e21 product 0x0800 bus uhub7
Mar 31 23:29:57 plebeian kernel: umass0:0:0:-1: Attached to scbus0
Mar 31 23:29:57 plebeian kernel: da0 at umass-sim0 bus 0 target 0 lun 0
Mar 31 23:29:57 plebeian kernel: da0: <COWON D2 0100> Removable Direct Access SCSI-0 device
Mar 31 23:29:57 plebeian kernel: da0: 40.000MB/s transfers
Mar 31 23:29:57 plebeian kernel: da0: 7808MB (15990784 512 byte sectors: 255H 63S/T 995C)
Mar 31 23:29:57 plebeian kernel: da1 at umass-sim0 bus 0 target 0 lun 1
Mar 31 23:29:57 plebeian kernel: da1: <COWON D2 0100> Removable Direct Access SCSI-0 device
Mar 31 23:29:57 plebeian kernel: da1: 40.000MB/s transfers
Mar 31 23:29:57 plebeian kernel: da1: 15359MB (31456320 512 byte sectors: 255H 63S/T 1958C)
Mar 31 23:29:57 plebeian kernel: GEOM_LABEL: Label for provider da0 is msdosfs/D2.
Mar 31 23:29:57 plebeian kernel: GEOM: da1: partition 1 does not start on a track boundary.
Mar 31 23:29:57 plebeian kernel: GEOM: da1: partition 1 does not end on a track boundary.
Mar 31 23:30:50 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=297610772480 size=131072
Mar 31 23:30:50 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=297610772480 size=131072
Mar 31 23:30:50 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:20 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=23159373824 size=131072
Mar 31 23:31:20 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=23159373824 size=131072
Mar 31 23:31:20 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18063163392 size=131072
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18063163392 size=131072
Mar 31 23:31:34 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18062901248 size=131072
Mar 31 23:31:34 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18062901248 size=131072
Mar 31 23:31:34 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=17453809664 size=131072
Mar 31 23:31:35 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
Mar 31 23:31:40 plebeian sudo:   dgerow : TTY=pts/4 ; PWD=/home/dgerow ; USER=root ; COMMAND=/etc/rc.d/sshd start
Mar 31 23:31:50 plebeian sudo:   dgerow : TTY=pts/4 ; PWD=/home/dgerow ; USER=root ; COMMAND=/etc/rc.d/sshd onestart
Mar 31 23:31:51 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18045468672 size=131072
Mar 31 23:31:51 plebeian dgerow: /etc/rc.d/sshd: WARNING: failed to start sshd
Mar 31 23:31:51 plebeian root: ZFS: checksum mismatch, zpool=storage path=/dev/ad4s1d.eli offset=18045468672 size=131072
Mar 31 23:31:51 plebeian root: ZFS: zpool I/O failure, zpool=storage error=86
-----

I explicitly received errors reading from /etc/termcap after this.  So I
shut everything down -- there is a strong tendancy for applications to core
dump at this point -- and rebooted into single-user mode.  Then I checked
the status of the zfs pool (zpool status -v):

-----
  pool: storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        storage       ONLINE       0     0     0
          ad4s1d.eli  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        storage/usr:/local/share/zsh/4.3.9/functions/Completion/Debian.zwc
        storage/usr:/local/share/zsh/4.3.9/functions/Completion/Linux.zwc
        storage/usr:<0xf02d>
        storage/usr:/local/bin/mutt
        storage/usr:/sbin/sshd
        storage/usr:/local/sbin/cupsd
        storage/usr:/share/games/fortune/fortunes
        storage/usr:/local/lib/libgtk-x11-2.0.so.0
        storage/usr:/local/lib/firefox3/chrome/browser.jar
        storage/usr:/share/misc/termcap
        storage/usr:/share/misc/termcap.db
        storage/usr:/local/bin/transmission
        storage/usr:/local/bin/openbox
        storage/home:<0x331d>
        storage/home:/dgerow/X-Files/402 - Home.mp4
        storage/home:/dgerow/.mozilla/firefox/bk0ibcxu.default/urlclassifier3.sqlite
        storage/home:/dgerow/X-Files/705 - Rush.mp4
        storage/home:/dgerow/.procmail.log
-----

There doesn't seem to be a pattern as to which files are affected: mutt,
openbox, cupsd, and transmission were all running before the checksum
errors, whereas fortune, sshd, and procmail were all run post-checksum
errors.  zsh and firefox were running, of course, both before and after.

Oddly, and as was noted as quoted above, I'm not sure what 0x331d would
be.  I didn't explicitly delete anything after plugging in the D2,
though it is possible that a program removed a temporary file.  The scrub
found an additional five errors, and post scrub, my pool now looks like this:

-----
  pool: storage
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 0h34m with 6 errors on Wed Apr  1 00:22:25 2009
config:

        NAME          STATE     READ WRITE CKSUM
        storage       ONLINE       0     0     6
          ad4s1d.eli  ONLINE       0     0    12

errors: Permanent errors have been detected in the following files:

        storage/home:/dgerow/.mozilla/firefox/bk0ibcxu.default/urlclassifier3.sqlite
        storage/home:/dgerow/.config/transmission/resume/X-Files.2a82218f000bc93d.resume
        storage/home:/dgerow/.mozilla/firefox/bk0ibcxu.default/Cache/_CACHE_001_
        storage/home:/dgerow/mutt.core
        storage/home:/dgerow/X-Files/707 - Orison.mp4
-----

Further investigation showed that by not opening transmission, the bug
is not triggered, which seems to suggest that the problem is tied to the
number of open files (and possibly the size of those files, so perhaps
the ZFS cache itself).
>How-To-Repeat:
1) Bring up a moderate filesystem load on a ZFS system.  I used a firefox
   session with ~10 open tabs, and transmission with one torrent (~200
   files, and 30GB).
2) Plug in a umass device.
3) Watch /var/log/messages as the errors occur.
>Fix:


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Tue Apr 7 01:34:29 UTC 2009 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=133373 

From: Damian Gerow <dgerow@afflictions.org>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/133373: [zfs] umass attachment causes ZFS checksum errors,
 data loss
Date: Sat, 9 May 2009 14:31:45 -0400

 This PR can be closed, it was a problem with bounced buffers in the new
 USB2 code, and has since been fixed.
Responsible-Changed-From-To: freebsd-fs->trasz 
Responsible-Changed-By: trasz 
Responsible-Changed-When: Fri Oct 2 07:32:43 UTC 2009 
Responsible-Changed-Why:  
I'll take it. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=133373 
State-Changed-From-To: open->closed 
State-Changed-By: trasz 
State-Changed-When: Fri Oct 2 07:33:25 UTC 2009 
State-Changed-Why:  
According to the submitter, the problem is already fixed. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=133373 
>Unformatted:
