From nobody@FreeBSD.org  Sun Jul  4 23:05:30 2010
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C132A106564A
	for <freebsd-gnats-submit@FreeBSD.org>; Sun,  4 Jul 2010 23:05:30 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id B05008FC0C
	for <freebsd-gnats-submit@FreeBSD.org>; Sun,  4 Jul 2010 23:05:30 +0000 (UTC)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id o64N5TDo013834
	for <freebsd-gnats-submit@FreeBSD.org>; Sun, 4 Jul 2010 23:05:29 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id o64N5TCu013833;
	Sun, 4 Jul 2010 23:05:29 GMT
	(envelope-from nobody)
Message-Id: <201007042305.o64N5TCu013833@www.freebsd.org>
Date: Sun, 4 Jul 2010 23:05:29 GMT
From: Rich Ercolani <admins@acm.jhu.edu>
To: freebsd-gnats-submit@FreeBSD.org
Subject: ZFS hanging forever on 8.1-PRERELEASE
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         148368
>Category:       kern
>Synopsis:       [zfs] ZFS hanging forever on 8.1-PRERELEASE
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-fs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Jul 04 23:10:04 UTC 2010
>Closed-Date:    
>Last-Modified:  Mon Jul 05 02:09:18 UTC 2010
>Originator:     Rich Ercolani
>Release:        RELENG_8 from June 15th
>Organization:
JHU ACM
>Environment:
FreeBSD manticore.acm.jhu.edu 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0: Wed Jun 16 17:10:42 UTC 2010     root@[removed]:/usr/obj/usr/local/ncvs/src/sys/DTRACE  amd64

>Description:
Occasionally, much to our chagrin, drives malfunction.

When this happens, ZFS and company appear to "handle" the errors correctly, but in practice, they often require a reboot to become at all responsive any more [e.g. "zpool scrub [affected pool]" will hang forever without returning to a shell, eventually "zpool status" will hang forever].

I've seen this problem before, but we were running an old kernel [circa November 2009] from RELENG_8, and presumed it would go away on upgrade.

The kernel config is the GENERIC config with the following modifications:
# diff GENERIC DTRACE
19c19
< # $FreeBSD: src/sys/amd64/conf/GENERIC,v 1.531.2.13 2010/05/02 06:24:17 imp Exp $
---
> # $FreeBSD: src/sys/amd64/conf/GENERIC,v 1.531.2.8 2010/01/18 00:53:21 imp Exp $
22c22
< ident         GENERIC
---
> ident         DTRACE
57c57
< options       COMPAT_FREEBSD32        # Compatible with i386 binaries
---
> options       COMPAT_IA32             # Compatible with i386 binaries
76,77c76,78
< #options      KDTRACE_FRAME           # Ensure frames are compiled in
< #options      KDTRACE_HOOKS           # Kernel DTrace hooks
---
> options       KDTRACE_FRAME           # Ensure frames are compiled in
> options       KDTRACE_HOOKS           # Kernel DTrace hooks
> options       DDB_CTF                 # Still more Dtrace-related hooks
227d227
< device                sge             # Silicon Integrated Systems SiS190/191
284d283
< options       USB_DEBUG       # enable debug msgs

I'm sorry I can't include a precise revision number of the kernel, I used cvsup to pull it, and I don't know how to extract the revision number.

I'm going to try pulling and installing latest RELENG_8 and see if that helps.

For reference, the errors printed in kernel log when the zpool reported read/write errors on a disk:
Jul  4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= a, TargetId=1, Lun=4
Jul  4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= a, TargetId=1, Lun=4
Jul  4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= 8, TargetId=1, Lun=4

Status of the pool now:
  pool: cannoli
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 0h13m, 0.87% done, 25h56m to go
config:

        NAME        STATE     READ WRITE CKSUM
        cannoli     ONLINE       0     0     0
          da5       ONLINE       0     0     0
          da6       ONLINE       0     0     0
          da2       ONLINE       0     0     0
          da4       ONLINE       0     0     4

errors: 1 data errors, use '-v' for a list


At this point, the system will fail to reboot cleanly, as it spends forever waiting for the zfs filesystems to cleanly unmount [presumably.]

My next kernel will have DDB built in.
>How-To-Repeat:
1) Have a disk which occasionally reports uncorrected read/write errors with a ZFS filesystem on it.
2) ZFS will eventually completely cease to respond to all queries using the "zpool" or "zfs" commands. [traffic to the mounted filesystems is fine for much longer, until the point where the entire system becomes unresponsive.]
>Fix:


>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->freebsd-fs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Mon Jul 5 02:09:06 UTC 2010 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=148368 
>Unformatted:
