From nobody@FreeBSD.org  Fri Jan  7 18:40:49 2011
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 46C151065672
	for <freebsd-gnats-submit@FreeBSD.org>; Fri,  7 Jan 2011 18:40:49 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from red.freebsd.org (unknown [IPv6:2001:4f8:fff6::22])
	by mx1.freebsd.org (Postfix) with ESMTP id 0EA898FC17
	for <freebsd-gnats-submit@FreeBSD.org>; Fri,  7 Jan 2011 18:40:49 +0000 (UTC)
Received: from red.freebsd.org (localhost [127.0.0.1])
	by red.freebsd.org (8.14.4/8.14.4) with ESMTP id p07Iembf020756
	for <freebsd-gnats-submit@FreeBSD.org>; Fri, 7 Jan 2011 18:40:48 GMT
	(envelope-from nobody@red.freebsd.org)
Received: (from nobody@localhost)
	by red.freebsd.org (8.14.4/8.14.4/Submit) id p07Iem20020755;
	Fri, 7 Jan 2011 18:40:48 GMT
	(envelope-from nobody)
Message-Id: <201101071840.p07Iem20020755@red.freebsd.org>
Date: Fri, 7 Jan 2011 18:40:48 GMT
From: Martin Simmons <martin@lispworks.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Unkillable process after a SCSI tape write error
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         153771
>Category:       kern
>Synopsis:       Unkillable process after a SCSI tape write error
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:
>Keywords:
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jan 07 18:50:05 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Martin Simmons
>Release:        8.0-RELEASE-p6
>Organization:
>Environment:
FreeBSD lwfs1-cam.cam.lispworks.com 8.0-RELEASE-p6 FreeBSD 8.0-RELEASE-p6 #0: Wed Dec 22 16:21:50 GMT 2010     root@lwfs1-cam.cam.lispworks.com:/usr/obj/usr/src/sys/GENERICLOCAL  amd64
>Description:
I'm not 100% sure of the sequence of system calls, because they are generated by Bacula, but here is what I think happened.

The kernel is built from the 8.0-RELEASE-p6 GENERIC with a small patch for nfs.

The hardware consists of an Intel chassis with an HP LTO-1 tape drive connected via an LSI U160 SCSI controller:

Jan  3 23:06:24 lwfs1-cam kernel: sym0: <1010-33> port 0x1100-0x11ff mem 0xb1904000-0xb19043ff,0xb1900000-0xb1901fff irq 16 at device 0.0 on pci7
Jan  3 23:06:24 lwfs1-cam kernel: sym0: Symbios NVRAM, ID 7, Fast-80, LVD, parity checking
Jan  3 23:06:24 lwfs1-cam kernel: sym0: open drain IRQ line driver, using on-chip SRAM
Jan  3 23:06:24 lwfs1-cam kernel: sym0: using LOAD/STORE-based firmware.
Jan  3 23:06:24 lwfs1-cam kernel: sym0: handling phase mismatch from SCRIPTS.
Jan  3 23:06:24 lwfs1-cam kernel: sym0: [ITHREAD]
Jan  3 23:06:24 lwfs1-cam kernel: sym1: <1010-33> port 0x1000-0x10ff mem 0xb1905000-0xb19053ff,0xb1902000-0xb1903fff irq 17 at device 0.1 on pci7
Jan  3 23:06:24 lwfs1-cam kernel: sym1: Symbios NVRAM, ID 7, Fast-80, SE, parity checking
Jan  3 23:06:24 lwfs1-cam kernel: sym1: open drain IRQ line driver, using on-chip SRAM
Jan  3 23:06:24 lwfs1-cam kernel: sym1: using LOAD/STORE-based firmware.
Jan  3 23:06:24 lwfs1-cam kernel: sym1: handling phase mismatch from SCRIPTS.
Jan  3 23:06:24 lwfs1-cam kernel: sym1: [ITHREAD]
Jan  3 23:06:24 lwfs1-cam kernel: sa0 at sym0 bus 0 target 3 lun 0
Jan  3 23:06:24 lwfs1-cam kernel: sa0: <HP Ultrium 1-SCSI P61D> Removable Sequential Access SCSI-3 device 
Jan  3 23:06:24 lwfs1-cam kernel: sa0: 160.000MB/s transfers (80.000MHz DT, offset 62, 16bit)


There was a MEDIUM ERROR during a backup to the LTO-1 drive, but the write system call returned as if the error hadn't happened (AFAICS).  The next write was reported as failing with EPERM and a subsequent ioctl to rewind the tape hung in an unkillable state.

These syslog messages were generated

Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): WRITE FILEMARKS(6). CDB: 10 0 0 0 1 0 
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): CAM Status: SCSI Status Error
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): SCSI Status: Check Condition
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): MEDIUM ERROR info:1 asc:c,0
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): Write error
Jan  3 00:18:35 lwfs1-cam kernel: (sa0:sym0:0:3:0): Retries Exhausted
Jan  3 00:27:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 00:31:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 00:35:18 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 01:23:43 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 02:12:08 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 03:00:33 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 03:49:02 lwfs1-cam kernel: sym0:3:control msgout: 80 6.
Jan  3 04:37:27 lwfs1-cam kernel: sym0:3:control msgout: 80 6.

The message at 00:31:18 corresponds to when it tried to rewind the tape and the last message then repeats roughly every 50 minutes.

The output of procstat -k -k for the unkillable backup process is:

  PID    TID COMM             TDNAME           KSTACK                       
 1166 100179 bacula-sd        -                mi_switch+0x16f sleepq_catch_signals+0x31f sleepq_wait_sig+0xc _cv_wait_sig+0x120 seltdwait+0xf6 kern_select+0x62f select+0x5d syscall+0x246 Xfast_syscall+0xe1 
 1166 100259 bacula-sd        -                mi_switch+0x16f sleepq_catch_signals+0x31f sleepq_timedwait_sig+0x19 _sleep+0x1a3 do_cv_wait+0x640 __umtx_op_cv_wait+0x5c syscall+0x246 Xfast_syscall+0xe1 
 1166 100316 bacula-sd        -                mi_switch+0x16f sleepq_wait+0x42 _sleep+0x31c cam_periph_runccb+0x4e sarewind+0x6c saioctl+0x977 giant_ioctl+0x7d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x246 Xfast_syscall+0xe1 

Attempts to attach gdb to the process caused gdb to hang (but it was killable).

Attempts to use "camcontrol periphlist sa0" and "camcontrol reset 0:3:0" also hung in an unkillable way.

"camcontrol reset 0" gave "Reset of bus 0 was successful" but didn't fix the hanging processes.

The machine has been rebooted since then so I can't test anything else directly.

>How-To-Repeat:
I don't know if it is repeatable.

>Fix:


>Release-Note:
>Audit-Trail:
>Unformatted:
