From tegge@not.fast.no Thu May 13 13:06:44 1999
Return-Path: <tegge@not.fast.no>
Received: from midten.fast.no (midten.fast.no [195.139.251.11])
	by hub.freebsd.org (Postfix) with ESMTP id 03C3914FC8
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 May 1999 13:06:42 -0700 (PDT)
	(envelope-from tegge@not.fast.no)
Received: from not.fast.no (IDENT:tegge@not.fast.no [195.139.251.12])
	by midten.fast.no (8.9.1/8.9.1) with ESMTP id WAA76064
	for <FreeBSD-gnats-submit@freebsd.org>; Thu, 13 May 1999 22:06:41 +0200 (CEST)
Received: (from tegge@localhost)
	by not.fast.no (8.9.3/8.8.8) id WAA59935;
	Thu, 13 May 1999 22:06:41 +0200 (CEST)
	(envelope-from tegge@not.fast.no)
Message-Id: <199905132006.WAA59935@not.fast.no>
Date: Thu, 13 May 1999 22:06:41 +0200 (CEST)
From: Tor Egge <tegge@not.fast.no>
Reply-To: tegge@not.fast.no
To: FreeBSD-gnats-submit@freebsd.org
Subject: Disk failure hangs system
X-Send-Pr-Version: 3.2

>Number:         11697
>Category:       kern
>Synopsis:       Disk failure hangs system
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    tegge
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu May 13 13:10:02 PDT 1999
>Closed-Date:    Fri Aug 26 11:42:33 GMT 2005
>Last-Modified:  Fri Aug 26 11:42:33 GMT 2005
>Originator:     Tor Egge
>Release:        FreeBSD 3.1-STABLE i386
>Organization:
Fast Search & Transfer ASA
>Environment:

FreeBSD 3.1-STABLE #0: Sat May  1 19:00:19 CEST 1999     root@response.fast.no:/usr/src/sys/compile/INDEX_SMP_SERIAL_DDB  i386

ahc1: <Adaptec 2940 Ultra2 SCSI adapter> rev 0x00 int a irq 17 on pci0.14.0
ahc1: aic7890/91 Wide Channel A, SCSI Id=7, 16/255 SCBs

da13 at ahc1 bus 0 target 9 lun 0
da13: <QUANTUM QM318000TD-SCA N1K0> Fixed Direct Access SCSI-2 device 
da13: 80.000MB/s transfers (40.000MHz, offset 31, 16bit), Tagged Queueing Enabled
da13: 17366MB (35566499 512 byte sectors: 255H 63S/T 2213C)

>Description:

----------------------
Unexpected busfree.  LASTPHASE == 0x80
SEQADDR == 0x15b
(da13:ahc1:0:9:0): Invalidating pack
(da13:ahc1:0:9:0): Invalidating pack
(da13:ahc1:0:9:0): Invalidating pack
vm_fault: pager read error, pid 63486 (mkserv)
(da13:ahc1:0:9:0): Invalidating pack
Stopped at      siointr1+0x6d:  jmp     siointr1+0x159
db> trace
siointr1(e3c8d800,e02890b0,0,f2e0da2c,e0206144) at siointr1+0x6d
siointr(0,f2e00010,0,1,e0289014) at siointr+0x1d
Xfastintr4(ebd13528,e3e12800,ebd13528,c8000040,e0e7e8c8) at Xfastintr4+0x24
biodone(ebd13528,ebd13528,ebd13528,c8000040,e3e08000) at biodone+0x2d0
dastrategy(ebd13528,200202b4,f2e0daa8,e018167d,f2e0dacc) at dastrategy+0xab
spec_strategy(f2e0dacc,f2e0dab4,e01e73a9,f2e0dacc,f2e0dad8) at spec_strategy+0x3e
spec_vnoperate(f2e0dacc,f2e0dad8,e016d46f,f2e0dacc,2000) at spec_vnoperate+0x15
ufs_vnoperatespec(f2e0dacc) at ufs_vnoperatespec+0x15
bwrite(ebd13528,f2e0daf0,e0171879,f2e0db34,f2e0dafc) at bwrite+0xaf
vop_stdbwrite(f2e0db34,f2e0dafc,e018167d,f2e0db34,f2e0db08) at vop_stdbwrite+0xe
vop_defaultop(f2e0db34,f2e0db08,e01e73a9,f2e0db34,f2e0db3c) at vop_defaultop+0x15
spec_vnoperate(f2e0db34,f2e0db3c,e016de03,f2e0db34,200) at spec_vnoperate+0x15
ufs_vnoperatespec(f2e0db34,200,ebd13528,1,0) at ufs_vnoperatespec+0x15
vfs_bio_awrite(ebd13528,200,a200a000,1,f2e00010) at vfs_bio_awrite+0x103
getnewbuf(f1cea900,d10050,0,0,2000) at getnewbuf+0x2ec
getblk(f1cea900,d10050,2000,0,0) at getblk+0x244
bread(f1cea900,d10050,2000,0,f2e0dc48) at bread+0x21
ffs_vget(e3e8c200,54ee7,f2e0dccc,f283ee40,f2e0df14) at ffs_vget+0x1bc
ufs_lookup(f2e0dd24,f2e0dd38,e017055c,f2e0dd24,f3009c47) at ufs_lookup+0x936
ufs_vnoperate(f2e0dd24,f3009c47,f283ee40,f2e0df14,0) at ufs_vnoperate+0x15
vfs_cache_lookup(f2e0dd80,f2e0dd90,e01729fd,f2e0dd80,f1c6ce00) at vfs_cache_lookup+0x248
ufs_vnoperate(f2e0dd80,f1c6ce00,f2e0df14,f2e0def0,0) at ufs_vnoperate+0x15
lookup(f2e0def0,0,f2e0df84,f2e0def0,7273752f) at lookup+0x2c1
namei(f2e0def0,0,f2e0df84,f2d5c840,286) at namei+0x133
vn_open(f2e0def0,3,584,f2d5c840,e0254064) at vn_open+0x1f6
open(f2d5c840,f2e0df84,dfbfd594,dfbfc7e0,dfbfbfe4) at open+0xad
syscall(27,27,dfbfbfe4,dfbfc7e0,dfbfc7b4) at syscall+0x187
Xint0x80_syscall() at Xint0x80_syscall+0x4c
db> panic
panic: from debugger
mp_lock = 01000002; cpuid = 1; lapic.id = 00000000
boot() called on cpu#1

syncing disks... 
-------------

The SCSI bus is freed at the wrong moment, probably due to the device
resetting.  Then the command is retried, but is aborted AGAIN due to
a selection timeout (indicating that the device had not completed 
resetting).  This might be caused by bad firmware on the disk or
a too weak power supply.  I assume this is bad firmware.

Combined with the VFS code being conservative (not wanting to throw
away buffer contents on fatal write errors (which might lead to file
system corruption if this is a transient error)), this sometimes lead to
the buffer queues being filled with dirty buffers associated with
the invalidated disk pack.

Combined with what appears to be a bug in the routine waitfreebuffers,
this could lead to an infinite busy loop in the kernel inside a
splbio() protect region of code.

>How-To-Repeat:

Use Quantum disks.


>Fix:
	
Index: vfs_bio.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
retrieving revision 1.193.2.5
diff -u -r1.193.2.5 vfs_bio.c
--- vfs_bio.c	1999/04/20 19:54:20	1.193.2.5
+++ vfs_bio.c	1999/05/12 19:57:13
@@ -577,7 +577,8 @@
 	if (bp->b_flags & B_LOCKED)
 		bp->b_flags &= ~B_ERROR;
 
-	if ((bp->b_flags & (B_READ | B_ERROR)) == B_ERROR) {
+	if ((bp->b_flags & (B_READ | B_ERROR)) == B_ERROR &&
+		bp->b_error != ENXIO) {
 		bp->b_flags &= ~B_ERROR;
 		bdirty(bp);
 	} else if ((bp->b_flags & (B_NOCACHE | B_INVAL | B_ERROR | B_FREEBUF)) ||
@@ -1219,7 +1220,7 @@
 waitfreebuffers(int slpflag, int slptimeo) {
 	while (numfreebuffers < hifreebuffers) {
 		flushdirtybuffers(slpflag, slptimeo);
-		if (numfreebuffers < hifreebuffers)
+		if (numfreebuffers >= hifreebuffers)
 			break;
 		needsbuffer |= VFS_BIO_NEED_FREE;
 		if (tsleep(&needsbuffer, (PRIBIO + 4)|slpflag, "biofre", slptimeo))



>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->closed 
State-Changed-By: jkh 
State-Changed-When: Sun Aug 8 17:56:11 PDT 1999 
State-Changed-Why:  
This was just fixed in both active branches by the author. 
State-Changed-From-To: closed->open 
State-Changed-By: jkh 
State-Changed-When: Sun Aug 8 21:09:40 PDT 1999 
State-Changed-Why:  
According to Tor, this is only partially fixed and it's now down to 
the author of rev 1.196 to take it the rest of the way.  Assigned 
accordingly. 


Responsible-Changed-From-To: freebsd-bugs->dg 
Responsible-Changed-By: jkh 
Responsible-Changed-When: Sun Aug 8 21:09:40 PDT 1999 
Responsible-Changed-Why:  
Responsible-Changed-From-To: dg->gibbs 
Responsible-Changed-By: dg 
Responsible-Changed-When: Mon Oct 16 12:02:36 PDT 2000 
Responsible-Changed-Why:  
Justin is the SCSI expert. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=11697 
Responsible-Changed-From-To: gibbs->tegge 
Responsible-Changed-By: gibbs 
Responsible-Changed-When: Sat Feb 24 12:00:27 PST 2001 
Responsible-Changed-Why:  
Is this still an issue?  You probably know more about the VFS layer 
than I do. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=11697 
State-Changed-From-To: open->closed 
State-Changed-By: matteo 
State-Changed-When: Fri Aug 26 11:42:09 GMT 2005 
State-Changed-Why:  
feedback timeout 

http://www.freebsd.org/cgi/query-pr.cgi?pr=11697 
>Unformatted:
