From gene@starkhome.cs.sunysb.edu  Sat Sep 21 21:35:32 1996
Received: from bsd7.cs.sunysb.edu (bsd7.cs.sunysb.edu [130.245.1.197])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id VAA22411
          for <FreeBSD-gnats-submit@freebsd.org>; Sat, 21 Sep 1996 21:35:13 -0700 (PDT)
Received: (from uucp@localhost) by bsd7.cs.sunysb.edu (8.6.12/8.6.9) with UUCP id AAA00125 for FreeBSD-gnats-submit@freebsd.org; Sun, 22 Sep 1996 00:35:02 -0400
Received: (from gene@localhost) by starkhome.cs.sunysb.edu (8.7.5/8.6.9) id AAA00362; Sun, 22 Sep 1996 00:34:07 -0400 (EDT)
Message-Id: <199609220434.AAA00362@starkhome.cs.sunysb.edu>
Date: Sun, 22 Sep 1996 00:34:07 -0400 (EDT)
From: Gene Stark <gene@starkhome.cs.sunysb.edu>
Reply-To: gene@starkhome.cs.sunysb.edu
To: FreeBSD-gnats-submit@freebsd.org
Subject: ft driver hangs uninterruptably at "bavail"
X-Send-Pr-Version: 3.2

>Number:         1661
>Category:       kern
>Synopsis:       ft driver hangs uninterruptably at "bavail"
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 21 21:40:01 PDT 1996
>Closed-Date:    Mon Jan 18 11:52:43 PST 1999
>Last-Modified:  Mon Jan 18 12:00:54 PST 1999
>Originator:     Gene Stark
>Release:        FreeBSD 2.1-STABLE i386
>Organization:
>Environment:

	FreeBSD 2.1.5-STABLE.  Pentium 133MHz, ASUS M/B,
	Colorado 250 floppy tape drive running off controller
	on M/B.

	Problem was also experienced on a noname 486DX/33MHz
	system.

>Description:

	The command:

		dump 0uf - /A | gzip -c | ft

	causes the tape to spin for a few seconds, and then the
	driver hangs in uninterruptable sleep with ps showing "bavail".

	Without the gzip in the pipeline, the hang does not occur.

>How-To-Repeat:

		dump 0uf - /A | gzip -c | ft

>Fix:
	
	I am not real familiar with the ft driver, but here is my analysis
	after turning on debugging printout and thinking about it for
	awhile.  The messages from the test are at the end of this report.

	When the driver is opened, it does an initial read, followed
	by readahead that continues until the completion queue is
	filled up and the free queue is empty.  Then the tape is stopped.
	Shortly after this, the first blocks of data are ready to be written,
	but there are no buffers available as they have all been used
	up by readahead.  The driver then hangs in the sleep at line
	2279:

		/* Sleep until a buffer becomes available. */
		while (sp == NULL) {
			ftsleep(wc_buff_avail, 0);
			sp = segio_alloc(ft);
		}

	In my test, there is an "unexpected interrupt" message
	that occurs just before the hang.  It appears to be a completion
	interrupt from the seek operation that was started because
	the current position was unknown.  As near as I can tell,
	this is a separate problem that does not contribute to the hanging
	problem, as even if the interrupt were to be processed, I couldn't
	find any way that it would help to free the buffers in the completion
	queue to unwedge the write.

	I believe the hanging does not occur when the gzip is omitted,
	because the data arrives faster, and there is not time for the
	readahead to use up all the buffers before the write is initiated,
	canceling the readahead and freeing the buffers.

	When I modified the driver by adding code to free any buffers
	in the completion queue before a write was started, the hang
	did not occur.  Somebody that understands the driver better than
	me should check this out and see if my analysis and fix is
	reasonable.

	Here is a context diff:

*** ft.c.orig	Tue May 30 04:01:41 1995
--- ft.c	Sun Sep 22 00:04:43 1996
***************
*** 2260,2265 ****
--- 2260,2269 ----
    } else {
  	if (ft->segh != NULL && ft->segh->reqtype != FTIO_WRITING)
  		tape_inactive(ftu);
+ 
+ 	/* Clear readahead blocks from completion queue */
+ 	while (ft->doneh != NULL)
+ 	    segio_free(ft, ft->doneh);
  
  	/* Allocate a buffer and start tape if we're running low. */
  	sp = segio_alloc(ft);


(test messages follow)
-----------------
Sep 21 20:45:22 starkhome /kernel.test: tape_start start
Sep 21 20:45:22 starkhome /kernel.test: tape_recal start
Sep 21 20:45:23 starkhome /kernel.test: tape_recal end
Sep 21 20:45:23 starkhome /kernel.test: tape_start end
Sep 21 20:45:23 starkhome /kernel.test: ===> tape_cmd: 46
Sep 21 20:45:23 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:23 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:23 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:24 starkhome last message repeated 8 times
Sep 21 20:45:24 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:24 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:24 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:24 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:24 starkhome last message repeated 8 times
Sep 21 20:45:24 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:24 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:24 starkhome /kernel.test: ===> tape_cmd: 27
Sep 21 20:45:24 starkhome /kernel.test: ===> tape_cmd: 4
Sep 21 20:45:24 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:24 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:24 starkhome last message repeated 8 times
Sep 21 20:45:24 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:25 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:25 starkhome /kernel.test: ===> tape_cmd: 30
Sep 21 20:45:25 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:25 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:25 starkhome last message repeated 8 times
Sep 21 20:45:25 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:25 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:25 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:25 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:25 starkhome last message repeated 8 times
Sep 21 20:45:25 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:25 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:26 starkhome /kernel.test: ===> tape_cmd: 8
Sep 21 20:45:26 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:26 starkhome last message repeated 8 times
Sep 21 20:45:26 starkhome /kernel.test: qic_status returned $d0
Sep 21 20:45:26 starkhome /kernel.test: ftgetgeom report config got $00d0
Sep 21 20:45:26 starkhome /kernel.test: Tape format is QIC-80, length is 307.5/550
Sep 21 20:45:26 starkhome /kernel.test: ===> tape_cmd: 18
Sep 21 20:45:26 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:26 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:26 starkhome last message repeated 8 times
Sep 21 20:45:26 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:26 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 11
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:27 starkhome last message repeated 8 times
Sep 21 20:45:27 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:27 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 13
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:27 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:27 starkhome last message repeated 8 times
Sep 21 20:45:27 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:28 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:28 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:28 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:28 starkhome last message repeated 8 times
Sep 21 20:45:28 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:28 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:28 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:28 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:28 starkhome last message repeated 8 times
Sep 21 20:45:28 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:28 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:28 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:29 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:29 starkhome last message repeated 8 times
Sep 21 20:45:29 starkhome /kernel.test: qic_status returned $65
Sep 21 20:45:29 starkhome /kernel.test: tape_status got $0065
Sep 21 20:45:29 starkhome /kernel.test: segio_alloc: nfree=7 ndone=0 nreq=0
Sep 21 20:45:29 starkhome /kernel.test: segio_queue: nfree=7 ndone=0 nreq=1
Sep 21 20:45:29 starkhome /kernel.test: Starting read I/O chain
Sep 21 20:45:29 starkhome /kernel.test: async_read ******STARTING TAPE
Sep 21 20:45:29 starkhome /kernel.test: async_status got cmd = 6 nbits = 8
Sep 21 20:45:29 starkhome /kernel.test: async status got $0065 ($02cb)
Sep 21 20:45:29 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:29 starkhome /kernel.test: segio_done: (r) nfree=7 ndone=1 nreq=0
Sep 21 20:45:30 starkhome /kernel.test: segio_alloc: nfree=6 ndone=1 nreq=0
Sep 21 20:45:30 starkhome /kernel.test: segio_queue: nfree=6 ndone=1 nreq=1
Sep 21 20:45:30 starkhome /kernel.test: Processing readahead reqblk = 32
Sep 21 20:45:30 starkhome /kernel.test: segio_free: nfree=7 ndone=0 nreq=1
Sep 21 20:45:30 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:30 starkhome /kernel.test: segio_done: (r) nfree=7 ndone=1 nreq=0
Sep 21 20:45:30 starkhome /kernel.test: segio_alloc: nfree=6 ndone=1 nreq=0
Sep 21 20:45:30 starkhome /kernel.test: segio_queue: nfree=6 ndone=1 nreq=1
Sep 21 20:45:30 starkhome /kernel.test: Processing readahead reqblk = 64
Sep 21 20:45:30 starkhome /kernel.test: segio_free: nfree=7 ndone=0 nreq=1
Sep 21 20:45:31 starkhome /kernel.test: xd: st0:40 st1:20 st2:20 c:0 h:0 s:77 pos:76 want:76
Sep 21 20:45:31 starkhome /kernel.test: ft0: CRC error on block 76
Sep 21 20:45:31 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:31 starkhome /kernel.test: segio_done: (r) nfree=7 ndone=1 nreq=0
Sep 21 20:45:31 starkhome /kernel.test: segio_alloc: nfree=6 ndone=1 nreq=0
Sep 21 20:45:31 starkhome /kernel.test: segio_queue: nfree=6 ndone=1 nreq=1
Sep 21 20:45:31 starkhome /kernel.test: Processing readahead reqblk = 96
Sep 21 20:45:31 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:31 starkhome /kernel.test: segio_done: (r) nfree=6 ndone=2 nreq=0
Sep 21 20:45:31 starkhome /kernel.test: segio_alloc: nfree=5 ndone=2 nreq=0
Sep 21 20:45:31 starkhome /kernel.test: segio_queue: nfree=5 ndone=2 nreq=1
Sep 21 20:45:32 starkhome /kernel.test: Processing readahead reqblk = 128
Sep 21 20:45:32 starkhome /kernel.test: xd: st0:40 st1:20 st2:20 c:1 h:0 s:26 pos:153 want:153
Sep 21 20:45:32 starkhome /kernel.test: ft0: CRC error on block 153
Sep 21 20:45:32 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:32 starkhome /kernel.test: segio_done: (r) nfree=5 ndone=3 nreq=0
Sep 21 20:45:32 starkhome /kernel.test: segio_alloc: nfree=4 ndone=3 nreq=0
Sep 21 20:45:32 starkhome /kernel.test: segio_queue: nfree=4 ndone=3 nreq=1
Sep 21 20:45:32 starkhome /kernel.test: Processing readahead reqblk = 160
Sep 21 20:45:32 starkhome /kernel.test: xd: st0:40 st1:20 st2:20 c:1 h:0 s:43 pos:170 want:170
Sep 21 20:45:32 starkhome /kernel.test: ft0: CRC error on block 170
Sep 21 20:45:33 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:33 starkhome /kernel.test: segio_done: (r) nfree=4 ndone=4 nreq=0
Sep 21 20:45:33 starkhome /kernel.test: segio_alloc: nfree=3 ndone=4 nreq=0
Sep 21 20:45:33 starkhome /kernel.test: segio_queue: nfree=3 ndone=4 nreq=1
Sep 21 20:45:33 starkhome /kernel.test: Processing readahead reqblk = 192
Sep 21 20:45:33 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:33 starkhome /kernel.test: segio_done: (r) nfree=3 ndone=5 nreq=0
Sep 21 20:45:33 starkhome /kernel.test: segio_alloc: nfree=2 ndone=5 nreq=0
Sep 21 20:45:33 starkhome /kernel.test: segio_queue: nfree=2 ndone=5 nreq=1
Sep 21 20:45:33 starkhome /kernel.test: Processing readahead reqblk = 224
Sep 21 20:45:33 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:34 starkhome /kernel.test: segio_done: (r) nfree=2 ndone=6 nreq=0
Sep 21 20:45:34 starkhome /kernel.test: segio_alloc: nfree=1 ndone=6 nreq=0
Sep 21 20:45:34 starkhome /kernel.test: segio_queue: nfree=1 ndone=6 nreq=1
Sep 21 20:45:34 starkhome /kernel.test: Processing readahead reqblk = 256
Sep 21 20:45:34 starkhome /kernel.test: xd: st0:40 st1:20 st2:20 c:2 h:0 s:21 pos:276 want:276
Sep 21 20:45:34 starkhome /kernel.test: ft0: CRC error on block 276
Sep 21 20:45:34 starkhome /kernel.test: xd: st0:40 st1:20 st2:20 c:2 h:0 s:24 pos:279 want:279
Sep 21 20:45:34 starkhome /kernel.test: ft0: CRC error on block 279
Sep 21 20:45:34 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:34 starkhome /kernel.test: segio_done: (r) nfree=1 ndone=7 nreq=0
Sep 21 20:45:35 starkhome /kernel.test: segio_alloc: nfree=0 ndone=7 nreq=0
Sep 21 20:45:35 starkhome /kernel.test: segio_queue: nfree=0 ndone=7 nreq=1
Sep 21 20:45:35 starkhome /kernel.test: Processing readahead reqblk = 288
Sep 21 20:45:35 starkhome /kernel.test: Read done..  Cancel = 0
Sep 21 20:45:35 starkhome /kernel.test: segio_done: (r) nfree=0 ndone=8 nreq=0
Sep 21 20:45:35 starkhome /kernel.test: segio_alloc: nfree=0 ndone=8 nreq=0
Sep 21 20:45:35 starkhome /kernel.test: No more I/O.. Stopping.
Sep 21 20:45:35 starkhome /kernel.test: async_status got cmd = 6 nbits = 8
Sep 21 20:45:35 starkhome /kernel.test: async status got $0024 ($0249)
Sep 21 20:45:35 starkhome /kernel.test: async_status got cmd = 6 nbits = 8
Sep 21 20:45:35 starkhome /kernel.test: async status got $0024 ($0249)
Sep 21 20:45:36 starkhome /kernel.test: async_status got cmd = 6 nbits = 8
Sep 21 20:45:36 starkhome /kernel.test: async status got $0024 ($0249)
Sep 21 20:45:36 starkhome /kernel.test: async_status got cmd = 6 nbits = 8
Sep 21 20:45:36 starkhome /kernel.test: async status got $0024 ($0249)
Sep 21 20:45:36 starkhome /kernel.test: async_status got cmd = 6 nbits = 8
Sep 21 20:45:36 starkhome /kernel.test: async status got $0025 ($024b)
Sep 21 20:45:51 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:51 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:51 starkhome last message repeated 8 times
Sep 21 20:45:51 starkhome /kernel.test: qic_status returned $25
Sep 21 20:45:51 starkhome /kernel.test: tape_status got $0025
Sep 21 20:45:51 starkhome /kernel.test: ===> tape_cmd: 6
Sep 21 20:45:52 starkhome /kernel.test: ===> tape_cmd: 2
Sep 21 20:45:52 starkhome last message repeated 8 times
Sep 21 20:45:52 starkhome /kernel.test: qic_status returned $25
Sep 21 20:45:52 starkhome /kernel.test: tape_status got $0025
Sep 21 20:45:52 starkhome /kernel.test: segio_alloc: nfree=0 ndone=8 nreq=0
Sep 21 20:45:52 starkhome /kernel.test: Starting write I/O chain
Sep 21 20:45:52 starkhome /kernel.test: ft0: position unknown: lastpos:-2 ft->xblk:1952805733
Sep 21 20:45:52 starkhome /kernel.test: ft0: unexpected interrupt; st0 = $20 pcn = 20



>Release-Note:
>Audit-Trail:

From: Mike Pritchard <mpp>
To: freebsd-gnats-submit
Cc:  Subject: Re: kern/1661
Date: Fri, 21 Feb 1997 18:27:42 -0800 (PST)

 Feedback from the originator:
 
 Gene Stark wrote:
 > From gene@starkhome.cs.sunysb.edu Fri Feb 21 18:22:45 1997
 > Date: Fri, 21 Feb 1997 21:08:53 -0500 (EST)
 > From: Gene Stark <gene@starkhome.cs.sunysb.edu>
 > Message-Id: <199702220208.VAA18539@starkhome.cs.sunysb.edu>
 > To: freefall.freebsd.org!mpp@bsd7.cs.sunysb.edu
 > In-reply-to: Mike Pritchard's message of Fri, 21 Feb 1997 16:19:44 -0800 (PST) <199702220019.QAA06967@freefall.freebsd.org>
 > Subject: FreeBSD PR# 1661
 > 
 > >Are you still seeing the problem reported in your FreeBSD problem report?
 > >Have you tried later versions, such as FreeBSD 2.2 BETA or GAMMA
 > >or FreeBSD 3.0-current?  Any other information you could provide would
 > >be helpful.  If you are still seeing the problem, what version
 > >of FreeBSD are you running right now?
 > 
 > Yes, I am still seeing a similar problem, but not the particular one I
 > originally reported, as I have patched that.  The patch I supplied with my
 > bug report fixed the highly repeatable initial hang at the beginning
 > of a dump, but I still experience somewhat more erratic timing-dependent
 > hangs of a similar nature in many cases.  When I was looking through
 > the driver some months ago in generating the patch I sent in, I believe
 > I noted other race conditions in the driver as well, which are probably
 > what are getting exercised now.
 > 
 > The structure of the driver was such that it was very difficult to
 > trace through and determine at which points interrupts were enabled
 > and disabled.  My conclusion was that the easiest thing to do might be
 > to restructure the driver so this was more clear.  Unfortunately, I didn't
 > have the time to do this.  In its current state, the driver is pretty
 > useless to me because more often than not a long dump will hang.
 > 
 > The latest system I have tried it on is 2.1.5.  However it looks like
 > the problems were in the driver itself, so unless the driver has been fixed,
 > I would expect similar problems to exist in later FreeBSD versions.
 > 
 > 							- Gene Stark
 > 
 
 
 -- 
 Mike Pritchard
 mpp@FreeBSD.org
 "Go that way.  Really fast.  If something gets in your way, turn"

From: Gene Stark <gene@starkhome.cs.sunysb.edu>
To: freebsd-gnats-submit@freebsd.org
Cc:  Subject: kern/1661
Date: Sun, 22 Mar 1998 23:42:46 -0500 (EST)

 I received E-mail asking me to comment on whether I am still
 experiencing the problem described in kern/1661.
 
 I cannot comment on the 2.2.6-beta, as I can't afford to install that
 on my system to check it out.  I am currently running a stock version of:
 
 	FreeBSD starkhome.cs.sunysb.edu 2.2.5-RELEASE FreeBSD 2.2.5-RELEASE #0: Sat Nov  1 22:14:33 EST 1997     gene@starkhome.cs.sunysb.edu:/A/src/sys/compile/STARK  i386
 
 I can confirm that the problem still exists in that version.
 After running the command indicated in my initial problem report,
 in the syslog I get:
 
 	Mar 22 23:35:15 starkhome /kernel: ft0: unexpected interrupt; st0 = $20 pcn = 19
 
 and
 
 	% ps wwalx | fgrep ft
 	    0  8966  8955   6  -6  0   264  144 bavail D+    p3    0:00.50 ft
 
 I don't see any change since my original problem report against
 2.1.5-STABLE.
 
 I am doing my backups mostly to SCSI DAT tape now, and have not been
 using the Colorado 250MB tape drive, though it is still installed.
 
 							- Gene Stark
State-Changed-From-To: open->suspended 
State-Changed-By: phk 
State-Changed-When: Tue May 19 02:36:20 PDT 1998 
State-Changed-Why:  

see also 1331 6652 

awaiting committer 
State-Changed-From-To: suspended->closed 
State-Changed-By: rnordier 
State-Changed-When: Mon Jan 18 11:52:43 PST 1999 
State-Changed-Why:  
The ft driver is not longer supported. 
>Unformatted:
