From nobody@FreeBSD.ORG  Sun Dec 12 18:52:28 1999
Return-Path: <nobody@FreeBSD.ORG>
Received: by hub.freebsd.org (Postfix, from userid 32767)
	id 4ED6114D2D; Sun, 12 Dec 1999 18:52:28 -0800 (PST)
Message-Id: <19991213025228.4ED6114D2D@hub.freebsd.org>
Date: Sun, 12 Dec 1999 18:52:28 -0800 (PST)
From: klh@netcom.com
Sender: nobody@FreeBSD.ORG
To: freebsd-gnats-submit@freebsd.org
Subject: Would be nice if the kernel could detect/report problems with SCSI tagged queueing
X-Send-Pr-Version: www-1.0

>Number:         15448
>Category:       kern
>Synopsis:       Would be nice if the kernel could detect/report problems with SCSI tagged queueing
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    ken
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Dec 12 19:00:02 PST 1999
>Closed-Date:    Sun Jan 9 15:21:25 PST 2000
>Last-Modified:  Sun Jan  9 15:23:17 PST 2000
>Originator:     Ken Harrenstien
>Release:        3.1-RELEASE
>Organization:
>Environment:
FreeBSD <hostname> 3.1-RELEASE FreeBSD 3.1-RELEASE #<n>: <buildstring>  i386
>Description:
I just spent a nerve-wracking day backing up some drives that I thought
were about to crash their little heads, only to finally discover that
the problem was a failure of SCSI command tagged queueing to work
properly.

I was very surprised that even though I was getting user program I/O
errors (from tar), the kernel gave me no feedback at all on the
console.  This was really mystifying.  I don't know enough about how
tagging works to know whether it's even feasible to detect when it's
not working -- but the kernel is clearly getting SOME kind of error
that it's relaying back to the user.

Would it be possible to make sure that I/O errors of this nature send
*something* to the console log?  (Actually, that's a good idea for
any sort of I/O error; I know most of them are reported OK).  This
would be a huge help tracking down potentially buggy drives; the effort
to zero in on this possibility is otherwise very time-consuming.

Thanks...


>How-To-Repeat:

>Fix:


>Release-Note:
>Audit-Trail:

From: "Kenneth D. Merry" <ken@kdm.org>
To: klh@netcom.com
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/15448: Would be nice if the kernel could detect/report problems with SCSI tagged queueing
Date: Sun, 12 Dec 1999 21:31:52 -0700

 On Sun, Dec 12, 1999 at 06:52:28PM -0800, klh@netcom.com wrote:
 > >Synopsis:       Would be nice if the kernel could detect/report problems with SCSI tagged queueing
 
 [ ... ]
 
 > FreeBSD <hostname> 3.1-RELEASE FreeBSD 3.1-RELEASE #<n>: <buildstring>  i386
 > >Description:
 > I just spent a nerve-wracking day backing up some drives that I thought
 > were about to crash their little heads, only to finally discover that
 > the problem was a failure of SCSI command tagged queueing to work
 > properly.
 > 
 > I was very surprised that even though I was getting user program I/O
 > errors (from tar), the kernel gave me no feedback at all on the
 > console.  This was really mystifying.  I don't know enough about how
 > tagging works to know whether it's even feasible to detect when it's
 > not working -- but the kernel is clearly getting SOME kind of error
 > that it's relaying back to the user.
 > 
 > Would it be possible to make sure that I/O errors of this nature send
 > *something* to the console log?  (Actually, that's a good idea for
 > any sort of I/O error; I know most of them are reported OK).  This
 > would be a huge help tracking down potentially buggy drives; the effort
 > to zero in on this possibility is otherwise very time-consuming.
 
 I don't know why the kernel didn't print out any errors, but you'll get a
 lot more information if you boot with the '-v' switch.  At the boot loader
 prompt, you can type:
 
 boot kernel -v
 
 To get the verbose boot messages.  You may need to increase your message
 buffer size, using the MSGBUF_SIZE kernel option (see LINT for details) to
 avoid overflowing the kernel's message buffer.
 
 Ken
 -- 
 Kenneth Merry
 ken@kdm.org
 

From: Ken Harrenstien <klh@netcom.com>
To: "Kenneth D. Merry" <ken@kdm.org>
Cc: klh@netcom.com, freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/15448: Would be nice if the kernel could detect/report
        problems with SCSI tagged queueing
Date: Tue, 14 Dec 99 0:37:04 PST

 > On Sun, Dec 12, 1999 at 06:52:28PM -0800, klh@netcom.com wrote:
 > > >Synopsis:       Would be nice if the kernel could detect/report problems with SCSI tagged queueing
 > 
 > [ ... ]
 > 
 > > FreeBSD <hostname> 3.1-RELEASE FreeBSD 3.1-RELEASE #<n>: <buildstring>  i386
 > > >Description:
 > > I just spent a nerve-wracking day backing up some drives that I thought
 > > were about to crash their little heads, only to finally discover that
 > > the problem was a failure of SCSI command tagged queueing to work
 > > properly.
 > > 
 > > I was very surprised that even though I was getting user program I/O
 > > errors (from tar), the kernel gave me no feedback at all on the
 > > console.  This was really mystifying.  I don't know enough about how
 > > tagging works to know whether it's even feasible to detect when it's
 > > not working -- but the kernel is clearly getting SOME kind of error
 > > that it's relaying back to the user.
 > > 
 > > Would it be possible to make sure that I/O errors of this nature send
 > > *something* to the console log?  (Actually, that's a good idea for
 > > any sort of I/O error; I know most of them are reported OK).  This
 > > would be a huge help tracking down potentially buggy drives; the effort
 > > to zero in on this possibility is otherwise very time-consuming.
 > 
 > I don't know why the kernel didn't print out any errors, but you'll get a
 > lot more information if you boot with the '-v' switch.  At the boot loader
 > prompt, you can type:
 > 
 > boot kernel -v
 
 I know about (and like) -v, but it doesn't make any difference.  Which
 is to say, there is no error output either way.  Just user-level I/O
 errors except when a page transfer fails, in which case the pager code
 then complains.  Never the CAM subsystem.
 
 One way of verifying this might be to test with a known broken drive
 after re-enabling tagged queueing, and see what happens in the way of
 error reporting.
 
 Unfortunately I don't have any of the ones in the current table or I
 could do that test.  Regardless of the actual cause of the I/O errors,
 it is still worrisome to me that there is no kernel log output at all.
 
 --Ken
 

From: "Kenneth D. Merry" <ken@kdm.org>
To: Ken Harrenstien <klh@netcom.com>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/15448: Would be nice if the kernel could detect/report problems with SCSI tagged queueing
Date: Wed, 15 Dec 1999 00:19:13 -0700

 On Tue, Dec 14, 1999 at 00:37:04 -0800, Ken Harrenstien wrote:
 > > I don't know why the kernel didn't print out any errors, but you'll get a
 > > lot more information if you boot with the '-v' switch.  At the boot loader
 > > prompt, you can type:
 > > 
 > > boot kernel -v
 > 
 > I know about (and like) -v, but it doesn't make any difference.  Which
 > is to say, there is no error output either way.  Just user-level I/O
 > errors except when a page transfer fails, in which case the pager code
 > then complains.  Never the CAM subsystem.
 > 
 > One way of verifying this might be to test with a known broken drive
 > after re-enabling tagged queueing, and see what happens in the way of
 > error reporting.
 > 
 > Unfortunately I don't have any of the ones in the current table or I
 > could do that test.  Regardless of the actual cause of the I/O errors,
 > it is still worrisome to me that there is no kernel log output at all.
 
 I think the reason you're not seeing any kernel diagnostics is because the
 driver isn't reporting errors to the upper level code.  It may be that it
 is just silently failing to return some buffers or something.  Since we
 already know the Tekram AMD driver is broken (PR readers see PR 15446), I
 suppose this isn't very surprising.
 
 In most cases, booting with the verbose switch turned on causes more SCSI
 diagnostics to be printed than would normally be printed.
 
 Anyway, why don't we leave this PR open, and you can verify that you get
 more diagnostics when you upgrade to the newer amd driver, and then we can
 close it.
 
 Ken
 -- 
 Kenneth Merry
 ken@kdm.org
 

From: Ken Harrenstien <klh@netcom.com>
To: "Kenneth D. Merry" <ken@kdm.org>
Cc: Ken Harrenstien <klh@netcom.com>,
	freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/15448: Would be nice if the kernel could detect/report
        problems with SCSI tagged queueing
Date: Sun, 9 Jan 100 13:23:34 PST

 > Anyway, why don't we leave this PR open, and you can verify that you get
 > more diagnostics when you upgrade to the newer amd driver, and then we can
 > close it.
 > 
 > Ken
 > -- 
 > Kenneth Merry
 > ken@kdm.org
 
 Just an update.  I can't actually verify whether there are more
 diagnostics with the current Tekram driver, because I no longer have
 user-level I/O errors and I'm reluctant to deliberately generate them.
 The bug was more of a concern that this was evidence of a general
 loophole allowing I/O errors to be propagated up to the user without
 ever causing a kernel error message.
 
 I *could* try to arrange for a sacrificial system and do a number of
 horrible things to the bus.  But if the driver folk are confident that
 they've covered all the bases, that's good enough for me.
 
 --Ken
 

From: "Kenneth D. Merry" <ken@kdm.org>
To: Ken Harrenstien <klh@netcom.com>
Cc: freebsd-gnats-submit@FreeBSD.ORG
Subject: Re: kern/15448: Would be nice if the kernel could detect/report problems with SCSI tagged queueing
Date: Sun, 9 Jan 2000 16:07:11 -0700

 On Sun, Jan 09, 2000 at 13:23:34 -0800, Ken Harrenstien wrote:
 > > Anyway, why don't we leave this PR open, and you can verify that you get
 > > more diagnostics when you upgrade to the newer amd driver, and then we can
 > > close it.
 > > 
 > > Ken
 > > -- 
 > > Kenneth Merry
 > > ken@kdm.org
 > 
 > Just an update.  I can't actually verify whether there are more
 > diagnostics with the current Tekram driver, because I no longer have
 > user-level I/O errors and I'm reluctant to deliberately generate them.
 > The bug was more of a concern that this was evidence of a general
 > loophole allowing I/O errors to be propagated up to the user without
 > ever causing a kernel error message.
 > 
 > I *could* try to arrange for a sacrificial system and do a number of
 > horrible things to the bus.  But if the driver folk are confident that
 > they've covered all the bases, that's good enough for me.
 
 I think things work well enough in general.  In any case, we've got a
 rewrite of the CAM error recovery code in the pipeline.  That should change
 things a little bit.
 
 So I'll go ahead and close this PR.
 
 Ken
 -- 
 Kenneth Merry
 ken@kdm.org
 
State-Changed-From-To: open->closed 
State-Changed-By: ken 
State-Changed-When: Sun Jan 9 15:21:25 PST 2000 
State-Changed-Why:  
PR submitter is satisfied that this isn't a big problem.  I don't think 
it's a big problem either.  


Responsible-Changed-From-To: freebsd-bugs->ken 
Responsible-Changed-By: ken 
Responsible-Changed-When: Sun Jan 9 15:21:25 PST 2000 
Responsible-Changed-Why:  
I'll handle this. 
>Unformatted:
