From nobody@FreeBSD.org  Thu Feb  7 13:36:37 2002
Return-Path: <nobody@FreeBSD.org>
Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21])
	by hub.freebsd.org (Postfix) with ESMTP id AE3E437B425
	for <freebsd-gnats-submit@FreeBSD.org>; Thu,  7 Feb 2002 13:36:36 -0800 (PST)
Received: (from nobody@localhost)
	by freefall.freebsd.org (8.11.6/8.11.6) id g17Laas84335;
	Thu, 7 Feb 2002 13:36:36 -0800 (PST)
	(envelope-from nobody)
Message-Id: <200202072136.g17Laas84335@freefall.freebsd.org>
Date: Thu, 7 Feb 2002 13:36:36 -0800 (PST)
From: Craig <craig@avnet.co.uk>
To: freebsd-gnats-submit@FreeBSD.org
Subject: frequent system stall under moderate scsi load
X-Send-Pr-Version: www-1.0

>Number:         34711
>Category:       kern
>Synopsis:       frequent system stall under moderate scsi load
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    ceri
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Feb 07 13:40:00 PST 2002
>Closed-Date:    Sun Jun 08 11:02:44 PDT 2003
>Last-Modified:  Sun Jun 08 11:02:44 PDT 2003
>Originator:     Craig
>Release:        4.1
>Organization:
none
>Environment:
>Description:
the OS appears to hang fairly frequently under moderate load.
all connections get stalled, and sometimes dropped.
occasionally the server reboots itself.

i initialy thought it might be a file locking problem with the buletin board software, but i think it is more fundemental than that.
>How-To-Repeat:
cannot provide a way to repeat, but having looked through previous reports, it could be large file movement related.

the server does nothing except serve two UBB forums.
if more than a couple of people try to post at once, then the process queue fills up with the perl pids and starts the problem.
>Fix:
      
>Release-Note:
>Audit-Trail:

From: David Malone <dwmalone@maths.tcd.ie>
To: Craig <craig@avnet.co.uk>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/34711: frequent system stall under moderate scsi load
Date: Thu, 7 Feb 2002 22:45:29 +0000

 On Thu, Feb 07, 2002 at 01:36:36PM -0800, Craig wrote:
 > >Description:
 > the OS appears to hang fairly frequently under moderate load.
 > all connections get stalled, and sometimes dropped.
 > occasionally the server reboots itself.
 
 I'm afriad we'll need significantly more details to make
 any progress on this. Are any log messages produced? (Check
 in /var/log)
 
 If you can get the output of "top" when processes are stuck
 it may provide some insight into what they are waiting for.
 
 	David.

From: "Craig Stratton" <craig@avnet.co.uk>
To: "David Malone" <dwmalone@maths.tcd.ie>
Cc: <freebsd-gnats-submit@freebsd.org>
Subject: Re: kern/34711: frequent system stall under moderate scsi load 
Date: Fri, 8 Feb 2002 13:40:41 -0000

 David
 
 had top running and it froze up with this:-
 
 last pid: 28865;  load averages: 29.69, 45.35, 74.62325 up 0+23:26:24
 13:27:02
 150 processes: 59 running, 86 sleeping, 5 zombie
 CPU states: 49.3% user,  0.0% nice, 31.2% system,  1.3% interrupt, 18.3%
 idle
 Mem: 35M Active, 5752K Inact, 18M Wired, 116K Cache, 14M Buf, 404K Free
 Swap: 256M Total, 86M Used, 170M Free, 33% Inuse, 12M In, 25M Out
 
   PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 28782 nobody   -14   0  5172K  1432K inode    0:03  1.91%  1.90% perl
 28757 nobody    -2   0  5248K  1224K getblk   0:07  1.56%  1.56% perl
 28780 nobody   -14   0  5216K  1300K inode    0:04  1.56%  1.56% perl
 28733 nobody    -6   0  5296K  1240K biord    0:09  1.51%  1.51% perl
 28761 nobody    -2   0  5244K  1260K getblk   0:06  1.46%  1.46% perl
 28793 nobody   -14   0  5208K  1292K inode    0:04  1.47%  1.46% perl
 28731 nobody   -14   0  5296K  1232K inode    0:09  1.37%  1.37% perl
 28743 nobody    -6   0  5248K  1240K biord    0:08  1.32%  1.32% perl
 28758 nobody    -6   0  7248K  1316K biord    0:07  1.32%  1.32% perl
 28773 nobody    -6   0  5244K  1324K biord    0:05  1.27%  1.27% perl
 28788 nobody    -6   0  5172K  1328K biord    0:04  1.22%  1.22% perl
 
 
 and then started and stopped again with this
 
 
 
 last pid: 28911;  load averages: 42.13, 44.43, 70.22325 up 0+23:28:56
 13:29:34
 185 processes: 94 running, 84 sleeping, 7 zombie
 CPU states: 42.2% user,  0.0% nice, 33.8% system,  1.5% interrupt, 22.5%
 idle
 Mem: 35M Active, 5556K Inact, 19M Wired, 72K Cache, 14M Buf, 404K Free
 Swap: 256M Total, 107M Used, 148M Free, 41% Inuse, 36M In, 51M Out
 
   PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 28780 nobody   -18   0  5296K  1368K RUN      0:09  4.83%  4.83% perl
 28782 nobody   -18   0  5244K  1280K RUN      0:08  4.35%  4.35% perl
 28793 nobody   -18   0  5292K  1360K RUN      0:09  4.30%  4.30% perl
 28794 nobody   -18   0  5292K  1356K RUN      0:09  4.30%  4.30% perl
 28764 nobody   -18   0  5292K  1272K RUN      0:11  4.05%  4.05% perl
 28788 nobody   -18   0  5244K  1280K RUN      0:08  3.81%  3.81% perl
 28773 nobody   -18   0  5292K  1368K RUN      0:10  3.47%  3.47% perl
 17390 root     -18   0  2092K   404K RUN      8:41  1.29%  0.88% top
 28903 nobody   -18   0  2808K  1856K RUN      0:01  0.79%  0.78% perl
 28892 nobody   -18   0  2884K  2100K RUN      0:01  0.59%  0.59% perl
 28881 nobody   -18   0  2756K  1728K RUN      0:01  0.44%  0.44% perl
 28907 nobody   -18   0  1924K  1260K RUN      0:00  0.51%  0.44% perl
 28908 nobody   -18   0  1776K  1108K RUN      0:00  0.17%  0.15% perl
 28898 nobody   -18   0  2856K  1756K RUN      0:01  0.10%  0.10% perl
 28899 nobody   -18   0  2644K  1656K RUN      0:01  0.10%  0.10% perl
 28904 nobody   -18   0  2368K  1680K RUN      0:01  0.10%  0.10% perl
 28800 nobody   -18   0  3972K  1924K RUN      0:02  0.88%  0.88% perl
 
 
 data centre guys dont have console to put onto it.!? :(
 
 
 Regards
 Craig
 
 ----- Original Message -----
 From: "David Malone" <dwmalone@maths.tcd.ie>
 To: "Craig Stratton" <craig@avnet.co.uk>
 Sent: Friday, February 08, 2002 10:47 AM
 Subject: Re: kern/34711: frequent system stall under moderate scsi load
 
 
 > > Is there anything in particular i should look out for or check ?
 >
 > If you watch the "state" field in top it will tell you where in the
 > kernel processes are waiting - this can be a good clue. You can
 > some times get the same information about the current process by
 > pressing control-T.
 >
 > If you need to get the data center guys to press the reset button
 > you could ask them if there are any messages on the console.
 >
 > If you follow up with any info you get and cc
 > freebsd-gnats-submit@freebsd.org then you'll have a reasonable chance
 > of someone spotting something useful.
 >
 > David.
 

From: David Malone <dwmalone@maths.tcd.ie>
To: Craig Stratton <craig@avnet.co.uk>
Cc: freebsd-gnats-submit@freebsd.org
Subject: Re: kern/34711: frequent system stall under moderate scsi load 
Date: Fri, 08 Feb 2002 13:48:26 +0000

 All the processes are in a state where they are waiting for the
 disk (inode, biord and getblk). The percentage system time is kinda
 high too - is that typical of this system?
 
 Can you have a look at the output of dmesg and see if there is
 anything mentioned around the stall time. Also, what type of SCSI
 card do you have? The relivant lines from /var/log/dmesg.boot might
 be useful.
 
 It is possible that the system is just running low on memory and
 getting stuck trying to free it up. Matt Dillon made some improvements
 in this area, which probably came in since 4.1...
 
 	David.

From: "Craig Stratton" <craig@avnet.co.uk>
To: "David Malone" <dwmalone@maths.tcd.ie>
Cc: <freebsd-gnats-submit@freebsd.org>
Subject: Re: kern/34711: frequent system stall under moderate scsi load 
Date: Fri, 8 Feb 2002 14:34:03 -0000

  David,
 
 i'm not sure if the system time is usual or not. But it sounds like it
 shouldn't be...
 
 It could be a memory problem, as you say, because it only has 64MB, but the
 previous system i ran exactly the same things on only had 16MB on a P75 IDE
 and had no problems. Uptime over 500 days before replacement with "better"
 hardware :-|
 Can't remember what release it was on though. Slightly earlier i believe.
 
 There is never anything logged anywhere whenever the system stalls or hangs,
 which is frustrating...
 
 I think the problem is certainly in the swapping/disk access area, as all
 disk operations seem to take a while. When it hangs, i can get ssh login,
 but no shell until it comes back.
 
 I was looking over how to upgrade "in situ" last night, but was too tired to
 take it in.
 Can you advise how i would go about bringing up to date online with no
 physical access to the machine ? (other than carefully .. ;-) )
 
 The most i do normally is install from scratch, and add/configure
 software/applications....
 
 ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0x6000-0x60ff mem
 0xe4000000-0xe4000fff irq 10 at device 9.0 on pci0
 ahc0: aic7880 Wide Channel A, SCSI Id=7, 16/255 SCBs
 pci0: <S3 Trio graphics accelerator> at 10.0 irq 9
 
 da1 at ahc0 bus 0 target 1 lun 0
 da1: <QUANTUM FIREBALL ST4.3S 0F0C> Fixed Direct Access SCSI-2 device
 da1: 10.000MB/s transfers (10.000MHz, offset 15), Tagged Queueing Enabled
 da1: 4136MB (8471232 512 byte sectors: 255H 63S/T 527C)
 da0 at ahc0 bus 0 target 0 lun 0
 da0: <QUANTUM FIREBALL ST4.3S 0F0C> Fixed Direct Access SCSI-2 device
 da0: 10.000MB/s transfers (10.000MHz, offset 15), Tagged Queueing Enabled
 da0: 4136MB (8471232 512 byte sectors: 255H 63S/T 527C)
 
 Regards,
 Craig
 
 ----- Original Message -----
 From: "David Malone" <dwmalone@maths.tcd.ie>
 To: "Craig Stratton" <craig@avnet.co.uk>
 Cc: <freebsd-gnats-submit@freebsd.org>
 Sent: Friday, February 08, 2002 1:48 PM
 Subject: Re: kern/34711: frequent system stall under moderate scsi load
 
 
 > All the processes are in a state where they are waiting for the
 > disk (inode, biord and getblk). The percentage system time is kinda
 > high too - is that typical of this system?
 >
 > Can you have a look at the output of dmesg and see if there is
 > anything mentioned around the stall time. Also, what type of SCSI
 > card do you have? The relivant lines from /var/log/dmesg.boot might
 > be useful.
 >
 > It is possible that the system is just running low on memory and
 > getting stuck trying to free it up. Matt Dillon made some improvements
 > in this area, which probably came in since 4.1...
 >
 > David.
 
State-Changed-From-To: open->feedback 
State-Changed-By: njl 
State-Changed-When: Fri Aug 16 00:20:39 PDT 2002 
State-Changed-Why:  


http://www.freebsd.org/cgi/query-pr.cgi?pr=34711 

From: Nate Lawson <nate@root.org>
To: freebsd-gnats-submit@FreeBSD.org, Craig <craig@avnet.co.uk>
Cc:  
Subject: Re: kern/34711: frequent system stall under moderate scsi load
Date: Fri, 16 Aug 2002 00:22:30 -0700 (PDT)

 It's been a while since there has been progress on this bug.  Have things
 improved for you?
 
 -Nate
 

From: "Craig Stratton" <craig@avnet.co.uk>
To: "Nate Lawson" <nate@root.org>
Cc:  
Subject: Re: kern/34711: frequent system stall under moderate scsi load
Date: Fri, 16 Aug 2002 09:20:45 +0100

 Hi Nate,
 
 The server ran two busy bulletin boards using UBB 5.45 which is all file
 based.
 One of the boards has been moved to another server, which has made the
 situation better operationally, but the problem still occurs under heavy
 load.
 
 It seems to be when too many users post to the same thread and the UBB is
 trying to regenerate the page, the file handling gets confused and locks
 itself up.
 
 Sometimes it sorts itself out after a while, where i think one of the
 threads probably gets killed off or times out and dies, and it seems to
 unblock itself. Other times it either reboots itself, or i have to do it.
 
 I had not progressed the problem any further as the situation had improved
 after removing one of the bulletin boards, and the hardware/software should
 be replaced this month.
 
 The server is only a P166 with 64MB, but i have seen heavier load on similar
 spec servers that has been handled fine. The same board used to run on a P75
 with 16MB, which never had any problems at all.
 
 I am fairly certain it is a scsi or file system problem, but it could be a
 physical hardware problem.
 No error messages are generated when it runs into problems, or none are
 logged.
 
 Regards,
 Craig
 
 ----- Original Message -----
 From: "Nate Lawson" <nate@root.org>
 To: <freebsd-gnats-submit@FreeBSD.org>; "Craig" <craig@avnet.co.uk>
 Sent: Friday, August 16, 2002 8:22 AM
 Subject: Re: kern/34711: frequent system stall under moderate scsi load
 
 
 > It's been a while since there has been progress on this bug.  Have things
 > improved for you?
 >
 > -Nate
 >
 
 

From: Nate Lawson <nate@root.org>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: kern/34711: frequent system stall under moderate scsi load
Date: Fri, 11 Oct 2002 18:22:39 -0700 (PDT)

 Are there any messages on console when the system hangs?  What is the
 output of camcontrol tags da0 (and da1).
 
 -Nate
 
State-Changed-From-To: feedback->closed 
State-Changed-By: ceri 
State-Changed-When: Sun Jun 8 11:02:41 PDT 2003 
State-Changed-Why:  
Feedback timeout (6 months or more). 
I will handle any feedback that this closure generates. 


Responsible-Changed-From-To: freebsd-bugs->ceri 
Responsible-Changed-By: ceri 
Responsible-Changed-When: Sun Jun 8 11:02:41 PDT 2003 
Responsible-Changed-Why:  
Feedback timeout (6 months or more). 
I will handle any feedback that this closure generates. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=34711 
>Unformatted:
