From webadmin@firstcallgroup.co.uk  Sat Apr 13 10:20:48 2002
Return-Path: <webadmin@firstcallgroup.co.uk>
Received: from mailhost.firstcallgroup.co.uk (firewall.firstcallgroup.co.uk [193.133.202.241])
	by hub.freebsd.org (Postfix) with ESMTP id B7A8737B400
	for <FreeBSD-gnats-submit@freebsd.org>; Sat, 13 Apr 2002 10:20:47 -0700 (PDT)
Received: from webadmin by mailhost.firstcallgroup.co.uk with local (Exim 3.35 #1)
	id 16wRCb-0001d4-00
	for FreeBSD-gnats-submit@freebsd.org; Sat, 13 Apr 2002 18:20:41 +0100
Message-Id: <E16wRCb-0001d4-00@mailhost.firstcallgroup.co.uk>
Date: Sat, 13 Apr 2002 18:20:41 +0100
From: Web and Middleware Administrator <webadmin@firstcallgroup.co.uk>
Reply-To: Web and Middleware Administrator <webadmin@firstcallgroup.co.uk>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: Latest stable causes SCSI bus freeze on sym0 when running SMP
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         37043
>Category:       kern
>Synopsis:       Latest stable causes SCSI bus freeze on sym0 when running SMP
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Apr 13 10:30:01 PDT 2002
>Closed-Date:    Fri Aug 16 13:16:52 PDT 2002
>Last-Modified:  Thu Aug 22 15:10:03 PDT 2002
>Originator:     Pete French <pfrench@firstcallgroup.co.uk>
>Release:        FreeBSD 4.5-STABLE i386
>Organization:
Seatem UK Limited
>Environment:
System: FreeBSD tixlink1.firstcallgroup.co.uk 4.5-STABLE FreeBSD 4.5-STABLE #0: Fri Apr 12 14:13:48 BST 2002 webadmin@tixlink1.firstcallgroup.co.uk:/usr/obj/usr/src/sys/TIXLINK1 i386

	Machine is a Compaq Proliant server. SMP machine with two 550MHz
	Pentium III processors. Onboard Symbios SCSI controller driving
	a pair of 9.1GB Compaq UW drives. Second SCSI controller attached to
	a tape drive. 256MB of memory, Thunderland 1000MBit ether interface.

>Description:

	After updating to the latest -STABLE (12/04/2002) the machine will
	now freeze giving the error message:

	(noperiph:sym0:0:-1:-1): SCSI BUS reset detected

	It will sometimes then recover from this and continue, but will
	occasionally freeze completely. The machine ran fine on 4.5-STABLE
	froom just fater 4.5-RELEASE, and this has only shown up on the
	latest upgrade.

	The problem appears to be related to heavy disc activity involving
	both drives at once. It also *only* occurs under SMP. Running
	a non SMP kernel solves the problem.

>How-To-Repeat:

	Under my setup I can easly cause this to happen by using two
	copy commands one after the other:

	cp src/cgi-bin/*.exe /usr/local/www/beta/cgi-bin/
	cp src/cgi-bin/*.exe /usr/local/www/live/cgi-bin/

	The second copy command freezes almost instantly. In this case
	the source of the files and the destination are on separate discs.

>Fix:

	The easiest workaround is to run the machine non-SMP. This is
	not acceptable long term, however, and the machine has run happily
	as an SMP box until now.
>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->feedback 
State-Changed-By: njl 
State-Changed-When: Fri Aug 16 00:33:43 PDT 2002 
State-Changed-Why:  
Are you still having trouble?  This sounds like an SMP problem. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=37043 

From: Nate Lawson <nate@root.org>
To: freebsd-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: kern/37043: Latest stable causes SCSI bus freeze on sym0 when
 running SMP 
Date: Fri, 16 Aug 2002 11:41:41 -0700 (PDT)

 Followup directly from user, resent.
 
 ---------- Forwarded message ----------
 Date: Fri, 16 Aug 2002 10:40:44 +0100
 From: Pete French <pfrench@firstcallgroup.co.uk>
 To: freebsd-bugs@FreeBSD.org, holger.kipp@alogis.com, njl@FreeBSD.org,
      webadmin@firstcallgroup.co.uk
 Subject: Re: kern/37043: Latest stable causes SCSI bus freeze on sym0 when
     running SMP
 
 [note to Holger - I've copied you in because this is the same problem
  you were having with the fxp0 hangs, this is in relation to my original PR]
 
 > Synopsis: Latest stable causes SCSI bus freeze on sym0 when running SMP
 > Are you still having trouble?  This sounds like an SMP problem.
 > http://www.freebsd.org/cgi/query-pr.cgi?pr=37043
 
 There was a discussion on -STABLE about this. The cause of the
 hanging was identified to sym0 stopping servicing interrupts. The author of
 that driver gave us a patch to detect this and poll for interrupts which
 showed up the problem. I dont know if tats been commited or not, but it
 was always more of a workaround than a fix - polling a SCSI interface because
 the interrupt servicing has died under suspicious circumstances is hardly
 ideal after all :-) (though the patch was extremely welcome as it did make
 the machine usable again)
 
 More investigation showed that the problem only occurs when interrupts are
 shared on the machine - several other people had the same problem sharing
 interrupts between ether cards and scsi cards. In my case the interrupt
 was shared with the ata interface. The Compaq bios does not let me fix
 this, but as I do not have any ATA devices I simply removed the ata driver
 from the kernel.
 
 This appears to have fixed the problem for me.
 
 Unsure what conclusions to draw - consensus seems to be that there is some
 problem with shared interrupts (as opposed to the sym0 driver) which only
 manifest themselves on an SMP system. Thats about as much information as I
 can give you I'm afraid. The best other person to talk to about this would
 be holger.kipp@alogis.com who put in most of the effort at finding a fix
 for this. I have copied him in on this reply.
 
 cheers,
 
 -pete french.
 
State-Changed-From-To: feedback->closed 
State-Changed-By: njl 
State-Changed-When: Fri Aug 16 13:14:57 PDT 2002 
State-Changed-Why:  
Workaround is to not share interrupts between ATA and SCSI controllers. 
This is not a complete fix so we should revisit this if others have the 
same trouble in the future. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=37043 

From: Nate Lawson <nate@root.org>
To: freebsd-gnats-submit@freebsd.org
Cc:  
Subject: Re: kern/37043: Latest stable causes SCSI bus freeze on sym0 when
 running SMP
Date: Thu, 22 Aug 2002 15:06:14 -0700 (PDT)

 Info provided by Gerard.
 
 ---------- Forwarded message ----------
 Date: Mon, 19 Aug 2002 01:18:22 +0200 (CEST)
 From: "[ISO-8859-1] Grard Roudier" <groudier@free.fr>
 To: Pete French <pfrench@firstcallgroup.co.uk>
 Cc: freebsd-bugs@FreeBSD.ORG, njl@FreeBSD.ORG, webadmin@firstcallgroup.co.uk
 Subject: Re: kern/37043: Latest stable causes SCSI bus freeze on sym0 when
     running SMP
 
 
 On Sat, 17 Aug 2002, Pete French wrote:
 > > Synopsis: Latest stable causes SCSI bus freeze on sym0 when running SMP
 > > State-Changed-From-To: feedback->closed
 > > State-Changed-By: njl
 > > State-Changed-When: Fri Aug 16 13:14:57 PDT 2002
 > > State-Changed-Why:
 >
 > > Workaround is to not share interrupts between ATA and SCSI controllers.
 > > This is not a complete fix so we should revisit this if others have the
 > > same trouble in the future.
 >
 > Its not specificly ATA controllers - everyone else who had the problem
 > was sharing interrupts with Ethernet adapters as I recall. But the fix
 > does work.
 
 It is indeed not a fix, but some last chance workaround.
 (The patch against sym is at the end of this email)
 
 Basically, the code tries to detect an interrupt stall and if such seems
 to happen, it installs the work-around that just polls the interrupt
 status of the chip 100 times per second.
 
 Btw, the ncr had this just hardcoded since day one, but I disliked it for
 the reason it can hide severe hardware or software flaws.
 
 As PCI interrupt trigerring relies on level sensitive logic, an interrupt
 stall should never happen. The risk is rather an interrupt storm if any
 interrupt condition is not properly handled by software.
 
 IMO, if an interrupt stall happens in PCI, then the cause can be either a
 flawed/misconfigured piece of hardware that doesn't implement the correct
 triggerring or a software bug that leaves the interrupt masked somewhere.
 
 (IIRC, the problem didn't show up with IO/APIC but only happenned using
 the legacy interrupt controller.)
 
 May-be, users that get their system fixed by this work-around in sym
 should also report a description of their system hardware and software.
 This may help find out where the actual flaw actually is.
 
 
 Regards,
   Grard.
 
 PS: The first line of the patch, i.e.:
 
 +#define SYM_CONF_HANDLE_INTR_STALL
 
 should be removed, if it happens that it will be worthwhile to commit this
 code, in order to allow to conditionnaly compile the workaround.
 
 --------------------- PATCH --------------------------
 
 --- sym_hipd.c.orig	Sun Jun  9 18:37:50 2002
 +++ sym_hipd.c	Sun Jun  9 16:36:07 2002
 @@ -1,3 +1,7 @@
 +#define SYM_CONF_HANDLE_INTR_STALL
 +#if 0
 +#define DEBUG_INTR_STALL
 +#endif
  /*
   *  Device driver optimized for the Symbios/LSI 53C896/53C895A/53C1010
   *  PCI-SCSI controllers.
 @@ -1922,6 +1926,17 @@
  	struct sym_tblmove abrt_tbl;	/* Table for the MOV of it 	*/
  	struct sym_tblsel  abrt_sel;	/* Sync params for selection	*/
  	u_char		istat_sem;	/* Tells the chip to stop (SEM)	*/
 +
 +#ifdef SYM_CONF_HANDLE_INTR_STALL
 +	int stall_state;	/* State of the algorithm */
 +	int stall_count;	/* Number of intr stall observed */
 +	u_long intr_count;	/* Real interrupt counter */
 +	u_long intr_prevc;	/* Previous counter seen from clock hanlder */
 +	u_long clock_curr;	/* Our clock in ticks */
 +	u_long clock_stall;	/* Clock value at a possible stall */
 +	struct callout_handle clock_ch;/* Kernel timer alchemy :) */
 +#define SYM_CLOCK_TICK	((hz+99)/100)
 +#endif
  };
 
  #define HCB_BA(np, lbl)	    (np->hcb_ba      + offsetof(struct sym_hcb, lbl))
 @@ -2513,6 +2528,10 @@
  static void sym_nvram_setup_target (hcb_p np, int targ, struct sym_nvram *nvp);
  static int sym_read_nvram (hcb_p np, struct sym_nvram *nvp);
 
 +#ifdef SYM_CONF_HANDLE_INTR_STALL
 +static void sym_clock_handler(void *arg);
 +#endif
 +
  /*
   *  Print something which allows to retrieve the controler type,
   *  unit, target, lun concerned by a kernel message.
 @@ -4216,6 +4235,9 @@
  static void sym_intr(void *arg)
  {
  	if (DEBUG_FLAGS & DEBUG_TINY) printf ("[");
 +#ifdef SYM_CONF_HANDLE_INTR_STALL
 +	++((hcb_p)arg)->intr_count;
 +#endif
  	sym_intr1((hcb_p) arg);
  	if (DEBUG_FLAGS & DEBUG_TINY) printf ("]");
  	return;
 @@ -9509,6 +9531,13 @@
  		goto attach_failed;
 
  	/*
 +	 * No comments for this one. :)
 +	 */
 +#ifdef SYM_CONF_HANDLE_INTR_STALL
 +	np->clock_ch = timeout(sym_clock_handler, (caddr_t)np, SYM_CLOCK_TICK);
 +#endif
 +
 +	/*
  	 *  Sigh! we are done.
  	 */
  	return 0;
 @@ -10410,3 +10439,91 @@
  }
 
  #endif	/* SYM_CONF_NVRAM_SUPPORT */
 +
 +#ifdef SYM_CONF_HANDLE_INTR_STALL
 +/*
 + * The below code tries to detect interrupt stalls.
 + *
 + * It assumes that an interrupt condition raised
 + * in the chip interrupt status that is not serviced
 + * for 0.2 second is a possible stall.
 + *
 + * If such happens 5 times, it installs a work-around
 + * that forces interrupt service each time an interrupt
 + * condition is present in the chip interrupt status.
 + */
 +
 +static void sym_clock_handler(void *arg)
 +{
 +	int s;
 +	hcb_p np;
 +	u_char istat;
 +	int intr_prevc;
 +
 +	np = arg;
 +	if (!np)
 +		return;
 +
 +	s = splcam();
 +
 +	/*
 +	 * Update our clock and interrupt counter copy.
 +	 */
 +	intr_prevc = np->intr_prevc;
 +	np->intr_prevc = np->intr_count;
 +	np->clock_curr += SYM_CLOCK_TICK;
 +
 +	/*
 +	 * Read the chip interrupt status.
 +	 */
 +	istat = INB (nc_istat) & (INTF|SIP|DIP);
 +
 +	/*
 +	 * Try to detect interrupt stalls.
 +	 */
 +	switch(np->stall_state) {
 +	default:
 +	case 0:	/* Wait for the first unserviced interrupt condition */
 +		np->stall_count = 0;
 +
 +	case 2:	/* Wait for subsequent ones */
 +		if (istat) {
 +			np->clock_stall = np->clock_curr;
 +			np->stall_state = 1;
 +		}
 +		break;
 +
 +	case 1:	/* Detect a possible interrupt stall */
 +#ifndef DEBUG_INTR_STALL
 +		if (intr_prevc != np->intr_count || !istat) {
 +			np->stall_state = 2;
 +			break;
 +		}
 +#endif
 +		if (((int)(np->clock_curr - np->clock_stall)) < (hz+4)/5)
 +			break;
 +
 +		++np->stall_count;
 +		if (np->stall_count < 5) {
 +			np->stall_state = 2;
 +			printf("%s: interrupt stall, forcing service.\n",
 +			       sym_name(np));
 +		}
 +		else {
 +			np->stall_state = 3;
 +			printf("%s: interrupt stall, installing workaround.\n",
 +			       sym_name(np));
 +		}
 +		sym_intr1(np);
 +		break;
 +
 +	case 3:	/* Force service if interrupt condition is pending */
 +		if (istat)
 +			sym_intr1(np);
 +		break;
 +	}
 +
 +	np->clock_ch = timeout(sym_clock_handler, (caddr_t)np, SYM_CLOCK_TICK);
 +	splx(s);
 +}
 +#endif /* SYM_CONF_HANDLE_INTR_STALL */
 
>Unformatted:
