From nobody@FreeBSD.org  Thu Apr 27 03:40:07 2006
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id CF87916A401
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 27 Apr 2006 03:40:07 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8CF6643D45
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 27 Apr 2006 03:40:07 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.13.1/8.13.1) with ESMTP id k3R3e7wa043238
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 27 Apr 2006 03:40:07 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.13.1/8.13.1/Submit) id k3R3e7ac043237;
	Thu, 27 Apr 2006 03:40:07 GMT
	(envelope-from nobody)
Message-Id: <200604270340.k3R3e7ac043237@www.freebsd.org>
Date: Thu, 27 Apr 2006 03:40:07 GMT
From: Nathan Whitehorn <nathanw@uchicago.edu>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Device timeouts on nve(4) [PATCH]
X-Send-Pr-Version: www-2.3

>Number:         96391
>Category:       kern
>Synopsis:       [nve] [patch] Device timeouts on nve(4)
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Apr 27 03:50:14 GMT 2006
>Closed-Date:    Mon Jun 12 20:53:51 GMT 2006
>Last-Modified:  Mon Jan 22 21:10:12 GMT 2007
>Originator:     Nathan Whitehorn
>Release:        6.1-RC
>Organization:
University of Chicago
>Environment:
FreeBSD munuc.uchicago.edu 6.1-RC FreeBSD 6.1-RC #9: Wed Apr 26 22:02:06 CDT 2006     root@munuc.uchicago.edu:/usr/obj/usr/src/sys/MUNUC  amd64
>Description:
On some systems with nVidia NICs, especially nForce4, nve(4) reports frequent
device timeouts (every 5-10 minutes) under low load. This seems to result, as
per a note in the forcedeth source, from the nve MAC randomly failing to send
tx acknowledgement interrupts. Under load, tx interrupts from other packets
or rx interrupts will cause the interrupt routine to run and register the
packet transmit notification. Under low load, the watchdog timer will expire
before this happens, causing a device timeout and a MAC reset, which also
briefly hangs the machine.
>How-To-Repeat:
Place an affected nve controller on a low-traffic network and watch the
errors come rolling in.
>Fix:
We can fix the problem by calling the nVidia HAL's interrupt service routine
from the nve_watchdog(), in effect causing an interrupt to occur if we're
expecting one and it hasn't shown up yet. If the pending transmits counter
is still non-zero, we conclude, as before, that the NIC has crashed and
reset it, but we can just continue on our way if the problem is now resolved.

--- if_nve_original.c   Wed Apr 26 22:23:14 2006
+++ if_nve.c    Wed Apr 26 21:52:34 2006
@@ -1270,6 +1270,18 @@
 nve_watchdog(struct ifnet *ifp)
 {
        struct nve_softc *sc = ifp->if_softc;
+
+       NVE_LOCK(sc);
+       /* Check for lost interrupts -- happens on nForce4 */
+       sc->hwapi->pfnDisableInterrupts(sc->hwapi->pADCX);
+       sc->hwapi->pfnHandleInterrupt(sc->hwapi->pADCX);
+       sc->hwapi->pfnEnableInterrupts(sc->hwapi->pADCX);
+
+       if (sc->pending_txs == 0) {
+               NVE_UNLOCK(sc);
+               return; /* Problem went away */
+       }
+       NVE_UNLOCK(sc);

        device_printf(sc->dev, "device timeout (%d)\n", sc->pending_txs);
>Release-Note:
>Audit-Trail:

From: Yuri Pankov <yuri.pankov@gmail.com>
To: bug-followup@FreeBSD.org,  nathanw@uchicago.edu
Cc:  
Subject: Re: kern/96391: [nve] [patch] Device timeouts on nve(4)
Date: Thu, 27 Apr 2006 19:02:24 +0400

 Indeed, patch seems to get rid of timeout(N) messages. Though it doesn't 
 apply cleanly to -CURRENT source. Here's patch against rev 1.20 of if_nve.c:
 
 --- if_nve.c.orig       Mon Dec 12 09:23:43 2005
 +++ if_nve.c    Thu Apr 27 18:23:48 2006
 @@ -1277,6 +1277,18 @@
   {
          struct nve_softc *sc = ifp->if_softc;
 
 +       NVE_LOCK(sc);
 +       /* Check for lost interrupts -- happens on nForce4 */
 +       sc->hwapi->pfnDisableInterrupts(sc->hwapi->pADCX);
 +       sc->hwapi->pfnHandleInterrupt(sc->hwapi->pADCX);
 +       sc->hwapi->pfnEnableInterrupts(sc->hwapi->pADCX);
 +
 +       if (sc->pending_txs == 0) {
 +               NVE_UNLOCK(sc);
 +               return; /* Problem went away */
 +       }
 +       NVE_UNLOCK(sc);
 +
          device_printf(sc->dev, "device timeout (%d)\n", c->pending_txs);
 
          NVE_LOCK(sc);

From: "Bjoern A. Zeeb" <bz@FreeBSD.org>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/96391: [nve] [patch] Device timeouts on nve(4)
Date: Fri, 28 Apr 2006 08:29:28 +0000 (UTC)

 There had also been at least following PRs (which seem to reference
 even more) about this issue. I have redirected the unassigned ones
 here so there are not too many PRs to track.
 
 PR 85583, PR 88045, PR 92371, PR 94070, PR 94524
State-Changed-From-To: open->closed 
State-Changed-By: jhb 
State-Changed-When: Mon Jun 12 20:53:32 UTC 2006 
State-Changed-Why:  
Patch applied to HEAD and RELENG_6.  Thanks for tracking this down!! 

http://www.freebsd.org/cgi/query-pr.cgi?pr=96391 

From: Daniel Rich <drich@pdi.com>
To: bug-followup@FreeBSD.org, nathanw@uchicago.edu
Cc:  
Subject: Re: kern/96391: [nve] [patch] Device timeouts on nve(4)
Date: Mon, 22 Jan 2007 11:40:03 -0800

 This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
 --------------enig3806569C1A2B03E94EE66C35
 Content-Type: text/plain; charset=ISO-8859-1
 Content-Transfer-Encoding: quoted-printable
 
 I realize this PR has been closed for over 6 months, but I'm seeing
 similar symptoms in 6.2REL with my nve0 interface.
 
 In looking at the current code in if_nve.c, it looks like things have
 changed a little bit since the patch in this PR, but most of the patch
 still exists.  The major difference I see is that it now uses
 sc->pending_txs to see if there are any pending packets instead of just
 checking for it =3D=3D 0.  Without knowing more about the hardware, would=
  it
 make more sense for the "pending_txs_start =3D sc->pending_txs;" line to
 be *after* the driver is kicked by tweaking the interrupts?
 
 For the specifics of my system:
     6.2-RELEASE amd64
     Motherboard: ASUS M2NPV-VM (integrated NVIDIA nForce=AE 430 built-in
 Gigabit MAC)
   =20
 It looks like it is only having problems when passing lots of traffic.=20
 Also, I do go through two switches at the moment, that will change when
 I get home tonight.
 
 --=20
 Dan Rich <drich@pdi.com>
           PDI Dreamworks |  "Step up to red alert!"  "Are you sure, sir?
            (650)562-9018 |   It means changing the bulb in the sign..."
 
 
 
 --------------enig3806569C1A2B03E94EE66C35
 Content-Type: application/pgp-signature; name="signature.asc"
 Content-Description: OpenPGP digital signature
 Content-Disposition: attachment; filename="signature.asc"
 
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.2.6 (GNU/Linux)
 
 iD8DBQFFtRMUBqBGo55TfQcRAhWCAJ434fsqjJ86G8vJh8ewN3zPEvffnQCfXcnZ
 NCttmz7zcz7pJCaM2MLcGFc=
 =+0tY
 -----END PGP SIGNATURE-----
 
 --------------enig3806569C1A2B03E94EE66C35--
>Unformatted:
