From nobody@FreeBSD.org  Thu Oct 21 17:41:51 2004
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1478B16A4CF
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 21 Oct 2004 17:41:51 +0000 (GMT)
Received: from www.freebsd.org (www.freebsd.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id AD16043D1D
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 21 Oct 2004 17:41:50 +0000 (GMT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.12.11/8.12.11) with ESMTP id i9LHfoxq022429
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 21 Oct 2004 17:41:50 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.12.11/8.12.11/Submit) id i9LHfoYa022424;
	Thu, 21 Oct 2004 17:41:50 GMT
	(envelope-from nobody)
Message-Id: <200410211741.i9LHfoYa022424@www.freebsd.org>
Date: Thu, 21 Oct 2004 17:41:50 GMT
From: James Van Bokkelen <jbvb@sandstorm.net>
To: freebsd-gnats-submit@FreeBSD.org
Subject: em driver can hang when mbuf starvation occurs
X-Send-Pr-Version: www-2.3

>Number:         72970
>Category:       kern
>Synopsis:       [em] em(4) driver can hang when mbuf starvation occurs
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    yongari
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Oct 21 17:50:20 GMT 2004
>Closed-Date:    Mon Mar 26 20:38:02 GMT 2007
>Last-Modified:  Mon Mar 26 20:38:02 GMT 2007
>Originator:     James Van Bokkelen
>Release:        4.8
>Organization:
Sandstorm Enterprises Inc.
>Environment:
FreeBSD ni8 4.8-RELEASE FreeBSD 4.8-RELEASE #0: Fri Oct 15 15:02:46 EDT 2004     prod@sandstorm.net:/usr/src/sys/compile/NI_3X_FREEBSD48_DUAL  i386
 
>Description:
  In file sys/dev/em/if_em.c, in function process_receive_interrupts(),
(at line 2469 in v1.2.2.16, still present in 1.50 viewed via CVS
on 21-Oct-2004), there is a call to em_get_buf().  If this fails due
to mbuf starvation, the driver presently counts the error, frees any
chain being built, puts the old buffer back in the receive and breaks
out of the while(current_desc->status...) loop.  This results in the
interrupt being dismissed without updating the receive queue tail
pointer, and the card never interrupts again.

>How-To-Repeat:
 Stress a FreeBSD system with an em interface by receiving a large
amount of traffic in promiscuous mode from several different senders
while simultaneously writing large amounts of data to the disk.
Observe that the em interface stops receiving new packets, and starts
rapidly counting missed packets.   Condition can be cleared by

   ifconfig em0 down && ifconfig em0 up
or
   ifconfig em0 media auto

>Fix:
    
I replaced the offending block of code with a goto:

		if (accept_frame) {

			if (em_get_buf(i, adapter, NULL) == ENOBUFS) {
                              goto next_rx_pkt; /* treat starvation like a runt or overrun */

This is aimed at a new label in the 'else' clause of the enclosing
'if (accept_frame)':

		} else {
	next_rx_pkt: /* Come here when starvation forces us to re-use an mbuf cluster */
	                adapter->dropped_pkts++;
			em_get_buf(i, adapter, mp);
			if (adapter->fmp != NULL) 
				m_freem(adapter->fmp);
			adapter->fmp = NULL;
			adapter->lmp = NULL;
		}

		/* Zero out the receive descriptors status  */

This keeps the driver from prematurely dismissing the interrupt and
going deaf to incoming packets.  
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->tackerman 
Responsible-Changed-By: delphij 
Responsible-Changed-When: Fri Oct 22 02:24:46 GMT 2004 
Responsible-Changed-Why:  
Over to Tony, our em(4) maintainer. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 
State-Changed-From-To: open->analyzed 
State-Changed-By: tackerman 
State-Changed-When: Tue Oct 26 23:28:08 GMT 2004 
State-Changed-Why:  
Problem is being reproduced and the fix is being tested. 
Final fix will be sumbitted once tested. 


http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 

From: "Youlin Feng" <yfeng@verniernetworks.com>
To: <freebsd-gnats-submit@FreeBSD.org>, <jbvb@sandstorm.net>
Cc:  
Subject: Re: kern/72970: em driver can hang when mbuf starvation occurs
Date: Fri, 7 Jan 2005 16:57:37 -0800

 This is a multi-part message in MIME format.
 
 ------_=_NextPart_001_01C4F51D.1ABAA02B
 Content-Type: text/plain;
 	charset="US-ASCII"
 Content-Transfer-Encoding: quoted-printable
 
 I am using 82546GB controllers and I have been seeing the same problem
 where under heavy traffic the receiver would hang while the transmitter
 seems to be ok. When this happens, the driver if_ipackets stops
 incrementing and the chip mpc (missed_packet_count) keeps increasing.
 
 =20
 
 Yet, my instrumented driver didn't report mbuf starvation, with the help
 of printf as well as a new counter inside the "if (em_get_buf(i,
 adapter, NULL) =3D=3D ENOBUFS) {" block.
 
 =20
 
 My solution is to add to the EM driver's 2-second timer handler
 em_local_timer to check for hung receiver and to recover from it. The
 assumption is made that if driver hasn't received a single packet in the
 past two seconds, yet the chip keeps reporting dropped packets, then we
 think the receiver is wedged and em_init_locked() is called to reset the
 interface. This assumption is valid for our switch application.
 
 =20
 
 Youlin Feng
 
 
 ------_=_NextPart_001_01C4F51D.1ABAA02B
 Content-Type: text/html;
 	charset="US-ASCII"
 Content-Transfer-Encoding: quoted-printable
 
 <html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
 xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
 xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
 xmlns=3D"http://www.w3.org/TR/REC-html40">
 
 <head>
 <meta http-equiv=3DContent-Type content=3D"text/html; =
 charset=3Dus-ascii">
 <meta name=3DGenerator content=3D"Microsoft Word 11 (filtered medium)">
 <style>
 <!--
  /* Style Definitions */
  p.MsoNormal, li.MsoNormal, div.MsoNormal
 	{margin:0in;
 	margin-bottom:.0001pt;
 	font-size:12.0pt;
 	font-family:"Times New Roman";}
 a:link, span.MsoHyperlink
 	{color:blue;
 	text-decoration:underline;}
 a:visited, span.MsoHyperlinkFollowed
 	{color:purple;
 	text-decoration:underline;}
 span.EmailStyle17
 	{mso-style-type:personal-compose;
 	font-family:Arial;
 	color:windowtext;}
 @page Section1
 	{size:8.5in 11.0in;
 	margin:1.0in 1.25in 1.0in 1.25in;}
 div.Section1
 	{page:Section1;}
 -->
 </style>
 
 </head>
 
 <body lang=3DEN-US link=3Dblue vlink=3Dpurple>
 
 <div class=3DSection1>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'>I am using 82546GB controllers and I have been seeing =
 the
 same problem where under heavy traffic the receiver would hang while the
 transmitter seems to be ok. When this happens, the driver if_ipackets =
 stops
 incrementing and the chip mpc (missed_packet_count) keeps =
 increasing.<o:p></o:p></span></font></p>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'><o:p>&nbsp;</o:p></span></font></p>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'>Yet, my instrumented driver didn&#8217;t report mbuf
 starvation, with the help of printf as well as a new counter inside the =
 &#8220;</span></font>if
 (em_get_buf(i, adapter, NULL) =3D=3D ENOBUFS) {&#8220; <font size=3D2 =
 face=3DArial><span
 style=3D'font-size:10.0pt;font-family:Arial'>block.<o:p></o:p></span></fo=
 nt></p>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'><o:p>&nbsp;</o:p></span></font></p>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'>My solution is to add to the EM driver&#8217;s =
 2-second
 timer handler em_local_timer to check for hung receiver and to recover =
 from it.
 The assumption is made that if driver hasn&#8217;t received a single =
 packet in
 the past two seconds, yet the chip keeps reporting dropped packets, then =
 we
 think the receiver is wedged and em_init_locked() is called to reset the
 interface. This assumption is valid for our switch =
 application.<o:p></o:p></span></font></p>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'><o:p>&nbsp;</o:p></span></font></p>
 
 <p class=3DMsoNormal><font size=3D2 face=3DArial><span =
 style=3D'font-size:10.0pt;
 font-family:Arial'>Youlin Feng</span></font><font face=3DArial><span
 style=3D'font-family:Arial'><o:p></o:p></span></font></p>
 
 </div>
 
 </body>
 
 </html>
 
 ------_=_NextPart_001_01C4F51D.1ABAA02B--

From: "Juan Ignacio Germano" <jigermano@gmail.com>
To: bug-followup@FreeBSD.org, jbvb@sandstorm.net, tackerman@FreeBSD.org
Cc:  
Subject: Re: kern/72970: [em] em(4) driver can hang when mbuf starvation occurs
Date: Thu, 16 Mar 2006 17:07:47 -0300

 Hi,
    We have squid-3 running in a FreeBSD 5.4-RELEASE-p11 with a couple
 of Broadcom BCM5721 Gigabit Ethernet NICs both using the bge driver.
 Both MBufClust and Mbuf were set to 22400. When MBufClust hit the max,
 the nics stop responding. We are still able to log into the system and
 everything was ok except the interfaces where counting errors. It
 would seem this is related to this bug:
 
 http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dkern/72970
 
 since is exactly the same behaviour. I've been able to reproduce it
 easily using a very low number of MBufCluster (650) configured in
 /boot/loader.conf and generating a couple of hundreds of request per
 second to squid.
 
 Should I open a separate bug report for this?
 Thank you.
 
 --
 Juan Germano

From: "James B. Van Bokkelen" <jbvb@sandstorm.net>
To: Juan Ignacio Germano <jigermano@gmail.com>
Cc: bug-followup@FreeBSD.org, tackerman@FreeBSD.org
Subject: Re: kern/72970: [em] em(4) driver can hang when mbuf starvation occurs
Date: Thu, 16 Mar 2006 15:57:44 -0500

 Juan Ignacio Germano wrote:
 > Hi,
 >    We have squid-3 running in a FreeBSD 5.4-RELEASE-p11 with a couple
 > of Broadcom BCM5721 Gigabit Ethernet NICs both using the bge driver.
 > Both MBufClust and Mbuf were set to 22400. When MBufClust hit the max,
 > the nics stop responding. We are still able to log into the system and
 > everything was ok except the interfaces where counting errors. It
 > would seem this is related to this bug:
 > 
 > http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/72970
 > 
 > since is exactly the same behaviour. I've been able to reproduce it
 > easily using a very low number of MBufCluster (650) configured in
 > /boot/loader.conf and generating a couple of hundreds of request per
 > second to squid.
 
 We are applying my fix (which has flaws, but at least doesn't
 leave the interface deaf until you down/up with ifconfig) to the
 em driver in all our production 5.3 kernels.  We are also applying
 it to 6.0 kernels for beta testing our next release.
 
 > Should I open a separate bug report for this?
 
 I can't comment.  If someone had the Intel documentation for the
 interface and wanted to discuss alternatives to my fix, I would
 be happy to.  I'd rather not discard all the packets in the input
 ring when MBufs are exhausted, but that was the only remedy I
 could identify using only the FreeBSD em driver source.
 
 For what it's worth, this is the only work I've done on a Unix
 driver for a current production LAN interface, but long ago I
 did about a dozen hardware drivers for FTP Software's PC/TCP,
 and really, not that much has changed.
 
 jbvb
State-Changed-From-To: analyzed->feedback 
State-Changed-By: linimon 
State-Changed-When: Wed Apr 5 00:36:42 UTC 2006 
State-Changed-Why:  
Is this still a problem with recent versions of FreeBSD? 


Responsible-Changed-From-To: tackerman->linimon 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed Apr 5 00:36:42 UTC 2006 
Responsible-Changed-Why:  
Reset PR assigned to inactive committer. 

Hat:	gnats-admin 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 

From: linimon@lonesome.com (Mark Linimon)
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/72970: [em] em(4) driver can hang when mbuf starvation occurs
Date: Wed, 5 Apr 2006 14:45:44 -0500

 ----- Forwarded message from "James B. Van Bokkelen" <jbvb@sandstorm.net> -----
 
 We are using FreeBSD 6.0 with our fix applied.  We have not re-tested
 without the fix since our original bug report.
 
 I have just reviewed if_em.c v1.114, and em_rxeof() still contains
 the logic I consider defective:  If no mbuf can be allocated, it
 breaks out of the while() loop, trying to save the packet for a
 future interrupt.  This is ok unless the ring is full, whereupon
 our experience suggests no further receive interrupts will happen.
 
 If code has been added elsewhere to catch this condition (watchdog
 etc.), I can't say how effective it would be without testing. We
 operate the interface in 'monitor' mode, never sending any packets,
 so resetting the receiver in the transmit side would not work for
 our application.
 
 jbvb
 
 ----- End forwarded message -----
State-Changed-From-To: feedback->suspended 
State-Changed-By: linimon 
State-Changed-When: Wed Apr 5 19:55:35 UTC 2006 
State-Changed-Why:  
This sounds like it is still a problem.  Mark as 'suspended' awaiting 
someone to take an interest in it. 


Responsible-Changed-From-To: linimon->freebsd-bugs 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Wed Apr 5 19:55:35 UTC 2006 
Responsible-Changed-Why:  

http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 
State-Changed-From-To: suspended->analyzed 
State-Changed-By: glebius 
State-Changed-When: Fri Apr 7 09:06:39 UTC 2006 
State-Changed-Why:  
I will work with this. 


Responsible-Changed-From-To: freebsd-bugs->glebius 
Responsible-Changed-By: glebius 
Responsible-Changed-When: Fri Apr 7 09:06:39 UTC 2006 
Responsible-Changed-Why:  
It looks like James has found the real root of the problem, 
that I have masked in rev. 1.80. Shame on me, I failed to 
find this PR, when I was searching through PR database. 

I will analyze James's report ASAP. Thanks, James! 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 

From: Gleb Smirnoff <glebius@FreeBSD.org>
To: James Van Bokkelen <jbvb@sandstorm.net>
Cc: Youlin Feng <yfeng@verniernetworks.com>, freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/72970: em driver can hang when mbuf starvation occurs
Date: Fri, 7 Apr 2006 15:22:14 +0400

   Hi, James! Hi, Youlin!
 
   I have encountered the same problem as James describes and
 I believe that revision 1.80 has fixed it. Yes, it fixed the
 problem on all my routers.
 
   However, many people doesn't like the code from revision
 1.80, since it 1) makes box less responsive under high
 interrupt load, 2) is more dangerous in theory.
 
   When I've found James's PR I hoped that he had found the
 root case of the problem and I can fix the driver adding
 code to handle em_get_buf() failure and remove my dangerous
 for(;;) loop.
 
   Unfortunately I'm seeing the same as Youlin sees - the
 allocation doesn't fail but NIC receive part wedges. Here is
 the state of the softc of the wedged NIC:
 
 (kgdb) p ((struct em_softc *)ifnet->tqh_first->if_softc)->dropped_pkts
 $10 = 0
 (kgdb) p ((struct em_softc *)ifnet->tqh_first->if_softc)->mbuf_cluster_failed
 $11 = 0
 
 This means that there were no ENOBUFS case at all.
 
 The receive ring isn't full:
 
 (kgdb) p ((struct em_softc *)ifnet->tqh_first->if_softc)->next_rx_desc_to_check
 $12 = 161
 
   So, it looks like either we (me and Youlin) see problem different from
 James's, or James diagnostics aren't correct.
 
 -- 
 Totus tuus, Glebius.
 GLEBIUS-RIPN GLEB-RIPE

From: "James B. Van Bokkelen" <jbvb@sandstorm.net>
To: Gleb Smirnoff <glebius@FreeBSD.org>
Cc: Youlin Feng <yfeng@verniernetworks.com>, freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/72970: em driver can hang when mbuf starvation occurs
Date: Fri, 07 Apr 2006 09:14:19 -0400

 Gleb Smirnoff wrote:
 >   Hi, James! Hi, Youlin!
 > 
 >   I have encountered the same problem as James describes and
 > I believe that revision 1.80 has fixed it. Yes, it fixed the
 > problem on all my routers.
 > 
 >   However, many people doesn't like the code from revision
 > 1.80, since it 1) makes box less responsive under high
 > interrupt load, 2) is more dangerous in theory.
 > 
 >   When I've found James's PR I hoped that he had found the
 > root case of the problem and I can fix the driver adding
 > code to handle em_get_buf() failure and remove my dangerous
 > for(;;) loop.
 > 
 >   Unfortunately I'm seeing the same as Youlin sees - the
 > allocation doesn't fail but NIC receive part wedges. Here is
 > the state of the softc of the wedged NIC:
 > 
 > (kgdb) p ((struct em_softc *)ifnet->tqh_first->if_softc)->dropped_pkts
 > $10 = 0
 > (kgdb) p ((struct em_softc *)ifnet->tqh_first->if_softc)->mbuf_cluster_failed
 > $11 = 0
 > 
 > This means that there were no ENOBUFS case at all.
 > 
 > The receive ring isn't full:
 > 
 > (kgdb) p ((struct em_softc *)ifnet->tqh_first->if_softc)->next_rx_desc_to_check
 > $12 = 161
 > 
 >   So, it looks like either we (me and Youlin) see problem different from
 > James's, or James diagnostics aren't correct.
 
 We definitely had an ENOBUFS problem.  We use the em interface as a
 monitoring device, and at the time we discovered the problem, we
 had a kernel modification that was making heavy use of mbufs.  This
 was demonstrated by the instrumentation we added to if_em.c, which
 we left out of the original PR to keep things simple.  I could send
 it to you if you wished, as it might help clarify the case you're
 looking at.  I think you're right in saying you've got a different
 issue.
 
 jbvb
 
 
 
State-Changed-From-To: analyzed->patched 
State-Changed-By: glebius 
State-Changed-When: Mon Aug 14 09:20:56 UTC 2006 
State-Changed-Why:  
Looks like Pyun has committed fix to HEAD. 


Responsible-Changed-From-To: glebius->yongari 
Responsible-Changed-By: glebius 
Responsible-Changed-When: Mon Aug 14 09:20:56 UTC 2006 
Responsible-Changed-Why:  
Looks like Pyun has committed fix to HEAD. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 
State-Changed-From-To: patched->closed 
State-Changed-By: remko 
State-Changed-When: Mon Mar 26 20:38:00 UTC 2007 
State-Changed-Why:  
Assume this got fixed already. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=72970 
>Unformatted:
