From vivek@rt.m1e.net  Tue Nov 15 15:19:21 2005
Return-Path: <vivek@rt.m1e.net>
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 1ACC616A41F
	for <FreeBSD-gnats-submit@freebsd.org>; Tue, 15 Nov 2005 15:19:21 +0000 (GMT)
	(envelope-from vivek@rt.m1e.net)
Received: from rt.m1e.net (rt.m1e.net [206.112.95.37])
	by mx1.FreeBSD.org (Postfix) with ESMTP id D575643D45
	for <FreeBSD-gnats-submit@freebsd.org>; Tue, 15 Nov 2005 15:19:20 +0000 (GMT)
	(envelope-from vivek@rt.m1e.net)
Received: by rt.m1e.net (Postfix, from userid 120)
	id 17A14B823; Tue, 15 Nov 2005 10:19:20 -0500 (EST)
Message-Id: <20051115151920.17A14B823@rt.m1e.net>
Date: Tue, 15 Nov 2005 10:19:20 -0500 (EST)
From: Vivek Khera <vivek@khera.org>
Reply-To: Vivek Khera <vivek@khera.org>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: em(4) continual "em0: RX overrun" warnings
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         89073
>Category:       kern
>Synopsis:       [em] em(4) continual "em0: RX overrun" warnings (regression in 6.0)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    glebius
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Nov 15 15:20:21 GMT 2005
>Closed-Date:    Mon Nov 21 08:19:55 GMT 2005
>Last-Modified:  Wed Feb 08 20:58:52 GMT 2006
>Originator:     Vivek Khera
>Release:        FreeBSD 6.0-RELEASE amd64
>Organization:
>Environment:
System: FreeBSD rt.m1e.net 6.0-RELEASE FreeBSD 6.0-RELEASE #2: Thu Nov 3 15:47:50 EST 2005 vivek@rt.m1e.net:/u/data/usr.obj/n/lorax1/usr6/src/sys/RT amd64

Hardware is Dual Opteron Tyan K8SR motherboard with Intel dual port gigabit
NIC plugged in.  Onboard bge ethernet is disabled in BIOS.  Other card is an
LSI 320-2X RAID card.

	
>Description:
	

I have an identical system (with two fewer disks, though) running 5.4-RELEASE
which runs non-stop.  This system I installed 6.0-REL and it runs really well
and the disk I/O is much faster.  However, if I do any kind of network access
I get "em0: RX overrun" warnings in the log at a pretty high rate.  This can
be caused by doing something so simple as running a buildworld with /usr/src
mounted via NFS and /usr/obj being a local disk.  The kernel is GENERIC as
installed by the 6.0-REL CD.

Ultimately, the machine becomes non-responsive to the network.  For some
reason the serial console is not working past the BIOS boot, so I can't see if
the machine is hung totally or just the network (the machine is at a co-lo
facility)  I'll be attempting to fix the serial console soon.

The em0 is conected to a Dell gigabit switch which is working fine (same port
was used for another system with no issues) and the NFS server is on a
different switch running 100baseT.

>How-To-Repeat:
	

run "buildworld" with NFS mounted /usr/src.

>Fix:

	


Not sure.  Trying a new NIC later today as soon as it arrives.

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->glebius 
Responsible-Changed-By: glebius 
Responsible-Changed-When: Wed Nov 16 15:21:37 GMT 2005 
Responsible-Changed-Why:  
This is definitely related to my change. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=89073 

From: Gleb Smirnoff <glebius@FreeBSD.org>
To: Vivek Khera <vivek@khera.org>
Cc: FreeBSD-gnats-submit@FreeBSD.org
Subject: Re: kern/89073: em(4) continual "em0: RX overrun" warnings
Date: Wed, 16 Nov 2005 18:25:30 +0300

 On Tue, Nov 15, 2005 at 10:19:20AM -0500, Vivek Khera wrote:
 V> >Description:
 V> 
 V> I have an identical system (with two fewer disks, though) running 5.4-RELEASE
 V> which runs non-stop.  This system I installed 6.0-REL and it runs really well
 V> and the disk I/O is much faster.  However, if I do any kind of network access
 V> I get "em0: RX overrun" warnings in the log at a pretty high rate.  This can
 V> be caused by doing something so simple as running a buildworld with /usr/src
 V> mounted via NFS and /usr/obj being a local disk.  The kernel is GENERIC as
 V> installed by the 6.0-REL CD.
 V> 
 V> Ultimately, the machine becomes non-responsive to the network.  For some
 V> reason the serial console is not working past the BIOS boot, so I can't see if
 V> the machine is hung totally or just the network (the machine is at a co-lo
 V> facility)  I'll be attempting to fix the serial console soon.
 V> 
 V> The em0 is conected to a Dell gigabit switch which is working fine (same port
 V> was used for another system with no issues) and the NFS server is on a
 V> different switch running 100baseT.
 
 Can you please show the following details:
 
 grep ^em /var/run/dmesg.boot
 pciconf -lv
 sysctl dev.em.0.stats=1 && dmesg | tail -n 17
 sysctl dev.em.0.debug_info=1 && dmesg | tail -n 13
 
 -- 
 Totus tuus, Glebius.
 GLEBIUS-RIPN GLEB-RIPE

From: Vivek Khera <vivek@khera.org>
To: FreeBSD-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: kern/89073: em(4) continual "em0: RX overrun" warnings
Date: Wed, 16 Nov 2005 11:04:55 -0500

 On Nov 16, 2005, at 10:25 AM, Gleb Smirnoff wrote:
 
 > Can you please show the following details:
 >
 > grep ^em /var/run/dmesg.boot
 
 em0: <Intel(R) PRO/1000 Network Connection, Version - 2.1.7> port  
 0x8c00-0x8c3f mem 0xfc9e0000-0xfc9fffff,0xfc980000-0xfc9bffff irq 27  
 at device 8.0 on pci3
 em0: Ethernet address: 00:04:23:c0:41:29
 em0:  Speed:N/A  Duplex:N/A
 em0: link state changed to UP
 
 
 > pciconf -lv
 
 pcib1@pci0:6:0: class=0x060400 card=0x000000c0 chip=0x74601022  
 rev=0x07 hdr=0x01
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8111 PCI Bridge'
      class    = bridge
      subclass = PCI-PCI
 isab0@pci0:7:0: class=0x060100 card=0x74681022 chip=0x74681022  
 rev=0x05 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8111 LPC Bridge'
      class    = bridge
      subclass = PCI-ISA
 atapci0@pci0:7:1:       class=0x01018a card=0x74691022  
 chip=0x74691022 rev=0x03 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8111 UltraATA/133 Controller'
      class    = mass storage
      subclass = ATA
 none0@pci0:7:2: class=0x0c0500 card=0x746a1022 chip=0x746a1022  
 rev=0x02 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8111 SMBus 2.0 Controller'
      class    = serial bus
      subclass = SMBus
 none1@pci0:7:3: class=0x068000 card=0x746b1022 chip=0x746b1022  
 rev=0x05 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8111 ACPI System Management Controller'
      class    = bridge
 pcib2@pci0:10:0:        class=0x060400 card=0x000000a0  
 chip=0x74501022 rev=0x12 hdr=0x01
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8131 PCI-X Bridge'
      class    = bridge
      subclass = PCI-PCI
 none2@pci0:10:1:        class=0x080010 card=0x36c01022  
 chip=0x74511022 rev=0x01 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8131 PCI-X IOAPIC'
      class    = base peripheral
      subclass = interrupt controller
 pcib3@pci0:11:0:        class=0x060400 card=0x000000a0  
 chip=0x74501022 rev=0x12 hdr=0x01
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8131 PCI-X Bridge'
      class    = bridge
      subclass = PCI-PCI
 none3@pci0:11:1:        class=0x080010 card=0x36c01022  
 chip=0x74511022 rev=0x01 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'AMD-8131 PCI-X IOAPIC'
      class    = base peripheral
      subclass = interrupt controller
 hostb0@pci0:24:0:       class=0x060000 card=0x00000000  
 chip=0x11001022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron HyperTransport Technology  
 Configuration'
      class    = bridge
      subclass = HOST-PCI
 hostb1@pci0:24:1:       class=0x060000 card=0x00000000  
 chip=0x11011022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron Address Map'
      class    = bridge
      subclass = HOST-PCI
 hostb2@pci0:24:2:       class=0x060000 card=0x00000000  
 chip=0x11021022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron DRAM Controller'
      class    = bridge
      subclass = HOST-PCI
 hostb3@pci0:24:3:       class=0x060000 card=0x00000000  
 chip=0x11031022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron Miscellaneous Control'
      class    = bridge
      subclass = HOST-PCI
 hostb4@pci0:25:0:       class=0x060000 card=0x00000000  
 chip=0x11001022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron HyperTransport Technology  
 Configuration'
      class    = bridge
      subclass = HOST-PCI
 hostb5@pci0:25:1:       class=0x060000 card=0x00000000  
 chip=0x11011022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron Address Map'
      class    = bridge
      subclass = HOST-PCI
 hostb6@pci0:25:2:       class=0x060000 card=0x00000000  
 chip=0x11021022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron DRAM Controller'
      class    = bridge
      subclass = HOST-PCI
 hostb7@pci0:25:3:       class=0x060000 card=0x00000000  
 chip=0x11031022 rev=0x00 hdr=0x00
      vendor   = 'Advanced Micro Devices (AMD)'
      device   = 'Athlon 64 / Opteron Miscellaneous Control'
      class    = bridge
      subclass = HOST-PCI
 none4@pci4:6:0: class=0x030000 card=0x80081002 chip=0x47521002  
 rev=0x27 hdr=0x00
      vendor   = 'ATI Technologies Inc'
      device   = 'Rage XL PCI'
      class    = display
      subclass = VGA
 em0@pci3:8:0:   class=0x020000 card=0x10018086 chip=0x10268086  
 rev=0x04 hdr=0x00
      vendor   = 'Intel Corporation'
      device   = '82545GM Gigabit Ethernet Controller'
      class    = network
      subclass = ethernet
 pcib4@pci1:3:0: class=0x060400 card=0x00000080 chip=0x01a71014  
 rev=0x03 hdr=0x01
      vendor   = 'International Business Machines Corp.'
      device   = 'IBM 133 PCI-X Bridge R1.1'
      class    = bridge
      subclass = PCI-PCI
 amr0@pci2:0:0:  class=0x010400 card=0x05321000 chip=0x04071000  
 rev=0x02 hdr=0x00
      vendor   = 'LSI Logic (Was: Symbios Logic, NCR)'
      device   = 'MegaRAID'
      class    = mass storage
      subclass = RAID
 
 > sysctl dev.em.0.stats=1 && dmesg | tail -n 17
 
 dev.em.0.stats: -1 -> -1
 em0: Excessive collisions = 0
 em0: Symbol errors = 0
 em0: Sequence errors = 0
 em0: Defer count = 0
 em0: Missed Packets = 0
 em0: Receive No Buffers = 0
 em0: Receive length errors = 0
 em0: Receive errors = 0
 em0: Crc errors = 0
 em0: Alignment errors = 0
 em0: Carrier extension errors = 0
 em0: XON Rcvd = 0
 em0: XON Xmtd = 0
 em0: XOFF Rcvd = 0
 em0: XOFF Xmtd = 0
 em0: Good Packets Rcvd = 120218
 em0: Good Packets Xmtd = 119017
 
 > sysctl dev.em.0.debug_info=1 && dmesg | tail -n 13
 
 dev.em.0.debug_info: -1 -> -1
 em0: Adapter hardware address = 0xffffff000001e940
 em0:CTRL  = 0x8f00249
 em0:RCTL  = 0x8002 PS=(0x8402)
 em0:tx_int_delay = 66, tx_abs_int_delay = 66
 em0:rx_int_delay = 0, rx_abs_int_delay = 66
 em0: fifo workaround = 0, fifo_reset = 0
 em0: hw tdh = 247, hw tdt = 247
 em0: Num Tx descriptors avail = 256
 em0: Tx Descriptors not avail1 = 0
 em0: Tx Descriptors not avail2 = 0
 em0: Std mbuf failed = 0
 em0: Std mbuf cluster failed = 0
 em0: Driver dropped packets = 0
 
 
 shortly thereafter I see this in dmesg output:
 
 em0: RX overrun
 em0: RX overrun
 em0: RX overrun
 nfs server yertle.int.kcilink.com:/usr/src: not responding
 nfs server yertle.int.kcilink.com:/usr/src: is alive again
 em0: RX overrun
 

From: Vivek Khera <vivek@khera.org>
To: FreeBSD-gnats-submit@FreeBSD.org
Cc:  
Subject: Re: kern/89073: em(4) continual "em0: RX overrun" warnings
Date: Wed, 16 Nov 2005 11:07:57 -0500

 On Nov 16, 2005, at 10:25 AM, Gleb Smirnoff wrote:
 
 > V> Ultimately, the machine becomes non-responsive to the network.   
 > For some
 > V> reason the serial console is not working past the BIOS boot, so  
 > I can't see if
 > V> the machine is hung totally or just the network (the machine is  
 > at a co-lo
 > V> facility)  I'll be attempting to fix the serial console soon.
 
 I plugged into the console at the colo and found the machine totally  
 wedged.  So I rebooted with a new NIC (single port intel this time)  
 and after 1 hour of running buildworld with NFS mounted /usr/src, it  
 locked up hard again (not even caps-lock responded to the keyboard).   
 I had nearly 150 RX overruns during that time.
 
 Running more tests now (at the office this time so I don't have to  
 keep driving out there over and over.... :-)
 
 
State-Changed-From-To: open->closed 
State-Changed-By: glebius 
State-Changed-When: Mon Nov 21 08:18:45 GMT 2005 
State-Changed-Why:  
Submitter reports that on the other mainboard of the same model the 
problem can't be reproduced. Looks like bad hardware. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=89073 

Adding to audit trail from misfiled PR kern/93037:

Date: Wed, 8 Feb 2006 17:22:27 +0200
From: Dimo Dimitrov <dimo.dimitrov@m-real.net>
 
 Same issue here:
 
 FreeBSD carbone.m-real.net 6.0-RELEASE FreeBSD 6.0-RELEASE #1 amd64 
 with 4 em() 82540EM gigabit ethernet controllers.
 
 cat /var/log/messages
 Feb  8 16:30:07 carbone kernel: em2: RX overrun
 Feb  8 16:30:07 carbone kernel: em2: RX overrun
 Feb  8 16:40:07 carbone kernel: em2: RX overrun
 Feb  8 16:50:07 carbone kernel: em2: RX overrun
 Feb  8 17:00:07 carbone kernel: em2: RX overrun
 Feb  8 17:00:07 carbone kernel: em2: RX overrun
 Feb  8 17:07:49 carbone kernel: em2: RX overrun
 Feb  8 17:07:49 carbone kernel: em2: RX overrun
 Feb  8 17:07:49 carbone kernel: em0: RX overrun
 Feb  8 17:10:07 carbone kernel: em2: RX overrun
 Feb  8 17:20:07 carbone kernel: em2: RX overrun
 ....
 
 This occurs each  10th minute, durring not so heavy loads (em0: 120Mb/s 
 , em2: 80Mb/s)
 
Adding to audit trail from misfiled PR kern/93038:

Date: Wed, 8 Feb 2006 10:39:23 -0500
From: Vivek Khera <vivek@khera.org>
 
 I've decided it was crappy hardware.  On a brand new Sun X4100 with  
 same configuration of software and load (I replaced the generic Tyan  
 S2881 system with Intel NIC card), I never see those warnings.
 
 In short, I'm never buying a generic no-name system again :-(
>Unformatted:
