From nobody@FreeBSD.org  Thu Apr 15 16:50:22 2010
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0A0061065672
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 15 Apr 2010 16:50:22 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id EC8F38FC17
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 15 Apr 2010 16:50:21 +0000 (UTC)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.3/8.14.3) with ESMTP id o3FGoL0Y035636
	for <freebsd-gnats-submit@FreeBSD.org>; Thu, 15 Apr 2010 16:50:21 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.3/8.14.3/Submit) id o3FGoLTA035635;
	Thu, 15 Apr 2010 16:50:21 GMT
	(envelope-from nobody)
Message-Id: <201004151650.o3FGoLTA035635@www.freebsd.org>
Date: Thu, 15 Apr 2010 16:50:21 GMT
From: AD <tempo@kgs.ru>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Stops working lagg between two servers.
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         145728
>Category:       kern
>Synopsis:       [lagg] Stops working lagg between two servers.
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-net
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Apr 15 17:00:12 UTC 2010
>Closed-Date:    
>Last-Modified:  Thu Apr 29 05:30:01 UTC 2010
>Originator:     AD
>Release:        7.2-RELEASE-p6 and 7.2-STABLE
>Organization:
ad
>Environment:
FreeBSD 1 7.2-RELEASE-p6 FreeBSD 7.2-RELEASE-p6 #1: Wed Mar 17 22:31:00 KRAT 2010     root@1:/usr/obj/usr/src/sys/1  i386


FreeBSD 2 7.2-STABLE FreeBSD 7.2-STABLE #8: Thu Apr  1 02:06:36 KRAST 2010     root@2:/usr/obj/usr/src/sys/2  i386

>Description:
There are 2 servers, in everyone costs on 4 network cards. 2 from them are united in lagg.

In some days lagg collapses:
1 server
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=19b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4>
        ether 00:1b:21:3b:4d:4d
        inet 1.1.1.1 netmask 0xffffffc0 broadcast 1.1.1.255
        media: Ethernet autoselect
        status: active
        laggproto lacp
        laggport: em3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: em2 flags=4<ACTIVE>

ifconfig em2
em2: flags=9c43<UP,BROADCAST,RUNNING,OACTIVE,SIMPLEX,LINK0,MULTICAST> metric 0 mtu 1500
        options=19b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4>
        ether 00:1b:21:3b:4d:4d
        media: Ethernet autoselect (1000baseTX <full-duplex>)
        status: active
        lagg: laggdev lagg0


#less /var/run/dmesg.boot | grep em2
em2: <Intel(R) PRO/1000 Network Connection 6.9.6.Yandex[$Revision: 1.36.2.17 $]> port 0x3000-0x301f mem 0xd3180000-0xd319ffff,0xd3100000-0xd317ffff,0xd31a0000-0xd31a3fff irq 16 at device 0.0 on pci2
em2: Using MSIX interrupts
em2: Using TXD_LOW instead of TXDW
em2: [FILTER]
em2: [FILTER]
em2: [FILTER]
em2: Ethernet address: 00:1b:21:3b:4d:4d


em2@pci0:2:0:0: class=0x020000 card=0xa01f8086 chip=0x10d38086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = network
    subclass   = ethernet

em3@pci0:4:0:0: class=0x020000 card=0xa01f8086 chip=0x10d38086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = network
    subclass   = ethernet


2 server
lagg1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=19b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4>
        ether 00:1b:21:1b:19:5d
        media: Ethernet autoselect
        status: active
        laggproto lacp
        laggport: em4 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: em1 flags=18<COLLECTING,DISTRIBUTING>

em1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=19b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4>
        ether 00:1b:21:1b:19:5d
        media: Ethernet autoselect (1000baseTX <full-duplex>)
        status: active
        lagg: laggdev lagg1

# less /var/run/dmesg.boot |grep em1
em1: <Intel(R) PRO/1000 Network Connection 6.9.6.Yandex[$Revision: 1.36.2.17 $]> port 0x4000-0x401f mem 0xd0320000-0xd033ffff,0xd0300000-0xd031ffff irq 16 at device 0.0 on pci3
em1: Using MSI interrupt
em1: Using TXD_LOW instead of TXDW
em1: [FILTER]
em1: Ethernet address: 00:1b:21:1b:19:5d


em1@pci0:3:0:0: class=0x020000 card=0x10838086 chip=0x10b98086 rev=0x06 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82572EI PRO/1000 PT Desktop Adapter (Copper)'
    class      = network
    subclass   = ethernet
em4@pci0:5:0:0: class=0x020000 card=0xa01f8086 chip=0x10d38086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = network
    subclass   = ethernet


Error log:
Apr 16 00:27:31 2 kernel: em4: link state changed to UP
Apr 16 00:27:34 2 kernel: em4: watchdog timeout -- resetting
Apr 16 00:27:34 2 kernel: em4: Excessive collisions = 0
Apr 16 00:27:34 2 kernel: em4: Sequence errors = 0
Apr 16 00:27:34 2 kernel: em4: Defer count = 0
Apr 16 00:27:34 2 kernel: em4: Missed Packets = 1217754
Apr 16 00:27:34 2 kernel: em4: Receive No Buffers = 0
Apr 16 00:27:34 2 kernel: em4: Receive Length Errors = 0
Apr 16 00:27:34 2 kernel: em4: Receive errors = 0
Apr 16 00:27:34 2 kernel: em4: Crc errors = 0
Apr 16 00:27:34 2 kernel: em4: Alignment errors = 0
Apr 16 00:27:34 2 kernel: em4: Collision/Carrier extension errors = 0
Apr 16 00:27:34 2 kernel: em4: RX overruns = 0
Apr 16 00:27:34 2 kernel: em4: watchdog timeouts = 143
Apr 16 00:27:34 2 kernel: em4: RX MSIX IRQ = 1654280804 TX MSIX IRQ = 1491971579 LINK MSIX IRQ = 1214367
Apr 16 00:27:34 2 kernel: em4: XON Rcvd = 203508246
Apr 16 00:27:34 2 kernel: em4: XON Xmtd = 3183073363
Apr 16 00:27:34 2 kernel: em4: XOFF Rcvd = 202792650
Apr 16 00:27:34 2 kernel: em4: XOFF Xmtd = 3170508497
Apr 16 00:27:34 2 kernel: em4: Good Packets Rcvd = 108209172443
Apr 16 00:27:34 2 kernel: em4: Good Packets Xmtd = 113645818564
Apr 16 00:27:34 2 kernel: em4: TSO Contexts Xmtd = 0
Apr 16 00:27:34 2 kernel: em4: TSO Contexts Failed = 0
Apr 16 00:27:34 2 kernel: em4: Adapter hardware address = 0xc52a0218
Apr 16 00:27:34 2 kernel: em4: CTRL = 0x58100248 RCTL = 0x801a
Apr 16 00:27:34 2 kernel: em4: Packet buffer = Tx=20k Rx=20k
Apr 16 00:27:34 2 kernel: em4: Flow control watermarks high = 18432 low = 16932
Apr 16 00:27:34 2 kernel: em4: tx_int_delay = 0, tx_abs_int_delay = 64
Apr 16 00:27:34 2 kernel: em4: rx_int_delay = 0, rx_abs_int_delay = 66
Apr 16 00:27:34 2 kernel: em4: fifo workaround = 0, fifo_reset_count = 0
Apr 16 00:27:34 2 kernel: em4: hw tdh = 0, hw tdt = 1
Apr 16 00:27:34 2 kernel: em4: hw rdh = 0, hw rdt = 4095, next_rx_desc_to_check = 0
Apr 16 00:27:34 2 kernel: em4: Num Tx descriptors avail = 4095
Apr 16 00:27:34 2 kernel: em4: Tx Descriptors not avail1 = 12063
Apr 16 00:27:34 2 kernel: em4: Tx Descriptors not avail2 = 0
Apr 16 00:27:34 2 kernel: em4: Std mbuf failed = 0
Apr 16 00:27:34 2 kernel: em4: Std mbuf cluster failed = 6
Apr 16 00:27:34 2 kernel: em4: Driver dropped packets = 0
Apr 16 00:27:34 2 kernel: em4: Driver tx dma failure in encap = 0
Apr 16 00:27:34 2 kernel: em4: Packets pended due to reorder = 0
Apr 16 00:27:34 2 kernel: em4: RX interrupts has been masked = 77251713
Apr 16 00:27:34 2 kernel: em4: TX interrupts has been generated = 0
Apr 16 00:27:34 2 kernel: em4: link state changed to DOWN


tcpdump -i em4
00:47:06.511867 LACPv1, length: 110
00:47:36.997247 LACPv1, length: 110



After reboot for some time all is normalised.


>How-To-Repeat:
To connect 2 servers directly through lagg.
>Fix:
While only reboot :( 

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-i386->freebsd-net 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Mon Apr 19 04:57:52 UTC 2010 
Responsible-Changed-Why:  
Over to maintainer(s). 

http://www.freebsd.org/cgi/query-pr.cgi?pr=145728 

From: "Slava@kraslan.ru" <slava@kraslan.ru>
To: bug-followup@FreeBSD.org, tempo@kgs.ru
Cc:  
Subject: Re: kern/145728: [lagg] Stops working lagg between two servers.
Date: Thu, 29 Apr 2010 12:41:36 +0800

 3 days ago has refreshed one of servers to 8.0-STABLE from *default 
 date=2010.04.05.00.00.00, the situation is a bit now another. Watchdog 
 is not present, but the interface from lagg is in a state
 
 
 lagg1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 
 0 mtu 1500
         
 options=9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>             
         ether 
 00:1b:21:1b:19:5d                                                   
         media: Ethernet 
 autoselect                                                
         status: 
 active                                                            
         laggproto 
 lacp                                                            
         laggport: em4 
 flags=18<COLLECTING,DISTRIBUTING>                           
         laggport: em1 
 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>                    
 
 em4: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 
 mtu 1500 
         
 options=9b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM>             
         ether 
 00:1b:21:1b:19:5d                                                   
         media: Ethernet 1000baseT (1000baseT 
 <full-duplex>)                       
         status: 
 active                                                            
 Has tried to make 
 ifconfig lagg1 -laggport em4
 and then
 ifconfig lagg1 laggport em4
 has not helped.
>Unformatted:
