From nobody@FreeBSD.org  Mon Apr 28 10:51:06 2008
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6F3161065674
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 28 Apr 2008 10:51:06 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (www.freebsd.org [IPv6:2001:4f8:fff6::21])
	by mx1.freebsd.org (Postfix) with ESMTP id 258C88FC14
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 28 Apr 2008 10:51:06 +0000 (UTC)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.14.2/8.14.2) with ESMTP id m3SAoUcT075738
	for <freebsd-gnats-submit@FreeBSD.org>; Mon, 28 Apr 2008 10:50:30 GMT
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.14.2/8.14.1/Submit) id m3SAoUtV075737;
	Mon, 28 Apr 2008 10:50:30 GMT
	(envelope-from nobody)
Message-Id: <200804281050.m3SAoUtV075737@www.freebsd.org>
Date: Mon, 28 Apr 2008 10:50:30 GMT
From: Auke Zaaiman <a.zaaiman@nouzelle.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: CARP messages filtered by Realtek driver on > 6.2
X-Send-Pr-Version: www-3.1
X-GNATS-Notify:

>Number:         123166
>Category:       kern
>Synopsis:       [re] CARP messages filtered by Realtek driver on > 6.2 [regression]
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    yongari
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Apr 28 11:00:11 UTC 2008
>Closed-Date:    Tue Oct 12 01:51:52 UTC 2010
>Last-Modified:  Tue Oct 12 01:51:52 UTC 2010
>Originator:     Auke Zaaiman
>Release:        6.2-RELEASE
>Organization:
Nouzelle Internet Services
>Environment:
FreeBSD loadbalance01.nouzelle.local 6.2-RELEASE FreeBSD 6.2-RELEASE #1: Sun Apr 27 18:37:02 CEST 2008 root@loadbalance01.nouzelle.local:/log/obj/log/src/sys/SMP  amd64

>Description:
On our testing environment we have the following configuration for
failover/loadbalancing:

2 machines with each:
AMD Sempron(tm) Processor 3000+
1GB RAM
2x RealTek 8169S Single-chip Gigabit Ethernet (re0 and re1)
1x VIA VT6102 Rhine II 10/100BaseTX (vr0)

Our initial setup has been build upon 6.2-RELEASE.
The setup is:
- vr0 on both machines are configured with an internal IP, additionally
  there is a CARP device to create a global gateway for the internal
  network; and over vr0 also pfsync runs, but that ain't really important.
- re0 is on both machines IP less and only used for VLAN's. The vlans
  are all configured with their own IP's in seperate IP-ranges (f.ex.
  172.29.27.0/24, 172.29.24.0/24) For every VLAN device there is a
  related CARP device to provide a global gateway for the network behind
  that VLAN.
- re1 is on both machines the external interface, with two dedicated IPs
  on both in two seperate IP-ranges. And again a CARP device, to provide
  failover and eventually loadbalancing for the external IP's.

Next to the above machines we have several machines in the backend and
frontend running on different hardware (non-realtek nic's) also in failover.

The whole setup ran fine for more then 3 months. CARP worked fine,
everything was communicating fine etc.

As 7.0 was released we decided to upgrade all machines in the environment
to this release. Upgrades went fine, but a problem appeared.

The loadbalancers failed to see eachother's CARP messages, that is on the
Realtek NIC's, CARP running on top of vr0 is working fine.

- We checked if firewall's were all of a sudden in the way (would be
  surprising as nothing in configs changed).
- We checked the ULE scheduler was not playing nice, wasn't the problem either.
- Any other machine still could send traffic over the loadbalancers;
- The switch didn't give any errors, nor did the local interfaces on the
  loadbalancers.

As it beat us, we decided to downgrade them to 6.3-RELEASE, hoping this
would fix it. The problem stayed though. After letting it rest for a few
weeks, I checked again on the problem.

I checked everything above again and then decided to look with tcpdump
on the loadbalancers and on a few frontend and backend machines to see
what was happening. And there was the surprise:

- loadbalancers are sending out CARP messages;
- frontend and backend machines are receiving those CARP messages;
- frontend and backend machines are also sending out CARP messages for
  themselves;
- loadbalancers are not receiving any CARP messages on their Realtek NIC's.

Although I haven't really been able to discover any changes regarding the
Realtek NIC's as of 6.2 to 6.3, I suspect there is some anyways.

So next thing I did is downgrade the loadbalancers to 6.2-RELEASE and as
soon as they booted everything was fine.

There is a problem with incoming CARP messages being somehow filtered on
Realtek NIC's starting with releases after 6.2-RELEASE

>How-To-Repeat:

>Fix:


>Release-Note:
>Audit-Trail:
State-Changed-From-To: open->feedback 
State-Changed-By: gavin 
State-Changed-When: Mon Apr 28 19:46:36 UTC 2008 
State-Changed-Why:  
To submitter:  Can you give the output of "pciconf -l" so that we know 
exactly what type of card you have?  Also, are you able/willing to 
establish exactly which change has broken the card for you (by updating 
to 6-STABLE half way between 6.2 and 6.3 and seeing if you still see the 
problem)?  From looking at the code there are a few commits that have the 
potential to be related. 


Responsible-Changed-From-To: freebsd-bugs->freebsd-net 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Mon Apr 28 19:46:36 UTC 2008 
Responsible-Changed-Why:  
Over to maintainers 

http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 

From: Gavin Atkinson <gavin@FreeBSD.org>
To: bug-followup@FreeBSD.org
Cc:  
Subject: [Fwd: RE: kern/123166: [re] CARP messages filtered by Realtek
	driver on > 6.2]
Date: Tue, 29 Apr 2008 09:13:29 +0100

 -------- Forwarded Message --------
 From: Auke Zaaiman <a.zaaiman@nouzelle.com>
 Date: Mon, 28 Apr 2008 22:27:32 +0200
 
 Hi,
 
 Below the output of pciconf -l:
 re0@pci0:8:0:   class=0x020000 card=0x816910ec chip=0x816910ec rev=0x10
 hdr=0x00
 re1@pci0:9:0:   class=0x020000 card=0x816910ec chip=0x816910ec rev=0x10
 hdr=0x00
 
 And yes, we are able and willing to establish exactly which change has
 broken the card. Can you give me details on where I can fetch the source
 I need to update to?
 
Responsible-Changed-From-To: freebsd-net->gavin 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Tue Apr 29 18:13:53 UTC 2008 
Responsible-Changed-Why:  
I'll try and get further info from the submitter. 

To submitter:  Firstly, does "ifconfig re0 promisc" make any 
difference? 

Secondly, you could try getting the most recent versions of 
the driver files from: 
http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/sys/dev/re/if_re.c?rev=1.46.2.39;content-type=text%2Fplain 
http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/sys/pci/if_rlreg.h?rev=1.51.2.13;content-type=text%2Fplain 

put the first into /usr/src/sys/dev/re and the second into 
/usr/src/sys/pci, recompile the kernel and test. 

Thirdly, if that hasn't fixed it, we need to establish when the 
breakage happened.  The easiest way is to try different kernels 
(you shouldn't need to recompile the userland for this), and 
basically try to establish when the breakage was introduced to 
6.x.  Assuming you are using csup or cvsup to update your 
system, add the following line: 

*default date=2007.07.14.00.00.00 

Then csup will bring sources down as they were on the 1st of August. 
Recompile, and see if CARP still works with that kernel.  If so, 
move the date forward, and if not, go backwards in time.  Some 
useful dates to try will probably be: 

2007.02.01.00.00.00 
2007.04.20.00.00.00 
2007.05.01.00.00.00 
2007.08.01.00.00.00 
2007.09.20.00.00.00 
2007.12.01.00.00.00 

i.e. in the worst case, you may have to recompile your kernel three 
times to figure out between which of the above dates the breakage occured. 

(For reference, 6.2 was released on 2007.01.15, with 6.3 on 2008.01.18) 


http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 

From: "Auke Zaaiman" <a.zaaiman@nouzelle.com>
To: <bug-followup@FreeBSD.org>,
	"Auke Zaaiman" <a.zaaiman@nouzelle.com>
Cc:  
Subject: Re: kern/123166: [re] CARP messages filtered by Realtek driver on &gt; 6.2 [regression]
Date: Sun, 18 May 2008 11:18:20 +0200

 > Firstly, does "ifconfig re0 promisc" make any difference?
 
 No, doesn't make any difference.
 
 > Secondly, you could try getting the most recent versions of=20
 > the driver files from:=20
 >
 http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/sys/dev/re/if_re.c?
 rev=3D1.46.2.39;content-type=3Dtext%2Fplain=20
 >
 http://www.freebsd.org/cgi/cvsweb.cgi/~checkout~/src/sys/pci/if_rlreg.h?
 rev=3D1.51.2.13;content-type=3Dtext%2Fplain=20
 > put the first into /usr/src/sys/dev/re and the second into=20
 > /usr/src/sys/pci, recompile the kernel and test.
 
 So, this delivered me some errors, seems a fair bit of supporting code
 also changed.
 Below the errors it gave me against 6.2-RELEASE:
 /log/usr/src/sys/dev/re/if_re.c: In function `re_attach':
 /log/usr/src/sys/dev/re/if_re.c:1261: warning: implicit declaration of
 function `bus_get_dma_tag'
 /log/usr/src/sys/dev/re/if_re.c:1261: warning: nested extern declaration
 of `bus_get_dma_tag'
 /log/usr/src/sys/dev/re/if_re.c:1264: warning: passing arg 1 of
 `bus_dma_tag_create' makes pointer from integer without a cast
 /log/usr/src/sys/dev/re/if_re.c: In function `re_start':
 /log/usr/src/sys/dev/re/if_re.c:2267: warning: implicit declaration of
 function `ETHER_BPF_MTAP'
 /log/usr/src/sys/dev/re/if_re.c:2267: warning: nested extern declaration
 of `ETHER_BPF_MTAP'
 
 > Thirdly, if that hasn't fixed it, we need to establish when the=20
 > breakage happened.
 
 Right, so I am using the below supfile, with ofcourse everytime a
 different date.
 ----
 *default host=3Dcvsup.nl.freebsd.org
 *default base=3D/var/db
 *default prefix=3D/usr
 *default date=3D2007.12.01.00.00.00
 *default release=3Dcvs delete use-rel-suffix compress
 
 src-all
 ----
 
 Everytime I recompiled the kernel and rebooted this didn't give any
 problems, for none of the dates listed below:
 2007.02.01.00.00.00=20
 2007.04.20.00.00.00=20
 2007.05.01.00.00.00=20
 2007.08.01.00.00.00=20
 2007.09.20.00.00.00=20
 2007.12.01.00.00.00
 Note, I only recompiled on one of the machines.
 
 Normal upgrades of those machines have been done by fetching all freebsd
 src packages from, for example:
 ftp://ftp.nl.freebsd.org/pub/FreeBSD/releases/amd64/7.0-RELEASE/src/
 
State-Changed-From-To: feedback->open 
State-Changed-By: gavin 
State-Changed-When: Mon Jun 30 12:16:37 UTC 2008 
State-Changed-Why:  


http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 
State-Changed-From-To: open->feedback 
State-Changed-By: gavin 
State-Changed-When: Mon Jun 30 12:16:55 UTC 2008 
State-Changed-Why:  
To submitter: As things still work for you with sources from 
2007.12.01.00.00.00, my best guess is that the following change 
breaks things for you: 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/re/if_re.c.diff?r1=1.46.2.35;r2=1.46.2.35.2.1;only_with_tag=RELENG_6_3 

Is there any chance you could confirm that this is the case, by either 
applying that patch and testing things, or by using the following dates 
in your supfile in the same way as last time? 

2007.12.06.06.00.00 (interface should work) 
2007.12.06.06.05.00 (I suspect this will fail) 

Using the "-L2" argument to cvsup, yo should be able to conirm that only 
a single file (if_re.c) is changing. 

If things still work for you after that, it would be useful if you could 
carry on advancing the date, maybe by a week at a time, until you can 
establish exactly what breaks if_re for you. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 

From: "Auke Zaaiman" <a.zaaiman@nouzelle.com>
To: <bug-followup@FreeBSD.org>,
	"Auke Zaaiman" <a.zaaiman@nouzelle.com>
Cc:  
Subject: Re: kern/123166: [re] CARP messages filtered by Realtek driver on &gt; 6.2 [regression]
Date: Mon, 22 Sep 2008 10:53:32 +0200

 
 Hi,
 
 So I had some time to do more testing.
 
 2007.12.06.06.00.00 (interface should work)=20
 2007.12.06.06.05.00 (I suspect this will fail)
 
 Actually both dates failed, so I started working back from
 2007.12.06.06.00.00 towards 2007.12.01.00.00.00
 
 It seems it broke between 2007.12.03.00.00.00 and 2007.12.04.00.00
 
 Change noted by using -L2:
 
 Edit src/sys/dev/re/if_re.c
 
   Add delta 1.98 2007.12.03.01.28.08 yongari
 
 Regards,
 
 Auke Zaaiman
 
 
State-Changed-From-To: feedback->open 
State-Changed-By: gavin 
State-Changed-When: Mon Sep 22 09:35:47 UTC 2008 
State-Changed-Why:  
Feedback was received 


Responsible-Changed-From-To: gavin->yongari 
Responsible-Changed-By: gavin 
Responsible-Changed-When: Mon Sep 22 09:35:47 UTC 2008 
Responsible-Changed-Why:  
yongari, is there any chance you could take a look at this?  The submitter 
has perforned a binary search, and determined that it was the MFC of 
src/sys/dev/re/if_re.c 1.98 to RELENG_6 that has broken CARP.  Thanks! 

http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 
State-Changed-From-To: open->feedback 
State-Changed-By: yongari 
State-Changed-When: Wed Mar 3 19:03:02 UTC 2010 
State-Changed-Why:  
Can you still reproduce it on recent FreeBSD? I think it was fixed 
in long time ago. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 
State-Changed-From-To: feedback->closed 
State-Changed-By: yongari 
State-Changed-When: Tue Oct 12 01:51:17 UTC 2010 
State-Changed-Why:  
Feedback timed out, close. 
There was a small window that multicasting handling was broken on 
re(4) but it was fixed long time ago(SVN r174809). I beleive it was 
also merged to stable/6. If you still see the issue on more recent 
FreeBSD releases, please open new PR. 
Thanks for reporting! 

http://www.freebsd.org/cgi/query-pr.cgi?pr=123166 
>Unformatted:
