From tom@claimlynx.com  Thu Apr 14 20:58:00 2011
Return-Path: <tom@claimlynx.com>
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 79CBA106564A;
	Thu, 14 Apr 2011 20:58:00 +0000 (UTC)
	(envelope-from tom@claimlynx.com)
Received: from alcatraz.claimlynx.com (alcatraz.claimlynx.com [216.17.83.245])
	by mx1.freebsd.org (Postfix) with ESMTP id 409118FC19;
	Thu, 14 Apr 2011 20:57:59 +0000 (UTC)
Received: from jaguar-2.claimlynx.com (unknown [216.17.68.153])
	by alcatraz.claimlynx.com (Postfix) with ESMTP id 585891CC1F;
	Thu, 14 Apr 2011 15:38:51 -0500 (CDT)
Received: by jaguar-2.claimlynx.com (Postfix, from userid 127)
	id 4AC2611F863; Thu, 14 Apr 2011 15:38:51 -0500 (CDT)
Message-Id: <20110414203851.4AC2611F863@jaguar-2.claimlynx.com>
Date: Thu, 14 Apr 2011 15:38:51 -0500 (CDT)
From: Thomas Johnson <tom@claimlynx.com>
Reply-To: Thomas Johnson <tom@claimlynx.com>
To: FreeBSD-gnats-submit@freebsd.org
Cc: jpaetzel@freebsd.org; root@claimlynx.com
Subject: Routing failure when using VLANs vs. Physical ethernet interfaces.
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         156408
>Category:       kern
>Synopsis:       [vlan] Routing failure when using VLANs vs. Physical ethernet interfaces.
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-net
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Apr 14 21:00:19 UTC 2011
>Closed-Date:    
>Last-Modified:  Wed Apr 20 16:00:19 UTC 2011
>Originator:     Thomas Johnson
>Release:        FreeBSD 8.2-RELEASE amd64
>Organization:
ClaimLynx, Inc.
>Environment:
System: FreeBSD jaguar-2.claimlynx.com 8.2-RELEASE FreeBSD 8.2-RELEASE #8: Sat Feb 26 21:23:00 CST 2011 root@jaguar-2.claimlynx.com:/usr/obj/usr/src/sys/GENERIC-CARP amd64

>Description:

I have discovered some odd routing behavior that seems to occur when VLANs are used as members of a bridge. Specifically, it seems that static routes do not function correctly.

Here is some background on the situation I have. I am building a new host to replace our aging (running 8.0) firewall. The new machine I am building has a single ethernet interface (re driver, but over the course of troubleshooting I've used sk and igb ethernet adapters), so I am using VLANs to segment traffic. The 'LAN' VLAN on my setup uses interface vlan500, with the 'WAN' on vlan200. The firewall also has an OpenVPN tunnel to our data center, operating in bridged mode on interface tap0. vlan500 and tap0 are both members of bridge0, allowing the LANs at our office and data center to talk on the same subnet, 172.31.0.0/16. 

In this configuration, I am able to connect from the office lan to hosts on the data center lan. The openvpn server at the datacenter (separate host from the firewall) pushes out a route for the dc production subnet upon connect. The logical configuration looks something like this:

(office lan)<->[vlan500|bridge0|tap0]<-vpn->(dc lan)<->[dc firewall]<->(dc production subnet)
               [      firewall      ]
[      common 172.31.0.0/16 subnet throughout      ]                   [ 100.100.100.128/26 ]

For the sake of reference, here are the relevant IP addresses:

172.31.0.252	- local firewall vlan500
172.31.0.254	- local firewall lan carp
172.31.5.1	- data center firewall

The problem seems to exist with the route to the production subnet at the data center. When the openvpn connection comes up, the route is installed in the routing table as expected. However, attempts to connect to hosts on this network result in instantaneous failure; not even a host unreachable. For example

~-> ping hostfoo
PING hostfoo.claimlynx.com (100.100.100.149): 56 data bytes
ping: sendto: Invalid argument

Here is the output of 'netstat -rn' on this host:

root@shawshank-1:~-> netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags    Refs      Use  Netif Expire
default            10.8.20.1          UGS         4   124778 vlan20
172.31.0.0/16        link#12            U           3    56103 vlan50
172.31.0.252         link#12            UHS         0        0    lo0
172.31.0.254         link#13            UH          0        0 carp10
172.31.3.5           link#8             UHS         0        0    lo0
10.8.20.0/24       link#9             U           0       33 vlan20
10.8.20.252        link#9             UHS         0        0    lo0
10.8.20.254        link#14            UH          0        0 carp20
10.8.30.0/24       link#10            U           0        0 vlan30
10.8.30.252        link#10            UHS         0        0    lo0
10.8.30.254        link#15            UH          0        0 carp30
10.8.40.0/24       link#11            U           0        0 vlan40
10.8.40.252        link#11            UHS         0        0    lo0
127.0.0.1          link#7             UH          0        0    lo0
100.100.100.128/26   172.31.5.1           UGS         0    21466   tap0

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::1                               ::1                           UH          lo0
fe80::%lo0/64                     link#7                        U           lo0
fe80::1%lo0                       link#7                        UHS         lo0
ff01:7::/32                       fe80::1%lo0                   U           lo0
ff02::%lo0/32                     fe80::1%lo0                   U           lo0

As you can see, the routing table shows the 172.31.0.0/16 subnet route on the vlan500 interface, and puts the 100.100.100.128/26 production subnet route on the tap0 interface. While troubleshooting this, my hunch was that perhaps the system was choking because the next-hop for the production route was on a network (172.31.0.0/16) that is not reachable via tap0 (in actuality it is). To test this, I inserted a host route for the next hop:

route add 172.31.5.1 -interface tap0

Adding this route resolves the condition, but it seems like a hacky fix. In comparison, the firewall that I am replacing uses the same lan/bridge/tap setup, but the machine has physical ethernet interfaces for all segments, rather than the vlans that my new setup uses. The existing setup works fine, without the need to add a host route. Here is the routing table for the existing firewall:

tom@shawshank:~-> netstat -rn
Routing tables

Internet:
Destination        Gateway            Flags    Refs      Use  Netif Expire
default            74.95.66.26        UGS         7  5043426   fxp2
172.31.0.0/16        link#2             U           4 70728235   fxp1
172.31.0.1           link#2             UHS         0  3870772    lo0
172.31.3.4           link#8             UHS         0        0    lo0
74.95.66.24/30     link#3             U           0     1243   fxp2
74.95.66.25        link#3             UHS         0        9    lo0
127.0.0.1          link#6             UH          0  1140570    lo0
192.168.50.0/24    link#1             U           0        0   fxp0
192.168.50.4       link#1             UHS         0        0    lo0
100.100.100.128/26   172.31.5.1           UGS         0    19877   fxp1

Internet6:
Destination                       Gateway                       Flags      Netif Expire
::1                               ::1                           UH          lo0
fe80::%lo0/64                     link#6                        U           lo0
fe80::1%lo0                       link#6                        UHS         lo0
ff01:6::/32                       fe80::1%lo0                   U           lo0
ff02::%lo0/32                     fe80::1%lo0                   U           lo0

The noteworthy difference between the two routing tables is that the production route on the old firewall is put on the LAN interface (fxp1).

>How-To-Repeat:

This situation occurs every time this host is booted.

>Fix:

The workaround I have found is to add a host route for the next-hop to the tap0 interface. This seems to work alright, but I want to make sure that this isn't a symptom of a bug in the vlan driver or elsewhere.

>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-amd64->freebsd-net 
Responsible-Changed-By: linimon 
Responsible-Changed-When: Sun Apr 17 00:59:50 UTC 2011 
Responsible-Changed-Why:  
reclassify. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=156408 

From: Thomas Johnson <tom@claimlynx.com>
To: bug-followup@FreeBSD.org, tom@claimlynx.com
Cc:  
Subject: re: kern/156408: [vlan] Routing failure when using VLANs vs. Physical
 ethernet interfaces.
Date: Wed, 20 Apr 2011 10:21:27 -0500

 --20cf307d01eeabd00704a15b2dba
 Content-Type: text/plain; charset=ISO-8859-1
 
 After further investigation, I have learned some new information that may or
 may not be useful.
 
 Although I am able to connect from a host on the office lan over the bridge
 to hosts on the data center lan, the firewall itself is unable to connect to
 these same hosts. This can be corrected by adding host static routes to the
 firewall in the same manner as I described in my initial PR. This behavior
 appears to be a result of the 172.31.0.0/16 route pointing at the vlan500
 interface, as I see ARP requests for dc hosts leave the firewall on the
 local lan (vlan500).
 
 By comparison, my existing/old firewall has a matching route for
 172.31.0.0/16 pointing at the local lan (in that case, the lan is a physical
 adapter, not a vlan). Connections from the firewall to hosts at the dc lan
 work correctly, and I see ARP requests on both the lan interface and the vpn
 tap interface.
 
 -- 
 Thomas Johnson
 ClaimLynx, Inc.
 
 --20cf307d01eeabd00704a15b2dba
 Content-Type: text/html; charset=ISO-8859-1
 Content-Transfer-Encoding: quoted-printable
 
 After further investigation, I have learned some new information that may o=
 r may not be useful.<br><br>Although I am able to connect from a host on th=
 e office lan over the bridge to hosts on the data center lan, the firewall =
 itself is unable to connect to these same hosts. This can be corrected by a=
 dding host static routes to the firewall in the same manner as I described =
 in my initial PR. This behavior appears to be a result of the <a href=3D"ht=
 tp://172.31.0.0/16" target=3D"_blank">172.31.0.0/16</a> route pointing at t=
 he vlan500 interface, as I see ARP requests for dc hosts leave the firewall=
  on the local lan (vlan500).<br>
 
 <br>By comparison, my existing/old firewall has a matching route for <a hre=
 f=3D"http://172.31.0.0/16">172.31.0.0/16</a> pointing at the local lan (in =
 that case, the lan is a physical adapter, not a vlan). Connections from the=
  firewall to hosts at the dc lan work correctly, and I see ARP requests on =
 both the lan interface and the vpn tap interface.<br clear=3D"all">
 <br>-- <br>Thomas Johnson<br>ClaimLynx, Inc.<br>
 
 --20cf307d01eeabd00704a15b2dba--
>Unformatted:
