From nobody@FreeBSD.org  Wed Aug 28 12:18:56 2002
Return-Path: <nobody@FreeBSD.org>
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id B31DC37B400
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 28 Aug 2002 12:18:56 -0700 (PDT)
Received: from www.freebsd.org (www.FreeBSD.org [216.136.204.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 4EBD143E75
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 28 Aug 2002 12:18:55 -0700 (PDT)
	(envelope-from nobody@FreeBSD.org)
Received: from www.freebsd.org (localhost [127.0.0.1])
	by www.freebsd.org (8.12.4/8.12.4) with ESMTP id g7SJIsOT081710
	for <freebsd-gnats-submit@FreeBSD.org>; Wed, 28 Aug 2002 12:18:54 -0700 (PDT)
	(envelope-from nobody@www.freebsd.org)
Received: (from nobody@localhost)
	by www.freebsd.org (8.12.4/8.12.4/Submit) id g7SJIsHp081709;
	Wed, 28 Aug 2002 12:18:54 -0700 (PDT)
Message-Id: <200208281918.g7SJIsHp081709@www.freebsd.org>
Date: Wed, 28 Aug 2002 12:18:54 -0700 (PDT)
From: Jeff Behl <jeff@expertcity.com>
To: freebsd-gnats-submit@FreeBSD.org
Subject: Path MTU broken - initial too-large packet continuously resent
X-Send-Pr-Version: www-1.0

>Number:         42137
>Category:       kern
>Synopsis:       Path MTU broken - initial too-large packet continuously resent
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Aug 28 12:21:23 PDT 2002
>Closed-Date:    Sun Sep 15 09:08:12 PDT 2002
>Last-Modified:  Sun Sep 15 09:08:12 PDT 2002
>Originator:     Jeff Behl
>Release:        4.6.1-RELEASE-p10
>Organization:
Expertcity
>Environment:
FreeBSD dell350-13.snv 4.6.1-RELEASE-p10 FreeBSD 4.6.1-RELEASE-p10 #1: Fri Aug 16 06:50:36 PDT 2002     root@dell350-13.snv:/usr/src/sys/compile/COMMS44-2  i386
>Description:
FBSD keeps resending the packet that generated an ICMP type 3 'needs fragmentation but DF bit set' even though it contains the correct MTU in the route table for the host 

notice the mtu of 1420:

dell350-12.snv#netstat -rnal | grep 10.4.1.134
10.4.1.134         63.xxx.224.129     UGHW        1     2268   1420 
 fxp0


but tcpdump shows:

17:21:43.497275 63.xxx.224.154.80 > 10.4.1.134.2314: . 1:1461(1460) ack
248 win 17520 (DF)
17:21:51.497212 63.xxx.224.154.80 > 10.4.1.134.2314: . 1:1461(1460) ack
248 win 17520 (DF)
17:22:07.497065 63.xxx.224.154.80 > 10.4.1.134.2314: . 1:1461(1460) ack
248 win 17520 (DF)


Once some sort of timeout mechanism occurs and PMTUD turns off (DF bit not set), further packets seem to obey the correct MTU.  Howerver, the inital packet that sparked the ICMP from the router will keep being resent until this timeout is reached.  We observed this as web requests to a web server of ours running apache would sometimes hang on various images on a web page.

We have also seen this same behavior on a 4.4-RELEASE system
>How-To-Repeat:
run some server (apache) and have a client make requests that result in large packets (images).  
>Fix:
turn of path mtu discoverty via sysctl.
>Release-Note:
>Audit-Trail:

From: Martin.Kaeske@Stud.TU-Ilmenau.DE
To: freebsd-gnats-submit@FreeBSD.org, jeff@expertcity.com
Cc:  
Subject: Re: kern/42137: Path MTU broken - initial too-large pa
Date: 31 Aug 2002 09:46:44 -0000

 Hello,
 I have an OpenBSD-2.9 machine (with ipfilter-3.4.16) acting as a router
 for my LAN and i have similar problems (MTU is lowered but FBSD keeps
 sending too large packets). I found out that it's not FreeBSDs fault
 it's the router sending wrong ICMP messages. The ICMP message should
 contain the IP-header and the first 8 bytes of data from the IP-packet
 that caused the error but OpenBSD doesn't include the 8 bytes. Since
 FreeBSD seems to need src/dst ip-address _and_ port number to react
 properly to the ICMP message and since src/dst port number are located
 in this 8 bytes FBSD keeps sending to large packets.
 Once i disabled ipfilter in OpenBSD (build a kernel without 'options
 IPFILTER') the problem disappeared (that means the ICMP messages are
 correct).
 
 Can you check if you're also getting such wrong ICMP-messages?
 
 Martin
State-Changed-From-To: open->feedback 
State-Changed-By: dwmalone 
State-Changed-When: Sat Aug 31 14:35:16 PDT 2002 
State-Changed-Why:  
Waiting to hear if the ICMP responses are complete. If they are 
not, and the problem somehow involves ipfilter, then it may be worth 
letting Darren know. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42137 

From: Harold Gutch <logix@foobar.franken.de>
To: freebsd-gnats-submit@FreeBSD.org, jeff@expertcity.com
Cc: martin.kaeske@stud.tu-ilmenau.de
Subject: Re: kern/42137: Path MTU broken - initial too-large packet continuously resent
Date: Mon, 2 Sep 2002 10:27:56 +0200

 Hi,
 
 I doubt that the problem is ipfilter-related for two reasons:
 
 1) as the original poster stated, FreeBSD actually is aware that
    it got a ICMP NEEDFRAG message, as it does lower the MTU for
    that route - it just does not use the new, lowered, MTU.
 
 2) I'm seeing the same without ipfilter, even with both, the
    router and the client-machine being FreeBSD 4-STABLE.  Neither
    of the two have IPFILTER in the kernel, both have IPFIREWALL
    (ipfw) in it.
 
 
 bye,
   Harold

From: Martin Kaeske <Martin.Kaeske@Stud.TU-Ilmenau.DE>
To: logix@foobar.franken.de, freebsd-gnats-submit@FreeBSD.org
Cc: jeff@expertcity.com
Subject: Re: kern/42137: Path MTU broken - initial too-large packet continuously resent
Date: Mon, 2 Sep 2002 11:57:31 +0200

 On Mon, Sep 02, 2002 at 10:27:56AM +0200, Harold Gutch wrote:
 > 1) as the original poster stated, FreeBSD actually is aware that
 >    it got a ICMP NEEDFRAG message, as it does lower the MTU for
 >    that route - it just does not use the new, lowered, MTU.
 
 As far as i understand the code in sys/netinet/ip_icmp.c an incoming
 ICMP message is handled by icmp_input(). This function updates the
 MTU of a specific route based on src/dst IP-address and the new
 MTU (suggested by ICMP). After that icmp_input() calls tcp_ctlinput()
 to inform the TCP layer of the changes. tcp_ctlinput() recognizes
 that PMTUD has to be performed (that means call tcp_mtudisc) but
 before that tcp_ctlinput calls in_pcblookup_hash. Due to the wrong
 src/dst-port values (from the ICMP-response) the lookup fails
 and tcp_mtudisc is never called. Once my router generates correct
 responses tcp_mtudisc is called and everything went fine.
 I didn't look further inside the TCP code but i think there must
 be some sort of cache that the tcp functions use to generate
 the packets.
 That could also be the explanation for the time-outs the OP wrote
 about, any new TCP connection uses the new values.
 
 > 2) I'm seeing the same without ipfilter, even with both, the
 >    router and the client-machine being FreeBSD 4-STABLE.  Neither
 >    of the two have IPFILTER in the kernel, both have IPFIREWALL
 >    (ipfw) in it.
 
 Hmm, that is strange.
 Do you get correct ICMP responses (correct src-port and dst-port)?
 
 Martin

From: Jeff Behl <jeff@expertcity.com>
To: freebsd-gnats-submit@FreeBSD.org, jeff@expertcity.com
Cc:  
Subject: Re: kern/42137: Path MTU broken - initial too-large packet continuously
 resent
Date: Wed, 04 Sep 2002 14:52:04 -0700

 below is a note i sent out privately to some who responded about the 
 problem.  Please contact me if you need any more information; i'd be 
 more than happy to work with someone on this problem.  I have tcpdumps 
 of pmtu working and not working, including the icmps sent by the router.
 
 thanks.
 
 
 so we left out some information which has proven to be quite
 salient...though it still seems there is a bug lurking, though very
 subtle...
 
 our web servers are behind a layer two load balancer (aha everyone
 exclaims!  but wait), i.e. it transparently re-writes the ip headers of
 packets and sends them to the web server.  upon reply, they are again
 re-written and sent to the client.
 
 It turns out that on reguests made directly to the web server, pmtud
 works fine.  As you have guessed, when requests go through the load
 balancer, pmtud doesn't work.  Why?  I don't know.  On requests that do
 go through the load balancer we see the same traffic pattern as requests
 that don't.  In both cases ICMPs are received by the actual server
 (verified by sniffing from the web server).  In both cases the cloned
 route table (is this the correct way to say this?) shows the correct,
 lower MTU for the host (1420 in our case).  The difference is that when
 directly accessing the web server the original too-large packet is
 re-segmented and sent with the correct MTU whereas when going through
 the load balancer, the too-large packet is repeatedly sent.  Please see
 the attached trace, taken from the server.
 
 So TCP is not noticing/being notified about the new MTU?  Further
 connections through the load balancer to the server use the current MTU
 as long as the cloned host route with the correct mtu exists (netstat
 -nalr).  It's just the initial too-large packet that is being
 continuously resent.  I've looked through the ICMPs and they appear to
 be correctly rewritten by the load balancer (and FBSD does get the
 correct new MTU in both cases).
 
 ideas??!  I tried looking through the ip stack some but I'm not real C
 savvy and couldn't figure it out.   I'd be more than happy to provide
 dumps of the actual exchange to anynoe who is interested.
 
 thanks for any help...this is hurting my head.
 
 jeff
 
 

From: Martin Kaeske <Martin.Kaeske@Stud.TU-Ilmenau.DE>
To: Jeff Behl <jeff@expertcity.com>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/42137: Path MTU broken - initial too-large packet continuously resent
Date: Fri, 6 Sep 2002 19:46:00 +0200

 Hi Jeff,
 Thanks for the tcpdump, i think i was able to identify the problem.
 As i wrote in the PR tcp_ctlinput() is responsible for calling tcp_mtudisc
 but tcp_ctlinput() does not only check src/dst-port it also examines wether
 the tcp-sequence number (found in the ICMP-response) is between snd_una (send
 unacknowledged) and snd_max (highest number sent). I found out that the
 ICMP-responses doesn't contain the correct seq. number, the first two bytes
 are correct but the last two aren't.
 
 So i think it is a router problem (as it was in my case ;).
 
 HTH
 Martin
 
 -- 
 "At the beginning of the week, we sealed ten BSD programmers into a
  computer room with a single distribution of BSD Unix. Upon opening
  the room after seven days, we found all ten programmers dead,
  clutching each others throats, and thirteen new flavors of BSD."

From: Jeff Behl <jeff@expertcity.com>
To: Martin Kaeske <Martin.Kaeske@Stud.TU-Ilmenau.DE>
Cc: freebsd-gnats-submit@FreeBSD.org,
	Steve Francis <sfrancis@expertcity.com>
Subject: Re: kern/42137: Path MTU broken - initial too-large packet continuously
 resent
Date: Fri, 06 Sep 2002 11:07:02 -0700

 Actually, it would be our layer two load balancer that is not re-writing 
 the icmps from the router correctly.  I was using ethereal to do a 
 inspection of the packets but it doesn't seem to show the sequence 
 number in the tcp header in the icmp.  Thanks!  I knew it had something 
 to do with this blasted load balancer.  This is a foundery serveriron 
 for all who might be interested
 
 But in any case, it still doesn't seem to be a 'good idea' for the stack 
 to transmit packets (and keep re-transmitting them) that are above the 
 MTU that it has for a certain host, does it?  or is this the only way it 
 makes sense as it would be inefficient to have TCP re-check the MTU on 
 every re-transmitted packet?   I'm certaintly no expert in this area...
 
 
 thanks again.
 
 Jeff
 
 Martin Kaeske wrote:
 > Hi Jeff,
 > Thanks for the tcpdump, i think i was able to identify the problem.
 > As i wrote in the PR tcp_ctlinput() is responsible for calling tcp_mtudisc
 > but tcp_ctlinput() does not only check src/dst-port it also examines wether
 > the tcp-sequence number (found in the ICMP-response) is between snd_una (send
 > unacknowledged) and snd_max (highest number sent). I found out that the
 > ICMP-responses doesn't contain the correct seq. number, the first two bytes
 > are correct but the last two aren't.
 > 
 > So i think it is a router problem (as it was in my case ;).
 > 
 > HTH
 > Martin
 > 
 
 
State-Changed-From-To: feedback->closed 
State-Changed-By: dwmalone 
State-Changed-When: Sun Sep 15 09:07:36 PDT 2002 
State-Changed-Why:  
Traced down to a load balancer now rewriting ICMP responses correctly. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42137 
>Unformatted:
