From dominic@indigo-ic.co.uk  Fri Sep 13 03:31:47 2002
Return-Path: <dominic@indigo-ic.co.uk>
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3D4D837B400
	for <FreeBSD-gnats-submit@freebsd.org>; Fri, 13 Sep 2002 03:31:47 -0700 (PDT)
Received: from blueyonder.co.uk (pcow057o.blueyonder.co.uk [195.188.53.94])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BEBE143E4A
	for <FreeBSD-gnats-submit@freebsd.org>; Fri, 13 Sep 2002 03:31:40 -0700 (PDT)
	(envelope-from dominic@indigo-ic.co.uk)
Received: from pcow057o.blueyonder.co.uk ([127.0.0.1]) by blueyonder.co.uk  with Microsoft SMTPSVC(5.5.1877.757.75);
	 Fri, 13 Sep 2002 11:31:39 +0100
Received: from the-mayor.dom (unverified [62.31.234.90]) by pcow057o.blueyonder.co.uk
 (Content Technologies SMTPRS 4.2.9) with ESMTP id <T5d4f700270ac1785b3327@pcow057o.blueyonder.co.uk> for <FreeBSD-gnats-submit@freebsd.org>;
 Fri, 13 Sep 2002 11:31:39 +0100
Received: from the-mayor.dom (localhost [127.0.0.1])
	by the-mayor.dom (8.12.3/8.12.3) with ESMTP id g8DAVcAK006141
	for <FreeBSD-gnats-submit@freebsd.org>; Fri, 13 Sep 2002 11:31:38 +0100 (BST)
	(envelope-from dominic@indigo-ic.co.uk)
Received: (from root@localhost)
	by the-mayor.dom (8.12.3/8.12.3/Submit) id g8DAV0QI006119;
	Fri, 13 Sep 2002 11:31:00 +0100 (BST)
Message-Id: <200209131031.g8DAV0QI006119@the-mayor.dom>
Date: Fri, 13 Sep 2002 11:31:00 +0100 (BST)
From: Dominic Froud <dominic@indigo-ic.co.uk>
Reply-To: Dominic Froud <dominic@indigo-ic.co.uk>
To: FreeBSD-gnats-submit@freebsd.org
Cc:
Subject: [PATCH] Wrong MTU in need-frag ICMP using IPSEC tunnels w/out GIF
X-Send-Pr-Version: 3.113
X-GNATS-Notify:

>Number:         42727
>Category:       kern
>Synopsis:       [PATCH] Wrong MTU in need-frag ICMP using IPSEC tunnels w/out GIF
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bms
>State:          closed
>Quarter:        
>Keywords:       
>Date-Required:  
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Sep 13 03:40:02 PDT 2002
>Closed-Date:    Wed Oct 05 03:04:18 GMT 2005
>Last-Modified:  Wed Oct 05 03:04:18 GMT 2005
>Originator:     Dominic Froud
>Release:        FreeBSD 4.6-RELEASE i386
>Organization:
>Environment:
System: FreeBSD the-mayor.dom 4.6-RELEASE FreeBSD 4.6-RELEASE #17: Wed Sep 11 17:13:53 BST 2002 root@the-mayor.dom:/usr/src/sys/compile/SERVER i386

Kernel options:
INET
INET6
IPSEC
IPSEC_ESP
IPSEC_DEBUG
MROUTING
IPFIREWALL
IPFIREWALL_FORWARD
IPDIVERT
RANDOM_IP_ID
ICMP_BANDLIM

Server has two Macronix 98715AEC-C 10/100BaseTX cards at dc0 and dc1.

net.inet.ipsec.dfbit=1

>Description:
I bridged my LAN (subnet 10.0.1.0/24) with a friend's LAN (10.0.0.0/24)
using IPSEC tunnels without GIF devices. I use FreeBSD 4.6 and he uses
Linux RedHat 7.x. My friend couldn't pull any packets from machines on
my LAN that required MTU reduction to prevent fragmentation, e.g. SMB
TCP packets. Upon further inspection, my FreeBSD server was telling the
machine on my LAN that fragmentation was needed but was suggesting an
incorrect MTU of 1500 instead of one that took the IPSEC tunnel headers
into account. This would cause the machine on my LAN to simply retry the
same over-sized packet again and again, causing the requesting machine
on his LAN to eventually timeout with a short read. [The short read
timeout problem is a common symptom of other MTU issues but this specific
issue can be accurately diagnosed].


There is code in netinet/ip_input.c:ip_forward() that should deal with
this but it never has the chance to do the calculation as some prior
IPSEC function call returns an error. In ip_forward(), before ip_output()
called, a rough copy of the top mbuf at 'm' is made and pointed to by
'mcopy'. Only the IP header and up to 8 bytes are copied - but the length
as stored in the packet header (m_pkthdr) remains unchanged and reflects
the original packet length.


If ip_forward()'s call to ip_output() fails with EMSGSIZE and the packet
would have transversed an IPSEC tunnel, then ipsec_setspidx() in
netinet6/ipsec.c would (eventually) be called. This would sanity check
the passed mbuf and fail with an error like: "ipsec_setspidx: total of
m_len(28) != pkthdr.len(1500), ignored."


The 28 is obviously the truncated length of mcopy (IP header + max 8
bytes) and the 1500 is the size of the original packet. Hence the rest
of the reduced MTU calculation would be stopped at this point and an
unchanged MTU used to construct the ICMP frag-needed packet.

>How-To-Repeat:
Bridge two subnets using IPSEC tunnels without the GIF device. If you
bridge the encapsulating machines themselves as well, you should end up
with 8 policies like the following:


81.5.133.243[any] 10.0.1.0/24[any] any
        in ipsec
        esp/tunnel/81.5.133.243-62.31.234.90/require
        spid=1 seq=7 pid=235
        refcnt=1
81.5.133.243[any] 62.31.234.90[any] any
        in ipsec
        esp/tunnel/81.5.133.243-62.31.234.90/require
        spid=3 seq=6 pid=235
        refcnt=1
10.0.0.0/24[any] 10.0.1.0/24[any] any
        in ipsec
        esp/tunnel/81.5.133.243-62.31.234.90/require
        spid=5 seq=5 pid=235
        refcnt=1
10.0.0.0/24[any] 62.31.234.90[any] any
        in ipsec
        esp/tunnel/81.5.133.243-62.31.234.90/require
        spid=7 seq=4 pid=235
        refcnt=1
10.0.1.0/24[any] 81.5.133.243[any] any
        out ipsec
        esp/tunnel/62.31.234.90-81.5.133.243/require
        spid=2 seq=3 pid=235
        refcnt=1
62.31.234.90[any] 81.5.133.243[any] any
        out ipsec
        esp/tunnel/62.31.234.90-81.5.133.243/require
        spid=4 seq=2 pid=235
        refcnt=1
10.0.1.0/24[any] 10.0.0.0/24[any] any
        out ipsec
        esp/tunnel/62.31.234.90-81.5.133.243/require
        spid=6 seq=1 pid=235
        refcnt=1
62.31.234.90[any] 10.0.0.0/24[any] any
        out ipsec
        esp/tunnel/62.31.234.90-81.5.133.243/require
        spid=8 seq=0 pid=235
        refcnt=1


I am 62.31.234.90 with protected subnet 10.0.1.0/24.
Peer is 81.5.133.243 with protected subnet 10.0.0.0/24.


I also have net.inet.ipsec.dfbit set to 1 via /etc/sysctl.conf.


I logged into peer's server and used smbclient to request a file from
10.0.1.20 (win98se machine). Just each test, make sure all your SAD
entries are 'mature' and relatively fresh (i.e. not about to die on you
during your test) using "setkey -D | egrep '(diff|state)'".


Use tcpdump to log data packets from, and icmp packets to, your
protected host (in my case this was 10.0.1.20). Increase IPSEC logging
using "sysctl net.key.debug=0x45". To turn these messages off, use
"sysctl net.key.debug=0".


Now try to transfer a file from your target host that is bigger than
your MTU (>1500 so say, 16Kbytes).


tcpdump will produce output like:


11:44:03.378193 10.0.1.20.139 > 81.5.133.243.43396: tcp 1460 (DF) (ttl 128, id 26226, len 1500)
11:44:03.387030 10.0.1.2 > 10.0.1.20: icmp: 81.5.133.243 unreachable - need to frag (mtu 1500) (DF) (ttl 64, id 48070, len 56)
11:44:04.778191 10.0.1.20.139 > 81.5.133.243.43396: tcp 1460 (DF) (ttl 128, id 26226, len 1500)
11:44:04.787022 10.0.1.2 > 10.0.1.20: icmp: 81.5.133.243 unreachable - need to frag (mtu 1500) (DF) (ttl 64, id 48070, len 56)
(pattern repeats)


Your console should show lines like:
Sep 10 11:44:03 the-mayor /kernel: ipsec_setspidx: total of m_len(28) != pkthdr.len(1500), ignored.


The requesting host on the remote LAN will timeout.

>Fix:
Simply update the packet length in mcopy->m_pkthdr.len to reflect the
truncated nature of mcopy. This can be done at line 1799 in
netinet/ip_input.c rev 1.130.2.35 for just the EMSGSIZE IPSEC case or
at line 1703 if this is of more general use within ip_forward() and
functions called by it. I've tried the following diff at both line 1703
and line 1799 and both cure the problem as expected. On my machine,
I've left the code in at line 1799 because I don't know if other code
using mcopy makes use of the original packet length.



--- patch begins here ---
--- ip_input.c  Wed Sep 11 17:55:09 2002
+++ ip_input.c-patched  Wed Sep 11 18:23:47 2002
@@ -1796,6 +1796,13 @@
                        int ipsechdr;
                        struct route *ro;


+                       /* Pretend original packet was only this long
+                        * as IPSEC functions like ipsec_setspidx(),
+                        * called by ispec4_getpolicybyaddr() below,
+                        * expect a sane mbuf chain.
+                        */
+                       mcopy->m_pkthdr.len = mcopy->m_len;
+
                        sp = ipsec4_getpolicybyaddr(mcopy,
                                                    IPSEC_DIR_OUTBOUND,
                                                    IP_FORWARDING,
--- patch ends here ---



tcpdump with patched kernel looks like:


17:17:43.108193 10.0.1.20.139 > 81.5.133.243.43396: tcp 1460 (DF) (ttl 128, id 26226, len 1500)
17:17:43.108779 10.0.1.20.139 > 81.5.133.243.43396: tcp 652 (DF) (ttl 128, id 26482, len 692)
17:17:43.114394 10.0.1.2 > 10.0.1.20: icmp: 81.5.133.243 unreachable - need to frag (mtu 1443) (DF) (ttl 64, id 39851, len 56)
17:17:43.115869 10.0.1.20.139 > 81.5.133.243.43396: tcp 1403 (DF) (ttl 128, id 26738, len 1443)
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: freebsd-bugs->bms 
Responsible-Changed-By: bms 
Responsible-Changed-When: Tue 25 Nov 2003 08:33:21 PST 
Responsible-Changed-Why:  
I'm in hoover up network PRs mode. I'll look into this. 

Your patch looks fairly simple and could possibly even be implemented right 
now, I will talk to sam@ about this. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42727 

From: Bruce M Simpson <bms@spc.org>
To: Dominic Froud <dominic@indigo-ic.co.uk>
Cc: sam@FreeBSD.org, freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/42727: [PATCH] Wrong MTU in need-frag ICMP using IPSEC tunnels w/out GIF
Date: Wed, 16 Jun 2004 09:16:12 +0100

 The fix does look correct but I'd really like to be able to test it first.
 
 This seems related:
 http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/netinet/ip_input.c.diff?r1=1.150&r2=1.151
 http://mail-index.netbsd.org/tech-kern/2002/06/07/0024.html
 http://mail-index.netbsd.org/tech-kern/2002/06/07/0000.html
 
 Perhaps we should incorporate this too?
 
 Regards,
 BMS
State-Changed-From-To: open->patched 
State-Changed-By: bms 
State-Changed-When: Wed Jun 16 08:33:16 GMT 2004 
State-Changed-Why:  
A diff from KAME has been committed to HEAD; it seems they had already picked 
up on this, but the sync hadn't caught up with us yet. 
I have also committed the NetBSD fix which takes per-route MTU into account. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42727 
State-Changed-From-To: patched->closed 
State-Changed-By: rodrigc 
State-Changed-When: Wed Oct 5 03:03:31 GMT 2005 
State-Changed-Why:  
This patch is in RELENG_6. 

http://www.freebsd.org/cgi/query-pr.cgi?pr=42727 
>Unformatted:
