[HN Gopher] Operation Jumbo Drop: How sending large packets brok...
       ___________________________________________________________________
        
       Operation Jumbo Drop: How sending large packets broke our AWS
       network
        
       Author : kiyanwang
       Score  : 73 points
       Date   : 2022-03-27 11:45 UTC (1 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | [deleted]
        
       | jsnell wrote:
       | I think the case is a little bit more subtle than described.
       | 
       | Linux doesn't actually need ICMP-based path MTU discovery to
       | figure out the safe MSS to use. It implements RFC 4821-style MTU
       | probing, where after repeated retransmissions the sender will try
       | to retransmit smaller prefixes of the segment and see if that
       | makes it through. If it does, it can be deduced that the safe MSS
       | is lower than advertised, and the MSS for the connection is
       | (permanently) dropped to the newly discovered limit. So even if
       | the ICMP return path isn't working, the connection should just
       | have a hickup for a few seconds but then resume working.
       | 
       | Why isn't MTU probing triggering here? I haven't checked the
       | code, but would guess that it's refusing to split a segment with
       | the PSH flag set.
        
         | touisteur wrote:
         | Fun fact, was observing periodic latency peaks on some customer
         | network. TCP analysis would show a resend after an icmp 'too
         | big' packet from a middlebox. Fine. Once at connection
         | establishment or rather at the first PSH over the default MTU,
         | right? But why periodically? Well Linux re-raises the path mtu
         | periodically, iirc in case of icmp spoofing...
         | 
         | Is it different here for mss-based discovery? You say
         | permanently?
        
       | IYasha wrote:
       | Pumping up MTUs is mostly viable on a file transfer networks.
       | Tried it on 10G NAS - worked pretty well. But that's LAN, where
       | retransmits are pretty non-existent. On the internet larger
       | damaged packets will cause larger retransfers. And one uncapable
       | router/switch can ruin the whole profit.
        
         | sneak wrote:
         | Fun fact: the default mainline kernel driver (or at least the
         | one used by Raspbian and Ubuntu) for the Raspberry Pi gigabit
         | interface does not support jumbo frames without patching and
         | recompiling the kernel. :/
         | 
         | This is a colossal pain if you run your home lan at 9000 for
         | the aforementioned storage/NAS reason.
        
       | tlb wrote:
       | The sad thing is, there's no reason for any modern system to
       | require a small MTU. MTUs of only a thousand bytes made sense on
       | sub-megabit networks (to limit jitter), and when RAM was
       | purchased by the kilobyte. But there's no reason why every
       | network shouldn't support 64 kB MTUs today.
       | 
       | A required max of 64 kB with a default segment size of 16 kB
       | would improve performance and allow any amount of headers to be
       | added.
        
         | drewg123 wrote:
         | I have exactly the opposite view. With offloads like TSO and
         | LRO, NICs can pretend the network supports very large
         | "packets", while still segmenting traffic into standard 1500b
         | chunks.
         | 
         | From a hardware perspective, supporting jumbo frames makes
         | everything harder. You need to have larger buffers. Especially
         | when dealing with ethernet flow control. And latencies
         | increase, because most NICs do store-and-forward, and need to
         | hold onto the packet until the end arrives so they can
         | calculate the CRC.
        
           | jeffbee wrote:
           | What about the receive side? Suppose you have TSO but the
           | receiver still gets a train of smaller frames, which consumes
           | extra CPU.
           | 
           | The solution to ethernet flow control is to disable it, of
           | course.
        
             | drewg123 wrote:
             | The solution on the receive side is LRO.
             | 
             | I was doing 10GbE line rate with 1500b frames in 2006 or so
             | on consumer-grade CPUs using software LRO (implemented in
             | the driver).
             | 
             | I worked for a NIC vendor (Myricom) that shipped our NIC
             | drivers with a default 9000b MTU. That was fine when we
             | just did HPC & everything reachable via layer 2 was another
             | one of our NICs with a 9k MTU. But it was a mess when we
             | started selling 10GbE NICs. I got pretty good at diagnosing
             | MTU mis-matches..
        
               | jeffbee wrote:
               | I don't doubt that you can hit 10g on a wimpy core, the
               | question is how much is left over for your application to
               | use. I recently analyzed jumbo vs. puny frames in EC2
               | with the ENA and the difference in softirq time amounted
               | to one core out of the 32 cores on the instance type, at
               | 12.5gbps line rate. That's pretty substantial cost when
               | you have a lot of nodes. The ENA doesn't have a ton of
               | offload features and EC2 tenants don't get to specify a
               | better NIC.
        
       | clone1018 wrote:
       | MTUs are fun. We (live video streaming platform) were recently
       | investigating issues with users having trouble watching our live
       | video streams when using a VPN. Since we're using WebRTC on the
       | viewer end we thought immediately it was just some WebRTC
       | "protection" trying to not leak the users IP.
       | 
       | Eventually we figured out that Open Broadcaster Software has a
       | compiled configuration of a max MTU of 1392 for the UDP packets
       | we're sending. Generally this is fine because most routers have a
       | default MTU of 1500, however when coupled with some of the VPN
       | technologies, it ends up pushing the MTU over the limit and the
       | video packets get dropped.
       | 
       | Overall MTUs seem to be slightly a not well understood thing,
       | because answers on the internet wildly vary for appropriate ways
       | of handling it. The consensus from some Google/WebRTC folks seems
       | to be that 1200 is a safe and fast default.
       | 
       | > _Anyway, 1200 bytes is 1280 bytes minus the RTP headers minus
       | some bytes for RTP header extensions minus a few "let's play it
       | safe" bytes. It'll usually work._
       | 
       | - https://groups.google.com/g/discuss-webrtc/c/gH5ysR3SoZI?pli...
       | 
       | - https://stackoverflow.com/questions/47635545/why-webrtc-chos...
        
         | ignoramous wrote:
         | Reminds me of:
         | https://github.com/tailscale/tailscale/blob/d9a7205be/net/ts...
         | // tunMTU is the MTU we set on tailscale's TUN interface.
         | wireguard-go       // defaults to 1420 bytes, which only works
         | if the "outer" MTU is 1500       // bytes. This breaks on DSL
         | connections (typically 1492 MTU) and on       // GCE (1460
         | MTU?!).       //       // 1280 is the smallest MTU allowed for
         | IPv6, which is a sensible       // "probably works everywhere"
         | setting until we develop proper PMTU       // discovery.
         | 
         | And: https://datatracker.ietf.org/doc/rfc9000/ (sec 14)
         | The maximum datagram size is defined as the largest size of UDP
         | payload that can be sent across a network path using a single
         | UDP        datagram.  QUIC MUST NOT be used if the network path
         | cannot support a        maximum datagram size of at least 1200
         | bytes.             QUIC assumes a minimum IP packet size of at
         | least 1280 bytes.  This        is the IPv6 minimum size [IPv6]
         | and is also supported by most modern        IPv4 networks.
         | Assuming the minimum IP header size of 40 bytes for        IPv6
         | and 20 bytes for IPv4 and a UDP header size of 8 bytes, this
         | results in a maximum datagram size of 1232 bytes for IPv6 and
         | 1252        bytes for IPv4.  Thus, modern IPv4 and all IPv6
         | network paths are        expected to be able to support QUIC.
         | |  Note: This requirement to support a UDP payload of 1200
         | bytes           |  limits the space available for IPv6
         | extension headers to 32           |  bytes or IPv4 options to
         | 52 bytes if the path only supports the           |  IPv6
         | minimum MTU of 1280 bytes.  This affects Initial packets
         | |  and path validation.             Any maximum datagram size
         | larger than 1200 bytes can be discovered        using Path
         | Maximum Transmission Unit Discovery (PMTUD) (see        Section
         | 14.2.1) or Datagram Packetization Layer PMTU Discovery
         | (DPLPMTUD) (see Section 14.3).
        
           | yardstick wrote:
           | 1200 MTU often in my experience does the trick.
           | 
           | But, I don't like the presumption in IPv6 that everything
           | supports 1280. What if there is a VPN running over a 1280-MTU
           | link, what is the MTU of the encrypted IPv6 payload? Now add
           | in a couple more layers of VPNs for good measure.
        
             | phicoh wrote:
             | 1280 is a guarantee by the IPv6 standard. If you run a VPN
             | over a link with 1280 MTU then the VPN protocol has to have
             | a way of fragmenting the payload (which could be using
             | fragmentation in IPv6, but application specific is likely
             | to work better).
        
               | yardstick wrote:
               | Yes I mostly agree, the tricky part is what to do about
               | packets marked as Don't Fragment. Hopefully they are
               | already well under the actual MTU, but if need be, should
               | they be fragmented or dropped? (The answer is to fragment
               | all except icmp imo, even if it's against the wishes of
               | the packet IP headers)
        
               | phicoh wrote:
               | Technically the packets don't get fragmented. It is the
               | tunnel that fragments.
               | 
               | For example, in the bad old days, when there was still a
               | lot of ATM, packets would routinely get transmitted in 48
               | byte pieces (with a 5 byte header). Nobody wondered
               | whether this kind of fragmentation should honor the DF
               | bit.
               | 
               | If I recall correctly, multi-link PPP could also split a
               | packet in smaller pieces to transmit it over multiple
               | links at the same time.
               | 
               | So for IPv6, anything that is 1280 octets or smaller
               | needs to forwarded without triggering a packet too big
               | ICMP. Otherwise the link violates the IPv6 standard.
        
               | yardstick wrote:
               | Good point, thanks for the insight!
        
             | kazen44 wrote:
             | with IPv6, one could also do MTU-path discovery. Which
             | would automatically set the correct MTU based on a ICMP
             | test between nodes and all endpoints.
        
             | mprovost wrote:
             | But you don't need VPNs with IPv6, it has encryption built
             | it and you don't need NAT because every node is directly
             | accessible!
        
               | yardstick wrote:
               | Apologies, I can't tell if you are being serious or
               | satire.
               | 
               | I'm leaning towards satire but in case you are serious,
               | IPv6 built-in encryption is IPsec using opportunistic
               | encryption (which then relies on dnsec...), and it's not
               | built into all IPv6 endpoints. Nor easy to configure,
               | debug or support.
               | 
               | WireGuard, OpenVPN, etc are easier for users to configure
               | and set up than IPsec, and less chances of mismatched
               | configs, unsupported cipher suites, etc.
               | 
               | As for NAT, that's irrelevant to the reason for using a
               | VPN. IPv6 has Unique Local Addresses (ULA), which can be
               | routable across and organisation but not from the
               | internet, and so you may want a VPN for access to those.
               | You may also simply want to extend your IPv6 network to a
               | remote location that doesn't have native IPv4, or whose
               | ISP doesn't provide sufficient delegated subnet ranges
               | for your requirements. The VPN could also be to provide
               | access to an IPv4 network behind the IPv6 router. The
               | list goes on...
        
       | ethbr0 wrote:
       | For those who don't get the '95 reference:
       | 
       | https://m.youtube.com/watch?v=19JXZcmMJx0
       | 
       | (From a time when Disney made Vietnam War comedies with ensemble
       | casts, instead of superhero movies)
        
       | wmf wrote:
       | _the AMIs AWS provides launch EC2 instances with the MTU set to
       | 9001 ... A transit gateway supports an MTU of 8500 bytes ... The
       | transit gateway does not generate the FRAG_NEEDED for ICMPv4
       | packet, or the Packet Too Big (PTB) for ICMPv6 packet. Therefore,
       | the Path MTU Discovery (PMTUD) is not supported._
       | 
       | Quite the footgun you've got there, AWS.
        
         | tempnow987 wrote:
         | It does set MSS so part of the reason they had this issue is I
         | think related to asymmetric paths?
        
           | lima wrote:
           | MSS clamping is a nasty hack that works only for TCP.
        
             | tempnow987 wrote:
             | Fair point. I'm not experienced, does path discovery work
             | with PSH set?
        
               | sleepydog wrote:
               | Path discovery is agnostic of the transport protocol.
               | Once an IP datagram with the "Don't Fragment" bit set
               | reaches a device that cannot forward it because it is too
               | large, the device must send an ICMP error back to the
               | originating IP address with the correct MTU. The sender
               | must adjust then adjust the MTU for the same destination
               | for future datagrams. The sender may have to do this
               | discovery multiple times, if each smaller packet gets
               | progressively further, but doesn't reach the destination.
               | 
               | This whole process is thwarted by overeager network
               | administrators blocking ICMP traffic, or, apparently, by
               | devices like this "transit gateway" that don't generate
               | the ICMP packet at all.
               | 
               | Edit: Sorry, I think I may have fallen into the trap of
               | explaining something that you already knew and not
               | answering your question. I realized this after jsnell's
               | comment elsewhere that the kernel may not want to split
               | packets with the PSH bit set. I wasn't able to find an
               | answer to your question in linux/net/ipv4/tcp_output.c,
               | hopefully someone more familiar with Linux's TCP can
               | chime in.
        
               | [deleted]
        
               | tempnow987 wrote:
               | Right, path discovery I get, but this was the first I'd
               | heard of PSH bit possibly interacting negatively. That
               | said, I'm not sure why it would. Buffering is normally
               | only if you are > MSS? And why would pushing up to
               | application change path discovery?
        
       | paleotrope wrote:
       | I think someone posted about this same exact issue a few weeks
       | ago.
       | 
       | I wouldn't mix VPC peering and transit gateway if you avoid it.
        
         | nijave wrote:
         | I guess if you're migrating from VPC peering to Transit
         | Gateways you have to coordinate updating the routing tables in
         | both VPCs at the same time (I'm guessing they probably updated
         | routing tables in one VPC but hadn't updated them in the other
         | yet) unless you adjust MTUs ahead of time.
         | 
         | Seems to be a crappy migration path (execute these updates
         | quickly and hopefully you don't drop too many packets before
         | they apply)
        
           | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-03-28 23:01 UTC)