[HN Gopher] Operation Jumbo Drop: How sending large packets brok...
___________________________________________________________________
Operation Jumbo Drop: How sending large packets broke our AWS
network
Author : kiyanwang
Score : 73 points
Date : 2022-03-27 11:45 UTC (1 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| [deleted]
| jsnell wrote:
| I think the case is a little bit more subtle than described.
|
| Linux doesn't actually need ICMP-based path MTU discovery to
| figure out the safe MSS to use. It implements RFC 4821-style MTU
| probing, where after repeated retransmissions the sender will try
| to retransmit smaller prefixes of the segment and see if that
| makes it through. If it does, it can be deduced that the safe MSS
| is lower than advertised, and the MSS for the connection is
| (permanently) dropped to the newly discovered limit. So even if
| the ICMP return path isn't working, the connection should just
| have a hickup for a few seconds but then resume working.
|
| Why isn't MTU probing triggering here? I haven't checked the
| code, but would guess that it's refusing to split a segment with
| the PSH flag set.
| touisteur wrote:
| Fun fact, was observing periodic latency peaks on some customer
| network. TCP analysis would show a resend after an icmp 'too
| big' packet from a middlebox. Fine. Once at connection
| establishment or rather at the first PSH over the default MTU,
| right? But why periodically? Well Linux re-raises the path mtu
| periodically, iirc in case of icmp spoofing...
|
| Is it different here for mss-based discovery? You say
| permanently?
| IYasha wrote:
| Pumping up MTUs is mostly viable on a file transfer networks.
| Tried it on 10G NAS - worked pretty well. But that's LAN, where
| retransmits are pretty non-existent. On the internet larger
| damaged packets will cause larger retransfers. And one uncapable
| router/switch can ruin the whole profit.
| sneak wrote:
| Fun fact: the default mainline kernel driver (or at least the
| one used by Raspbian and Ubuntu) for the Raspberry Pi gigabit
| interface does not support jumbo frames without patching and
| recompiling the kernel. :/
|
| This is a colossal pain if you run your home lan at 9000 for
| the aforementioned storage/NAS reason.
| tlb wrote:
| The sad thing is, there's no reason for any modern system to
| require a small MTU. MTUs of only a thousand bytes made sense on
| sub-megabit networks (to limit jitter), and when RAM was
| purchased by the kilobyte. But there's no reason why every
| network shouldn't support 64 kB MTUs today.
|
| A required max of 64 kB with a default segment size of 16 kB
| would improve performance and allow any amount of headers to be
| added.
| drewg123 wrote:
| I have exactly the opposite view. With offloads like TSO and
| LRO, NICs can pretend the network supports very large
| "packets", while still segmenting traffic into standard 1500b
| chunks.
|
| From a hardware perspective, supporting jumbo frames makes
| everything harder. You need to have larger buffers. Especially
| when dealing with ethernet flow control. And latencies
| increase, because most NICs do store-and-forward, and need to
| hold onto the packet until the end arrives so they can
| calculate the CRC.
| jeffbee wrote:
| What about the receive side? Suppose you have TSO but the
| receiver still gets a train of smaller frames, which consumes
| extra CPU.
|
| The solution to ethernet flow control is to disable it, of
| course.
| drewg123 wrote:
| The solution on the receive side is LRO.
|
| I was doing 10GbE line rate with 1500b frames in 2006 or so
| on consumer-grade CPUs using software LRO (implemented in
| the driver).
|
| I worked for a NIC vendor (Myricom) that shipped our NIC
| drivers with a default 9000b MTU. That was fine when we
| just did HPC & everything reachable via layer 2 was another
| one of our NICs with a 9k MTU. But it was a mess when we
| started selling 10GbE NICs. I got pretty good at diagnosing
| MTU mis-matches..
| jeffbee wrote:
| I don't doubt that you can hit 10g on a wimpy core, the
| question is how much is left over for your application to
| use. I recently analyzed jumbo vs. puny frames in EC2
| with the ENA and the difference in softirq time amounted
| to one core out of the 32 cores on the instance type, at
| 12.5gbps line rate. That's pretty substantial cost when
| you have a lot of nodes. The ENA doesn't have a ton of
| offload features and EC2 tenants don't get to specify a
| better NIC.
| clone1018 wrote:
| MTUs are fun. We (live video streaming platform) were recently
| investigating issues with users having trouble watching our live
| video streams when using a VPN. Since we're using WebRTC on the
| viewer end we thought immediately it was just some WebRTC
| "protection" trying to not leak the users IP.
|
| Eventually we figured out that Open Broadcaster Software has a
| compiled configuration of a max MTU of 1392 for the UDP packets
| we're sending. Generally this is fine because most routers have a
| default MTU of 1500, however when coupled with some of the VPN
| technologies, it ends up pushing the MTU over the limit and the
| video packets get dropped.
|
| Overall MTUs seem to be slightly a not well understood thing,
| because answers on the internet wildly vary for appropriate ways
| of handling it. The consensus from some Google/WebRTC folks seems
| to be that 1200 is a safe and fast default.
|
| > _Anyway, 1200 bytes is 1280 bytes minus the RTP headers minus
| some bytes for RTP header extensions minus a few "let's play it
| safe" bytes. It'll usually work._
|
| - https://groups.google.com/g/discuss-webrtc/c/gH5ysR3SoZI?pli...
|
| - https://stackoverflow.com/questions/47635545/why-webrtc-chos...
| ignoramous wrote:
| Reminds me of:
| https://github.com/tailscale/tailscale/blob/d9a7205be/net/ts...
| // tunMTU is the MTU we set on tailscale's TUN interface.
| wireguard-go // defaults to 1420 bytes, which only works
| if the "outer" MTU is 1500 // bytes. This breaks on DSL
| connections (typically 1492 MTU) and on // GCE (1460
| MTU?!). // // 1280 is the smallest MTU allowed for
| IPv6, which is a sensible // "probably works everywhere"
| setting until we develop proper PMTU // discovery.
|
| And: https://datatracker.ietf.org/doc/rfc9000/ (sec 14)
| The maximum datagram size is defined as the largest size of UDP
| payload that can be sent across a network path using a single
| UDP datagram. QUIC MUST NOT be used if the network path
| cannot support a maximum datagram size of at least 1200
| bytes. QUIC assumes a minimum IP packet size of at
| least 1280 bytes. This is the IPv6 minimum size [IPv6]
| and is also supported by most modern IPv4 networks.
| Assuming the minimum IP header size of 40 bytes for IPv6
| and 20 bytes for IPv4 and a UDP header size of 8 bytes, this
| results in a maximum datagram size of 1232 bytes for IPv6 and
| 1252 bytes for IPv4. Thus, modern IPv4 and all IPv6
| network paths are expected to be able to support QUIC.
| | Note: This requirement to support a UDP payload of 1200
| bytes | limits the space available for IPv6
| extension headers to 32 | bytes or IPv4 options to
| 52 bytes if the path only supports the | IPv6
| minimum MTU of 1280 bytes. This affects Initial packets
| | and path validation. Any maximum datagram size
| larger than 1200 bytes can be discovered using Path
| Maximum Transmission Unit Discovery (PMTUD) (see Section
| 14.2.1) or Datagram Packetization Layer PMTU Discovery
| (DPLPMTUD) (see Section 14.3).
| yardstick wrote:
| 1200 MTU often in my experience does the trick.
|
| But, I don't like the presumption in IPv6 that everything
| supports 1280. What if there is a VPN running over a 1280-MTU
| link, what is the MTU of the encrypted IPv6 payload? Now add
| in a couple more layers of VPNs for good measure.
| phicoh wrote:
| 1280 is a guarantee by the IPv6 standard. If you run a VPN
| over a link with 1280 MTU then the VPN protocol has to have
| a way of fragmenting the payload (which could be using
| fragmentation in IPv6, but application specific is likely
| to work better).
| yardstick wrote:
| Yes I mostly agree, the tricky part is what to do about
| packets marked as Don't Fragment. Hopefully they are
| already well under the actual MTU, but if need be, should
| they be fragmented or dropped? (The answer is to fragment
| all except icmp imo, even if it's against the wishes of
| the packet IP headers)
| phicoh wrote:
| Technically the packets don't get fragmented. It is the
| tunnel that fragments.
|
| For example, in the bad old days, when there was still a
| lot of ATM, packets would routinely get transmitted in 48
| byte pieces (with a 5 byte header). Nobody wondered
| whether this kind of fragmentation should honor the DF
| bit.
|
| If I recall correctly, multi-link PPP could also split a
| packet in smaller pieces to transmit it over multiple
| links at the same time.
|
| So for IPv6, anything that is 1280 octets or smaller
| needs to forwarded without triggering a packet too big
| ICMP. Otherwise the link violates the IPv6 standard.
| yardstick wrote:
| Good point, thanks for the insight!
| kazen44 wrote:
| with IPv6, one could also do MTU-path discovery. Which
| would automatically set the correct MTU based on a ICMP
| test between nodes and all endpoints.
| mprovost wrote:
| But you don't need VPNs with IPv6, it has encryption built
| it and you don't need NAT because every node is directly
| accessible!
| yardstick wrote:
| Apologies, I can't tell if you are being serious or
| satire.
|
| I'm leaning towards satire but in case you are serious,
| IPv6 built-in encryption is IPsec using opportunistic
| encryption (which then relies on dnsec...), and it's not
| built into all IPv6 endpoints. Nor easy to configure,
| debug or support.
|
| WireGuard, OpenVPN, etc are easier for users to configure
| and set up than IPsec, and less chances of mismatched
| configs, unsupported cipher suites, etc.
|
| As for NAT, that's irrelevant to the reason for using a
| VPN. IPv6 has Unique Local Addresses (ULA), which can be
| routable across and organisation but not from the
| internet, and so you may want a VPN for access to those.
| You may also simply want to extend your IPv6 network to a
| remote location that doesn't have native IPv4, or whose
| ISP doesn't provide sufficient delegated subnet ranges
| for your requirements. The VPN could also be to provide
| access to an IPv4 network behind the IPv6 router. The
| list goes on...
| ethbr0 wrote:
| For those who don't get the '95 reference:
|
| https://m.youtube.com/watch?v=19JXZcmMJx0
|
| (From a time when Disney made Vietnam War comedies with ensemble
| casts, instead of superhero movies)
| wmf wrote:
| _the AMIs AWS provides launch EC2 instances with the MTU set to
| 9001 ... A transit gateway supports an MTU of 8500 bytes ... The
| transit gateway does not generate the FRAG_NEEDED for ICMPv4
| packet, or the Packet Too Big (PTB) for ICMPv6 packet. Therefore,
| the Path MTU Discovery (PMTUD) is not supported._
|
| Quite the footgun you've got there, AWS.
| tempnow987 wrote:
| It does set MSS so part of the reason they had this issue is I
| think related to asymmetric paths?
| lima wrote:
| MSS clamping is a nasty hack that works only for TCP.
| tempnow987 wrote:
| Fair point. I'm not experienced, does path discovery work
| with PSH set?
| sleepydog wrote:
| Path discovery is agnostic of the transport protocol.
| Once an IP datagram with the "Don't Fragment" bit set
| reaches a device that cannot forward it because it is too
| large, the device must send an ICMP error back to the
| originating IP address with the correct MTU. The sender
| must adjust then adjust the MTU for the same destination
| for future datagrams. The sender may have to do this
| discovery multiple times, if each smaller packet gets
| progressively further, but doesn't reach the destination.
|
| This whole process is thwarted by overeager network
| administrators blocking ICMP traffic, or, apparently, by
| devices like this "transit gateway" that don't generate
| the ICMP packet at all.
|
| Edit: Sorry, I think I may have fallen into the trap of
| explaining something that you already knew and not
| answering your question. I realized this after jsnell's
| comment elsewhere that the kernel may not want to split
| packets with the PSH bit set. I wasn't able to find an
| answer to your question in linux/net/ipv4/tcp_output.c,
| hopefully someone more familiar with Linux's TCP can
| chime in.
| [deleted]
| tempnow987 wrote:
| Right, path discovery I get, but this was the first I'd
| heard of PSH bit possibly interacting negatively. That
| said, I'm not sure why it would. Buffering is normally
| only if you are > MSS? And why would pushing up to
| application change path discovery?
| paleotrope wrote:
| I think someone posted about this same exact issue a few weeks
| ago.
|
| I wouldn't mix VPC peering and transit gateway if you avoid it.
| nijave wrote:
| I guess if you're migrating from VPC peering to Transit
| Gateways you have to coordinate updating the routing tables in
| both VPCs at the same time (I'm guessing they probably updated
| routing tables in one VPC but hadn't updated them in the other
| yet) unless you adjust MTUs ahead of time.
|
| Seems to be a crappy migration path (execute these updates
| quickly and hopefully you don't drop too many packets before
| they apply)
| [deleted]
___________________________________________________________________
(page generated 2022-03-28 23:01 UTC)