[HN Gopher] IPv6 Fragmentation Loss
___________________________________________________________________
IPv6 Fragmentation Loss
Author : oedmarap
Score : 52 points
Date : 2021-04-23 10:22 UTC (12 hours ago)
(HTM) web link (www.potaroo.net)
(TXT) w3m dump (www.potaroo.net)
| toast0 wrote:
| Based on how fragmentation is handled (generally poorly, often
| because there's little choice), I would have preferred truncation
| with an in-band signal. For TCP, truncation is a clear win; you
| get some of the packet, and can signal back to the other end that
| things are missing, and hopefully the other end adapts to stop
| sending packets that get truncated. (Of course, when a middle box
| uses large receive offload to combine the packets and then
| complains that they're too big to forward, it's hard to fix as an
| endpoint).
|
| For UDP, it's not so simple; IP fragmentation does allow for
| large data, all or nothing processing, without needing
| application level handling, but the cost of fragmentation is
| high.
|
| The out of band signalling when sending packets that are too
| large is too easy to break, and too many systems are still not
| setup to probe for path mtu blackholes (the biggest one for me is
| Android), and the workarounds are meh, too.
|
| Another option would be for IP fragments to have the protocol
| level header, so fragments could be grouped by the full 5-tuple
| (protocol, source ip, dest ip, source port, dest port) and kept
| if useful or dropped if not, without having to wait for all the
| fragments to appear.
| zamadatix wrote:
| Truncation creates even more work actually. IP routers don't
| need to understand deeper protocols or do anything with them to
| fragment, simply split the IP packet into fragments and
| recalculate the IP checksum (and in v6 they don't even need to
| do that). To do this with truncation you have to know how to
| parse the inner protocol headers and modify things like length
| or checksums in those as well.
|
| You get the same issue putting the information on the
| fragments. Now there is no "IP layer" there is just "well we're
| using IP+UDP today and how that is right now should forever be
| baked into this hardware that will be here for 20 years" which
| is excatly the problem that led Google/IETF to push headers
| deeper with HTTP/3 to get out of that mess.
|
| You also can get an in-band signal that you're being fragmented
| in the middle without changing IP. E.g. TCP already negotiates
| an MSS, if IP fragments at the start of a group come in smaller
| than that you know there is something fragmenting in the
| middle.
| toast0 wrote:
| Minimum work for routers would be mark the IP packet as
| truncated, and adjust the IP checksum; upper level protocol
| can suck it. This is, of course, more work than dropping it
| on the floor and pretending you'll send an ICMP, but much
| less work than sending two (or ocassionally more) new
| packets.
|
| In the middle fragmentation is not really something that
| happens very often. IPv6 prohibited it, but in IPv4, nearly
| all packets are marked do not fragment, because IP routers
| weren't fragmenting much anyway; I think it's more likely to
| get an ICMP needs fragmentation packet on a too big packet
| with Don't Fragment, than to actually get fragmented
| delivery.
|
| Also, MSS is mostly not a negotiation; most stacks send what
| they think they can receive in the syn and the syn+ack. The
| only popular stack that sent min(received MSS, MSS from
| routing) was FreeBSD, but they changed to the common way in
| 12 IIRC; which in my opinion is a mistake, but I don't have
| enough data to show it... actually what seems best is to send
| back min (received MSS - X, MSS from routing), where X is 8
| or 20, depending on if you more of your users are on
| misconfigured PPPoE or behind misconfigured IPIP tunnels.
| zamadatix wrote:
| At that point it pretty much amounts to "send a message in
| a raw IP header to the destination rather than a message in
| an ICMP header to the source". The extra step of truncation
| of the original payload doesn't gain you anything except a
| lot of pain for the upper layers/endpoint stacks all so the
| session doesn't have to resend the first part of the first
| packets of a conversation.
|
| The vast majority IPv4 traffic does not have the DNF bit
| set. Your logic of why they would doesn't even make sense
| as setting DNF only means it'll drop on routers that would
| fragment have fragmented not improve the situation with the
| ones that wouldn't have.
|
| MSS is definitely a negotiation but a negotiation just
| between the TCP aware nodes not along the whole IP path
| which is why I say endpoint stacks can use it to detect if
| the IP path is fragmenting by comparing incoming IP
| fragment sizes to the MSS.
| jandrese wrote:
| The article conclusion is basically that IPv6 extension headers
| (fragmentation is one of them) are useless on the Internet, which
| seems pretty reasonable to me. They're basically a research tool.
|
| The lack of on-path fragmentation in IPv6 is definitely on
| purpose. It was a mistake in IPv4 and would be silly to replicate
| in IPv6. The fragmentation header in IPv6 is effectively useless.
| It can only be done at the endpoints, and if that's the case the
| application should be doing it, not the stack. Instead IPv6
| mandates path MTU discovery, which is the correct solution.
| anticristi wrote:
| The biggest single cause of network engineer hair loss must be
| MTU and fragmentation. We had some customers over a tunnel (IIRC,
| it was MPLS) that reduced the MTU from 1500 (default Ethernet) to
| 1496. This happened while our vendor changed some equipment and
| forgot to update either MTU to accommodate the extra MPLS header.
| Of course, there was a misconfigured firewall that wouldn't
| fragment and wouldn't send "ICMP won't fragment".
|
| The result? Most DNS queries went through. All chat applications
| worked. SSH generally worked, except when you started a full-
| screen terminal application. Smaller web pages load. Larger web
| pages loaded only partially.
|
| Imagine non-technical users explaining their issue. "The Internet
| is half-broken. Please help."
|
| God only knows how much hair I would have today if the world had
| figured MTU and fragmentation properly.
| korethr wrote:
| > Of course, there was a misconfigured firewall that wouldn't
| fragment and wouldn't send "ICMP won't fragment".
|
| Blocked or dropped ICMP has caused me heartburn as well. I am
| pretty sure my co-workers are used to my "blocking ICMP is Evil
| Bad and Wrong" rants by now.
| temp667 wrote:
| Some people don't want to deal with ping floods, sweeps, ICMP
| tunneling issued and the whole ICMP redirect attacks?
| kureikain wrote:
| AWS blocked ICMP by default though and cause a lot of
| traceroute problem for me...
| cryptonector wrote:
| MTU mismatches are hard to solve too. This happens when you
| have nodes and/or switches that are either misconfigured or
| don't all support autodiscovery and then have divergent
| defaults.
|
| A typical symptom is that things (e.g., ssh) hang. This
| typically happens when you use a protocol that uses Kerberos in
| an Active Directory domain with very large Kerberos tickets. PQ
| crypto would do this too.
| kazen44 wrote:
| what is even more annoying is that there is a perfectly fine
| solution for this problem called MTU pad discovery. sadly many
| people block the entirety of ICMP to prevent this from working
| properly. Also, some NAT devices just straight up lie about
| their MTUPD size.
| raspyberr wrote:
| Why would you reduce the MTU? I had an issue many years ago
| playing league of legends where log in would fail until I
| changed the MTU to be slightly lower. Something like 1486.
| jandrese wrote:
| Tunnels always have the danger of reducing the MTU because
| they have to account for the header.
|
| Worse, is that if a tunnel doesn't reduce the MTU and its
| packets are dropped by a router further down the line the the
| ICMP TOO_FAT response back to the tunnel endpoint not the
| original host. The tunnel endpoint has zero clue what to do
| with it and drops it on the ground, leaving the original host
| in the dark.
|
| Even worse is when overzealous firewall admins start locking
| down everything they don't understand and that includes ICMP.
| "Some hackers could ping our networks!" Then you're truly up
| a creek.
|
| Luckily MTU breakage is fairly easy to spot once you know
| what to look for. It's easy enough to fix too if your local
| firewall admins aren't the block everything type. Just fix
| the MTU on the tunnel endpoint and suddenly everything will
| start working.
|
| One final note: If as a router or application developer you
| ever find yourself having to fragment packets, please, for
| the love of all that is holy, fragment them into two roughly
| equal sized packets. Don't create one huge packet and one
| runt from the leftovers. It hurts me to see packets coming in
| from some heavily tunneled source with 1400 byte initial
| packet followed by 3 or 4 tiny fragments. Or worse, 3 or 4
| tiny out of order fragments followed by the 1400 byte bulk
| because the MAC prioritized small packets in its queue and
| send them first.
| kazen44 wrote:
| > One final note: If as a router or application developer
| you ever find yourself having to fragment packets,
|
| Ideally, you should not want to fragment IP packets. It is
| far better to do MSS clamping in TCP to prevent the size of
| the TCP payload to grew above the size of the max MTU in
| the path. TCP can stitch together the data at the other
| node just fine, without routers in between having to
| fragment the IP packets, which will kill your bandwith in
| comparison.
| jandrese wrote:
| The "application" in this case is for example a
| packet->radio bridge application, or a VPN tunnel
| endpoint. Something that is chewing on packets mid stream
| where you don't get much choice in the matter.
|
| One thing I like about IPv6 is the minimum MTU of 1280.
| That's big enough that if I'm really uncertain about the
| environment I can just set my MTU to that and avoid
| future headaches without impacting performance too badly.
| IPv4's 576 minimum nearly triples the number of packets
| you generate, which is really noticeable when you're
| running routers close to their limit. Forwarding speed is
| often dominated by packets per second, not bytes per
| second.
| SarbinBracklur wrote:
| 1500 bytes is the de facto standard maximum size of an
| ethernet frame (not including headers). Therefore, if you
| want to send data over Ethernet, for example, an IP packet
| you can typically only send up to 1500 bytes at a time, and
| if you try to send more it gets dropped. This is called the
| Maximum Transmission Unit (MTU).
|
| Apparently, their customers were tunneling IP packets through
| another protocol, meaning that instead of sending an IP
| packet in a Ethernet frame, they were sending an IP packet in
| an X packet in an Ethernet frame. Since, like IP and Ethernet
| packets, the X packets need to contain some information
| related to the protocol, there was less room for the IP
| packet. I.e. the MTU was lower.
|
| When you set the MTU on your Operating System (OS), it refers
| to something slightly different. Instead of "this is the
| maximum size packet that will fit", it means " _assume_ this
| is the maximum size packet that will fit ". You can use that
| setting to force your OS to send smaller packets if you know
| the MTU is lower than your OS thinks.
| zamadatix wrote:
| 1500 payload bytes is not de facto, it's part of the actual
| 802.3 standard. Any other payload MTU is actually non-
| standard (and the IEEE has went out of their way to avoid
| standardizing other payload sizes. Frame size had a bump
| with 802.3as but still expressly left payload at 1500).
|
| MTU doesn't refer to anything different in that case, MTU
| always refers to the maximum size frame "this" node can
| handle. It also doesn't always mean the OS assumes it's the
| maximum size packet that will fit on a path it's the
| maximum size Ethernet frame the OS knows will fit on the
| NIC. The OS has other methods for assuming things about a
| path. Forcing MTU lower does force the OS to assume any
| path is never more than the MTU though which is why it
| works as a fix.
| iso1210 wrote:
| The world did, then people came along and started blocking ICMP
| corty wrote:
| Yes. There is a special circle in hell for braindead firewall
| admins filtering ICMP. Unfortunately happens often enough
| that "MTU problem" is the first thing that comes to mind at
| the grandparents description. Next in line would be "broken
| DNS"...
| [deleted]
| temp667 wrote:
| They are not braindead. ICMP used to be a bit broken and
| folks got sick of ping floods, sweeps, redirect attacks and
| tunneling hacks etc. Some pretty major places block ICMP as
| a result to keep their network reasoning and security
| analysis simpler.
| korethr wrote:
| And when I am in my less charitable moods, I would claim that
| said people are unqualified to not just to configure
| networks, but to have opinions worth considering when it
| comes to setting network policy.
| tinus_hn wrote:
| Unfortunately the only qualification required is for you to
| want to communicate with their network.
| anticristi wrote:
| In my utopic world, we wouldn't need "ICMP won't fragment".
| IPv4/IPv6 would have the MTU of exactly 4096 bytes.
|
| Why not a higher or lower level protocol? Because this is the
| lowest level protocol which is end-to-end and has a _path_
| MTU. Lower level protocols would need to somehow handle the
| MTU and higher level protocols would need to fit their
| segments within the MTU.
|
| Why 4096? Because this is a common page size on computers.
| Multiple of this would likely bring little benefits, due to
| GRO/GSO.
|
| Also, it's my utopic world, so if you don't like it create
| your own. :))
| alerighi wrote:
| 4096 can be too much for unstable connections, for example
| Wi-Fi or mobile.
|
| Also, 4096 bytes at what level? Layer 2? And when you are
| using a tunnel, you add bytes and thus have to go higher
| than 4096 or go lower and reduce MTU, or fragment the
| packet to keep 4096 bytes.
|
| You are just raising the limit, not solving the problem.
| noselasd wrote:
| The issue is how the network are glued together, your X MTU
| will be embedded in a VPN tunnel, PPPOE - and that device
| needs to tell your device about its MTU
| tinus_hn wrote:
| This is the network version of '640k should be enough for
| anyone'.
|
| Don't fixate this kind of policy in such a fundamental
| protocol.
| zamadatix wrote:
| Not much of a utopia if all you do is say the same problem
| is now unsolved at a different layer ;). Nothing special
| about the layer below IP that allows it to fix the problem
| in a way IP couldn't be made to - it's just an annoying
| problem to fix efficiently.
| rjsw wrote:
| I consider the time I spent implementing RFC 4638 to allow the
| PPPoE implementation I use to have a 1500 byte MTU to be a good
| investment.
| stock_toaster wrote:
| I wish centurylink fiber supported baby jump frames. :/
| the8472 wrote:
| baby jumbo frames.
| toast0 wrote:
| Would make more sense for them to just do IP over fiber
| directly than fiddling with their PPPoE implementation.
|
| I have centurylink DSL with PPPoE and the thing that really
| bugs me is if your modem lost the PPPoE password, it can
| login with default credentials and ask for the passsword.
| CenturyLink clearly knows who I am without needing PPPoE
| authentication, so what do they get out of it?
| praseodym wrote:
| My local provider requires PPPoE over fiber but accepts
| any password. It's probably some legacy artefact that's
| too complicated to remove from the stack.
| rubatuga wrote:
| Websites half working should send alarm bells ringing that it's
| an MTU problem. The reason google.com works is because they
| only support an MTU of 1280.
| drewg123 wrote:
| Part of the problem with IPv6 extension headers is that there can
| be an unbounded number of them. No hardware designer wants to
| deal with that, so just dropping all of them is far easier.
| alerighi wrote:
| Or we should stop thinking about processing packets with
| hardware. Does it still make differences these days?
|
| Designing hardware for a single purpose most of the times is
| not a good idea. That means that you are stuck with the
| implementation that was implemented in the hardware for a very
| long period. Also, an hardware implementation can't take into
| account (and handle) all particular cases, for example
| malformed packets.
|
| IP was designed to be flexible, you could have in theory used
| whatever L2 protocol you wanted, and whatever L4 protocol you
| wanted. Thanks to hardware implementation that considered only
| Ethernet as L2 protocol and TCP or UDP as possible L4 protocol
| anything was innovated.
|
| The reason that we are so slow to adopt IPv6, same reason. What
| would have take to just adopt it if we didn't have hardware
| implementation of IPv4? Software update and you are done.
|
| The problem is that in the world there are many not so good
| programmers that write inefficient code and thus people think
| that they need to implement things in specialized hardware to
| make them faster. You don't, at least in most cases.
| zamadatix wrote:
| For $10,000 you can get a box which can route 12.8 Tbps (25.6
| Tbps bidirectionally) at 256 byte packet size. Even putting
| all of that money at literal CPU alone (i.e. not the rest of
| a server or interfaces) you can't route at that performance,
| let alone match the feature set of the hardware before
| talking about extra things one could add in software that
| would slow it even more.
| knorker wrote:
| I could be wrong, but I would think high packet rates with
| large routing tables (default-free zone) would still need a
| TCAM.
| kazen44 wrote:
| Also, hardware based offloading is a godsend for
| reliability aswell.
|
| You can upgrade the control plane of a router with traffic
| still flowing thanks to the data plane being seperate. You
| can even make another router do control plane for the data
| plane in another router. Not to mention the massive
| performance benefits.
| kevingadd wrote:
| Software packet processing simply does not scale to the
| levels that hardware packet processing does, and that's not
| even mentioning the cost issues involved in provisioning and
| operating enough CPUs to do it. Maybe in a few years now that
| we're seeing higher core counts become common in non-server
| hardware...
| zamadatix wrote:
| There are "only" 256 extension header IDs which gives an upper
| bound. Realistically you're never going to be able see more
| than 20ish to process on a packet because after that you're
| likely requiring the packet be >1280 bytes which would create
| problems anyways - unless all of the sudden they start pumping
| out 8 byte extension headers like candy.
| majke wrote:
| Couple of years ago I wrote a tool to check if end-hosts are
| complying:
|
| http://icmpcheckv6.popcount.org/
|
| (v4 version http://icmpcheck.popcount.org/ )
|
| it answers:
|
| - can fragments reach you
|
| - can PTB ICMP reach you
|
| hope it's useful. Prose: https://blog.cloudflare.com/ip-
| fragmentation-is-broken/
|
| Notice: it's easy to run the tests headless with curl if you need
| to see if your server is configured fine.
|
| Fun fact is that it's very much not easy to accept/send
| fragmented packets from linux. I learned the hard way what
| `IP_NODEFRAG` is about.
| zamadatix wrote:
| Wow this is great, thanks for hosting this!
| bombcar wrote:
| It seems will all "standards" there is the published standard as
| written, and the standard "as implemented" - and many of the
| unused corners quickly become "here be dragons".
| kazinator wrote:
| > _Committees should never attempt to define technology_
|
| I completely agree; that's why I look down on newer POSIX, newer
| C++, newer ISO C, Unicode, WWW, ...
___________________________________________________________________
(page generated 2021-04-23 23:03 UTC)