[HN Gopher] IPv6 Fragmentation Loss
       ___________________________________________________________________
        
       IPv6 Fragmentation Loss
        
       Author : oedmarap
       Score  : 52 points
       Date   : 2021-04-23 10:22 UTC (12 hours ago)
        
 (HTM) web link (www.potaroo.net)
 (TXT) w3m dump (www.potaroo.net)
        
       | toast0 wrote:
       | Based on how fragmentation is handled (generally poorly, often
       | because there's little choice), I would have preferred truncation
       | with an in-band signal. For TCP, truncation is a clear win; you
       | get some of the packet, and can signal back to the other end that
       | things are missing, and hopefully the other end adapts to stop
       | sending packets that get truncated. (Of course, when a middle box
       | uses large receive offload to combine the packets and then
       | complains that they're too big to forward, it's hard to fix as an
       | endpoint).
       | 
       | For UDP, it's not so simple; IP fragmentation does allow for
       | large data, all or nothing processing, without needing
       | application level handling, but the cost of fragmentation is
       | high.
       | 
       | The out of band signalling when sending packets that are too
       | large is too easy to break, and too many systems are still not
       | setup to probe for path mtu blackholes (the biggest one for me is
       | Android), and the workarounds are meh, too.
       | 
       | Another option would be for IP fragments to have the protocol
       | level header, so fragments could be grouped by the full 5-tuple
       | (protocol, source ip, dest ip, source port, dest port) and kept
       | if useful or dropped if not, without having to wait for all the
       | fragments to appear.
        
         | zamadatix wrote:
         | Truncation creates even more work actually. IP routers don't
         | need to understand deeper protocols or do anything with them to
         | fragment, simply split the IP packet into fragments and
         | recalculate the IP checksum (and in v6 they don't even need to
         | do that). To do this with truncation you have to know how to
         | parse the inner protocol headers and modify things like length
         | or checksums in those as well.
         | 
         | You get the same issue putting the information on the
         | fragments. Now there is no "IP layer" there is just "well we're
         | using IP+UDP today and how that is right now should forever be
         | baked into this hardware that will be here for 20 years" which
         | is excatly the problem that led Google/IETF to push headers
         | deeper with HTTP/3 to get out of that mess.
         | 
         | You also can get an in-band signal that you're being fragmented
         | in the middle without changing IP. E.g. TCP already negotiates
         | an MSS, if IP fragments at the start of a group come in smaller
         | than that you know there is something fragmenting in the
         | middle.
        
           | toast0 wrote:
           | Minimum work for routers would be mark the IP packet as
           | truncated, and adjust the IP checksum; upper level protocol
           | can suck it. This is, of course, more work than dropping it
           | on the floor and pretending you'll send an ICMP, but much
           | less work than sending two (or ocassionally more) new
           | packets.
           | 
           | In the middle fragmentation is not really something that
           | happens very often. IPv6 prohibited it, but in IPv4, nearly
           | all packets are marked do not fragment, because IP routers
           | weren't fragmenting much anyway; I think it's more likely to
           | get an ICMP needs fragmentation packet on a too big packet
           | with Don't Fragment, than to actually get fragmented
           | delivery.
           | 
           | Also, MSS is mostly not a negotiation; most stacks send what
           | they think they can receive in the syn and the syn+ack. The
           | only popular stack that sent min(received MSS, MSS from
           | routing) was FreeBSD, but they changed to the common way in
           | 12 IIRC; which in my opinion is a mistake, but I don't have
           | enough data to show it... actually what seems best is to send
           | back min (received MSS - X, MSS from routing), where X is 8
           | or 20, depending on if you more of your users are on
           | misconfigured PPPoE or behind misconfigured IPIP tunnels.
        
             | zamadatix wrote:
             | At that point it pretty much amounts to "send a message in
             | a raw IP header to the destination rather than a message in
             | an ICMP header to the source". The extra step of truncation
             | of the original payload doesn't gain you anything except a
             | lot of pain for the upper layers/endpoint stacks all so the
             | session doesn't have to resend the first part of the first
             | packets of a conversation.
             | 
             | The vast majority IPv4 traffic does not have the DNF bit
             | set. Your logic of why they would doesn't even make sense
             | as setting DNF only means it'll drop on routers that would
             | fragment have fragmented not improve the situation with the
             | ones that wouldn't have.
             | 
             | MSS is definitely a negotiation but a negotiation just
             | between the TCP aware nodes not along the whole IP path
             | which is why I say endpoint stacks can use it to detect if
             | the IP path is fragmenting by comparing incoming IP
             | fragment sizes to the MSS.
        
       | jandrese wrote:
       | The article conclusion is basically that IPv6 extension headers
       | (fragmentation is one of them) are useless on the Internet, which
       | seems pretty reasonable to me. They're basically a research tool.
       | 
       | The lack of on-path fragmentation in IPv6 is definitely on
       | purpose. It was a mistake in IPv4 and would be silly to replicate
       | in IPv6. The fragmentation header in IPv6 is effectively useless.
       | It can only be done at the endpoints, and if that's the case the
       | application should be doing it, not the stack. Instead IPv6
       | mandates path MTU discovery, which is the correct solution.
        
       | anticristi wrote:
       | The biggest single cause of network engineer hair loss must be
       | MTU and fragmentation. We had some customers over a tunnel (IIRC,
       | it was MPLS) that reduced the MTU from 1500 (default Ethernet) to
       | 1496. This happened while our vendor changed some equipment and
       | forgot to update either MTU to accommodate the extra MPLS header.
       | Of course, there was a misconfigured firewall that wouldn't
       | fragment and wouldn't send "ICMP won't fragment".
       | 
       | The result? Most DNS queries went through. All chat applications
       | worked. SSH generally worked, except when you started a full-
       | screen terminal application. Smaller web pages load. Larger web
       | pages loaded only partially.
       | 
       | Imagine non-technical users explaining their issue. "The Internet
       | is half-broken. Please help."
       | 
       | God only knows how much hair I would have today if the world had
       | figured MTU and fragmentation properly.
        
         | korethr wrote:
         | > Of course, there was a misconfigured firewall that wouldn't
         | fragment and wouldn't send "ICMP won't fragment".
         | 
         | Blocked or dropped ICMP has caused me heartburn as well. I am
         | pretty sure my co-workers are used to my "blocking ICMP is Evil
         | Bad and Wrong" rants by now.
        
           | temp667 wrote:
           | Some people don't want to deal with ping floods, sweeps, ICMP
           | tunneling issued and the whole ICMP redirect attacks?
        
           | kureikain wrote:
           | AWS blocked ICMP by default though and cause a lot of
           | traceroute problem for me...
        
         | cryptonector wrote:
         | MTU mismatches are hard to solve too. This happens when you
         | have nodes and/or switches that are either misconfigured or
         | don't all support autodiscovery and then have divergent
         | defaults.
         | 
         | A typical symptom is that things (e.g., ssh) hang. This
         | typically happens when you use a protocol that uses Kerberos in
         | an Active Directory domain with very large Kerberos tickets. PQ
         | crypto would do this too.
        
         | kazen44 wrote:
         | what is even more annoying is that there is a perfectly fine
         | solution for this problem called MTU pad discovery. sadly many
         | people block the entirety of ICMP to prevent this from working
         | properly. Also, some NAT devices just straight up lie about
         | their MTUPD size.
        
         | raspyberr wrote:
         | Why would you reduce the MTU? I had an issue many years ago
         | playing league of legends where log in would fail until I
         | changed the MTU to be slightly lower. Something like 1486.
        
           | jandrese wrote:
           | Tunnels always have the danger of reducing the MTU because
           | they have to account for the header.
           | 
           | Worse, is that if a tunnel doesn't reduce the MTU and its
           | packets are dropped by a router further down the line the the
           | ICMP TOO_FAT response back to the tunnel endpoint not the
           | original host. The tunnel endpoint has zero clue what to do
           | with it and drops it on the ground, leaving the original host
           | in the dark.
           | 
           | Even worse is when overzealous firewall admins start locking
           | down everything they don't understand and that includes ICMP.
           | "Some hackers could ping our networks!" Then you're truly up
           | a creek.
           | 
           | Luckily MTU breakage is fairly easy to spot once you know
           | what to look for. It's easy enough to fix too if your local
           | firewall admins aren't the block everything type. Just fix
           | the MTU on the tunnel endpoint and suddenly everything will
           | start working.
           | 
           | One final note: If as a router or application developer you
           | ever find yourself having to fragment packets, please, for
           | the love of all that is holy, fragment them into two roughly
           | equal sized packets. Don't create one huge packet and one
           | runt from the leftovers. It hurts me to see packets coming in
           | from some heavily tunneled source with 1400 byte initial
           | packet followed by 3 or 4 tiny fragments. Or worse, 3 or 4
           | tiny out of order fragments followed by the 1400 byte bulk
           | because the MAC prioritized small packets in its queue and
           | send them first.
        
             | kazen44 wrote:
             | > One final note: If as a router or application developer
             | you ever find yourself having to fragment packets,
             | 
             | Ideally, you should not want to fragment IP packets. It is
             | far better to do MSS clamping in TCP to prevent the size of
             | the TCP payload to grew above the size of the max MTU in
             | the path. TCP can stitch together the data at the other
             | node just fine, without routers in between having to
             | fragment the IP packets, which will kill your bandwith in
             | comparison.
        
               | jandrese wrote:
               | The "application" in this case is for example a
               | packet->radio bridge application, or a VPN tunnel
               | endpoint. Something that is chewing on packets mid stream
               | where you don't get much choice in the matter.
               | 
               | One thing I like about IPv6 is the minimum MTU of 1280.
               | That's big enough that if I'm really uncertain about the
               | environment I can just set my MTU to that and avoid
               | future headaches without impacting performance too badly.
               | IPv4's 576 minimum nearly triples the number of packets
               | you generate, which is really noticeable when you're
               | running routers close to their limit. Forwarding speed is
               | often dominated by packets per second, not bytes per
               | second.
        
           | SarbinBracklur wrote:
           | 1500 bytes is the de facto standard maximum size of an
           | ethernet frame (not including headers). Therefore, if you
           | want to send data over Ethernet, for example, an IP packet
           | you can typically only send up to 1500 bytes at a time, and
           | if you try to send more it gets dropped. This is called the
           | Maximum Transmission Unit (MTU).
           | 
           | Apparently, their customers were tunneling IP packets through
           | another protocol, meaning that instead of sending an IP
           | packet in a Ethernet frame, they were sending an IP packet in
           | an X packet in an Ethernet frame. Since, like IP and Ethernet
           | packets, the X packets need to contain some information
           | related to the protocol, there was less room for the IP
           | packet. I.e. the MTU was lower.
           | 
           | When you set the MTU on your Operating System (OS), it refers
           | to something slightly different. Instead of "this is the
           | maximum size packet that will fit", it means " _assume_ this
           | is the maximum size packet that will fit ". You can use that
           | setting to force your OS to send smaller packets if you know
           | the MTU is lower than your OS thinks.
        
             | zamadatix wrote:
             | 1500 payload bytes is not de facto, it's part of the actual
             | 802.3 standard. Any other payload MTU is actually non-
             | standard (and the IEEE has went out of their way to avoid
             | standardizing other payload sizes. Frame size had a bump
             | with 802.3as but still expressly left payload at 1500).
             | 
             | MTU doesn't refer to anything different in that case, MTU
             | always refers to the maximum size frame "this" node can
             | handle. It also doesn't always mean the OS assumes it's the
             | maximum size packet that will fit on a path it's the
             | maximum size Ethernet frame the OS knows will fit on the
             | NIC. The OS has other methods for assuming things about a
             | path. Forcing MTU lower does force the OS to assume any
             | path is never more than the MTU though which is why it
             | works as a fix.
        
         | iso1210 wrote:
         | The world did, then people came along and started blocking ICMP
        
           | corty wrote:
           | Yes. There is a special circle in hell for braindead firewall
           | admins filtering ICMP. Unfortunately happens often enough
           | that "MTU problem" is the first thing that comes to mind at
           | the grandparents description. Next in line would be "broken
           | DNS"...
        
             | [deleted]
        
             | temp667 wrote:
             | They are not braindead. ICMP used to be a bit broken and
             | folks got sick of ping floods, sweeps, redirect attacks and
             | tunneling hacks etc. Some pretty major places block ICMP as
             | a result to keep their network reasoning and security
             | analysis simpler.
        
           | korethr wrote:
           | And when I am in my less charitable moods, I would claim that
           | said people are unqualified to not just to configure
           | networks, but to have opinions worth considering when it
           | comes to setting network policy.
        
             | tinus_hn wrote:
             | Unfortunately the only qualification required is for you to
             | want to communicate with their network.
        
           | anticristi wrote:
           | In my utopic world, we wouldn't need "ICMP won't fragment".
           | IPv4/IPv6 would have the MTU of exactly 4096 bytes.
           | 
           | Why not a higher or lower level protocol? Because this is the
           | lowest level protocol which is end-to-end and has a _path_
           | MTU. Lower level protocols would need to somehow handle the
           | MTU and higher level protocols would need to fit their
           | segments within the MTU.
           | 
           | Why 4096? Because this is a common page size on computers.
           | Multiple of this would likely bring little benefits, due to
           | GRO/GSO.
           | 
           | Also, it's my utopic world, so if you don't like it create
           | your own. :))
        
             | alerighi wrote:
             | 4096 can be too much for unstable connections, for example
             | Wi-Fi or mobile.
             | 
             | Also, 4096 bytes at what level? Layer 2? And when you are
             | using a tunnel, you add bytes and thus have to go higher
             | than 4096 or go lower and reduce MTU, or fragment the
             | packet to keep 4096 bytes.
             | 
             | You are just raising the limit, not solving the problem.
        
             | noselasd wrote:
             | The issue is how the network are glued together, your X MTU
             | will be embedded in a VPN tunnel, PPPOE - and that device
             | needs to tell your device about its MTU
        
             | tinus_hn wrote:
             | This is the network version of '640k should be enough for
             | anyone'.
             | 
             | Don't fixate this kind of policy in such a fundamental
             | protocol.
        
             | zamadatix wrote:
             | Not much of a utopia if all you do is say the same problem
             | is now unsolved at a different layer ;). Nothing special
             | about the layer below IP that allows it to fix the problem
             | in a way IP couldn't be made to - it's just an annoying
             | problem to fix efficiently.
        
         | rjsw wrote:
         | I consider the time I spent implementing RFC 4638 to allow the
         | PPPoE implementation I use to have a 1500 byte MTU to be a good
         | investment.
        
           | stock_toaster wrote:
           | I wish centurylink fiber supported baby jump frames. :/
        
             | the8472 wrote:
             | baby jumbo frames.
        
             | toast0 wrote:
             | Would make more sense for them to just do IP over fiber
             | directly than fiddling with their PPPoE implementation.
             | 
             | I have centurylink DSL with PPPoE and the thing that really
             | bugs me is if your modem lost the PPPoE password, it can
             | login with default credentials and ask for the passsword.
             | CenturyLink clearly knows who I am without needing PPPoE
             | authentication, so what do they get out of it?
        
               | praseodym wrote:
               | My local provider requires PPPoE over fiber but accepts
               | any password. It's probably some legacy artefact that's
               | too complicated to remove from the stack.
        
         | rubatuga wrote:
         | Websites half working should send alarm bells ringing that it's
         | an MTU problem. The reason google.com works is because they
         | only support an MTU of 1280.
        
       | drewg123 wrote:
       | Part of the problem with IPv6 extension headers is that there can
       | be an unbounded number of them. No hardware designer wants to
       | deal with that, so just dropping all of them is far easier.
        
         | alerighi wrote:
         | Or we should stop thinking about processing packets with
         | hardware. Does it still make differences these days?
         | 
         | Designing hardware for a single purpose most of the times is
         | not a good idea. That means that you are stuck with the
         | implementation that was implemented in the hardware for a very
         | long period. Also, an hardware implementation can't take into
         | account (and handle) all particular cases, for example
         | malformed packets.
         | 
         | IP was designed to be flexible, you could have in theory used
         | whatever L2 protocol you wanted, and whatever L4 protocol you
         | wanted. Thanks to hardware implementation that considered only
         | Ethernet as L2 protocol and TCP or UDP as possible L4 protocol
         | anything was innovated.
         | 
         | The reason that we are so slow to adopt IPv6, same reason. What
         | would have take to just adopt it if we didn't have hardware
         | implementation of IPv4? Software update and you are done.
         | 
         | The problem is that in the world there are many not so good
         | programmers that write inefficient code and thus people think
         | that they need to implement things in specialized hardware to
         | make them faster. You don't, at least in most cases.
        
           | zamadatix wrote:
           | For $10,000 you can get a box which can route 12.8 Tbps (25.6
           | Tbps bidirectionally) at 256 byte packet size. Even putting
           | all of that money at literal CPU alone (i.e. not the rest of
           | a server or interfaces) you can't route at that performance,
           | let alone match the feature set of the hardware before
           | talking about extra things one could add in software that
           | would slow it even more.
        
           | knorker wrote:
           | I could be wrong, but I would think high packet rates with
           | large routing tables (default-free zone) would still need a
           | TCAM.
        
             | kazen44 wrote:
             | Also, hardware based offloading is a godsend for
             | reliability aswell.
             | 
             | You can upgrade the control plane of a router with traffic
             | still flowing thanks to the data plane being seperate. You
             | can even make another router do control plane for the data
             | plane in another router. Not to mention the massive
             | performance benefits.
        
           | kevingadd wrote:
           | Software packet processing simply does not scale to the
           | levels that hardware packet processing does, and that's not
           | even mentioning the cost issues involved in provisioning and
           | operating enough CPUs to do it. Maybe in a few years now that
           | we're seeing higher core counts become common in non-server
           | hardware...
        
         | zamadatix wrote:
         | There are "only" 256 extension header IDs which gives an upper
         | bound. Realistically you're never going to be able see more
         | than 20ish to process on a packet because after that you're
         | likely requiring the packet be >1280 bytes which would create
         | problems anyways - unless all of the sudden they start pumping
         | out 8 byte extension headers like candy.
        
       | majke wrote:
       | Couple of years ago I wrote a tool to check if end-hosts are
       | complying:
       | 
       | http://icmpcheckv6.popcount.org/
       | 
       | (v4 version http://icmpcheck.popcount.org/ )
       | 
       | it answers:
       | 
       | - can fragments reach you
       | 
       | - can PTB ICMP reach you
       | 
       | hope it's useful. Prose: https://blog.cloudflare.com/ip-
       | fragmentation-is-broken/
       | 
       | Notice: it's easy to run the tests headless with curl if you need
       | to see if your server is configured fine.
       | 
       | Fun fact is that it's very much not easy to accept/send
       | fragmented packets from linux. I learned the hard way what
       | `IP_NODEFRAG` is about.
        
         | zamadatix wrote:
         | Wow this is great, thanks for hosting this!
        
       | bombcar wrote:
       | It seems will all "standards" there is the published standard as
       | written, and the standard "as implemented" - and many of the
       | unused corners quickly become "here be dragons".
        
       | kazinator wrote:
       | > _Committees should never attempt to define technology_
       | 
       | I completely agree; that's why I look down on newer POSIX, newer
       | C++, newer ISO C, Unicode, WWW, ...
        
       ___________________________________________________________________
       (page generated 2021-04-23 23:03 UTC)