[HN Gopher] So you think you understand IP fragmentation?
___________________________________________________________________
So you think you understand IP fragmentation?
Author : kevincox
Score : 114 points
Date : 2024-02-15 11:35 UTC (2 days ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| metaxy2 wrote:
| > In my experience, it is rare for a network to correctly
| generate Time Exceeded messages for both IPv4 and IPv6.
|
| Doesn't that make it more one of those situations where the non-
| documented behavior has become the de facto standard, rather than
| "wrong" exactly? (I guess it depends on whether that decision is
| being made consciously by the implementors or just for lack of
| knowledge of the standards.)
| toast0 wrote:
| People who filter out all ICMP are probably unaware of the
| standard, but router implementors that limit ICMP rates are
| balancing transparent observability with the need to keep the
| equipment running.
|
| I guess you could provision the router cpus so they could send
| ICMPs for line rate incoming packets that must be dropped, but
| that doesn't seem like a good cost tradeoff.
| toast0 wrote:
| > Worse, fragments are more likely to be lost. Many routers and
| firewalls treat fragments as a security risk because they don't
| include the information from higher-level protocols like TCP or
| UDP and can't be filtered based on port, so they drop all IP
| fragments.
|
| I've seen worse than that. A firewall dropping the first fragment
| based on the UDP port number (which is available in the first
| fragment), but allowing further fragments.
|
| I'd love to see their new discovery algorithm get widely
| distributed, it's 2024, and a lot of stuff still breaks or
| suffers terrible delays if I don't apply the proper settings with
| my 1492 MTU.
| mzs wrote:
| That could be an old BSD bug. When there is no ARP entry the
| first fragment is dropped before the ARP entry is cached. That
| made it into many BDS based network software stacks before it
| was fixed.
|
| I just use 1024 since nobody seems to use SLIP anymore. It
| should fit under 1200 with headers and the logic is similar to
| DEC 512+64 limit arrived at initially. All the PMTU detection
| algos suffer from something lowering the MTU along a route for
| a long lasting connection.
| ardel95 wrote:
| One of the biggest misses with IP fragmentation was not requiring
| each fragment to carry the higher protocol header. Or at least do
| that for UDP.
|
| That decision alone would've made fragments so much simpler on
| network devices and appliances, and much less likely for them to
| get dropped.
| wmf wrote:
| That would be a layering violation. IP routers don't
| necessarily know about higher protocols.
| tptacek wrote:
| You could design a network protocol that fragments by
| capturing a variable number of bytes from the next header,
| and ICMP already does something like that.
|
| (None of this would fix the real problem with fragmentation,
| which is that you can't efficiently segment out a large frame
| without having some kind of reliability layer).
| raggi wrote:
| If I was revisiting, I'd probably eradicate the layer and
| pick a fixed number of flow types with distinct headers and
| state machines. The layers were a reasonable choice given
| the understanding of the time, but in hindsight I think you
| can make a strong case they're cut at the wrong places.
| Hikikomori wrote:
| Theres no strict rules about layers, most routers can and do
| read info in tcp/udp headers.
| n2d4 wrote:
| They _can_ read higher layers, but they (currently) don 't
| _have to_ in order to implement IP correctly
| dtech wrote:
| And that's how we got forever stuck with those 2 and now
| have to build every new protocol on top of UDP.
| n2d4 wrote:
| Actually, that's not a bad thing. UDP is small enough to
| have nearly no overhead, but complex enough to let
| firewalls do their job. Six of the eight bytes in its
| header would probably be in the header of any transport
| layer protocol anyways (only the checksum might be
| unnecessary).
|
| Wikipedia lists over 100 assigned IP protocol numbers
| [1], and while it would break existing firewalls, adding
| a new protocol would certainly require less work than the
| transition from IPv4 to IPv6. But UDP is already simple
| enough that there's very little benefit in not just
| building on that.
|
| [1]
| https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers
| cesarb wrote:
| > most routers can and do read info in tcp/udp headers.
|
| Do _most_ routers really do that, or just the ones which
| are also trying to act as a firewall?
| wmf wrote:
| For example, IP routers often peek at UDP/TCP port
| numbers to calculate ECMP flow hashing. This is
| technically naughty but it's read-only and it's only an
| optimization that isn't required for correct forwarding.
| Hikikomori wrote:
| Yes. I doubt you can find one that is not capable.
| gizmo686 wrote:
| You could implement it as a generic 'application metadata'
| field in the IP header. From the perspective of IP, it one
| more length prefixed field in the IP header. Routers may
| interpret it in conjunction with the value of the protocol
| field; otherwise they are just required to leave it in
| unchanged in the header (including in all fragments).
|
| For packets that don't want to use it, this is just 1 byte of
| overhead to set the size to 0.
| jandrese wrote:
| The IP layer doesn't have to know what is in those upper
| layers to include 50 or 100 bytes of it in a little trunk.
| zamadatix wrote:
| If you always chop 100 add 100 then it's even more
| massively inefficient than the problem it solves. The
| router would at least need to have every protocol start
| with a header length value. Otherwise if you just take the
| first 100 bytes and stick it in the front of each packet
| and the header was only 57 bytes then you've suddenly got
| 43 bytes of garbage in the next layer's payload when you
| reassemble.
|
| Keep in mind, most routers don't even bother supporting
| existing fragmentation because it's costly to implement in
| high speed hardware. So while you could theoretically have
| that dynamic next protocol header length value field it'd
| only be complicating something hardware makers already
| think is too complicated to be worth it. Making things
| unappealing complex is one of the common results of
| layering violations.
| colmmacc wrote:
| It's just a dumb mistake. All it takes is a "next layer
| header length" field. It would have been very simple.
|
| You don't even really need that, and as proof, take ICMP ...
| which was designed as part of IP ... actually does do this.
| Routers are already required to copy and include the header
| of the packet that triggered an ICMP error.
| p1mrx wrote:
| If I could go back in time, I would replace fragmentation with
| in-flight truncation. The receiver sees the exact MTU and can
| communicate it back to the sender as needed.
| zokier wrote:
| Why would that be preferable over having routers just drop
| oversized packets and send icmp message to sender?
| kevincox wrote:
| 1. ICMP messages often don't reliably make it back to the
| sender. This is especially true with low level load
| balancers.
|
| 2. You at least get some information. Maybe half a packet is
| better than no packet.
|
| 3. You get an exact size limit of the entire path the packet
| took. Rather than just a "too big" of the first hop that
| wouldn't pass it.
|
| Overall I think it is a pretty decent solution. Assuming that
| the IP header says the original size or similar so that the
| truncation can be easily detected.
|
| I do see some flaws though:
|
| 1. Layered protocols would have to be more complex to deal
| with packet truncation. (Although I guess they could simply
| treat truncation the same as dropped to avoid extra
| complexity.)
|
| 2. Checksums at the end of a packet being dropped likely make
| the whole packet useless.
|
| 3. It may be better to get a quick rejection from a local
| router rather than the possibly far away peer.
| amluto wrote:
| I'll add an extra bonus: every now and then, an endpoint
| could append one extra bonus byte to a transmitted packet.
| That would probe for a potential increase in PMTU at very
| little cost.
| zamadatix wrote:
| Yeah, maybe a "contains MTU dummy data" bit at the start
| of the header which lets the receiving client know that
| this packet 1) May contain additional data after the real
| payload 2) If it does, send back an updated max packet
| size based on the amount that made it through.
|
| IETF folks wouldn't like that much state in the IP
| protocol and it doesn't solve intermediate L2 loss (the
| real reason you have IP MTU differences) when switches
| are involved in the path though. That said, might make in
| interesting ICMP probing method.
| cesarb wrote:
| I'd add a fourth potential positive aspect of truncation:
| it could be implemented at full speed in the data path,
| instead of a separate process running on another CPU to
| generate, route, and send the ICMP packet.
|
| All these advantages would come from truncation being "in-
| band", following the same path as the data packets, instead
| of being a special "out-of-band" message like ICMP packets
| are.
|
| As for the flaws you listed:
|
| 1. The main complexity would come not from dealing with
| truncation itself (as you said, just treat it the same as a
| dropped packet), but from each protocol having to reflect
| the "truncated size" back to the original sender (like how
| ECN deals with marked packets).
|
| 2. The whole packet is not useless, the reason for
| truncating instead of turning it into an empty packet is to
| keep as much as possible of the higher layer headers. That
| is, _most_ of the packet might be useless, but the higher
| layer headers at the start can be useful.
|
| 3. On the other hand, you might need several of these
| "quick" rejections before arriving at the ideal size.
| sgtnoodle wrote:
| Both approaches seem like they wouldn't work with one-way
| routes? It's relatively rare, but I've worked on IP based
| telemetry systems with only one way, or highly asymmetric
| routes. Probably the most fun was tunneling IP from a space
| capsule, through the international space station's 1553-MIL-
| STD databus, through Houston, to California.
| Scubabear68 wrote:
| Wow, so no way to run TCP or otherwise get ACKs or NACK
| like packets back? Yikes.
| jandrese wrote:
| Yeah. You wouldn't want to run TCP on those links anyway
| as the high latency and non-negligible loss rates would
| make TCP run like crap. This is where protocols like DTN
| are helpful, so long as you can interface your
| application with DTN. Fine for email, annoying for
| browsing, and painful for chat.
| zamadatix wrote:
| One way communication naturally makes it impossible to get
| any information back whatsoever so no dynamic solution is
| possible. Asymmetric should be able to be handled though,
| send and receive don't actually have to have the same MTU
| in this scenario.
|
| The fallback for anything (one way, lazy, filtered
| connection, truly anything) is you always have your minimum
| MTU where the protocol is guaranteed to work at a given
| size and if it's not the protocol knows the solution will
| need to lie outside itself. IPv4/IPv6 already have this
| today.
| Szpadel wrote:
| serious question, how would be different from sending packed
| with no-fragment bit set?
| zamadatix wrote:
| The behavior of DNF is you either forward the whole thing or
| drop the whole thing.
| zamadatix wrote:
| Say the active L2 path between two routers was:
| router A <-> switch B <-> switch C <-> router D
|
| Router A still needs to have a reliable, actively updating, way
| to detect the path MTU to Router D because something could
| change and the path turns into: router A <->
| switch E <-> switch C <-> router D
|
| where switch E has a different MTU than the other equipment.
| Now you could go even further back and say the same happens for
| anything IP transits on top of but then the folks at Xerox PARC
| would probably wonder why the hell you're trying to make their
| hubs so damn smart and expensive and the people after them
| would probably wonder why the hell you're telling them to throw
| backwards compatibility with all the previous gear out the
| window in the name of making MTU slightly simpler.
|
| It's a decent solution I think you just have the time backwards
| - it's something we should look to do as we move towards the
| future and the overhead of such logic is tiny to add to
| networking rather than something to wish we had in the past
| when even dumb hubs were expensively complicated.
|
| This does leave one case unhandled though: you end up with a
| low MTU path at some point, things update so your traffic is
| going over a high MTU path, you still need to have some
| algorithm to discover this or the connection never recovers the
| performance.
| cesarb wrote:
| This discussion is about truncating at L3, not L2; it assumes
| that the L2 MTU is fixed and known by every directly
| connected router, but a L3 packet transits over several L2
| networks, each with its own MTU. A L2 network with variable
| MTU (or a MTU too small for Internet packets, like ATM) has
| its own independent segmentation and reassembly logic.
| zamadatix wrote:
| Well that's the thing, for the real world the L3's MTU is
| limited by the L2 MTU. If you can guarantee L2 MTU will be
| fixed and known for every device along the path then there
| is not really a need for a variable L3 MTU anymore and
| you've solved the problem by convincing everyone in the
| world they'll never need or want a different MTU again.
| Then someone gets the bright idea they can do something
| with tunnels and it starts back up.
|
| If you want a world where you can rely on an L2 to have a
| way to handle the MTU mismatch then nothing is being solved
| by this other than the ability to say "fragmentation is not
| not in the current abstraction layer anymore, it's now an
| annoying problem on the layer below it instead" :).
| cesarb wrote:
| > If you can guarantee L2 MTU will be fixed and known for
| every device along the path then there is not really a
| need for a variable L3 MTU anymore
|
| The L2 MTU is fixed and known for every device _on that
| same L2 network_. The Internet is a network of networks;
| the reason we need to find the path MTU on the L3 network
| is that a single packet transits over several L2
| networks, and while each L2 network has a single MTU, the
| L3 path can vary.
| zamadatix wrote:
| The different L3 networks are connected by L2 networks
| in-between them, not just behind them. L2 doesn't
| disappear once you hit the router, you've still got to
| reach your peer's IP over L1/L2 somehow at some point. A
| great deal of internet peering is not even a L3 box <-
| direct L1+L2 connection -> L3 box connection, it's dumb
| L2 transport which provides the IP connectivity path for
| the BGP session to a more centralized router. Sometimes
| the path is a pseudowire doing the same with a functional
| MTU lower than the switched/routed path it rides on too.
|
| These underlying paths are not always static either, just
| because you have 9000 today doesn't mean when the path
| fails to a backup alternate tomorrow the MTU will be
| 9000. You have to get everyone involved on the internet
| to agree all links should now be 9000, now you can
| reliably set your router's outbound link to 9000 in all
| cases and rely on L3 fragmentation. Until someone wants
| to set their MTU to 12000 :). Even when I've had paid WAN
| transport with contracted MTU the MTU has lowered during
| carrier maintenance like firmware upgrades or unit
| replacements and I've had to call them up saying the MTU
| is broken because something like NFS servers will think
| they can statically set a known MTU on the path and it'll
| stay that way, a routed neighbor on the path would make
| the same error in this example.
| akira2501 wrote:
| Truncation and encryption are an unfavorable mix.
| egberts1 wrote:
| I grok IP-fragmentation because IDS/IPS/NDS/XDS all have to
| splice packets together to get to the payload data ... at 5Gbps.
|
| At that nosebleed speed, FPGA is designed to do such
| defragmentation of IP as well as assembly of UDP reassembly and
| TCP desegmentation.
|
| Very fast.
|
| Including a bulk of FPGA dedicated to the overlapping payload
| that hackers often play tricks on such IDS, et. al.
|
| That is probably the ONLY beef I have with RFCs covering
| IP/TCP/UDP: did not detail what to do in event of overlapping
| packets (first overlapped takes precedence or last overlapped
| overrides old data?)
| cesarb wrote:
| > That is probably the ONLY beef I have with RFCs covering
| IP/TCP/UDP: did not detail what to do in event of overlapping
| packets (first overlapped takes precedence or last overlapped
| overrides old data?)
|
| For a correctly operating sender, it shouldn't matter, since
| the data should be identical. It would be the same as asking
| what should happen if the sender changes the data when
| retransmitting a packet (the only difference from overlapping
| fragments is that a retransmitted packet has 100% overlap).
| It's Undefined Behavior, the receiver is allowed to do anything
| it wants, and the sender can't complain since it's its own
| fault.
|
| (Also consider that packets are always allowed to be reordered
| in transit, so what you thought was the first packet might
| become the last packet on the next hop; even if the standard
| constrained the reassembly ordering, you might still not get
| the result you expected.)
| Arnavion wrote:
| I believe egberts1's point is that leaving it unspecified
| invites different implementations to implement it
| differently. Some might choose to use the bytes from the
| first fragment, others might choose to use the bytes from the
| newer fragment. You could have a situation where the IDS /
| firewall / misc security appliance think the packets contain
| a benign request but the application server interprets them
| in a malicious way. Things like HTTP request smuggling rely
| on the same kind of mismatch at the application protocol
| layer, for example.
| raggi wrote:
| The problem is intractable unless you enforce lossless
| perfect ordering as a strict requirement. There's no way to
| define "first" as written in the premise without it.
| egberts1 wrote:
| Case in point, MS Windows does it one way, UNIX does it
| another way.
| zamadatix wrote:
| 5 Gbps is pretty slow. Internet carrier routers typically
| connect to each other at 100 gbps (or 400 gbps these days) and
| a single router will have many such connections, so Tbps of
| bandwidth, often costing about the same as that IPS FPGA
| hardware (well, the routers go up to rack size in which case
| you'd probably want to compare to a rack size IPS box cost).
| wmf wrote:
| Yes, analyzing hostile traffic is much more expensive than
| routing.
| zamadatix wrote:
| Well yes, you could gather that, but also highlighting that
| what inspection boxes do in regards to F&R does not
| necessarily relate to the problems covered in the article
| because F&R has to work extremely quickly across the
| internet, often ruling out anything related how the box in
| the middle could see it and instead being about what
| endpoints can do independently via things like DPLPMTUD
| instead.
|
| The boxes do run into very interesting problems and
| knowledge space though, just more on the inverse of the
| problem at hand (how to properly act like a client
| receiving already fragmented data).
| egberts1 wrote:
| Or properly behave as all-client allowable rather than A
| client-permitted
| egberts1 wrote:
| There is a 25Gbps XNS in the work.
|
| With even more live updating of newly added Aho-Corrasick and
| Regex algorithm into the FPGA.
| zvmaz wrote:
| > So you think you understand IP fragmentation?
|
| Titles like these sound unnecessarily arrogant to me and are off-
| putting.
| zokier wrote:
| Is there any place for IP fragmentation anymore? On surface level
| look, the main motivation seems to have been fitting over-sized
| dnssec responses in single datagram, but with ecdsa/eddsa is that
| still a relevant concern? I realize some TXT etc data might be
| big, but presumably those could also be queried over
| TCP/HTTP/QUIC/etc; I see UDP more as the fast path for common
| A/AAAA queries, and those should fit in unfragmented packets just
| fine?
|
| What are other cases where fragmentation truly makes sense?
| convolvatron wrote:
| well, if you read the article. the issue isn't really
| fragmentation as a mechanism, its finding the appropriate
| (dynamic) path MTU. that's kinda gonna be a problem forever
| given the stateless nature of IP. it would be fine if people
| hadn't decided that ICMP was .. undesirable.
|
| I guess you could just say '1480 for all time' and be done with
| it
| tptacek wrote:
| There has never really been a place for it; "Fragmentation
| Considered Harmful" is one of the original and most famous
| "Considered Harmfuls" and it's from the late 1980s. A lot of
| protocol engineering, from MSS options to ENDS0, goes into
| ensuring that you never hit the fragmentation case in the first
| place. It's been kept around as a mechanism of last resort for
| weirdo hops with bizarro MTUs.
| rayiner wrote:
| I'm surprised by the willful disregard of the feature, e.g.,
| with some routers just dropping fragmented packets. Sending
| packets bigger than the MTU seems like a reasonable thing to
| want to do--especially given that IP by design won't
| guarantee a stable path between endpoints. Was the reasoning
| that it's always better handled at a higher protocol layer?
| tptacek wrote:
| Yes, that's the reasoning. Fragmentation is kind of a
| performative half-measure. You're not _really_ enabling
| delivery across varying network links, in that the
| performance is so bad (necessarily!) that it alters the
| service model for many protocols.
|
| IPv6 moves to an end-to-end fragmentation model, but even
| then the right answer is to negotiate in a higher level
| protocol a maximum segment size for the path you're talking
| on, and then just avoid fragmentation entirely.
| Fragmentation is an absolutely wretched stream transport
| protocol!
| zamadatix wrote:
| Fragmentation/MTU is right up there with verifying delivery as
| one of those things that can sound like a really easy problem to
| solve until you start trying to solve it.
| peter_d_sherman wrote:
| I wonder if IP fragmentation would/could have any effect on the
| ability for a Nation State's (i.e., China, Russia, North Korea,
| ?, ???) firewall to prevent packets from entering/leaving a
| blocked country...
|
| Why or why not?
|
| In fact, now that I think about it... what happens if a TCP/IP
| connection is initiated from another country to a TCP/IP address
| inside of an outwardly blocked country?
|
| Would that act similar to a NAT punchthrough, but on a Nation-
| State's firewall?
|
| Think about it this way... let's take an arbitrary example,
| China...
|
| And let's suppose for the purposes of discussion that what the
| western Mainstream Media says about China's Internet is
| completely true -- that outgoing connections to the West from
| China -- are blocked.
|
| OK, but there still is Alibaba -- the equivalent of China's
| Amazon.com.
|
| Connections FROM the West (e-commerce connections, "we buy from
| you", "you are a merchant to us", etc., etc.) TO Alibaba.com (in
| China) -- are not blocked, and would NOT be blocked by the
| Chinese government!
|
| Why?
|
| Because anywhere there's _trade_ -- if the trade is beneficial to
| the government in question (anyone heard of _taxes_ , anyone?
| They benefit the government you know!) -- there will be internal
| political pressure _NOT_ to _prevent_ that trade from occurring!
|
| So let's say that a TCP/IP connection that looked like a
| legitimate e-commerce connection -- was initiated FROM the West,
| to an IP address in China for a legitimate Chinese e-commerce
| site...
|
| Now, couldn't that connection, once established (since it is,
| after all bi-directional) -- be used to _send data_ outside of
| China?
|
| Yes!
|
| But to only one outside/external IP address!
|
| But _what if_ that IP address on the Western side (being in a
| non-firewalled country) -- now could somehow route that data to
| any other IP address around the world?
|
| Sort of like a VPN -- but the connection is opened up from an
| INCOMING connection, not an outgoing one...
|
| Would that be the equivalent of _NAT punchthrough_ -- for a
| Nation State 's firewall?
|
| ?
|
| Before you answer, I'm guessing that there are Deep State bots
| (from all countries!) that will try to derail this
| conversation...
|
| That's a testament to _" always think for yourself"_ (let logic
| be your guide!) -- stay away from so called "expert" opinions,
| especially one or two sentence "no it can't be done" or "it would
| never work" replies by fake posters with fake names from fake
| accounts!
|
| But, I guess we'll open up the floor to the AI powered agenda-
| driven bots... foreign and domestic! :-)
|
| Also, note that I don't suggest/condone that anyone actually do
| any of the above -- this discussion is theoretical in scope only!
| vitus wrote:
| > Connections FROM the West (e-commerce connections, "we buy
| from you", "you are a merchant to us", etc., etc.) TO
| Alibaba.com (in China) -- are not blocked, and would NOT be
| blocked by the Chinese government!
|
| Um, conspiracy theories aside, Alibaba has servers in the US.
| peter_d_sherman wrote:
| Which would have two-way data paths (inbound/outbound) to the
| Alibaba company (or other e-commerce company in the case of a
| different e-commerce company) in mainland China, or, more
| broadly, to the interior of other countries whose Nation-
| State firewall blocks outgoing connections...
|
| Point is, _if there 's a way out, then there's a way in_...
|
| How does the data on the Alibaba servers in the U.S. get
| updated? Probably not by carrier pigeon... Someone (or a
| group of people) in the Alibaba offices in China updates that
| server when prices change, when new products are added, etc.,
| etc.
|
| And how does that happen, if there isn't an outgoing Internet
| connection?
|
| And if so, then that's proof of a non-blocked outgoing
| connection from China.
|
| And if that's true, then we need to raise the question of
| _which_ types of outgoing connections are _selectively
| permitted_ from China (because obviously the Alibaba update
| on U.S. servers is permitted!)
|
| Which would in turn start to crumble the western mainstream
| media narrative that China blocks all outgoing connections...
|
| In other words, if China doesn't block all outgoing
| connections, then we need to know (if we care at all about
| global Internet censorship) which types of connections are
| blocked, which ones aren't, and what the exact criteria for a
| blocked or permitted connection is...
|
| Because if you are correct, then it would seem that
| connections from mainland China to Alibaba's U.S. servers are
| NOT blocked...
|
| Conspiracy theories aside!
| fch42 wrote:
| I'm amazed by this; call it "PMTUD Broadside" or some such. The
| idea is really a bit egg-of-columbus. Anyway, it tickles my geek
| senses.
___________________________________________________________________
(page generated 2024-02-17 23:00 UTC)