[HN Gopher] QUIC for the kernel
___________________________________________________________________
QUIC for the kernel
Author : Bogdanp
Score : 174 points
Date : 2025-07-31 15:57 UTC (7 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| WASDx wrote:
| I recall this article on QUIC disadvantages:
| https://www.reddit.com/r/programming/comments/1g7vv66/quic_i...
|
| Seems like this is a step in the right direction to resole some
| of those issues. I suppose nothing is preventing it from getting
| hardware support in future network cards as well.
| miohtama wrote:
| QUIC does not work very well for use cases like machine-to-
| machine traffic. However most of traffic in Internet today is
| from mobile phones to servers and it is were QUIC and HTTP 3
| shine.
|
| For other use cases we can keep using TCP.
| thickice wrote:
| Why doesn't QUIC work well for machine-to-machine traffic ?
| Is it due to the lack of offloads/optimizations for TCP and
| machine-to-machine traffic tend to me high volume/high rate ?
| yello_downunder wrote:
| QUIC would work okay, but not really have many advantages
| for machine-to-machine traffic. Machine-to-machine you tend
| to have long-lived connections over a pretty good network.
| In this situation TCP already works well and is currently
| handled better in the kernel. Eventually QUIC will probably
| be just as good for TCP in this use case, but we're not
| there yet.
| jabart wrote:
| You still have latency, legacy window sizes, and packet
| schedulers to deal with.
| spwa4 wrote:
| But that is the huge advantage of QUIC. It does NOT
| totally outcompete TCP traffic on links (we already have
| bittorrent over udp for that purpose). They redesigned
| the protocol 5 times or so to achieve that.
| m00x wrote:
| It's explained in the reddit thread. Most of it is because
| you have to handle a ton of what TCP does in userland.
| extropy wrote:
| The NAT firewalls do not like P2P UDP traffic. Majoritoy of
| the routers lack the smarts to passtrough QUIC correctly,
| they need to treat it the same as TCP essentially.
| beeflet wrote:
| NAT is the devil. bring on the IPoc4lypse
| hdgvhicv wrote:
| Nat is massively useful for all sorts of reasons which
| has nothing to do with ip limitations.
| unethical_ban wrote:
| Rather, NAT is a bandage for all sorts of reasons besides
| IP exhaustion.
|
| Example: Janky way to get return routing for traffic when
| you don't control enterprise routes.
|
| Source: FW engineer
| beeflet wrote:
| sounds great but it fucks up P2P in residential
| connections, where it is mostly used due to ipv4 address
| conservation. You can still have nat in IPv6 but
| hopefully I won't have to deal with it
| paulddraper wrote:
| The NAT RPC talks purely about IP exhaustion.
|
| What do you have in mind.
| dan-robertson wrote:
| I think basically there is currently a lot of overhead and,
| when you control the network more and everything is more
| reliable, you can make tcp work better.
| exabrial wrote:
| For starters, why encrypt something literally in the same
| datacenter 6 feet away? Add significant latency and
| processing overhead.
| 20k wrote:
| Because the NSA actively intercepts that traffic. There's
| a reason why encryption is non optional
| Karrot_Kream wrote:
| To me this seems outlandish (e.g. if you're part of PRISM
| you know what's happening and you're forced to comply.)
| But to think through this threat model, you're worried
| that the NSA will tap intra-DC traffic but not that it
| will try to install software or hardware on your hosts to
| spy traffic at the NIC level? I guess it would be harder
| to intercept and untangle traffic at the NIC level than
| intra-DC, but I'm not sure?
| viraptor wrote:
| > you're worried that the NSA will tap intra-DC traffic
| but not that it will try to install software or hardware
| on your hosts
|
| It doesn't have to be one or the other. We've known for
| over a decade that the traffic between DCs was tapped htt
| ps://www.theguardian.com/technology/2013/oct/30/google-
| re... Extending that to intra-DC wouldn't be surprising
| at all.
|
| Meanwhile backdoored chips and firmware attacks are a
| constant worry and shouldn't be discounted regardless of
| the first point.
| adgjlsfhk1 wrote:
| The difference between tapping intra-DC and in computer
| spying is that in computer spying is much more likely to
| get caught and much less easily able to get data out.
| There's a pretty big difference between software/hardware
| weaknesses that require specific targeting to exploit and
| passive scooping everything up and scanning
| exabrial wrote:
| Imaginary problems are the funnest to solve.
| cherryteastain wrote:
| If you are concerned about this, how do you think you
| could protect against AWS etc allowing NSA to snoop on
| you from the hypervisor level?
| switchbak wrote:
| Service meshes often encrypt traffic that may be running
| on the same physical host. Your security policy may
| simply require this.
| lll-o-lll wrote:
| To stop or slow down the attacker who is inside your
| network and trying to move horizontally? Isn't this the
| principle of defense in depth?
| sleepydog wrote:
| Encryption gets you data integrity "for free". If a bit
| is flipped by faulty hardware, the packet won't decrypt.
| TCP checksums are not good enough for catching corruption
| in many cases.
| mschuster91 wrote:
| Because any random machine in the same datacenter and
| network segment might be compromised and do stuff like
| running ARP spoofing attacks. Cisco alone has had so many
| vendor-provided backdoors cropping up that I wouldn't
| trust _anything_ in a data center with Cisco gear.
| dahfizz wrote:
| > QUIC is meant to be fast, but the benchmark results included
| with the patch series do not show the proposed in-kernel
| implementation living up to that. A comparison of in-kernel QUIC
| with in-kernel TLS shows the latter achieving nearly three times
| the throughput in some tests. A comparison between QUIC with
| encryption disabled and plain TCP is even worse, with TCP winning
| by more than a factor of four in some cases.
|
| Jesus, that's bad. Does anyone know if userspace QUIC
| implementations are also this slow?
| klabb3 wrote:
| Yes, they are. Worse, I've seen them shrink down to nothing in
| the face of congestion with TCP traffic. If Quic is indeed the
| future protocol, it's a good thing to move it into the kernel
| IMO. It's just madness to provide these massive userspace impls
| everywhere, on a _packet switched_ protocol nonetheless, and
| expect it to beat good old TCP. Wouldn't surprise me if we need
| optimizations all the way down to the NIC layer, and maybe even
| middleboxes. Oh and I haven't even mentioned the CPU cost of
| UDP.
|
| OTOH, TCP is like a quiet guy at the gym who always wears baggy
| clothes but does 4 plates on the bench when nobody is looking.
| Don't underestimate. I wasted months to learn that lesson.
| vladvasiliu wrote:
| Why is QUIC being pushed, then?
| dan-robertson wrote:
| The problem it is trying to solve is not overhead of the
| Linux kernel on a big server in a datacenter
| favflam wrote:
| I know in the p2p space, peers have to send lots of small
| pieces of data. QUIC stops stream blocking on a single
| packet delay.
| toast0 wrote:
| It has good properties compared to tcp-in-tcp (http/2),
| especially when connected to clients without access to
| modern congestion control on iffy networks. http/2 was
| perhaps adopted too broadly; binary protocol is useful,
| header compression is useful (but sometimes dangerous), but
| tcp multiplexing is bad, unless you have very low loss ...
| it's not ideal for phones with inconsistent networking.
| fkarg wrote:
| because it _does_ provide a number of benefits (potentially
| fewer initial round-trips, more dynamic routing control by
| using UDP instead of TCP, etc), and is a userspace softare
| implementation compared with a hardware-accelerated option.
|
| QUIC getting hardware acceleration should close this gap,
| and keep all the benefits. But a kernel (software)
| implementation is basically necessary before it can be
| properly hardware-accelerated in future hardware (is my
| current understanding)
| 01HNNWZ0MV43FF wrote:
| To clarify, the userspace implementation is not a
| benefit, it's just that you can't have a brand new
| protocol dropped into a trillion dollars of existing
| hardware overnight, you have to do userspace first as PoC
|
| It does save 2 round-trips during connection compared to
| TLS-over-TCP, if Wikipedia's diagram is accurate:
| https://en.wikipedia.org/wiki/QUIC#Characteristics That
| is a decent latency win on every single connection, and
| with 0-RTT you can go further, but 0-RTT is stateful and
| hard to deploy and I expect it will see very little use.
| klabb3 wrote:
| From what I understand the "killer app" initially was
| because of mobile spotty networks. TCP is interface (and
| IP) specific, so if you switch from WiFi to LTE the conn
| breaks (or worse, degrades/times out slowly). QUIC has a
| logical conn id that continues to work even when a peer
| changes the path. Thus, your YouTube ads will not buffer.
|
| Secondary you have the reduced RTT, multiple streams
| (prevents HOL blocking), datagrams (realtime video on same
| conn) and you can scale buffers (in userspace) to avoid BDP
| limits imposed by kernel. However.. I think in practice
| those haven't gotten as much visibility and traction, so
| the original reason is still the main one from what I can
| tell.
| wahern wrote:
| MPTCP provides interface mobility. It's seen widespread
| deployment with the iPhone, so network support today is
| much better than one would assume. Unlike QUIC, the
| changes required by applications are minimal to none. And
| it's backward compatible; an application can request
| MPTCP, but if the other end doesn't support it,
| everything still works.
| dan-robertson wrote:
| I think the 'fast' claims are just different. QUIC is meant to
| make things fast by:
|
| - having a lower latency handshake
|
| - avoiding some badly behaved 'middleware' boxes between users
| and servers
|
| - avoiding resetting connections when user up addresses change
|
| - avoiding head of line blocking / the increased cost of many
| connections ramping up
|
| - avoiding poor congestion control algorithms
|
| - probably other things too
|
| And those are all things about working better with the kind of
| network situations you tend to see between users (often on
| mobile devices) and servers. I don't think QUIC was meant to be
| fast by reducing OS overhead on sending data, and one should
| generally expect it to be slower for a long time until
| operating systems become better optimised for this flow and
| hardware supports offloading more of the work. If you are
| Google then presumably you are willing to invest in specialised
| network cards/drivers/software for that.
| dahfizz wrote:
| Yeah I totally get that it optimizes for different things.
| But the trade offs seem way too severe. Does saving one round
| trip on the handshake mean anything at all if you're only
| getting _one fourth_ of the throughput?
| dan-robertson wrote:
| Are you getting one fourth of the throughput? Aren't you
| going to be limited by:
|
| - bandwidth of the network
|
| - how fast the nic on the server is
|
| - how fast the nic on your device is
|
| - whether the server response fits in the amount of data
| that can be sent given the client's initial receive window
| or whether several round trips are required to scale the
| window up such that the server can use the available
| bandwidth
| brokencode wrote:
| Maybe it's a fourth as fast in ideal situations with a fast
| LAN connection. Who knows what they meant by this.
|
| It could still be faster in real world situations where the
| client is a mobile device with a high latency, lossy
| connection.
| eptcyka wrote:
| There are claims of 2x-3x operating costs on the server
| side to deliver better UX for phone users.
| yello_downunder wrote:
| It depends on the use case. If your server is able to
| handle 45k connections but 42k of them are stalled because
| of mobile users with too much packet loss, QUIC could look
| pretty attractive. QUIC is a solution to some of the
| problematic aspects of TCP that couldn't be fixed without
| breaking things.
| drewg123 wrote:
| The primary advantage of QUIC for things like congestion
| control is that companies like Google are free to
| innovate both sides of the protocol stack (server in
| prod, client in chrome) simultaneously. I believe that
| QUIC uses BBR for congestion control, and the major
| advantage that QUIC has is being able to get a bit more
| useful info from the client with respect to packet loss.
|
| This could be achieved by encapsulating TCP in UDP and
| running a custom TCP stack in userspace on the client.
| That would allow protocol innovation without throwing
| away 3 decades of optimizations in TCP that make it 4x as
| efficient on the server side.
| jeroenhd wrote:
| > - avoiding some badly behaved 'middleware' boxes between
| users and servers
|
| Surely badly behaving middleboxes won't just ignore UDP
| traffic? If anything, they'd get confused about udp/443 and
| act up, forcing clients to fall back to normal TCP.
| zamadatix wrote:
| Your average middlebox will just NAT UDP (unless it's
| outright blocked by security policy) and move on. It's TCP
| where many middleboxes think they can "help" the congestion
| signaling, latch more deeply into the session information,
| or worse. Unencrypted protocols can have further
| interference under either TCP or UDP beyond this note.
|
| QUIC is basically about taking all of the information
| middleboxes like to fuck with in TCP, putting it under the
| encryption layer, and packaging it back up in a UDP packet
| precisely so it's either just dropped or forwarded. In
| practice this (i.e. QUIC either being just dropped or left
| alone) has actually worked quite well.
| rayiner wrote:
| It's an interesting testament to how well designed TCP is.
| adgjlsfhk1 wrote:
| IMO, it's more a testament to how fast hardware designers can
| make things with 30 years to tune.
| eptcyka wrote:
| QUIC performance requires careful use of batching. Using UDP
| spckets naively, i.e. sending one QUIC packet per syscall, will
| incur a lot of oberhead - every time the kernel has to figure
| out which interface to use, queue it up on a buffer, and all
| the rest. If one uses it like TCP, batching up lots of data and
| enquing packets in one "call" helps a ton. Similarly, the
| kernel wireguard implementation can be slower than wireguard-go
| since it doesn't batch traffic. At the speeds offered by modern
| hardware, we really need to use vectored I/O to be efficient.
| Veserv wrote:
| Yes. msquic is one of the best performing implementations and
| only achieves ~7 Gbps [1]. The benchmarks for the Linux kernel
| implementation only get ~3 Gbps to ~5 Gbps with encryption
| disabled.
|
| To be fair, the Linux kernel TCP implementation only gets ~4.5
| Gbps at normal packets sizes and still only achieves ~24 Gbps
| with large segmentation offload [2]. Both of which are
| ridiculously slow. It is straightforward to achieve ~100
| Gbps/core at normal packet sizes without segmentation offload
| with the same features as QUIC with a properly designed
| protocol and implementation.
|
| [1] https://microsoft.github.io/msquic/
|
| [2]
| https://lwn.net/ml/all/cover.1751743914.git.lucien.xin@gmail...
| 0x457 wrote:
| I would expect that a protocol such as TCP performs much better
| than QUIC in benchmarks. Now do a realistic benchmark over
| roaming LTE connection and come back with the results.
|
| Without seeing actual benchmark code, it's hard to tell if you
| should even care about that specific result.
|
| If your goal is to pipe lots of bytes from A to B over internal
| or public internet there probably aren't make things, if any,
| that can outperform TCP. Decades were spent optimizing TCP for
| that. If HOL blocking isn't an issue for you, then you can keep
| using HTTP over TCP.
| jeffbee wrote:
| This seems to be a categorical error, for reasons that are
| contained in the article itself. The whole appeal of QUIC is
| being immune to ossification, being free to change parameters of
| the protocol without having to beg Linux maintainers to agree.
| corbet wrote:
| Ossification does not come about from the decisions of "Linux
| maintainers". You need to look at the people who design, sell,
| and deploy middleboxes for that.
| jeffbee wrote:
| I disagree. There is _plenty_ of ossification coming from
| inside the house. Just some examples off the top of my head
| are the stuck-in-1974 minimum RTO and ack delay time
| parameters, and the unwillingness to land microsecond
| timestamps.
| otterley wrote:
| Not a networking expert, but does TCP in IPv6 suffer the
| same maladies?
| pumplekin wrote:
| Yes.
|
| Layer4 TCP is pretty much just slapped on top of Layer3
| IPv4 or IPv6 in exactly the same way for both of them.
|
| Outside of some little nitpicky things like details on
| how TCP MSS clamping works, it is basically the same.
| ComputerGuru wrote:
| ...which is basically how it's supposed to work (or how
| we teach that it's supposed to work). (Not that you said
| anything to the contrary!)
| 0xbadcafebee wrote:
| The "middleboxes" excuse for not improving (or replacing)
| protocols in the past was horseshit. If a big incumbent
| player in the networking world releases a new feature that
| everyone wants (but nobody else has), everyone else
| (including 'middlebox' vendors) will bend over backwards to
| support it, because if you don't your competitors will and
| then you lose business. It was never a technical or
| logistical issue, it was an economic and supply-demand issue.
|
| To prove it:
|
| 1. Add a new OSI Layer 4 protocol called "QUIC" and give it a
| new protocol number, and just for fun, change the UDP frame
| header semantics so it can't be confused for UDP.
|
| 2. Then release kernel updates to support the new protocol.
|
| Nobody's going to use it, right? Because internet routers,
| home wireless routers, servers, shared libraries, etc would
| all need their TCP/IP stacks updated to support the new
| protocol. If we can't ship it over a weekend, it takes too
| long!
|
| But wait. What if ChatGPT/Claude/Gemini/etc _only_ supported
| communication over that protocol? You know what would happen:
| every vendor in the world would backport firmware patches
| overnight, bending over backwards to support it. Because they
| can _smell_ the money.
| toast0 wrote:
| IMHO, you likely want the server side to be in the kernel, so
| you can get to performance similar to in-kernel TCP, and
| ossification is less of a big deal, because it's "easy" to
| modify the kernel on the server side.
|
| OTOH, you want to be in user land on the client, because
| modifying the kernel on clients is hard. If you were Google,
| maybe you could work towards a model where Android clients
| could get their in-kernel protocol handling to be something
| that could be updated regularly, but that doesn't seem to be
| something Google is willing or able to do; Apple and Microsoft
| can get priority kernel updates out to most of their users
| quickly; Apple also can influence networks to support things
| they want their clients to use (IPv6, MP-TCP). </rant>
|
| If you were happy with congestion control on both sides of TCP,
| and were willing to open multiple TCP connections like http/1,
| instead of multiplexing requests on a single connection like
| http/2, (and maybe transfer a non-pessimistic bandwidth
| estimate between TCP connections to the same peer), QUIC still
| gives you control over retransmission that TCP doesn't, but I
| don't think that would be compelling enough by itself.
|
| Yes, there's still ossification in middle boxes doing TCP
| optimization. My information may be old, but I was under the
| impression that nobody does that in IPv6, so the push for v6 is
| both a way to avoid NAT and especially CGNAT, but also a way to
| avoid optimizer boxes as a benefit for both network providers
| (less expense) and services (less frustration).
| jeffbee wrote:
| This is a perspective, but just one of many. The overwhelming
| majority of IP flows are within data centers, not over
| planet-scale networks between unrelated parties.
| toast0 wrote:
| I've never been convinced by an explanation of how QUIC
| applies for flows in the data center.
|
| Ossification doesn't apply (or it shouldn't, IMHO, the
| point of Open Source software is that you can change it to
| fit your needs... if you don't like what upstream is doing,
| you _should_ be running a local fork that does what you
| want... yeah, it 's nicer if it's upstreamed, but try
| running a local fork of Windows or MacOS); you can make
| congestion control work for you when you control both
| sides; enterprise switches and routers aren't messing with
| tcp flows. If you're pushing enough traffic that this is an
| issue, the cost of QUIC seems _way_ too high to justify,
| even if it helps with some issues.
| jeffbee wrote:
| I don't see why this exception to the end-to-end
| principle should exist. At the scale of single hosts
| today, with hundreds of CPUs and hundreds of tenants in a
| single system sharing a kernel, the kernel itself becomes
| an unwanted middlebox.
| jeroenhd wrote:
| Unless you're using QUIC as some kind of datacenter-to-
| datacenter protocol (basically as SCTP on steroids with
| TLS), I don't think QUIC in the datacenter makes much sense
| at all.
|
| As very few server administrators bother turning on
| features like MPTCP, QUIC has an advantage on mobile phones
| with moderate to bad reception. That's not a huge issue for
| me most of the time, but billions of people are using
| mobile phones as their only access to the internet,
| especially in developing countries that are practically
| skipping widespread copper and fiber infrastructure and
| moving directly to 5G instead. Any service those people are
| using should probably consider implementing QUIC, and if
| they use it, they'd benefit from an in-kernel server.
|
| All the data center operators can stick to (MP)TCP, the
| telco people can stick to SCTP, but the consumer facing
| side of the internet would do well to keep QUIC as an
| option.
| mschuster91 wrote:
| > That's not a huge issue for me most of the time, but
| billions of people are using mobile phones as their only
| access to the internet, especially in developing
| countries that are practically skipping widespread copper
| and fiber infrastructure and moving directly to 5G
| instead.
|
| For what it's worth: Romania, one of the piss poorest
| countries of Europe, has a perfectly fine mobile phone
| network, and even outback small villages have XGPON fiber
| rollouts everywhere. Germany? As soon as you cross into
| the country from Austria, your phone signal instantly
| drops, barely any decent coverage outside of the cities.
| And forget about PON, much less GPON or even XGPON.
|
| Germany should be considered a developing country when it
| comes to expectations around telecommunication.
| ComputerGuru wrote:
| One thing is that congestion control choice is sort of cursed
| in that it assumes your box/side is being switched but the
| majority of the rest of the internet continues with legacy
| limitations (aside from DCTCP, which is designed for intra-
| datacenter usage), which is an essential part of the question
| given that resultant/emergent network behavior changes
| drastically depending on whether or not all sides are using
| the same algorithm. (Cubic is technically another sort-of-
| exception, at least since it became the default Linux CC
| algorithm, but even then you're still dealing with all sorts
| of middleware with legacy and/or pathological stateful
| behavior you can't control.)
| Karrot_Kream wrote:
| Do you think putting QUIC in the kernel will significantly
| ossify QUIC? If so, how do you want to deal with the
| performance penalty for the actual syscalls needed? Your
| concern makes sense to me as the Linux kernel moves slower than
| userspace software and middleboxes sometimes never update their
| kernels.
| GuB-42 wrote:
| The _protocol_ itself is resistant to ossification, no matter
| how it is implemented.
|
| It is mostly achieved by using encryption, and it is a reason
| why it is such an important and mandatory part of the protocol.
| The idea is to expose as little as possible of the protocol
| between the endpoints, the rest is encrypted, so that
| "middleboxes" can't look at the packet and do funny things
| based on their own interpretation of the protocol stack.
|
| Endpoint can still do whatever they want, and ossification can
| still happen, but it helps against ossification at the
| infrastructure level, which is the worst. Updating the linux
| kernel on your server is easier than changing the proprietary
| hardware that makes up the network backbone.
|
| The use of UDP instead of doing straight QUIC/IP is also an
| anti-ossification technique, as your app can just use UDP and a
| userland library regardless of the QUIC kernel implementation.
| In theory you could do that with raw sockets too, but that's
| much more problematic since because you don't have ports, you
| need the entire interface for yourself, and often root access.
| darksaints wrote:
| For the love of god, can we please move to microkernel-based
| operating systems already? We're adding a million lines of code
| to the linux kernel _every year_. That 's so much attack surface
| area. We're setting ourselves up for a kessler syndrome of sorts
| with every system that we add to the kernel.
| mdavid626 wrote:
| Most of that code is not loaded into the kernel, only when
| needed.
| darksaints wrote:
| True, but the last time I checked (several years ago), the
| size of the portion of code that is not drivers or kernel
| modules was still 7 million lines of code, and the average
| system still has to load a few million more via kernel
| modules and drivers. That is still a phenomenally large
| attack surface.
|
| The SeL4 kernel is 10k lines of code. OKL4 is 13k. QNX is
| ~30k.
| arp242 wrote:
| Can I run Firefox or PostgreSQL with reasonable performance
| on SeL4, OKL4, or QNX?
| doubled112 wrote:
| Reasonable performance includes GPU acceleration for both
| rendering and decoding media, right?
| 0x457 wrote:
| yes
| regularfry wrote:
| You've still got combinatorial complexity problem though,
| because you never know what a specific user is going to load.
| beeflet wrote:
| Often you do know what a specific user is going to load
| wosined wrote:
| I might be wrong, but microkernel also need drivers, so the
| attack surface would be the same, or not?
| kaoD wrote:
| You're not wrong, but monolithic kernel drivers run at a
| privilege level that's even higher than root (ring 0) while
| microkernels run them at userspace so they're as dangerous as
| running a normal program.
| pessimizer wrote:
| _" Just think of the power of ring-0, muhahaha! Think of
| the speed and simplicity of ring-0-only and identity-
| mapping. It can change tasks in half a microsecond because
| it doesn't mess with page tables or privilege levels.
| Inter-process communication is effortless because every
| task can access every other task's memory.
|
| "It's fun having access to everything."_
|
| -- Terry A. Davis
| beeflet wrote:
| > Inter-process communication is effortless because every
| task can access every other task's memory.
|
| I think this would get messy quick in an OS designed by
| more than one person
| 01HNNWZ0MV43FF wrote:
| Redox is a microkernel written in Rust
| valorzard wrote:
| Would this (eventually) include the unreliable datagram
| extension?
| wosined wrote:
| Don't know if it could get faster than UDP if it is on top of
| it.
| valorzard wrote:
| The use case for this would be running a multiplayer game
| server over QUIC
| 01HNNWZ0MV43FF wrote:
| Other use cases include video / audio streaming, VPNs over
| QUIC, and QUIC-over-QUIC (you never know)
| wosined wrote:
| The general web is slowed down by bloated websites. But I guess
| this can make game latency lower.
| fmbb wrote:
| https://en.m.wikipedia.org/wiki/Jevons_paradox
|
| The Jevons Paradox is applicable in a lot of contexts.
|
| More efficient use of compute and communications resources will
| lead to higher demand.
|
| In games this is fine. We want more, prettier, smoother,
| pixels.
|
| In scientific computing this is fine. We need to know those
| simulation results.
|
| On the web this is not great. We don't want more ads, tracking,
| JavaScript.
| 01HNNWZ0MV43FF wrote:
| No, the last 20 years of browser improvements has made my
| static site incredibly fast!
|
| I'm benefiting from WebP, JS JITs, Flexbox, zstd, Wasm, QUIC,
| etc, etc
| Ericson2314 wrote:
| What will the socket API look like for multiple streams? I guess
| it is implied it is the same as multiple connections, with
| caching behind the scenes.
|
| I would hope for something more explicit, where you get a
| connection object and then open streams from it, but I guess that
| is fine for now.
|
| https://github.com/microsoft/msquic/discussions/4257 ah but look
| at this --- unless this is an extension, the _server_ side can
| also create new streams, once a connection is established. The
| client creating new "connections" (actually streams) cannot
| abstract over this. Something fundamentally new _is_ needed.
|
| My guess is recvmsg to get a new file descriptor for new stream.
| gte525u wrote:
| I would look at SCTP socket API it supports multistreaming.
| Ericson2314 wrote:
| I checked that out and....yuck!
|
| - Send specifies which stream by _ordinal number_? (Can 't
| have different parts of a concurrent app independently open
| new streams)
|
| - Receive doesn't specify which stream at all?!
| wahern wrote:
| API RFC is https://datatracker.ietf.org/doc/html/draft-lxin-
| quic-socket...
| Ericson2314 wrote:
| Ah fuck, it still has a stream_id notion
|
| How are socket APIs always such garbage....
| Bender wrote:
| I don't know about using it in the kernel but I would love to see
| OpenSSH support QUIC so that I get some of the benefits of Mosh
| [1] while still having all the features of OpenSSH including
| SFTP, SOCKS, port forwarding, less state table and keep alive
| issues, roaming support, etc... Could OpenSSH leverage the kernel
| support?
|
| [1] - https://mosh.org/
| wmf wrote:
| SSH would need a lot of work to replace its crypto and mux
| layers with QUIC. It's probably worth starting from scratch to
| create a QUIC login protocol. There are a bunch of different
| approaches to this in various states of prototyping out there.
| Bender wrote:
| Fair points. I suppose Mosh would be the proper starting
| point then. I'm just selfish and want the benefits of QUIC
| without losing all the really useful features of OpenSSH.
| bauruine wrote:
| OpenSSH is an OpenBSD project therefore I guess a Linux api
| isn't that interesting but I could be wrong ofc.
| Bender wrote:
| That's a good point. At least it would not be an entirely new
| idea. [1] Curious what reactions he received.
|
| [1] - https://papers.freebsd.org/2022/eurobsdcon/jones-
| making_free...
| kibwen wrote:
| I'm confused, I thought the revolution of the past decade or so
| was in moving network stacks to userspace for better performance.
| michaelsshaw wrote:
| The constant mode switching for hardware access is slow. TCP/IP
| remains in the kernel for windows and Linux.
| wmf wrote:
| Performance comes from dedicating core(s) to polling, not from
| userspace.
| shanemhansen wrote:
| You are right but it's confusing because there are two
| different approaches. I guess you could say both approaches
| improve performance by eliminating context switches and system
| calls.
|
| 1. Kernel bypass combined with DMA and techniques like
| dedicating a CPU to packet processing improve performance.
|
| 2. What I think of as "removing userspace from the data plane"
| improves performance for things like sendfile and ktls.
|
| To your point, Quic in the kernel seems to not have either
| advantage.
| FuriouslyAdrift wrote:
| So... RDMA?
| michaelsshaw wrote:
| No, the first technique describes the basic way they
| already operate, DMA, but giving access to userspace
| directly because it's a zerocopy buffer. This is handled by
| the OS.
|
| RDMA is directly from bus-to-bus, bypassing all the
| software.
| zamalek wrote:
| What is done for that is userspace gets the network data
| directly without (I believe) involving syscalls. It's not
| something you'd do for end-user software, only the likes of
| MOFAANG need it.
|
| _In theory_ the likes of io_uring would bring these benefits
| across the board, but we haven 't seen that delivered (yet, I
| remain optimistic).
| phlip9 wrote:
| I'm hoping we get there too with io_uring. It looks like the
| last few kernel release have made a lot of progress with
| zero-copy TCP rx/tx, though NIC support is limited and you
| need some finicky network iface setup to get the flow
| steering working
|
| https://docs.kernel.org/networking/iou-zcrx.html
| 0xbadcafebee wrote:
| Networking is much faster in the kernel. Even faster on an
| ASIC.
|
| Network stacks were moved to userspace because Google wanted to
| replace TCP itself (and upgrade TLS), but it only cared about
| the browser, so they just put the stack _in_ the browser, and
| problem solved.
| Karrot_Kream wrote:
| You still need to offload your bytes to a NIC buffer. Either
| you can do something like DMA where you get privileged space to
| write your bytes to that the NIC reads from or you have to
| cross the syscall barrier and have your kernel write the bytes
| into the NIC's buffer. Crossing the syscall barrier adds a huge
| performance penalty due to the switch in memory space and
| privilege rings so userspace networking only makes sense if
| you're not having to deal with the privilege changes or you
| have DMA.
| Veserv wrote:
| That is only a problem if you do one or more syscalls per
| packet which is a utterly bone-headed design.
|
| The copy itself is going at 200-400 Gbps so writing out a
| standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in
| steady state with caches being prefetched). Of course you get
| slaughtered if you stupidly do a syscall (~100 ns hardware
| overhead) per packet since that is like 300% overhead. You
| just batch like 32 packets so the write time is ~1,000-2,000
| ns then your overhead goes from 300% to 10%.
|
| At a 1 Gbps throughput, that is ~80,000 packets per second or
| one packet per ~12.5 us. So, waiting for a 32 packet batch
| only adds a additional 500 us to your end-to-end latency in
| return for 4x efficiency (assuming that was your bottleneck;
| which it is not for these implementations as they are nowhere
| near the actual limits). If you go up to 10 Gbps, that is
| only 50 us of added latency, and at 100 Gbps you are only
| looking at 5 us of added latency for a literal 4x efficiency
| improvement.
| toast0 wrote:
| Most QUIC stacks are built upon in-kernel UDP. You get
| significant performance benefits if you can avoid your traffic
| going through kernel and userspace and the context switches
| involved.
|
| You can work that angle by moving networking into user space...
| setting up the NIC queues so that user space can access them
| directly, without needed to context switch into the kernel.
|
| Or you can work the angle by moving networking into kernel
| space ... things like sendfile which let a tcp application
| instruct the kernel to send a file to the peer without needing
| to copy the content into userspace and then back into kernel
| space and finally into the device memory, if you have in-kernel
| TLS with sendfile then you can continue to skip copying to
| userspace; if you have NIC based TLS, the kernel doesn't need
| to read the data from the disk; if you have NIC based TLS and
| the disk can DMA to the NIC buffers, the data doesn't need to
| even hit main memory. Etc
|
| But most QUIC stacks don't get benefit from either side of
| that. They're reading and writing packets via syscalls, and
| they're doing all the packetization in user space. No chance to
| sendfile and skip a context switch and skip a copy. Batching io
| via io_uring or similar helps with context switches, but
| probably doesn't prevent copies.
| qwertox wrote:
| I recently had to add `ssl_preread_server_name` to my NGINX
| configuration in order to `proxy_pass` requests for certain
| domains to another NGINX instance. In this setup, the first
| instance simply forwards the raw TLS stream (with
| `proxy_protocol` prepended), while the second instance handles
| the actual TLS termination.
|
| This approach works well when implementing a failover mechanism:
| if the default path to a server goes down, you can update DNS A
| records to point to a fallback machine running NGINX. That
| fallback instance can then route requests for specific domains to
| the original backend over an alternate path without needing to
| replicate the full TLS configuration locally.
|
| However, this method won't work with HTTP/3. Since HTTP/3 uses
| QUIC over UDP and encrypts the SNI during the handshake,
| `ssl_preread_server_name` can no longer be used to route based on
| domain name.
|
| What alternatives exist to support this kind of SNI-based routing
| with HTTP/3? Is the recommended solution to continue using
| HTTP/1.1 or HTTP/2 over TLS for setups requiring this behavior?
| jcgl wrote:
| Hm, that's a good question. I suppose the same would apply to
| TCP+TLS with Encrypted Client Hello as well, right? Presumably
| the answer would be the same/similar between the two.
| jbritton wrote:
| The article didn't discuss ACK. I have often wondered if it makes
| sense for the protocol to not have ACKs, and to leave that up to
| the application layer. I feel like the application layer has to
| ensure this anyway, so I don't know how much benefit it is to
| additionally support this at a lower layer.
| xgpyc2qp wrote:
| Looks good. Quick is a real game changer for many. Internet
| should be a little faster with it. Probably we will not care
| because of 5g, but still valuable. Wondering that there is a
| separate tow handshake, I was thinking that qick embeds tls, but
| seams like I am wrong.
| another_twist wrote:
| I have a question - bottleneck for TCP is said to the handshake.
| But that can be solved by reusing connections and/or
| multiplexing. The current implementation is 3-4x slower than the
| Linux impl and performance gap is expected to close.
|
| If speed is touted as the advantage for QUIC and it is in fact
| slower, why bother with this protocol ? The author of the PR
| itself attributes some of the speed issues to the protocol
| design. Are there other problems in TCP that need fixing ?
| morning-coffee wrote:
| That's just one bottleneck. The other issue is head-of-line
| blocking. When there is packet loss on a TCP connection,
| nothing sent after that is delivered until the loss is
| repaired.
| anonymousiam wrote:
| TCP windowing fixes the issue you are describing. Make the
| window big and TCP will keep sending when there is a packet
| loss. It will also retry and usually recover before the end
| of the window is reached.
|
| https://en.wikipedia.org/wiki/TCP_window_scale_option
| quietbritishjim wrote:
| The statement in the comment you're replying to is still
| true. While waiting for those missed packets, the later
| packets will not be dropped if you have a large window
| size. But they won't be delivered either. They'll be cached
| in the kennel, even though it may be that the application
| could make use of them before the earlier blocked packet.
| morning-coffee wrote:
| They are unrelated. Larger windows help achieve higher
| throughput over paths with high delay. You allude to
| selective acknowledgements as a way to repair loss before
| the window completely drains which is true, but my point is
| that no data can be _delivered_ to the application until
| the loss is repaired (and that repair takes at least a
| round-trip time). (Then the follow-on effects from noticed
| loss on the congestion controller can limit subsequent in-
| flight data for a time, etc, etc.)
| Twirrim wrote:
| The queuing discipline used by default (pfifo_fast) is
| barely more than 3 FIFO queues bundled together. The 3
| queues allow for a barest minimum semblance of
| prioritisation of traffic, where Queue 0 > 1 > 2, and you
| can tweak some tcp parameters to have your traffic land in
| certain queues. If there's something in queue 0 it must be
| processed first _before_ anything in queue 1 gets touched
| etc.
|
| Those queues operate purely head-of-queue basis. If what is
| at the top of the queue 0 is blocked in any way, the whole
| queue behind it gets stuck, regardless of if it is talking
| to the same destination, or a completely different one.
|
| I've seen situations where a glitching network card caused
| some serious knock on impacts across a whole cluster,
| because the card would hang or packets would drop, and that
| would end up blocking the qdisc on a completely healthy
| host that was in the middle of talking to it, which would
| have impacts on any other host that happened to be talking
| to that _healthy_ host. A tiny glitch caused much wider
| impacts than you 'd expect.
|
| The same kind of effect would happen from a VM that went
| through live migration. The tiny, brief pause would cause a
| spike of latency all over the place.
|
| There are classful alternatives like fq_codel that can be
| used, that can mitigate some fo this, but you do have to
| pay a small amount of processing overhead on every packet,
| because now you have a queuing discipline that actually
| needs to track some semblance of state.
| another_twist wrote:
| Whats the packet loss rate on modern networks ? Curious.
| reliablereason wrote:
| That depends on how much data you are pushing. if you are
| pushing 200 mb on a 100mb line you will get 50% packet
| loss.
| spwa4 wrote:
| Well, yes, that's the idea behind TCP itself, but a
| "normal" rate of packet loss is something along the lines
| of 5/100k packets dropped on any given long-haul link.
| Let's say a random packet passes about 8 such links, so a
| "normal" rate of packet loss is 0.025% or so.
| wmf wrote:
| It can be high on cellular.
| adgjlsfhk1 wrote:
| ~80% when you step out of wifi range on your cell phone.
| jauntywundrkind wrote:
| The article discusses many of the reasons QUIC is currently
| slower. Most of them seem to come down to "we haven't done any
| optimization for this yet".
|
| > _Long offers some potential reasons for this difference,
| including the lack of segmentation offload support on the QUIC
| side, an extra data copy in transmission path, and the
| encryption required for the QUIC headers._
|
| All of these three reasons seem potentially very addressable.
|
| It's worth noting that the benchmark here is on pristine
| network conditions, a drag race if you will. If you are on
| mobile, your network will have a lot more variability, and
| there TCP's design limits are going to become much more
| apparent.
|
| TCP itself often has protocols run on top of it, to do QUIC
| like things. HTTP/2 is an example of this. So when you compare
| QUIC and TCP, it's kind of like comparing how fast a car goes
| with how fast an engine bolted to a frame with wheels on it
| goes. QUIC goes significantly up the OSI network stack, is
| layer 5+, where-as TCP+TLS is layer 3. Thats less system
| design.
|
| QUIC also has wins for connecting faster, and especially for
| reconnecting faster. It also has IP mobility: if you're on
| mobile and your IP address changes (happens!) QUIC can keep the
| session going without rebuilding it once the client sends the
| next packet.
|
| It's a fantastically well thought out & awesome advancement,
| radically better in so many ways. The advantages of having
| multiple non-blocking streams (alike SCTP) massively reduces
| the scope that higher level protocol design has to take on. And
| all that multi-streaming stuff being in the kernel means it's
| deeply optimizable in a way TCP can never enjoy.
|
| Time to stop driving the old rust bucket jalopy of TCP around
| everywhere, crafting weird elaborate handmade shit atop it. We
| need a somewhat better starting place for higher level
| protocols and man oh man is QUIC alluring.
| redleader55 wrote:
| > QUIC goes significantly up the OSI network stack, is layer
| 5+, where-as TCP+TLS is layer 3
|
| IP is layer 3 - network(ensures packets are routed to the
| correct host). TCP is layer 4 - transport(some people argue
| that TCP has functions from layer 5 - eg. establishing
| sessions between apps), while TLS adds a few functions from
| layer 6(eg. encryption), which QUIC also has.
| w3ll_w3ll_w3ll wrote:
| TCP is level 4 in the OSI model
| frmdstryr wrote:
| The "advantage" is tracking via the server provided connection
| ID
| https://www.cse.wustl.edu/~jain/cse570-21/ftp/quic/index.htm...
| gethly wrote:
| I've been hearing about QUIC for ages, yet it is still an obscure
| tech and will likely end up like IPv6.
| gfody wrote:
| isn't it just http3 now?
| tralarpa wrote:
| Your browser is using it when you watch a video on youtube
| (HTTP/3).
| Jtsummers wrote:
| QUIC is already out and in active use. Every major web browser
| supports it, and it's not like IPv6. There's no fundamental
| infrastructure change needed to support it since it's built on
| top of UDP. The end points obviously have to support it, but
| that's the same as any other protocol built on UDP or TCP (like
| HTTP, SNMP, etc.).
| rstuart4133 wrote:
| > yet it is still an obscure tech and will likely end up like
| IPv6.
|
| Probably. According to Google, IPv6 has a measly 46% of
| internet traffic now [0], and growing at about 5% per year.
| QUIC is 40% of Chrome traffic, and is growing at 5% every two
| years [1]. So yeah, their fates do look similar, which is to
| say both are headed for world domination in a couple of
| decades.
|
| [0] https://dnsmadeeasy.com/resources/the-state-of-
| ipv6-adoption...
|
| [1] https://www.cellstream.com/2025/02/14/an-update-on-quic-
| adop...
| gethly wrote:
| When you remove IoT, those numbers will look very
| differently.
___________________________________________________________________
(page generated 2025-07-31 23:00 UTC)