[HN Gopher] QUIC is not quick enough over fast internet
___________________________________________________________________
QUIC is not quick enough over fast internet
Author : Shank
Score : 619 points
Date : 2024-09-09 02:34 UTC (20 hours ago)
(HTM) web link (dl.acm.org)
(TXT) w3m dump (dl.acm.org)
| jacob019 wrote:
| Maybe moving the connection protocol into userspace isn't such a
| great plan.
| foota wrote:
| I don't have access to the article, but they're saying the
| issue is due to client side ack processing. I suspect they're
| testing at bandwidths far beyond what's normal for consumer
| applications.
| dartharva wrote:
| It's available on arxiv and nope, they are testing mostly for
| regular 4G/5G speeds.
|
| https://arxiv.org/pdf/2310.09423
| DannyBee wrote:
| 4g tops out at 1gbps only when one person is on the
| network. 5g tops out at ~10gbps (some 20gbps i guess) only
| when one person is on the network.
|
| They are testing at 1gbps.
|
| This is not regular 4g speed for sure, and it's a rare 5g
| speed. regular 5g speed is (in the US) 40-50mbps, so, 20x
| slower than they are testing.
| dartharva wrote:
| Still won't be beyond normal consumer applications'
| capacity, right?
| DannyBee wrote:
| correct
| izend wrote:
| What about 1gbps fiber at home, it is becoming common in
| Canada. I have 1gbps up/down.
| DannyBee wrote:
| This would affect that.
|
| As said, i was only replying to the claim that this
| affects things at 4g/5g cell phone speeds, which it
| clearly does not, by their own data.
| vrighter wrote:
| Gigabit fiber internet is quite cheap and increasingly
| available (I'm not from the US). I don't just use the
| internet over a 4/5g connection. This definitely affects
| more people than you think.
| DannyBee wrote:
| I think it affects lots of people.
|
| I have 5gbps internet at home myself.
|
| But that is not what i was replying to. I was replying to
| the claim that this affects regular 4g/5g cell phone
| speeds. The data is clear that it does not.
| KaiserPro wrote:
| Http1.1 has been around for 28 years. At the time,
| gigabit ethernet was _expensive_. 9600baud on mobile was
| rare.
|
| and yet http1.1 runs on gigabit networks pretty well.
| yencabulator wrote:
| Your 5G has 0.23ms ping to the average webserver?
| spacebacon wrote:
| See arXiv link in comments.
| 01HNNWZ0MV43FF wrote:
| Does QUIC mandate that, or is that just the stepping stone
| until the chicken-and-egg problem is solved and we get kernel
| support?
| wmf wrote:
| On mobile the plan is to never use kernel support so that
| apps can have the latest QUIC on old kernels.
| vlovich123 wrote:
| As others in the thread summarized the paper as saying the
| issue is ack offload. That has nothing to do with whether the
| stack is in kernel space or user space. Indeed there's some
| concern about this inevitable scenario because the kernel is
| so slow moving, updates take much longer to propagate to
| applications needing them without a middle ground whereas as
| user space stacks they can update as the endpoint
| applications need them to.
| kmeisthax wrote:
| No, but it depends on how QUIC works, how Ethernet hardware
| works, and how much you actually want to offload to the NIC.
| For example, QUIC has TLS encryption built-in, so anything
| that's encrypted can't be offloaded. And I don't think most
| people want to hand all their TLS keys to their NIC[0].
|
| At the very least you probably would have to assign QUIC its
| own transport, rather than using UDP as "we have raw sockets
| at home". Problem is, only TCP and UDP reliably traverse the
| Internet[1]. _Everything_ in the middle is sniffing traffic,
| messing with options, etc. In fact, Google rejected an
| alternate transport protocol called SCTP (which does all the
| stream multiplexing over a single connection that QUIC does)
| specifically because, among other things, SCTP 's a transport
| protocol and middleboxes choke on it.
|
| [0] I am aware that "SSL accelerators" used to do exactly
| this, but in modern times we have perfectly good crypto
| accelerators right in our CPU cores.
|
| [1] ICMP sometimes traverses the internet, it's how ping
| works, but a lot of firewalls blackhole ICMP. Or at least
| they did before IPv6 made it practically mandatory to forward
| ICMP packets.
| _flux wrote:
| I don't think passing just the session keys to NIC would
| sound so perilous, though.
| justinphelps wrote:
| SCTP had already solved the problem that QUIC proposes to
| solve. Google of all companies has the influence to
| properly implement and accommodate other L4 protocols. QUIC
| seems like doubling down on a hack and breaks the elegance
| of OSI model.
| tepmoc wrote:
| SCTP still have some donwsides it has to resolve
| https://http3-explained.haxx.se/en/why-quic/why-
| tcpudp#why-n...
|
| Plus we need happy eyeballs for transport if SCTP run
| over IP and not encapuslated
| https://datatracker.ietf.org/doc/html/draft-grinnemo-
| taps-he
|
| But IPv4 pretty much non-workable since most end-users
| behind NAT and there no known implementation to work
| around that.
| kccqzy wrote:
| The flexibility and ease of changing a userspace protocol IMO
| far outweighs anything else. If the performance problem
| described in this article (which I don't have access to) is in
| userspace QUIC code, it can be fixed and deployed very quickly.
| If similar performance issue were to be found in TCP, expect to
| wait multiple years.
| vrighter wrote:
| Well, the problem is probably that it is in userspace in the
| first place.
| mrweasel wrote:
| Maybe moving the entire application to the browser/cloud wasn't
| the best idea for a large number of use cases?
|
| Video streaming, sure, but we're already able to stream 4K
| video over a 25Mbit line. With modern internet connections
| being 200Mbit to 1Gbit, I don't see that we need the bandwidth
| in private homes. Maybe for video conferencing in large
| companies, but that also doesn't need to be 4K.
|
| The underlying internet protocols are old, so there's no harm
| in assessing if they've outlived their usefulness. However, we
| should also consider in web applications and "always connected"
| is truly the best solution for our day to day application
| needs.
| kuschku wrote:
| > With modern internet connections being 200Mbit to 1Gbit, I
| don't see that we need the bandwidth in private homes
|
| Private connections tend to be asymmetrical. In some cases,
| e.g. old DOCSIS versions, that used to be due to technical
| necessity.
|
| Private connections tend to be unstable, the bandwidth
| fluctuates quite a bit. Depending on country, the actually
| guaranteed bandwidth is somewhere between half of what's on
| the sticker, to nothing at all.
|
| Private connections are usually used by families, with
| multiple people using it at the same time. In recent years,
| you might have 3+ family members in a video call at the same
| time.
|
| So if you're paying for a 1000/50 line (as is common with
| DOCSIS deployments), what you're actually getting is usually
| a 400/20 line that sometimes achieves more. And those 20Mbps
| upload are now split between multiple people.
|
| At the same time, you're absolutely right - Gigabit is enough
| for most people. Download speeds are enough for quite a
| while. We should instead be increasing upload speeds and
| deploying FTTH and IPv6 everywhere to reduce the latency.
| throwaway2037 wrote:
| This is a great post. I often forget that home Internet
| connections are frequently shared between many people.
|
| This bit: > IPv6 everywhere to reduce the
| latency
|
| I am not an expert on IPv4 vs IPv6. Teach me: How will
| migrating to IPv6 reduce latency? As I understand, a lot of
| home Internet connections are always effectively IPv6 via
| CarrierNAT. (Am I wrong? Or not relevant to your point?)
| kuschku wrote:
| IPv4 routing is more complicated, especially with
| multiple levels of NAT applied.
|
| Google has measured to most customers about 20ms less
| latency on IPv6 than on IPv4, according to their IPv6
| report.
| simoncion wrote:
| > Google has measured to most customers about 20ms less
| latency on IPv6 than on IPv4, according to their IPv6
| report.
|
| I've run that comparison across four ISPs and never seen
| any significant difference in latency... not once in the
| decades I've had "dual stack" service.
|
| I imagine that Google is getting confounded by folks with
| godawful middle/"security"ware that is too stupid to know
| how to handle IPv6 traffic and just passes it through.
| throwaway2037 wrote:
| Overall, I like your post very much. > but
| that also doesn't need to be 4K.
|
| Here, I would say "need" is a strong term. Surely, you are
| correct at the most basic level, but if the bandwidth exists,
| then _some_ streaming platforms will use it. Deeper question:
| Is there any practical use case for Internet connections
| about 1Gbit? I struggle to think of any. Yes, I can
| understand that people may wish to reduce latency, but I don
| 't think home users need any more bandwidth at this point. I
| am astonished when I read about 10Gbit home Internet access
| in Switzerland, Japan, and Korea.
|
| Zero trolling: Can you help me to better understand your last
| sentence? > However, we should also
| consider in web applications and "always connected" is truly
| the best solution for our day to day application needs.
|
| I cannot tell if this is written with sarcasm. Let me ask
| more directly: Do you think it is a good design for our
| modern apps to always be connected or not? Honestly, I don't
| have a strong opinion on the matter, but I am interested to
| hear your opinion.
| mrweasel wrote:
| Generally speaking I think we should aim for offline first,
| always. Obvious things like Teams or Slack requires an
| internet connection to work, but assuming a working
| internet connection shouldn't even be a requirement for a
| web browser.
|
| I think it is bad design to expect a working internet
| connection, because in many places your can't expect
| bandwidth be cheap, or the connection to be stable. That's
| not to say that something like Google Docs (others seems to
| like it, but everyone in my company thinks it's awful)
| should be a thing, there's certainly value in the real time
| collaboration features, but it should be able to function
| without an internet connection.
|
| Last week someone was complaining about the S3 (sleep)
| feature on laptops, and one thing that came to my mind is
| that despite these being portable, we somehow expect them
| to be always connected to the internet. That just seems
| like a somewhat broken mindset to me.
| surajrmal wrote:
| Note that in deeper sleep states you typically see more
| aggressive limiting of what interrupts can take you out
| of the sleep state. Turning off network card interrupts
| is common.
| simiones wrote:
| The problem is that the biggest win by far with QUIC is merging
| encryption and session negotiation into a single packet, and
| the kernel teams have been adamant about not wanting to
| maintain encryption libraries in kernel. So, QUIC or any other
| protocol like it being in kernel is basically a non-starter.
| suprjami wrote:
| Looking forward to QUIC in the kernel. Linux already has kTLS.
| JoshTriplett wrote:
| Seems to be available on arXiv: https://arxiv.org/pdf/2310.09423
| Tempest1981 wrote:
| The page headings say "Conference'17, July 2017" -- why is
| that?
|
| Although the sidebar on page 1 shows "13 Oct 2023".
| mrngm wrote:
| It's likely the authors used an existing conference template
| to fit in their paper's contents. Upon sending it to the
| conference, the editors can easily fit the contents in their
| prescribed format, and the authors know how many characters
| they can fit in the page limit.
|
| arXiv typically contains pre-prints of papers. These may not
| have been peer-reviewed, and the contents may not reflect the
| actual "published" paper that was accepted (and/or corrected
| after peer review) to a conference or journal.
|
| arXiv applies a watermark to the submitted PDF such that
| different versions are distinguishable on download.
| JoshTriplett wrote:
| In the early days of QUIC, many people pointed out that the UDP
| stack has had far far less optimization put into it than the TCP
| stack. Sure enough, some of the issues identified here arise
| because the UDP stack isn't doing things that it _could_ do but
| that nobody has been motivated to make it do, such as UDP generic
| receive offload. Papers like this are very likely to lead to
| optimizations both obvious and subtle.
| Animats wrote:
| What is UDP offload going to _do_? UDP barely does anything but
| queue and copy.
|
| Linux scheduling from packet-received to thread has control is
| not real-time, and if the CPUs are busy, may be rather slow.
| That's probably part of the bottleneck.
|
| The embarrassing thing is that QUIC, even in Google's own
| benchmarks, only improved performance by about 10%. The added
| complexity probably isn't worth the trouble. However, it gave
| Google control of more of the stack, which may have been the
| real motivation.
| Vecr wrote:
| I think one of the original drivers was the ability to
| quickly tweak parameters, after Linux rejected what I think
| was userspace adjustment of window sizing to be more
| aggressive than the default.
|
| The Linux maintainers didn't want to be responsible for
| congestion collapse, but UDP lets you spray packets from
| userspace, so Google went with that.
| JoshTriplett wrote:
| Among other things, GRO (receive offloading) means you can
| get more data off of the network card in fewer operations.
|
| Linux has receive packet steering, which can help with
| getting packets from the network card to the right CPU and
| the right userspace thread without moving from one CPU's
| cache to another.
| suprjami wrote:
| RPS is just software RSS.
|
| You mean Receive Flow Steering, and RFS can only control
| RPS, so to do it in hardware you actually mean Accelerated
| RFS (which requires a pretty fancy NIC these days).
|
| Even ignoring the hardware requirement, unfortunately it's
| not that simple. I find results vary wildly whether you
| should put process and softirq on the same CPU core
| (sharing L1 and L2) or just on the same CPU socket (sharing
| L3 but don't constantly blow out L1/L2).
|
| Eric Dumazet said years ago at a Netdev.conf that L1 cache
| sizes have really not kept up with reality. That matches my
| experience.
|
| QUIC doing so much in userspace adds another class of
| application which has a so-far uncommon design pattern.
|
| I don't think it's possible to say whether any QUIC
| application benefits from RFS or not.
| infogulch wrote:
| In my head the main benefit of QUIC was always multipath, aka
| the ability to switch interfaces on demand without losing the
| connection. There's MPTCP but who knows how viable it is.
| rocqua wrote:
| Mptcp sees use in the Telco space, so they probably know.
| modeless wrote:
| Is that actually implemented and working in practice? My
| connection still hangs whenever my wifi goes out of
| range...
| Sesse__ wrote:
| Apple's Siri is using MPTCP, so it is presumably viable.
| jshier wrote:
| It requires explicit backend support, and Apple supports
| it for many of their services, but I've never seen
| another public API that does. Anyone have any examples?
| mh- wrote:
| Last I looked into this (many years), ELB/GLBs didn't
| support it on AWS/GCP respectively. That prevented us
| from further considering implementing it at the time
| (mobile app -> AWS-hosted EC2 instances behind an ELB).
|
| Not sure if that's changed, but at the time it wasn't
| worth having to consider rolling our own LBs.
|
| To answer your original question, no, I haven't
| (knowingly) seen it on any public APIs.
| suprjami wrote:
| I always thought the main benefit of QUIC was to encrypt
| the important part of the transport header, so endpoints
| control their own destiny, not some middle device.
|
| If I had a dollar for every firewall vendor who thought
| dropping TCP retransmissions or TCP Reset was a good
| idea...
| amluto wrote:
| Last I looked (several months ago), Linux's UDP stack did not
| seemed well tuned in its memory management accounting.
|
| For background, the mental model of what receiving network
| data looks like in userspace is almost completely backwards
| compared to how general-purpose kernel network receive
| actually works. User code thinks it allocates a buffer (per-
| socket or perhaps a fancier io_uring scheme), then receives
| packets into that buffer, then processes them.
|
| The kernel is the other way around. The kernel allocates
| buffers and feeds pointers to those buffers to the NIC. The
| NIC receives packets and DMAs them into the buffers, then
| tells the kernel. But the NIC and the kernel have _absolutely
| no concept_ of which socket those buffers belong to until
| _after_ they are DMAed into the buffers. So the kernel cannot
| possibly map received packets to the actual recipient 's
| memory. So instead, after identifying who owns a received
| packet, the kernel retroactively charges the recipient for
| the memory. This happens on a per-packet basis, it involves
| per-socket _and_ cgroup accounting, and there is no support
| for having a socket "pre-allocate" this memory in advance of
| receiving a packet. So the accounting is gnarly, involves
| atomic operations, and seems quite unlikely to win any speed
| awards. On a very cursory inspection, the TCP code seemed
| better tuned, and it possibly also won by generally handling
| more bytes per operation.
|
| Keep in mind that the kernel _can 't_ copy data to
| application memory synchronously -- the application memory
| might be paged out when a packet shows up. So instead the
| whole charging dance above happens immediately when a packet
| is received, and the data is copied later on.
|
| For quite a long time, I've thought it would be nifty if
| there was a NIC that kept received data in its own RAM and
| then allowed it to be efficiently DMAed to application memory
| when the application was ready for it. In essence, a lot of
| the accounting and memory management logic could move out of
| the kernel into the NIC. I'm not aware of anyone doing this.
| fragmede wrote:
| RDMA is common for high performance applications but it
| doesn't work over the Internet.
| Danieru wrote:
| It's a good thing the NIC is connected over pcie then.
| shaklee3 wrote:
| You can do GPUdirect over the Internet without RDMA
| though.
| jpgvm wrote:
| GPUDirect relies on the PeerDirect extensions for RDMA
| and are thus an extension to the RDMA verbs, not a
| separate an independent thing that works without RDMA.
| shaklee3 wrote:
| Again, you can do what I said. You may be using different
| terminology, but you can do GPUdirect in dpdk without
| rdma
| throw0101c wrote:
| > _RDMA is common for high performance applications but
| it doesn 't work over the Internet._
|
| RoCEv2 is routable.
|
| * https://en.wikipedia.org/wiki/RDMA_over_Converged_Ether
| net
|
| * https://docs.nvidia.com/networking/display/winofv550530
| 00/ro...
|
| Of course you're going to get horrible latency because of
| speed-of-light limitations, so the definition of "work"
| may be weak, but data should be able to be transmitted.
| JoshTriplett wrote:
| > For quite a long time, I've thought it would be nifty if
| there was a NIC that kept received data in its own RAM and
| then allowed it to be efficiently DMAed to application
| memory when the application was ready for it.
|
| I wonder if we could do a more advanced version of receive-
| packet steering that sufficiently identifies packets as
| _definitely_ for a given process and DMAs them directly to
| that process 's pre-provided buffers for later
| notification? In particular, can we offload enough
| information to a smart NIC that it can identify where
| something should be DMAed to?
| amluto wrote:
| I don't think the result would be compatible with the
| socket or io_uring API, but maybe io_uring could be
| extended a bit. Basically the kernel would
| opportunistically program a "flow director" or similar
| rule to send packets to special rx queue, and that queue
| would point to (pinned) application memory. Getting this
| to be compatible with iptables/nftables would be a mess
| or maybe entirely impossible.
|
| I've never seen the accelerated steering stuff work well
| in practice, sadly. The code is messy, the diagnostics
| are basically nonexistent, and it's not clear to me that
| many drivers support it well.
| mgaunard wrote:
| Most advanced NICs support flow steering, which makes the
| NIC write to different buffers depending on the target
| port.
|
| In practice though, you only have a limited amount of
| these buffers, and it causes complications if multiple
| processes need to consume the same multicast.
| eptcyka wrote:
| Multicast may well be shitcanned to an expensive slow
| path, given that multicast is rarely used for high
| bandwidth scenarios, especially when multiple processes
| need to receive the same packet.
| lokar wrote:
| The main real use of multicast I've seen is pretty high
| packet rate. High frequency traders get multicast feeds
| of tick data from the exchange.
| mgaunard wrote:
| multicast is precisely used for low-latency high-
| throughput message buses.
| eptcyka wrote:
| With multiple processes listening for the data? I think
| that's a market niche. In terms of billions of devices,
| multicast is mostly used for zero-config service
| discovery. I am not saying there isn't a market for high-
| bandwidth multicast, I am stating that for the vast
| majority of software deployments, multi-cast performance
| is not an issue. For whatever deployments it is an issue,
| they can specialize. And, as in the sibling comment
| mentions, people who need breakneck speeds have already
| proven that they can create a market for themselves.
| mgaunard wrote:
| That's not a market niche, that's the normal mode of
| operation of a message bus.
|
| The cloud doesn't implement multicast, but that doesn't
| mean it doesn't get used by people that build non-
| Internet networks and applications.
| derefr wrote:
| Presuming that this is a server that has One (public) Job,
| couldn't you:
|
| 1. dedicate a NIC to the application;
|
| 2. and have the userland app open a packet socket against
| the NIC, to drink from its firehose through MMIO against
| the kernel's own NIC DMA buffer;
|
| ...all without involving the kernel TCP/IP (or in this
| case, UDP/IP) stack, and any of the accounting logic
| squirreled away in there?
|
| (You _can_ also throw in a BPF filter here, to drop
| everything except UDP packets with the expected specified
| ip:port -- but if you 're already doing more packet
| validation at the app level, you may as well just take the
| whole firehose of packets and validate them for being
| targeted at the app at the same time that they're validated
| for their L7 structure.)
| amluto wrote:
| I think DPDK does something like this. The NIC is
| programmed to aim the packets in question at a specific
| hardware receive queue, and that queue is entirely owned
| by a userspace program.
|
| A lot of high end NICs support moderately complex receive
| queue selection rules.
| SSLy wrote:
| > _1. dedicate a NIC to the application;_
|
| you need to respond to ICMPs which have different
| proto/header number than UDP or TCP.
| veber-alex wrote:
| Have you looked into NVIDIA VMA?
|
| https://docs.nvidia.com/networking/display/vmav9860/introdu
| c...
| rkagerer wrote:
| Why don't we eliminate the initial step of an app reserving
| a buffer, keep each packet in its own buffer, and once the
| socket it belongs to is identified hand a pointer and
| ownership of that buffer back to the app? If buffers can be
| of fixed (max) size, you could still allow the NIC to fill
| a bunch of them in one go.
| apitman wrote:
| Ditching head of line blocking is potentially a big win, but
| I really wish it wouldn't have come with so much complexity.
| raggi wrote:
| UDP offload gets you implicitly today:
|
| - 64 packets per syscall, which is enough data to amortize
| the syscall overhead - a single packet is not.
|
| - UDP offload optionally lets you defer checksum computation,
| often offloading it to hardware.
|
| - UDP offload lets you skip/reuse route lookups for
| subsequent packets in a bundle.
|
| What UDP offload is no good for though, is large scale
| servers - the current APIs only work when the incoming packet
| chains neatly organize into batches per peer socket. If you
| have many thousands of active sockets you'll stop having full
| bundles and the overhead starts sneaking back in. As I said
| in another thread, we really need a replacement for the BSD
| APIs here, they just don't scale for modern hardware
| constraints and software needs - much too expensive per
| packet.
| 10000truths wrote:
| Bulk throughout isn't on par with TLS mainly because NICs
| with dedicated hardware for QUIC offload aren't commercially
| available (yet). Latency is undoubtedly better - the 1-RTT
| QUIC handshake substantially reduces time-to-first-byte
| compared to TLS.
| majke wrote:
| > What is UDP offload going to do?
|
| Handling ACK packets in kernelspace would be one thing -
| helping for example RTT estimation. With userspace stack
| ACK's are handled in application and are subject to
| scheduler, suffering a lot on a loaded system.
| morning-coffee wrote:
| There are no ACKs inherent in the _UDP_ protocol, so "UDP
| offload" is not where the savings are. There are ACKs in
| the _QUIC_ protocol and they are carried by UDP datagrams
| which need to make their way up to user land to be
| processed, and this is the crux of the issue.
|
| What is needed is for _QUIC offload_ to be invented
| /supported by HW so that most of the high-frequency/tiny-
| packet processing happens there, just as it does today for
| _TCP offload_. TCP large-send and large-receive offload is
| what is responsible for all the CPU savings as the
| application deals in 64KB or larger send /receives and the
| segmentation and receive coalescing all happen in hardware
| before an interrupt is even generated to involve the
| kernel, let alone userland.
| RachelF wrote:
| Also bear in mind that many of today's network cards have
| processors in them that handle much of the TCP/IP overhead.
| kccqzy wrote:
| That's mostly still for the data center. Which end-user
| network cards that I can buy can do TCP offloading?
| phil21 wrote:
| Unless I'm missing something here, pretty much any Intel
| nic released in the past decade should support tcp offload.
| I imagine the same is true for Broadcom and other vendors
| as well, but I don't have something handy to check.
| JoshTriplett wrote:
| Some wifi cards offload a surprising amount in order to do
| wake-on-wireless, but that's not for performance.
| throw0101c wrote:
| > _Which end-user network cards that I can buy can do TCP
| offloading?_
|
| Intel's I210 controllers support offloading:
|
| > _Other performance-enhancing features include IPv4 and
| IPv6 checksum offload, TCP /UDP checksum offload, extended
| Tx descriptors for more offload capabilities, up to 256 KB
| TCP segmentation (TSO v2), header splitting, 40 KB packet
| buffer size, and 9.5 KB Jumbo Frame support._
|
| * https://cdrdv2-public.intel.com/327935/327935-Intel%20Eth
| ern...
|
| And cost US$ 22:
|
| * https://www.amazon.com/dp/B0728289M7/
| vel0city wrote:
| Practically every on-board network adapter I've had for
| over a decade has had TCP offload support. Even the network
| adapter on my cheap $300 Walmart laptop has hardware TCP
| offload support.
| suprjami wrote:
| All of them. You'd be hard pressed to buy a new NIC which
| _doesn 't_ have a raft of protocol offloads.
|
| Even those garbage tg3 things from 1999 that OEMs are still
| putting onboard of enterprise servers have some TCP offload
| capability.
| dilyevsky wrote:
| Not just that but TLS too. Starting ConnectX-5 i think you
| can push kTLS down to nic. Dont think there's a QUIC
| equivalent for this
| nextaccountic wrote:
| Do you mean that under the same workload, TCP will perform
| better?
| skywhopper wrote:
| The whole reason QUIC even exists in user space is because its
| developers were trying to hack a quick speed-up to HTTP rather
| than actually do the work to improve the underlying networking
| fundamentals. In this case the practicalities seem to have
| caught them out.
|
| If you want to build a better TCP, do it. But hacking one in on
| top of UDP was a cheat that didn't pay off. Well, assuming
| performance was even the actual goal.
| osmarks wrote:
| They couldn't have built it on anything but UDP because the
| world is now filled with poorly designed firewall/NAT
| middleboxes which will not route things other than TCP, UDP
| and optimistically ICMP.
| kbolino wrote:
| It already exists, it's called SCTP. It doesn't work over the
| Internet because there's too much crufty hardware in the
| middle that will drop it instead of routing it. Also,
| Microsoft refused to implement it in Windows and also banned
| raw sockets so it's impossible to get support for it on that
| platform without custom drivers that practically nobody will
| install.
|
| I don't know how familiar the developers of QUIC were with
| SCTP in particular but they were definitely aware of the
| problems that prevented a better TCP from existing. The only
| practical solution is to build something on top of UDP, but
| if even that option proves unworkable, then the only other
| possibility left is to fragment the Internet.
| suprjami wrote:
| I like (some aspects of) SCTP too but it's not a solution
| to this problem.
|
| If you've followed Dave Taht's bufferbloat stuff, the
| reason he lost faith in TCP is because middle devices have
| access to the TCP header and can interfere with it.
|
| If SCTP got popular, then middle devices would ruin SCTP in
| the same way.
|
| QUIC is the bufferbloat preferred solution because the
| header is encrypted. It's not possible for a middle device
| to interfere with QUIC. Endpoints, and only endpoints,
| control their own traffic.
| adgjlsfhk1 wrote:
| counterpoint, it is paying off, just taking a while. this
| paper wasn't "quick is bad" it was "OSes need more
| optimization for quick to be as fast as https"
| guappa wrote:
| The whole point of the project was for it to be faster
| without touching the OS...
| adgjlsfhk1 wrote:
| I think this is slightly wrong. the goal was faster
| without requiring the OS/middleware support. optimizing
| the OSes that need high performance is much easier since
| that's a much smaller set of OSes (basically just
| Linux/Mac/windows)
| IshKebab wrote:
| Yeah they probably wanted a protocol that would actually work
| on the wild internet with real firewalls and routers and
| whatnot. The only option if you want that is building on top
| of UDP or TCP and you obviously can't use TCP.
| morning-coffee wrote:
| The UDP optimizations are already there and have been pretty
| much wrung out.
|
| https://www.fastly.com/blog/measuring-quic-vs-tcp-computatio...
| has good details and was done almost five years ago.
|
| The solution isn't in more _UDP_ offload optimizations as there
| aren 't any semantics in UDP that are expensive other than the
| quantity and frequency of datagrams to be processed in the
| context of the QUIC protocol that uses UDP as a transport.
| QUIC's state machine needs to see every UDP datagram carrying
| QUIC protocol messages in order to move forward. Just like was
| done for TCP offload more than twenty years ago, portions of
| QUIC state need to move and be maintained in hardware to
| prevent the host from having to see so many high-frequency tiny
| state updates messages.
| suprjami wrote:
| Your first point is correct - papers ideally lead to innovation
| and tangible software improvements.
|
| I think a kernel implementation of QUIC is the next logical
| step. A context switch to decrypt a packet header and send
| control traffic is just dumb. That's the kernel's job.
|
| Userspace network stacks have never been a good idea. QUIC is
| no different.
|
| (edit: Xin Long already has started a kernel implementation,
| see elsewhere on this page)
| mholt wrote:
| I don't have access to the paper but based on the abstract and a
| quick scan of the presentation, I can confirm that I have seen
| results like this in Caddy, which enables HTTP/3 out of the box.
|
| HTTP/3 implementations vary widely at the moment, and will likely
| take another decade to optimize to homogeneity. But even then,
| QUIC requires a _lot_ of state management that TCP doesn 't have
| to worry about (even in the kernel). There's a ton of processing
| involved with every UDP packet, and small MTUs, still engrained
| into many middle boxes and even end-user machines these days,
| don't make it any better.
|
| So, yeah, as I felt about QUIC ... oh, about 6 years ago or so...
| HTTP/2 is actually really quite good enough for most use cases.
| The far reaches of the world and those without fast connections
| will benefit, but the majority of global transmissions will
| likely be best served with HTTP/2.
|
| Intuitively, I consider each HTTP major version an increased
| order of magnitude in complexity. From 1 to 2 the main
| complexities are binary (that's debatable, since it's technically
| simpler from an encoding standpoint), compression, and streams;
| then with HTTP/3 there's _so, so much_ it does to make it work.
| It _can_ be faster -- that's proven -- but only when networks are
| slow.
|
| TCP congestion control is its own worst enemy, but when networks
| aren't congested (and with the right algorithm)... guess what.
| It's fast! And the in-order packet transmissions (head-of-line
| blocking) makes endpoint code so much simpler and faster. It's no
| wonder TCP is faster these days when networks are fast.
|
| I think servers should offer HTTP/3 but clients should be choosy
| when to use it, for the sake of their own experience/performance.
| altairprime wrote:
| The performance gap is shown to be due to hardware offloading,
| not due to congestion control, in the arxiv paper above.
| vlovich123 wrote:
| And because Quic is encrypted at a fundamental level, offload
| likely means needing to share keys with the network card
| which is a trust concern.
| 10000truths wrote:
| This is already how TLS offload is implemented for NICs
| that support it. The handshake isn't offloaded, only the
| data path. So essentially, the application performs the
| handshake, then it calls setsockopt to convert the TCP
| socket to a kTLS socket, then it passes the shared key, IV,
| etc. to the kTLS socket, and the OS's network stack passes
| those parameters to the NIC. From there, the NIC only
| handles the bulk encryption/decryption and record
| encapsulation/decapsulation. This approach keeps the
| drivers' offload implementations simple, while still
| allowing the application/OS to manage the session state.
| vlovich123 wrote:
| Sure, similar mechanisms are available but for TCP ack
| offloading and TLS encryption/decryption offloading are
| distinct features. With QUIC there's no separation which
| changes the threat model. Of course the root
| architectural problem is that this kind of stuff is part
| of the NIC instead of an "encryption accelerator" that
| can be requested to operate with a key ID on a RAM region
| and then the kernel only needs to give the keys to the SE
| (and potentially that's where they even originate instead
| of ever living anywhere else)
| jstarks wrote:
| Your NIC can already access arbitrary RAM via DMA. It can
| read your keys already.
| altairprime wrote:
| That is often incorrect for Apple computers, whether
| x64+T2 or aarch64: https://support.apple.com/fr-
| tn/guide/security/seca4960c2b5/...
|
| And it's often incorrect on x64 PCs when IOMMU access is
| appropriately segmented. See also e.g. Thunderclap:
| https://www.ndss-symposium.org/wp-
| content/uploads/ndss2019_0...
|
| It may still be true in some cases, but it shouldn't be
| taken for granted that it's _always_ true.
| yencabulator wrote:
| Nope. https://en.wikipedia.org/wiki/Input%E2%80%93output_
| memory_ma...
| truetraveller wrote:
| I'd say Http1.1 is good enough for most people, especially with
| persistent connections. Http2 is an exponential leap in
| complexity, and burdensome/error-prone for clients to
| implement.
| 01HNNWZ0MV43FF wrote:
| Yeah I imagine 1 + 3 being popular. 1.1 is so simple to
| implement and WebTransport / QUIC is basically a teeny VPN
| connection.
| apitman wrote:
| The day they come for HTTP/1.1 is the day I die on a hill.
| Sparkyte wrote:
| Agreed on this.
| geocar wrote:
| I turned off HTTP2 and HTTP3 a few months ago.
|
| I see a few million daily page views: Memory usage has been
| down, latency has been down, network accounting (bandwidth) is
| about the same. Revenue (ads) is up.
|
| > It _can_ be faster -- that's proven -- but only when networks
| are slow.
|
| It can be faster in a situation that doesn't exist.
|
| It sounds charitable to say something like "when networks are
| slow" -- but because everyone has had a slow network
| experience, they are going to think that QUIC would help them
| out, but _real world slow network problems_ don 't look like
| the ones that QUIC solves.
|
| In the real world, QUIC wastes memory and money and increases
| latency on the _average case_. Maybe some Google engineers can
| come up with a clever heuristic involving TCP options or the
| RTT information to "switch on QUIC selectively" but honestly I
| wish they wouldn't bother, simply because I don't want to waste
| my time benchmarking another half-baked google fart.
| withinboredom wrote:
| The thing is, very few people who use "your website" are on
| slow, congested networks. The number of people who visit
| google on a slow, congested network (airport wifi, phones at
| conferences, etc) is way greater than that. This is a
| protocol to solve a google problem, not a general problem or
| even a general solution.
| geocar wrote:
| Since I buy ads on Google to my site I would argue it's
| representative of Google's traffic.
|
| But nice theory.
| withinboredom wrote:
| It's not. Think about what you search for on your mobile,
| while out or traveling, and what you search for on
| desktop/wifi. They are vastly different. Your traffic is
| not representative of the majority of searches.
| geocar wrote:
| I'm sure the majority of searches are for "google" and
| "facebook" and you're right in a way: I'm not interested
| in those users.
|
| I'm only interested in searches that advertisers are
| interested in, but this is also where Google gets _their_
| revenue from, so we are aligned with which users we want
| to prioritise, so I do not understand who you possibly
| think QUIC is for if not Google 's ad business?
| withinboredom wrote:
| That's literally what I said; the entire protocol is
| engineered for google. Not for everyone else. 99.99% of
| websites out there do not need it.
| suprjami wrote:
| No. Google traffic is the Google Search page, Gmail,
| GSuite like Drive and Meet, and YouTube. You probably
| aren't hosting those.
| replete wrote:
| It's strange to read this when you see articles like this[0]
| and see Lighthouse ranking better with it switched on.
| Nothing beats real world stats though. Could this be down to
| server/client implementation of HTTP2 or would you say its a
| fundamental implication of the design of the protocol?
|
| Trying to make my sites load faster led me to experiment with
| QUIC and ultimately I didn't trust it enough to leave it on
| with the increase of complexity.
|
| [0]: https://kiwee.eu/blog/http-3-how-it-performs-compared-
| to-htt...
| withinboredom wrote:
| > It's strange to read this when you see articles like
| this[0] and see Lighthouse ranking better with it switched
| on.
|
| I mean, Lighthouse is maintained by Google (IIRC), and I
| can believe they are going to give their own protocol bonus
| points.
|
| > Could this be down to server/client implementation of
| HTTP2 or would you say its a fundamental implication of the
| design of the protocol?
|
| For stable internet connections, you'll see http2 beat
| http3 around 95% of the time. It's the 95th+ percentile
| that really benefits from http3 on a stable connection.
|
| If you have unstable connections, then http3 will win,
| hands down.
| sbstp wrote:
| Even HTTP/2 seems to have been rushed[1]. Chrome has removed
| support for server push. Maybe more thought should be put into
| these protocols instead of just rebranding whatever Google is
| trying to impose on us.
|
| [1] https://varnish-
| cache.org/docs/trunk/phk/h2againagainagain.h...
| surajrmal wrote:
| It's okay to make mistakes, that's how you learn and improve.
| Being conservative has drawbacks of its own. Id argue we need
| more parties involved earlier in the process rather than just
| time.
| zdragnar wrote:
| It's a weird balancing act. On the other hand, waiting for
| everyone to agree on everything means that the spec will take
| a decade or two for everyone to come together, and then all
| the additional time for everyone to actively support it.
|
| AJAX is a decent example. Microsoft's Outlook Web Access team
| implemented XMLHTTP as an activex thing for IE 5 and soon the
| rest of the vendors adopted it as a standard thing as
| XmlHttpRequest objects.
|
| In fact, I suspect the list of things that exist in browsers
| because one vendor thought it was a good idea and everyone
| hopped on board is far, far longer than those designed by
| committee. Often times, the initially released version is not
| exactly the same that everyone standardized on, but they all
| get to build on the real-world consequences of it.
|
| I happen to like the TC39 process https://tc39.es/process-
| document/ which requires two live implementations with use in
| the wild for something to get into the final stage and become
| an official part of the specification. It is obviously harder
| for something like a network stack than a JavaScript engine
| to get real world use and feedback, but it has helped to keep
| a lot of the crazier vendor specific features at bay.
| arcbyte wrote:
| It's okay to make mistakes, but its not okay to ignore the
| broad consensus that HTTP2 was TERRIBLY designed and then
| admit it 10 years later as if it was unknowable. We knew it
| was bad.
| est wrote:
| I don't blame Google, all major version changes are very brave,
| I praised them. The problem is lack of non-google protocols for
| competition.
| KaiserPro wrote:
| HTTP2 was a prototype that was designed by people who either
| assumed that mobile internet would get better much quicker than
| it did, or who didn't understand what packet loss did to
| throughput.
|
| I suspect part of the problem is that some of the rush is that
| people at major companies will get a promotion if they do "high
| impact" work out in the open.
|
| HTTP/2 "solves head of line blocking" which is doesn't. It
| exchanged an HTTP SSL blocking issues with TCP on the real
| internet issue. This was predicted at the time.
|
| The other issue is that instead of keeping it a simple
| protocol, the temptation to add complexity to aid a specific
| use case gets too much. (It's human nature I don't blame them)
| pornel wrote:
| H/2 doesn't solve blocking it on the TCP level, but it solved
| another kind of blocking on the protocol level by having
| multiplexing.
|
| H/1 pipelining was unusable, so H/1 had to wait for a
| response before sending the next request, which added a ton
| of latency, and made server-side processing serial and
| latency-sensitive. The solution to this was to open a dozen
| separate H/1 connections, but that multiplied setup cost, and
| made congestion control worse across many connections.
| KaiserPro wrote:
| > it solved another kind of blocking on the protocol level
|
| Indeed! and it works well on low latency, low packet loss
| networks. On high packet loss networks, it performs worse
| than HTTP1.1. Moreover it gets increasingly worse the
| larger the page the request is serving.
|
| We pointed this out at the time, but were told that we
| didn't understand the web.
|
| > H/1 pipelining was unusable,
|
| Yup, but think how easy it would be to create http1.2 with
| better spec for pipe-lining. (but then why not make changes
| to other bits as well, soon we get HTTP2!) But of course
| pipelining only really works in a low packet loss network,
| because you get head of line blocking.
|
| > open a dozen separate H/1 connections, but that
| multiplied setup cost
|
| Indeed, that SSL upgrade is a pain in the arse. But
| connections are cheap to keep open. So with persistent
| connections and pooling its possible to really nail down
| the latency.
|
| Personally, I think the biggest problem with HTTP is that
| its a file access protocol, a state interchange protocol
| and an authentication system. I would tentatively suggest
| that we adopt websockets to do state (with some extra
| features like optional schema sharing {yes I know thats a
| bit of enanthema}) Make http4 a proper file sharing
| prototcol and have a third system for authentication token
| generation, sharing and validation.
|
| However the real world says that'll never work. So
| connection pooling over TCP with quick start TLS would be
| my way forward.
| kiitos wrote:
| > Personally, I think the biggest problem with HTTP is
| that its a file access protocol, a state interchange
| protocol and an authentication system.
|
| HTTP is a state interchange protocol. It's not any of the
| other things you mention.
| liveoneggs wrote:
| Part of/Evidence of the Google monopoly position in the web
| stack are these big beta tests of protocols they cook up for
| whatever reason.
| surajrmal wrote:
| This is a weak argument that simply caters to the ongoing HN
| hivemind opinion. While Google made the initial proposal,
| many other parties did participate in getting quic
| standardized. The industry at large was in favor.
| oefrha wrote:
| IETF QUIC ended up substantially different from gQUIC.
| People who say Google somehow single-handedly pushed things
| through probably haven't read anything along the
| standardization process, but of course everyone has to have
| an opinion about all things Google.
| botanical wrote:
| > we identify the root cause to be high receiver-side processing
| overhead
|
| I find this to be the issue when it comes to Google, and I bet it
| was known before hand; pushing processing to the user. For
| example, the AV1 video codec was deployed when no consumer had HW
| decoding capabilities. It saved them on space at the expense of
| increased CPU usage for the end-user.
|
| I don't know what the motive was there; it would still show that
| they are carbon-neutral while billions are busy processing the
| data.
| anfilt wrote:
| Well I will say if your running servers hit billions of times
| per day. Offloading processing to the client when safe to do so
| starts make sense financially. Google does not have to pay for
| your CPU or storage usage ect...
|
| Also I will say if said overhead is not too much it's not that
| bad of a thing.
| kccqzy wrote:
| This is indeed an issue but it's widespread and everyone does
| it, including Google. Things like servers no longer generating
| actual dynamic HTML, replaced with servers simply serving pure
| data like JSON and expecting the client to render it into the
| DOM. It's not just Google that doesn't care, but the majority
| of web developers also don't care.
| SquareWheel wrote:
| There's clearly advantages to writing a web app as an SPA,
| otherwise web devs wouldn't do it. The idea that web devs
| "don't care" (about what exactly?) really doesn't make any
| sense.
|
| Moving interactions to JSON in many cases is just a better
| experience. If you click a Like button on Facebook, which is
| the better outcome: To see a little animation where the
| button updates, or for the page to reload with a flash of
| white, throw away the comment you were part-way through
| writing, and then scroll you back to the top of the page?
|
| There's a reason XMLHttpRequest took the world by storm. More
| than that, jQuery is still used on more than 80% of websites
| due in large part to its legacy of making this process easier
| and cross-browser.
| tock wrote:
| I don't think Facebook is the best example given the sheer
| number of loading skeletons I see on their page.
| consteval wrote:
| > To see a little animation where the button updates, or
| for the page to reload with a flash of white, throw away
| the comment you were part-way through writing, and then
| scroll you back to the top of the page
|
| I don't understand how web devs understand the concept of
| loading and manipulating JSON to dynamically modify the
| page's HTML, but they don't understand the concept of
| loading and manipulating HTML to dynamically modify the
| page's HTML.
|
| It's the same thing, except now you don't have to do a
| conversion from JSON->HTML.
|
| There's no rule anywhere saying receiving HTML on the
| client should do a full page reload and throw up the
| current running javascript.
|
| > XMLHttpRequest
|
| This could've easily been HTMLHttpRequest and it would've
| been the same API, but probably better. Unfortunately,
| during that time period Microsoft was obsessed with XML.
| Like... obsessed obsessed.
| kccqzy wrote:
| Rendering JSON into HTML has nothing to do with
| XMLHttpRequest.
|
| Funny that you mention jQuery. When jQuery was hugely
| popular, people used it to make XMLHttpRequests that
| returned HTML which you then set as the innerHTML of some
| element. Of course being jQuery, people used the shorthand
| of `$("selector").html(...)` instead.
|
| In the heyday of jQuery the JSON.parse API didn't exist.
| danpalmer wrote:
| > the AV1 video codec was deployed when no consumer had HW
| decoding capabilities
|
| This was a bug. An improved software decoder was deployed for
| Android and for buggy reasons the YouTube app used it instead
| of a hardware accelerated implementation. It was fixed.
|
| Having worked on a similar space (compression formats for app
| downloads) I can assure you that all factors are accounted for
| with decisions like this, we were profiling device thermals for
| different compression formats. Setting aside bugs, the teams
| behind things like this are taking wide-reaching views of the
| ecosystem when making these decisions, and at scale, client
| concerns almost always outweigh server concerns.
| watermelon0 wrote:
| YouTube had the same issue with VP9 on laptops, where you had
| to use an extension to force H264, to avoid quickly draining
| the battery.
| toastal wrote:
| If only they would give us JXL on Android
| Sparkyte wrote:
| Maybe I'm the only person who thinks that trying to make existing
| internet protocols faster is wasted energy. But who am I to say
| anything.
| cheema33 wrote:
| > Maybe I'm the only person who thinks that trying to make
| existing internet protocols faster is wasted energy. But who am
| I to say anything.
|
| If you have a valid argument to support your claim, why not
| present it?
| Sparkyte wrote:
| They are already expected standards so when you create
| optimizations you're building on functions that need to be
| supported additionally on top of them. This leads to
| incompatibility and sometimes often worse performance as what
| is being experienced here with QUIC.
|
| You can read more about such things from, The Evolution of
| the Internet Congestion Control. https://groups.csail.mit.edu
| /ana/Publications/The_Evolution_...
|
| A good solution is to create a newer protocol when the limits
| of an existing protcol are exceeded. No one thought of
| needing HTTPS long ago and now we have 443 for HTTP security.
| If we need something to be faster and it has already achieved
| an arbitrary limit for the sake of backward compatibility it
| would be better to introduce a new protocol.
|
| I dislike the idea that we're turning into another Reddit
| where we are pointing fingers at people for updoots. If you
| dislike my opinion please present one equal to where that can
| be challenged.
| likis wrote:
| You posted your opinion without any kind of accompanying
| argument, and it was also quite unclear what you meant.
| Whining about being a target and being downvoted is not
| really going to help your case.
|
| I initially understood your first post as: "Let's not try
| to make the internet faster"
|
| With this reply, you are clarifying your initial post that
| was very unclear. Now I understand it as:
|
| "Let's not try to make existing protocols faster, let's
| make new protocols instead"
| Sparkyte wrote:
| More that if a protocol has met it's limit and you are at
| a dead end it is better to build a new one from the
| ground up. Making the internet faster is great but you
| eventually hit a wall. You need to be creative and come
| up with better solutions.
|
| In fact our modern network infrastructure returns on
| designs intended for limited network performance. Our
| networks are fiber and 5g which are roughly 170,000 times
| faster and wider since the initial inception of the
| internet.
|
| Time for a QUICv2
|
| https://datatracker.ietf.org/doc/rfc9369/
|
| But I don't think it addresses the disparity between it
| and lightweight protocols as networks get faster.
| paulgb wrote:
| > A good solution is to create a newer protocol when the
| limits of an existing protcol are exceeded.
|
| It's not clear to me how this is different to what's
| happening. Is your objection that they did it on top of UDP
| instead of inventing a new transport layer?
| Sparkyte wrote:
| No, actually what I mean was that QUIC being a protocol
| on UDP was intended to take advantage of the speed of UDP
| to do things faster that some TCP protocols did. While
| the merit is there the optimizations done on TCP itself
| has drastically improved the performance of TCP based
| protocols. UDP is still exceptional but it is like using
| a crowbar to open bottle. Not exactly the tool intended
| for the purpose.
|
| Creating a new protocol starting from scratch would be
| better effort spent. A QUICv2 is on the way.
| https://datatracker.ietf.org/doc/rfc9369/
|
| I don't think it addresses the problems with QUICv1 in
| terms of lightweight performance and bandwidth which the
| post claims QUIC lacks.
| Veserv wrote:
| QUICv2 is not really a new standard. It explicitly exists
| merely to intentionally rearrange some fields to prevent
| standard hardcoding/ossification and exercise the version
| negotiation logic of implementations. It says so right in
| the abstract:
|
| "Its purpose is to combat various ossification vectors
| and exercise the version negotiation framework."
| simiones wrote:
| Creating a new transport protocol for use on the whole
| Internet is a massive undertaking, not only in purely
| technical terms, but much more difficult, in political
| terms. Getting all of the world's sysadmins to allow your
| new protocol is a massive massive undertaking.
|
| And if you have the new protocol available today, with
| excellent implementations for Linux, Windows, BSD, MacOS,
| Apple iOS, and for F5, Cisco, etc routers done, it will
| still take an absolute minimum of 5-10 years until it
| _starts_ becoming available on the wider Internet, and
| that is if people are _desperate_ to adopt it. And the
| vast majority of the world will not use it for the next
| 20 years.
|
| The time for updating hardware to allow and use new
| protocols is going to be a massive hurdle to anything
| like this. And the advantage to doing so over just using
| UDP would have to be monumental to justify such an
| effort.
|
| The reality is that there will simply not be a new
| transport protocol used on the wide internet in our
| lifetimes. Trying to get one to happen is a pipe dream.
| Any attempts at replacing TCP will just use UDP.
| hnfong wrote:
| While you're absolutely correct, I think it is
| interesting to note that your argument could also have
| applied to the HTTP protocol itself, given how widely
| HTTP is used.
|
| _However_ , in reality, the people/forces pushing for
| HTTP2 and QUIC are the same one(s?) who have a defacto
| monopoly on browsers.
|
| So, yes, it's a political issue, and they just
| implemented their changes on a layer (or even... "app")
| that they had the most control over.
|
| On a purely "moral" perspective, political expediency
| probably shouldn't be the reason why something is done,
| but of course that's what actually happens in the real
| world...
| simiones wrote:
| There are numerous non-HTTP protocols used successfully
| on the Internet, as long as they run over TCP or UDP.
| Policing content running on TCP port 443 to enforce that
| it is HTTP/1.1 over TLS is actually extremely rare,
| outside some very demanding corporate networks. If you
| wanted to send your own new "HTTP/7" traffic today, with
| some new encapsulation over TLS on port 443, and you
| controlled the servers and the clients for this, I think
| you would actually meet minimal issues.
|
| The problems with SCTP, or any new transport-layer
| protocol (or any even lower layer protocol), run much
| deeper than deploying a new protocol on any higher layer.
| foul wrote:
| It's wasted energy when they aren't used at their full
| capacity.
|
| I think that GoogleHTTP has real-world uses for bad
| connectivity or in datacenters where they can fine-tune their
| data throughput (and buy crazy good NICs), but it seems that to
| use it for replacing TCP (which seems to be confirmed as very
| good when receiver and sender aren't controlled by the same
| party) the world needs a hardware overhaul or something.
| Sparkyte wrote:
| Maybe, the problem is that we are designed around a limited
| bandwidth network at the initial inception of the internet
| and have been building around that design for 50 years. We
| need to change the paradigm to think about our wide bandwidth
| networks.
| suprjami wrote:
| You aren't the only one. The bufferbloat community has given up
| on TCP altogether.
| apitman wrote:
| Currently chewing my way laboriously through RFC9000. Definitely
| concerned by how complex it is. The high level ideas of QUIC seem
| fairly straight forward, but the spec feels full of edge cases
| you must account for. Maybe there's no other way, but it makes me
| uncomfortable.
|
| I don't mind too much as long as they never try to take HTTP/1.1
| from me.
| ironmagma wrote:
| Considering they can't really even make IPv6 happen, that seems
| like a likely scenario.
| BartjeD wrote:
| https://www.google.com/intl/en/ipv6/statistics.html
|
| I think it's just your little corner of the woods that isn't
| adopting it. Over here the trend is very clearly to move away
| from IPv4, except for legacy reasons.
| apitman wrote:
| The important milestone is when it's safe to turn IPv4 off.
| And that's not going to happen as long as any country
| hasn't fully adopted it, and I don't think that's ever
| going to happen. For better or worse NAT handles outgoing
| connections and SNI routing handles incoming connections
| for most use cases. Self-hosting is the most broken but IMO
| that's better handled with tunneling anyway so you don't
| expose your home IP.
| jeroenhd wrote:
| IPv4 doesn't need to be off. Hacks and workarounds like
| DS-Lite can stay with us forever, just like hacks and
| workarounds like NAT and ALGs will.
| consp wrote:
| DS-lite (aka CGNAT), now we don't need to give the
| costumers a proper IP address anymore. It should be
| banned as it limits IPv6 adoption and it getting more and
| more use for "customers own good" and is annoying as hell
| to work around.
| AlienRobot wrote:
| >I think it's just your little corner of the woods that
| isn't adopting it.
|
| The graph says adoption is under 50%.
|
| Even U.S. is at only 50%. Some countries are under 1%.
| BartjeD wrote:
| Parts of the EU: 74%
| ktosobcy wrote:
| And others are 10-15%...
| alt227 wrote:
| The majority of this traffic is mobile devices. Most use
| ipv6 by default.
|
| Uptake on dekstop/laptops/servers is still extremely low
| and will be for a long time to come.
| sandos wrote:
| Sweden is awful here, neither my home connection nor my
| phone uses ipv6.
|
| We were once very early with internet stuff, but now we
| lagging it seems.
| ktosobcy wrote:
| Save for the France/Germany (~75%) and then
| USA/Mexcico/Brazil (~50%) rest of the world is not really
| adopting it... Even in Europe Spain has only ~10% and
| Poland ~17% penetration but yeah... let's be dismissive
| with "your little corner"...
| 71bw wrote:
| >and Poland ~17% penetration
|
| Almost exclusively due to Orange Polska -> belongs to
| France Telecom -> go figure...
| arp242 wrote:
| Adoption is not even 50%, and the line goes up fairly
| linear so ~95% will be around 2040 or so?
|
| And if you click on the map view you will see "little
| corner of the woods" is ... the entire continent of Africa,
| huge countries like China and Indonesia.
| mardifoufs wrote:
| Why did adoption slow down after a sudden rise? I guess
| some countries switched to ipv6 and since then, progress
| has been slow? It's hard to infer from the graph but my
| guess would be india? They have a very nice adoption rate.
|
| Sadly here in Canada I don't think any ISP even supports
| IPv6 in any shape or form except for mobile. Videotron has
| been talking about it for a decade (and they have a
| completely outdated infrastructure now, only DOCSIS and a
| very bad implementation of it too), and Bell has fiber but
| does not provide any info on that either.
| jtakkala wrote:
| Rogers and Teksavvy support IPv6
| mardifoufs wrote:
| Ah that's cool! It sucks that they are basically non
| existent in Quebec, at least for residential internet.
| But I think they are pushing for a bigger foothold here
| apitman wrote:
| There's simply not enough demand. ISPs can solve their IP
| problems with NAT. Web services can solve theirs with SNI
| routing. The only people who really need IPv6 are self
| hosters.
| jakeogh wrote:
| I think keeping HTTP/1.1 is almost as important as not dropping
| IPV4 (there are good reasons to not being able to tag
| everything; it's harder to block a country than a user.) For
| similar reasons we should keep old protocols.
|
| On a positive note, AFAICT 90%(??) of QUIC implementations
| ignored the proposed the spin bit:
| https://news.ycombinator.com/item?id=20990754
| jauntywundrkind wrote:
| I wonder if these results reproduce on Windows. Is there any TCP
| offload or GSO there? If not maybe the results wouldn't vary?
| v1ne wrote:
| Oh, sure there is! https://learn.microsoft.com/en-us/windows-
| hardware/drivers/n...
| raggi wrote:
| There are a number of concrete problems:
|
| - syscall interfaces are a mess, the primitive APIs are too slow
| for regular sized packets (~1500 bytes), the overhead is too
| high. GSO helps but it's a horrible API, and it's been buggy even
| lately due to complexity and poor code standards.
|
| - the syscall costs got even higher with spectre mitigation - and
| this story likely isn't over. We need a replacement for the BSD
| sockets / POSIX APIs they're terrible this decade. Yes, uring is
| fancy, but there's a tutorial level API middle ground possible
| that should be safe and 10x less overhead without resorting to
| uring level complexity.
|
| - system udp buffers are far too small by default - they're much
| much smaller than their tcp siblings, essentially no one but
| experts have been using them, and experts just retune stuff.
|
| - udp stack optimizations are possible (such as possible route
| lookup reuse without connect(2)), gso demonstrates this, though
| as noted above gso is highly fallible, quite expensive itself,
| and the design is wholly unnecessarily intricate for what we
| need, particularly as we want to do this safely from unprivileged
| userspace.
|
| - several optimizations currently available only work at low/mid-
| scale, such as connect binding to (potentially) avoid route
| lookups / GSO only being applicable on a socket without high
| peer-competition (competing peers result in short offload chains
| due to single-peer constraints, eroding the overhead wins).
|
| Despite all this, you can implement GSO and get substantial
| performance improvements, we (tailscale) have on Linux. There
| will be a need at some point for platforms to increase platform
| side buffer sizes for lower end systems, high load/concurrency,
| bdp and so on, but buffers and congestion control are a high
| complex and sometimes quite sensitive topic - nonetheless, when
| you have many applications doing this (presumed future state),
| there will be a need.
| SomaticPirate wrote:
| What is GSO?
| thorncorona wrote:
| presumably generic segmentation offloading
| throwaway8481 wrote:
| Generic Segmentation Offload
|
| https://www.kernel.org/doc/html/latest/networking/segmentati.
| ..
| chaboud wrote:
| Likely Generic Segmentation Offload (if memory serves), which
| is a generalization of TCP segmentation offload.
|
| Basically (hyper simple), the kernel can lump stuff together
| when working with the network interface, which cuts down on
| ultra slow hardware interactions.
| raggi wrote:
| it was originally for the hardware, but it's also valuable
| on the software side as the cost of syscalls is far too
| high for packet sized transactions
| jesperwe wrote:
| Generic Segmentation Offload
|
| "GSO gains performance by enabling upper layer applications
| to process a smaller number of large packets (e.g. MTU size
| of 64KB), instead of processing higher numbers of small
| packets (e.g. MTU size of 1500B), thus reducing per-packet
| overhead."
| underdeserver wrote:
| This is more the result.
|
| Generally today an Ethernet frame, which is the basic
| atomic unit of information over the wire, is limited to
| 1500 bytes (the MTU, or Maximum Transmission Unit).
|
| If you want to send more - the IP layer allows for 64k
| bytes per IP packet - you need to split the IP packet into
| multiple (64k / 1500 plus some header overhead) frames.
| This is called segmentation.
|
| Before GSO the kernel would do that which takes buffering
| and CPU time to assemble the frame headers. GSO moves this
| to the ethernet hardware, which is essentially doing the
| same thing only hardware accelerated and without taking up
| a CPU core.
| USiBqidmOOkAqRb wrote:
| Shipping? Government services online? Piedmont airport?
| Alcoholics anonymous? Obviously not.
|
| _Please_ introduce your initialisms, if it 's not guaranteed
| that first result in a search will be correct.
| mh- wrote:
| _> first result in a search will be correct_
|
| Searching for _GSO network_ gives you the correct answer in
| the first result. I 'd consider that condition met.
| cookiengineer wrote:
| Say what you want but I bet we'll see lots of eBPF modules
| being loaded in the future for the very reason you're
| describing. An ebpf quic module? Why not!
|
| And that scares me, because there's not a single tool that has
| this on its radar for malware detection/prevention.
| raggi wrote:
| we can consider ebpf "a solution" when there's even a remote
| chance you'll be able to do it from an unentitled ios app.
| somewhat hyperbole, but the point is, this problem is a
| problem for userspace client applications, and bpf isn't a
| particularly "good" solution for servers either, it's high
| cost of authorship for a problem that is easily solvable with
| a better API to the network stack.
| mgaunard wrote:
| ebpf is linux technology, you will never be able to do it
| from iOS.
| dan-robertson wrote:
| https://github.com/microsoft/ebpf-for-windows
| JoshTriplett wrote:
| > Yes, uring is fancy, but there's a tutorial level API middle
| ground possible that should be safe and 10x less overhead
| without resorting to uring level complexity.
|
| I don't think io_uring is as complex as its reputation
| suggests. I don't think we need a substantially simpler _low-
| level_ API; I think we need more high-level APIs built on top
| of io_uring. (That will also help with portability: we need
| APIs that can be most efficiently implemented atop io_uring but
| that work on non-Linux systems.)
| raggi wrote:
| > I don't think io_uring is as complex as its reputation
| suggests.
|
| uring is extremely problematic to integrate into many common
| application / language runtimes and it has been demonstrably
| difficult to integrate into linux safely and correctly as
| well, with a continual stream of bugs, security and policy
| control issues.
|
| in principle a shared memory queue is a reasonable basis for
| improving the IO cost between applications and IO stacks such
| as the network or filesystem stacks, but this isn't easy to
| do well, cf. uring bugs and binder bugs.
| JoshTriplett wrote:
| > with a continual stream of bugs, security and policy
| control issues
|
| This has not been true for a long time. There was an early
| design mistake that made it quite prone to these, but that
| mistake has been fixed. Unfortunately, the reputational
| damage will stick around for a while.
| raggi wrote:
| 13 CVEs so far this year afaik
| bonzini wrote:
| CVE numbers from the Linux CNA are bollocks.
| JoshTriplett wrote:
| This conversation would be a good one to point them to to
| show that their policy is not just harmless point-
| proving, but in fact does cause harm.
|
| For context, to the best of my knowledge the current
| approach of the Linux CNA is, in keeping with long-
| standing Linux security policy of "every single fix
| _might_ be a security fix ", to assign CVEs regardless of
| whether something has any security impact or not.
| di4na wrote:
| I would not call it harm. The use of uring in higher
| level languages is definitely prone to errors, bugs and
| security problems
| JoshTriplett wrote:
| See the context I added to that comment; this is not
| about security issues, it's about the Linux CNA's absurd
| approach to CVE assignment for things that aren't CVEs.
| tialaramex wrote:
| I don't agree that it's absurd. I would say it reflects a
| proper understanding of their situation.
|
| You've doubtless heard Tony Hoare's "There are two ways
| to write code: write code so simple there are obviously
| no bugs in it, or write code so complex that there are no
| obvious bugs in it.". Linux is definitely in the latter
| category, it's now such a sprawling system that
| determining whether a bug "really" has security
| implications is no long a reasonable task compared to
| just fixing the bug.
|
| The other reason is that Linux is so widely used that
| almost no assumption made to simplify that above task is
| definitely correct.
| JoshTriplett wrote:
| That's fine, except that it is thus no longer meaningful
| to compare CVE count.
| hifromwork wrote:
| I like CVEs, I think Linux approach to CVEs is stupid,
| but also it was never meaningful to compare CVE count.
| But I guess it's hard to make people stop doing that, and
| that's the reason Linux does the thing it does out of
| spite.
| kuschku wrote:
| CVE assignment != security issue
|
| CVE numbers are just a way to ensure everyone is talking
| about the same bug. Not every security issue has a CVE,
| not every CVE is a security issue.
|
| Often, a regular bug turns out years later to have been a
| security issue, or a security issue turns out to have no
| security impact at all.
|
| If you want a central authority to tell you what to
| think, just use CVSS instead of the binary "does it have
| a CVE" metric.
| skywhopper wrote:
| That's definitely not the understanding that literally
| anyone outside the Linux team has for what a CVE is,
| including the people who came up with them and run the
| database. Overloading a well-established mechanism of
| communicating security issues to just be a registry of
| Linux bugs is an abuse of an important shared resource.
| Sure "anything could be a security issue" but in
| practice, most bugs aren't, and putting meaningless bugs
| into the international security issue database is just a
| waste of everyone's time and energy to make a very stupid
| point.
| kuschku wrote:
| > including the people who came up with them
|
| How do you figure that? The original definition of CVE is
| exactly the same as how Linux approaches it.
|
| Sure, in recent years security consultants have been
| overloading CVE to mean something else, but that's
| something to fix, not to keep.
| jiripospisil wrote:
| Can you post the original definition?
| vel0city wrote:
| Common Vulnerabilities and Exposures
| jiripospisil wrote:
| Right but I was hoping for a definition which supports
| OP's claim that "CVE assignment != security issue".
| kuschku wrote:
| Then check out these definitions, from 2000, defined by
| the CVE editorial board:
|
| > The CVE list aspires to describe and name all publicly
| known facts about computer systems that could allow
| somebody to violate a reasonable security policy for that
| system
|
| As well as:
|
| > Discussions on the Editorial Board mailing list and
| during the CVE Review meetings indicate that there is no
| definition for a "vulnerability" that is acceptable to
| the entire community. At least two different definitions
| of vulnerability have arisen and been discussed. There
| appears to be a universally accepted, historically
| grounded, "core" definition which deals primarily with
| specific flaws that directly allow some compromise of the
| system (a "universal" definition). A broader definition
| includes problems that don't directly allow compromise,
| but could be an important component of a successful
| attack, and are a violation of some security policies (a
| "contingent" definition).
|
| > In accordance with the original stated requirements for
| the CVE, the CVE should remain independent of multiple
| perspectives. Since the definition of "vulnerability"
| varies so widely depending on context and policy, the CVE
| should avoid imposing an overly restrictive perspective
| on the vulnerability definition itself.
|
| Under this definition, any kernel bug that could lead to
| user-space software acting differently is a CVE.
| Similarly, all memory management bugs in the kernel
| justify a CVE, as they could be used as part of an
| exploit.
| jiripospisil wrote:
| > to violate a reasonable security policy for that system
|
| > with specific flaws that directly allow some compromise
| of the system
|
| > important component of a successful attack, and are a
| violation of some security policies
|
| All of these are talking about security issues, not
| "acting differently".
| josefx wrote:
| > All of these are talking about security issues, not
| "acting differently".
|
| Because no system has been ever taken down by code that
| behaved different from what it was expected to do? Right?
| Like http desync attacks, sql escape bypasses, ... .
| Absolutely no security issue going to be caused by a very
| minor and by itself very secure difference in behavior.
| kuschku wrote:
| > important component of a successful attack, and are a
| violation of some security policies
|
| If the kernel returned random values from gettime, that'd
| lead to tls certificate validation not being reliable
| anymore. As result, any bug in gettime is certainly
| worthy of a CVE.
|
| If the kernel shuffled filenames so they'd be returned
| backwards, apparmor and selinux profiles would break. As
| result, that'd be worthy of a CVE.
|
| If the kernel has a memory corruption, use after free,
| use of uninitialized memory or refcounting issue, that's
| obviously a violation of security best practices and can
| be used as component in an exploit chain.
|
| Can you now see how almost every kernel bug can and most
| certainly will be turned into a security issue at some
| point?
| cryptonector wrote:
| > that could allow somebody to violate a reasonable
| security policy for that system
|
| That's "security bug". Please stop saying it's not.
| kuschku wrote:
| As detailed in my sibling reply, by definition that
| includes any bug in gettime (as that'd affect tls
| certificate validation), any bug in a filesystem (as
| that'd affect loading of selinux/apparmor profiles), any
| bug in eBPF (as that'd affect network filtering), etc.
|
| Additionally, any security bug in the kernel itself, so
| any use after free, any refcounting bug, any use of
| uninitialized memory.
|
| Can you now see why pretty much every kernel bug fulfills
| that definition?
| simiones wrote:
| This is completely false. The CVE website defines these
| very clearly:
|
| > The mission of the CVE(r) Program is to identify,
| define, and catalog _publicly disclosed cybersecurity
| vulnerabilities_ [emphasis mine].
|
| In fact, CVE stands for "Common Vulnerabilities and
| Exposures", again showing that CVE == security issue.
|
| It's of course true that just because your _code_ has an
| unpatched CVE doesn 't automatically mean that your
| _system_ is vulnerable - other mitigations can be in
| place to protect it.
| kuschku wrote:
| That's the modern definition, which is rewriting history.
| Let's look at the actual, original definition:
|
| > The CVE list aspires to describe and name all publicly
| known facts about computer systems that could allow
| somebody to violate a reasonable security policy for that
| system
|
| There's also a decision from the editorial board on this,
| which said:
|
| > Discussions on the Editorial Board mailing list and
| during the CVE Review meetings indicate that there is no
| definition for a "vulnerability" that is acceptable to
| the entire community. At least two different definitions
| of vulnerability have arisen and been discussed. There
| appears to be a universally accepted, historically
| grounded, "core" definition which deals primarily with
| specific flaws that directly allow some compromise of the
| system (a "universal" definition). A broader definition
| includes problems that don't directly allow compromise,
| but could be an important component of a successful
| attack, and are a violation of some security policies (a
| "contingent" definition).
|
| > In accordance with the original stated requirements for
| the CVE, the CVE should remain independent of multiple
| perspectives. Since the definition of "vulnerability"
| varies so widely depending on context and policy, the CVE
| should avoid imposing an overly restrictive perspective
| on the vulnerability definition itself.
|
| For more details, see https://web.archive.org/web/2000052
| 6190637fw_/http://www.cve... and https://web.archive.org/
| web/20020617142755/http://cve.mitre....
|
| Under this definition, any kernel bug that could lead to
| user-space software acting differently is a CVE.
| Similarly, all memory management bugs in the kernel
| justify a CVE, as they could be used as part of an
| exploit.
| simiones wrote:
| Those two links say that CVEs can be one of two
| categories: universal vulnerabilities or exposures. But
| the examples of exposures are _not_ , in any way, "any
| bug in the kernel". They give specific examples of things
| which _are known_ to make a system more vulnerable to
| attack, even if not everyone would agree that they are a
| problem.
|
| So yes, any CVE is supposed to be a security problem, and
| it has always been so. Maybe not for your specific system
| or for your specific security posture, but for someone's.
|
| Extending this to any bugfix is a serious
| misunderstanding of what an "exposure" means, and it is a
| serious difference from other CNAs. Linux CNA-assigned
| CVEs just can't be taken as seriously as normal CNAs.
| immibis wrote:
| As I understand it, they adopted this policy because the
| other policy was also causing harm.
|
| They are right, by the way. When CVEs were used for
| things like Heartbleed they made sense - you could point
| to Heartbleed's CVE number and query various information
| systems about vulnerable systems. When every single
| possible security fix gets one, AND automated systems are
| checking the you've patched every single one or else you
| fail the audit (even ones completely irrelevant to the
| system, like RCE on an embedded device with no internet
| access) the system is not doing anything useful - it's
| deleting value from the world and must be repaired or
| destroyed.
| hifromwork wrote:
| The problem here are the automated systems and braindead
| auditors, not the CVE system itself.
| immibis wrote:
| Well, the CVE system itself is only about assigning
| identifiers, and assigning identifiers unnecessarily
| couldn't possibly hurt anyone, who isn't misusing the
| system, unless they're running out of identifiers.
| raggi wrote:
| this is a bit of a distraction, sure the leaks and some
| of the deadlocks are fairly uninteresting, but the
| toctou, overflows, uid race/confusion and so on are real
| issues that shouldn't be dismissed as if they don't
| exist.
| jeffparsons wrote:
| I find this surprising, given that my initial response to
| reading the iouring design was:
|
| 1. This is pretty clean and straightforward. 2. This is
| obviously what we need to decouple a bunch of things
| without the previous downsides.
|
| What has made it so hard to integrate it into common
| language runtimes? Do you have examples of where there's
| been an irreconcilable "impedance mismatch"?
| raggi wrote:
| https://github.com/tailscale/tailscale/pull/2370 was a
| practical drive toward this, will not proceed on this
| path.
|
| much more approachable, boats has written about
| challenges integrating in rust:
| https://without.boats/tags/io-uring/
|
| in the most general form: you need a fairly "loose"
| memory model to integrate the "best" (performance wise)
| parts, and the "best" (ease of use/forward looking
| safety) way to integrate requires C library linkage. This
| is troublesome in most GC languages, and many managed
| runtimes. There's also the issue that uring being non-
| portable means that the things it suggests you must do
| (such as say pinning a buffer pool and making APIs like
| read not immediate caller allocates) requires a
| substantially separate API for this platform than for
| others, or at least substantial reworks over all the
| existing POSIX modeled APIs - thus back to what I said
| originally, we need a replacement for POSIX & BSD here,
| broadly applied.
| gpderetta wrote:
| I can see how a zero-copy API would be hard to implement
| on some languages, but you could still implement
| something on top of io_uring with posix buffer copy
| semantics , while using batching to decrease syscall
| overhead.
|
| Zero-copy APIs will necessarily be tricky to implement
| and use, especially on memory safe languages.
| gmokki wrote:
| I think most GC languages support native/pinned me(at
| least Java and C# do memory to support talking to kernel
| or native libraries. The APIs are even quite nice.
| neonsunset wrote:
| Java's off-heap memory and memory segment API is quite
| dreadful and on the slower side. C# otoh gives you easy
| and cheap object pinning, malloc/free and stack-allocated
| buffers.
| asveikau wrote:
| I read the oldest of those blog posts the closest.
|
| Seems like the author points out two things:
|
| 1. The lack of rust futures supporting manual
| cancellation. That doesn't seem like an inevitable choice
| by rust.
|
| 2. Sharing buffers with kernel mode. This is probably a
| bigger topic.
| withoutboats3 wrote:
| Rust's async model can support io-uring fine, it just has
| to be a different API based on ownership instead of
| references. (That's the conclusion of my posts you link
| to.)
| arghwhat wrote:
| Two things:
|
| One, uring is not extremely problematic to integrate, as it
| can be chained into a conventional event loop if you want
| to, or can even be fit into a conventionally blocking
| design to get localized syscall benefits. That is, you do
| not need to convert to a fully uring event loop design,
| even if that would be superior - and it can usually be kept
| entirely within a (slightly modified) event loop
| abstraction. The reason it has not yet been implemented is
| just priority - most stuff _isn 't_ bottlenecked on IOPS.
|
| Two, yes you could have e middle-ground. I assume the
| syscall overhead you call out is the need to send UDP
| packets one at a time through sendmsg/sendto, rather than
| doing one big write for several packets worth of data on
| TCP. An API that allowed you to provide a chain of
| messages, like sendmsg takes an iovec for data, is
| possible. But it's also possible to do this already as a
| tiny blocking wrapper around io_uring, saving you new
| syscalls.
| Veserv wrote:
| The system call to send multiple UDP packets in a single
| call has existed since Linux 3.0 over a decade ago[1]:
| sendmmsg().
|
| [1] https://man7.org/linux/man-pages/man2/sendmmsg.2.html
| arghwhat wrote:
| Ah nice, in that case OP's point about syscall overhead
| is entirely moot. :)
|
| That should really be in the `SEE ALSO` of `man 3
| sendmsg`...
| evntdrvn wrote:
| patches welcome :p
| justincormack wrote:
| At one point if I remember it didnt actually work, it
| still just sent one message at a time and returned the
| length of the first piece of the iovec. Hopefully it got
| fixed.
| johnp_ wrote:
| Looks like Mozilla is currently working on implementing
| `sendmmsg` and `recvmmsg` use in neqo (Mozilla's QUIC
| implementation) [1].
|
| [1] https://github.com/mozilla/neqo/issues/1693
| londons_explore wrote:
| I think you need to look at a common use case and
| consider how many syscalls you'd like it to take and how
| many CPU cycles would be reasonable.
|
| Let's take downloading a 1MB jpeg image over QUIC and
| rendering it on the screen.
|
| I would hope that can be done in about 100k CPU cycles
| and 20 syscalls, considering that all the jpeg decoding
| and rendering is going to be hardware accelerated. The
| decryption is also hardware accelerated.
|
| Unfortunately, no network API allows that right now. The
| CPU needs to do a substantial amount of processing for
| every individual packet, in both userspace and kernel
| space, for receiving the packet and sending the ACK, and
| there is no 'bulk decrypt' non-blocking API.
|
| Even the data path is troublesome - there should be a way
| for the data to go straight from the network card to the
| GPU, with the CPU not even touching it, but we're far
| from that.
| arghwhat wrote:
| There's a few issues here.
|
| 1. A 1 MB file is at the very least 64 individually
| encrypted TLS records (16k max size) sent in sequence,
| possibly more. So decryption 64 times is the maximum
| amount of bulk work you can do - this is done to allow
| streaming verification and decryption in parallel with
| the download, whereas one big block would have you wait
| for the very last byte before any processing could start.
|
| 2. TLS is still userspace and decryption does not involve
| the kernel, and thus no syscalls. The benefits of kernel
| TLS largely focus on servers sending files straight from
| disk, bypassing userspace for the entire data processing
| path. This is not really relevant receive-side for
| something you are actively decoding.
|
| 3. JPEG is, to my knowledge, rarely hardware offloaded on
| desktop, so no syscalls there.
|
| Now, the number of actual syscalls end up being dictated
| by the speed of the sender, and the tunable receive
| buffer size. The slower the sender, the _more_ kernel
| roundtrips you end upo with, which allows you to amortize
| the processing over a longer period so everything is
| ready when the last packet is. For a fast enough sender
| with big enough receive buffers, this could be a single
| kernel roundtrip.
| miohtama wrote:
| JPEG is not a particular great example. However most
| video streams and partially hardware decoded. Usually you
| still need to decode part of the stream, namely entropy
| coding and metadata, first on the CPU.
| immibis wrote:
| This system call you're asking for already exists - it's
| called sendmmsg. There is also recvmmsg.
| lukeh wrote:
| async/await io_uwring wrappers for languages such as Swift
| [1] and Rust [2] [3] can improve usability considerably. I'm
| not super familiar with the Rust wrappers but, I've been
| using IORingSwift for socket, file and serial I/O for some
| time now.
|
| [1] https://github.com/PADL/IORingSwift [2]
| https://github.com/bytedance/monoio [3]
| https://github.com/tokio-rs/tokio-uring
| anarazel wrote:
| FWIW, the biggest problem I've seen with efficiently using
| io_uring for networking is that none of the popular TLS
| libraries have a buffer ownership model that really is
| suitable for asynchronous network IO.
|
| What you'd want is the ability to control the buffer for the
| "raw network side", so that asynchronous network IO can be
| performed without having to copy between a raw network buffer
| and buffers owned by the TLS library.
|
| It also would really help if TLS libraries supported
| processing multiple TLS records in a batched fashion. Doing
| roundtrips between app <-> tls library <-> userspace network
| buffer <-> kernel <-> HW for every 16kB isn't exactly
| efficient.
| quotemstr wrote:
| > Yes, uring is fancy, but there's a tutorial level API middle
| ground possible that should be safe and 10x less overhead
| without resorting to uring level complexity.
|
| And the kernel has no business providing this middle-layer API.
| Why should it? Let people grab whatever they need from the
| ecosystem. Networking should be like Vulkan: it should have a
| high-performance, flexible API at the systems level with being
| "easy to use" a non-goal --- and higher-level facilities on
| top.
| astrange wrote:
| The kernel provides networking because it doesn't trust
| userspace to do it. If you provided a low level networking
| API you'd have to verify everything a client sends is not
| malicious or pretending to be from another process. And for
| the same reason, it'd only work for transmission, not
| receiving.
|
| That and nobody was able to get performant microkernels
| working at the time, so we ended up with everything in the
| monokernel.
|
| If you do trust the client processes then it could be better
| to just have them read/write IP packets though.
| namibj wrote:
| Also, it is really easy to do the normal IO "syscall
| wrappers" on top of io_uring instead, even easily exposing a
| very simple async/await variant of them that splits out the
| "block on completion (after which just like normal IO the
| data buffer has been copied into kernel space)" from the rest
| of the normal IO syscall, which allow pipelining & coalescing
| of requests.
| modeless wrote:
| Seems to me that the real problem is the 1500 byte MTU that
| hasn't increased in practice in over _40 years_.
| asmor wrote:
| That's on the list that right after we all migrate to IPv6.
| p_l wrote:
| For all practical purposes, the internet MTU is lower than
| ethernet default MTU.
|
| Sometimes for ease of mind I end up clamping it to v6 minimum
| (1280) just in case .
| j16sdiz wrote:
| The real problem is some so called "sysadmin" drop all ICMP,
| breaking path mtu discovery.
| icedchai wrote:
| The most secure network is one that doesn't pass any
| traffic at all. ;)
| throw0101c wrote:
| > _Seems to me that the real problem is the 1500 byte MTU
| that hasn 't increased in practice in over 40 years._
|
| As per a sibling comment, 1500 is just for Ethernet (the
| default, jumbo frames being able to go to (at least) 9000).
| But the Internet is more than just Ethernet.
|
| If you're on DSL, then RFC 2516 states that PPPoE's MTU is
| 1492 (and you probably want an MSS of 1452). The PPP, L2TP,
| and ATM AAL5 standards all have 16-bit length fields allowing
| for packets up 64k in length. GPON ONT MTU is 2000. The
| default MTU for LTE is 1428. If you're on an HPC cluster,
| there's a good chance you're using Infiniband, which goes to
| 4096.
|
| What are size do you suggest everyone on the planet go to?
| Who exactly is going to get everyone to switch to the new
| value?
| Hikikomori wrote:
| The internet is mostly ethernet these days (ISP core/edge),
| last mile connections like DSL and cable already handle a
| smaller MTU so should be fine with a bigger one.
| cesarb wrote:
| > The internet is mostly ethernet these days (ISP
| core/edge),
|
| A lot of that ISP edge is CPEs with WiFi, which AFAIK
| limits the MTU to 2304 bytes.
| throw0101c wrote:
| > _The internet is mostly ethernet these days_ [...]
|
| Except for the bajillion mobile devices in people's
| pockets/purses.
| fallingsquirrel wrote:
| > What are size do you suggest everyone on the planet go
| to?
|
| 65536
|
| > Who exactly is going to get everyone to switch to the new
| value?
|
| The same people who got everyone to switch to IPv6. It's a
| missed opportunity that these migrations weren't done at
| the same time imho.
|
| It'll take a few decades, sure, but that's how big
| migrations go. What's the alternative? Making no
| improvements at all, forever?
| 0xbadcafebee wrote:
| > got everyone to switch to IPv6
|
| I have some bad news...
|
| > What's the alternative? Making no improvements at all,
| forever?
|
| No, sadly. The alternative is what the entire tech world
| has been doing for the past 15 years: shove
| "improvements" inside whatever crap we already have
| because nobody wants to replace the crap.
|
| If IPv6 were made today, it would be tunneled inside an
| HTTP connection. All the new apps would adopt it, the
| legacy apps would be abandoned or have shims made, and
| the whole thing would be inefficient and buggy, but
| adopted. Since poking my head outside of the tech world
| and into the wider world, it turns out this is how most
| of the world works.
| MerManMaid wrote:
| >If IPv6 were made today, it would be tunneled inside an
| HTTP connection. All the new apps would adopt it, the
| legacy apps would be abandoned or have shims made, and
| the whole thing would be inefficient and buggy, but
| adopted. Since poking my head outside of the tech world
| and into the wider world, it turns out this is how most
| of the world works.
|
| What you're suggesting here wouldn't work, wrapping all
| the addressing information inside HTTP which relies on IP
| for delivery does not work. It would be the equivalent of
| sealing all the addressing information for a letter you'd
| like to send _inside_ the envelope.
| throw0101c wrote:
| > _If IPv6 were made today, it would be tunneled inside
| an HTTP connection._
|
| Given that one of the primary goals of IPv6 was increased
| address space, how would putting IPv6 in an HTTP
| connection riding over IPv4 solve that?
| Diggsey wrote:
| Historically there have been too many constraints on the Linux
| syscall interface:
|
| - Performance
|
| - Stability
|
| - Convenience
|
| - Security
|
| This differs from eg. Windows because on Windows the stable
| interface to the OS is in user-space, not tied to the syscall
| boundary. This has resulted in unfortunate compromises in the
| design of various pieces of OS functionality.
|
| Thankfully things like futex and io-uring have dropped the
| "convenience" constraint from the syscall itself and moved it
| into user-space. Convenience is still important, but it doesn't
| need to be a constraint at the lowest level, and shouldn't
| compromise the other ideals.
| amluto wrote:
| Hi, Tailscale person! If you want a fairly straightforward
| improvement you could make: Tailscale, by default uses
| SOCK_RAW. And having any raw socket listening at all hurts
| receive performance systemwide:
|
| https://lore.kernel.org/all/CALCETrVJqj1JJmHJhMoZ3Fuj685Unf=...
|
| It shouldn't be particularly hard to port over the optimization
| that prevents this problem for SOCK_PACKET. I'll get to it
| eventually (might be quite a while), but I only care about this
| because of Tailscale, and I don't have a ton of bandwidth.
| raggi wrote:
| Very interesting, thank you. We'll take a look at this!
| bradfitz wrote:
| BTW, that code changed just recently:
|
| https://github.com/tailscale/tailscale/commit/1c972bc7cbebfc.
| ..
|
| It's now a AF_PACKET/SOCK_DGRAM fd as it was originally meant
| to be.
| cryptonector wrote:
| Of these the hardest one to deal with is route lookup caching
| and reuse w/o connect(2). Obviously the UDP connected TCB can
| cache that, but if you don't want a "connected" socket fd...
| then there's nowhere else to cache it except ancillary data, so
| ancillary data it would have to be. But getting return-to-
| sender ancillary data on every read (so as to be able to copy
| it to any sends back to the same peer) adds overhead, so that's
| not good.
|
| A system call to get that ancillary data adds overhead that can
| be amortized by having the application cache it, so that's
| probably the right design, and if it could be combined with
| sending (so a new flavor of sendto(2)) that would be even
| better, and it all has to be uring-friendly.
| latentpot wrote:
| QUIC is the standard problem across n number of clients who
| choose Zscaler and similar content inspection tools. You can
| block it at the policy level but you also need to have it
| disabled at the browser level. Which sometimes magically turns on
| again and leads to a flurry of tickets for 'slow internet',
| 'Google search not working' etcetera.
| watermelon0 wrote:
| Wouldn't the issue in this case be with Zscaler, and not with
| QUIC?
| chgs wrote:
| The problem here is choosing software like zscaler
| mcosta wrote:
| Zscaler is not chosen, it is imposed by the corporation
| v1ne wrote:
| Hmm, interesting. We also have a policies imposed by the
| Regulator(tm) that leads to us inspecting all web traffic. All
| web traffic goes through a proxy that's configured in the web
| browser. No proxy, no internet.
|
| Out of curiosity: What's your use case to use ZScaler for this
| inspection instead?
| AlphaCharlie wrote:
| Free PDF file of the research: https://arxiv.org/pdf/2310.09423
| jiggawatts wrote:
| I wonder if the trick might be to repurpose technology from
| server hardware: partition the physical NIC into virtual PCI-e
| devices with distinct addresses, and map to user-space processes
| instead of virtual machines.
|
| So in essence, each browser tab or even each listening UDP socket
| could have a distinct IPv6 address dedicated to it, with packets
| delivered into a ring buffer in user-mode. This is so similar to
| what goes on with hypervisors now that existing hardware designs
| might even be able to handle it already.
|
| Just an idle thought...
| KaiserPro wrote:
| Or just have multiple TCP streams. Super simple, low cost, uses
| all the optimisations we have already.
|
| when the latency/packet drop is low, prune the connections and
| you get monster speed.
|
| When the latency/loss is high, grow the number of concurrent
| connections to overcome slow start.
|
| Doesn't give you QUIC like multipath though.
| m_eiman wrote:
| There's Multipath TCP.
| KaiserPro wrote:
| I mean there is, but from what I recall its more a link
| aggregation thing, rather than a network portable system
| jeroenhd wrote:
| I've often pondered if it was possible to assign every
| application/tab/domain/origin a different IPv6 address to
| exchange data with, to make tracking people just a tad harder,
| but also to simplify per-process firewall rules. With the bare
| minimum, a /64, you could easily host billions of addresses per
| device without running out.
|
| I think there may be a limit to how many IP addresses NICs (and
| maybe drivers) can track at once, though.
|
| What I don't really get is why QUIC had to be invented when
| multi-stream protocols like SCTP already exist. SCTP brings the
| reliability of TCP with the multi-stream system that makes QUIC
| good for websites. Piping TLS over it is a bit of a pain (you
| don't want a separate handshake per stream), but surely there
| could be techniques to make it less painful (leveraging 0-RTT?
| Using session resumptions with tickets from the first connected
| stream?).
| simiones wrote:
| First and foremost, you can't use SCTP on the Internet, so
| the whole idea is dead on arrival. The Internet only really
| works for TCP and UDP over IP - anything else, you have a
| loooooong tail of networks which will drop the traffic.
|
| Secondly, the whole point of QUIC is to merge the TLS and
| transport handskakes into a single packet, to reduce RTT.
| This would mean you need to modify SCTP anyway to allow for
| this use case, so even what small support exists for SCTP in
| the large would need to be upgraded.
|
| Thirdly, there is no reason to think that SCTP is better
| handled than UDP at the kernel's IP stack level. All of the
| problems of memory optimizations are likely to be much worse
| for SCTP than for UDP, as it's used far, far less.
| astrange wrote:
| Is there a service like test-ipv6 to see if SCTP works?
| Obviously harder to run since you can't do it in a browser.
| simiones wrote:
| I doubt there is, because it's just not a very popular
| thing to even try. Even WebRTC, which uses SCTP for non-
| streaming data channels, uses it over DTLS over UDP.
| jeroenhd wrote:
| I don't see why you can't use SCTP over the internet. HTTP2
| has fallbacks for broken or generally shitty middleboxes, I
| don't see why the weird corporate networks should hold back
| the rest of the world.
|
| TLS already does 0-RTT so you don't need QUIC for that.
|
| The problem with UDP is that many optimisations are simply
| not possible. The "TCP but with blackjack and hookers"
| approach QUIC took makes it very difficult to accelerate.
|
| SCTP is Fine(tm) on Linux but it's basically unimplemented
| on Windows. Acceleration beyond what these protocols can do
| right now requires either specific kernel/hardware QUIC
| parsing or kernel mode SCTP on Windows.
|
| Getting Microsoft to actually implement SCTP would be a lot
| cleaner than to hack yet another protocol on top of UDP out
| of fear of the mighty shitty middleboxes.
| simiones wrote:
| WebRTC decided they liked SCTP, so... they run it over
| UDP (well, over DTLS over UDP). And while HTTP/2 might
| fail over to HTTP/1.1, what would an SCTP session fall
| back to?
|
| The problem is not that Windows doesn't have in-kernel
| support for SCTP (there are several user-space libraries
| already available, you wouldn't even need to convince MS
| to do anything). The blocking issue is that many, many
| routers on the Internet, especially but not exclusively
| around all corporate networks, will drop any packet that
| is neither TCP or UDP over IP.
|
| And if you think UDP is not optimized, I'd bet you'll
| find that the SCTP situation is far, far worse.
|
| And regarding 0-RTT, that only works for resumed
| connections, and it is still actually 1 RTT (TCP
| connection establish). New connections still need 2-3
| round trips (1 for TCP, 1 for TLS 1.3, or 2 for TLS 1.2)
| with TLS; they only need 1 round trip (even when using
| TLS 1.2 for encryption). With QUIC, you can have true
| 0-RTT traffic, sending the (encrypted) HTTP request data
| in the very first packet you send to a host [that you
| communicated with previously].
| kbolino wrote:
| How is userspace SCTP possible on Windows? Microsoft
| doesn't implement it in WinSock and, back in the XP SP2
| days, Microsoft disabled/hobbled raw sockets and has
| never allowed them since. Absent a kernel-mode driver, or
| Microsoft changing their stance (either on SCTP or raw
| sockets), you cannot send pure SCTP from a modern Windows
| box using only non-privileged application code.
| simiones wrote:
| Per these Microsoft docs [0], it seems that it should
| still be possible to open a raw socket on Windows 11, as
| long as you don't try to send TCP or UDP traffic through
| it (and have the right permissions, presumably).
|
| Of course, to open a raw socket you need privileged
| access, just like you do on Linux, because a raw socket
| allows you to see and respond to traffic from any other
| application (or even system traffic). But in principle
| you should be able to make a Service that handles SCTP
| traffic for you, and a non-privileged application could
| send its traffic to this service and receive data back.
|
| I did find some user-space library that is purported to
| support SCTP on Windows [1], but it may be quite old and
| not supported. Not sure if there is any real interest in
| something like this.
|
| [0] https://learn.microsoft.com/en-
| us/windows/win32/winsock/tcp-...
|
| [1] https://www.sctp.de/sctp-download.html
| kbolino wrote:
| Interesting. I think the service approach would now be
| viable since it can be paired with UNIX socket support,
| which was added a couple of years ago (otherwise COM or
| RPC would be necessary, making clients more complicated
| and Windows-specific). But yeah, the lack of interest is
| the bigger problem now.
| tepmoc wrote:
| SCTP works fine on internet, as long your egress is comming
| from public IP and you don't perform NAT. So in case IPv6
| its non issue at all unless you sit behind middle boxes.
|
| Probably best approuch would be is like happy eye balls but
| for transport. https://datatracker.ietf.org/doc/html/draft-
| grinnemo-taps-he
| simiones wrote:
| How many corporate or residential firewalls are
| configured to allow SCTP traffic through?
| tepmoc wrote:
| residential - not many. Corporate on other hand is
| different story, thus why happy eyeballs for transport
| still would needed to gradual rollout anyway.
| wseqyrku wrote:
| There's a work in progress for kernel support:
| https://github.com/lxin/quic
| crashingintoyou wrote:
| Don't have access to the published version but draft at
| https://arxiv.org/pdf/2310.09423 mentions ping RTT at 0.23ms.
|
| As someone frequently at 150ms+ latency for a lot of websites
| (and semi-frequently 300ms+ for non-geo-distributed websites), in
| practice with the latency QUIC is easily the best for throughput,
| HTTP/1.1 with a decent number of parallel connections is a not-
| that-distant second, and in a remote third is HTTP/2 due to head-
| of-line-blocking issues if/when a packet goes missing.
| sylware wrote:
| To go faster, you need to simplify a lot.
| bell-cot wrote:
| To force a lucrative cycle of hardware upgrades, you need
| software to do the opposite.
|
| True story: Back in the early aughties, Intel was hosting
| regular seminars for dealers and integrators selling either
| Intel-made PC's, or white box ones. I attended one of those,
| and the Intel rep openly claimed that Intel had challenged
| Microsoft to produce software which could bring a GHz CPU to
| its knees.
| larsonnn wrote:
| Site is blocking Apples private relay :(
| Banou wrote:
| I think one of the reasons Google choose UDP is that it's already
| a popular protocol, on which you can build reliable packets,
| while also having the base UDP unreliability on the side.
|
| From my perspective, which is a web developer's, having QUIC,
| allowed the web standards to easily piggy back on top of it for
| the Webtransport API, which is ways better than the current HTTP
| stack and WebRTC which is a complete mess. Basically giving a TCP
| and UDP implementation for the web.
|
| Knowing this, I feel like it makes more sense to me why Google
| choose this way of doing, which some people seem to be
| criticizing.
| simoncion wrote:
| > I think one of the reasons Google choose UDP is that it's
| already a popular protocol...
|
| If you want your packets to reliably travel fairly unmolested
| between you and an effectively-randomly-chosen-peer on The
| Greater Internet, you have two transport protocol choices:
| TCP/IP or UDP/IP.
|
| If you don't want the connection-management & etc that TCP/IP
| does for you, then you have exactly one choice.
|
| > ...which some people seem to be criticizing.
|
| People are criticizing the fact that on LAN link speeds (and
| fast (for the US) home internet speeds) QUIC is no better than
| (and sometimes worse than) previous HTTP transport protocols,
| despite the large amount of effort put into it.
|
| It also seems that some folks are suggesting that Google could
| have put that time and effort into improving Linux's packet-
| handling code and (presumably) getting that into both Android
| and mainline Linux.
| dathinab wrote:
| it says it isn't fast _enough_
|
| but as far as I can tell it's fast _enough_ just not as fast as
| it could be
|
| mainly they seem to test situations related to bandwidth/latency
| which aren't very realistically for the majority of users
| (because most users don't have supper fast high bandwidth
| internet)
|
| this doesn't meant QUIC can't be faster or we shouldn't look into
| reducing overhead, just it's likely not as much as a deal as it
| might initially loook
| M2Ys4U wrote:
| >The results show that QUIC and HTTP/2 exhibit similar
| performance when the network bandwidth is relatively low (below
| ~600 Mbps)
|
| >Next, we investigate more realistic scenarios by conducting the
| same file download experiments on major browsers: Chrome, Edge,
| Firefox, and Opera. We observe that the performance gap is even
| larger than that in the cURL and quic_client experiments: on
| Chrome, QUIC begins to fall behind when the bandwidth exceeds
| ~500 Mbps.
|
| Okay, well, this isn't going to be a problem over the general
| Internet, it's more of a problem in local networks.
|
| For people that have high-speed connections, how often are you
| getting >500Mbps from a single source?
| sinuhe69 wrote:
| Well, I have other issues with QUIC: when I access Facebook
| with QUIC, the site often loads the first pages but then it
| kind of hung, force me to refresh the site, which is annoying.
| I didn't know it's a problem with QUIC, until I turned it off.
| Since then, FB & Co. load at the same speed, but don't show
| this annoying behavior anymore!
| inetknght wrote:
| > _For people that have high-speed connections, how often are
| you getting >500Mbps from a single source?_
|
| Often enough over HTTP/1.1 that discussions like this are
| relevant to my concerns.
| thelastparadise wrote:
| Gotta be QUIC er than that, buddy!
| throw0101c wrote:
| Netflix has gotten TCP/TLS up to 800 Gbps (over many streams):
|
| * https://news.ycombinator.com/item?id=32519881
|
| * https://news.ycombinator.com/item?id=33449297
|
| hitting 100 Gbps (20k-30k customers) using less that 100W:
|
| * https://twitter.com/ocochardlabbe/status/1781848334145130661
|
| * https://news.ycombinator.com/item?id=40630699#unv_40630785
| ahmetozer wrote:
| For mobile connectivity -> quic For home internet wifi & cable
| access -> http2 For heavy loaded enterprise slow wifi network ->
| quic
| necessary wrote:
| Does QUIC do better with packet loss compared to TCP? TCP
| perceives packet loss as network congestion and so throughput
| over high bandwidth+high packet loss links suffers.
| AtNightWeCode wrote:
| For us, what QUIC solves is that mobile users that move around in
| the subway and so on are not getting these huge latency spikes.
| Which was one of our biggest complains.
| 404mm wrote:
| When looking at the tested browsers, I want to ask why this was
| not tested on Safari (which is currently the second most used
| browser by share).
| exabrial wrote:
| QUIC needs an unencrypted mode!
| suprjami wrote:
| Pretty sure Dave Taht would explode if anyone did this.
| edwintorok wrote:
| TCP has a lot of offloads that may not all be available for UDP.
___________________________________________________________________
(page generated 2024-09-09 23:01 UTC)