[HN Gopher] Userspace isn't slow, some kernel interfaces are
___________________________________________________________________
Userspace isn't slow, some kernel interfaces are
Author : jeffhenson
Score : 151 points
Date : 2022-12-13 17:25 UTC (5 hours ago)
(HTM) web link (tailscale.com)
(TXT) w3m dump (tailscale.com)
| rwmj wrote:
| For something completely different, you might want to look at
| Unikernel Linux: https://arxiv.org/abs/2206.00789 You could run
| all the code without switching between userspace and the kernel,
| and you can call into kernel functions directly (with the usual
| caveats about kernel ABI not being stable).
|
| There is a v1 patch posted on LKML, and I think they're hoping to
| get a v2 patch posted by January. If you are interested in a chat
| with the team, email me rjones redhat.com.
| bradfitz wrote:
| Fun! We have support for running on gokrazy
| (https://gokrazy.org/) already, and that's probably where
| Unikernel Linux is more applicable for us, for when people just
| want a "Tailscale appliance" image.
|
| I'll email you.
| ignoramous wrote:
| Hi: Is there any possibility of TSO and GRO working on
| Android?
| raggi wrote:
| One of the authors here: it could, all the kernel code is
| present. Right now the android selinux policy blocks some
| of the necessary ioctls (at least on the pixel builds I
| tested).
| pjdesno wrote:
| There are a few gotchas with GRO, although I'm not sure they're
| applicable to Wireguard - in particular, there used to be a line
| in the kernel vswitch code that dropped a packet if it had been
| processed by GRO. A while back I spent a long time debugging a
| network problem caused by that particular "feature"...
| mattpallissard wrote:
| Title is click bait-y. This has next to nothing to do with kernel
| interfaces and is all about network tuning and encapsulation. Not
| sure why the authors went with the title as networking is
| interesting enough.
|
| Also, the "slow" things about kernel interfaces (if you aren't
| doing IO which is nearly always the slowest thing) usually isn't
| a given syscall, it's the transition from user to kernel space
| and back. Lots of stuff going on such as flushing cache and
| buffers due to security concerns these days.
| bradfitz wrote:
| (Tailscale employee)
|
| We certainly didn't try to make it click-baity. The point of
| the title is that people assume that Tailscale was slower than
| kernel wireguard because the kernel must be intrinsically
| faster somehow. The point of the blog post is to say, "no, code
| can run fast on either side... you just have to cross the
| boundary less." The blog post is all about how we then cross
| that boundary less, using a less obvious kernel interface.
| cycomanic wrote:
| Just some feedback, that's not what I expected from the title
| and I would agree with the previous poster that the title is
| a little (quite minor though) clickbaity.
| wpietri wrote:
| For what it's worth, the combination of the source and the
| title made sense to me, so I think it's fine as is.
| karmakaze wrote:
| Thanks for the clarifying reply. I thought most folks who
| cared knew it was about context switches and not speed on one
| side vs the other. Now I'm really interested to read the full
| article.
| jesboat wrote:
| I disagree. Two main points of the article are "nothing is
| inherently slow about doing stuff in userland (as shown by the
| fact that we made a fast implementation)" and "kennel
| interfaces, e.g. particular methods of boundary crossing, can
| be (as shown by the fact that the way they made it faster was
| in large part by doing the boundary crossings better)".
|
| The title gave me a reasonably decent idea of what to expect,
| and the article delivered.
| raggi wrote:
| One of the authors here: What I was going for with the title is
| that singular read/write switching (the before case) is very
| slow (for packet sized work), and batching (~ >=64kb) is much
| faster - it's about amortizing the cost of the transition, as
| you rightly point out. That's the point the title is making -
| some interfaces do not provide the ability to amortize that
| cost, others do!
| sublinear wrote:
| Similar to how javascript isn't slow, networking in general is.
| :)
| adtac wrote:
| It'd be interesting to see the benchmark environment's raw
| ChaCha20Poly1305 throughput (the x/crypto implementation) in the
| analysis. My hunch is it's several times greater than the
| network, which would further support the argument.
| wmf wrote:
| I noticed that very little of the flame graph is crypto which
| implies that the system under test could do 20-30 Gbps of
| ChaCha20Poly1305.
| adtac wrote:
| Yeah the flame graphs show ~9% of the time being spent in
| golang.org/x/crypto/chacha20poly1305 so you're probably
| right, but flame graphs and throughput aren't always a one-
| to-one mapping. Flame graphs just tell you where time was
| spent per unit packet, but depending on the workload, there
| are some things in the life of a packet that you can
| parallelise and some things you can't.
|
| Just thought it'd be interesting to see the actual throughput
| along with the rest for the benchmarked environment.
| raggi wrote:
| One of the authors here: yeah, it's very interesting. The
| flame graphs here don't do a great job at highlighting an
| aspect of the challenge which is that crypto fans out
| across many CPUs. I think the hunch that 20-30gbps is
| attainable (on a fast system) is accurate - it'll take more
| work to get there.
|
| What's interesting is that the cost for x/crypto on these
| system classes is prohibitive for serial decoding at
| 10gbps. I was ballparking with 1280 MTU, you have about
| 1000ns to process a packet, it takes about 2000ns to
| encrypt. The fan-out is critical at these levels, and will
| always introduce it's own additional costs, with
| synchronization, memory management and so on.
| [deleted]
| lost_tourist wrote:
| If userspace was really exceedingly slow then we wouldn't bother
| using it.
| yjftsjthsd-h wrote:
| Leaving aside the general question (which sibling comment
| covers), there's an unwritten qualification of "userspace is
| generally seen as slow _for drivers_ (network, disk,
| filesystem) ", and... we generally _don 't_ bother using it for
| those things, or at least we try to move the data path into the
| kernel when we care about performance.
| throwaway09223 wrote:
| I can explain why this isn't correct.
|
| We have a concept of userspace for safety. Systems without
| protected memory are very unstable. The tradeoff is speed.
|
| Trading speed for safety is extremely commonplace.
|
| * Every assert or runtime validation
|
| * Every time we guard against speculative execution attacks
| (enormous perf hits)
|
| * Every time we implement a "safe" runtime like java
|
| * process safety models using protected memory
|
| Efficiency is one of many competing concerns in a complex
| system.
| 0xQSL wrote:
| Would it be an option to use io-uring to further recuce syscall
| overhead? Perhaps there's also a way to do zerocopy?
| bradfitz wrote:
| That was previously explored in
| https://github.com/tailscale/tailscale/issues/2303 and will
| probably still happen.
|
| When Josh et al tried it, they hit some fun kernel bugs on
| certain kernel versions and that soured them on it for a bit,
| knowing it wouldn't be as widely usable as we'd hoped based on
| what kernels were in common use at the time. It's almost
| certainly better nowadays.
|
| Hopefully the Go runtime starts doing it instead:
| https://github.com/golang/go/issues/31908
| majke wrote:
| I can chime in with some optimizations (linux).
|
| For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster
| than sendmmsg/recvmmsg.
|
| In Gvisor they decided that read/write from tun is slow so they
| did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap
| device and run a raw socket on it. Dumping packet from raw socket
| has faster interface than the device itself.
|
| https://github.com/google/gvisor/blob/master/pkg/tcpip/link/...
| https://github.com/google/gvisor/issues/210
| Matthias247 wrote:
| It can not only be a lot faster, it definitely is.
|
| I did a lot of work on QUIC protocol efficiency improvements in
| the last 3 years. The use of sendmmsg/recvmmsg yields maybe a
| 10% efficiency improvement - because it only helps with
| reducing the system call overhead. Once the data is inside the
| kernel, these calls just behave like a loop of sendmsg/recvmsg
| calls.
|
| The syscall overhead however isn't the bottleneck - all the
| other work in the network stack is. E.g. looking up routes for
| each packet, applying iptable rules, applying BPF calls, etc.
|
| Using segmentation offloads means the packets will traverse
| also the remaining path as a single unit. This can allow for
| efficiency improvements from somewhere between 200% and 500%
| depending on the overall application. It's vastly useful to
| look at GSO/GRO if you are doing anything which requires bulk
| UDP datagram transmission.
___________________________________________________________________
(page generated 2022-12-13 23:00 UTC)