[HN Gopher] Io_uring is not an event system
___________________________________________________________________
Io_uring is not an event system
Author : ot
Score : 217 points
Date : 2021-06-17 14:56 UTC (8 hours ago)
(HTM) web link (despairlabs.com)
(TXT) w3m dump (despairlabs.com)
| ayanamist wrote:
| So linux people find that the model of windows iocp used is
| better?
| asveikau wrote:
| I think it's long been understood that it's better for disk
| I/O. I'm not sure the consensus is as clear for sockets.
| hermanradtke wrote:
| As a Linux person: Yes, and I have thought so for a long time.
|
| This is why many of us are excited about io_uring.
| spullara wrote:
| Now if they could allocate the memory for you when needed
| rather than having a bunch of buffers allocated when they
| aren't yet needed.
| ot wrote:
| Yes, and now things are going full circle and windows is
| adopting the ring model too.
|
| https://windows-internals.com/i-o-rings-when-one-i-o-operati...
| muststopmyths wrote:
| huh, interesting. As far as I can tell the main advantage
| this has over IOCP is that you can get one completion for
| multiple read requests.
|
| Looks like they took a lot of the concepts from Winsock RIO
| and applied them to file I/O. Which is fascinating because
| with network traffic you can't predict packet boundaries and
| thus your I/O rate can be unpredictable. RIO helps you get
| the notification rate under control, which can help if your
| packet rate is very high.
|
| With files, I would think you can control the rate at which
| you request data, as well as the memory you allocate for it.
|
| The other thing it saves just like RIO is the overhead of
| locking/unlocking buffers by preregistering them. Is that the
| main reason for this API then ?
|
| I would be very interested to hear from people who have
| actually run into limits with overlapped file reads and are
| therefore excited about IoRings
| volta83 wrote:
| Yes.
|
| Unfortunately Rust went the exact other way.
| ginsmar wrote:
| Very very interesting.
| grok22 wrote:
| Isn't locking a problem with io_uring? Won't you block the kernel
| when it's trying to do the event completion stuff and the
| completion work tries to take a lock? Or is the completion stuff
| done entirely in user-space and blocking is not a problem? Maybe
| I need to read up on this a bit more...
| zxzax wrote:
| The submissions and completions are stored in a lock-free ring
| buffer, hence the name "uring."
| legulere wrote:
| lock-free ring buffers still have locks for the case when
| they are full, so it would be interesting to see how the
| kernel behaves when you never read from the completion ring.
| zxzax wrote:
| You will get an error for that upon submit. The application
| can then buffer the submissions somewhere else and wait for
| a few completions to finish.
| mzs wrote:
| bandwidth exceeded alternative:
| https://web.archive.org/web/20210617150204/https://despairla...
| zootboy wrote:
| It seems that their server is rewriting the 451 error to a 403,
| which caused Archive.org to drop the page from its archives.
| Unfortunate...
| bilalhusain wrote:
| Currently serving Bandwidth Restricted page - 451.
|
| Cached version of the write up https://archive.is/VgHkW
| joshmarinacci wrote:
| I'm curious how this handles the case where the calling program
| dies or wants to cancel the request before it's actually
| happened.
| ww520 wrote:
| The do_exit() function in kernel/exit.c is responsible for
| general cleanup on a process [1]. Whether a process dies
| gracefully or abruptly, the kernel calls do_exit() to clean up
| all the resources owned by the process, like opened files or
| acquired locks. I would imagine the io_uring related stuff is
| cleaned up there as well.
|
| [1]
| https://elixir.bootlin.com/linux/v5.13-rc6/source/kernel/exi...
|
| Edit: I just looked up at the latest version of the source [1].
| Yes, it does clean up io_uring related files.
| asdfasgasdgasdg wrote:
| I don't know the answer but I would assume if you have
| submitted an operation to the kernel, you should assume it's in
| an indeterminate state until you get the result. If the program
| dies then the call may complete or not.
|
| For cancellation there is an API. Example call:
| https://github.com/axboe/liburing/blob/c4c280f31b0e05a1ea792...
| PaulDavisThe1st wrote:
| Death: same way it handles a program with an open socket and
| data arriving and unread and the program dies. It's just part
| of the overall resource set of the process and has to be
| cleaned up when the process goes away.
| raphlinus wrote:
| Something I've been thinking about that maybe the HN hivemind can
| help with; I know enough about io_uring and GPU each to be
| dangerous.
|
| The roundtrip for command buffer submission in GPU is huge by my
| estimation, around 100us. On a 10TFLOPS card, which is nowhere
| near the top of the line, that's 1 billion operations. I don't
| know exactly where all the time is going, but suspect it's a
| bunch of process and kernel transitions between the application,
| the userland driver, and the kernel driver.
|
| My understanding is that games mostly work around this by
| batching up a lot of work (many dozens of draw calls, for
| example) in one submission. But it's still a problem if CPU
| readback is part of the workload.
|
| So my question is: can a technique like io_uring be used here, to
| keep the GPU pipeline full and only take expensive transitions
| when absolutely needed? I suspect the programming model will be
| different and in some cases harder, but that's already part of
| the territory with GPU.
| boardwaalk wrote:
| The communication between the host and the GPU already works on
| a ring buffer on any modern GPU I believe.
|
| It's why graphics APIs are asynchronous until you synchronize
| f.e. by flipping a frame buffer or reading something back.
|
| APIs like Vulkan are very explicit about this and have fences
| and semaphores. Older APIs will just block if you do something
| that requires blocking.
| api wrote:
| What's funny about io_uring is that the blocking syscall
| interface was always something critics of Unix pointed out as
| being a major shortcoming. We had OSes that did this kind of
| thing way back in the 1980s and 1990s, but Unix with its
| simplicity and generalism and free implementations took over.
|
| Now Unix is finally, in 2021, getting a syscall queue construct
| where I can interact with the kernel asynchronously.
| CodesInChaos wrote:
| io_uring is Linux specific. BSD offers kqueue instead,
| introduced in 2000.
|
| I believe both are limited to specific operations, mostly IO,
| and aren't fully general asynchronous syscall interfaces.
| rapsey wrote:
| The entire point of the article is that it is not just for
| IO.
| binarycrusader wrote:
| The OP said _mostly_ IO not only.
| CodesInChaos wrote:
| > The entire point of the article is that it is not just
| for IO
|
| I went through the list of operations supported by
| `io_uring_enter`. Almost all of them are for IO, the
| remainder (NOP, timeout, madvise) are useful for supporting
| IO, though madvise might have some non-IO uses as well.
| While io_uring could form the basis of a generic async
| syscall interface in the future, in its current state it
| most certainly is not.
|
| The article mostly talks about io_uring enabling completion
| based IO instead of readiness based IO.
|
| AFAIK kqueue also supports completion based IO using
| aio_read/aio_write together with sigevent.
| [deleted]
| coder543 wrote:
| > AFAIK kqueue also supports completion based IO using
| aio_read/aio_write together with sigevent.
|
| If you can point to a practical example of a program
| doing it this way and seeing a performance benefit, I
| would be curious to see it. I did some googling and
| didn't really even find any articles mentioning this as
| possibility.
|
| kqueue is widely considered to be readiness based, just
| like epoll, not completion based.
|
| What you wrote sounds like an interesting hack, but I'm
| not sure it counts for much if it is impractical to use.
| CodesInChaos wrote:
| I don't know if it offers any practical benefits for
| sequential/socket IO. But AFAIK it's the way to go if you
| want to do async random-access/file IO.
| binarycrusader wrote:
| Solaris had this over a decade ago now with event ports which
| are basically a variant of Windows IOCP:
|
| https://web.archive.org/web/20110719052845/http://developers...
|
| So at least one UNIX system had them a while ago.
| wahern wrote:
| People keep saying this but IME that's not how the Solaris
| Event Ports API works _at_ _all_. The semantics of Solaris
| Event Ports is nearly identical to both epoll+family and
| kqueue. And like with both those others (ignoring io_uring),
| I /O _completion_ is done using the POSIX AIO interface,
| which signals completion through the Event Port descriptor.
|
| I've written at least two wrapper libraries for I/O
| readiness, POSIX signal, file event, and user-triggered event
| polling that encompass epoll, kqueue, and Solaris Event
| Ports. Supporting all three is _relatively_ trivial from an
| API perspective because they work so similarly. In fact,
| notably all _three_ let you poll on the epoll, kqueue, or
| Event Port descriptor itself. So you can have event queue
| _trees_ , which is very handy when writing composable
| libraries.
| ww520 wrote:
| I remember IPX has similar communication model. To read from
| the network, you post an array of buffer pointers to the IPX
| driver and can continue to do whatever. When the buffers are
| filled, the driver calls your completion function.
| PaulDavisThe1st wrote:
| Maybe Linux will get scheduler activations in the near future,
| another OS feature from the 90s that ended up in Solaris and
| more or less nowhere else. "Let my user space thread scheduler
| do its work!"
| jlokier wrote:
| We talked about adding that to Linux in the 90s too. A
| simple, small scheduler-hook system call that would allow
| userspace to cover different asynchronous I/O scheduling
| cases efficiently.
|
| The sort of thing Go and Rust runtimes try to approximate in
| a hackish way nowadays. They would both by improved by an
| appropriate scheduler-activation hook.
|
| Back then the idea didn't gain support. It needed a champion,
| and nobody cared enough. It seemed unnecessary, complicated.
| What was done instead seemed to be driven by interests that
| focused on one kind of task or another, e.g. networking or
| databases.
|
| It doesn't help that the understandings many people have of
| performance around asynchronous I/O, stackless and stackful
| coroutines, userspace-kernel interactions, CPU-hardware
| interactions and so on are not particularly deep. For example
| I've met a few people who argued that "async-await" is the
| modern and faster alternative to threads in every scenario,
| except for needing N threads to use N CPU cores. But that is
| far from correct. Stackful coroutines doing blocking I/O with
| complex logic (such as filesystems) are lighter than async-
| await coroutines doing the same thing, and "heavy" fair
| scheduling can improve throughput and latency statistics over
| naive queueing.
|
| It's exciting to see efficient userspace-kernel I/O
| scheduling getting attention, and getting better over the
| years. Kudos to the implementors.
|
| But it's also kind of depressing that things that were on the
| table 20-25 years ago take this long to be evaluated. It's
| almost as if economics and personal situations governs
| progress much more than knowledge and ideas...
| PaulDavisThe1st wrote:
| Actually, I think the biggest obstacle is that as cool as
| scheduler activations are, it turns out that not many
| applications are really in a position to benefit from them.
| The ones that can found other ways ("workarounds") to
| address the fact that the kernel scheduler can't know which
| user space thread to run. They did so because it was
| important to them.
| zaphar wrote:
| It's almost as if economics and personal situations
| governs progress much more than knowledge and ideas...
|
| That has always been the case and will probably always be
| the case.
| aseipp wrote:
| There's already plans for a new futex-based swap_to
| primitive, for improving userland thread scheduling
| capabilities. There was some work done on it last year, but
| it was rejected on LKML. At this rate, it looks like it will
| not move forward until the new futex2 syscall is in place,
| since the original API is showing its age.
|
| So, it will probably happen Soon(tm), but you're probably
| still ~2 years out before you can reliably depend on it, I'd
| say.
| PaulDavisThe1st wrote:
| Scheduler activations don't require swap_to.
|
| The kernel wakes up the user space scheduler when it
| decides to put the process onto a cpu. The user space
| scheduler decides which _user space_ thread executes in the
| kernel thread context that it runs in, and does a user
| space thread switch (not a full context switch) to it. It
| 's a combination of kernel threads and user space (aka
| "green") threads.
| gpderetta wrote:
| I think some of the *BSDs have (or had) it. Linux almost got
| it at the turn of the millennium, with the Next Generation
| Posix Threading project, but then the much simpler and faster
| NPTL won.
| rektide wrote:
| It might! I'm not sure if it's an exact fit or not but the
| User Managed Concurrency Groups work[1] Google is trying to
| upstream with their Fibers userland-scheduling library sounds
| like it could be a match, and perhaps it could get the
| upstreaming itcs seeking.
|
| [1] https://www.phoronix.com/scan.php?page=news_item&px=Googl
| e-F...
| tele_ski wrote:
| I think this might be the best explanation I've read of why
| io_uring should be better than epoll since it effectively
| collapses the 'tell me when this is ready' with the 'do action'
| part. That was the really enlightening part for me.
|
| I have to say though, the name io_uring seems unfortunate and I
| think the author touches on this in the article... the name is
| really an implementation detail but io_uring's true purpose is a
| generic asynchronous syscall facility that is currently tailored
| towards i/o. syscall_queue or async_queue or something else...? A
| descriptive api name and not an implementation detail would
| probably go a long way in helping the feature be easier to
| understand. Even window's IOCP seems infinitely better named than
| 'uring'.
| pydry wrote:
| I'm still confused coz this is exactly what I always thought
| the difference between epoll and select was.
|
| "what if, instead of the kernel telling us when something is
| ready for an action to be taken so that we can take it, we tell
| the kernel what action to we want to take, and it will do it
| when the conditions become right."
|
| The difference between select and epoll was that select would
| keep checking in until the conditions were right while epoll
| would send _you_ a message. That was gamechanging.
|
| - I'm not really sure why this is seen as such a fundamental
| change. It's changed from the kernel triggering a callback
| to... a callback.
| asveikau wrote:
| select, poll, epoll, are all the same model of blocking and
| signalling for readiness.
|
| The problem with the former occurs with large lists of file
| descriptors. Calling from user to kernel, the kernel needs to
| copy and examine N file descriptors. When user mode comes
| back, it needs to scan its list of file descriptors to see
| what changed. That's 2 O(n) scans at every syscall, one
| kernel side, one user side, even if only zero or one file
| descriptors has an event.
|
| epoll and kqueue make it so that the kernel persists the list
| of interesting file descriptors between calls, and only
| returns back what has actually changed, without either side
| needing to scan an entire list.
|
| By contrast, the high level programming model of io_uring
| seems pretty similar to POSIX AIO or Windows async I/O [away
| from readiness and more towards "actually do the thing"], but
| with the innovation being a new data structure that allows
| reduction in syscall overhead.
| coder543 wrote:
| epoll: tell me when any of these descriptors are ready, then
| I'll issue another syscall to actually read from that
| descriptor into a buffer.
|
| io_uring: when any of these descriptors are ready, read into
| any one of these buffers I've preallocated for you, then let
| me know when it is done.
|
| Instead of waking up a process just so it can do the work of
| calling back into the kernel to have the kernel fill a
| buffer, io_uring skips that extra syscall altogether.
|
| Taking things to the next level, io_uring allows you to chain
| operations together. You can tell it to read from one socket
| and write the results into a different socket or directly to
| a file, and it can do that without waking your process
| pointlessly at any intermediate stage.
|
| A nearby comment also mentioned opening files, and that's
| cool too. You could issue an entire command sequence to
| io_uring, then your program can work on other stuff and check
| on it later, or just go to sleep until everything is done.
| You could tell the kernel that you want it to open a
| connection, write a particular buffer that you prepared for
| it into that connection, then open a specific file on disk,
| read the response into that file, close the file, then send a
| prepared buffer as a response to the connection, close the
| connection, then let you know that it is all done. You just
| have to prepare two buffers on the frontend, issue the
| commands (which could require either 1 or 0 syscalls,
| depending on how you're using io_uring), then do whatever you
| want.
|
| You can even have numerous command sequences under kernel
| control in parallel, you don't have to issue them one at a
| time and wait on them to finish before you can issue the next
| one.
|
| With epoll, you have to do every individual step along the
| way yourself, which involves syscalls, context switches, and
| potentially more code complexity. Then you realize that epoll
| doesn't even support file I/O, so you have to mix multiple
| approaches together to even approximate what io_uring is
| doing.
|
| (Note: I've been looking for an excuse to use io_uring, so
| I've read a ton about it, but I don't have any practical
| experience with it yet. But everything I wrote above should
| be accurate.)
| throwaway81523 wrote:
| Being able to open files with io_uring is important because
| there is no other way to do it without an unpredictable
| delay. Some systems like Erlang end up using separate OS
| threads just to be able to open files without blocking the
| main interpreter thread.
| zxzax wrote:
| If you're looking for an excuse to work on io_uring, please
| consider helping get it implemented and tested in your
| favorite event loop or I/O abstraction library. Here's some
| open issues and PRs:
|
| https://github.com/golang/go/issues/31908
|
| https://github.com/libuv/libuv/pull/2322
|
| https://github.com/tokio-rs/mio/issues/923
|
| https://gitlab.gnome.org/GNOME/glib/-/issues/2084
|
| https://github.com/libevent/libevent/issues/1019
| coder543 wrote:
| Oh, trust me... that Go issue is top of mind for me. I
| have the fifth comment on that issue, along with several
| other comments in there, and I'd love to implement it...
| I'm just not familiar enough with working on Go runtime
| internals, and motivation for volunteer work is sometimes
| hard to come by for the past couple of years.
|
| Maybe someday I'll get it done :)
| zxzax wrote:
| Haha nice, I just noticed that :) I think supporting
| someone else to help work on it and even just offering to
| help test and review a PR is a great and useful thing to
| do.
| jra_samba wrote:
| io_uring has been a game-changer for Samba IO speed.
|
| Check out Stefan Metzmacher's talk at SambaXP 2021
| (online event) for details:
|
| https://www.youtube.com/watch?v=eYxp8yJHpik
| surrealize wrote:
| The performance comparisons start here
| https://youtu.be/eYxp8yJHpik?t=1421
|
| Looks like the bandwidth went from 3.8 GB/s to 22 GB/s,
| with the client being the bottleneck.
| pydry wrote:
| This makes it much clearer. Thanks!
| tele_ski wrote:
| What you're describing sounds awesome, I hadn't thought
| about being able to string syscall commands together like
| that. I wonder how well that will work in practice? Is
| there a way to be notified if one of the commands in the
| sequence fails like for instance the buffer wasn't large
| enough to write all the incoming data into?
| touisteur wrote:
| I'm looking at the evolution in the chaining capabilities
| of io_uring. Right now it's a bit basic but I'm guessing
| in 5 or 6 kernel versions people will have built a micro
| kernel or a web server just by chaining things in
| io_uring and maybe some custom chaining/decision blocks
| in ebpf :-)
| coder543 wrote:
| BPF, you say? https://lwn.net/Articles/847951/
|
| > The obvious place where BPF can add value is making
| decisions based on the outcome of previous operations in
| the ring. Currently, these decisions must be made in user
| space, which involves potential delays as the relevant
| process is scheduled and run. Instead, when an operation
| completes, a BPF program might be able to decide what to
| do next without ever leaving the kernel. "What to do
| next" could include submitting more I/O operations,
| moving on to the next in a series of files to process, or
| aborting a series of commands if something unexpected
| happens.
| touisteur wrote:
| BPF is going to change so many things... At the moment
| I'm having lots of trouble with the tooling but hey,
| let's just write BPF bytecode by hand or with a macro-
| asm. Reduce the ambitions...
| touisteur wrote:
| Also wondering whether we should rethink language
| runtimes for this. Like write everything in SPARK (so all
| specs are checked), target bpf bytecode through gnatllvm.
| OK you've written the equivalent of a cuda kernel or
| tbb::flow block. Now for the chaining y'all have this
| toolbox of task-chainers (barriers, priority queues,
| routers...) and you'll never even enter userland? I'm
| thinking /many/ programs could be described as such.
| touisteur wrote:
| Yes exactly what I had in mind. I'm also thinking of a
| particular chain of syscalls [0][1][2][3] (send netlink
| message, setsockopt, ioctls, getsockopts, reads, then
| setsockopt, then send netlink message) grouped so as to
| be done in one sequence without ever surfacing up to
| userland (just fill those here buffers, who's a good
| boy!). So now I'm missing ioctls and getsockopts but all
| in good time!
|
| [0] https://github.com/checkpoint-
| restore/criu/blob/7686b939d155...
|
| [1] https://github.com/checkpoint-
| restore/criu/blob/7686b939d155...
|
| [2] https://github.com/checkpoint-
| restore/criu/blob/7686b939d155...
|
| [3] https://www.infradead.org/~tgr/libnl/doc/api/group__q
| disc__p...
| coder543 wrote:
| According to a relevant manpage[0]:
|
| > Only members inside the chain are serialized. A chain
| of SQEs will be broken, if any request in that chain ends
| in error. io_uring considers any unexpected result an
| error. This means that, eg, a short read will also
| terminate the remainder of the chain. If a chain of SQE
| links is broken, the remaining unstarted part of the
| chain will be terminated and completed with -ECANCELED as
| the error code.
|
| So it sounds like you would need to decide what your
| strategy is. It sounds like you can inspect the step in
| the sequence that had the error, learn what the error
| was, and decide whether you want to re-issue the command
| that failed along with the remainder of the sequence. For
| a short read, you should still have access to the bytes
| that were read, so you're not losing information due to
| the error.
|
| There is an alternative "hardlink" concept that will
| continue the command sequence even in the presence of an
| error in the previous step, like a short read, as long as
| the previous step was correctly submitted.
|
| Error handling gets in the way of some of the fun, as
| usual, but it is important to think about.
|
| [0]: https://manpages.debian.org/unstable/liburing-
| dev/io_uring_e...
| zxzax wrote:
| Yes, check the documentation for the IOSQE_IO_LINK flag
| to see exactly how this works.
| dataflow wrote:
| epoll is based on a "readiness" model (i.e. it tells when
| when you can _start_ I /O). io_uring is based on a
| "completion" model (i.e. it tells you when I/O is _finished_
| ). The latter is like Windows IOCP, where the C stands for
| Completion. Readiness models are rather useless for a local
| disk because, unlike with a socket, the disk is more or less
| always ready to receive a command.
| simcop2387 wrote:
| Io uring can in theory be built to subscribe to any syscall
| (though it hasn't yet). I don't believe epoll can do things
| like stat, opening files, closing files, and syncing though.
| [deleted]
| eloff wrote:
| I've seen mixed results so far. In theory it should perform
| better than epoll, but I'm not sure it's quite there yet. The
| maintainer of uWebSockets tried it with an earlier version and
| it was slower.
|
| Where it really shines is disk IO because we don't have an
| epoll equivalent there. I imagine it would also be great at
| network requests that go to or from disk in a simple way
| because you can chain the syscalls in theory.
| zxzax wrote:
| The main benefit to me has been in cases that previously
| required a thread pool on top of epoll, it should be safe to
| get rid of the thread pool now and only use io_uring. Socket
| I/O of course doesn't really need a thread pool in a lot of
| cases, but disk I/O does.
| infogulch wrote:
| syscall_uring would be my preference.
| [deleted]
| hawski wrote:
| Now I wonder if my idea of having a muxcall or a batchcall as I
| thought about it a few years ago is something similar to
| io_uring, but on a lesser scale and without eBPF goodies.
|
| My idea was to have a syscall like this: struct
| batchvec { unsigned long batchv_callnr; unsigned
| long long batchv_argmask; }; asmlinkage long
| sys_batchcall(struct batchvec *batchv, int batchvcnt,
| long args[16], unsigned flags);
|
| You were supposed to give in a batchvec a sequence of system
| call numbers and a little mapping to arguments you provided in
| args. batchv_argmask is a long long - 64 bit type, this mask is
| divided to 4 bit fields, every field can address a long from
| args table. AFAIR Linux syscalls have up to 6 arguments. 6
| fields for arguments and one for return value, that gives 7
| fields - 28 bits and now I don't remember why I thought I need
| a long long.
|
| It would go like this pseudo code: int i = 0;
| for(; i < batchvcnt; i++) { args[batchv[i].argmask[6]]
| = sys_call_table[batchv[i].callnr](args[batchv[i].argmask[0]],
| args[batchv[i].argmask[1]], args[batchv[i].argmask[2]],
| args[batchv[i].argmask[3]], args[batchv[i].argmask[4]],
| args[batchv[i].argmask[5]]);
| if(args[batchv[i].argmask[6]] < 0) { break; }
| } return i;
|
| It would return a number of successfully run syscalls. It would
| stop on first failed one. The user would have to pick up the
| error code out of args table.
|
| I would be interested to know why it wouldn't work.
|
| I started implementing it against Linux 4.15.12, but never went
| to test it. I have some code, but I don't believe it is my last
| version of the attempt.
| pkghost wrote:
| W/r/t a more descriptive name, I disagree--though until fairly
| recently I would have agreed.
|
| I would guess that the desire for something more "descriptive"
| reflects the fact that you are not in the weeds with io_uring
| (et al), and as such a name that's tied to specifics of the
| terrain (io, urings) feels esoteric and unfamiliar.
|
| However, to anyone who is an immediate consumer of io_uring or
| its compatriots, "io" obviously implies "syscall", but is
| better than "syscall", because it's more specific; since
| io_uring doens't do anything other than io-related syscalls
| (there are other kinds of syscalls), naming it "syscall_" would
| make it harder for its immediate audience to remember what it
| does.
|
| Similarly, "uring" will be familiar to most of the immediate
| audience, and is better than "queue", because it also
| communicates some specific features (or performance
| characteristics? idk, I'm also not in the weeds) of the API
| that the more generic "_queue" would not.
|
| So, while I agree that the name is mildly inscrutable to us
| distant onlookers, I think it's the right name, and indeed
| reflects a wise pattern in naming concepts in complex systems.
| The less ambiguity you introduce at each layer of indirection
| or reference, the better.
|
| I recently did a toy project that has some files named in a
| similar fashion: `docker/cmd`, which is what the CMD directive
| in my Dockerfile points at, and `systemd/start`, which is what
| the ExecStart line of my systemd service file points at.
| They're mildly inscrutable if you're unfamiliar with either
| docker or systemd, as they don't really say much about what
| they _do_ , but this is a naming pattern that I can port to
| just about any project, and at the same time stop spending
| energy remembering a unique name for the app's entry point, or
| the systemd script.
|
| Some abstract observations: - naming for
| grokkability-a-first-glance is at odds with naming for utility-
| over-time; the former is necessarily more ambiguous -
| naming for utility over time seems like obviously the better
| default naming strategy; find a nice spot in your readme for
| onboarding metaphors and make sure the primary consumers of
| your name don't have to work harder than necessary to make
| sense of it - if you find a name inscrutable, perhaps
| you're just missing some context
| tele_ski wrote:
| Thanks for your detailed thoughts on this, I am definitely
| not involved at all and just onlooking from the sidelines.
| The name does initially seem quite esoteric but I can
| understand why it was picked. Thinking about it more 'io'
| rather than 'syscall' does make sense, and Windows also does
| use IO in IOCP.
| wtallis wrote:
| The io specificity is expected to be a temporary situation,
| and that part of the name may end up being an anachronism in
| a few years once a usefully large subset of another category
| of syscalls has been added to io_uring. The shared ring
| buffers aspect definitely is an implementation detail, but
| one that does explain why performance is better than other
| async IO methods (and also avoids misleading people into
| thinking that it has something to do with the async/await
| paradigm).
|
| If the BSDs hadn't already claimed the name, it would
| probably have been fine to call this kqueue or something like
| that.
| justsomeuser wrote:
| If my process has async/await and uses epoll, wouldn't this have
| about the same performance as io_uring?
|
| E.g. with io_uring, the event triggers the kernel to read into a
| buffer.
|
| With async/await epoll wakes up my process which does the syscall
| to read the file.
|
| In both cases you still need to read from the device and get the
| data to the user process?
| Matthias247 wrote:
| The answer is "go ahead and benchmark". There are some
| theoretical advantages to uring, like having to do less
| syscalls and have less userspace <-> kernelspace transitions.
|
| In practice implementation differences can offset that
| advantage, or it just might not make any difference for the
| application at all since its not the hotspot.
|
| I think for socket IO a variety of people did some synthetic
| benchmarks for epoll vs uring, and got all kinds of results
| from either one being a bit faster to both being roughly the
| same.
| Misdicorl wrote:
| Imagine your process instead of getting woken up to make a
| syscall gets woken up with a pointer to a filled data buffer.
| justsomeuser wrote:
| But some process (kernel or user space) still needs to spend
| the computers finite resources to read it?
|
| If the work happens in the kernel process or user process it
| still costs the same?
| Misdicorl wrote:
| Yes, the syscall work still happens and the raw work has
| not changed. The overhead has dropped dramatically though
| since you've eliminated at least 2 context switches per
| data fill cycle(and probably more).
| wtallis wrote:
| The transition from userspace to the kernel and back takes
| a similar amount of time to actually doing a small read
| from the fastest SSDs, or issuing a write to most
| relatively fast SSDs. So avoiding one extra syscall is a
| meaningful performance improvement even if you can't always
| eliminate a memcpy operation.
| Veserv wrote:
| That sounds unlikely. syscall hardware overhead (entry +
| exit) on a modern x86 is only on the order of 100 ns
| which is approximately main memory access latency. I am
| not familiar with Linux internals or what the fastest
| SSDs are capable of these days, but I am fairly sure that
| for your statement to be true Linux would need to be
| adding 1 to 2 orders of magnitude in software overhead.
| This occurs in the context switch pathway due to
| scheduling decisions, but it is fairly unlikely it occurs
| in the syscall pathway unless they are doing something
| horribly wrong.
| vlovich123 wrote:
| My understanding is that's the naiive measurement of the
| cost of just the syscall operation (i.e. if you measure
| issue to kernel is executing). Does this actually account
| for the performance loss of cache innefficiency? If I'm
| not mistaken at a minimum the CPU needs to flush various
| caches to enter the kernel, fill them up as the kernel is
| executing, & then repopulate them when executing back in
| userspace. In that case (even if it's not a full flush),
| you have a hard to measure slowdown on the code
| processing the request in the kernel & in userspace after
| the syscall because the locality assumptions that caches
| rely on are invalidated. With an io_uring model, since
| there's no context switches, temporal & spatial locality
| should provide an outsized benefit beyond just removing
| the syscall itself.
|
| Additionally, as noted elsewhere, you can chain syscalls
| pretty deeply so that the entire operation occurs in the
| kernel & never schedules your process. This also benefits
| spatial & temporal locality AND removes the cost of
| needing to schedule the process in the first place.
| dundarious wrote:
| Not necessarily, as you can use io_uring in a zero-copy way,
| avoiding copies of network/disk data from kernel space to user
| space.
| diegocg wrote:
| IIRC the author of io_uring wants to support using it with any
| system call, not just the current supported subset. Not sure how
| these efforts are going.
| swiley wrote:
| Supporting exec sounds like it could be a security issue. I
| also have a hard time imagining why on earth you would call
| exec that way.
| tptacek wrote:
| Say more about this? You're probably right but I haven't
| thought about it before and I'd be happy to have you do that
| thinking for me. :)
| touisteur wrote:
| Start a user-provided compression background task at file
| close? Think tcpdump with io_uring 'take this here buffer
| you've just read and send it to disk without bothering
| userland.
| the_duke wrote:
| > Supporting exec sounds like it could be a security issue.
|
| Why would it be any more of an issue than calling a blocking
| exec?
|
| > why on earth you would call exec that way
|
| The same reason as why you would want to do anything else
| with io_uring? In an async runtime you have to delegate
| blocking calls to a thread pool. Much nicer if the runtime
| can use the same execution system as for other IO.
| swiley wrote:
| >Why would it be any more of an issue than calling a
| blocking exec?
|
| Now you turn everything doing fast I/O into something
| that's essentially an easier rwx situation WRT memory
| corruption.
|
| >Much nicer
|
| In other words: the runtime should handle it because the
| kernel doesn't need to.
| the_duke wrote:
| I really don't follow your thought process here. Can you
| expand on your objections?
|
| io_uring can really be thought of as a generic
| asynchronous syscall interface.
|
| It uses a kernel thread pool to run operations. If
| something is blocking on a kernel-level it can just be
| run as a blocking operation on that thread pool.
| hinkley wrote:
| If you can emulate blocking calls in user space for
| backward compatibility, that is probably best. But I wonder
| about process scheduling. Blocked threads don't get
| rescheduled until the kernel unblocks them. Can you tell
| the kernel to block a thread until an io_uring state change
| occurs?
| mikedilger wrote:
| > io_uring is not an event system at all. io_uring is actually a
| generic asynchronous syscall facility.
|
| I would call it a message passing system, not an asynchronous
| syscall facility. A syscall, even one that doesn't block
| indefinitely, transfers control to the kernel. io_uring, once
| setup, doesn't. Now that we have multiple cores, there is no
| reason to context switch if the kernel can handle your request on
| some other core on which it is perhaps already running.
| hinkley wrote:
| It becomes a question of how expensive the message passing is
| between cores versus the context switch overhead - especially
| in a world where switching privilege levels has been made
| expensive by multiple categories of attack against processor
| speculation.
|
| Is there a middle ground where io_uring doesn't require the
| mitigations but other syscalls do?
| volta83 wrote:
| This is how I think about it as well.
|
| Which makes me wonder if Mach wasn't right all along.
| rjzzleep wrote:
| I found this presentation to be quite informative
| https://www.youtube.com/watch?v=-5T4Cjw46ys
| vkka wrote:
| ... _So it has the potential to make a lot of programs much
| simpler_.
|
| More efficient? Yes. Simpler? Not really. A synchronous program
| would be simpler, everyone who has done enough of these know it.
| masklinn wrote:
| I'm guessing they mean that programs which did not want to
| block on syscalls and had to deploy workarounds can now just...
| do async syscalls.
| Filligree wrote:
| Which is every program that wants to be fast, on modern
| computers. So...
| masklinn wrote:
| Nonsense.
|
| Let's say your program wants to list a directory, if it has
| nothing to do during that time then there is no point to
| using an asynchronous model, that only adds costs and
| complexity.
| wtallis wrote:
| True. But as soon as your program wants to list _two_
| directories and knows both names ahead of time, you have
| an opportunity to fire off both operations for the kernel
| to work on simultaneously.
|
| And even if your program doesn't have any opportunity for
| doing IO in parallel, being able to chain a sequence of
| IO operations together and issue them with at most one
| syscall may still get you improved latency.
| touisteur wrote:
| Yes and even clustering often-used-together syscalls...
| An interesting 2010 thesis
| https://os.itec.kit.edu/deutsch/2211.php and
| https://www2.cs.arizona.edu/~debray/Publications/multi-
| call.... for something called 'multi-calls'
|
| Interesting times.
| [deleted]
| ot wrote:
| Let's say your program wants to list a directory and sort
| by mtime, or size. You need to stat all those files,
| which means reading a bunch of inodes. And your directory
| is on flash, so you'll definitely want to pipeline all
| those operations.
|
| How do you do that without an async API? Thread pool and
| synchronous syscalls? That's not simpler.
| the8472 wrote:
| You can use io_uring as a synchronous API too. Put a bunch of
| commands on the submission queue (think of it as a submission
| "buffer"), call io_uring_enter() with min_complete == number of
| commands and once the syscall returns extract results from the
| competion buffer^H^H^Hqueue. Voila, a perfectly synchronous
| batch syscall interface.
|
| You can even choose between executing them sequentially and
| aborting on the first error or trying to complete as many as
| possible.
| aseipp wrote:
| My read of that paragraph was that they meant existing
| asynchronous programs can be simplified, due to the need for
| less workarounds for the Linux I/O layer (e.g. thread pools to
| make disk operations appear asynchronous are no longer
| necessary.) And I agree with that; asynchronous I/O had a lot
| of pitfalls on Linux until io_uring came around, making things
| much worse than strictly necessary.
|
| In general I totally agree that a synchronous program will be
| way simpler than an equivalent asynchronous one, though.
___________________________________________________________________
(page generated 2021-06-17 23:00 UTC)