[HN Gopher] A programmer-friendly I/O abstraction over io_uring ...
___________________________________________________________________
A programmer-friendly I/O abstraction over io_uring and kqueue
Author : simonz05
Score : 69 points
Date : 2022-11-23 16:16 UTC (6 hours ago)
(HTM) web link (tigerbeetle.com)
(TXT) w3m dump (tigerbeetle.com)
| dboreham wrote:
| "The good news is that Windows also has a completion based system
| similar to io_uring but without batching, called IOCP"
|
| fwiw IOCP in NT predates the similar mechanisms in Linux by at
| least a decade (and the VMS QIO scheme upon which it was in turn
| based is even older). As I understand it the reason Unix(1) (and
| then Linux) did not have efficient network I/O kernel interfaces
| until relatively recently was due to fear of patent litigation
| from MS.
|
| (1) except for AIX, possibly due to IBM being less concerned
| about MS patents in this area.
| p_l wrote:
| TL;DR - Async I/O wasn't included in Unices because it's hard &
| complex and unices are about keeping to the simple till you can
| no longer lie about it being painless.
|
| Considering that Windows NT's IOCP is very close to direct copy
| of VMS QIO mechanism (and even more underneath in officially
| undocumented boundary layer between user space and kernel
| space), I don't think it's a case of patents.
|
| UNIX was just always _against_ asynchronous I /O, back from the
| start - asynchronous I/O schemes were known and used as early
| as original UNIX, and going with fully synchronous model was
| explicit design choice at Bell Labs.
|
| When asynchronous I/O turned out to be important enough to
| include after all, there was no common enough interface to
| handle it and everyone was using select() and poll() out of
| lack of anything better for the most obvious use cases of AIO
| (networking). Meanwhile properly implementing asynchronous I/O
| can be non-trivial - QIO never ran truly multithreaded from the
| PoV of client program, for example (NT focused on making sure
| async worked from start).
| amluto wrote:
| IOCP was missing a rather obvious feature until recently:
| disassociating a handle from an IOCP:
|
| https://stackoverflow.com/questions/30688028/un-associate-so...
| eatonphil wrote:
| Hey folks! Phil from TigerBeetle here. Happy to answer questions
| or pull in King when I cannot. :)
| tiffanyh wrote:
| Can this benefit BSDs?
| eatonphil wrote:
| I think FreeBSD invented kqueue and from a quick search it
| looks like OpenBSD and NetBSD also adopted it. We've seen at
| least one person slightly tweak TigerBeetle to run on FreeBSD
| already through the darwin code paths.
| cb22 wrote:
| > We've seen at least one person slightly tweak TigerBeetle
| to run on FreeBSD already through the darwin code paths.
|
| I'm part of that sample set! Was quite surprised how easy
| it was to get it up and running on FreeBSD. Benchmarking on
| tmpfs on both, it even had a ~10% lead on Linux.
|
| (Of course, that's not exactly the intended use case, so
| don't pay too much attention to that number!)
| williamcotton wrote:
| Is there a reason why you're not using libdispatch on darwin
| and instead using kqueue directly?
| jorangreef wrote:
| At the time, we wanted to get macOS or Darwin running as soon
| as possible to improve the local developer experience, but--
| we've always had a soft spot for FreeBSD so kqueue was two
| birds in one!
| kprotty wrote:
| King from TigerBeetle here.
|
| One of the reasons is that libdispatch's I/O functions
| introduce extra dynamic allocations for internal queueing via
| `dispatch_async` ([0],[1],[2]) and from an API perspective of
| realloc-ing [3] an internally owned [4] buffer.
|
| TigerBeetle, on the other hand, statically allocates all I/O
| buffers upfront [5], treats these buffers as intrusively-
| provided typed data [6] (no growing/owned buffers), and does
| internal queueing without synchronization or dynamic
| allocation [7].
|
| [0]: https://github.com/apple/swift-corelibs-
| libdispatch/blob/469...
|
| [1]: https://github.com/apple/swift-corelibs-
| libdispatch/blob/469...
|
| [2]: https://github.com/apple/swift-corelibs-
| libdispatch/blob/469...
|
| [3]: https://github.com/apple/swift-corelibs-
| libdispatch/blob/469...
|
| [4]: https://developer.apple.com/documentation/dispatch/13889
| 33-d...
|
| [5]: https://tigerbeetle.com/blog/a-database-without-dynamic-
| memo...
|
| [6]: https://github.com/tigerbeetledb/tigerbeetle/blob/d15acc
| 663f...
|
| [7]: https://github.com/tigerbeetledb/tigerbeetle/d15acc663f8
| 882c...
| williamcotton wrote:
| Thank you! I was indeed looking for the technical details!
| packetlost wrote:
| Is there any attempt to reorder IO events such that writes (and
| maybe reads) operate in as contiguous of a manner as possible?
| It might only be relevant in the case of multiple writes to the
| same file, so the complexity trade-off might not be worth it
| for TigerBeetle, but might be worth it for other systems.
|
| Edit: also, are you guys hiring? ;)
| eatonphil wrote:
| We are definitely hiring a designer in Europe. :)
|
| I asked DJ (not on HN, but hangs out in our community Slack
| [3] where you can ask further if curious), who knows the disk
| side of things best, and he responds:
|
| The OS is free to reorder writes (this is true for both
| io_uring and conventional IO).
|
| In practice it does this for spinning disks, but not SSDs.
|
| The OS is aware of the "geometry" of a spinning disk, i.e.
| what sectors are physically close to each other.
|
| But for NVME SSDs it is typically handled in the firmware.
| SSDs internally remap "logical" addresses (i.e. the address
| from the OS point of view) to "physical" addresses (actual
| locations on the SSD).
|
| e.g. if the application (or OS) writes to block address "1"
| then "2", the SSD does not necessarily store these in
| adjacent physical locations. (OSTEP explains this well [0].)
|
| "Performance Analysis of NVMe SSDs and their Implication on
| Real World Databases" explains in more detail:
|
| > In the conventional SATA I/O path, an I/O request arriving
| at the block layer will first be inserted into a request
| queue (Elevator). The Elevator would then reorder and combine
| multiple requests into sequential requests. While reordering
| was needed in HDDs because of their slow random access
| characteristics, it became redundant in SSDs where random
| access latencies are almost the same as sequential. Indeed,
| the most commonly used Elevator scheduler for SSDs is the
| noop scheduler (Rice 2013), which implements a simple First-
| In-First-Out (FIFO) policy without any reordering.
|
| Applications can help performance by grouping writes
| according to time-of-death (per "The Unwritten Contract of
| Solid State Drives" [2]), but the SSD is free to do whatever.
| We are shortly going to be reworking the LSM's compaction
| scheduling to take advantage of this:
| https://github.com/tigerbeetledb/tigerbeetle/issues/269.
|
| [0] https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
|
| [1]
| https://www.cs.binghamton.edu/~tameesh/pubs/systor2015.pdf
|
| [2] https://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
|
| [3] https://join.slack.com/t/tigerbeetle/shared_invite/zt-1gf
| 3qn...
| loeg wrote:
| > Is there any attempt to reorder IO events such that writes
| (and maybe reads) operate in as contiguous of a manner as
| possible?
|
| Something like a disk elevator at a higher level of the
| stack?
| packetlost wrote:
| Yeah, even tweaking kernel IO scheduling would probably be
| sufficient. It'll depend on spinning rust vs SSDs though.
| jorangreef wrote:
| Yes, our thinking here was that for SSD or NVMe sometimes
| the cost of scheduling is not worth it, and "noop" can be
| a good choice, since the device is already so fast
| relative to the cost of the CPU and memory access
| required to schedule.
|
| As far as we understand w.r.t. mechanical sympathy for
| flash, what counts most is either a large bulk I/O that
| the device can internally parallelize, or else
| sufficiently parallel smaller I/Os.
|
| Then, being sector-aligned to avoid the kernel from
| having to fixup alignment with a memcpy--also, we're
| using Direct I/O so we have to--and especially trying to
| issue writes all with similar "Time of Death". For
| example, if you're writing out the blocks for a new LSM
| table on disk, then it is in fact good to keep these
| close together, since it's likely they'll be compacted
| again and reclaimed at the same time in future.
|
| This brings us back to scheduling... :)
___________________________________________________________________
(page generated 2022-11-23 23:01 UTC)