[HN Gopher] A programmer-friendly I/O abstraction over io_uring ...
       ___________________________________________________________________
        
       A programmer-friendly I/O abstraction over io_uring and kqueue
        
       Author : simonz05
       Score  : 69 points
       Date   : 2022-11-23 16:16 UTC (6 hours ago)
        
 (HTM) web link (tigerbeetle.com)
 (TXT) w3m dump (tigerbeetle.com)
        
       | dboreham wrote:
       | "The good news is that Windows also has a completion based system
       | similar to io_uring but without batching, called IOCP"
       | 
       | fwiw IOCP in NT predates the similar mechanisms in Linux by at
       | least a decade (and the VMS QIO scheme upon which it was in turn
       | based is even older). As I understand it the reason Unix(1) (and
       | then Linux) did not have efficient network I/O kernel interfaces
       | until relatively recently was due to fear of patent litigation
       | from MS.
       | 
       | (1) except for AIX, possibly due to IBM being less concerned
       | about MS patents in this area.
        
         | p_l wrote:
         | TL;DR - Async I/O wasn't included in Unices because it's hard &
         | complex and unices are about keeping to the simple till you can
         | no longer lie about it being painless.
         | 
         | Considering that Windows NT's IOCP is very close to direct copy
         | of VMS QIO mechanism (and even more underneath in officially
         | undocumented boundary layer between user space and kernel
         | space), I don't think it's a case of patents.
         | 
         | UNIX was just always _against_ asynchronous I /O, back from the
         | start - asynchronous I/O schemes were known and used as early
         | as original UNIX, and going with fully synchronous model was
         | explicit design choice at Bell Labs.
         | 
         | When asynchronous I/O turned out to be important enough to
         | include after all, there was no common enough interface to
         | handle it and everyone was using select() and poll() out of
         | lack of anything better for the most obvious use cases of AIO
         | (networking). Meanwhile properly implementing asynchronous I/O
         | can be non-trivial - QIO never ran truly multithreaded from the
         | PoV of client program, for example (NT focused on making sure
         | async worked from start).
        
         | amluto wrote:
         | IOCP was missing a rather obvious feature until recently:
         | disassociating a handle from an IOCP:
         | 
         | https://stackoverflow.com/questions/30688028/un-associate-so...
        
       | eatonphil wrote:
       | Hey folks! Phil from TigerBeetle here. Happy to answer questions
       | or pull in King when I cannot. :)
        
         | tiffanyh wrote:
         | Can this benefit BSDs?
        
           | eatonphil wrote:
           | I think FreeBSD invented kqueue and from a quick search it
           | looks like OpenBSD and NetBSD also adopted it. We've seen at
           | least one person slightly tweak TigerBeetle to run on FreeBSD
           | already through the darwin code paths.
        
             | cb22 wrote:
             | > We've seen at least one person slightly tweak TigerBeetle
             | to run on FreeBSD already through the darwin code paths.
             | 
             | I'm part of that sample set! Was quite surprised how easy
             | it was to get it up and running on FreeBSD. Benchmarking on
             | tmpfs on both, it even had a ~10% lead on Linux.
             | 
             | (Of course, that's not exactly the intended use case, so
             | don't pay too much attention to that number!)
        
         | williamcotton wrote:
         | Is there a reason why you're not using libdispatch on darwin
         | and instead using kqueue directly?
        
           | jorangreef wrote:
           | At the time, we wanted to get macOS or Darwin running as soon
           | as possible to improve the local developer experience, but--
           | we've always had a soft spot for FreeBSD so kqueue was two
           | birds in one!
        
           | kprotty wrote:
           | King from TigerBeetle here.
           | 
           | One of the reasons is that libdispatch's I/O functions
           | introduce extra dynamic allocations for internal queueing via
           | `dispatch_async` ([0],[1],[2]) and from an API perspective of
           | realloc-ing [3] an internally owned [4] buffer.
           | 
           | TigerBeetle, on the other hand, statically allocates all I/O
           | buffers upfront [5], treats these buffers as intrusively-
           | provided typed data [6] (no growing/owned buffers), and does
           | internal queueing without synchronization or dynamic
           | allocation [7].
           | 
           | [0]: https://github.com/apple/swift-corelibs-
           | libdispatch/blob/469...
           | 
           | [1]: https://github.com/apple/swift-corelibs-
           | libdispatch/blob/469...
           | 
           | [2]: https://github.com/apple/swift-corelibs-
           | libdispatch/blob/469...
           | 
           | [3]: https://github.com/apple/swift-corelibs-
           | libdispatch/blob/469...
           | 
           | [4]: https://developer.apple.com/documentation/dispatch/13889
           | 33-d...
           | 
           | [5]: https://tigerbeetle.com/blog/a-database-without-dynamic-
           | memo...
           | 
           | [6]: https://github.com/tigerbeetledb/tigerbeetle/blob/d15acc
           | 663f...
           | 
           | [7]: https://github.com/tigerbeetledb/tigerbeetle/d15acc663f8
           | 882c...
        
             | williamcotton wrote:
             | Thank you! I was indeed looking for the technical details!
        
         | packetlost wrote:
         | Is there any attempt to reorder IO events such that writes (and
         | maybe reads) operate in as contiguous of a manner as possible?
         | It might only be relevant in the case of multiple writes to the
         | same file, so the complexity trade-off might not be worth it
         | for TigerBeetle, but might be worth it for other systems.
         | 
         | Edit: also, are you guys hiring? ;)
        
           | eatonphil wrote:
           | We are definitely hiring a designer in Europe. :)
           | 
           | I asked DJ (not on HN, but hangs out in our community Slack
           | [3] where you can ask further if curious), who knows the disk
           | side of things best, and he responds:
           | 
           | The OS is free to reorder writes (this is true for both
           | io_uring and conventional IO).
           | 
           | In practice it does this for spinning disks, but not SSDs.
           | 
           | The OS is aware of the "geometry" of a spinning disk, i.e.
           | what sectors are physically close to each other.
           | 
           | But for NVME SSDs it is typically handled in the firmware.
           | SSDs internally remap "logical" addresses (i.e. the address
           | from the OS point of view) to "physical" addresses (actual
           | locations on the SSD).
           | 
           | e.g. if the application (or OS) writes to block address "1"
           | then "2", the SSD does not necessarily store these in
           | adjacent physical locations. (OSTEP explains this well [0].)
           | 
           | "Performance Analysis of NVMe SSDs and their Implication on
           | Real World Databases" explains in more detail:
           | 
           | > In the conventional SATA I/O path, an I/O request arriving
           | at the block layer will first be inserted into a request
           | queue (Elevator). The Elevator would then reorder and combine
           | multiple requests into sequential requests. While reordering
           | was needed in HDDs because of their slow random access
           | characteristics, it became redundant in SSDs where random
           | access latencies are almost the same as sequential. Indeed,
           | the most commonly used Elevator scheduler for SSDs is the
           | noop scheduler (Rice 2013), which implements a simple First-
           | In-First-Out (FIFO) policy without any reordering.
           | 
           | Applications can help performance by grouping writes
           | according to time-of-death (per "The Unwritten Contract of
           | Solid State Drives" [2]), but the SSD is free to do whatever.
           | We are shortly going to be reworking the LSM's compaction
           | scheduling to take advantage of this:
           | https://github.com/tigerbeetledb/tigerbeetle/issues/269.
           | 
           | [0] https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
           | 
           | [1]
           | https://www.cs.binghamton.edu/~tameesh/pubs/systor2015.pdf
           | 
           | [2] https://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf
           | 
           | [3] https://join.slack.com/t/tigerbeetle/shared_invite/zt-1gf
           | 3qn...
        
           | loeg wrote:
           | > Is there any attempt to reorder IO events such that writes
           | (and maybe reads) operate in as contiguous of a manner as
           | possible?
           | 
           | Something like a disk elevator at a higher level of the
           | stack?
        
             | packetlost wrote:
             | Yeah, even tweaking kernel IO scheduling would probably be
             | sufficient. It'll depend on spinning rust vs SSDs though.
        
               | jorangreef wrote:
               | Yes, our thinking here was that for SSD or NVMe sometimes
               | the cost of scheduling is not worth it, and "noop" can be
               | a good choice, since the device is already so fast
               | relative to the cost of the CPU and memory access
               | required to schedule.
               | 
               | As far as we understand w.r.t. mechanical sympathy for
               | flash, what counts most is either a large bulk I/O that
               | the device can internally parallelize, or else
               | sufficiently parallel smaller I/Os.
               | 
               | Then, being sector-aligned to avoid the kernel from
               | having to fixup alignment with a memcpy--also, we're
               | using Direct I/O so we have to--and especially trying to
               | issue writes all with similar "Time of Death". For
               | example, if you're writing out the blocks for a new LSM
               | table on disk, then it is in fact good to keep these
               | close together, since it's likely they'll be compacted
               | again and reclaimed at the same time in future.
               | 
               | This brings us back to scheduling... :)
        
       ___________________________________________________________________
       (page generated 2022-11-23 23:01 UTC)