hngopher.com

       [HN Gopher] Async I/O on Linux in databases
       ___________________________________________________________________
        
       Async I/O on Linux in databases
        
       Author : jtregunna
       Score  : 174 points
       Date   : 2025-07-20 06:20 UTC (16 hours ago)
        
 (HTM) web link (blog.canoozie.net)
 (TXT) w3m dump (blog.canoozie.net)
        
       | jtregunna wrote:
       | Post talks about how to use io_uring, in the context of building
       | a "database" (a demonstration key-value cache with a write-ahead
       | log), to maintain durability.
        
       | tlb wrote:
       | The recovery process is to "only apply operations that have both
       | intent and completion records." But then I don't see the point of
       | logging the intent record separately. If no completion is logged,
       | the intent is ignored. So you could log the two together.
       | 
       | Presumably the intent record is large (containing the key-value
       | data) while the completion record is tiny (containing just the
       | index of the intent record). Is the point that the completion
       | record write is guaranteed to be atomic because it fits in a disk
       | sector, while the intent record doesn't?
        
         | ta8645 wrote:
         | It's really not clear in the article. But I _think_ the gains
         | are to be had because you can do the in-memory updating during
         | the time that the WAL is being written to disk (rather than
         | waiting for it to flush before proceeding). So I'm guessing the
         | protocol as presented, is actually missing a key step:
         | Write intent record (async)         Perform operation in memory
         | Write completion record (async)         * * Wait for intent and
         | completion to be flushed to disk * *         Return success to
         | client
        
           | gsliepen wrote:
           | But this makes me wonder how it works when there are
           | concurrent requests. What if a second thread requests data
           | that is being written to memory by the first thread?
           | Shouldn't it also wait for both the write intent record and
           | completion record having been flushed to disk? Otherwise you
           | could end up with a query that returns data that after a
           | crash won't exist anymore.
        
             | Manuel_D wrote:
             | It's not the write ahead log that prevents that scenario,
             | it's transaction isolation. And note that the more
             | permissive isolation levels offered by Postgres, for
             | example, do allow that failure mode to occur.
        
               | Demiurge wrote:
               | If thats the hypothesis, it would be good to see some
               | numbers or proof of concept. The real world performance
               | impact seems not that obvious to predict here.
        
           | avinassh wrote:
           | * * Wait for intent and completion to be flushed to disk * *
           | 
           | if you wait for both to complete, then how it can be faster
           | than doing a single IO?
        
         | cbzbc wrote:
         | _Presumably the intent record is large (containing the key-
         | value data) while the completion record is tiny_
         | 
         | I don't think this is necessarily the case, because the
         | operations may have completed in a different order to how they
         | are recorded in the intent log.
        
       | jmpman wrote:
       | "Write intent record (async) Perform operation in memory Write
       | completion record (async) Return success to client
       | 
       | During recovery, I only apply operations that have both intent
       | and completion records. This ensures consistency while allowing
       | much higher throughput. "
       | 
       | Does this mean that a client could receive a success for a
       | request, which if the system crashed immediately afterwards, when
       | replayed, wouldn't necessarily have that request recorded?
       | 
       | How does that not violate ACID?
        
         | JasonSage wrote:
         | As best I can tell, the author understands that the async
         | write-ahead fails to be a guarantee where the sync one does...
         | then turns their async write into two async writes... but
         | there's still no guarantee comparable to the synchronous
         | version.
         | 
         | So I fail to see how the two async writes are any guarantee at
         | all. It sounds like they just happen to provide better
         | consistency than the one async write because it forces an
         | arbitrary amount of time to pass.
        
           | m11a wrote:
           | Yeah, I feel like I'm missing the point of this. The original
           | purpose of the WAL was for recovery, so WAL entries are
           | supposed to be flushed to disk.
           | 
           | Seems like OP's async approach removes that, so there's no
           | durability guarantee, so why even maintain a WAL to begin
           | with?
        
             | nephalegm wrote:
             | Reading through the article it's explained in the recovery
             | process. He reads the intent log entries and the completion
             | entries and only applies them if they both exist.
             | 
             | So there is no guarantee that operations are committed by
             | virtue of not being acknowledged to the application
             | (asynchronous) the recovery replay will be consistent.
             | 
             | I could see it would be problematic for any data where the
             | order of operations is important, but that's the trade off
             | for performance. This does seem to be an improvement to
             | ensure asynchronous IO will always result in a consistent
             | recovery.
        
               | ori_b wrote:
               | There's not even a guarantee that the intent log flushes
               | to disk before the completion log. You can get
               | completions entries in the completion log that were lost
               | in the intent log. So, no, there's no guarantee of
               | consistent recovery.
               | 
               | You'd be better off with a single log.
        
               | toast0 wrote:
               | There's no guarantee of ordering of writes within the two
               | logs either.
               | 
               | This seems nightmarish to recover from.
        
               | lmeyerov wrote:
               | I think he says he checks for both
               | 
               | It's interesting as a weaker safety guarantee. He is
               | guaranteeing write integrity, so valid WAL view on
               | restart by throwing out mismatching writes. But from an
               | outside observation, premature signaling of completion,
               | which would mean data loss as a client may have moved on
               | without retries due to thinking the data was safely
               | saved. (I was a bit confused in the completion meaning
               | around this, so not confident.)
               | 
               | We hit some similar scenarios in Graphistry where we
               | treat recieving server disk/RAM during browser uploads as
               | writethrough caches in front of our cloud storage
               | persistence tiers. The choice of when to signal success
               | to the uploader is funny -- disk/RAM vs cloud storage --
               | and timing difference is fairly observable to the web
               | user.
        
         | zozbot234 wrote:
         | > Does this mean that a client could receive a success for a
         | request, which if the system crashed immediately afterwards,
         | when replayed, wouldn't necessarily have that request recorded?
         | 
         | Yup. OP says "the intent record could just be sitting in a
         | kernel buffer", but then the exact same issue applies to the
         | completion record. So confirmation to the client cannot be
         | issued until the completion record has been written to durable
         | storage. Not really seeing the point of this blogpost.
        
       | nromiun wrote:
       | Slightly off topic but anyone knows when/if Google is going to
       | enable io_uring for Android?
        
         | jeffbee wrote:
         | Hopefully never. It almost seems to have been purpose-built for
         | local privilege escalation exploits.
        
       | ozgrakkurt wrote:
       | Great to see someone going into this. I wanted to do a simple LSM
       | tree using io_uring in Zig for some time but couldn't get into it
       | yet.
       | 
       | I always use this approach for crash-resistance:
       | 
       | - Append to the data (WAL) file normally.
       | 
       | - Have a seperate small file that is like a hash + length for WAL
       | state.
       | 
       | - First append to WAL file.
       | 
       | - Start fsync call on the WAL file, create a new hash/length file
       | with different name and fsync it in parallel.
       | 
       | - Rename the length file onto the real one for making sure it is
       | fully atomic.
       | 
       | - Update in-memory state to reflect the files and return from the
       | write function call.
       | 
       | Curious if anyone knows tradeoffs between this and doing double
       | WAL. Maybe doing fsync on everything is too slow to maintain fast
       | writes?
       | 
       | I learned about append/rename approach from this article in case
       | anyone is interested:
       | 
       | - https://discuss.hypermode.com/t/making-badger-crash-resilien...
       | 
       | - https://research.cs.wisc.edu/adsl/Publications/alice-osdi14....
        
         | toolslive wrote:
         | it's possible to unify the WAL and the tree. There are some
         | append only B-tree implementations.
         | https://github.com/Incubaid/baardskeerder fe.
        
           | avinassh wrote:
           | There are also CoW B Trees not entirely similar, but kinda
           | same.
        
             | toolslive wrote:
             | there's a whole class of persistent persistent (the
             | repetition is intentional here) data structures. Some of
             | them even combine performance with elegance.
        
       | tobias3 wrote:
       | I don't get this. How can two(+) WAL operations be faster than
       | one (double the sync IOPS)?
       | 
       | I think this database doesn't have durability at all.
        
         | benjiro wrote:
         | fsync waits for the drive to report back the success write.
         | When you do a ton of small writes, fsync becomes a bottleneck.
         | Its a issue of context switching and pipelining with fsync.
         | 
         | When you async write data, you do not need to wait for this
         | confirmation. So by double writing two async requests, you are
         | better using all your system CPU cores as they are not being
         | stalled waiting for that I/O response. Seeing a 10x performance
         | gain is not uncommon using a method like this.
         | 
         | Yes, you do need to check if both records are written and then
         | report it back to the client. But that is a non-fsync request
         | and does not tax your system the same as fsync writes.
         | 
         | It has literally the same durability as a fsync write. You need
         | to take in account, that most databases are written 30, 40 ...
         | years ago. In the time when HDDs ruled and stuff like NVME
         | drives was a pipedream. But most DBs still work the same, and
         | threat NVME drives like they are HDDs.
         | 
         | Doing this above operation on a HDD, will cost you 2x the
         | performance because you barely have like 80 to 120 IOPS/s. But
         | a cheap NVME drive easily does 100.000 like its nothing.
         | 
         | If you even monitored a NVME drive with a database write usage,
         | you will noticed that those NVME drives are just underutilized.
         | This is why you see a lot more work in trying new data storage
         | layers being developed for Databases that better utilize NVME
         | capabilities (and trying to bypass old HDD era bottlenecks).
        
           | zozbot234 wrote:
           | > It has literally the same durability as a fsync write
           | 
           | I don't think we can ensure this without knowing what fsync()
           | maps to in the NVMe standard, and somehow replicating that.
           | Just reading back is not enough, e.g. the hardware might be
           | reading from a volatile cache that will be lost in a crash.
        
             | benjiro wrote:
             | Unless your running cheap consumer NVME drives, that is not
             | a issue on Enterprise SSD/NVMEs as they have their own
             | capacitors to ensure data is always written.
             | 
             | On cheaper NVME drives, your point is valid. But we also
             | need to add, how much at risk are you. What is the chance
             | of a system doing funky issues, that you just happened to
             | send X amount of confirm requests to clients, with data
             | that never got written.
             | 
             | For specific companies, they will not cheap out and spend
             | tons of enterprise level of hardware. But for the rest of
             | us? I mean, have you seen the German Hetzner, where 97% of
             | their hardware is mostly consumer level hardware. Yes,
             | there is a risk, but nobody complains about that risk.
             | 
             | And frankly, everything can be a risk if you think about
             | it. I have had EXT3 partition's corrupt on a production DB
             | server. That is why you have replication and backups ;)
             | 
             | TiDB, or was it another distributed DB is also not
             | consistency guaranteed, if i remember correctly. They give
             | for performance eventual consistency.
        
               | gpderetta wrote:
               | Forget about consumer FD, unless you are explicitly doing
               | O_DIRECT, why would you expect that a notification that
               | your IO has completed would mean that it has reached the
               | disk at all? The data might still be just in the kernel
               | page buffer and not gotten close to the disk at all.
               | 
               | You mention you need to wait for the compilation record
               | to be written. But how do you do that without fsync or
               | O_DIRECT? A notification that the write is completed is
               | not that.
               | 
               | Edit: maybe you are using RWF_SYNC in your write call.
               | That could work.
        
           | codys wrote:
           | > Yes, you do need to check if both records are written and
           | then report it back to the client. But that is a non-fsync
           | request and does not tax your system the same as fsync
           | writes.
           | 
           | What mechanism can be used to check that the writes are
           | complete if not fsync (or adjacent fdatasync)? What specific
           | io_uring operation or system call?
        
       | avinassh wrote:
       | I don't get this scheme at all. The protocol violates durability,
       | because once the client receives success from server, it should
       | be durable. However, completion record is async, it is possible
       | that it never completes and server crashes.
       | 
       | During recovery, since the server applies only the operations
       | which have both records, you will not recover a record which was
       | successful to the client.
        
         | benjiro wrote:
         | I think you missed the part in the middle:
         | 
         | -----------------
         | 
         | So the protocol ends up becoming:
         | 
         | Write intent record (async) Perform operation in memory Write
         | completion record (async) Return success to client
         | 
         | -----------------
         | 
         | In other words, the client only knows its a success when both
         | wal files have been written.
         | 
         | The goal is not to provide faster responses to the client, on
         | the first intent record, but to ensure that the system is not
         | stuck with I/O Waiting on fsync requests.
         | 
         | When you write a ton of data to database, you often see that
         | its not the core writes but the I/O > fsync that eat a ton of
         | your resources. Cutting back on that mess, results that you can
         | push more performance out of a write heavy server.
        
           | jcgrillo wrote:
           | There's no fsync in the async version, though, unless I
           | missed it? The problem with the two WAL approach is that now
           | none of the WAL writes are durable--you could encounter a
           | situation where a client reads an entry on the completion WAL
           | which upon recovery does not exist on disk. Before with the
           | single fsynced WAL, writes were durably persisted.
        
           | loeg wrote:
           | No, we saw this scheme, it just doesn't work. Either of the
           | async writes can fail _after_ ack 'ing the logical write to
           | the client as successful (e.g., kernel crash or power
           | failure) and then you have lost data.
        
             | cyanydeez wrote:
             | You can always have data loss. The intent is that when the
             | client is told the data is saved, it doesnt happen before
             | the garuntee.
             | 
             | I dont know if OP achieved this, but the client isnt told
             | "we have your data" until both of the WALs are agreeing. If
             | the system goes down those WALs are used to rebuild data in
             | flight.
             | 
             | The speed up allows for decoupling synchronous disk writes
             | that are now parallel.
             | 
             | You are not conceptualizing what data loss means in the
             | ACID contract between DB and Client.
             | 
             | But you
        
               | loeg wrote:
               | > I dont know if OP achieved this,
               | 
               | They did not.
               | 
               | > but the client isnt told "we have your data" until both
               | of the WALs are agreeing.
               | 
               | Wrong. In the proposed scheme, the client writes are
               | ack'd _before_ the WAL writes are flushed. Their contents
               | may or may not agree after subsequent power loss or
               | kernel crash.
               | 
               | (It is generally considered unacceptable for network
               | databases/filers to be lossier than the underlying media.
               | Sometimes stronger guarantees are required/provided, but
               | that is usually the minimum.)
        
       | LAC-Tech wrote:
       | Great article, but I have a question:
       | 
       |  _The problem with naive async I /O in a database context at
       | least, is that you lose the durability guarantee that makes
       | databases useful. When a client receives a success response,
       | their expectation is the data will survive a system crash. But
       | with async I/O, by the time you send that response, the data
       | might still be sitting in kernel buffers, not yet written to
       | stable storage._
       | 
       | Shouldn't you just tie the successful response to a successful
       | fsync?
       | 
       | Async or sync, I'm not sure what's different here.
        
       | leentee wrote:
       | First, I think the article provides false claim, the solution
       | doesn't guarantee durability. Second, I believe good synchronous
       | code is better than bad asynchronous code, and it's way easier to
       | write good synchronous code than asynchronous code, especially
       | with io_uring. Modern NVMe are fast, even with synchronous IO,
       | enough for most applications. Before thinking about asynchronous,
       | make sure your application use synchronous IO well.
        
         | benjiro wrote:
         | Speaking from experience, its easy to make Postgres (for
         | example), just trash your system usage on a lot of individual
         | or batch inserts. The NVME drives are often extreme
         | underutilized, and your bottleneck is the whole fsync layer.
         | 
         | Second, the durability is the same as fsync. The client only
         | gets reported a success, if both wall writes have been done.
         | 
         | Its the same guarantee as fsync but you bypass the fsync
         | bottleneck, what in turn allows for actually using the benefits
         | of your NVME drives better (and shifting away the resource from
         | the i/o blocking fsync).
         | 
         | Yes, it involves more management because now you need to
         | maintain two states, instead of one with the synchronous fsync
         | operation. But that is the thing about parallel programming,
         | its more complex but you get a ton of benefits from it by
         | bypassing synchronous bottlenecks.
        
       | jorangreef wrote:
       | To be clear, this is different to what we do (and why we do it)
       | in TigerBeetle.
       | 
       | For example, we never externalize commits without full fsync, to
       | preserve durability [0].
       | 
       | Further, the motivation for why TigerBeetle has both a prepare
       | WAL plus a header WAL is different, not performance (we get
       | performance elsewhere, through batching) but correctness, cf.
       | "Protocol-Aware Recovery for Consensus-Based Storage" [1].
       | 
       | Finally, TigerBeetle's recovery is more intricate, we do all this
       | to survive TigerBeetle's storage fault model. You can read the
       | actual code here [2] and Kyle Kingsbury's Jepsen report on
       | TigerBeetle also provides an excellent overview [3].
       | 
       | [0] https://www.youtube.com/watch?v=tRgvaqpQPwE
       | 
       | [1]
       | https://www.usenix.org/system/files/conference/fast18/fast18...
       | 
       | [2]
       | https://github.com/tigerbeetle/tigerbeetle/blob/main/src/vsr...
       | 
       | [3] https://jepsen.io/analyses/tigerbeetle-0.16.11.pdf
        
       | quietbritishjim wrote:
       | The article claims that, when they switched to io_uring,
       | 
       | > throughput increased by an order of magnitude almost
       | immediately
       | 
       | But right near the start is the real story: the sync version had
       | 
       | > the classic fsync() call after every write to the log for
       | durability
       | 
       | They are not comparing performance of sync APIs vs io_uring.
       | _They 're comparing using fsync vs not using fsync!_ They even go
       | on to say that a problem with async API is that
       | 
       | > you lose the durability guarantee that makes databases useful.
       | ... the data might still be sitting in kernel buffers, not yet
       | written to stable storage.
       | 
       | No! That's because you stopped using fsync. It's nothing to do
       | with your code being async.
       | 
       | If you just removed the fsync from the sync code you'd quite
       | possibly get a speedup of an order of magnitude too. Or if you
       | put the fsync back in the async version (I don't know io_uring
       | well enough to understand that but it appears to be possible with
       | "io_uring_prep_fsync") then that would surely slide back. Would
       | the io_uring version still be faster either way? Quite possibly,
       | but because they made an apples-to-oranges comparison, we can't
       | know from this article.
       | 
       | (As other commenters have pointed out, their two-phase commit
       | strategy also fails to provide any guarantee. There's no getting
       | around fsync if you want to be sure that your data is really on
       | the storage medium.)
        
         | zozbot234 wrote:
         | So OP's _real_ point is that fsync() sucks in the context of
         | modern hardware where thousands of I /O reqs may be in flight
         | at any given time. We need more fine-grained mechanisms to
         | ensure that writes are committed to permanent storage, without
         | introducing undue serialization.
        
           | quietbritishjim wrote:
           | Well, there already is slightly more fine gained control: in
           | the sync version, you can perhaps call sync write() a few
           | times before calling fsync() once i.e. basically batch up a
           | few writes. That does have the disadvantage that you can't
           | easily queue new writes while waiting for the previous ones.
           | Perhaps you could use calls to write() in another thread
           | while the first one is waiting for fsync() for the previous
           | batch? You could even have lots of threads doing that in
           | parallel, but probably not the thousands that you mentioned.
           | I don't know the nitty gritty of Linux file IO well enough to
           | know how well that would work.
           | 
           | As I said, I don't know anything about fsync in io_uring.
           | Maybe that has more control?
           | 
           | An article that did a fair comparison, by someone who
           | actually knows what they're talking about, would be pretty
           | interesting.
        
             | loeg wrote:
             | > As I said, I don't know anything about fsync in io_uring.
             | Maybe that has now control?
             | 
             | io_uring fsync has byte range support:
             | https://man7.org/linux/man-
             | pages/man2/io_uring_enter.2.html#...
        
               | quietbritishjim wrote:
               | Sorry, that was a typo in my comment (now edited). "Now"
               | was meant to be "more" i.e. "perhaps [io_uring] has
               | _more_ control [than sync APIs]? "
               | 
               | Byte range is support is interesting but also present in
               | the Linux sync API:
               | 
               | https://man7.org/linux/man-
               | pages/man2/sync_file_range.2.html
               | 
               | I meant more like, perhaps it's possible to concurrently
               | queue fsync for different writes in a way that isn't
               | possible with the sync API. From your link, it appears
               | not (unless they're isolated at non-overlapping byte
               | ranges, but that's no different from what you can do with
               | sync API + threads):
               | 
               | > Note that, while I/O is initiated in the order in which
               | it appears in the submission queue, completions are
               | unordered. For example, an application which places a
               | write I/O followed by an fsync in the submission queue
               | cannot expect the fsync to apply to the write. The two
               | operations execute in parallel, so the fsync may complete
               | before the write is issued to the storage.
               | 
               | So if two writes are for an overlapping byte range, and
               | you wanted to write + fsync the first one then write +
               | fsync the second then you'd need to queue those four
               | operations in application space, ensuring only one is
               | submitted to io_uring at a time.
        
               | gpderetta wrote:
               | You can insert synchronization OPs (i.e. barriers) in the
               | queue to guarantee in-order execution.
        
             | immibis wrote:
             | Postgres claims to have some kind of commit batching, but I
             | couldn't figure out how to turn it on.
             | 
             | I wanted to scrub a table by processing each row, but
             | without holding locks, so I wanted to commit every few
             | hundred rows, but with only ACI and not D, since I could
             | just run the process again. I don't think Postgres supports
             | this feature. It also seemed to be calling fsync much more
             | than once per transaction.
        
               | sgarland wrote:
               | Maybe I don't understand what you're trying to do, but
               | you can directly control how frequently commits occur.
               | BEGIN         INSERT ... --- batch of N size
               | COMMIT AND CHAIN         INSERT ...
        
               | PaulDavisThe1st wrote:
               | Chance of Postgres commit mapping 1:1 onto posix fsync or
               | equivalent: slim.
        
               | azlev wrote:
               | commit_delay
               | 
               | https://www.postgresql.org/docs/current/runtime-config-
               | wal.h...
        
               | morningsam wrote:
               | Looking through the options listed under "Non-Durable
               | Settings", [1] I guess synchronous_commit = off fits the
               | bill?
               | 
               | [1]: https://www.postgresql.org/docs/current/non-
               | durability.html
        
           | stefanha wrote:
           | The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit
           | in write requests. This can be used instead of fdatasync(2)
           | in some cases. It only syncs a specific write request instead
           | of the entire disk write cache.
        
             | zozbot234 wrote:
             | You should prefer RWF_SYNC in case the write involves
             | changes to the file metadata (For example, most append
             | operations will alter the file size).
        
               | LtdJorge wrote:
               | Not really, RWF_DSYNC is equivalent to open(2) with
               | O_DSYNC when writing which is equivalent to write(2)
               | followed by fdatasync(2) and:                 fdatasync()
               | is similar to fsync(), but does not flush modified
               | metadata unless that metadata is needed in order to allow
               | a            subsequent data retrieval to be correctly
               | handled.  For example,            changes to st_atime or
               | st_mtime (respectively, time of last access
               | and time of last modification; see inode(7)) do not
               | require            flushing because they are not
               | necessary for a subsequent data read            to be
               | handled correctly.  On the other hand, a change to the
               | file            size (st_size, as made by say
               | ftruncate(2)), would require a            metadata flush.
        
               | stefanha wrote:
               | Agreed, when metadata changes are involved then RWF_SYNC
               | _must_ be used.
               | 
               | RWF_DSYNC is sufficient and faster when data is
               | overwritten without metadata changes to the file.
        
               | vlovich123 wrote:
               | No that's incorrect. File size changes caused by append
               | are covered by fdatasync in terms of durability
               | guarantees.
        
           | ImPostingOnHN wrote:
           | Some applications, like Apache Kafka, don't immediately fsync
           | every write. This lets the kernel batch writes and also
           | linearize them, both adding speed. Until synced, the data
           | exists only in the linux page cache.
           | 
           | To deal with the risk of data loss, multiple such servers are
           | used, with the hope that if one server dies before syncing,
           | another server to which the data was replicated, performs an
           | fsync _without_ failure.
        
             | to11mtm wrote:
             | I feel like you can try to FAFO with that on a distributed
             | log like Kafka (although also... eww, but also I wonder
             | whether NATS does the same thing or not...)
             | 
             | I would think for something like a database, at _most_ you
             | 'd want to have something like the io_uring_prep_fsync
             | others mentioned with flags set to just not update the
             | metadata.
             | 
             | To be clear, in my head I'm envisioning this case to be a
             | WAL type scenario; in my head you can get away with just
             | having a separate thread or threads pulling from WAL and
             | writing to main DB files... but also I've never written a
             | real database so maybe those thoughts are off base.
        
         | osigurdson wrote:
         | Suggest watching the Tigerbeatle video link in the article.
         | There they discuss bitrot, "fsync gate", how Postgres used
         | fsync wrong for 30 years, etc. It is very interesting even as
         | pure entertainment.
        
           | jorangreef wrote:
           | Thanks! Great to hear you enjoyed our talk. Most of it is
           | simply putting the spotlight on UW-Madison's work on storage
           | faults.
           | 
           | Just to emphasize again that this blog post here is really
           | quite different, since it does not fsync and breaks
           | durability.
           | 
           | Not what we do in TigerBeetle or would recommend or
           | encourage.
           | 
           | See also: https://news.ycombinator.com/item?id=44624065
        
             | mhuffman wrote:
             | Hi! I don't have a need for your products directly, but I
             | was very intrigued when I saw TB's demo and talk on
             | ThePrimeagen YT channel. I have be developing software for
             | a looooong time and it was a breath of fresh air in a sea
             | of startups to see a company champion optimization, speed,
             | and security without going too deep in the weeds and
             | slowing development. These days, that typically comes more
             | as an afterthought or as a response to an incident. Or not
             | at all. I would recommend any developer with an open mind
             | to read this short document[0]. I have been integrating it
             | into my own company's development practices with good
             | results.
             | 
             | [0]https://github.com/tigerbeetle/tigerbeetle/blob/main/doc
             | s/TI...
        
               | jorangreef wrote:
               | Appreciate your taking the time to write these kind
               | words. Great to hear that TigerStyle has been making an
               | impact on your company's developer practices!
        
         | ajross wrote:
         | > There's no getting around fsync if you want to be sure that
         | your data is really on the storage medium.
         | 
         | That's not correct; io_uring supports O_DIRECT write requests
         | just fine. Obviously bypassing the cache isn't the same as just
         | flushing it (which is what fsync does), so there are design
         | impacts.
         | 
         | But database engines are absolutely the target of io_uring's
         | feature set and they're expected to be managing this
         | complexity.
        
           | zozbot234 wrote:
           | That's not what O_DIRECT is for. Did you mean O_SYNC ?
        
           | codys wrote:
           | > But database engines are absolutely the target of
           | io_uring's feature set and they're expected to be managing
           | this complexity.
           | 
           | io_uring includes an fsync opcode (with range support). When
           | folks talk about fsync generally here, they're not saying the
           | io_uring is unusable, they're saying that they'd expect the
           | fsync to be used whether it's via the io_uring opcode, the
           | system call, or some other mechanism yet to be created.
        
           | jandrewrogers wrote:
           | O_DIRECT is not a substitute for fsync(). It only guarantees
           | that data gets to the storage device cache, which is not
           | durable in most cases.
        
             | somat wrote:
             | My understanding is that the storage device cache is
             | opaque, that is, drives tend to lie, saying the write is
             | done when it is in cache, and depend on having enough
             | internal power capacity to flush on power loss.
        
               | loeg wrote:
               | Consumer devices sometimes lie (enterprise products less
               | so), but there is a distinction between O_DIRECT and
               | actual fsync at the protocol layer (e.g., in NVMe, fsync
               | maps into a Flush command).
        
           | quietbritishjim wrote:
           | Is that's true (notwithstanding objections from sibling
           | comments) then that's just another spelling of fsync.
           | 
           | My point was really: you can't magically get the performance
           | benefits of omitting fsync (or functional equivalent) while
           | still getting the durability guarantees it gives.
        
         | codys wrote:
         | > > you lose the durability guarantee that makes databases
         | useful. ... the data might still be sitting in kernel buffers,
         | not yet written to stable storage.
         | 
         | > No! That's because you stopped using fsync. It's nothing to
         | do with your code being async.
         | 
         | From that section, it sounds like OP was tossing data into the
         | io_uring submition queue and calling it "done" at that point
         | (ie: not waiting for the io_uring completion queue to have the
         | completion indicated). So yes, fsync is needed, but they
         | weren't even waiting for the kernel to start the write before
         | indicating success.
         | 
         | I think to some extent things have been confused because
         | io_uring has a completion concept, but OP also has a separate
         | completion concept in their dual wal design (where the second
         | WAL they call the "completion" WAL).
         | 
         | But I'm not sure if OP really took away the right understanding
         | from their issues with ignoring io_uring completions, as they
         | then create a 5 step procedure that adds one check for an
         | io_uring completion, but still omits another.
         | 
         | > 1. Write intent record (async)
         | 
         | > 2. Perform operation in memory
         | 
         | > 3. Write completion record (async)
         | 
         | > 4. Wait for the completion record to be written to the WAL
         | 
         | > 5. Return success to client
         | 
         | Note the lack of waiting for the io_uring completion of the
         | intent record (and yes, there's still not any reference to
         | fsync or alternates, which is also wrong). There is no ordering
         | guarantee between independent io_urings (OP states they're
         | using separate io_uring instances for each WAL), and even in
         | the same io_uring there is limited ordering around completions
         | (IOSQE_IO_LINK exists, but doesn't allow traversing submission
         | boundaries, so won't work here because OP submits the work a
         | separate times. They'd need to use IOSQE_IO_DRAIN which seems
         | like it would effectively serialize their writes. which is why
         | It seems like OP would need to actually wait for completion of
         | the intent write).
        
           | cryptonector wrote:
           | Correct, TFA needs to wait for the completion of _all_ writes
           | to the WAL, which is what `fsync()` was doing. Waiting only
           | for the completion of the "completion record" does not ensure
           | that the "intent record" made it to the WAL. In the event of
           | a power failure it is entirely possible that the intent
           | record did not make it but the completion record did, and
           | then on recovery you'll have to panic.
        
             | codys wrote:
             | Yes, but I suspect there might be some confusion by the
             | author and others between "io_uring completion of a write"
             | (ie: io_uring sends its completion queue event that
             | corresponds to a previous submission queue event) and
             | "fsync completion" (as you've put as "completion of all
             | writes", though note that fsync the api is fd scoped and
             | the io_uring operation for fsync has file range support).
             | 
             | The CQEs on a write indicate something different compared
             | to the CQE of an fsync operation on that same range.
        
       | osigurdson wrote:
       | I've watched the Tigerbeatle talk (youtube link in the article).
       | This is very interesting even for those not in the space.
        
       | demaga wrote:
       | I feel like writing asynchronously to a WAL defeats its purpose.
        
       | jasonthorsness wrote:
       | Is the underlying NVME storage interface the kernel/drivers get
       | to use cleaner/simpler than the Linux abstractions? Or does it
       | get more complicated? Sometimes I wonder if certain high-
       | performance applications would be better off running as special-
       | purpose unikernels unburdened by interfaces designed for older
       | generations of technology.
        
         | loeg wrote:
         | Also an option with io_uring:
         | https://www.usenix.org/conference/fast24/presentation/joshi
         | 
         | (We use it at work it in a network object storage service in
         | order to use the underlying NVMe T10-DIF[1], which isn't
         | exposed nicely by conventional POSIX/Linux interfaces.)
         | 
         | Ultimately, having a full, ~normal Linux stack around makes
         | system management / orchestration easier. And programs other
         | than our specialized storage software can still access other
         | partitions, etc.
         | 
         | [1]: https://en.wikipedia.org/wiki/Data_Integrity_Field
        
       | eatonphil wrote:
       | From the title I was hoping this would be a survey of databases
       | using io_uring, since there've been quips on the internet (here,
       | twitter, etc) that no one uses io_uring in production. In my
       | brief search TigerBeetle (and maybe Turso's Limbo) was the only
       | database in production that I remember doing io_uring (by
       | default). Some other databases had it as an option but didn't
       | seem to default to it.
       | 
       | If anyone else feels like doing this survey and publishing the
       | results I'd love to see it.
        
       | jtregunna wrote:
       | Update:
       | 
       | I updated the post based on the conversation below, I wholly
       | missed an important callout about performance, and wasn't super
       | clear that you do need to wait for the completion record to be
       | written before responding to the client. That was implicitly
       | mentioned by writing the completion record coming before
       | responding, but I made it clearer to avoid confusion.
       | 
       | Also the dual WAL approach is worse for latency, unless you can
       | amortize the double write over multiple async writes, so the cost
       | paid amortizes across the batch, but when batch size is closer to
       | 1, the cost is higher.
        
         | gpderetta wrote:
         | How can you know that the completion record is written to disk?
        
         | codys wrote:
         | From the update added to the post:
         | 
         | > This is tracked through io_uring's completion queue - we only
         | send a success response after receiving confirmation that the
         | completion record has been persisted to stable storage.
         | 
         | Which completion queue event(s) are you examining here? I ask
         | because the way this is worded makes it sound like you're
         | waiting solely for the completion queue event for the _write_
         | to the "completion wal".
         | 
         | Doing that (waiting only on the "completion wal" write CQE)
         | 
         | 1. doesn't ensure that the "intent wal" has been written
         | (because it's a different io_uring and a different submission
         | queue event used to do the "intent wal" write from the
         | "completion wal" write), and
         | 
         | 2. doesn't indicate the "intent wal" data or the "completion
         | wal" data has made it to durable storage (one needs fsync for
         | that, the completion queue events for writes don't make that
         | promise. The CQE for an fsync opcode would indicate that data
         | has made it to durable storage if the fsync has the right
         | ordering wrt the writes and refers to the appropriate fd and
         | data ranges. Alternatively, there are some flags that have the
         | effect of implying an fsync following a write that could be
         | used, but those aren't mentioned)
        
       | ptrwis wrote:
       | For some background, it is now a single guy paid by Microsoft to
       | work on implementing async direct I/O for PostgreSQL
       | (github.com/anarazel)
        
       | lstroud wrote:
       | About 10ish years ago, I ended up finding a deadlock in the Linux
       | raid driver when turning on Oracle's async writes with raid10 on
       | lvm on AWS. I traced it to the ring buffers the author mentioned,
       | but ended up having to remove lvm (since it wasn't that necessary
       | on this infrastructure) to get the throughput I needed.
        
       | BeeOnRope wrote:
       | What is the point of the intent entry at all? It seems like
       | operations are only durable after the completion record is
       | written so the intent record seems to serve no purpose (unless it
       | is say much larger).
        
       | sethev wrote:
       | There's some faulty reasoning in this post. Without the code,
       | it's hard to pin down exactly where things went wrong.
       | 
       | These are the steps described in the post:                  1.
       | Write intent record (async)        2. Perform operation in memory
       | 3. Write completion record (async)        4. Wait for the
       | completion record to be written to the WAL        5. Return
       | success to client
       | 
       | If 4 is done correctly then 3 is not needed - it can just wait
       | for the intent to be durable before replying to the client.
       | Perhaps there's a small benefit to speculatively executing the
       | operation before the WAL is committed - but I'm skeptical and my
       | guess is that 4 is not being done correctly. The author added an
       | update to the article:
       | 
       | > This is tracked through io_uring's completion queue - we only
       | send a success response after receiving confirmation that the
       | completion record has been persisted to stable storage
       | 
       | This makes it sound like he's submitting write operations for the
       | completion record and then misinterpreting the completion queue
       | for those writes as "the record is now in durable storage".
        
       | jeffbee wrote:
       | What's baffling to me about this post is that anyone would
       | believe that io_uring was even capable of speeding up this
       | workload by 10x. Unless your profile suggests that syscall entry
       | is taking > 90% of your CPU time, that is impossible. The only
       | thing io_uring can do for you is reduce your syscall count, so
       | the upper bound of its utility is whatever you are currently
       | spending on sysenter/exit.
        
         | loeg wrote:
         | You could also imagine it hiding write latency by allowing a
         | very naive single-threaded application to do IOs concurrently,
         | overlapped in time, instead of serialized. (But a threadpool
         | would do much the same thing.)
        
         | gpderetta wrote:
         | Io_uring could allow for better throughout by simply having
         | multiple operations in flight and allow for better I/O
         | scheduling.
         | 
         | But yes, this specific case seems to be a misunderstanding in
         | what io_uring write completion means.
         | 
         | You would expect that they would have tested recovery by at
         | least simulating system stops immediately after after Io
         | completion notification.
         | 
         | Unless they are truly using asynchronous O_SYNC writes and are
         | just bad at explaining it.
        
       | misiek08 wrote:
       | 1. Write intent 2. Don't use intent write as success 3. Report
       | success on different operation completion.
       | 
       | While restoring: 1. Ignore all intents 2. Use only different
       | operations with corresponding intents.
       | 
       | I think this article introduces so much chaos that it's like many
       | ,,almost" helpful info on io_uring and finally hurts the tech.
       | io_uring IMHO lacks clean and simple examples and here we again
       | have some bad-explained theories instead of meat.
        
       ___________________________________________________________________
       (page generated 2025-07-20 23:01 UTC)