[HN Gopher] Async I/O on Linux in databases
___________________________________________________________________
Async I/O on Linux in databases
Author : jtregunna
Score : 174 points
Date : 2025-07-20 06:20 UTC (16 hours ago)
(HTM) web link (blog.canoozie.net)
(TXT) w3m dump (blog.canoozie.net)
| jtregunna wrote:
| Post talks about how to use io_uring, in the context of building
| a "database" (a demonstration key-value cache with a write-ahead
| log), to maintain durability.
| tlb wrote:
| The recovery process is to "only apply operations that have both
| intent and completion records." But then I don't see the point of
| logging the intent record separately. If no completion is logged,
| the intent is ignored. So you could log the two together.
|
| Presumably the intent record is large (containing the key-value
| data) while the completion record is tiny (containing just the
| index of the intent record). Is the point that the completion
| record write is guaranteed to be atomic because it fits in a disk
| sector, while the intent record doesn't?
| ta8645 wrote:
| It's really not clear in the article. But I _think_ the gains
| are to be had because you can do the in-memory updating during
| the time that the WAL is being written to disk (rather than
| waiting for it to flush before proceeding). So I'm guessing the
| protocol as presented, is actually missing a key step:
| Write intent record (async) Perform operation in memory
| Write completion record (async) * * Wait for intent and
| completion to be flushed to disk * * Return success to
| client
| gsliepen wrote:
| But this makes me wonder how it works when there are
| concurrent requests. What if a second thread requests data
| that is being written to memory by the first thread?
| Shouldn't it also wait for both the write intent record and
| completion record having been flushed to disk? Otherwise you
| could end up with a query that returns data that after a
| crash won't exist anymore.
| Manuel_D wrote:
| It's not the write ahead log that prevents that scenario,
| it's transaction isolation. And note that the more
| permissive isolation levels offered by Postgres, for
| example, do allow that failure mode to occur.
| Demiurge wrote:
| If thats the hypothesis, it would be good to see some
| numbers or proof of concept. The real world performance
| impact seems not that obvious to predict here.
| avinassh wrote:
| * * Wait for intent and completion to be flushed to disk * *
|
| if you wait for both to complete, then how it can be faster
| than doing a single IO?
| cbzbc wrote:
| _Presumably the intent record is large (containing the key-
| value data) while the completion record is tiny_
|
| I don't think this is necessarily the case, because the
| operations may have completed in a different order to how they
| are recorded in the intent log.
| jmpman wrote:
| "Write intent record (async) Perform operation in memory Write
| completion record (async) Return success to client
|
| During recovery, I only apply operations that have both intent
| and completion records. This ensures consistency while allowing
| much higher throughput. "
|
| Does this mean that a client could receive a success for a
| request, which if the system crashed immediately afterwards, when
| replayed, wouldn't necessarily have that request recorded?
|
| How does that not violate ACID?
| JasonSage wrote:
| As best I can tell, the author understands that the async
| write-ahead fails to be a guarantee where the sync one does...
| then turns their async write into two async writes... but
| there's still no guarantee comparable to the synchronous
| version.
|
| So I fail to see how the two async writes are any guarantee at
| all. It sounds like they just happen to provide better
| consistency than the one async write because it forces an
| arbitrary amount of time to pass.
| m11a wrote:
| Yeah, I feel like I'm missing the point of this. The original
| purpose of the WAL was for recovery, so WAL entries are
| supposed to be flushed to disk.
|
| Seems like OP's async approach removes that, so there's no
| durability guarantee, so why even maintain a WAL to begin
| with?
| nephalegm wrote:
| Reading through the article it's explained in the recovery
| process. He reads the intent log entries and the completion
| entries and only applies them if they both exist.
|
| So there is no guarantee that operations are committed by
| virtue of not being acknowledged to the application
| (asynchronous) the recovery replay will be consistent.
|
| I could see it would be problematic for any data where the
| order of operations is important, but that's the trade off
| for performance. This does seem to be an improvement to
| ensure asynchronous IO will always result in a consistent
| recovery.
| ori_b wrote:
| There's not even a guarantee that the intent log flushes
| to disk before the completion log. You can get
| completions entries in the completion log that were lost
| in the intent log. So, no, there's no guarantee of
| consistent recovery.
|
| You'd be better off with a single log.
| toast0 wrote:
| There's no guarantee of ordering of writes within the two
| logs either.
|
| This seems nightmarish to recover from.
| lmeyerov wrote:
| I think he says he checks for both
|
| It's interesting as a weaker safety guarantee. He is
| guaranteeing write integrity, so valid WAL view on
| restart by throwing out mismatching writes. But from an
| outside observation, premature signaling of completion,
| which would mean data loss as a client may have moved on
| without retries due to thinking the data was safely
| saved. (I was a bit confused in the completion meaning
| around this, so not confident.)
|
| We hit some similar scenarios in Graphistry where we
| treat recieving server disk/RAM during browser uploads as
| writethrough caches in front of our cloud storage
| persistence tiers. The choice of when to signal success
| to the uploader is funny -- disk/RAM vs cloud storage --
| and timing difference is fairly observable to the web
| user.
| zozbot234 wrote:
| > Does this mean that a client could receive a success for a
| request, which if the system crashed immediately afterwards,
| when replayed, wouldn't necessarily have that request recorded?
|
| Yup. OP says "the intent record could just be sitting in a
| kernel buffer", but then the exact same issue applies to the
| completion record. So confirmation to the client cannot be
| issued until the completion record has been written to durable
| storage. Not really seeing the point of this blogpost.
| nromiun wrote:
| Slightly off topic but anyone knows when/if Google is going to
| enable io_uring for Android?
| jeffbee wrote:
| Hopefully never. It almost seems to have been purpose-built for
| local privilege escalation exploits.
| ozgrakkurt wrote:
| Great to see someone going into this. I wanted to do a simple LSM
| tree using io_uring in Zig for some time but couldn't get into it
| yet.
|
| I always use this approach for crash-resistance:
|
| - Append to the data (WAL) file normally.
|
| - Have a seperate small file that is like a hash + length for WAL
| state.
|
| - First append to WAL file.
|
| - Start fsync call on the WAL file, create a new hash/length file
| with different name and fsync it in parallel.
|
| - Rename the length file onto the real one for making sure it is
| fully atomic.
|
| - Update in-memory state to reflect the files and return from the
| write function call.
|
| Curious if anyone knows tradeoffs between this and doing double
| WAL. Maybe doing fsync on everything is too slow to maintain fast
| writes?
|
| I learned about append/rename approach from this article in case
| anyone is interested:
|
| - https://discuss.hypermode.com/t/making-badger-crash-resilien...
|
| - https://research.cs.wisc.edu/adsl/Publications/alice-osdi14....
| toolslive wrote:
| it's possible to unify the WAL and the tree. There are some
| append only B-tree implementations.
| https://github.com/Incubaid/baardskeerder fe.
| avinassh wrote:
| There are also CoW B Trees not entirely similar, but kinda
| same.
| toolslive wrote:
| there's a whole class of persistent persistent (the
| repetition is intentional here) data structures. Some of
| them even combine performance with elegance.
| tobias3 wrote:
| I don't get this. How can two(+) WAL operations be faster than
| one (double the sync IOPS)?
|
| I think this database doesn't have durability at all.
| benjiro wrote:
| fsync waits for the drive to report back the success write.
| When you do a ton of small writes, fsync becomes a bottleneck.
| Its a issue of context switching and pipelining with fsync.
|
| When you async write data, you do not need to wait for this
| confirmation. So by double writing two async requests, you are
| better using all your system CPU cores as they are not being
| stalled waiting for that I/O response. Seeing a 10x performance
| gain is not uncommon using a method like this.
|
| Yes, you do need to check if both records are written and then
| report it back to the client. But that is a non-fsync request
| and does not tax your system the same as fsync writes.
|
| It has literally the same durability as a fsync write. You need
| to take in account, that most databases are written 30, 40 ...
| years ago. In the time when HDDs ruled and stuff like NVME
| drives was a pipedream. But most DBs still work the same, and
| threat NVME drives like they are HDDs.
|
| Doing this above operation on a HDD, will cost you 2x the
| performance because you barely have like 80 to 120 IOPS/s. But
| a cheap NVME drive easily does 100.000 like its nothing.
|
| If you even monitored a NVME drive with a database write usage,
| you will noticed that those NVME drives are just underutilized.
| This is why you see a lot more work in trying new data storage
| layers being developed for Databases that better utilize NVME
| capabilities (and trying to bypass old HDD era bottlenecks).
| zozbot234 wrote:
| > It has literally the same durability as a fsync write
|
| I don't think we can ensure this without knowing what fsync()
| maps to in the NVMe standard, and somehow replicating that.
| Just reading back is not enough, e.g. the hardware might be
| reading from a volatile cache that will be lost in a crash.
| benjiro wrote:
| Unless your running cheap consumer NVME drives, that is not
| a issue on Enterprise SSD/NVMEs as they have their own
| capacitors to ensure data is always written.
|
| On cheaper NVME drives, your point is valid. But we also
| need to add, how much at risk are you. What is the chance
| of a system doing funky issues, that you just happened to
| send X amount of confirm requests to clients, with data
| that never got written.
|
| For specific companies, they will not cheap out and spend
| tons of enterprise level of hardware. But for the rest of
| us? I mean, have you seen the German Hetzner, where 97% of
| their hardware is mostly consumer level hardware. Yes,
| there is a risk, but nobody complains about that risk.
|
| And frankly, everything can be a risk if you think about
| it. I have had EXT3 partition's corrupt on a production DB
| server. That is why you have replication and backups ;)
|
| TiDB, or was it another distributed DB is also not
| consistency guaranteed, if i remember correctly. They give
| for performance eventual consistency.
| gpderetta wrote:
| Forget about consumer FD, unless you are explicitly doing
| O_DIRECT, why would you expect that a notification that
| your IO has completed would mean that it has reached the
| disk at all? The data might still be just in the kernel
| page buffer and not gotten close to the disk at all.
|
| You mention you need to wait for the compilation record
| to be written. But how do you do that without fsync or
| O_DIRECT? A notification that the write is completed is
| not that.
|
| Edit: maybe you are using RWF_SYNC in your write call.
| That could work.
| codys wrote:
| > Yes, you do need to check if both records are written and
| then report it back to the client. But that is a non-fsync
| request and does not tax your system the same as fsync
| writes.
|
| What mechanism can be used to check that the writes are
| complete if not fsync (or adjacent fdatasync)? What specific
| io_uring operation or system call?
| avinassh wrote:
| I don't get this scheme at all. The protocol violates durability,
| because once the client receives success from server, it should
| be durable. However, completion record is async, it is possible
| that it never completes and server crashes.
|
| During recovery, since the server applies only the operations
| which have both records, you will not recover a record which was
| successful to the client.
| benjiro wrote:
| I think you missed the part in the middle:
|
| -----------------
|
| So the protocol ends up becoming:
|
| Write intent record (async) Perform operation in memory Write
| completion record (async) Return success to client
|
| -----------------
|
| In other words, the client only knows its a success when both
| wal files have been written.
|
| The goal is not to provide faster responses to the client, on
| the first intent record, but to ensure that the system is not
| stuck with I/O Waiting on fsync requests.
|
| When you write a ton of data to database, you often see that
| its not the core writes but the I/O > fsync that eat a ton of
| your resources. Cutting back on that mess, results that you can
| push more performance out of a write heavy server.
| jcgrillo wrote:
| There's no fsync in the async version, though, unless I
| missed it? The problem with the two WAL approach is that now
| none of the WAL writes are durable--you could encounter a
| situation where a client reads an entry on the completion WAL
| which upon recovery does not exist on disk. Before with the
| single fsynced WAL, writes were durably persisted.
| loeg wrote:
| No, we saw this scheme, it just doesn't work. Either of the
| async writes can fail _after_ ack 'ing the logical write to
| the client as successful (e.g., kernel crash or power
| failure) and then you have lost data.
| cyanydeez wrote:
| You can always have data loss. The intent is that when the
| client is told the data is saved, it doesnt happen before
| the garuntee.
|
| I dont know if OP achieved this, but the client isnt told
| "we have your data" until both of the WALs are agreeing. If
| the system goes down those WALs are used to rebuild data in
| flight.
|
| The speed up allows for decoupling synchronous disk writes
| that are now parallel.
|
| You are not conceptualizing what data loss means in the
| ACID contract between DB and Client.
|
| But you
| loeg wrote:
| > I dont know if OP achieved this,
|
| They did not.
|
| > but the client isnt told "we have your data" until both
| of the WALs are agreeing.
|
| Wrong. In the proposed scheme, the client writes are
| ack'd _before_ the WAL writes are flushed. Their contents
| may or may not agree after subsequent power loss or
| kernel crash.
|
| (It is generally considered unacceptable for network
| databases/filers to be lossier than the underlying media.
| Sometimes stronger guarantees are required/provided, but
| that is usually the minimum.)
| LAC-Tech wrote:
| Great article, but I have a question:
|
| _The problem with naive async I /O in a database context at
| least, is that you lose the durability guarantee that makes
| databases useful. When a client receives a success response,
| their expectation is the data will survive a system crash. But
| with async I/O, by the time you send that response, the data
| might still be sitting in kernel buffers, not yet written to
| stable storage._
|
| Shouldn't you just tie the successful response to a successful
| fsync?
|
| Async or sync, I'm not sure what's different here.
| leentee wrote:
| First, I think the article provides false claim, the solution
| doesn't guarantee durability. Second, I believe good synchronous
| code is better than bad asynchronous code, and it's way easier to
| write good synchronous code than asynchronous code, especially
| with io_uring. Modern NVMe are fast, even with synchronous IO,
| enough for most applications. Before thinking about asynchronous,
| make sure your application use synchronous IO well.
| benjiro wrote:
| Speaking from experience, its easy to make Postgres (for
| example), just trash your system usage on a lot of individual
| or batch inserts. The NVME drives are often extreme
| underutilized, and your bottleneck is the whole fsync layer.
|
| Second, the durability is the same as fsync. The client only
| gets reported a success, if both wall writes have been done.
|
| Its the same guarantee as fsync but you bypass the fsync
| bottleneck, what in turn allows for actually using the benefits
| of your NVME drives better (and shifting away the resource from
| the i/o blocking fsync).
|
| Yes, it involves more management because now you need to
| maintain two states, instead of one with the synchronous fsync
| operation. But that is the thing about parallel programming,
| its more complex but you get a ton of benefits from it by
| bypassing synchronous bottlenecks.
| jorangreef wrote:
| To be clear, this is different to what we do (and why we do it)
| in TigerBeetle.
|
| For example, we never externalize commits without full fsync, to
| preserve durability [0].
|
| Further, the motivation for why TigerBeetle has both a prepare
| WAL plus a header WAL is different, not performance (we get
| performance elsewhere, through batching) but correctness, cf.
| "Protocol-Aware Recovery for Consensus-Based Storage" [1].
|
| Finally, TigerBeetle's recovery is more intricate, we do all this
| to survive TigerBeetle's storage fault model. You can read the
| actual code here [2] and Kyle Kingsbury's Jepsen report on
| TigerBeetle also provides an excellent overview [3].
|
| [0] https://www.youtube.com/watch?v=tRgvaqpQPwE
|
| [1]
| https://www.usenix.org/system/files/conference/fast18/fast18...
|
| [2]
| https://github.com/tigerbeetle/tigerbeetle/blob/main/src/vsr...
|
| [3] https://jepsen.io/analyses/tigerbeetle-0.16.11.pdf
| quietbritishjim wrote:
| The article claims that, when they switched to io_uring,
|
| > throughput increased by an order of magnitude almost
| immediately
|
| But right near the start is the real story: the sync version had
|
| > the classic fsync() call after every write to the log for
| durability
|
| They are not comparing performance of sync APIs vs io_uring.
| _They 're comparing using fsync vs not using fsync!_ They even go
| on to say that a problem with async API is that
|
| > you lose the durability guarantee that makes databases useful.
| ... the data might still be sitting in kernel buffers, not yet
| written to stable storage.
|
| No! That's because you stopped using fsync. It's nothing to do
| with your code being async.
|
| If you just removed the fsync from the sync code you'd quite
| possibly get a speedup of an order of magnitude too. Or if you
| put the fsync back in the async version (I don't know io_uring
| well enough to understand that but it appears to be possible with
| "io_uring_prep_fsync") then that would surely slide back. Would
| the io_uring version still be faster either way? Quite possibly,
| but because they made an apples-to-oranges comparison, we can't
| know from this article.
|
| (As other commenters have pointed out, their two-phase commit
| strategy also fails to provide any guarantee. There's no getting
| around fsync if you want to be sure that your data is really on
| the storage medium.)
| zozbot234 wrote:
| So OP's _real_ point is that fsync() sucks in the context of
| modern hardware where thousands of I /O reqs may be in flight
| at any given time. We need more fine-grained mechanisms to
| ensure that writes are committed to permanent storage, without
| introducing undue serialization.
| quietbritishjim wrote:
| Well, there already is slightly more fine gained control: in
| the sync version, you can perhaps call sync write() a few
| times before calling fsync() once i.e. basically batch up a
| few writes. That does have the disadvantage that you can't
| easily queue new writes while waiting for the previous ones.
| Perhaps you could use calls to write() in another thread
| while the first one is waiting for fsync() for the previous
| batch? You could even have lots of threads doing that in
| parallel, but probably not the thousands that you mentioned.
| I don't know the nitty gritty of Linux file IO well enough to
| know how well that would work.
|
| As I said, I don't know anything about fsync in io_uring.
| Maybe that has more control?
|
| An article that did a fair comparison, by someone who
| actually knows what they're talking about, would be pretty
| interesting.
| loeg wrote:
| > As I said, I don't know anything about fsync in io_uring.
| Maybe that has now control?
|
| io_uring fsync has byte range support:
| https://man7.org/linux/man-
| pages/man2/io_uring_enter.2.html#...
| quietbritishjim wrote:
| Sorry, that was a typo in my comment (now edited). "Now"
| was meant to be "more" i.e. "perhaps [io_uring] has
| _more_ control [than sync APIs]? "
|
| Byte range is support is interesting but also present in
| the Linux sync API:
|
| https://man7.org/linux/man-
| pages/man2/sync_file_range.2.html
|
| I meant more like, perhaps it's possible to concurrently
| queue fsync for different writes in a way that isn't
| possible with the sync API. From your link, it appears
| not (unless they're isolated at non-overlapping byte
| ranges, but that's no different from what you can do with
| sync API + threads):
|
| > Note that, while I/O is initiated in the order in which
| it appears in the submission queue, completions are
| unordered. For example, an application which places a
| write I/O followed by an fsync in the submission queue
| cannot expect the fsync to apply to the write. The two
| operations execute in parallel, so the fsync may complete
| before the write is issued to the storage.
|
| So if two writes are for an overlapping byte range, and
| you wanted to write + fsync the first one then write +
| fsync the second then you'd need to queue those four
| operations in application space, ensuring only one is
| submitted to io_uring at a time.
| gpderetta wrote:
| You can insert synchronization OPs (i.e. barriers) in the
| queue to guarantee in-order execution.
| immibis wrote:
| Postgres claims to have some kind of commit batching, but I
| couldn't figure out how to turn it on.
|
| I wanted to scrub a table by processing each row, but
| without holding locks, so I wanted to commit every few
| hundred rows, but with only ACI and not D, since I could
| just run the process again. I don't think Postgres supports
| this feature. It also seemed to be calling fsync much more
| than once per transaction.
| sgarland wrote:
| Maybe I don't understand what you're trying to do, but
| you can directly control how frequently commits occur.
| BEGIN INSERT ... --- batch of N size
| COMMIT AND CHAIN INSERT ...
| PaulDavisThe1st wrote:
| Chance of Postgres commit mapping 1:1 onto posix fsync or
| equivalent: slim.
| azlev wrote:
| commit_delay
|
| https://www.postgresql.org/docs/current/runtime-config-
| wal.h...
| morningsam wrote:
| Looking through the options listed under "Non-Durable
| Settings", [1] I guess synchronous_commit = off fits the
| bill?
|
| [1]: https://www.postgresql.org/docs/current/non-
| durability.html
| stefanha wrote:
| The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit
| in write requests. This can be used instead of fdatasync(2)
| in some cases. It only syncs a specific write request instead
| of the entire disk write cache.
| zozbot234 wrote:
| You should prefer RWF_SYNC in case the write involves
| changes to the file metadata (For example, most append
| operations will alter the file size).
| LtdJorge wrote:
| Not really, RWF_DSYNC is equivalent to open(2) with
| O_DSYNC when writing which is equivalent to write(2)
| followed by fdatasync(2) and: fdatasync()
| is similar to fsync(), but does not flush modified
| metadata unless that metadata is needed in order to allow
| a subsequent data retrieval to be correctly
| handled. For example, changes to st_atime or
| st_mtime (respectively, time of last access
| and time of last modification; see inode(7)) do not
| require flushing because they are not
| necessary for a subsequent data read to be
| handled correctly. On the other hand, a change to the
| file size (st_size, as made by say
| ftruncate(2)), would require a metadata flush.
| stefanha wrote:
| Agreed, when metadata changes are involved then RWF_SYNC
| _must_ be used.
|
| RWF_DSYNC is sufficient and faster when data is
| overwritten without metadata changes to the file.
| vlovich123 wrote:
| No that's incorrect. File size changes caused by append
| are covered by fdatasync in terms of durability
| guarantees.
| ImPostingOnHN wrote:
| Some applications, like Apache Kafka, don't immediately fsync
| every write. This lets the kernel batch writes and also
| linearize them, both adding speed. Until synced, the data
| exists only in the linux page cache.
|
| To deal with the risk of data loss, multiple such servers are
| used, with the hope that if one server dies before syncing,
| another server to which the data was replicated, performs an
| fsync _without_ failure.
| to11mtm wrote:
| I feel like you can try to FAFO with that on a distributed
| log like Kafka (although also... eww, but also I wonder
| whether NATS does the same thing or not...)
|
| I would think for something like a database, at _most_ you
| 'd want to have something like the io_uring_prep_fsync
| others mentioned with flags set to just not update the
| metadata.
|
| To be clear, in my head I'm envisioning this case to be a
| WAL type scenario; in my head you can get away with just
| having a separate thread or threads pulling from WAL and
| writing to main DB files... but also I've never written a
| real database so maybe those thoughts are off base.
| osigurdson wrote:
| Suggest watching the Tigerbeatle video link in the article.
| There they discuss bitrot, "fsync gate", how Postgres used
| fsync wrong for 30 years, etc. It is very interesting even as
| pure entertainment.
| jorangreef wrote:
| Thanks! Great to hear you enjoyed our talk. Most of it is
| simply putting the spotlight on UW-Madison's work on storage
| faults.
|
| Just to emphasize again that this blog post here is really
| quite different, since it does not fsync and breaks
| durability.
|
| Not what we do in TigerBeetle or would recommend or
| encourage.
|
| See also: https://news.ycombinator.com/item?id=44624065
| mhuffman wrote:
| Hi! I don't have a need for your products directly, but I
| was very intrigued when I saw TB's demo and talk on
| ThePrimeagen YT channel. I have be developing software for
| a looooong time and it was a breath of fresh air in a sea
| of startups to see a company champion optimization, speed,
| and security without going too deep in the weeds and
| slowing development. These days, that typically comes more
| as an afterthought or as a response to an incident. Or not
| at all. I would recommend any developer with an open mind
| to read this short document[0]. I have been integrating it
| into my own company's development practices with good
| results.
|
| [0]https://github.com/tigerbeetle/tigerbeetle/blob/main/doc
| s/TI...
| jorangreef wrote:
| Appreciate your taking the time to write these kind
| words. Great to hear that TigerStyle has been making an
| impact on your company's developer practices!
| ajross wrote:
| > There's no getting around fsync if you want to be sure that
| your data is really on the storage medium.
|
| That's not correct; io_uring supports O_DIRECT write requests
| just fine. Obviously bypassing the cache isn't the same as just
| flushing it (which is what fsync does), so there are design
| impacts.
|
| But database engines are absolutely the target of io_uring's
| feature set and they're expected to be managing this
| complexity.
| zozbot234 wrote:
| That's not what O_DIRECT is for. Did you mean O_SYNC ?
| codys wrote:
| > But database engines are absolutely the target of
| io_uring's feature set and they're expected to be managing
| this complexity.
|
| io_uring includes an fsync opcode (with range support). When
| folks talk about fsync generally here, they're not saying the
| io_uring is unusable, they're saying that they'd expect the
| fsync to be used whether it's via the io_uring opcode, the
| system call, or some other mechanism yet to be created.
| jandrewrogers wrote:
| O_DIRECT is not a substitute for fsync(). It only guarantees
| that data gets to the storage device cache, which is not
| durable in most cases.
| somat wrote:
| My understanding is that the storage device cache is
| opaque, that is, drives tend to lie, saying the write is
| done when it is in cache, and depend on having enough
| internal power capacity to flush on power loss.
| loeg wrote:
| Consumer devices sometimes lie (enterprise products less
| so), but there is a distinction between O_DIRECT and
| actual fsync at the protocol layer (e.g., in NVMe, fsync
| maps into a Flush command).
| quietbritishjim wrote:
| Is that's true (notwithstanding objections from sibling
| comments) then that's just another spelling of fsync.
|
| My point was really: you can't magically get the performance
| benefits of omitting fsync (or functional equivalent) while
| still getting the durability guarantees it gives.
| codys wrote:
| > > you lose the durability guarantee that makes databases
| useful. ... the data might still be sitting in kernel buffers,
| not yet written to stable storage.
|
| > No! That's because you stopped using fsync. It's nothing to
| do with your code being async.
|
| From that section, it sounds like OP was tossing data into the
| io_uring submition queue and calling it "done" at that point
| (ie: not waiting for the io_uring completion queue to have the
| completion indicated). So yes, fsync is needed, but they
| weren't even waiting for the kernel to start the write before
| indicating success.
|
| I think to some extent things have been confused because
| io_uring has a completion concept, but OP also has a separate
| completion concept in their dual wal design (where the second
| WAL they call the "completion" WAL).
|
| But I'm not sure if OP really took away the right understanding
| from their issues with ignoring io_uring completions, as they
| then create a 5 step procedure that adds one check for an
| io_uring completion, but still omits another.
|
| > 1. Write intent record (async)
|
| > 2. Perform operation in memory
|
| > 3. Write completion record (async)
|
| > 4. Wait for the completion record to be written to the WAL
|
| > 5. Return success to client
|
| Note the lack of waiting for the io_uring completion of the
| intent record (and yes, there's still not any reference to
| fsync or alternates, which is also wrong). There is no ordering
| guarantee between independent io_urings (OP states they're
| using separate io_uring instances for each WAL), and even in
| the same io_uring there is limited ordering around completions
| (IOSQE_IO_LINK exists, but doesn't allow traversing submission
| boundaries, so won't work here because OP submits the work a
| separate times. They'd need to use IOSQE_IO_DRAIN which seems
| like it would effectively serialize their writes. which is why
| It seems like OP would need to actually wait for completion of
| the intent write).
| cryptonector wrote:
| Correct, TFA needs to wait for the completion of _all_ writes
| to the WAL, which is what `fsync()` was doing. Waiting only
| for the completion of the "completion record" does not ensure
| that the "intent record" made it to the WAL. In the event of
| a power failure it is entirely possible that the intent
| record did not make it but the completion record did, and
| then on recovery you'll have to panic.
| codys wrote:
| Yes, but I suspect there might be some confusion by the
| author and others between "io_uring completion of a write"
| (ie: io_uring sends its completion queue event that
| corresponds to a previous submission queue event) and
| "fsync completion" (as you've put as "completion of all
| writes", though note that fsync the api is fd scoped and
| the io_uring operation for fsync has file range support).
|
| The CQEs on a write indicate something different compared
| to the CQE of an fsync operation on that same range.
| osigurdson wrote:
| I've watched the Tigerbeatle talk (youtube link in the article).
| This is very interesting even for those not in the space.
| demaga wrote:
| I feel like writing asynchronously to a WAL defeats its purpose.
| jasonthorsness wrote:
| Is the underlying NVME storage interface the kernel/drivers get
| to use cleaner/simpler than the Linux abstractions? Or does it
| get more complicated? Sometimes I wonder if certain high-
| performance applications would be better off running as special-
| purpose unikernels unburdened by interfaces designed for older
| generations of technology.
| loeg wrote:
| Also an option with io_uring:
| https://www.usenix.org/conference/fast24/presentation/joshi
|
| (We use it at work it in a network object storage service in
| order to use the underlying NVMe T10-DIF[1], which isn't
| exposed nicely by conventional POSIX/Linux interfaces.)
|
| Ultimately, having a full, ~normal Linux stack around makes
| system management / orchestration easier. And programs other
| than our specialized storage software can still access other
| partitions, etc.
|
| [1]: https://en.wikipedia.org/wiki/Data_Integrity_Field
| eatonphil wrote:
| From the title I was hoping this would be a survey of databases
| using io_uring, since there've been quips on the internet (here,
| twitter, etc) that no one uses io_uring in production. In my
| brief search TigerBeetle (and maybe Turso's Limbo) was the only
| database in production that I remember doing io_uring (by
| default). Some other databases had it as an option but didn't
| seem to default to it.
|
| If anyone else feels like doing this survey and publishing the
| results I'd love to see it.
| jtregunna wrote:
| Update:
|
| I updated the post based on the conversation below, I wholly
| missed an important callout about performance, and wasn't super
| clear that you do need to wait for the completion record to be
| written before responding to the client. That was implicitly
| mentioned by writing the completion record coming before
| responding, but I made it clearer to avoid confusion.
|
| Also the dual WAL approach is worse for latency, unless you can
| amortize the double write over multiple async writes, so the cost
| paid amortizes across the batch, but when batch size is closer to
| 1, the cost is higher.
| gpderetta wrote:
| How can you know that the completion record is written to disk?
| codys wrote:
| From the update added to the post:
|
| > This is tracked through io_uring's completion queue - we only
| send a success response after receiving confirmation that the
| completion record has been persisted to stable storage.
|
| Which completion queue event(s) are you examining here? I ask
| because the way this is worded makes it sound like you're
| waiting solely for the completion queue event for the _write_
| to the "completion wal".
|
| Doing that (waiting only on the "completion wal" write CQE)
|
| 1. doesn't ensure that the "intent wal" has been written
| (because it's a different io_uring and a different submission
| queue event used to do the "intent wal" write from the
| "completion wal" write), and
|
| 2. doesn't indicate the "intent wal" data or the "completion
| wal" data has made it to durable storage (one needs fsync for
| that, the completion queue events for writes don't make that
| promise. The CQE for an fsync opcode would indicate that data
| has made it to durable storage if the fsync has the right
| ordering wrt the writes and refers to the appropriate fd and
| data ranges. Alternatively, there are some flags that have the
| effect of implying an fsync following a write that could be
| used, but those aren't mentioned)
| ptrwis wrote:
| For some background, it is now a single guy paid by Microsoft to
| work on implementing async direct I/O for PostgreSQL
| (github.com/anarazel)
| lstroud wrote:
| About 10ish years ago, I ended up finding a deadlock in the Linux
| raid driver when turning on Oracle's async writes with raid10 on
| lvm on AWS. I traced it to the ring buffers the author mentioned,
| but ended up having to remove lvm (since it wasn't that necessary
| on this infrastructure) to get the throughput I needed.
| BeeOnRope wrote:
| What is the point of the intent entry at all? It seems like
| operations are only durable after the completion record is
| written so the intent record seems to serve no purpose (unless it
| is say much larger).
| sethev wrote:
| There's some faulty reasoning in this post. Without the code,
| it's hard to pin down exactly where things went wrong.
|
| These are the steps described in the post: 1.
| Write intent record (async) 2. Perform operation in memory
| 3. Write completion record (async) 4. Wait for the
| completion record to be written to the WAL 5. Return
| success to client
|
| If 4 is done correctly then 3 is not needed - it can just wait
| for the intent to be durable before replying to the client.
| Perhaps there's a small benefit to speculatively executing the
| operation before the WAL is committed - but I'm skeptical and my
| guess is that 4 is not being done correctly. The author added an
| update to the article:
|
| > This is tracked through io_uring's completion queue - we only
| send a success response after receiving confirmation that the
| completion record has been persisted to stable storage
|
| This makes it sound like he's submitting write operations for the
| completion record and then misinterpreting the completion queue
| for those writes as "the record is now in durable storage".
| jeffbee wrote:
| What's baffling to me about this post is that anyone would
| believe that io_uring was even capable of speeding up this
| workload by 10x. Unless your profile suggests that syscall entry
| is taking > 90% of your CPU time, that is impossible. The only
| thing io_uring can do for you is reduce your syscall count, so
| the upper bound of its utility is whatever you are currently
| spending on sysenter/exit.
| loeg wrote:
| You could also imagine it hiding write latency by allowing a
| very naive single-threaded application to do IOs concurrently,
| overlapped in time, instead of serialized. (But a threadpool
| would do much the same thing.)
| gpderetta wrote:
| Io_uring could allow for better throughout by simply having
| multiple operations in flight and allow for better I/O
| scheduling.
|
| But yes, this specific case seems to be a misunderstanding in
| what io_uring write completion means.
|
| You would expect that they would have tested recovery by at
| least simulating system stops immediately after after Io
| completion notification.
|
| Unless they are truly using asynchronous O_SYNC writes and are
| just bad at explaining it.
| misiek08 wrote:
| 1. Write intent 2. Don't use intent write as success 3. Report
| success on different operation completion.
|
| While restoring: 1. Ignore all intents 2. Use only different
| operations with corresponding intents.
|
| I think this article introduces so much chaos that it's like many
| ,,almost" helpful info on io_uring and finally hurts the tech.
| io_uring IMHO lacks clean and simple examples and here we again
| have some bad-explained theories instead of meat.
___________________________________________________________________
(page generated 2025-07-20 23:01 UTC)