[HN Gopher] A write-ahead log is not a universal part of durability
___________________________________________________________________
A write-ahead log is not a universal part of durability
Author : todsacerdoti
Score : 114 points
Date : 2024-07-01 11:44 UTC (11 hours ago)
(HTM) web link (notes.eatonphil.com)
(TXT) w3m dump (notes.eatonphil.com)
| eatonphil wrote:
| One of the interesting things I came up against while writing
| this post was the pretty common misconception that SQLite and
| Postgres will validate your data with checksums [0]. SQLite
| leaves this to an optional extension (this is a little more
| commonly known). Here was an HN comment I stumbled on about
| Postgres [1]:
|
| > Postgres (and other databases that actually care about data
| integrity) have per page checksums:
|
| Postgres does have support for data checksumming (but not
| metadata checksumming) but defaults to _disabling_ data
| checksumming. MongoDB (WiredTiger) on the other hand defaults to
| checksumming. I was told after publishing this post that MySQL
| (InnoDB) also does checksum by default but I did not check that
| in my survey.
|
| Links for proof are in the post.
|
| [0] https://x.com/eatonphil/status/1807572135340134687
|
| [1] https://news.ycombinator.com/item?id=25231308
| wongarsu wrote:
| That's interesting. [1] seems to be the relevant page in the
| postgres documentation.
|
| tl;dl: run `SHOW data_checksums` to see if checksums are
| enabled (they were enabled by default on our GCP Cloud SQL
| instance), to enable them shut down postgres and run
| `pg_checksums --enable --progress`
|
| 1: https://www.postgresql.org/docs/current/checksums.html
| dboreham wrote:
| Database engines that live as libraries inside application
| processes (like WT) have a more pressing need for checksums
| because the most common attack vector is buggy pointer
| arithmetic outside the library that stomps on data in memory
| randomly.
| whizzter wrote:
| Checksumming is also useful when storage goes bad or other
| random hardware corruption like cosmic rays or
| overheating/old computers.
| pixl97 wrote:
| This reminds me of the time I had a disk go 'random' in a
| RAID. Disk didn't fail, instead it just returned junk for
| every Nth read. Happened to be in a big database server
| too. Crazy part is it took far longer than I would have
| expected to catch data showing up corrupted in user
| applications so an entire 28 hours of data had to be rolled
| back.
| slaymaker1907 wrote:
| I've come around to the idea that checksumming isn't some
| universal panacea. Once you have checksums, you need to start
| thinking about error *recovery* and not just detection. What
| happens to linked data? Do you just declare the whole SQLite
| file corrupted and just throw up your hands?
|
| If you have backups, that's one obvious solution, but people
| might not have uncorrupted backups.
|
| I still think you should do it and emit a warning about it at
| the very least, but it's not trivial to handle.
| throw0101b wrote:
| Just use ZFS. :)
| fdr wrote:
| yeah, that one is a path dependency issue, re: postgres. I'm
| not sure if it's well reasoned past inertia after the initial
| "it's new, let's not throw the entire user base into it at
| once", but at the very least, it will complicate pg_upgrade
| somewhat. WiredTiger didn't really have that path dependency,
| being itself new to Mongo to shore up the storage situation.
|
| It's probably about time to swap the default, but some people's
| once-working pg_upgrade programs that they haven't looked at in
| a while might break. Probably okay; those things need to
| happen...once in a while. I suppose some people that resent the
| overhead of Postgres checksumming atop their ZFS/btrfs/dm-
| integrity/whatever stacks, but they are somewhat rarer.
| sgeisenh wrote:
| In the pseudocode snippet that introduces a WAL, the semaphores
| are signaled before the fsync occurs. This seems like a mistake
| but maybe there's a good reason to do this that I am missing?
|
| Edit: it looks like this is the case in the first batched fsync
| example, as well.
| eatonphil wrote:
| Good catch! That was a think-o. I've swapped it around so that
| semaphores are signalled after the fsync.
| sgeisenh wrote:
| Thanks for the quick response, and for the nice write up. I
| thoroughly enjoy the python-like pseudocode, it was the main
| reason I was able to pick that out reasonably quickly!
| AdamProut wrote:
| Maybe good to mention torn pages somewhere too? Both MySQL and
| Postgres jump through some hoops to both detect them and repair
| them [1][2]. So, even the scenario in the post where fsync is
| used to harden writes, the database still needs to handle torn
| pages (or requires using a file system \ storage that guarantees
| atomic page writes at the page size the database is using as
| several managed\cloud databases do).
|
| [1] https://wiki.postgresql.org/wiki/Full_page_writes [2]
| https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-b...
| eatonphil wrote:
| Thanks Adam! I think torn writes would still be caught via
| checksums, no? Although that may be later than you'd wish.
|
| I'm not confident but from reading that page it seems that for
| Postgres at least, if it did do checksums it might not need to
| count on page-level atomic writes?
| AdamProut wrote:
| Checksums can detect a torn page, but not always repair them.
| It's likely a good part of the database page is gone (i.e.,
| an amount of data that matches the disk / file system atomic
| write unit size is probably missing). Torn page writes are a
| pretty common scenario too, so databases need to be able to
| fully recover from them - not just detect them and report a
| corruption (ie., just pull the power plug from the machine
| during a heavy write workload and you're likely to get one -
| it doesn't require a solar ray to flip a bit :) ).
| eatonphil wrote:
| That's fair. In the post I did mention disk redundancy (and
| I guess I only implied recovery) as one additional level
| for safety. Which I think is what you're getting at too.
| mattashii wrote:
| Disk redundancy won't help guarantee torn page protection
| if the writes across the redundant disks are not
| coordinated to have one start after the other finishes
| such that there is always have one copy of the page that
| is not currently being written. So writing to a RAID1
| array won't help here without knowledge about how that
| raid1's writes work.
| londons_explore wrote:
| Worth noting that there are various tricks to combine the write
| ahead log and the rest of the on-disk data, meaning the vast
| majority of data will not be written more than once, and you'll
| still get the durability and data locality benefits of a WAL.
| 7e wrote:
| Could you elaborate? Do you mean an LSM tree or log structured
| filesystem?
| simonz05 wrote:
| Lsm trees also need a log, right?
| valyala wrote:
| LSM trees do not need write-ahead log in general case:
|
| - When new data arrives, it is converted to SSTable, which
| is then stored to disk in an atomic manner before returning
| 'success' to the client, who writes the data. If computer
| crashes in the middle of write, no problems - the partially
| written SSTable will be dropped on database start, since it
| isn't registered in the database yet.
|
| - When computer crashes in the middle of background merge
| of smaller SSTables into bigger ones, then no problem - the
| source SSTables are still available after database restart,
| while partially written output SSTable can be safely
| dropped on database restart.
|
| VictoriaMetrics and VictoriaLogs use LSM trees without WAL,
| while providing good durability levels. They can lose
| recently ingested metrics or logs on server crash, if they
| weren't converted to SSTables and weren't written to disk
| yet. But this is very good tradeoff comparing to data
| corruption or failed WAL replay in other systems, which use
| WAL in improper ways - https://valyala.medium.com/wal-
| usage-looks-broken-in-modern-... .
| awitt wrote:
| LSM-trees do need a WAL. The entire idea of LSM-trees is
| that writes are buffered in memory and written out all at
| once. But a particular write doesn't wait for the
| memtable to be flushed. For that reason you still need a
| WAL (there is committed state in memory).
|
| See
| https://research.facebook.com/publications/optimizing-
| space-... and https://www.cs.umb.edu/~poneil/lsmtree.pdf.
| scottlamb wrote:
| I don't agree with your https://valyala.medium.com/wal-
| usage-looks-broken-in-modern-... article in at least a
| couple ways.
|
| One specific reason:
|
| > TimescaleDB relies on PostgreSQL's WAL mechanism, which
| puts data into WAL buffers in RAM and periodically
| flushes them to WAL file. This means that the the data
| from unflushed WAL buffers is lost on power loss or on
| process crash.
|
| That links to the manpage which says "The contents of the
| WAL buffers are written out to disk at every transaction
| commit". Maybe there's a missing "TransactionDB only
| commits periodically" that makes the quote above true,
| but any suggestion that PostgreSQL does not guarantee
| durability of committed transactions out of the box is
| incorrect.
|
| A broader reason is: it talks about how WALs may be "lost
| / corrupted" before fsync. Then how the "write directly
| to SSTable" approach just loses recently added data, and
| "IMHO, recently written data loss on process crash has
| lower severity comparing to data corruption". But in
| general, I'd expect these databases to have a mechanism
| by which they don't apply a corrupted WAL (typically
| involving a strong checksum on WAL entries). So
| ultimately these are two ways of describing the same
| thing. If those databases really do apply corrupt/half-
| written/unflushed WAL entries and thus corrupt the
| previously committed data, yes, that's very interesting,
| but the smoking gun is missing. The article is either
| wrong or incomplete.
| pjdesno wrote:
| Unfortunately this post skips over the "atomicity" part of a
| write-ahead log.
|
| Assume you start with data on disk AAAAAAAA, read it into memory,
| and update it to BBBBBBBB, then write it back. If you crash in
| the middle, you might end up with BBBAAAAA, BBBBBBAA, or even
| some crazy interleaving. (at least for reasonable file sizes -
| note that the largest atomic write to many NVMe drives is 128K)
|
| If you ditch the in-memory BTree and write a direct-to-disk one,
| with a lot of care (and maybe a bit of copy-on-write) you can
| make sure that each disk write leaves the database in a crash-
| consistent state, but that will cost multiple writes and fsyncs
| for any database modifications that split or merge BTree nodes -
| you have to ensure that each write leaves the database in a
| consistent state.
|
| (for those of you old enough to remember ext2, it had the same
| problem. If you mounted it async and had a bad crash, the data on
| disk would be inconsistent - you'd lose data, so you'd vow to
| always mount your filesystem with synchronous writes so you'd
| never lose data again, then a few weeks later you'd get tired of
| the crappy performance and go back to async writes, until the
| next crash happened, etc. etc.)
|
| The advantage of a log is that it allows you to combine multiple
| writes to different parts of the database file into a single
| record, guaranteeing (after crash recovery if necessary) that
| either all changes happen or none of them do. It serves the same
| purpose as a mutex in multi-threaded code - if your invariants
| hold when you get the mutex, and you reestablish them before you
| drop it, everything will be fine. We'd all love to have a mutex
| that keeps the system from crashing, but failing that we can use
| a WAL record to ensure that we move atomically from one valid
| state to another, without worrying about the order of
| intermediate changes to the data structure.
| bjornsing wrote:
| For some data structures you get the atomicity for "free"
| though. For example, a log structured merge tree where writes
| are done in large transactions could well do without a WAL and
| still achieve atomicity.
| nyrikki wrote:
| Ever have an LSMT quit writing a run half way through? Ever
| need to try and restore Cassandra where mutation storage
| order isn't guaranteed, with the last mutation always
| winning?
|
| It is absolutely not "atomic" if you have ever had to do a
| recovery from SSTable snapshots.
| bjornsing wrote:
| > Ever have an LSMT quit writing a run half way through?
|
| That should be very easy to "solve" by just writing the
| length of the SSTable in a header.
|
| > Ever need to try and restore Cassandra where mutation
| storage order isn't guaranteed, with the last mutation
| always winning?
|
| This is more of a distributed systems challenge as I
| understand it: Cassandra is "multi-master" so can end up in
| a situation where transaction order is hard to sort out.
| valyala wrote:
| A good example of database, which relies on LSM tree
| properties for protecting against data corruption on unclean
| shutdown is VictoriaMetrics [1].
|
| [1] https://valyala.medium.com/wal-usage-looks-broken-in-
| modern-...
| neal wrote:
| Good point!
|
| I'm not sure if there are any databases that do your 'with a
| lot of care' option, but for anyone curious about what that
| might look like in practice there are file systems that forgo
| write-ahead logging and maintain metadata consistency using
| techniques like soft updates[0] or copy-on-write up to a
| root[1].
|
| [0]: https://www.usenix.org/conference/1999-usenix-annual-
| technic... [1]:
| https://www.cs.hmc.edu/~rhodes/courses/cs134/fa20/readings/T...
| (yes, ZFS can be configured to use a WAL too for durability)
| refset wrote:
| > I'm not sure if there are any databases that do your 'with
| a lot of care' option
|
| LMDB springs to mind.
|
| > LMDB was designed to resist data loss in the face of system
| and application crashes. Its copy-on-write approach never
| overwrites currently-in-use data. Avoiding overwrites means
| the structure on disk/storage is always valid, so application
| or system crashes can never leave the database in a corrupted
| state.
|
| https://en.wikipedia.org/wiki/Lightning_Memory-
| Mapped_Databa...
| jmix wrote:
| This is obvious if you took a DB course, and if you didn't, you
| have no business building a DB. Sadly, all the NoSQL junkware was
| built by people who didn't.
| zellyn wrote:
| I found the Developer Voices discussion[1][2] with Joran Dirk
| Greef, creator of TigerBeetle, to be fascinating. They mentioned
| that the rigorous correctness proofs that exist for Raft and
| Paxos assume an absence of disk faults, but that more modern
| theory includes fixes for that, as long as you marry the log
| writing and consensus algorithms together properly rather than
| keeping each as a separate black box.
|
| [1] https://podcasts.apple.com/us/podcast/databases-ambitions-
| an...
|
| [2] https://www.youtube.com/watch?v=ayG7ltGRRHs
| refset wrote:
| Specifically, that modern theory is "protocol-aware recovery" -
| e.g.
| https://www.usenix.org/conference/fast18/presentation/alagap...
| samatman wrote:
| I second this recommendation. I'm impressed by what TigerBeetle
| has accomplished, and would love to see more of the industry
| approach software engineering with their philosophy and rigour.
|
| The TigerStyle video is well worth watching for anyone who
| wants to build defect-free software systems, or recognizes that
| this is the correct thing to aim for.
| https://www.youtube.com/watch?v=w3WYdYyjek4
| beltsazar wrote:
| But WAL is not for durability, but for atomicity and consistency.
| And yes, you need to use fsync to ensure durability.
| anarazel wrote:
| Nitpick: If you're careful fdatasync() can suffice.
| refset wrote:
| > durability is a spectrum
|
| This is the truth. My favourite example of this is MemoryDB's
| "multi-AZ (multi-datacenter) durability" for Valkey - there's a
| good write-up here
| https://brooker.co.za/blog/2024/04/25/memorydb.html
___________________________________________________________________
(page generated 2024-07-01 23:01 UTC)