[HN Gopher] A write-ahead log is not a universal part of durability
       ___________________________________________________________________
        
       A write-ahead log is not a universal part of durability
        
       Author : todsacerdoti
       Score  : 114 points
       Date   : 2024-07-01 11:44 UTC (11 hours ago)
        
 (HTM) web link (notes.eatonphil.com)
 (TXT) w3m dump (notes.eatonphil.com)
        
       | eatonphil wrote:
       | One of the interesting things I came up against while writing
       | this post was the pretty common misconception that SQLite and
       | Postgres will validate your data with checksums [0]. SQLite
       | leaves this to an optional extension (this is a little more
       | commonly known). Here was an HN comment I stumbled on about
       | Postgres [1]:
       | 
       | > Postgres (and other databases that actually care about data
       | integrity) have per page checksums:
       | 
       | Postgres does have support for data checksumming (but not
       | metadata checksumming) but defaults to _disabling_ data
       | checksumming. MongoDB (WiredTiger) on the other hand defaults to
       | checksumming. I was told after publishing this post that MySQL
       | (InnoDB) also does checksum by default but I did not check that
       | in my survey.
       | 
       | Links for proof are in the post.
       | 
       | [0] https://x.com/eatonphil/status/1807572135340134687
       | 
       | [1] https://news.ycombinator.com/item?id=25231308
        
         | wongarsu wrote:
         | That's interesting. [1] seems to be the relevant page in the
         | postgres documentation.
         | 
         | tl;dl: run `SHOW data_checksums` to see if checksums are
         | enabled (they were enabled by default on our GCP Cloud SQL
         | instance), to enable them shut down postgres and run
         | `pg_checksums --enable --progress`
         | 
         | 1: https://www.postgresql.org/docs/current/checksums.html
        
         | dboreham wrote:
         | Database engines that live as libraries inside application
         | processes (like WT) have a more pressing need for checksums
         | because the most common attack vector is buggy pointer
         | arithmetic outside the library that stomps on data in memory
         | randomly.
        
           | whizzter wrote:
           | Checksumming is also useful when storage goes bad or other
           | random hardware corruption like cosmic rays or
           | overheating/old computers.
        
             | pixl97 wrote:
             | This reminds me of the time I had a disk go 'random' in a
             | RAID. Disk didn't fail, instead it just returned junk for
             | every Nth read. Happened to be in a big database server
             | too. Crazy part is it took far longer than I would have
             | expected to catch data showing up corrupted in user
             | applications so an entire 28 hours of data had to be rolled
             | back.
        
         | slaymaker1907 wrote:
         | I've come around to the idea that checksumming isn't some
         | universal panacea. Once you have checksums, you need to start
         | thinking about error *recovery* and not just detection. What
         | happens to linked data? Do you just declare the whole SQLite
         | file corrupted and just throw up your hands?
         | 
         | If you have backups, that's one obvious solution, but people
         | might not have uncorrupted backups.
         | 
         | I still think you should do it and emit a warning about it at
         | the very least, but it's not trivial to handle.
        
           | throw0101b wrote:
           | Just use ZFS. :)
        
         | fdr wrote:
         | yeah, that one is a path dependency issue, re: postgres. I'm
         | not sure if it's well reasoned past inertia after the initial
         | "it's new, let's not throw the entire user base into it at
         | once", but at the very least, it will complicate pg_upgrade
         | somewhat. WiredTiger didn't really have that path dependency,
         | being itself new to Mongo to shore up the storage situation.
         | 
         | It's probably about time to swap the default, but some people's
         | once-working pg_upgrade programs that they haven't looked at in
         | a while might break. Probably okay; those things need to
         | happen...once in a while. I suppose some people that resent the
         | overhead of Postgres checksumming atop their ZFS/btrfs/dm-
         | integrity/whatever stacks, but they are somewhat rarer.
        
       | sgeisenh wrote:
       | In the pseudocode snippet that introduces a WAL, the semaphores
       | are signaled before the fsync occurs. This seems like a mistake
       | but maybe there's a good reason to do this that I am missing?
       | 
       | Edit: it looks like this is the case in the first batched fsync
       | example, as well.
        
         | eatonphil wrote:
         | Good catch! That was a think-o. I've swapped it around so that
         | semaphores are signalled after the fsync.
        
           | sgeisenh wrote:
           | Thanks for the quick response, and for the nice write up. I
           | thoroughly enjoy the python-like pseudocode, it was the main
           | reason I was able to pick that out reasonably quickly!
        
       | AdamProut wrote:
       | Maybe good to mention torn pages somewhere too? Both MySQL and
       | Postgres jump through some hoops to both detect them and repair
       | them [1][2]. So, even the scenario in the post where fsync is
       | used to harden writes, the database still needs to handle torn
       | pages (or requires using a file system \ storage that guarantees
       | atomic page writes at the page size the database is using as
       | several managed\cloud databases do).
       | 
       | [1] https://wiki.postgresql.org/wiki/Full_page_writes [2]
       | https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-b...
        
         | eatonphil wrote:
         | Thanks Adam! I think torn writes would still be caught via
         | checksums, no? Although that may be later than you'd wish.
         | 
         | I'm not confident but from reading that page it seems that for
         | Postgres at least, if it did do checksums it might not need to
         | count on page-level atomic writes?
        
           | AdamProut wrote:
           | Checksums can detect a torn page, but not always repair them.
           | It's likely a good part of the database page is gone (i.e.,
           | an amount of data that matches the disk / file system atomic
           | write unit size is probably missing). Torn page writes are a
           | pretty common scenario too, so databases need to be able to
           | fully recover from them - not just detect them and report a
           | corruption (ie., just pull the power plug from the machine
           | during a heavy write workload and you're likely to get one -
           | it doesn't require a solar ray to flip a bit :) ).
        
             | eatonphil wrote:
             | That's fair. In the post I did mention disk redundancy (and
             | I guess I only implied recovery) as one additional level
             | for safety. Which I think is what you're getting at too.
        
               | mattashii wrote:
               | Disk redundancy won't help guarantee torn page protection
               | if the writes across the redundant disks are not
               | coordinated to have one start after the other finishes
               | such that there is always have one copy of the page that
               | is not currently being written. So writing to a RAID1
               | array won't help here without knowledge about how that
               | raid1's writes work.
        
       | londons_explore wrote:
       | Worth noting that there are various tricks to combine the write
       | ahead log and the rest of the on-disk data, meaning the vast
       | majority of data will not be written more than once, and you'll
       | still get the durability and data locality benefits of a WAL.
        
         | 7e wrote:
         | Could you elaborate? Do you mean an LSM tree or log structured
         | filesystem?
        
           | simonz05 wrote:
           | Lsm trees also need a log, right?
        
             | valyala wrote:
             | LSM trees do not need write-ahead log in general case:
             | 
             | - When new data arrives, it is converted to SSTable, which
             | is then stored to disk in an atomic manner before returning
             | 'success' to the client, who writes the data. If computer
             | crashes in the middle of write, no problems - the partially
             | written SSTable will be dropped on database start, since it
             | isn't registered in the database yet.
             | 
             | - When computer crashes in the middle of background merge
             | of smaller SSTables into bigger ones, then no problem - the
             | source SSTables are still available after database restart,
             | while partially written output SSTable can be safely
             | dropped on database restart.
             | 
             | VictoriaMetrics and VictoriaLogs use LSM trees without WAL,
             | while providing good durability levels. They can lose
             | recently ingested metrics or logs on server crash, if they
             | weren't converted to SSTables and weren't written to disk
             | yet. But this is very good tradeoff comparing to data
             | corruption or failed WAL replay in other systems, which use
             | WAL in improper ways - https://valyala.medium.com/wal-
             | usage-looks-broken-in-modern-... .
        
               | awitt wrote:
               | LSM-trees do need a WAL. The entire idea of LSM-trees is
               | that writes are buffered in memory and written out all at
               | once. But a particular write doesn't wait for the
               | memtable to be flushed. For that reason you still need a
               | WAL (there is committed state in memory).
               | 
               | See
               | https://research.facebook.com/publications/optimizing-
               | space-... and https://www.cs.umb.edu/~poneil/lsmtree.pdf.
        
               | scottlamb wrote:
               | I don't agree with your https://valyala.medium.com/wal-
               | usage-looks-broken-in-modern-... article in at least a
               | couple ways.
               | 
               | One specific reason:
               | 
               | > TimescaleDB relies on PostgreSQL's WAL mechanism, which
               | puts data into WAL buffers in RAM and periodically
               | flushes them to WAL file. This means that the the data
               | from unflushed WAL buffers is lost on power loss or on
               | process crash.
               | 
               | That links to the manpage which says "The contents of the
               | WAL buffers are written out to disk at every transaction
               | commit". Maybe there's a missing "TransactionDB only
               | commits periodically" that makes the quote above true,
               | but any suggestion that PostgreSQL does not guarantee
               | durability of committed transactions out of the box is
               | incorrect.
               | 
               | A broader reason is: it talks about how WALs may be "lost
               | / corrupted" before fsync. Then how the "write directly
               | to SSTable" approach just loses recently added data, and
               | "IMHO, recently written data loss on process crash has
               | lower severity comparing to data corruption". But in
               | general, I'd expect these databases to have a mechanism
               | by which they don't apply a corrupted WAL (typically
               | involving a strong checksum on WAL entries). So
               | ultimately these are two ways of describing the same
               | thing. If those databases really do apply corrupt/half-
               | written/unflushed WAL entries and thus corrupt the
               | previously committed data, yes, that's very interesting,
               | but the smoking gun is missing. The article is either
               | wrong or incomplete.
        
       | pjdesno wrote:
       | Unfortunately this post skips over the "atomicity" part of a
       | write-ahead log.
       | 
       | Assume you start with data on disk AAAAAAAA, read it into memory,
       | and update it to BBBBBBBB, then write it back. If you crash in
       | the middle, you might end up with BBBAAAAA, BBBBBBAA, or even
       | some crazy interleaving. (at least for reasonable file sizes -
       | note that the largest atomic write to many NVMe drives is 128K)
       | 
       | If you ditch the in-memory BTree and write a direct-to-disk one,
       | with a lot of care (and maybe a bit of copy-on-write) you can
       | make sure that each disk write leaves the database in a crash-
       | consistent state, but that will cost multiple writes and fsyncs
       | for any database modifications that split or merge BTree nodes -
       | you have to ensure that each write leaves the database in a
       | consistent state.
       | 
       | (for those of you old enough to remember ext2, it had the same
       | problem. If you mounted it async and had a bad crash, the data on
       | disk would be inconsistent - you'd lose data, so you'd vow to
       | always mount your filesystem with synchronous writes so you'd
       | never lose data again, then a few weeks later you'd get tired of
       | the crappy performance and go back to async writes, until the
       | next crash happened, etc. etc.)
       | 
       | The advantage of a log is that it allows you to combine multiple
       | writes to different parts of the database file into a single
       | record, guaranteeing (after crash recovery if necessary) that
       | either all changes happen or none of them do. It serves the same
       | purpose as a mutex in multi-threaded code - if your invariants
       | hold when you get the mutex, and you reestablish them before you
       | drop it, everything will be fine. We'd all love to have a mutex
       | that keeps the system from crashing, but failing that we can use
       | a WAL record to ensure that we move atomically from one valid
       | state to another, without worrying about the order of
       | intermediate changes to the data structure.
        
         | bjornsing wrote:
         | For some data structures you get the atomicity for "free"
         | though. For example, a log structured merge tree where writes
         | are done in large transactions could well do without a WAL and
         | still achieve atomicity.
        
           | nyrikki wrote:
           | Ever have an LSMT quit writing a run half way through? Ever
           | need to try and restore Cassandra where mutation storage
           | order isn't guaranteed, with the last mutation always
           | winning?
           | 
           | It is absolutely not "atomic" if you have ever had to do a
           | recovery from SSTable snapshots.
        
             | bjornsing wrote:
             | > Ever have an LSMT quit writing a run half way through?
             | 
             | That should be very easy to "solve" by just writing the
             | length of the SSTable in a header.
             | 
             | > Ever need to try and restore Cassandra where mutation
             | storage order isn't guaranteed, with the last mutation
             | always winning?
             | 
             | This is more of a distributed systems challenge as I
             | understand it: Cassandra is "multi-master" so can end up in
             | a situation where transaction order is hard to sort out.
        
           | valyala wrote:
           | A good example of database, which relies on LSM tree
           | properties for protecting against data corruption on unclean
           | shutdown is VictoriaMetrics [1].
           | 
           | [1] https://valyala.medium.com/wal-usage-looks-broken-in-
           | modern-...
        
         | neal wrote:
         | Good point!
         | 
         | I'm not sure if there are any databases that do your 'with a
         | lot of care' option, but for anyone curious about what that
         | might look like in practice there are file systems that forgo
         | write-ahead logging and maintain metadata consistency using
         | techniques like soft updates[0] or copy-on-write up to a
         | root[1].
         | 
         | [0]: https://www.usenix.org/conference/1999-usenix-annual-
         | technic... [1]:
         | https://www.cs.hmc.edu/~rhodes/courses/cs134/fa20/readings/T...
         | (yes, ZFS can be configured to use a WAL too for durability)
        
           | refset wrote:
           | > I'm not sure if there are any databases that do your 'with
           | a lot of care' option
           | 
           | LMDB springs to mind.
           | 
           | > LMDB was designed to resist data loss in the face of system
           | and application crashes. Its copy-on-write approach never
           | overwrites currently-in-use data. Avoiding overwrites means
           | the structure on disk/storage is always valid, so application
           | or system crashes can never leave the database in a corrupted
           | state.
           | 
           | https://en.wikipedia.org/wiki/Lightning_Memory-
           | Mapped_Databa...
        
       | jmix wrote:
       | This is obvious if you took a DB course, and if you didn't, you
       | have no business building a DB. Sadly, all the NoSQL junkware was
       | built by people who didn't.
        
       | zellyn wrote:
       | I found the Developer Voices discussion[1][2] with Joran Dirk
       | Greef, creator of TigerBeetle, to be fascinating. They mentioned
       | that the rigorous correctness proofs that exist for Raft and
       | Paxos assume an absence of disk faults, but that more modern
       | theory includes fixes for that, as long as you marry the log
       | writing and consensus algorithms together properly rather than
       | keeping each as a separate black box.
       | 
       | [1] https://podcasts.apple.com/us/podcast/databases-ambitions-
       | an...
       | 
       | [2] https://www.youtube.com/watch?v=ayG7ltGRRHs
        
         | refset wrote:
         | Specifically, that modern theory is "protocol-aware recovery" -
         | e.g.
         | https://www.usenix.org/conference/fast18/presentation/alagap...
        
         | samatman wrote:
         | I second this recommendation. I'm impressed by what TigerBeetle
         | has accomplished, and would love to see more of the industry
         | approach software engineering with their philosophy and rigour.
         | 
         | The TigerStyle video is well worth watching for anyone who
         | wants to build defect-free software systems, or recognizes that
         | this is the correct thing to aim for.
         | https://www.youtube.com/watch?v=w3WYdYyjek4
        
       | beltsazar wrote:
       | But WAL is not for durability, but for atomicity and consistency.
       | And yes, you need to use fsync to ensure durability.
        
         | anarazel wrote:
         | Nitpick: If you're careful fdatasync() can suffice.
        
       | refset wrote:
       | > durability is a spectrum
       | 
       | This is the truth. My favourite example of this is MemoryDB's
       | "multi-AZ (multi-datacenter) durability" for Valkey - there's a
       | good write-up here
       | https://brooker.co.za/blog/2024/04/25/memorydb.html
        
       ___________________________________________________________________
       (page generated 2024-07-01 23:01 UTC)