hngopher.com

       [HN Gopher] PSA: SQLite WAL checksums fail silently and may lose...
       ___________________________________________________________________
        
       PSA: SQLite WAL checksums fail silently and may lose data
        
       Author : avinassh
       Score  : 220 points
       Date   : 2025-07-24 14:48 UTC (8 hours ago)
        
 (HTM) web link (avi.im)
 (TXT) w3m dump (avi.im)
        
       | cwillu wrote:
       | The benefit is that you're left with a database state that
       | actually existed; there's no guarantee from the database's
       | perspective that dropping some committed transactions and not
       | others that came after will result in a valid state.
        
         | HelloNurse wrote:
         | This is the main point that the OP misses: even if the newer
         | portion of the WAL file isn't corrupted, its content cannot be
         | used in any way because doing so would require the lost
         | transactions from the corrupted block. The chained checksums
         | are a feature, not gratuitous fragility.
        
           | AlotOfReading wrote:
           | Sqlite could attempt to recover the detected errors though
           | and not lose the transactions.
        
             | hobs wrote:
             | And not get in an infinite loop, and not harm the startup
             | time of the process inordinately, and...
             | 
             | This is just basically how a WAL works, if you have an
             | inconsistent state the transaction is rolled back - at that
             | point you need to redo your work.
        
             | daneel_w wrote:
             | The WAL was corrupted, the actual data is lost. There's no
             | parity. You're suggesting that sqlite should somehow
             | _recreate_ the data from nothing.
        
               | avinassh wrote:
               | > The WAL was corrupted, the actual data is lost. There's
               | no parity. You're suggesting that sqlite should somehow
               | recreate the data from nothing.
               | 
               | Not all frames in the WAL are important. Sure, recovery
               | may be impossible in some cases, but not all checksum
               | failures are impossible to recover from.
        
               | pests wrote:
               | > but not all checksum failures are impossible to recover
               | from
               | 
               | Which failures are possible to recover from?
        
               | dec0dedab0de wrote:
               | what if the corruption only affected the stored checksum,
               | but not the data itself?
        
               | kstrauser wrote:
               | And you would detect this how?
        
               | daneel_w wrote:
               | ...then the checksum can't clear the associated
               | operation. Same result.
        
               | jandrewrogers wrote:
               | There are two examples I know of that require no
               | additional data:
               | 
               | First, force a re-read of the corrupted page from disk. A
               | significant fraction of data corruption occurs while it
               | is being moved between storage and memory due to weak
               | error detection in that part of the system. A clean read
               | the second time would indicate this is what happened.
               | 
               | Second, do a brute-force search for single or double bit
               | flips. This involves systematically flipping every bit in
               | the corrupted page, recomputing the checksum, and seeing
               | if corruption is detected.
        
               | daneel_w wrote:
               | _> A significant fraction of data corruption occurs while
               | it is being moved between storage and memory_
               | 
               | Surely you mean on the memory bus specifically? SATA and
               | PCIe both have some error correction methods for securing
               | transfers between storage and host controller. I'm not
               | sure about old parallel ATA. While I understand it can
               | happen under conditions similar to non-ECC RAM being
               | corrupted, I don't think I've ever heard or read about a
               | case where a _storage device_ randomly returned erroneous
               | data, short of a legitimate hardware error.
        
               | jandrewrogers wrote:
               | The error correction in PCIe, SATA, etc is too weak to be
               | reliable for modern high-performance storage. This is a
               | known deficiency and why re-reading a corrupted page
               | sometimes fixes things. PCIe v6 is introducing a much
               | stronger error detection scheme to address this, which
               | will mostly leave bit-rot on storage media as the major
               | vector.
               | 
               | The bare minimum you want these days is a 64-bit CRC. A
               | strong 128-bit hash would be ideal. Even if you just
               | apply these at the I/O boundaries then you'll catch most
               | corruption. The places it can realistically occur are
               | shrinking but most software makes minimal effort to
               | detect this corruption even though it is a fairly well-
               | bounded problem.
        
               | daneel_w wrote:
               | Thanks for the tech details.
        
               | daneel_w wrote:
               | My feeling is that any portion that no longer checksums
               | correctly is practically forfeited, which makes any form
               | of recovery an arbitrary result - a half-baked
               | transaction. I don't see how that's a more desirable
               | outcome than rolling back to the last known-good commit,
               | which is never arbitrary.
        
               | avinassh wrote:
               | I posted in another comment: There is a smoll demo in the
               | post. In it, I corrupt an old frame that is not needed by
               | the database at all. Now, one approach would be to
               | continue the recovery and then present both states: one
               | where the WAL is dropped, and another showing whatever we
               | have recovered. If I had such an option, I would almost
               | always pick the latter.
        
               | nemothekid wrote:
               | > _one where the WAL is dropped, and another showing
               | whatever we have recovered. If I had such an option, I
               | would almost always pick the latter_
               | 
               | I can't imagine picking the latter unless you were
               | treating sqlite like a filesystem of completely unrelated
               | blobs.
               | 
               | If I run three transactions where:
               | 
               | 1. John gives $100 to Sue.
               | 
               | 2. Sue gives $100 to Mark.
               | 
               | 3. Mark $100 money to Paul.
               | 
               | If sqlite, just erases transaction (2), then Mark
               | materializes $100 from nowhere. The rest of your database
               | is potentially completely corrupted. At that point your
               | database is no longer consistent - I can't see how you
               | would "almost always" prefer this.
               | 
               | If (2) is corrupt, then the restore stops at (1), and you
               | are guaranteed consistency.
        
               | avinassh wrote:
               | Yes, this is a good example of showing an we cannot
               | partly apply the WAL _always_. Again, let me repeat it,
               | we cannot partly apply the WAL all the time expect it to
               | work but there are some valid cases where we can do that
               | to recover. Your example is not the one.
        
               | AlotOfReading wrote:
               | I was assuming sqlite did the sane thing and used a CRC.
               | CRCs have (limited) error correction capabilities that
               | you can use to fix 1-2 bit errors in most circumstances,
               | but apparently sqlite uses a Fletcher variant and gives
               | up that ability (+ long message length error detection)
               | for negligible performance gains on modern (even
               | embedded) CPUs.
        
               | lxgr wrote:
               | The ability to... correct 1-2 bit errors? Is that even a
               | realistic failure mode on common hardware?
               | 
               | CRCs as used in SQLite are not intended to detect data
               | corruption due to bit rot, and are certainly not ECCs.
        
               | AlotOfReading wrote:
               | Yes, it's a pretty important error case with storage
               | devices. It's so common that modern storage devices and
               | filesystems include their own protections against it.
               | Your system may or may not have these and bit flips may
               | happen after that point, so WAL redundancy wouldn't be
               | out of place.
               | 
               | Sure, the benefits to the incomplete write use case are
               | limited, but there's basically no reason to ever use a
               | fletcher these days.
               | 
               | It's also worth mentioning that the VFS checksums _are_
               | explicitly documented as guarding against storage device
               | bitrot and use the same fletcher algorithm.
        
               | lxgr wrote:
               | It would be absolutely out of place if your lower layers
               | already provided it, and not enough if they didn't (since
               | you'd then also need checksums on the actual database
               | file, which SQLite does not provide - all writes happen
               | again there!)
        
               | AlotOfReading wrote:
               | sqlite does actually provide database checksums, via the
               | vfs extension I mentioned previously.
               | 
               | There's no harm to having redundant checksums and it's
               | not truly redundant for small messages. It's pretty
               | common for systems not to have lower level checksumming
               | either. Lots of people are still running NTFS/EXT4 on
               | hardware that doesn't do granular checksums or protect
               | data in transit.
               | 
               | Of course this is all a moot point because sqlite _does_
               | WAL checksums, it just does them with an obsolete
               | algorithm.
        
               | lxgr wrote:
               | > There's no harm to having redundant checksums
               | 
               | There sure is: Redundant checksums need extra storage and
               | extra processing. SQLite often runs on embedded systems,
               | where both can come at a premium.
               | 
               | > Of course this is all a moot point because sqlite does
               | WAL checksums, it just does them with an obsolete
               | algorithm.
               | 
               | That's not nearly the only thing missing for SQLite to
               | provide full resistance to lower-level data corruption.
               | At a very minimum, you'd also need checksums of the
               | actual database file.
        
               | AlotOfReading wrote:
               | There's no extra storage because it's already storing
               | checksums. There's technically extra processing, but a
               | good CRC is faster than reading memory on virtually all
               | modern systems (even in embedded) so there's no reason to
               | prefer a fletcher unless the few extra cycles of latency
               | to get the CRC running are deeply concerning to you.
               | That's not nearly the only thing missing for SQLite to
               | provide full resistance to lower-level data corruption.
               | 
               | A CRC wouldn't provide full resistance anyway, so this
               | clearly isn't what I'm arguing. What I'm saying is that a
               | fletcher is strictly worse than a CRC here. Not an
               | advanced algorithm, a bog-standard CRC. Fletcher actually
               | takes more space in this case, has worse error detection,
               | and doesn't allow recovery. A CRC would allow you to
               | recover _some of the time_ , from _some_ kinds of errors,
               | in _some_ scenarios. It 's simply better than nothing.
        
               | lxgr wrote:
               | > There's no extra storage because it's already storing
               | checksums.
               | 
               | At the database level (i.e. not just the WAL)? Are you
               | sure?
               | 
               | > What I'm saying is that a fletcher is strictly worse
               | than a CRC here.
               | 
               | I can't speak to the performance differences, but the
               | only thing SQLite really needs the checksum to do is to
               | expose partial writes, both due to reordered sector
               | writes and partial intra-sector writes. (The former could
               | also be solved by just using an epoch counter, but the
               | latter would require some tricky write formats, and a
               | checksum nicely addresses both).
               | 
               | In both cases, there's really nothing to recover: CRC
               | won't catch an entire missing sector, and almost no
               | partially written sectors (i.e. unless the failure
               | somehow happens in the very last bytes of it, so that the
               | ratio of "flipped" bits is low enough).
        
               | AlotOfReading wrote:
               | Just talking about the WAL. The database does have a page
               | level checksum with the VFS extension, but it's optional.
        
         | teraflop wrote:
         | Yes, and it's not just about application-level integrity. The
         | WAL operates at a page level, so dropping one WAL entry and
         | then applying later ones would be likely to _cause_ corruption
         | at the B-tree level.
         | 
         | For instance, say you have a node A which has a child B:
         | 
         | * Transaction 1 wants to add a value to B, but it's already
         | full, so B is split into new nodes C and D. Correspondingly,
         | the pointer in A that points to B is removed, and replaced with
         | pointers to C and D.
         | 
         | * Transaction 2 makes an unrelated change to A.
         | 
         | If you skip the updates from transaction 1, and apply the
         | updates from transaction 2, then suddenly A's data is
         | overwritten with a new version that points to nodes C and D,
         | but those nodes haven't been written. The pointers just point
         | to uninitialized garbage.
        
       | teraflop wrote:
       | > The checksums in WAL are likely not meant to check for random
       | page corruption in the middle; maybe they're just to check if the
       | last write of a frame was fsynced properly or not?
       | 
       | This is the correct explanation. The purpose is to detect partial
       | writes, not to detect arbitrary data corruption. If detecting
       | corruption was the goal, then checksumming the WAL without also
       | checksumming the database itself would be fairly pointless.
       | 
       | In fact, it's not accurate to say "SQLite does not do checksums
       | by default, but it has checksums in WAL mode." SQLite always uses
       | checksums for its journal, regardless of whether that's a
       | rollback journal or a write-ahead log. [1]
       | 
       | For the purpose of tolerating and recovering from crashes/power
       | failures, writes to the database file itself are effectively
       | idempotent. It doesn't matter if only a subset of the DB writes
       | are persisted before a crash, and you don't need to know which
       | ones succeeded, because you can just roll all of them forward or
       | backward (depending on the mode). But for the journal itself,
       | distinguishing partial journal entries from complete ones
       | matters.
       | 
       | No matter what order the disk physically writes out pages, the
       | instant when the checksum matches the data is the instant at
       | which the transaction can be unambiguously said to commit.
       | 
       | [1]: https://www.sqlite.org/fileformat.html
        
         | kentonv wrote:
         | Exactly. To put it another way:
         | 
         | Imagine the power goes out while sqlite is in the middle of
         | writing a transaction to the WAL (before the write has been
         | confirmed to the application). What do you want to happen when
         | power comes back, and you reload the database?
         | 
         | If the transaction was fully written, then you'd probably like
         | to keep it. But if it was not complete, you want to roll it
         | back.
         | 
         | How does sqlite know if the transaction was complete? It needs
         | to see two things:
         | 
         | 1. The transaction ends with a commit frame, indicating the
         | application did in fact perform a `COMMIT TRANSACTION`.
         | 
         | 2. All the checksums are correct, indicating the data was fully
         | synced to disk when it was committed.
         | 
         | If the checksums are wrong, the assumption is that the
         | transaction wasn't fully written out. Therefore, it should be
         | rolled back. That's exactly what sqlite does.
         | 
         | This is not "data loss", because the transaction was not ever
         | fully committed. The power failure happened before the commit
         | was confirmed to the application, so there's no way anyone
         | should have expected that the transaction is durable.
         | 
         | The checksum is NOT intended to detect when the data was
         | corrupted by some other means, like damage to the disk or a
         | buggy app overwriting bytes. Myriad other mechanisms should be
         | protecting against those already, and sqlite is assuming those
         | other mechanisms are working, because if not, there's very
         | little sqlite can do about it.
        
           | malone wrote:
           | Why is the commit frame not sufficient to determine whether
           | the transaction was fully written or not? Is there a scenario
           | where the commit frame is fsynced to disk but the proceeding
           | data isn't?
        
             | adambb wrote:
             | The disk controller may decide to write out blocks in a
             | different order than the logical layout in the log file
             | itself, and be interrupted before completing this work.
        
               | hinkley wrote:
               | We still have the elevator algorithm on NVMe?
        
               | lxgr wrote:
               | SQLite runs on anything from servers to Internet-
               | connected lightbulbs.
        
               | jrockway wrote:
               | Which lightbulbs include SQLite? I kind of want one.
        
               | natebc wrote:
               | these guys have a Cree logo on their homepage so maybe
               | Cree?
               | 
               | https://imaginovation.net/case-study/cree/
               | 
               | At least what I could turn up with a quick web search.
        
               | bob1029 wrote:
               | Anything that uses NAND storage technology is going to be
               | optimized in some way like this. NVMe is just the
               | messenger.
        
               | jrockway wrote:
               | NVMe drives do their own manipulation of the datastream.
               | Wear leveling, GC, trying to avoid rewriting an entire
               | block for your 1 bit change, etc. NVMe drives have CPUs
               | and RAM for this purpose; they are full computers with a
               | little bit of flash memory attached. And no, of course
               | they're not open source even though they have full access
               | to your system.
        
               | johncolanduoni wrote:
               | It's worth noting this is also dependent on filesystem
               | behavior; most that do copy-on-write will not suffer from
               | this issue regardless of drive behavior, even if they
               | don't do their own checksumming.
        
           | hinkley wrote:
           | For instance, running on ZFS or one of its peers.
        
             | jandrewrogers wrote:
             | Apropos this use case, ZFS is usually not recommended for
             | databases. Competent database storage engines have their
             | own strong corruption detection mechanisms regardless. What
             | filesystems in the wild typically provide for this is
             | weaker than what is advisable for a database, so databases
             | should bring their own implementation.
        
               | TheDong wrote:
               | > ZFS is usually not recommended for databases
               | 
               | Say more? I've heard people say that ZFS is somewhat
               | slower than, say, ext4, but I've personally had zero
               | issues running postgres on zfs, nor have I heard any
               | well-reasoned reasons not to.
               | 
               | > What filesystems in the wild typically provide for this
               | is weaker than what is advisable for a database, so
               | databases should bring their own implementation.
               | 
               | Sorry, what? Just yesterday matrix.org had a post about
               | how they (using ext4 + postgres) had disk corruption
               | which led to postgres returning garbage data:
               | https://matrix.org/blog/2025/07/postgres-corruption-
               | postmort...
               | 
               | The corruption was likely present for months or years,
               | and postgres didn't notice.
               | 
               | ZFS, on the other hand, would have noticed during a
               | weekly scrub and complained loudly, letting you know a
               | disk had an error, letting you attempt to repair it if
               | you used RAID, etc.
               | 
               | It's stuff like in that post that are exactly why I run
               | postgres on ZFS.
               | 
               | If you've got specifics about what you mean by "databases
               | should bring their own implementation", I'd be happy to
               | hear it, but I'm having trouble thinking of any sorta
               | technically sound reason for "databases actually prefer
               | it if filesystems can silently corrupt data lol" being
               | true.
        
               | zaarn wrote:
               | SQLite on ZFS needs the Fsync behaviour to be off,
               | otherwise SQLite will randomly hang the application as
               | the fsync will wait for the txg to commit. This can take
               | a minute or two, in my experience.
               | 
               | Btrfs is a better choice for SQLite.
        
               | throw0101b wrote:
               | > _SQLite on ZFS needs the Fsync behaviour to be off_
               | [...]                   zfs set sync=disabled
               | mydata/mydb001
               | 
               | * https://openzfs.github.io/openzfs-
               | docs/man/master/7/zfsprops...
        
               | supriyo-biswas wrote:
               | Btw this concern also applies to other databases,
               | although probably it manifests in the worst way in
               | SQLite. Essentially, you're doing a WAL over the file
               | systems' own WAL-like recovery mechanism.
        
               | jandrewrogers wrote:
               | The point is that a database cannot rely on being
               | deployed on a filesystem with proper checksums.
               | 
               | Ext4 uses 16-/32-bit CRCs, which is very weak for storage
               | integrity in 2025. Many popular filesystems for databases
               | are similarly weak. Even if they have a strong option,
               | the strong option is not enabled by default. In real-
               | world Linux environments, the assumption that the
               | filesystem has weak checksums usually true.
               | 
               | Postgres has (IIRC) 32-bit CRCs but they are not enabled
               | by default. That is also much weaker than you would
               | expect from a modern database. Open source databases do
               | not have a good track record of providing robust
               | corruption detection generally _nor_ the filesystems they
               | often run on. It is a systemic problem.
               | 
               | ZFS doesn't support features that high-performance
               | database kernels use and is slow, particularly on high-
               | performance storage. Postgres does not use any of those
               | features, so it matters less if that is your database.
               | XFS has traditionally been the preferred filesystem for
               | databases on Linux and Ext4 will work. Increasingly,
               | databases don't use external filesystems at all.
        
               | mardifoufs wrote:
               | I know MySQL has checksums by default, how does it
               | compare? Is it useful or is it similarly weak?
        
               | wahern wrote:
               | But what ZFS provides isn't weaker, and in SQLite page
               | checksums are opt-in:
               | https://www.sqlite.org/cksumvfs.html
               | 
               | EDIT: It seems they're opt-in for PostgreSQL, too:
               | https://www.postgresql.org/docs/current/checksums.html
        
               | avinassh wrote:
               | you might like my other post -
               | https://avi.im/blag/2024/databases-checksum/
               | 
               | bad news is, most databases don't do checksums by
               | default.
        
               | lxgr wrote:
               | This is in fact good news.
               | 
               | Redundantly performing the same performance-intensive
               | tasks on multiple layers makes latency less predictable
               | and just generally wastes resources.
        
               | jandrewrogers wrote:
               | Actually bad news. Most popular filesystems and
               | filesystem configurations have limited and/or weak
               | checksums, certainly much worse than you'd want for a
               | database. 16-bit and 32-bit CRCs are common in
               | filesystems.
               | 
               | This is a major reason databases implement their own
               | checksums. Unfortunately, many open source databases have
               | weak or non-existent checksums too. It is sort of an
               | indefensible oversight.
        
               | hawk_ wrote:
               | So when checksums are enabled and the DB process restarts
               | or the host reboots does the DB run the checksum over all
               | the stored data? Sounds like it would take forever for
               | the database to come online. But if it doesn't it may not
               | detect bitrot in time...?
        
               | lxgr wrote:
               | No, competent systems just need to have something that,
               | taken together, prevents data corruption.
               | 
               | One possible instance of that is a database providing its
               | own data checksumming, but another perfectly valid one is
               | running one that does not on a lower layer with a
               | sufficiently low data corruption rate.
        
               | tetha wrote:
               | Hm.
               | 
               | On the other hand, I've heard people recommend running
               | Postgres on ZFS so you can enable on the fly compression.
               | This increases CPU utilization on the postgres server by
               | quite a bit, read latency of uncached data a bit, but it
               | decreases necessary write IOPS a lot. And as long as the
               | compression is happening a lot in parallel (which it
               | should, if your database has many parallel queries), it's
               | much easier to throw more compute threads at it than to
               | speed up the write-speed of a drive.
               | 
               | And after a certain size, you start to need atomic
               | filesystem snapshots to be able to get a backup of a very
               | large and busy database without everything exploding. We
               | already have the more efficient backup strategies from
               | replicas struggle on some systems and are at our wits end
               | how to create proper backups and archives without
               | reducing the backup freqency to weeks. ZFS has mature
               | mechanisms and zfs-send to move this data around with
               | limited impact ot the production dataflow.
        
               | supriyo-biswas wrote:
               | Is an incremental backup of the database not possible?
               | Pgbackrest etc. can do this by creating a full backup
               | followed by incremental backups from the WAL.
               | 
               | For Postgres specifically you may also want to look at
               | using hot_standby_feedback, as described in this recent
               | HN article: https://news.ycombinator.com/item?id=44633933
        
               | tetha wrote:
               | On the big product clusters, we have incremental
               | pgbackrest backups running for 20 minutes. Full backups
               | take something between 12 - 16 hours. All of this from a
               | sync standby managed by patroni. Archiving all of that
               | takes 8 - 12 hours. It's a couple of terabytes on
               | noncompressible data that needs to move. It's fine
               | though, because this is an append-log-style dataset and
               | we can take our time backing this up.
               | 
               | We also have decently sized clusters with very active
               | data on them, and rather spicy recovery targets. On some
               | of them, a full backup from the sync standby takes 4
               | hours, we need to pull an incremental backup at most 2
               | hours afterwards, but the long-term archiving process
               | needs 2-3 hours to move the full backup to the archive.
               | This is the first point in which filesystem snapshots,
               | admittedly, of the pgbackrest repo, become necessary to
               | adhere to SLOs as well as system function.
               | 
               | We do all of the high-complexity, high-throughput things
               | recommended by postgres, and it's barely enough on the
               | big systems. These things are getting to the point of
               | needing a lot more storage and network bandwidth.
        
               | hinkley wrote:
               | This was my understanding as well, color me also
               | confused.
        
               | johncolanduoni wrote:
               | Is not great for databases that do updates in place. Log-
               | structured merge databases (which most newer DB engines
               | are) work fine with its copy-on-write semantics.
        
             | zaarn wrote:
             | ZFS isn't viable for SQLite unless you turn off fsync's in
             | ZFS, because otherwise you will have the same experience I
             | had for years; SQLite may randomly hang for up to a few
             | minutes with no visible cause, if there isn't sufficient
             | write txg's to fill up in the background. If your app
             | depends on SQLite, it'll randomly die.
             | 
             | Btrfs is a better choice for sqlite, haven't seen that
             | issue there.
        
               | throw0101b wrote:
               | > _ZFS isn't viable for SQLite unless you turn off
               | fsync's in ZFS_
               | 
               | Which you can do on a per dataset ('directory') basis
               | very easily:                   zfs set sync=disabled
               | mydata/mydb001
               | 
               | * https://openzfs.github.io/openzfs-
               | docs/man/master/7/zfsprops...
               | 
               | Meanwhile all the rest of your pools / datasets can keep
               | the default POSIX behaviour.
        
               | ezekiel68 wrote:
               | You know what's even easier than doing that? Neglecting
               | to do it or meaning to do it then getting pulled in to
               | some meeting (or other important distraction) and then
               | imagining you did it.
        
               | throw0101b wrote:
               | > _Neglecting to do it or meaning to do it then getting
               | pulled in to some meeting (or other important
               | distraction) and then imagining you did it._
               | 
               | If your job is to make sure your file system and your
               | database--SQLite, Pg, My/MariDB, _etc_ --are tuned
               | together, and you don't tune it, then you _should be_
               | called into a meeting. Or at least the no-fault RCA
               | should bring up remediation methods to make sure it 's
               | part of the SOP so that it won't happen again.
               | 
               | The alternative the GP suggests is using Btrfs, which I
               | find even more irresponsible than your non-tuning
               | situation. (Heck, if someone on my sysadmin team
               | suggested we start using Btrfs for anything I would think
               | they were going senile.)
        
               | johncolanduoni wrote:
               | Facebook is apparently using it at scale, which surprised
               | me. Though that's not necessarily an endorsement, and who
               | knows what their kernel patcheset looks like.
        
               | kentonv wrote:
               | Doesn't turning off sync mean you can lose confirmed
               | writes in a power failure?
        
               | Modified3019 wrote:
               | Interesting. Found a GitHub issue that covers this bug:
               | https://github.com/openzfs/zfs/issues/14290
               | 
               | The latest comment seems to be a nice summary of the root
               | cause, with earlier in the thread pointing to ftruncate
               | instead of fsync being a trigger:
               | 
               | >amotin
               | 
               | >I see. So ZFS tries to drop some data from pagecache,
               | but there seems to be some dirty pages, which are held by
               | ZFS till them either written into ZIL, or to disk at the
               | end of TXG. And if those dirty page writes were
               | asynchronous, it seems there is nothing that would nudge
               | ZFS to actually do something about it earlier than
               | zfs_txg_timeout. Somewhat similar problem was recently
               | spotted on FreeBSD after #17445, which is why newer
               | version of the code in #17533 does not keep references on
               | asynchronously written pages.
               | 
               | Might be worth testing zfs_txg_timeout=1 or 0
        
         | lxgr wrote:
         | I believe it's also because of this (from
         | https://www.sqlite.org/wal.html):
         | 
         | > [...] The checkpoint does not normally truncate the WAL file
         | (unless the journal_size_limit pragma is set). Instead, it
         | merely causes SQLite to start overwriting the WAL file from the
         | beginning. This is done because it is normally faster to
         | overwrite an existing file than to append.
         | 
         | Without the checksum, a new WAL entry might cleanly overwrite
         | an existing longer one in a way that still looks valid (e.g.
         | "A|B" -> "C|B" instead of "AB" -> "C|data corruption"), at
         | least without doing an (expensive) scheme of overwriting B with
         | invalid data, fsyncing, and then overwriting A with C and
         | fsyncing again.
         | 
         | In other words, the checksum allows an optimized write path
         | with fewer expensive fsync/truncate operations; it's not a
         | sudden expression of mistrust of lower layers that doesn't
         | exist in the non-WAL path.
        
       | slashdev wrote:
       | How would this work differently? As soon as you encounter a
       | checksum failure, you can't trust anything from that point on. If
       | the checksum were just per-page and didn't build on the previous
       | page's checksum, you can't just apply pages from the WAL that
       | were valid, skipping the ones which were not. The database at the
       | end of that process would be corrupt.
       | 
       | If you stop at the first failure, the database is restored to the
       | last good state. That's the best outcome that can be achieved
       | under the circumstances. Some data could be lost, but there
       | wasn't anything sensible you could do with it anyway.
        
         | avinassh wrote:
         | > How would this work differently?
         | 
         | I would like it to raise an error and then provide an option to
         | continue or stop. Since continuing is the default, we need a
         | way to opt in to stopping on checksum failure.
         | 
         | Not all checksum errors are impossible to recover from. Also,
         | as the post mentions, only some non important pages could be
         | corrupt too.
         | 
         | My main complaint is that it doesn't give developers an option.
        
           | thadt wrote:
           | Aight, I'll bite: continue or stop... and do what? As others
           | have pointed out, the only safe option to get back to a
           | consistent state is to roll back to a safe point.
           | 
           | If what we're really interested in is the _log_ part of a
           | write ahead log - where we could safely recover data after a
           | corruption, then a better tool might be just a log file,
           | instead of SQLite.
        
             | avinassh wrote:
             | > Aight, I'll bite: continue or stop... and do what? As
             | others have pointed out, the only safe option to get back
             | to a consistent state is to roll back to a safe point.
             | 
             | Attempt to recover! Again, not all checksum errors are
             | impossible to recover. I hold the view that even if there
             | is a 1% chance of recovery, we should attempt it. This may
             | be done by SQLite, an external tool, or even manually.
             | Since WAL corruption issues are silent, we cannot do that
             | now.
             | 
             | There is a smoll demo in the post. In it, I corrupt an old
             | frame that is not needed by the database at all. Now, one
             | approach would be to continue the recovery and then present
             | both states: one where the WAL is dropped, and another
             | showing whatever we have recovered. If I had such an
             | option, I would almost always pick the latter.
        
               | ncruces wrote:
               | > Since WAL corruption issues are silent, we cannot do
               | that now.
               | 
               | You do have backups, right?
        
               | avinassh wrote:
               | Yes. Say if I am using something like litestream, then
               | all subsequent generations will be affected. At some
               | point, I'd go back and find out the broken generation.
               | 
               | Instead of that, I'd prefer for it to fail fast
        
               | lxgr wrote:
               | What about all classes of durability errors that do not
               | manifest in corrupted checksums (e.g. "cleanly" lost
               | appends of entire transactions to the WAL)?
               | 
               | It seems like you're focusing on a very specific failure
               | mode here.
               | 
               | Also, what if the data corruption error happens during
               | the write to the actual database file (i.e. at WAL
               | checkpointing time)? That's still 50% of all your writes,
               | and there's no checksum there!
        
               | NortySpock wrote:
               | The demo is good, through I wish you had presented it as
               | text rather than as a gif.
               | 
               | I do see your point of wanting an option to refuse to
               | delete the wal so a developer can investigate the wal and
               | manually recover... But the the typical user probably
               | wants the database to come back up with a consistent,
               | valid state if power is lost. They do not want have the
               | database refuse to operate because it found uncommitted
               | transactions in a scratchpad file...
               | 
               | As a SQL-first developer, I don't pick apart write-ahead
               | logs trying to save a few bytes from the great hard drive
               | in the sky, I just want the database to give me the
               | current state of my data and never be in an invalid
               | state.
        
               | avinassh wrote:
               | > As a SQL-first developer, I don't pick apart write-
               | ahead logs trying to save a few bytes from the great hard
               | drive in the sky, I just want the database to give me the
               | current state of my data and never be in an invalid
               | state.
               | 
               | Yes, that is a very valid choice. Hence, I want databases
               | to give me an option, so that you can choose to ignore
               | the checksum errors and I can choose to stop the app and
               | try to recover.
        
               | lxgr wrote:
               | WAL checksum errors are not used to catch data
               | corruption, they are only used to detect incomplete
               | writes or leftover data from previous checkpoints.
               | 
               | If you attempt to do the former in a system that by
               | design uses checksums only for the latter, you'll
               | actually introduce corrupted data from some non-corrupted
               | WAL files.
        
               | nemothekid wrote:
               | > _I corrupt an old frame that is not needed by the
               | database at all_
               | 
               | 1. How does the _database_ know that.
               | 
               | 2. In your example Alice gets the money from nowhere.
               | What if another user had sent the money to Alice and
               | _that_ frame get corrupted. Then you just created
               | 10,000,000 from nowhere.
               | 
               | At the very least, rolling back to a good point gives you
               | an exact moment of time where transactions can be applied
               | from. Your example is very contrived and in a database
               | where several transactions can be happening, doing a
               | partial recovery will destroy the consistency of the
               | database.
        
           | lxgr wrote:
           | Giving developers that option would require SQLite to change
           | the way it writes WALs, which would increase overhead.
           | Checksum corruptions can happen without any lower-level
           | errors; this is a performance optimization by SQLite.
           | 
           | I've written more about this here:
           | https://news.ycombinator.com/item?id=44673991
        
             | avinassh wrote:
             | > Giving developers that option would require SQLite to
             | change the way it writes WALs, which would increase
             | overhead.
             | 
             | Yes! But I am happy to accept that overhead with the
             | corruption detection.
        
               | lxgr wrote:
               | But why? You'd only get partial corruption detection.
               | 
               | As I see it, either you have a lower layer you can trust,
               | and then this would just be extra overhead, or you don't,
               | in which case you'll also need error correction (not just
               | detection!) for the database file itself.
        
           | slashdev wrote:
           | It is good that it doesn't give you an option. I don't want
           | some app on my phone telling me its database is corrupt, I
           | want it to always load back to the last good state and I'll
           | handle any missing data myself.
           | 
           | The checksums are not going to fail unless there was disk
           | corruption or a partial write.
           | 
           | In the former, thank your lucky stars it was in the WAL file
           | and you just lose some data but have a functioning database
           | still.
           | 
           | In the latter, you didn't fsync, so it couldn't have been
           | that important. If you care about not losing data, you need
           | to fsync on every transaction commit. If you don't care
           | enough to do that, why do you care about checksums, it's
           | missing the point.
        
       | ryanjshaw wrote:
       | > What I want: throw an error when corruption is detected and let
       | the code handle it.
       | 
       | I wonder what that code would look like. My sense is that it'll
       | look exactly like the code that would run as if the transactions
       | never occurred to begin with, which is why the SQLite design
       | makes sense.
       | 
       | For example, I have a database of todos that sync locally from
       | the cloud. The WAL gets corrupted. The WAL gets truncated the
       | next time the DB is opened. The app logic then checks the last
       | update timestamp in the DB and syncs with the cloud.
       | 
       | I don't see what the app would do differently if it were notified
       | about the WAL corruption.
        
         | fer wrote:
         | Exactly. I'd read it as
         | 
         | > I want to correct errors that the DB wizard who implemented
         | SQLite chose not to
         | 
         | When there's a design decision in such a high profile project
         | that you disagree with, it's either
         | 
         | 1. You don't understand why it was done like this.
         | 
         | 2. You can (and probably will) submit a change that would solve
         | it.
         | 
         | If you find yourself in the situation of understanding, yet not
         | doing anything about it, you're the Schrodinger's developer:
         | you're right and wrong until you collapse the mouth function by
         | putting money on it.
         | 
         | It's very rarely an easy to fix mistake.
        
           | avinassh wrote:
           | > 2. You can (and probably will) submit a change that would
           | solve it.
           | 
           | SQLite is not open to contribution -
           | https://www.sqlite.org/copyright.html
           | 
           | > 1. You don't understand why it was done like this.
           | 
           | sure, I would like to understand it. That's why the post!
        
             | rovr138 wrote:
             | > Contributed Code
             | 
             | > In order to keep SQLite completely free and unencumbered
             | by copyright, the project does not accept patches. If you
             | would like to suggest a change and you include a patch as a
             | proof-of-concept, that would be great. However, please do
             | not be offended if we rewrite your patch from scratch.
             | 
             | Propose it.
        
       | adzm wrote:
       | sqlite has several callbacks / hooks / handlers that can be set.
       | I think it is reasonable to expect there to be a way for this
       | situation to be communicated to the application.
        
       | westurner wrote:
       | Do the sqlite replication systems depend upon WAL checksums?
       | 
       | Merkle hashes would probably be better.
       | 
       | google/trillian adds Merkle hashes to table rows.
       | 
       | sqlite-parquet-vtable would workaround broken WAL checksums.
       | 
       | sqlite-wasm-http is almost a replication system
       | 
       | Re: "Migration of the [sqlite] build system to autosetup"
       | https://news.ycombinator.com/item?id=41921992 :
       | 
       | > _There are many extensions of SQLite; rqlite (Raft in Go,), cr-
       | sqlite (CRDT in C), postlite (Postgres wire protocol for SQLite),
       | electricsql (Postgres), sqledge (Postgres), and also WASM:
       | sqlite-wasm, sqlite-wasm-http, dqlite (Raft in Rust),_
       | 
       | > _awesome-sqlite_
       | 
       | From "Adding concurrent read/write to DuckDB with Arrow Flight"
       | https://news.ycombinator.com/item?id=42871219 :
       | 
       | > _cosmos /iavl is a Merkleized AVL tree.
       | https://github.com/cosmos/iavl _
       | 
       | /? Merkle hashes for sqlite:
       | https://www.google.com/search?q=Merkle+hashes+for+SQlite
       | 
       | A git commit hash is basically a Merkle tree root, as it depends
       | upon the previous hashes before it.
       | 
       | Merkle tree: https://en.wikipedia.org/wiki/Merkle_tree
       | 
       | (How) Should merkle hashes be added to sqlite for consistency?
       | How would merkle hashes in sqlite differ from WAL checksums?
        
       | dathinab wrote:
       | Some things:
       | 
       | - there is an official check sum VFS shim, but I never used it
       | and don't know how good it is. The difference between it and WAL
       | checksum is that it works on a per page level and you seem to
       | need manually run the checksum checks and then yourself decide
       | what to do
       | 
       | - check sums (as used by SQLite WAL) aren't meant for backup,
       | redundancy or data recovery (there are error recovery codes
       | focused on allowing recovering a limited set of bits, but they
       | have way more overhead then the kind of checksum used here)
       | 
       | - I also believe SQLite should indicate such checksum errors
       | (e.g. so that you might engage out of band data recovery, i.e.
       | fetch a backup from somewhere), but I'm not fully sure how you
       | would integrate it in a backward compatible way? Like return it
       | as an error which otherwise acts like a SQLITE_BUSY??
        
         | ncruces wrote:
         | The checksum VFS explicitly disables its checksums during
         | checkpointing (search of inCkpt):
         | https://sqlite.org/src/doc/tip/ext/misc/cksumvfs.c
         | 
         | Data in the WAL should be considered to be of "reduced
         | durability".
        
       | kburman wrote:
       | An employee of Turso, a commercial fork of SQLite, is presenting
       | a standard, safety-first feature of SQLite's WAL as a dangerous
       | flaw. As many have noted, this behavior prevents database
       | corruption, it doesn't cause it.
        
         | supriyo-biswas wrote:
         | I wouldn't have jumped to a conspiracy angle immediately, but
         | there are some signs which are difficult to overlook:
         | 
         | - Said person was apparently employed due to his good
         | understanding of databases and distributed systems concepts
         | (there's a HN thread about how he found an issue in the paper
         | describing an algorithm); yet makes fundamental mistakes in
         | understanding what the WAL does and how it's possible not to
         | "partly" apply a WAL.
         | 
         | - Said person expects a SQL database to expose WAL level errors
         | to the user breaking transactional semantics (if you want that
         | level of control, consider simpler file-based key-value stores
         | that expose such semantics?)
         | 
         | - Said person maligns SQLite as being impossible to contribute;
         | whereas the actual project only mentions that they may rewrite
         | the proposed patch to avoid copyright implications.
         | 
         | - Said person again maligns SQLite as "limping along" in the
         | face of disk errors (while making the opposite claim a few
         | paragraphs ago); while ignoring that the checksum VFS exists
         | when on-disk data corruption is a concern.
        
           | avinassh wrote:
           | of all places, I did not expect to get personal attacks on HN
           | :)
           | 
           | > yet makes fundamental mistakes in understanding what the
           | WAL does and how it's possible not to "partly" apply a WAL.
           | 
           | Please provide citation on where I said that. You can't
           | partly apply WAL _always_ , but there are very valid cases
           | where you can do that to recover. Recovery doesn't have to
           | automatic. It can be done by SQLite, or some recovery tool or
           | with manual intervention.
           | 
           | > - Said person maligns SQLite as being impossible to
           | contribute; whereas the actual project only mentions that
           | they may rewrite the proposed patch to avoid copyright
           | implications.
           | 
           | Please provide citation on where I said that. Someone asked
           | me to send a patch to SQLite, I linked them to the SQLite's
           | page.
        
             | supriyo-biswas wrote:
             | > You can't partly apply WAL always, but there are very
             | valid cases where you can do that to recover.
             | 
             | Without mentioning the exact set of cases where recovery is
             | possible and it isn't, going "PSA: SQLite is
             | unreliable!!1one" is highly irresponsible. I think there's
             | quite a bit of criticism going around though, you could add
             | them to your blog article :)
             | 
             | Please also consider the fact that SQLite being a
             | transactional database, it is usually not possible to
             | expose a WAL level error to the user. The correct way to
             | address it is to probably come up with a list of cases
             | where it is possible, and then send in a patch, or at least
             | a proposal, of how to address it.
             | 
             | > Please provide citation on where I said that [SQLite is
             | impossible to contribute].
             | 
             | https://news.ycombinator.com/item?id=44672563
        
               | avinassh wrote:
               | I don't know if you have some personal vendetta against
               | me, because you are citing things I did not say. I did
               | not say SQLite is unreliable. I said SQLite stops at
               | checksum errors found in WAL and stops recovery, which
               | may lead to data loss. Which part of this is incorrect?
               | 
               | On SQLite contribution, I did not say it's "impossible."
               | I said it's not open to contribution. This is the exact
               | phrase from the linked page.
        
               | CJefferson wrote:
               | To me it isn't incorrect, but misleading. A checksum
               | error means corruption, corruption in the main database
               | just randomly can cause any damage. Why specially
               | consider checksum error in the WAL?
        
             | tucnak wrote:
             | > of all places, I did not expect to get personal attacks
             | on HN
             | 
             | You must be new to the site.
        
           | jrockway wrote:
           | I think it's kind of possible to partially apply the WAL
           | manually. Imagine your frames are:
           | 
           | 1) Insert new subscription for "foobar @ 123 Fake St." 2)
           | Insert new subscription for "##xD{.t3Axu:!" 3) Insert new
           | subscription for "barbaz @ 742 Evergreen Terrace"
           | 
           | A human could probably grab two subscriptions out of that
           | data loss incident. I think that's what they're saying. If
           | you're very lucky and want to do a lot of manual work, you
           | could maybe restore some of the data. Obviously both of the
           | "obviously correct" records could just be random bitflips
           | that happen to look right to humans. There's no way of
           | knowing.
        
             | cwillu wrote:
             | And if the "obviously correct" last entry is actually an
             | old entry that just hadn't been overwritten yet? Or if it
             | was only permitted because of something in the corrupted
             | section?
             | 
             | The database should absolutely not be performing guesswork
             | about the meaning of its contents during recovery. If you
             | want mongodb, go use mongodb.
        
             | ulrikrasmussen wrote:
             | Yes, in this particular example you could. But in general
             | the database cannot make assumptions that changes are
             | independent of each other.
             | 
             | I think SQLite assumes that a failing checksum occurs due
             | to a crash during a write which never finished. A corrupt
             | WAL frame before a valid frame can only occur if the
             | underlying storage is corrupt, but it makes no sense for
             | SQLite to start handling that during replay as it has no
             | way to recover. You could maybe argue that it should emit a
             | warning
        
             | supriyo-biswas wrote:
             | This could work for a simple key-value store; but SQLite
             | also does referential integrity which means we might just
             | end up with extra entries with no entities on the other
             | side of the table. IMO, best avoided in a transactional
             | database.
        
       | lxgr wrote:
       | > This is a follow-up post to my PSA: SQLite does not do
       | checksums and PSA: Most databases do not do checksums by default.
       | 
       | That's really all there is to it.
       | 
       | SQLite has very deliberate and well-documented assumptions (see
       | for example [1], [2]) about the lower layers it supports. One of
       | them is that data corruption is handled by these lower layers,
       | except if stated otherwise.
       | 
       | Not relying on this assumption would require introducing
       | checksums (or redundancy/using an ECC, really) on both the
       | WAL/rollback journal _and on the main database file_. This would
       | make SQLite significantly more complex.
       | 
       | I believe TFA is mistaken about how SQLite uses checksums. They
       | primarily serve as a way to avoid some extra write barriers/fsync
       | operations, and maybe to catch incomplete out-of-order writes,
       | but never to detect actual data corruption:
       | https://news.ycombinator.com/item?id=44671373
       | 
       | [1] https://www.sqlite.org/psow.html
       | 
       | [2] https://www.sqlite.org/howtocorrupt.html
        
       | nemothekid wrote:
       | I might be missing something (We use sqlite for our embedded
       | stores) - but I feel like "failing silently" is alarmist here.
       | 
       | 1. If the WAL is incomplete, then "failing" silently is the
       | correct thing to do here, and is the natural function of the WAL.
       | The WAL had an incomplete write, nothing should have been
       | communicated back the application and the application should
       | assume the write never completed.
       | 
       | 2. If the WAL is corrupt (due to the reasons he mentioned), then
       | sqlite says that is that's _your_ problem, not sqlite 's. I think
       | this is the default behavior for other databases as well. If a
       | bit flips on disk, it's not guaranteed the database will catch
       | it.
       | 
       | This article is framed almost like a CVE, but to me this is kind
       | of like saying "PSA: If your hard drive dies you may lose data".
       | If you care about data integrity (because your friend is sending
       | you sqlite files) you should be handling that.
        
         | supriyo-biswas wrote:
         | Also, partially applying a WAL has obvious issues even though
         | the author of this post would somehow prefer that. If we update
         | 3 rows in a database and the WAL entry for one of the rows is
         | corrupted, do they expect to ignore the corrupted entry and
         | apply the rest? What happens to data consistency in this
         | particular case?
        
           | lxgr wrote:
           | Even worse: SQLite, by default, does not immediately truncate
           | WAL files, but rather overwrites the existing WAL from the
           | beginning after successfully applying a checksum.
           | 
           | Doing what the author suggests would actually _introduce_
           | data corruption errors when  "restoring a WAL with a broken
           | checksum".
        
             | avinassh wrote:
             | However, SQLite, by default, always truncates the WAL files
             | on last connection close
             | 
             | > When the last connection to a database closes, that
             | connection does one last checkpoint and then deletes the
             | WAL and its associated shared-memory file, to clean up the
             | disk.
             | 
             | https://www.sqlite.org/wal.html
        
               | lxgr wrote:
               | Yes, but you can't rely on an orderly database close when
               | it comes to correctness. Non-clean shutdowns are much
               | more common than actual low-level data corruption in many
               | SQLite applications.
        
       | jmull wrote:
       | > What's interesting is that when a frame is found to have a
       | missing or invalid checksum, SQLite drops that frame and all the
       | subsequent frames.
       | 
       | Skipping a frames but processing later ones would corrupt the
       | database.
       | 
       | > SQLite doesn't throw any error on detection of corruption
       | 
       | I don't think it's actually a corruption detection feature
       | though. I think it's to prevent a physical failure while writing
       | (like power loss) from corrupting the database. A corruption
       | detection feature would work differently. E.g., it would cover
       | the whole database, not just the WAL. Throwing an error here
       | doesn't make sense.
        
       | asveikau wrote:
       | > You have SQLite .db and .db-wal files, but no accompanying .db-
       | shm file. Maybe your friend shared it with you, or you downloaded
       | some data off the internet.
       | 
       | Honestly this sounds out of scope for normal usage of sqlite and
       | not realistic. I had a hard time reading past this. If I read
       | that correctly, they're saying sqlite doesn't work if one of the
       | database files disappears from under it.
       | 
       | I guess if you had filesystem corruption it's possible that .db-
       | shm disappears without notice and that's a problem. But that
       | isn't sqlite's fault.
        
       ___________________________________________________________________
       (page generated 2025-07-24 23:00 UTC)