[HN Gopher] PSA: SQLite WAL checksums fail silently and may lose...
___________________________________________________________________
PSA: SQLite WAL checksums fail silently and may lose data
Author : avinassh
Score : 220 points
Date : 2025-07-24 14:48 UTC (8 hours ago)
(HTM) web link (avi.im)
(TXT) w3m dump (avi.im)
| cwillu wrote:
| The benefit is that you're left with a database state that
| actually existed; there's no guarantee from the database's
| perspective that dropping some committed transactions and not
| others that came after will result in a valid state.
| HelloNurse wrote:
| This is the main point that the OP misses: even if the newer
| portion of the WAL file isn't corrupted, its content cannot be
| used in any way because doing so would require the lost
| transactions from the corrupted block. The chained checksums
| are a feature, not gratuitous fragility.
| AlotOfReading wrote:
| Sqlite could attempt to recover the detected errors though
| and not lose the transactions.
| hobs wrote:
| And not get in an infinite loop, and not harm the startup
| time of the process inordinately, and...
|
| This is just basically how a WAL works, if you have an
| inconsistent state the transaction is rolled back - at that
| point you need to redo your work.
| daneel_w wrote:
| The WAL was corrupted, the actual data is lost. There's no
| parity. You're suggesting that sqlite should somehow
| _recreate_ the data from nothing.
| avinassh wrote:
| > The WAL was corrupted, the actual data is lost. There's
| no parity. You're suggesting that sqlite should somehow
| recreate the data from nothing.
|
| Not all frames in the WAL are important. Sure, recovery
| may be impossible in some cases, but not all checksum
| failures are impossible to recover from.
| pests wrote:
| > but not all checksum failures are impossible to recover
| from
|
| Which failures are possible to recover from?
| dec0dedab0de wrote:
| what if the corruption only affected the stored checksum,
| but not the data itself?
| kstrauser wrote:
| And you would detect this how?
| daneel_w wrote:
| ...then the checksum can't clear the associated
| operation. Same result.
| jandrewrogers wrote:
| There are two examples I know of that require no
| additional data:
|
| First, force a re-read of the corrupted page from disk. A
| significant fraction of data corruption occurs while it
| is being moved between storage and memory due to weak
| error detection in that part of the system. A clean read
| the second time would indicate this is what happened.
|
| Second, do a brute-force search for single or double bit
| flips. This involves systematically flipping every bit in
| the corrupted page, recomputing the checksum, and seeing
| if corruption is detected.
| daneel_w wrote:
| _> A significant fraction of data corruption occurs while
| it is being moved between storage and memory_
|
| Surely you mean on the memory bus specifically? SATA and
| PCIe both have some error correction methods for securing
| transfers between storage and host controller. I'm not
| sure about old parallel ATA. While I understand it can
| happen under conditions similar to non-ECC RAM being
| corrupted, I don't think I've ever heard or read about a
| case where a _storage device_ randomly returned erroneous
| data, short of a legitimate hardware error.
| jandrewrogers wrote:
| The error correction in PCIe, SATA, etc is too weak to be
| reliable for modern high-performance storage. This is a
| known deficiency and why re-reading a corrupted page
| sometimes fixes things. PCIe v6 is introducing a much
| stronger error detection scheme to address this, which
| will mostly leave bit-rot on storage media as the major
| vector.
|
| The bare minimum you want these days is a 64-bit CRC. A
| strong 128-bit hash would be ideal. Even if you just
| apply these at the I/O boundaries then you'll catch most
| corruption. The places it can realistically occur are
| shrinking but most software makes minimal effort to
| detect this corruption even though it is a fairly well-
| bounded problem.
| daneel_w wrote:
| Thanks for the tech details.
| daneel_w wrote:
| My feeling is that any portion that no longer checksums
| correctly is practically forfeited, which makes any form
| of recovery an arbitrary result - a half-baked
| transaction. I don't see how that's a more desirable
| outcome than rolling back to the last known-good commit,
| which is never arbitrary.
| avinassh wrote:
| I posted in another comment: There is a smoll demo in the
| post. In it, I corrupt an old frame that is not needed by
| the database at all. Now, one approach would be to
| continue the recovery and then present both states: one
| where the WAL is dropped, and another showing whatever we
| have recovered. If I had such an option, I would almost
| always pick the latter.
| nemothekid wrote:
| > _one where the WAL is dropped, and another showing
| whatever we have recovered. If I had such an option, I
| would almost always pick the latter_
|
| I can't imagine picking the latter unless you were
| treating sqlite like a filesystem of completely unrelated
| blobs.
|
| If I run three transactions where:
|
| 1. John gives $100 to Sue.
|
| 2. Sue gives $100 to Mark.
|
| 3. Mark $100 money to Paul.
|
| If sqlite, just erases transaction (2), then Mark
| materializes $100 from nowhere. The rest of your database
| is potentially completely corrupted. At that point your
| database is no longer consistent - I can't see how you
| would "almost always" prefer this.
|
| If (2) is corrupt, then the restore stops at (1), and you
| are guaranteed consistency.
| avinassh wrote:
| Yes, this is a good example of showing an we cannot
| partly apply the WAL _always_. Again, let me repeat it,
| we cannot partly apply the WAL all the time expect it to
| work but there are some valid cases where we can do that
| to recover. Your example is not the one.
| AlotOfReading wrote:
| I was assuming sqlite did the sane thing and used a CRC.
| CRCs have (limited) error correction capabilities that
| you can use to fix 1-2 bit errors in most circumstances,
| but apparently sqlite uses a Fletcher variant and gives
| up that ability (+ long message length error detection)
| for negligible performance gains on modern (even
| embedded) CPUs.
| lxgr wrote:
| The ability to... correct 1-2 bit errors? Is that even a
| realistic failure mode on common hardware?
|
| CRCs as used in SQLite are not intended to detect data
| corruption due to bit rot, and are certainly not ECCs.
| AlotOfReading wrote:
| Yes, it's a pretty important error case with storage
| devices. It's so common that modern storage devices and
| filesystems include their own protections against it.
| Your system may or may not have these and bit flips may
| happen after that point, so WAL redundancy wouldn't be
| out of place.
|
| Sure, the benefits to the incomplete write use case are
| limited, but there's basically no reason to ever use a
| fletcher these days.
|
| It's also worth mentioning that the VFS checksums _are_
| explicitly documented as guarding against storage device
| bitrot and use the same fletcher algorithm.
| lxgr wrote:
| It would be absolutely out of place if your lower layers
| already provided it, and not enough if they didn't (since
| you'd then also need checksums on the actual database
| file, which SQLite does not provide - all writes happen
| again there!)
| AlotOfReading wrote:
| sqlite does actually provide database checksums, via the
| vfs extension I mentioned previously.
|
| There's no harm to having redundant checksums and it's
| not truly redundant for small messages. It's pretty
| common for systems not to have lower level checksumming
| either. Lots of people are still running NTFS/EXT4 on
| hardware that doesn't do granular checksums or protect
| data in transit.
|
| Of course this is all a moot point because sqlite _does_
| WAL checksums, it just does them with an obsolete
| algorithm.
| lxgr wrote:
| > There's no harm to having redundant checksums
|
| There sure is: Redundant checksums need extra storage and
| extra processing. SQLite often runs on embedded systems,
| where both can come at a premium.
|
| > Of course this is all a moot point because sqlite does
| WAL checksums, it just does them with an obsolete
| algorithm.
|
| That's not nearly the only thing missing for SQLite to
| provide full resistance to lower-level data corruption.
| At a very minimum, you'd also need checksums of the
| actual database file.
| AlotOfReading wrote:
| There's no extra storage because it's already storing
| checksums. There's technically extra processing, but a
| good CRC is faster than reading memory on virtually all
| modern systems (even in embedded) so there's no reason to
| prefer a fletcher unless the few extra cycles of latency
| to get the CRC running are deeply concerning to you.
| That's not nearly the only thing missing for SQLite to
| provide full resistance to lower-level data corruption.
|
| A CRC wouldn't provide full resistance anyway, so this
| clearly isn't what I'm arguing. What I'm saying is that a
| fletcher is strictly worse than a CRC here. Not an
| advanced algorithm, a bog-standard CRC. Fletcher actually
| takes more space in this case, has worse error detection,
| and doesn't allow recovery. A CRC would allow you to
| recover _some of the time_ , from _some_ kinds of errors,
| in _some_ scenarios. It 's simply better than nothing.
| lxgr wrote:
| > There's no extra storage because it's already storing
| checksums.
|
| At the database level (i.e. not just the WAL)? Are you
| sure?
|
| > What I'm saying is that a fletcher is strictly worse
| than a CRC here.
|
| I can't speak to the performance differences, but the
| only thing SQLite really needs the checksum to do is to
| expose partial writes, both due to reordered sector
| writes and partial intra-sector writes. (The former could
| also be solved by just using an epoch counter, but the
| latter would require some tricky write formats, and a
| checksum nicely addresses both).
|
| In both cases, there's really nothing to recover: CRC
| won't catch an entire missing sector, and almost no
| partially written sectors (i.e. unless the failure
| somehow happens in the very last bytes of it, so that the
| ratio of "flipped" bits is low enough).
| AlotOfReading wrote:
| Just talking about the WAL. The database does have a page
| level checksum with the VFS extension, but it's optional.
| teraflop wrote:
| Yes, and it's not just about application-level integrity. The
| WAL operates at a page level, so dropping one WAL entry and
| then applying later ones would be likely to _cause_ corruption
| at the B-tree level.
|
| For instance, say you have a node A which has a child B:
|
| * Transaction 1 wants to add a value to B, but it's already
| full, so B is split into new nodes C and D. Correspondingly,
| the pointer in A that points to B is removed, and replaced with
| pointers to C and D.
|
| * Transaction 2 makes an unrelated change to A.
|
| If you skip the updates from transaction 1, and apply the
| updates from transaction 2, then suddenly A's data is
| overwritten with a new version that points to nodes C and D,
| but those nodes haven't been written. The pointers just point
| to uninitialized garbage.
| teraflop wrote:
| > The checksums in WAL are likely not meant to check for random
| page corruption in the middle; maybe they're just to check if the
| last write of a frame was fsynced properly or not?
|
| This is the correct explanation. The purpose is to detect partial
| writes, not to detect arbitrary data corruption. If detecting
| corruption was the goal, then checksumming the WAL without also
| checksumming the database itself would be fairly pointless.
|
| In fact, it's not accurate to say "SQLite does not do checksums
| by default, but it has checksums in WAL mode." SQLite always uses
| checksums for its journal, regardless of whether that's a
| rollback journal or a write-ahead log. [1]
|
| For the purpose of tolerating and recovering from crashes/power
| failures, writes to the database file itself are effectively
| idempotent. It doesn't matter if only a subset of the DB writes
| are persisted before a crash, and you don't need to know which
| ones succeeded, because you can just roll all of them forward or
| backward (depending on the mode). But for the journal itself,
| distinguishing partial journal entries from complete ones
| matters.
|
| No matter what order the disk physically writes out pages, the
| instant when the checksum matches the data is the instant at
| which the transaction can be unambiguously said to commit.
|
| [1]: https://www.sqlite.org/fileformat.html
| kentonv wrote:
| Exactly. To put it another way:
|
| Imagine the power goes out while sqlite is in the middle of
| writing a transaction to the WAL (before the write has been
| confirmed to the application). What do you want to happen when
| power comes back, and you reload the database?
|
| If the transaction was fully written, then you'd probably like
| to keep it. But if it was not complete, you want to roll it
| back.
|
| How does sqlite know if the transaction was complete? It needs
| to see two things:
|
| 1. The transaction ends with a commit frame, indicating the
| application did in fact perform a `COMMIT TRANSACTION`.
|
| 2. All the checksums are correct, indicating the data was fully
| synced to disk when it was committed.
|
| If the checksums are wrong, the assumption is that the
| transaction wasn't fully written out. Therefore, it should be
| rolled back. That's exactly what sqlite does.
|
| This is not "data loss", because the transaction was not ever
| fully committed. The power failure happened before the commit
| was confirmed to the application, so there's no way anyone
| should have expected that the transaction is durable.
|
| The checksum is NOT intended to detect when the data was
| corrupted by some other means, like damage to the disk or a
| buggy app overwriting bytes. Myriad other mechanisms should be
| protecting against those already, and sqlite is assuming those
| other mechanisms are working, because if not, there's very
| little sqlite can do about it.
| malone wrote:
| Why is the commit frame not sufficient to determine whether
| the transaction was fully written or not? Is there a scenario
| where the commit frame is fsynced to disk but the proceeding
| data isn't?
| adambb wrote:
| The disk controller may decide to write out blocks in a
| different order than the logical layout in the log file
| itself, and be interrupted before completing this work.
| hinkley wrote:
| We still have the elevator algorithm on NVMe?
| lxgr wrote:
| SQLite runs on anything from servers to Internet-
| connected lightbulbs.
| jrockway wrote:
| Which lightbulbs include SQLite? I kind of want one.
| natebc wrote:
| these guys have a Cree logo on their homepage so maybe
| Cree?
|
| https://imaginovation.net/case-study/cree/
|
| At least what I could turn up with a quick web search.
| bob1029 wrote:
| Anything that uses NAND storage technology is going to be
| optimized in some way like this. NVMe is just the
| messenger.
| jrockway wrote:
| NVMe drives do their own manipulation of the datastream.
| Wear leveling, GC, trying to avoid rewriting an entire
| block for your 1 bit change, etc. NVMe drives have CPUs
| and RAM for this purpose; they are full computers with a
| little bit of flash memory attached. And no, of course
| they're not open source even though they have full access
| to your system.
| johncolanduoni wrote:
| It's worth noting this is also dependent on filesystem
| behavior; most that do copy-on-write will not suffer from
| this issue regardless of drive behavior, even if they
| don't do their own checksumming.
| hinkley wrote:
| For instance, running on ZFS or one of its peers.
| jandrewrogers wrote:
| Apropos this use case, ZFS is usually not recommended for
| databases. Competent database storage engines have their
| own strong corruption detection mechanisms regardless. What
| filesystems in the wild typically provide for this is
| weaker than what is advisable for a database, so databases
| should bring their own implementation.
| TheDong wrote:
| > ZFS is usually not recommended for databases
|
| Say more? I've heard people say that ZFS is somewhat
| slower than, say, ext4, but I've personally had zero
| issues running postgres on zfs, nor have I heard any
| well-reasoned reasons not to.
|
| > What filesystems in the wild typically provide for this
| is weaker than what is advisable for a database, so
| databases should bring their own implementation.
|
| Sorry, what? Just yesterday matrix.org had a post about
| how they (using ext4 + postgres) had disk corruption
| which led to postgres returning garbage data:
| https://matrix.org/blog/2025/07/postgres-corruption-
| postmort...
|
| The corruption was likely present for months or years,
| and postgres didn't notice.
|
| ZFS, on the other hand, would have noticed during a
| weekly scrub and complained loudly, letting you know a
| disk had an error, letting you attempt to repair it if
| you used RAID, etc.
|
| It's stuff like in that post that are exactly why I run
| postgres on ZFS.
|
| If you've got specifics about what you mean by "databases
| should bring their own implementation", I'd be happy to
| hear it, but I'm having trouble thinking of any sorta
| technically sound reason for "databases actually prefer
| it if filesystems can silently corrupt data lol" being
| true.
| zaarn wrote:
| SQLite on ZFS needs the Fsync behaviour to be off,
| otherwise SQLite will randomly hang the application as
| the fsync will wait for the txg to commit. This can take
| a minute or two, in my experience.
|
| Btrfs is a better choice for SQLite.
| throw0101b wrote:
| > _SQLite on ZFS needs the Fsync behaviour to be off_
| [...] zfs set sync=disabled
| mydata/mydb001
|
| * https://openzfs.github.io/openzfs-
| docs/man/master/7/zfsprops...
| supriyo-biswas wrote:
| Btw this concern also applies to other databases,
| although probably it manifests in the worst way in
| SQLite. Essentially, you're doing a WAL over the file
| systems' own WAL-like recovery mechanism.
| jandrewrogers wrote:
| The point is that a database cannot rely on being
| deployed on a filesystem with proper checksums.
|
| Ext4 uses 16-/32-bit CRCs, which is very weak for storage
| integrity in 2025. Many popular filesystems for databases
| are similarly weak. Even if they have a strong option,
| the strong option is not enabled by default. In real-
| world Linux environments, the assumption that the
| filesystem has weak checksums usually true.
|
| Postgres has (IIRC) 32-bit CRCs but they are not enabled
| by default. That is also much weaker than you would
| expect from a modern database. Open source databases do
| not have a good track record of providing robust
| corruption detection generally _nor_ the filesystems they
| often run on. It is a systemic problem.
|
| ZFS doesn't support features that high-performance
| database kernels use and is slow, particularly on high-
| performance storage. Postgres does not use any of those
| features, so it matters less if that is your database.
| XFS has traditionally been the preferred filesystem for
| databases on Linux and Ext4 will work. Increasingly,
| databases don't use external filesystems at all.
| mardifoufs wrote:
| I know MySQL has checksums by default, how does it
| compare? Is it useful or is it similarly weak?
| wahern wrote:
| But what ZFS provides isn't weaker, and in SQLite page
| checksums are opt-in:
| https://www.sqlite.org/cksumvfs.html
|
| EDIT: It seems they're opt-in for PostgreSQL, too:
| https://www.postgresql.org/docs/current/checksums.html
| avinassh wrote:
| you might like my other post -
| https://avi.im/blag/2024/databases-checksum/
|
| bad news is, most databases don't do checksums by
| default.
| lxgr wrote:
| This is in fact good news.
|
| Redundantly performing the same performance-intensive
| tasks on multiple layers makes latency less predictable
| and just generally wastes resources.
| jandrewrogers wrote:
| Actually bad news. Most popular filesystems and
| filesystem configurations have limited and/or weak
| checksums, certainly much worse than you'd want for a
| database. 16-bit and 32-bit CRCs are common in
| filesystems.
|
| This is a major reason databases implement their own
| checksums. Unfortunately, many open source databases have
| weak or non-existent checksums too. It is sort of an
| indefensible oversight.
| hawk_ wrote:
| So when checksums are enabled and the DB process restarts
| or the host reboots does the DB run the checksum over all
| the stored data? Sounds like it would take forever for
| the database to come online. But if it doesn't it may not
| detect bitrot in time...?
| lxgr wrote:
| No, competent systems just need to have something that,
| taken together, prevents data corruption.
|
| One possible instance of that is a database providing its
| own data checksumming, but another perfectly valid one is
| running one that does not on a lower layer with a
| sufficiently low data corruption rate.
| tetha wrote:
| Hm.
|
| On the other hand, I've heard people recommend running
| Postgres on ZFS so you can enable on the fly compression.
| This increases CPU utilization on the postgres server by
| quite a bit, read latency of uncached data a bit, but it
| decreases necessary write IOPS a lot. And as long as the
| compression is happening a lot in parallel (which it
| should, if your database has many parallel queries), it's
| much easier to throw more compute threads at it than to
| speed up the write-speed of a drive.
|
| And after a certain size, you start to need atomic
| filesystem snapshots to be able to get a backup of a very
| large and busy database without everything exploding. We
| already have the more efficient backup strategies from
| replicas struggle on some systems and are at our wits end
| how to create proper backups and archives without
| reducing the backup freqency to weeks. ZFS has mature
| mechanisms and zfs-send to move this data around with
| limited impact ot the production dataflow.
| supriyo-biswas wrote:
| Is an incremental backup of the database not possible?
| Pgbackrest etc. can do this by creating a full backup
| followed by incremental backups from the WAL.
|
| For Postgres specifically you may also want to look at
| using hot_standby_feedback, as described in this recent
| HN article: https://news.ycombinator.com/item?id=44633933
| tetha wrote:
| On the big product clusters, we have incremental
| pgbackrest backups running for 20 minutes. Full backups
| take something between 12 - 16 hours. All of this from a
| sync standby managed by patroni. Archiving all of that
| takes 8 - 12 hours. It's a couple of terabytes on
| noncompressible data that needs to move. It's fine
| though, because this is an append-log-style dataset and
| we can take our time backing this up.
|
| We also have decently sized clusters with very active
| data on them, and rather spicy recovery targets. On some
| of them, a full backup from the sync standby takes 4
| hours, we need to pull an incremental backup at most 2
| hours afterwards, but the long-term archiving process
| needs 2-3 hours to move the full backup to the archive.
| This is the first point in which filesystem snapshots,
| admittedly, of the pgbackrest repo, become necessary to
| adhere to SLOs as well as system function.
|
| We do all of the high-complexity, high-throughput things
| recommended by postgres, and it's barely enough on the
| big systems. These things are getting to the point of
| needing a lot more storage and network bandwidth.
| hinkley wrote:
| This was my understanding as well, color me also
| confused.
| johncolanduoni wrote:
| Is not great for databases that do updates in place. Log-
| structured merge databases (which most newer DB engines
| are) work fine with its copy-on-write semantics.
| zaarn wrote:
| ZFS isn't viable for SQLite unless you turn off fsync's in
| ZFS, because otherwise you will have the same experience I
| had for years; SQLite may randomly hang for up to a few
| minutes with no visible cause, if there isn't sufficient
| write txg's to fill up in the background. If your app
| depends on SQLite, it'll randomly die.
|
| Btrfs is a better choice for sqlite, haven't seen that
| issue there.
| throw0101b wrote:
| > _ZFS isn't viable for SQLite unless you turn off
| fsync's in ZFS_
|
| Which you can do on a per dataset ('directory') basis
| very easily: zfs set sync=disabled
| mydata/mydb001
|
| * https://openzfs.github.io/openzfs-
| docs/man/master/7/zfsprops...
|
| Meanwhile all the rest of your pools / datasets can keep
| the default POSIX behaviour.
| ezekiel68 wrote:
| You know what's even easier than doing that? Neglecting
| to do it or meaning to do it then getting pulled in to
| some meeting (or other important distraction) and then
| imagining you did it.
| throw0101b wrote:
| > _Neglecting to do it or meaning to do it then getting
| pulled in to some meeting (or other important
| distraction) and then imagining you did it._
|
| If your job is to make sure your file system and your
| database--SQLite, Pg, My/MariDB, _etc_ --are tuned
| together, and you don't tune it, then you _should be_
| called into a meeting. Or at least the no-fault RCA
| should bring up remediation methods to make sure it 's
| part of the SOP so that it won't happen again.
|
| The alternative the GP suggests is using Btrfs, which I
| find even more irresponsible than your non-tuning
| situation. (Heck, if someone on my sysadmin team
| suggested we start using Btrfs for anything I would think
| they were going senile.)
| johncolanduoni wrote:
| Facebook is apparently using it at scale, which surprised
| me. Though that's not necessarily an endorsement, and who
| knows what their kernel patcheset looks like.
| kentonv wrote:
| Doesn't turning off sync mean you can lose confirmed
| writes in a power failure?
| Modified3019 wrote:
| Interesting. Found a GitHub issue that covers this bug:
| https://github.com/openzfs/zfs/issues/14290
|
| The latest comment seems to be a nice summary of the root
| cause, with earlier in the thread pointing to ftruncate
| instead of fsync being a trigger:
|
| >amotin
|
| >I see. So ZFS tries to drop some data from pagecache,
| but there seems to be some dirty pages, which are held by
| ZFS till them either written into ZIL, or to disk at the
| end of TXG. And if those dirty page writes were
| asynchronous, it seems there is nothing that would nudge
| ZFS to actually do something about it earlier than
| zfs_txg_timeout. Somewhat similar problem was recently
| spotted on FreeBSD after #17445, which is why newer
| version of the code in #17533 does not keep references on
| asynchronously written pages.
|
| Might be worth testing zfs_txg_timeout=1 or 0
| lxgr wrote:
| I believe it's also because of this (from
| https://www.sqlite.org/wal.html):
|
| > [...] The checkpoint does not normally truncate the WAL file
| (unless the journal_size_limit pragma is set). Instead, it
| merely causes SQLite to start overwriting the WAL file from the
| beginning. This is done because it is normally faster to
| overwrite an existing file than to append.
|
| Without the checksum, a new WAL entry might cleanly overwrite
| an existing longer one in a way that still looks valid (e.g.
| "A|B" -> "C|B" instead of "AB" -> "C|data corruption"), at
| least without doing an (expensive) scheme of overwriting B with
| invalid data, fsyncing, and then overwriting A with C and
| fsyncing again.
|
| In other words, the checksum allows an optimized write path
| with fewer expensive fsync/truncate operations; it's not a
| sudden expression of mistrust of lower layers that doesn't
| exist in the non-WAL path.
| slashdev wrote:
| How would this work differently? As soon as you encounter a
| checksum failure, you can't trust anything from that point on. If
| the checksum were just per-page and didn't build on the previous
| page's checksum, you can't just apply pages from the WAL that
| were valid, skipping the ones which were not. The database at the
| end of that process would be corrupt.
|
| If you stop at the first failure, the database is restored to the
| last good state. That's the best outcome that can be achieved
| under the circumstances. Some data could be lost, but there
| wasn't anything sensible you could do with it anyway.
| avinassh wrote:
| > How would this work differently?
|
| I would like it to raise an error and then provide an option to
| continue or stop. Since continuing is the default, we need a
| way to opt in to stopping on checksum failure.
|
| Not all checksum errors are impossible to recover from. Also,
| as the post mentions, only some non important pages could be
| corrupt too.
|
| My main complaint is that it doesn't give developers an option.
| thadt wrote:
| Aight, I'll bite: continue or stop... and do what? As others
| have pointed out, the only safe option to get back to a
| consistent state is to roll back to a safe point.
|
| If what we're really interested in is the _log_ part of a
| write ahead log - where we could safely recover data after a
| corruption, then a better tool might be just a log file,
| instead of SQLite.
| avinassh wrote:
| > Aight, I'll bite: continue or stop... and do what? As
| others have pointed out, the only safe option to get back
| to a consistent state is to roll back to a safe point.
|
| Attempt to recover! Again, not all checksum errors are
| impossible to recover. I hold the view that even if there
| is a 1% chance of recovery, we should attempt it. This may
| be done by SQLite, an external tool, or even manually.
| Since WAL corruption issues are silent, we cannot do that
| now.
|
| There is a smoll demo in the post. In it, I corrupt an old
| frame that is not needed by the database at all. Now, one
| approach would be to continue the recovery and then present
| both states: one where the WAL is dropped, and another
| showing whatever we have recovered. If I had such an
| option, I would almost always pick the latter.
| ncruces wrote:
| > Since WAL corruption issues are silent, we cannot do
| that now.
|
| You do have backups, right?
| avinassh wrote:
| Yes. Say if I am using something like litestream, then
| all subsequent generations will be affected. At some
| point, I'd go back and find out the broken generation.
|
| Instead of that, I'd prefer for it to fail fast
| lxgr wrote:
| What about all classes of durability errors that do not
| manifest in corrupted checksums (e.g. "cleanly" lost
| appends of entire transactions to the WAL)?
|
| It seems like you're focusing on a very specific failure
| mode here.
|
| Also, what if the data corruption error happens during
| the write to the actual database file (i.e. at WAL
| checkpointing time)? That's still 50% of all your writes,
| and there's no checksum there!
| NortySpock wrote:
| The demo is good, through I wish you had presented it as
| text rather than as a gif.
|
| I do see your point of wanting an option to refuse to
| delete the wal so a developer can investigate the wal and
| manually recover... But the the typical user probably
| wants the database to come back up with a consistent,
| valid state if power is lost. They do not want have the
| database refuse to operate because it found uncommitted
| transactions in a scratchpad file...
|
| As a SQL-first developer, I don't pick apart write-ahead
| logs trying to save a few bytes from the great hard drive
| in the sky, I just want the database to give me the
| current state of my data and never be in an invalid
| state.
| avinassh wrote:
| > As a SQL-first developer, I don't pick apart write-
| ahead logs trying to save a few bytes from the great hard
| drive in the sky, I just want the database to give me the
| current state of my data and never be in an invalid
| state.
|
| Yes, that is a very valid choice. Hence, I want databases
| to give me an option, so that you can choose to ignore
| the checksum errors and I can choose to stop the app and
| try to recover.
| lxgr wrote:
| WAL checksum errors are not used to catch data
| corruption, they are only used to detect incomplete
| writes or leftover data from previous checkpoints.
|
| If you attempt to do the former in a system that by
| design uses checksums only for the latter, you'll
| actually introduce corrupted data from some non-corrupted
| WAL files.
| nemothekid wrote:
| > _I corrupt an old frame that is not needed by the
| database at all_
|
| 1. How does the _database_ know that.
|
| 2. In your example Alice gets the money from nowhere.
| What if another user had sent the money to Alice and
| _that_ frame get corrupted. Then you just created
| 10,000,000 from nowhere.
|
| At the very least, rolling back to a good point gives you
| an exact moment of time where transactions can be applied
| from. Your example is very contrived and in a database
| where several transactions can be happening, doing a
| partial recovery will destroy the consistency of the
| database.
| lxgr wrote:
| Giving developers that option would require SQLite to change
| the way it writes WALs, which would increase overhead.
| Checksum corruptions can happen without any lower-level
| errors; this is a performance optimization by SQLite.
|
| I've written more about this here:
| https://news.ycombinator.com/item?id=44673991
| avinassh wrote:
| > Giving developers that option would require SQLite to
| change the way it writes WALs, which would increase
| overhead.
|
| Yes! But I am happy to accept that overhead with the
| corruption detection.
| lxgr wrote:
| But why? You'd only get partial corruption detection.
|
| As I see it, either you have a lower layer you can trust,
| and then this would just be extra overhead, or you don't,
| in which case you'll also need error correction (not just
| detection!) for the database file itself.
| slashdev wrote:
| It is good that it doesn't give you an option. I don't want
| some app on my phone telling me its database is corrupt, I
| want it to always load back to the last good state and I'll
| handle any missing data myself.
|
| The checksums are not going to fail unless there was disk
| corruption or a partial write.
|
| In the former, thank your lucky stars it was in the WAL file
| and you just lose some data but have a functioning database
| still.
|
| In the latter, you didn't fsync, so it couldn't have been
| that important. If you care about not losing data, you need
| to fsync on every transaction commit. If you don't care
| enough to do that, why do you care about checksums, it's
| missing the point.
| ryanjshaw wrote:
| > What I want: throw an error when corruption is detected and let
| the code handle it.
|
| I wonder what that code would look like. My sense is that it'll
| look exactly like the code that would run as if the transactions
| never occurred to begin with, which is why the SQLite design
| makes sense.
|
| For example, I have a database of todos that sync locally from
| the cloud. The WAL gets corrupted. The WAL gets truncated the
| next time the DB is opened. The app logic then checks the last
| update timestamp in the DB and syncs with the cloud.
|
| I don't see what the app would do differently if it were notified
| about the WAL corruption.
| fer wrote:
| Exactly. I'd read it as
|
| > I want to correct errors that the DB wizard who implemented
| SQLite chose not to
|
| When there's a design decision in such a high profile project
| that you disagree with, it's either
|
| 1. You don't understand why it was done like this.
|
| 2. You can (and probably will) submit a change that would solve
| it.
|
| If you find yourself in the situation of understanding, yet not
| doing anything about it, you're the Schrodinger's developer:
| you're right and wrong until you collapse the mouth function by
| putting money on it.
|
| It's very rarely an easy to fix mistake.
| avinassh wrote:
| > 2. You can (and probably will) submit a change that would
| solve it.
|
| SQLite is not open to contribution -
| https://www.sqlite.org/copyright.html
|
| > 1. You don't understand why it was done like this.
|
| sure, I would like to understand it. That's why the post!
| rovr138 wrote:
| > Contributed Code
|
| > In order to keep SQLite completely free and unencumbered
| by copyright, the project does not accept patches. If you
| would like to suggest a change and you include a patch as a
| proof-of-concept, that would be great. However, please do
| not be offended if we rewrite your patch from scratch.
|
| Propose it.
| adzm wrote:
| sqlite has several callbacks / hooks / handlers that can be set.
| I think it is reasonable to expect there to be a way for this
| situation to be communicated to the application.
| westurner wrote:
| Do the sqlite replication systems depend upon WAL checksums?
|
| Merkle hashes would probably be better.
|
| google/trillian adds Merkle hashes to table rows.
|
| sqlite-parquet-vtable would workaround broken WAL checksums.
|
| sqlite-wasm-http is almost a replication system
|
| Re: "Migration of the [sqlite] build system to autosetup"
| https://news.ycombinator.com/item?id=41921992 :
|
| > _There are many extensions of SQLite; rqlite (Raft in Go,), cr-
| sqlite (CRDT in C), postlite (Postgres wire protocol for SQLite),
| electricsql (Postgres), sqledge (Postgres), and also WASM:
| sqlite-wasm, sqlite-wasm-http, dqlite (Raft in Rust),_
|
| > _awesome-sqlite_
|
| From "Adding concurrent read/write to DuckDB with Arrow Flight"
| https://news.ycombinator.com/item?id=42871219 :
|
| > _cosmos /iavl is a Merkleized AVL tree.
| https://github.com/cosmos/iavl _
|
| /? Merkle hashes for sqlite:
| https://www.google.com/search?q=Merkle+hashes+for+SQlite
|
| A git commit hash is basically a Merkle tree root, as it depends
| upon the previous hashes before it.
|
| Merkle tree: https://en.wikipedia.org/wiki/Merkle_tree
|
| (How) Should merkle hashes be added to sqlite for consistency?
| How would merkle hashes in sqlite differ from WAL checksums?
| dathinab wrote:
| Some things:
|
| - there is an official check sum VFS shim, but I never used it
| and don't know how good it is. The difference between it and WAL
| checksum is that it works on a per page level and you seem to
| need manually run the checksum checks and then yourself decide
| what to do
|
| - check sums (as used by SQLite WAL) aren't meant for backup,
| redundancy or data recovery (there are error recovery codes
| focused on allowing recovering a limited set of bits, but they
| have way more overhead then the kind of checksum used here)
|
| - I also believe SQLite should indicate such checksum errors
| (e.g. so that you might engage out of band data recovery, i.e.
| fetch a backup from somewhere), but I'm not fully sure how you
| would integrate it in a backward compatible way? Like return it
| as an error which otherwise acts like a SQLITE_BUSY??
| ncruces wrote:
| The checksum VFS explicitly disables its checksums during
| checkpointing (search of inCkpt):
| https://sqlite.org/src/doc/tip/ext/misc/cksumvfs.c
|
| Data in the WAL should be considered to be of "reduced
| durability".
| kburman wrote:
| An employee of Turso, a commercial fork of SQLite, is presenting
| a standard, safety-first feature of SQLite's WAL as a dangerous
| flaw. As many have noted, this behavior prevents database
| corruption, it doesn't cause it.
| supriyo-biswas wrote:
| I wouldn't have jumped to a conspiracy angle immediately, but
| there are some signs which are difficult to overlook:
|
| - Said person was apparently employed due to his good
| understanding of databases and distributed systems concepts
| (there's a HN thread about how he found an issue in the paper
| describing an algorithm); yet makes fundamental mistakes in
| understanding what the WAL does and how it's possible not to
| "partly" apply a WAL.
|
| - Said person expects a SQL database to expose WAL level errors
| to the user breaking transactional semantics (if you want that
| level of control, consider simpler file-based key-value stores
| that expose such semantics?)
|
| - Said person maligns SQLite as being impossible to contribute;
| whereas the actual project only mentions that they may rewrite
| the proposed patch to avoid copyright implications.
|
| - Said person again maligns SQLite as "limping along" in the
| face of disk errors (while making the opposite claim a few
| paragraphs ago); while ignoring that the checksum VFS exists
| when on-disk data corruption is a concern.
| avinassh wrote:
| of all places, I did not expect to get personal attacks on HN
| :)
|
| > yet makes fundamental mistakes in understanding what the
| WAL does and how it's possible not to "partly" apply a WAL.
|
| Please provide citation on where I said that. You can't
| partly apply WAL _always_ , but there are very valid cases
| where you can do that to recover. Recovery doesn't have to
| automatic. It can be done by SQLite, or some recovery tool or
| with manual intervention.
|
| > - Said person maligns SQLite as being impossible to
| contribute; whereas the actual project only mentions that
| they may rewrite the proposed patch to avoid copyright
| implications.
|
| Please provide citation on where I said that. Someone asked
| me to send a patch to SQLite, I linked them to the SQLite's
| page.
| supriyo-biswas wrote:
| > You can't partly apply WAL always, but there are very
| valid cases where you can do that to recover.
|
| Without mentioning the exact set of cases where recovery is
| possible and it isn't, going "PSA: SQLite is
| unreliable!!1one" is highly irresponsible. I think there's
| quite a bit of criticism going around though, you could add
| them to your blog article :)
|
| Please also consider the fact that SQLite being a
| transactional database, it is usually not possible to
| expose a WAL level error to the user. The correct way to
| address it is to probably come up with a list of cases
| where it is possible, and then send in a patch, or at least
| a proposal, of how to address it.
|
| > Please provide citation on where I said that [SQLite is
| impossible to contribute].
|
| https://news.ycombinator.com/item?id=44672563
| avinassh wrote:
| I don't know if you have some personal vendetta against
| me, because you are citing things I did not say. I did
| not say SQLite is unreliable. I said SQLite stops at
| checksum errors found in WAL and stops recovery, which
| may lead to data loss. Which part of this is incorrect?
|
| On SQLite contribution, I did not say it's "impossible."
| I said it's not open to contribution. This is the exact
| phrase from the linked page.
| CJefferson wrote:
| To me it isn't incorrect, but misleading. A checksum
| error means corruption, corruption in the main database
| just randomly can cause any damage. Why specially
| consider checksum error in the WAL?
| tucnak wrote:
| > of all places, I did not expect to get personal attacks
| on HN
|
| You must be new to the site.
| jrockway wrote:
| I think it's kind of possible to partially apply the WAL
| manually. Imagine your frames are:
|
| 1) Insert new subscription for "foobar @ 123 Fake St." 2)
| Insert new subscription for "##xD{.t3Axu:!" 3) Insert new
| subscription for "barbaz @ 742 Evergreen Terrace"
|
| A human could probably grab two subscriptions out of that
| data loss incident. I think that's what they're saying. If
| you're very lucky and want to do a lot of manual work, you
| could maybe restore some of the data. Obviously both of the
| "obviously correct" records could just be random bitflips
| that happen to look right to humans. There's no way of
| knowing.
| cwillu wrote:
| And if the "obviously correct" last entry is actually an
| old entry that just hadn't been overwritten yet? Or if it
| was only permitted because of something in the corrupted
| section?
|
| The database should absolutely not be performing guesswork
| about the meaning of its contents during recovery. If you
| want mongodb, go use mongodb.
| ulrikrasmussen wrote:
| Yes, in this particular example you could. But in general
| the database cannot make assumptions that changes are
| independent of each other.
|
| I think SQLite assumes that a failing checksum occurs due
| to a crash during a write which never finished. A corrupt
| WAL frame before a valid frame can only occur if the
| underlying storage is corrupt, but it makes no sense for
| SQLite to start handling that during replay as it has no
| way to recover. You could maybe argue that it should emit a
| warning
| supriyo-biswas wrote:
| This could work for a simple key-value store; but SQLite
| also does referential integrity which means we might just
| end up with extra entries with no entities on the other
| side of the table. IMO, best avoided in a transactional
| database.
| lxgr wrote:
| > This is a follow-up post to my PSA: SQLite does not do
| checksums and PSA: Most databases do not do checksums by default.
|
| That's really all there is to it.
|
| SQLite has very deliberate and well-documented assumptions (see
| for example [1], [2]) about the lower layers it supports. One of
| them is that data corruption is handled by these lower layers,
| except if stated otherwise.
|
| Not relying on this assumption would require introducing
| checksums (or redundancy/using an ECC, really) on both the
| WAL/rollback journal _and on the main database file_. This would
| make SQLite significantly more complex.
|
| I believe TFA is mistaken about how SQLite uses checksums. They
| primarily serve as a way to avoid some extra write barriers/fsync
| operations, and maybe to catch incomplete out-of-order writes,
| but never to detect actual data corruption:
| https://news.ycombinator.com/item?id=44671373
|
| [1] https://www.sqlite.org/psow.html
|
| [2] https://www.sqlite.org/howtocorrupt.html
| nemothekid wrote:
| I might be missing something (We use sqlite for our embedded
| stores) - but I feel like "failing silently" is alarmist here.
|
| 1. If the WAL is incomplete, then "failing" silently is the
| correct thing to do here, and is the natural function of the WAL.
| The WAL had an incomplete write, nothing should have been
| communicated back the application and the application should
| assume the write never completed.
|
| 2. If the WAL is corrupt (due to the reasons he mentioned), then
| sqlite says that is that's _your_ problem, not sqlite 's. I think
| this is the default behavior for other databases as well. If a
| bit flips on disk, it's not guaranteed the database will catch
| it.
|
| This article is framed almost like a CVE, but to me this is kind
| of like saying "PSA: If your hard drive dies you may lose data".
| If you care about data integrity (because your friend is sending
| you sqlite files) you should be handling that.
| supriyo-biswas wrote:
| Also, partially applying a WAL has obvious issues even though
| the author of this post would somehow prefer that. If we update
| 3 rows in a database and the WAL entry for one of the rows is
| corrupted, do they expect to ignore the corrupted entry and
| apply the rest? What happens to data consistency in this
| particular case?
| lxgr wrote:
| Even worse: SQLite, by default, does not immediately truncate
| WAL files, but rather overwrites the existing WAL from the
| beginning after successfully applying a checksum.
|
| Doing what the author suggests would actually _introduce_
| data corruption errors when "restoring a WAL with a broken
| checksum".
| avinassh wrote:
| However, SQLite, by default, always truncates the WAL files
| on last connection close
|
| > When the last connection to a database closes, that
| connection does one last checkpoint and then deletes the
| WAL and its associated shared-memory file, to clean up the
| disk.
|
| https://www.sqlite.org/wal.html
| lxgr wrote:
| Yes, but you can't rely on an orderly database close when
| it comes to correctness. Non-clean shutdowns are much
| more common than actual low-level data corruption in many
| SQLite applications.
| jmull wrote:
| > What's interesting is that when a frame is found to have a
| missing or invalid checksum, SQLite drops that frame and all the
| subsequent frames.
|
| Skipping a frames but processing later ones would corrupt the
| database.
|
| > SQLite doesn't throw any error on detection of corruption
|
| I don't think it's actually a corruption detection feature
| though. I think it's to prevent a physical failure while writing
| (like power loss) from corrupting the database. A corruption
| detection feature would work differently. E.g., it would cover
| the whole database, not just the WAL. Throwing an error here
| doesn't make sense.
| asveikau wrote:
| > You have SQLite .db and .db-wal files, but no accompanying .db-
| shm file. Maybe your friend shared it with you, or you downloaded
| some data off the internet.
|
| Honestly this sounds out of scope for normal usage of sqlite and
| not realistic. I had a hard time reading past this. If I read
| that correctly, they're saying sqlite doesn't work if one of the
| database files disappears from under it.
|
| I guess if you had filesystem corruption it's possible that .db-
| shm disappears without notice and that's a problem. But that
| isn't sqlite's fault.
___________________________________________________________________
(page generated 2025-07-24 23:00 UTC)