[HN Gopher] A faster way to copy SQLite databases between computers
___________________________________________________________________
A faster way to copy SQLite databases between computers
Author : ingve
Score : 394 points
Date : 2025-05-01 11:15 UTC (11 hours ago)
(HTM) web link (alexwlchan.net)
(TXT) w3m dump (alexwlchan.net)
| dundundundun wrote:
| This is basically the way every other database is moved around.
| d34th0rl1f3 wrote:
| You can save time by using `zcat` instead of `cat` and skip the
| `gunzip my_local_database.db.txt.gz` step.
| _joel wrote:
| You could speed up by using pigz (parallel gzip) too.
| masklinn wrote:
| If you're going to use a less universal tool for compression
| you might as well go with zstd.
| Slasher1337 wrote:
| Why would anyone use gzip instead of zstd in 2025? zstd is
| superior in every dimension.
|
| gzip is a legacy algorithm that imo only gets used for
| compatibility with legacy software that understands nothing but
| gzip.
| nodesocket wrote:
| You don't need cat at all for the restore. Can simply do:
|
| sqlite3 data/database.db < "{backup_file}"
| pvorb wrote:
| How long does this procedure take in comparison to the network
| transfer?
|
| My first try would've been to copy the db file first, gzip it and
| then transfer it but I can't tell whether compression will be
| that useful in binary format.
| MPSimmons wrote:
| The sqlite file format (https://www.sqlite.org/fileformat.html)
| does not talk about compression, so I would wager unless you
| are storing already compressed content (media maybe?) or random
| numbers (encrypted data), it should compress reasonably well.
| chasil wrote:
| Native compression in sqlite is offered as a closed and
| licensed extension.
|
| https://sqlite.org/com/cerod.html
| simlevesque wrote:
| In DuckDB you can do the same but export to Parquet, this way the
| data is an order of magnitude smaller than using text-based SQL
| statements. It's faster to transfer and faster to load.
|
| https://duckdb.org/docs/stable/sql/statements/export.html
| uwemaurer wrote:
| you can do it with a command line like this:
| duckdb -c "attach 'sqlite-database.db' as db; copy
| db.table_name to 'table_name.parquet' (format parquet,
| compression zstd)"
|
| in my test database this is about 20% smaller than the gzipped
| text SQL statements.
| simlevesque wrote:
| That's not it. This only exports the table's data, not the
| database. You lose the index, comments, schemas,
| partitioning, etc... The whole point of OP's article is how
| to export the indices in an efficient way.
|
| You'd want to do this: duckdb -c "ATTACH
| 'sqlite-database.db' (READ-ONLY); EXPORT DATABASE
| 'target_directory' (FORMAT parquet, COMPRESSION zstd)"
|
| Also I wonder how big your test database is and it's schema.
| For large tables Parquet is way more efficient than a 20%
| reduction.
|
| If there's UUIDs, they're 36 bits each in text mode and 16
| bits as binary in Parquet. And then if they repeat you can
| use a dictionary in your Parquet to save the 16 bits only
| once.
|
| It's also worth trying to use brotli instead of zstd if small
| files is your goal.
| mromanuk wrote:
| I was surprised that he didn't try to use on the flight
| compression, provided by rsync: -z, --compress
| compress file data during the transfer --compress-
| level=NUM explicitly set compression level
|
| Probably it's faster to compress to gzip and later transfer. But
| it's nice to have the possibility to improve the transfer with a
| a flag.
| jddj wrote:
| Or better yet, since they cite corruption issues, sqlite3_rsync
| (https://sqlite.org/rsync.html) with -z
|
| sqlite transaction- and WAL-aware rsync with inflight
| compression.
| worldsavior wrote:
| I believe compression is only good on slow speed networks.
| berbec wrote:
| Valve tends to take a different view...
| stackskipton wrote:
| Valve has different needs then most. Their files are rarely
| change so they only need to do expensive compression once
| and they save a ton in bandwidth/storage along with fact
| that their users are more tolerant of download
| responsiveness.
| rollcat wrote:
| Depends. Run a benchmark on your own hardware/network. ZFS
| uses in-flight compression because CPUs are generally faster
| than disks. That may or may not be the case for your setup.
| PhilipRoman wrote:
| It would have to be one really fast network... zstd
| compresses and decompresses at 5+ GB (bytes, not bits) per
| second.
| worldsavior wrote:
| Where are you getting this performance? On the average
| computer this is by far not the speed.
| o11c wrote:
| I just tested on a ramdisk: tool cspeed
| size dspeed zstd 361 MB/s 16% 1321 MB/s
| lzop 512 MB/s 29% 539 MB/s lz4 555 MB/s 29%
| 1190 MB/s
|
| If working from files on disk that happen not to be cached,
| the speed differences are likely to disappear, even on many
| NVMe disks.
|
| (It just so happens that the concatenation of all text-
| looking .tar files I happen to have on this machine is
| roughly a gigabyte (though I did the math for the actual
| size)).
| PhilipRoman wrote:
| Looks like it depends heavily on choice of file, but I
| see good performance on both compressible and
| uncompressible files. Small files tend to perform
| (relatively) bad though. Here is a sample of 3 large
| files with different compression ratios:
| zstd -b1 --fast -i10 some-rpi-os-image-idk.img
| -1#-os-image-idk.img :2097152000 -> 226798302 (x9.247),
| 6765.0 MB/s, 5897.3 MB/s zstd -b1 --fast -i10
| jdk-11.0.8+10.tar -1#jdk-11.0.8+10.tar : 688844800
| -> 142114709 (x4.847), 2660.7 MB/s, 2506.8 MB/s
| zstd -b1 --fast -i10 archlinux-2025.04.01-x86_64.iso
| -1#.04.01-x86_64.iso :1236303872 ->1221271223 (x1.012),
| 3643.5 MB/s, 7836.6 MB/s
| masklinn wrote:
| Ain't no way zstd compresses at 5+, even at -1. That's the
| sort of throughputs you see on lz4 running on a bunch of
| core (either half a dozen very fast, or 12~16 merely fast).
| cogman10 wrote:
| Is the network only doing an rsync? Then you are probably
| right.
|
| For every other network, you should compress as you are
| likely dealing with multiple tenants that would all like a
| piece of your 40Gbps bandwidth.
| worldsavior wrote:
| In your logic, you should not compress as multiple tenants
| would all like a piece of your CPU.
| cogman10 wrote:
| This will always be something you have to determine for
| your own situation. At least at my work, CPU cores are
| plentiful, IO isn't. We rarely have apps that need more
| than a fraction of the CPU cores (barring garbage
| collection). Yet we are often serving fairly large chunks
| of data from those same apps.
| creatonez wrote:
| What? Compression is absolutely essential throughout
| computing as a whole, especially as CPUs have gotten faster.
| If you have compressible data sent over the network (or even
| on disk / in RAM) there's a good chance you should be
| compressing it. Faster links have not undercut this reality
| in any significant way.
| bityard wrote:
| Whether or not to compress data before transfer is VERY
| situationally dependent. I have seen it go both ways and
| the real-world results do not not always match intuition.
| At the end of the day, if you care about performance, you
| still have to do proper testing.
|
| (This is the same spiel I give whenever someone says swap
| on Linux is or is not always beneficial.)
| berbec wrote:
| or used --remove-source-files so they didn't have to ssh back
| to rm
| crazygringo wrote:
| The main point is to skip the indices, which you have to do
| pre-compression.
|
| When I do stuff like this, I stream the dump straight into
| gzip. (You can usually figure out a way to stream directly to
| the destination without an intermediate file at all.)
|
| Plus this way it stays stored compressed at its destination. If
| your purpose is backup rather than a poor man's replication.
| schnable wrote:
| The main point was decreasing the transfer time - if rsync -z
| makes it short enough, it doesn't matter if the indices are
| there or not, and you also skip the step of re-creating the
| DB from the text file.
| crazygringo wrote:
| The point of the article is that it _does_ matter if the
| indices are there. And indices generally don 't compress
| very well anyways. What compresses well are usually things
| like human-readable text fields or booleans/enums.
| Jyaif wrote:
| He absolutely should be doing this, because by using rsync on a
| compressed file he's passing by the whole point of using rsync,
| which is the rolling-checksum based algorithm that allows to
| transfer diffs.
| rabysh wrote:
| I think this could be a single pipeline?
|
| ssh username@server "sqlite3 my_remote_database.db .dump | gzip
| -c" | gunzip -c | sqlite3 my_local_database.db
| rabysh wrote:
| gzip/gunzip might also be redundant if using ssh compression
| with -oCompression=on or -C on the ssh call
| sneak wrote:
| My first thought, too. It also seems somewhat glaringly obvious
| that it needs a `pv` in there, as well.
| feverzsj wrote:
| That makes zero sense. Incremental backup via rsync/sqlite3_rsync
| should always be faster.
| Retr0id wrote:
| For incremental backups sure, but I think OP's solution would
| win for one-off snapshots.
| iambear wrote:
| I usually use scp for this case, sometimes rsync version is not
| compatible between 2 machines
| Cthulhu_ wrote:
| I used to work at a company that had a management interface that
| used sqlite as database, its multi-node / fallover approach was
| also just... copying the file and rsyncing it. I did wonder about
| data integrity though, what if the file is edited while it's
| being copied over? But there's probably safeguards in place.
|
| Anyway I don't think the database file size was really an issue,
| it was a relatively big schema but not many indices and
| performance wasn't a big consideration - hence why the backend
| would concatenate query results into an XML file, then pass it
| through an xml->json converter, causing 1-2 second response times
| on most requests. I worked on a rewrite using Go where requests
| were more like 10-15 milliseconds.
|
| But, I still used sqlite because that was actually a pretty good
| solution for the problem at hand; relatively low concurrency (up
| to 10 active simultaneous users), no server-side dependencies or
| installation needed, etc.
| MPSimmons wrote:
| >I did wonder about data integrity though, what if the file is
| edited while it's being copied over? But there's probably
| safeguards in place.
|
| You could do a filesystem snapshot and copy from that, but
| neither a cp or rsync is atomic.
| formerly_proven wrote:
| sqlite3 has a backup API for this, which you can invoke using
| the .backup command in the sqlite3 CLI.
|
| And then there is also https://www.sqlite.org/rsync.html
| rollcat wrote:
| SQLite has a write-ahead log (WAL). You can use Litestream on
| top of that. You get single RW, multiple readers (you lose the
| C in CAP), and can promote a reader when the writer fails.
| forinti wrote:
| One of the coolest things you can do with Postgresql is pipe
| pg_dump straight into psql connected to another cluster on
| another host.
| actinium226 wrote:
| I recently set up some scripts to do this and it wasn't quite
| as simple as I had hoped. I had to pass some extra flags to
| pg_restore for --no-owner --no-acl, and then it still had
| issues when the target db has data in it, even with --clean and
| --create. And sometimes it would leave me in a state where it
| dropped the database and had trouble restoring, and so I'd be
| totally empty.
|
| What I ended up doing is creating a new database,
| pg_restore'ing into that one with --no-owner and --no-acl,
| forcibly dropping the old database, and then renaming the new
| to the old one's name. This has the benefit of not leaving me
| high and dry should there be an issue with restoring.
| rarrrrrr wrote:
| If you're regularly syncing from an older version to a new
| version, you can likely optimize further using gzip with "--
| rsyncable" option. It will reduce the compression by ~1% but make
| it so differences from one version to the next are localized
| instead of cascading through the full length of the compression
| output.
|
| Another alternative is to skip compression of the dump output,
| let rsync calculate the differences from an previous uncompressed
| dump to the current dump, then have rsync compress the change
| sets it sends over the network. (rsync -z)
| ukuina wrote:
| Doesn't this just push the runtime into index recomputation on
| the destination database?
| masklinn wrote:
| Yes, however they seem to have a pretty slow internet
| connection
|
| > Downloading a 250MB database from my web server takes about a
| minute over my home Internet connection
|
| So for the original 3.4GB database that's nearly 15mn waiting
| for the download.
| zeroq wrote:
| How to copy databases between computers? Just send a circle and
| forget about the rest of the owl.
|
| As others have mentioned an incremental rsync would be much
| faster, but what bothers me the most is that he claims that
| sending SQL statements is faster than sending database and
| COMPLETELY omiting the fact that you have to execute these
| statements. And then run /optimize/. And then run /vacuum/.
|
| Currently I have scenario in which I have to "incrementally
| rebuild *" a database from CSV files. While in my particular case
| recreating the database from scratch is more optimal - despite
| heavy optimization it still takes half an hour just to run batch
| inserts on an empty database in memory, creating indexes, etc.
| JamesonNetworks wrote:
| 30 minutes seems long. Is there a lot of data? I've been
| working on bootstrapping sqlite dbs off of lots of json data
| and by holding a list of values and then inserting 10k at a
| time with inserts, Ive found a good perf sweet spot where I can
| insert plenty of rows (millions) in minutes. I had to use some
| tricks with bloom filters and LRU caching, but can build a 6
| gig db in like 20ish minutes now
| pessimizer wrote:
| Saying that 30 minutes seems long is like saying that 5 miles
| seems far.
| thechao wrote:
| Millions of rows in minutes sounds not ok, unless your tables
| have a large number of columns. A good rule is that SQLite's
| insertion performance should be at least 1% of sustained max
| write bandwidth of your disk; preferably 5%, or more. The
| last bulk table insert I was seeing 20%+ sustained; that came
| to ~900k inserts/second for an 8 column INT table (small
| integers).
| zeroq wrote:
| It's roughly 10Gb across several CSV files.
|
| I create a new in-mem db, run schema and then import every
| table in one single transaction (in my testing it showed that
| it doesn't matter if it's a single batch or multiple single
| inserts as long are they part of single transaction).
|
| I do a single string replacement per every CSV line to handle
| an edge case. This results in roughly 15 million inserts per
| minute (give or take, depending on table length and
| complexity). 450k inserts per second is a magic barrier I
| can't break.
|
| I then run several queries to remove unwanted data, trim
| orphans, add indexes, and finally run optimize and vacuum.
|
| Here's quite recent log (on stock Ryzen 5900X):
| 08:43 import 13:30 delete non-essentials 18:52
| delete orphans 19:23 create indexes 19:24
| optimize 20:26 vacuum
| iveqy wrote:
| I hope you've found
| https://stackoverflow.com/questions/1711631/improve-insert-p...
|
| It's a very good writeup on how to do fast inserts in sqlite3
| jgalt212 wrote:
| yes, but they punt on this issue:
|
| CREATE INDEX then INSERT vs. INSERT then CREATE INDEX
|
| i.e. they only time INSERTs, not the CREATE INDEX after all
| the INSERTs.
| zeroq wrote:
| Yes! That was actually quite helpful.
|
| For my use case (recreating in-memory from scratch) it
| basically boils down to three points: (1) journal_mode = off
| (2) wrapping all inserts in a single transaction (3) indexes
| after inserts.
|
| For whatever it's worth I'm getting 15M inserts per minute on
| average, and topping around 450k/s for trivial relationship
| table on a stock Ryzen 5900X using built-in sqlite from
| NodeJS.
| vlovich123 wrote:
| Would it be useful for you to have a SQL database that's
| like SQLite (single file but not actually compatible with
| the SQLite file format) but can do 100M/s instead?
| zeroq wrote:
| Not really.
|
| I tested couple different approaches, including pglite,
| but node finally shipped native sqlite with version 23
| and it's fine for me.
|
| I'm a huge fan of serverless solutions and one of the
| absolute hidden gems about sqlite is that you can publish
| the database on http server and query it extremely
| efficitent from a client.
|
| I even have a separate miniature benchmark project I
| thought I might publish, but then I decided it's not
| worth anyones time. x]
| o11c wrote:
| It's worth noting that the data in that benchmark is tiny
| (28MB). While this varies between database engines, "one
| transaction for everything" means keeping some kind of
| allocations alive.
|
| The optimal transaction size is difficult to calculate so
| should be measured, but it's almost certainly never
| _beneficial_ to spend multiple seconds on a single
| transaction.
|
| There will also be weird performance changes when the size of
| data (or indexed data) exceeds the size of main memory.
| gibibit wrote:
| Hilarious, 3000+ votes for a Stack Overflow question that's
| not a question. But it is an interesting article. Interesting
| enough that it gets to break all the rules, I guess?
| detaro wrote:
| It's a (quite old) community wiki post. These do (and
| especially did back then) work and are treated differently.
| stackskipton wrote:
| As with any optimization, it matters where your bottleneck is
| here. Sounds like theirs is bandwidth but CPU/Disk IO is
| plentiful since they mentioned that downloading 250MB database
| takes minute where I just grabbed 2GB SQLite test database from
| work server in 15 seconds thanks to 1Gbps fiber.
| RenThraysk wrote:
| SQLite has an session extension, which will track changes to a
| set of tables and produce a changeset/patchset which can patch
| previous version of an SQLite database.
|
| https://www.sqlite.org/sessionintro.html
| simonw wrote:
| Have you used that? I've read the documentation but I don't
| think I've ever heard from anyone who uses the extension.
| RenThraysk wrote:
| I have, atleast to confirm it does what it says on the tin.
|
| Idea for an offline first app, where each app install call
| pull a changeset and apply it to their local db.
| oefrha wrote:
| I have yet to see a single SQLite binding supporting this, so
| it's quite useless unless you're writing your application in C,
| or are open to patching the language binding.
|
| In one of my projects I have implemented my own poor man's
| session by writing all the statements and parameters into a
| separate database, then sync that and replay. Works well enough
| for a ~30GB database that changes by ~0.1% every day.
| RenThraysk wrote:
| There are atleast two SQLite bindings for Go.
|
| https://github.com/crawshaw/sqlite
|
| https://github.com/eatonphil/gosqlite/
|
| Ended up with the latter, but did have to add one function
| binding in C, to inspect changesets.
| ncruces wrote:
| I'm open to adding it to my driver, if people consider it
| essential.
|
| Every extra bit makes AOT compiling the Wasm slower
| (impacting startup time).
|
| I also wanna keep the number of variants reasonable, or my
| repo blows up.
|
| Add your votes for additional features to this issue:
| https://github.com/ncruces/go-sqlite3/issues/126
| paulclinger wrote:
| I have updated the Lua binding to support the session
| extension (http://lua.sqlite.org/home/timeline?r=session) and
| it's been integrated into the current version of
| cosmopolitan/redbean. This was partially done to support
| application-level sync of SQLite DBs, however this is still a
| work in progress.
| nickpeterson wrote:
| I really wish SQLite had some default way of doing change data
| capture via session or something similar.
| jbverschoor wrote:
| In curious how your indices are twice the data. Sounds like you
| just put indices in anything you see.
| hobs wrote:
| Hah they need to try harder then, I have seen more than 20x the
| data volume in systems where people are both paranoid and
| ignorant, a dangerous combo!
| crazygringo wrote:
| I definitely have databases like this.
|
| It's not carelessness, it's performance.
|
| Quite simply, I have a table with 4 columns -- A, B, C, D. Each
| column is just an 8-byte integer. It has hundreds of millions
| of rows. It has an index on B+C+D, an index on C+D, and one on
| D.
|
| All of these are required because the user needs to be able to
| retrieve aggregate data based on range conditions around lots
| of combinations of the columns. Without all the indices,
| certain queries take a couple minutes. With them, each query
| takes milliseconds to a couple seconds.
|
| I thought of every possible way to avoid having all three
| indices, but it just wasn't possible. It's just how performant
| data lookup works.
|
| You shouldn't assume people are being careless with indices.
| Far too often I see the opposite.
| Levitating wrote:
| I am sure you can just pipe all this so you don't have to use an
| intermediate gunzip file.
|
| Just ssh the machine, dump the SQL and load it back into SQLite
| locally.
| rollcat wrote:
| rsync will transmit only the delta between the source and
| destination.
| wang_li wrote:
| I've seen a suggestion several times to compress the data
| before sending. If remote means in the same data center,
| there's a good chance compressing the data is just slowing you
| down. Not many machines can gzip/bzip2/7zip at better than the
| 1 gigabyte per second you can get from 10 Gbps networks.
| whalesalad wrote:
| it's a file - what am I missing? scp host:path .
| ninth_ant wrote:
| Then entire point of the article is to answer this specific
| question
| cwmma wrote:
| how well does just the sqlite database gzip, the indexes are a
| lot of redundant data so your going to get some efficiencies
| there, probably less locality of data then the text file though
| so maybe less?
| nottorp wrote:
| Wait... why would you even think about rsyncing a database that
| can get changed while being copied?
|
| Isn't this a case for proper database servers with replication?
|
| Or if it's an infrequent process done for dev purposes just shut
| down the application doing writes on the other side?
| bambax wrote:
| > _If it takes a long time to copy a database and it gets updated
| midway through, rsync may give me an invalid database file. The
| first half of the file is pre-update, the second half file is
| post-update, and they don't match. When I try to open the
| database locally, I get an error_
|
| Of course! You can't copy the file of a running, active db
| receiving updates, that can only result in corruption.
|
| For replicating sqlite databases safely there is
|
| https://github.com/benbjohnson/litestream
| quotemstr wrote:
| > Of course! You can't copy the file of a running, active db
| receiving updates, that can only result in corruption
|
| Do people really not understand how file storage works? I
| cannot rightly apprehend the confusion of ideas that would
| produce an attempt to copy a volatile database without
| synchronization and expect it to work.
| kccqzy wrote:
| The confusion of ideas here is understandable IMO: people
| assume everything is atomic. Databases of course famously
| have ACID guarantees. But it's easy for people to assume
| copying is also an atomic operation. Honestly if someone
| works too much with databases and not enough with filesystems
| it's a mistake easily made.
| ahazred8ta wrote:
| > I cannot rightly apprehend the confusion of ideas
|
| I see you are a man of culture.
| kccqzy wrote:
| Charles Babbage is smart, but either he lacks empathy to
| understand other people or he's just saying that
| deliberately for comedic effect.
| jerf wrote:
| It was early days... _very_ early days. He didn 't have
| the benefit of trying to help his (metaphorical)
| grandparents get their emails or worked under a manager
| who thinks 2023-era ChatGPT is only slightly less
| reliable than the Standard Model of Physics, if not
| slightly more.
| bambax wrote:
| Oh he definitely lacked empathy.
|
| But things haven't improved much. Today we have "prompt
| engineers" whose only job is to input the right question
| in order to get the right answer.
| yellow_lead wrote:
| Litestream looks interesting but they are still in beta, and
| seem to have not had a release in over a year, although SQLite
| doesn't move that quickly.
|
| Is Litestream still an active project?
| clintonb wrote:
| Despite the beta label and lack of a 1.x release, I would
| consider the project pretty stable. We've used it in
| production for over 18 months to support an offline-first
| point of sale system. We haven't had any issues with
| Litestream.
| pixl97 wrote:
| >You can't copy the file of a running, active db receiving
| updates, that can only result in corruption
|
| There is a slight 'well akshully' on this. A DB flush and FS
| snapshot where you copy the snapshotted file will allow this.
| MSSQL VSS snapshots would be an example of this.
| tpmoney wrote:
| Similarly you can rsync a Postgres data directory safely
| while the db is running, with the caveat that you likely lose
| any data written while the rsync is running. And if you want
| that data, you can get it with the WAL files.
|
| It's been years since I needed to do this, but if I remember
| right, you can clone an entire pg db live with a
| `pg_backup_start()`, rsync the data directory,
| pg_backup_stop() and rsync the WAL files written since backup
| start.
| edoceo wrote:
| For moving DBs where I'm allowed minutes of downtime I do
| rsync (slow) first from the live, while hot, then just stop
| that one, then rsync again (fast) then make the new one
| hot.
|
| Works a treat when other (better) method are not available.
| creatonez wrote:
| > You can't copy the file of a running, active db receiving
| updates, that can only result in corruption
|
| To push back against "only" -- there is actually one scenario
| where this works. Copying a file or a subvolume on Btrfs or ZFS
| can be done atomically, so if it's an ACID database or an LSM
| tree, in the worst case it will just rollback. Of course, if
| it's multiple files you have to take care to wrap them in a
| subvolume so that all of them are copied in the same
| transaction, simply using `cp --reflink=always` won't do.
|
| Possibly freezing the process with SIGSTOP would yield the same
| result, but I wouldn't count on that
| lmz wrote:
| It can't be done without fs specific snapshots - otherwise
| how would it distinguish between a cp/rsync needing
| consistent reads vs another sqlite client wanting the newest
| data?
| o11c wrote:
| Obligatory "LVM still exists and snapshots are easy enough
| to overprovision for"
| HumanOstrich wrote:
| Taking an LVM snapshot and then copying the sqlite
| database from that is sufficient to keep it from being
| corrupted, but you can have incomplete transactions that
| will be rolled back during crash recovery.
|
| The problem is that LVM snapshots operate at the block
| device level and only ensure there are no torn or half-
| written blocks. It doesn't know about the filesystem's
| journal or metadata.
|
| To get a consistent point-in-time snapshot without
| triggering crash-recovery and losing transactions, you
| also need to lock the sqlite database or filesystem from
| writes during the snapshot. PRAGMA
| wal_checkpoint(FULL); BEGIN IMMEDIATE; -- locks
| out writers . /* trigger your LVM snapshot here
| */ COMMIT;
|
| You can also use fsfreeze to get the same level of
| safety: sudo fsfreeze -f /mnt/data
| # (A) flush dirty pages & block writes lvcreate
| -L1G -s -n snap0 /dev/vg0/data sudo fsfreeze -u
| /mnt/data # (B) thaw, resume writes
|
| Bonus - validate the snapshotted db file with:
| sqlite3 mydb-snapshot.sqlite "PRAGMA integrity_check;"
| ummonk wrote:
| I would assume cp uses ioctl (with atomic copies of
| individual files on filesystems that support CoW like APFS
| and BTRFS), whereas sqlite probably uses mmap?
| vlovich123 wrote:
| I was trying to find evidence that reflink copies are
| atomic and could not and LLMs seem to think they are not.
| So at best may be a btrfs only feature?
| creatonez wrote:
| From Linux kernel documentation:
|
| https://man7.org/linux/man-
| pages/man2/ioctl_ficlone.2.html
|
| > Clones are atomic with regards to concurrent writes, so
| no locks need to be taken to obtain a consistent cloned
| copy.
|
| I'm not aware of any of the filesystems that use it
| (Btrfs, XFS, Bcachefs, ZFS) that deviate from expected
| atomic behavior, at least with single files being the
| atom in question for `FICLONE` operation.
| vlovich123 wrote:
| Thanks! I similarly thought that a file clone would work
| but I couldn't confirm it.
| wswope wrote:
| The built-in .backup command is also intended as an official
| tool for making "snapshotted" versions of a live db that can be
| copied around.
| jmull wrote:
| If the corruption is detectable and infrequent enough for your
| purposes, then it _does_ work, with a simple "retry until
| success" loop. (That's how TCP works, for example.)
| benbjohnson wrote:
| Not all corruption is detectable. You could make a copy
| during a transaction where only a subset of the transactions
| saved pages are persisted but all branch & leaf pages are
| pointed to correctly. That would give you a state of the
| database that never actually existed and break atomicity.
| jmull wrote:
| > Not all corruption is detectable.
|
| Well, I don't know rsync that well. If you're saying it
| doesn't detect changes to files while it's being copied,
| then I'll believe you.
|
| And, as far as I know, it's impossible to detect absolutely
| _all_ corruption.
|
| But you can pretty easily detect, e.g., that a file has or
| has not changed since before you copied it to after, on a
| system with a basically functioning filesystem and clock,
| with a reasonable/useful level of confidence.
| lknuth wrote:
| While I run and love litestream on my own system, I also like
| that they have a pretty comprehensive guide on how to do
| something like this manually, via built-in tools:
| https://litestream.io/alternatives/cron/
| yard2010 wrote:
| Litestream is really cool! I'm planning to use it to backup and
| restore my SQLite in the container level, just like what that
| ex-google guy who started a startup of a small KVM and had a
| flood in his warehouse while on vacation did. If I'm not
| mistaken. I would link here the perfect guide he wrote but
| there's 0 chance I'll find it. If you understand the reference
| please post the link.
| mtlynch wrote:
| Haha, that sounds like me. Here's the writeup you're talking
| about:
|
| https://mtlynch.io/litestream/
|
| And here's the flooding story:
|
| https://mtlynch.io/solo-developer-year-6/#the-most-
| terrifyin...
|
| Sidenote: I still use Litestream in every project where I use
| SQLite.
| rsync wrote:
| "For replicating sqlite databases safely there is (litestream)
| ..."
|
| A reminder that litestream can run over plain old SFTP[1] which
| means you can stream database replication to just about any
| UNIX endpoint over SSH.
|
| I have a favorite[2] but any SFTP server will do ...
|
| [1] https://github.com/benbjohnson/litestream/issues/140
|
| [2]
| https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...
| ollybee wrote:
| sqlite provide the sqlite3_rsync command to safely copy databases
| https://sqlite.org/rsync.html
| npn wrote:
| ???
|
| Why not just compress the whole database using `gzip` or `lz4`
| before rsyncing it instead? `zstd` works too but seems like it
| had a bug regarding compressing file with modified content.
|
| better yet, split your sqlite file to smaller piece. it is not
| like it needs to contain all the app data in a single sqlite
| file.
| hundredwatt wrote:
| The recently released sqlite_rsync utility uses a version of the
| rsync algorithm optimized to work on the internal structure of a
| SQLite database. It compares the internal data pages efficiently,
| then only syncs changed or missing pages.
|
| Nice tricks in the article, but you can more easily use the
| builtin utility now :)
|
| I blogged about how it works in detail here:
| https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...
| jgalt212 wrote:
| sqlite_rsync can only be used in WAL mode. A further constraint
| of WAL mode is the database file must be stored on local disk.
| Clearly, you'd want to do this almost all the time, but for the
| times this is not possible this utility won't work.
| SQLite wrote:
| I just checked in an experimental change to sqlite3_rsync
| that allows it to work on non-WAL-mode database files, as
| long as you do _not_ use the --wal-only command-line option.
| The downside of this is that the origin database will block
| all writers while the sync is going on, and the replicate
| database will block both reads and writers during the sync,
| because to do otherwise requires WAL-mode. Nevertheless,
| being able to sync DELETE-mode databases might well be
| useful, as you observe.
|
| If you are able, please try out this enhancement and let me
| know if it solves your problem. See
| <https://sqlite.org/src/info/2025-05-01T16:07Z> for the
| patch.
| SQLite wrote:
| Update: This enhancement is now on trunk and will be
| included in the 3.50.0 release of SQLite due out in about
| four weeks.
| construct0 wrote:
| Demands increasing page size if you sync frequently
| (bandwidth).
| rsync wrote:
| Also note:
|
| sqlite3_rsync is now built into the rsync.net platform.
| ssh user@rsync.net sqlite3_rsync ... blah blah ...
|
| ... just added last week and not rolled out in all regions but
| ... all initial users reported it worked exactly as they
| expected it to.
| actinium226 wrote:
| I have recently discovered a tool called mscp which opens open
| multiple scp threads to copy down large files. It works great for
| speeding up these sorts of downloads.
|
| https://github.com/upa/mscp
| 404mm wrote:
| zstd would be a better choice. It's bonkers fast (especially when
| used with multithreading) and still compresses better than gzip.
| Alternatively, I'd recommend looking into bzip3, but I'm not sure
| if it would save time.
| ozim wrote:
| I guess for me it is obvious you don't try to copy running DB
| only a backup.
|
| So I see basic stuff needs to be repeated as people still miss
| those kinds of things.
|
| But I learned that you can easily dump SQLite to a text file -
| neat!
| yapyap wrote:
| Very neat walkthrough, clear commands and I appreciate the
| explanations as to why this may help in OPs case
| markhahn wrote:
| isn't this rather obvious? doesn't everyone do this when it makes
| sense? obviously, it applies to other DBs, and you don't even
| need to store the file (just a single ssh from dumper to remote
| undumper).
|
| if retaining the snapshot file is of value, great.
|
| I'd be a tiny bit surprised if rsync could recognize diffs in the
| dump, but it's certainly possible, assuming the dumper is
| "stable" (probably is because its walking the tables as trees).
| the amount of change detected by rsync might actually be a useful
| thing to monitor.
| RKFADU_UOFCCLEL wrote:
| Pretty good point. I just wonder if databases in generally can be
| perfectly reconstructed from a text dump. For instance, do the
| insertion orders change in any of the operations between dumping
| and importing?
| M95D wrote:
| Saving to text file is inefficient. I save sqlite databases using
| VACUUM INTO, like this: sqlite3 -readonly
| /path/db.sqlite "VACUUM INTO '/path/backup.sqlite';"
|
| From https://sqlite.org/lang_vacuum.html : The
| VACUUM command with an INTO clause is an alternative to the
| backup API for generating backup copies of a live database. The
| advantage of using VACUUM INTO is that the resulting backup
| database is minimal in size and hence the amount of filesystem
| I/O may be reduced.
| gwbas1c wrote:
| Does that preserve the indexes? As the TFA mentioned, the
| indexes are why the sqlite files are huge.
| M95D wrote:
| You're right. It does. I never thought about it until you
| asked.
| 4silvertooth wrote:
| I think it won't preserve the index but it will recreate
| the index while running the text sql.
| nine_k wrote:
| It's cool but it does not address the issue of indexes,
| mentioned in the original post. Not carrying index data over
| the slow link was the key idea. The VACUUM INTO approach keeps
| indexes.
|
| A text file may be inefficient as is, but it's perfectly
| compressible, even with primitive tools like gzip. I'm not sure
| the SQLite binary format compresses equality well, though it
| might.
| vlovich123 wrote:
| > A text file may be inefficient as is, but it's perfectly
| compressible, even with primitive tools like gzip. I'm not
| sure the SQLite binary format compresses equality well,
| though it might.
|
| I hope you're saying because of indexes? I think you may want
| to revisit how compression works to fix your intuition.
| Text+compression will always be larger and slower than
| equivalent binary+compression assuming text and binary
| represent the same contents? Why? Binary is less compressible
| as a percentage but starts off smaller in absolute terms
| which will result in a smaller absolute binary. A way to
| think about it is information theory - binary should
| generally represent the data more compactly already because
| the structure lived in the code. Compression is about
| replacing common structure with noise and it works better if
| there's a lot of redundant structure. However while text has
| a lot of redundant structure, that's actually bad for the
| compressor because it has to find that structure and process
| more data to do that. Additionally, is using generic
| mathematical techniques to remove that structure which are
| genetically optimal but not as optimal as removing that
| structure by hand via binary is.
|
| There's some nuance here because the text represents slightly
| different things than the raw binary SQLite (how to restore
| data in the db vs the precise relationships + data structures
| for allowing insertion/retrieval. But still I'd expect it to
| end up smaller compressed for non trivial databases
| dunham wrote:
| Below I'm discussing compressed size here rather than how
| "fast" it is to copy databases.
|
| Yeah there are indexes. And even without indexes there is
| an entire b-tree sitting above the data. So we're weighing
| the benefits of having a domain dependent compression
| (binary format) vs dropping all of the derived data. I'm
| not sure how that will go, but lets try one.
|
| Here is sqlite file containing metadata for apple's photo's
| application: 767979520 May 1 07:28
| Photos.sqlite
|
| Doing a VACUUM INTO: 719785984 May 1
| 08:56 photos.sqlite
|
| gzip -k photos.sqlite (this took 20 seconds):
| 303360460 May 1 08:56 photos.sqlite.gz
|
| sqlite3 -readonly photos.sqlite .dump > photos.dump (10
| seconds): 1277903237 May 1 09:01
| photos.dump
|
| gzip -k photos.dump (21 seconds):
| 285086642 May 1 09:01 photos.dump.gz
|
| About 6% smaller for dump vs the original binary (but there
| are a bunch of indexes in this one). For me, I don't think
| it'd be worth the small space savings to spend the extra
| time doing the dump.
|
| With indexes dropped and vacuumed, the compressed binary is
| 8% smaller than compressed text (despite btree overhead):
| 566177792 May 1 09:09 photos_noindex.sqlite
| 262067325 May 1 09:09 photos_noindex.sqlite.gz
|
| About 13.5% smaller than compressed binary with indices.
| And one could re-add the indices on the other side.
| ForOldHack wrote:
| Brilliant. >60% savings. 700mb? wow.
| cnewey wrote:
| Is that really necessary?
| nine_k wrote:
| Depending on the bandwidth at the target site, which may
| be pretty remote, and not exposing a public internet
| service.
| cnewey wrote:
| Ah no, I meant "is the snark necessary?" to the parent
| comment. I enjoyed the read!
| vlovich123 wrote:
| Yup, these results are pretty consistent with what I'd
| expect (& why I noted the impact of indices) cause even
| string data has a lot of superfluous information when
| expressed in the DDL ("INSERT INTO foo ...") - I would
| expect all of that to exceed any bookkeeping within the
| btree. And non-string values like blobs or numbers are
| going to be stored more efficiently than in the dump
| which is a text encoding (or even hex for blobs) which is
| going to blow things up further.
| conradev wrote:
| SQLite tosses out the SQL once it is parsed into bytecode.
| Using text is just going to take longer, even though I'm sure
| it works great.
|
| You can modify the database before vacuuming by making a new
| in-memory database, copying selected tables into it, and then
| vacuuming that to disk.
| nine_k wrote:
| This should be the accepted answer.
| xnx wrote:
| Great example of Cunningham's Law:
| https://en.wikipedia.org/wiki/Ward_Cunningham#:~:text=%22Cun...
| bluefirebrand wrote:
| All this obsession with making processes like this faster
|
| When is a guy supposed to get a coffee and stretch his legs
| anymore?
| gwbas1c wrote:
| I wonder if there's a way to export to parquet files? They are
| designed to be extremely compact.
| kazinator wrote:
| Does the author not know that rsync can use compression (rsync -z
| | --compress | --compress-level=<n> ), or does he not think it
| worthwhile to compare that data point?
|
| I just tried some comparisons (albeit with a fairly small sqlite
| file). The text compressed to only about 84% of the size of the
| compressed binary database, which isn't negligible, but not
| necessarily worth fussing over in every situation. (The binary
| compressed to 7.1%, so it's 84% relative to that).
|
| bzip2 performed better on both formats; its compression of the
| binary database was better than gzip's compression of the text
| (91.5%) and bzip2's text was better than binary (92.5).
|
| Though that is not available inside rsync, it indicates that if
| you're going with an external compression solution, maybe gzip
| isn't the best choice if you care about every percentage
| reduction.
|
| If you don't care about every percentage reduction, maybe just
| rsync compression.
|
| One thing worth mentioning is that if you are updating the file,
| rsync will only compress what is sent. To replicate that with the
| text solution, you will have to be retaining the text on both
| sides to do the update between them.
| isaacvando wrote:
| Nice!
| alienbaby wrote:
| I'm surprised sqlite is duplicating data to make indexes? Surely
| it would just be manipulating groups of pointers?
| nodesocket wrote:
| I've been looking into a way to replicate a SQLite database and
| came across the LiteFS project by Fly.io. Seems like a solid
| drop-in solution backed by FUSE and Consul. Anybody used it in
| production? My use case is high availability between multiple
| VMs.
| baxter001 wrote:
| What technologies we have in 2025!
| quantadev wrote:
| Since sqlite is just a simple file-level locking DB, I'm pretty
| shocked they don't have an option to let the indexes be stored in
| separate files for all kinds of obvious and beneficial reasons,
| like the fact that you can easily exclude them from backups if
| they were, and you can make them "rebuild" just by deleting them.
| Probably their reason for keeping all internal has to do with
| being sure indexes are never out of sync, but that could just as
| easily be accomplished with hashing algos.
| conradev wrote:
| SQLite has an official tool for this, fwiw:
| https://www.sqlite.org/rsync.html
|
| It works at the page level:
|
| > The protocol is for the replica to send a cryptographic hash of
| each of its pages over to the origin side, then the origin sends
| back the complete content of any page for which the hash does not
| match.
| limaoscarjuliet wrote:
| In case someone is looking for a proper way of making db copy in
| PostgreSQL and not SQLLite, pg_basebackup is the proper way:
| https://www.postgresql.org/docs/current/app-pgbasebackup.htm...
| deepsun wrote:
| "Skip the index and turn it on later" is an old trick from
| mainframes and DB2.
___________________________________________________________________
(page generated 2025-05-01 23:00 UTC)