[HN Gopher] A faster way to copy SQLite databases between computers
       ___________________________________________________________________
        
       A faster way to copy SQLite databases between computers
        
       Author : ingve
       Score  : 394 points
       Date   : 2025-05-01 11:15 UTC (11 hours ago)
        
 (HTM) web link (alexwlchan.net)
 (TXT) w3m dump (alexwlchan.net)
        
       | dundundundun wrote:
       | This is basically the way every other database is moved around.
        
       | d34th0rl1f3 wrote:
       | You can save time by using `zcat` instead of `cat` and skip the
       | `gunzip my_local_database.db.txt.gz` step.
        
         | _joel wrote:
         | You could speed up by using pigz (parallel gzip) too.
        
           | masklinn wrote:
           | If you're going to use a less universal tool for compression
           | you might as well go with zstd.
        
         | Slasher1337 wrote:
         | Why would anyone use gzip instead of zstd in 2025? zstd is
         | superior in every dimension.
         | 
         | gzip is a legacy algorithm that imo only gets used for
         | compatibility with legacy software that understands nothing but
         | gzip.
        
         | nodesocket wrote:
         | You don't need cat at all for the restore. Can simply do:
         | 
         | sqlite3 data/database.db < "{backup_file}"
        
       | pvorb wrote:
       | How long does this procedure take in comparison to the network
       | transfer?
       | 
       | My first try would've been to copy the db file first, gzip it and
       | then transfer it but I can't tell whether compression will be
       | that useful in binary format.
        
         | MPSimmons wrote:
         | The sqlite file format (https://www.sqlite.org/fileformat.html)
         | does not talk about compression, so I would wager unless you
         | are storing already compressed content (media maybe?) or random
         | numbers (encrypted data), it should compress reasonably well.
        
           | chasil wrote:
           | Native compression in sqlite is offered as a closed and
           | licensed extension.
           | 
           | https://sqlite.org/com/cerod.html
        
       | simlevesque wrote:
       | In DuckDB you can do the same but export to Parquet, this way the
       | data is an order of magnitude smaller than using text-based SQL
       | statements. It's faster to transfer and faster to load.
       | 
       | https://duckdb.org/docs/stable/sql/statements/export.html
        
         | uwemaurer wrote:
         | you can do it with a command line like this:
         | duckdb -c "attach  'sqlite-database.db' as db;  copy
         | db.table_name to 'table_name.parquet' (format parquet,
         | compression zstd)"
         | 
         | in my test database this is about 20% smaller than the gzipped
         | text SQL statements.
        
           | simlevesque wrote:
           | That's not it. This only exports the table's data, not the
           | database. You lose the index, comments, schemas,
           | partitioning, etc... The whole point of OP's article is how
           | to export the indices in an efficient way.
           | 
           | You'd want to do this:                    duckdb -c "ATTACH
           | 'sqlite-database.db' (READ-ONLY); EXPORT DATABASE
           | 'target_directory' (FORMAT parquet, COMPRESSION zstd)"
           | 
           | Also I wonder how big your test database is and it's schema.
           | For large tables Parquet is way more efficient than a 20%
           | reduction.
           | 
           | If there's UUIDs, they're 36 bits each in text mode and 16
           | bits as binary in Parquet. And then if they repeat you can
           | use a dictionary in your Parquet to save the 16 bits only
           | once.
           | 
           | It's also worth trying to use brotli instead of zstd if small
           | files is your goal.
        
       | mromanuk wrote:
       | I was surprised that he didn't try to use on the flight
       | compression, provided by rsync:                 -z, --compress
       | compress file data during the transfer           --compress-
       | level=NUM    explicitly set compression level
       | 
       | Probably it's faster to compress to gzip and later transfer. But
       | it's nice to have the possibility to improve the transfer with a
       | a flag.
        
         | jddj wrote:
         | Or better yet, since they cite corruption issues, sqlite3_rsync
         | (https://sqlite.org/rsync.html) with -z
         | 
         | sqlite transaction- and WAL-aware rsync with inflight
         | compression.
        
         | worldsavior wrote:
         | I believe compression is only good on slow speed networks.
        
           | berbec wrote:
           | Valve tends to take a different view...
        
             | stackskipton wrote:
             | Valve has different needs then most. Their files are rarely
             | change so they only need to do expensive compression once
             | and they save a ton in bandwidth/storage along with fact
             | that their users are more tolerant of download
             | responsiveness.
        
           | rollcat wrote:
           | Depends. Run a benchmark on your own hardware/network. ZFS
           | uses in-flight compression because CPUs are generally faster
           | than disks. That may or may not be the case for your setup.
        
           | PhilipRoman wrote:
           | It would have to be one really fast network... zstd
           | compresses and decompresses at 5+ GB (bytes, not bits) per
           | second.
        
             | worldsavior wrote:
             | Where are you getting this performance? On the average
             | computer this is by far not the speed.
        
             | o11c wrote:
             | I just tested on a ramdisk:                 tool  cspeed
             | size  dspeed       zstd  361 MB/s  16%   1321 MB/s
             | lzop  512 MB/s  29%    539 MB/s       lz4   555 MB/s  29%
             | 1190 MB/s
             | 
             | If working from files on disk that happen not to be cached,
             | the speed differences are likely to disappear, even on many
             | NVMe disks.
             | 
             | (It just so happens that the concatenation of all text-
             | looking .tar files I happen to have on this machine is
             | roughly a gigabyte (though I did the math for the actual
             | size)).
        
               | PhilipRoman wrote:
               | Looks like it depends heavily on choice of file, but I
               | see good performance on both compressible and
               | uncompressible files. Small files tend to perform
               | (relatively) bad though. Here is a sample of 3 large
               | files with different compression ratios:
               | zstd -b1 --fast -i10 some-rpi-os-image-idk.img
               | -1#-os-image-idk.img :2097152000 -> 226798302 (x9.247),
               | 6765.0 MB/s, 5897.3 MB/s            zstd -b1 --fast -i10
               | jdk-11.0.8+10.tar       -1#jdk-11.0.8+10.tar : 688844800
               | -> 142114709 (x4.847), 2660.7 MB/s, 2506.8 MB/s
               | zstd -b1 --fast -i10 archlinux-2025.04.01-x86_64.iso
               | -1#.04.01-x86_64.iso :1236303872 ->1221271223 (x1.012),
               | 3643.5 MB/s, 7836.6 MB/s
        
             | masklinn wrote:
             | Ain't no way zstd compresses at 5+, even at -1. That's the
             | sort of throughputs you see on lz4 running on a bunch of
             | core (either half a dozen very fast, or 12~16 merely fast).
        
           | cogman10 wrote:
           | Is the network only doing an rsync? Then you are probably
           | right.
           | 
           | For every other network, you should compress as you are
           | likely dealing with multiple tenants that would all like a
           | piece of your 40Gbps bandwidth.
        
             | worldsavior wrote:
             | In your logic, you should not compress as multiple tenants
             | would all like a piece of your CPU.
        
               | cogman10 wrote:
               | This will always be something you have to determine for
               | your own situation. At least at my work, CPU cores are
               | plentiful, IO isn't. We rarely have apps that need more
               | than a fraction of the CPU cores (barring garbage
               | collection). Yet we are often serving fairly large chunks
               | of data from those same apps.
        
           | creatonez wrote:
           | What? Compression is absolutely essential throughout
           | computing as a whole, especially as CPUs have gotten faster.
           | If you have compressible data sent over the network (or even
           | on disk / in RAM) there's a good chance you should be
           | compressing it. Faster links have not undercut this reality
           | in any significant way.
        
             | bityard wrote:
             | Whether or not to compress data before transfer is VERY
             | situationally dependent. I have seen it go both ways and
             | the real-world results do not not always match intuition.
             | At the end of the day, if you care about performance, you
             | still have to do proper testing.
             | 
             | (This is the same spiel I give whenever someone says swap
             | on Linux is or is not always beneficial.)
        
         | berbec wrote:
         | or used --remove-source-files so they didn't have to ssh back
         | to rm
        
         | crazygringo wrote:
         | The main point is to skip the indices, which you have to do
         | pre-compression.
         | 
         | When I do stuff like this, I stream the dump straight into
         | gzip. (You can usually figure out a way to stream directly to
         | the destination without an intermediate file at all.)
         | 
         | Plus this way it stays stored compressed at its destination. If
         | your purpose is backup rather than a poor man's replication.
        
           | schnable wrote:
           | The main point was decreasing the transfer time - if rsync -z
           | makes it short enough, it doesn't matter if the indices are
           | there or not, and you also skip the step of re-creating the
           | DB from the text file.
        
             | crazygringo wrote:
             | The point of the article is that it _does_ matter if the
             | indices are there. And indices generally don 't compress
             | very well anyways. What compresses well are usually things
             | like human-readable text fields or booleans/enums.
        
         | Jyaif wrote:
         | He absolutely should be doing this, because by using rsync on a
         | compressed file he's passing by the whole point of using rsync,
         | which is the rolling-checksum based algorithm that allows to
         | transfer diffs.
        
       | rabysh wrote:
       | I think this could be a single pipeline?
       | 
       | ssh username@server "sqlite3 my_remote_database.db .dump | gzip
       | -c" | gunzip -c | sqlite3 my_local_database.db
        
         | rabysh wrote:
         | gzip/gunzip might also be redundant if using ssh compression
         | with -oCompression=on or -C on the ssh call
        
         | sneak wrote:
         | My first thought, too. It also seems somewhat glaringly obvious
         | that it needs a `pv` in there, as well.
        
       | feverzsj wrote:
       | That makes zero sense. Incremental backup via rsync/sqlite3_rsync
       | should always be faster.
        
         | Retr0id wrote:
         | For incremental backups sure, but I think OP's solution would
         | win for one-off snapshots.
        
       | iambear wrote:
       | I usually use scp for this case, sometimes rsync version is not
       | compatible between 2 machines
        
       | Cthulhu_ wrote:
       | I used to work at a company that had a management interface that
       | used sqlite as database, its multi-node / fallover approach was
       | also just... copying the file and rsyncing it. I did wonder about
       | data integrity though, what if the file is edited while it's
       | being copied over? But there's probably safeguards in place.
       | 
       | Anyway I don't think the database file size was really an issue,
       | it was a relatively big schema but not many indices and
       | performance wasn't a big consideration - hence why the backend
       | would concatenate query results into an XML file, then pass it
       | through an xml->json converter, causing 1-2 second response times
       | on most requests. I worked on a rewrite using Go where requests
       | were more like 10-15 milliseconds.
       | 
       | But, I still used sqlite because that was actually a pretty good
       | solution for the problem at hand; relatively low concurrency (up
       | to 10 active simultaneous users), no server-side dependencies or
       | installation needed, etc.
        
         | MPSimmons wrote:
         | >I did wonder about data integrity though, what if the file is
         | edited while it's being copied over? But there's probably
         | safeguards in place.
         | 
         | You could do a filesystem snapshot and copy from that, but
         | neither a cp or rsync is atomic.
        
           | formerly_proven wrote:
           | sqlite3 has a backup API for this, which you can invoke using
           | the .backup command in the sqlite3 CLI.
           | 
           | And then there is also https://www.sqlite.org/rsync.html
        
         | rollcat wrote:
         | SQLite has a write-ahead log (WAL). You can use Litestream on
         | top of that. You get single RW, multiple readers (you lose the
         | C in CAP), and can promote a reader when the writer fails.
        
       | forinti wrote:
       | One of the coolest things you can do with Postgresql is pipe
       | pg_dump straight into psql connected to another cluster on
       | another host.
        
         | actinium226 wrote:
         | I recently set up some scripts to do this and it wasn't quite
         | as simple as I had hoped. I had to pass some extra flags to
         | pg_restore for --no-owner --no-acl, and then it still had
         | issues when the target db has data in it, even with --clean and
         | --create. And sometimes it would leave me in a state where it
         | dropped the database and had trouble restoring, and so I'd be
         | totally empty.
         | 
         | What I ended up doing is creating a new database,
         | pg_restore'ing into that one with --no-owner and --no-acl,
         | forcibly dropping the old database, and then renaming the new
         | to the old one's name. This has the benefit of not leaving me
         | high and dry should there be an issue with restoring.
        
       | rarrrrrr wrote:
       | If you're regularly syncing from an older version to a new
       | version, you can likely optimize further using gzip with "--
       | rsyncable" option. It will reduce the compression by ~1% but make
       | it so differences from one version to the next are localized
       | instead of cascading through the full length of the compression
       | output.
       | 
       | Another alternative is to skip compression of the dump output,
       | let rsync calculate the differences from an previous uncompressed
       | dump to the current dump, then have rsync compress the change
       | sets it sends over the network. (rsync -z)
        
       | ukuina wrote:
       | Doesn't this just push the runtime into index recomputation on
       | the destination database?
        
         | masklinn wrote:
         | Yes, however they seem to have a pretty slow internet
         | connection
         | 
         | > Downloading a 250MB database from my web server takes about a
         | minute over my home Internet connection
         | 
         | So for the original 3.4GB database that's nearly 15mn waiting
         | for the download.
        
       | zeroq wrote:
       | How to copy databases between computers? Just send a circle and
       | forget about the rest of the owl.
       | 
       | As others have mentioned an incremental rsync would be much
       | faster, but what bothers me the most is that he claims that
       | sending SQL statements is faster than sending database and
       | COMPLETELY omiting the fact that you have to execute these
       | statements. And then run /optimize/. And then run /vacuum/.
       | 
       | Currently I have scenario in which I have to "incrementally
       | rebuild *" a database from CSV files. While in my particular case
       | recreating the database from scratch is more optimal - despite
       | heavy optimization it still takes half an hour just to run batch
       | inserts on an empty database in memory, creating indexes, etc.
        
         | JamesonNetworks wrote:
         | 30 minutes seems long. Is there a lot of data? I've been
         | working on bootstrapping sqlite dbs off of lots of json data
         | and by holding a list of values and then inserting 10k at a
         | time with inserts, Ive found a good perf sweet spot where I can
         | insert plenty of rows (millions) in minutes. I had to use some
         | tricks with bloom filters and LRU caching, but can build a 6
         | gig db in like 20ish minutes now
        
           | pessimizer wrote:
           | Saying that 30 minutes seems long is like saying that 5 miles
           | seems far.
        
           | thechao wrote:
           | Millions of rows in minutes sounds not ok, unless your tables
           | have a large number of columns. A good rule is that SQLite's
           | insertion performance should be at least 1% of sustained max
           | write bandwidth of your disk; preferably 5%, or more. The
           | last bulk table insert I was seeing 20%+ sustained; that came
           | to ~900k inserts/second for an 8 column INT table (small
           | integers).
        
           | zeroq wrote:
           | It's roughly 10Gb across several CSV files.
           | 
           | I create a new in-mem db, run schema and then import every
           | table in one single transaction (in my testing it showed that
           | it doesn't matter if it's a single batch or multiple single
           | inserts as long are they part of single transaction).
           | 
           | I do a single string replacement per every CSV line to handle
           | an edge case. This results in roughly 15 million inserts per
           | minute (give or take, depending on table length and
           | complexity). 450k inserts per second is a magic barrier I
           | can't break.
           | 
           | I then run several queries to remove unwanted data, trim
           | orphans, add indexes, and finally run optimize and vacuum.
           | 
           | Here's quite recent log (on stock Ryzen 5900X):
           | 08:43 import        13:30 delete non-essentials        18:52
           | delete orphans        19:23 create indexes        19:24
           | optimize        20:26 vacuum
        
         | iveqy wrote:
         | I hope you've found
         | https://stackoverflow.com/questions/1711631/improve-insert-p...
         | 
         | It's a very good writeup on how to do fast inserts in sqlite3
        
           | jgalt212 wrote:
           | yes, but they punt on this issue:
           | 
           | CREATE INDEX then INSERT vs. INSERT then CREATE INDEX
           | 
           | i.e. they only time INSERTs, not the CREATE INDEX after all
           | the INSERTs.
        
           | zeroq wrote:
           | Yes! That was actually quite helpful.
           | 
           | For my use case (recreating in-memory from scratch) it
           | basically boils down to three points: (1) journal_mode = off
           | (2) wrapping all inserts in a single transaction (3) indexes
           | after inserts.
           | 
           | For whatever it's worth I'm getting 15M inserts per minute on
           | average, and topping around 450k/s for trivial relationship
           | table on a stock Ryzen 5900X using built-in sqlite from
           | NodeJS.
        
             | vlovich123 wrote:
             | Would it be useful for you to have a SQL database that's
             | like SQLite (single file but not actually compatible with
             | the SQLite file format) but can do 100M/s instead?
        
               | zeroq wrote:
               | Not really.
               | 
               | I tested couple different approaches, including pglite,
               | but node finally shipped native sqlite with version 23
               | and it's fine for me.
               | 
               | I'm a huge fan of serverless solutions and one of the
               | absolute hidden gems about sqlite is that you can publish
               | the database on http server and query it extremely
               | efficitent from a client.
               | 
               | I even have a separate miniature benchmark project I
               | thought I might publish, but then I decided it's not
               | worth anyones time. x]
        
           | o11c wrote:
           | It's worth noting that the data in that benchmark is tiny
           | (28MB). While this varies between database engines, "one
           | transaction for everything" means keeping some kind of
           | allocations alive.
           | 
           | The optimal transaction size is difficult to calculate so
           | should be measured, but it's almost certainly never
           | _beneficial_ to spend multiple seconds on a single
           | transaction.
           | 
           | There will also be weird performance changes when the size of
           | data (or indexed data) exceeds the size of main memory.
        
           | gibibit wrote:
           | Hilarious, 3000+ votes for a Stack Overflow question that's
           | not a question. But it is an interesting article. Interesting
           | enough that it gets to break all the rules, I guess?
        
             | detaro wrote:
             | It's a (quite old) community wiki post. These do (and
             | especially did back then) work and are treated differently.
        
         | stackskipton wrote:
         | As with any optimization, it matters where your bottleneck is
         | here. Sounds like theirs is bandwidth but CPU/Disk IO is
         | plentiful since they mentioned that downloading 250MB database
         | takes minute where I just grabbed 2GB SQLite test database from
         | work server in 15 seconds thanks to 1Gbps fiber.
        
       | RenThraysk wrote:
       | SQLite has an session extension, which will track changes to a
       | set of tables and produce a changeset/patchset which can patch
       | previous version of an SQLite database.
       | 
       | https://www.sqlite.org/sessionintro.html
        
         | simonw wrote:
         | Have you used that? I've read the documentation but I don't
         | think I've ever heard from anyone who uses the extension.
        
           | RenThraysk wrote:
           | I have, atleast to confirm it does what it says on the tin.
           | 
           | Idea for an offline first app, where each app install call
           | pull a changeset and apply it to their local db.
        
         | oefrha wrote:
         | I have yet to see a single SQLite binding supporting this, so
         | it's quite useless unless you're writing your application in C,
         | or are open to patching the language binding.
         | 
         | In one of my projects I have implemented my own poor man's
         | session by writing all the statements and parameters into a
         | separate database, then sync that and replay. Works well enough
         | for a ~30GB database that changes by ~0.1% every day.
        
           | RenThraysk wrote:
           | There are atleast two SQLite bindings for Go.
           | 
           | https://github.com/crawshaw/sqlite
           | 
           | https://github.com/eatonphil/gosqlite/
           | 
           | Ended up with the latter, but did have to add one function
           | binding in C, to inspect changesets.
        
             | ncruces wrote:
             | I'm open to adding it to my driver, if people consider it
             | essential.
             | 
             | Every extra bit makes AOT compiling the Wasm slower
             | (impacting startup time).
             | 
             | I also wanna keep the number of variants reasonable, or my
             | repo blows up.
             | 
             | Add your votes for additional features to this issue:
             | https://github.com/ncruces/go-sqlite3/issues/126
        
           | paulclinger wrote:
           | I have updated the Lua binding to support the session
           | extension (http://lua.sqlite.org/home/timeline?r=session) and
           | it's been integrated into the current version of
           | cosmopolitan/redbean. This was partially done to support
           | application-level sync of SQLite DBs, however this is still a
           | work in progress.
        
         | nickpeterson wrote:
         | I really wish SQLite had some default way of doing change data
         | capture via session or something similar.
        
       | jbverschoor wrote:
       | In curious how your indices are twice the data. Sounds like you
       | just put indices in anything you see.
        
         | hobs wrote:
         | Hah they need to try harder then, I have seen more than 20x the
         | data volume in systems where people are both paranoid and
         | ignorant, a dangerous combo!
        
         | crazygringo wrote:
         | I definitely have databases like this.
         | 
         | It's not carelessness, it's performance.
         | 
         | Quite simply, I have a table with 4 columns -- A, B, C, D. Each
         | column is just an 8-byte integer. It has hundreds of millions
         | of rows. It has an index on B+C+D, an index on C+D, and one on
         | D.
         | 
         | All of these are required because the user needs to be able to
         | retrieve aggregate data based on range conditions around lots
         | of combinations of the columns. Without all the indices,
         | certain queries take a couple minutes. With them, each query
         | takes milliseconds to a couple seconds.
         | 
         | I thought of every possible way to avoid having all three
         | indices, but it just wasn't possible. It's just how performant
         | data lookup works.
         | 
         | You shouldn't assume people are being careless with indices.
         | Far too often I see the opposite.
        
       | Levitating wrote:
       | I am sure you can just pipe all this so you don't have to use an
       | intermediate gunzip file.
       | 
       | Just ssh the machine, dump the SQL and load it back into SQLite
       | locally.
        
         | rollcat wrote:
         | rsync will transmit only the delta between the source and
         | destination.
        
         | wang_li wrote:
         | I've seen a suggestion several times to compress the data
         | before sending. If remote means in the same data center,
         | there's a good chance compressing the data is just slowing you
         | down. Not many machines can gzip/bzip2/7zip at better than the
         | 1 gigabyte per second you can get from 10 Gbps networks.
        
       | whalesalad wrote:
       | it's a file - what am I missing? scp host:path .
        
         | ninth_ant wrote:
         | Then entire point of the article is to answer this specific
         | question
        
       | cwmma wrote:
       | how well does just the sqlite database gzip, the indexes are a
       | lot of redundant data so your going to get some efficiencies
       | there, probably less locality of data then the text file though
       | so maybe less?
        
       | nottorp wrote:
       | Wait... why would you even think about rsyncing a database that
       | can get changed while being copied?
       | 
       | Isn't this a case for proper database servers with replication?
       | 
       | Or if it's an infrequent process done for dev purposes just shut
       | down the application doing writes on the other side?
        
       | bambax wrote:
       | > _If it takes a long time to copy a database and it gets updated
       | midway through, rsync may give me an invalid database file. The
       | first half of the file is pre-update, the second half file is
       | post-update, and they don't match. When I try to open the
       | database locally, I get an error_
       | 
       | Of course! You can't copy the file of a running, active db
       | receiving updates, that can only result in corruption.
       | 
       | For replicating sqlite databases safely there is
       | 
       | https://github.com/benbjohnson/litestream
        
         | quotemstr wrote:
         | > Of course! You can't copy the file of a running, active db
         | receiving updates, that can only result in corruption
         | 
         | Do people really not understand how file storage works? I
         | cannot rightly apprehend the confusion of ideas that would
         | produce an attempt to copy a volatile database without
         | synchronization and expect it to work.
        
           | kccqzy wrote:
           | The confusion of ideas here is understandable IMO: people
           | assume everything is atomic. Databases of course famously
           | have ACID guarantees. But it's easy for people to assume
           | copying is also an atomic operation. Honestly if someone
           | works too much with databases and not enough with filesystems
           | it's a mistake easily made.
        
           | ahazred8ta wrote:
           | > I cannot rightly apprehend the confusion of ideas
           | 
           | I see you are a man of culture.
        
             | kccqzy wrote:
             | Charles Babbage is smart, but either he lacks empathy to
             | understand other people or he's just saying that
             | deliberately for comedic effect.
        
               | jerf wrote:
               | It was early days... _very_ early days. He didn 't have
               | the benefit of trying to help his (metaphorical)
               | grandparents get their emails or worked under a manager
               | who thinks 2023-era ChatGPT is only slightly less
               | reliable than the Standard Model of Physics, if not
               | slightly more.
        
               | bambax wrote:
               | Oh he definitely lacked empathy.
               | 
               | But things haven't improved much. Today we have "prompt
               | engineers" whose only job is to input the right question
               | in order to get the right answer.
        
         | yellow_lead wrote:
         | Litestream looks interesting but they are still in beta, and
         | seem to have not had a release in over a year, although SQLite
         | doesn't move that quickly.
         | 
         | Is Litestream still an active project?
        
           | clintonb wrote:
           | Despite the beta label and lack of a 1.x release, I would
           | consider the project pretty stable. We've used it in
           | production for over 18 months to support an offline-first
           | point of sale system. We haven't had any issues with
           | Litestream.
        
         | pixl97 wrote:
         | >You can't copy the file of a running, active db receiving
         | updates, that can only result in corruption
         | 
         | There is a slight 'well akshully' on this. A DB flush and FS
         | snapshot where you copy the snapshotted file will allow this.
         | MSSQL VSS snapshots would be an example of this.
        
           | tpmoney wrote:
           | Similarly you can rsync a Postgres data directory safely
           | while the db is running, with the caveat that you likely lose
           | any data written while the rsync is running. And if you want
           | that data, you can get it with the WAL files.
           | 
           | It's been years since I needed to do this, but if I remember
           | right, you can clone an entire pg db live with a
           | `pg_backup_start()`, rsync the data directory,
           | pg_backup_stop() and rsync the WAL files written since backup
           | start.
        
             | edoceo wrote:
             | For moving DBs where I'm allowed minutes of downtime I do
             | rsync (slow) first from the live, while hot, then just stop
             | that one, then rsync again (fast) then make the new one
             | hot.
             | 
             | Works a treat when other (better) method are not available.
        
         | creatonez wrote:
         | > You can't copy the file of a running, active db receiving
         | updates, that can only result in corruption
         | 
         | To push back against "only" -- there is actually one scenario
         | where this works. Copying a file or a subvolume on Btrfs or ZFS
         | can be done atomically, so if it's an ACID database or an LSM
         | tree, in the worst case it will just rollback. Of course, if
         | it's multiple files you have to take care to wrap them in a
         | subvolume so that all of them are copied in the same
         | transaction, simply using `cp --reflink=always` won't do.
         | 
         | Possibly freezing the process with SIGSTOP would yield the same
         | result, but I wouldn't count on that
        
           | lmz wrote:
           | It can't be done without fs specific snapshots - otherwise
           | how would it distinguish between a cp/rsync needing
           | consistent reads vs another sqlite client wanting the newest
           | data?
        
             | o11c wrote:
             | Obligatory "LVM still exists and snapshots are easy enough
             | to overprovision for"
        
               | HumanOstrich wrote:
               | Taking an LVM snapshot and then copying the sqlite
               | database from that is sufficient to keep it from being
               | corrupted, but you can have incomplete transactions that
               | will be rolled back during crash recovery.
               | 
               | The problem is that LVM snapshots operate at the block
               | device level and only ensure there are no torn or half-
               | written blocks. It doesn't know about the filesystem's
               | journal or metadata.
               | 
               | To get a consistent point-in-time snapshot without
               | triggering crash-recovery and losing transactions, you
               | also need to lock the sqlite database or filesystem from
               | writes during the snapshot.                   PRAGMA
               | wal_checkpoint(FULL);         BEGIN IMMEDIATE;  -- locks
               | out writers         .   /* trigger your LVM snapshot here
               | */         COMMIT;
               | 
               | You can also use fsfreeze to get the same level of
               | safety:                   sudo fsfreeze -f /mnt/data
               | # (A) flush dirty pages & block writes         lvcreate
               | -L1G -s -n snap0 /dev/vg0/data         sudo fsfreeze -u
               | /mnt/data      # (B) thaw, resume writes
               | 
               | Bonus - validate the snapshotted db file with:
               | sqlite3 mydb-snapshot.sqlite "PRAGMA integrity_check;"
        
             | ummonk wrote:
             | I would assume cp uses ioctl (with atomic copies of
             | individual files on filesystems that support CoW like APFS
             | and BTRFS), whereas sqlite probably uses mmap?
        
               | vlovich123 wrote:
               | I was trying to find evidence that reflink copies are
               | atomic and could not and LLMs seem to think they are not.
               | So at best may be a btrfs only feature?
        
               | creatonez wrote:
               | From Linux kernel documentation:
               | 
               | https://man7.org/linux/man-
               | pages/man2/ioctl_ficlone.2.html
               | 
               | > Clones are atomic with regards to concurrent writes, so
               | no locks need to be taken to obtain a consistent cloned
               | copy.
               | 
               | I'm not aware of any of the filesystems that use it
               | (Btrfs, XFS, Bcachefs, ZFS) that deviate from expected
               | atomic behavior, at least with single files being the
               | atom in question for `FICLONE` operation.
        
               | vlovich123 wrote:
               | Thanks! I similarly thought that a file clone would work
               | but I couldn't confirm it.
        
         | wswope wrote:
         | The built-in .backup command is also intended as an official
         | tool for making "snapshotted" versions of a live db that can be
         | copied around.
        
         | jmull wrote:
         | If the corruption is detectable and infrequent enough for your
         | purposes, then it _does_ work, with a simple "retry until
         | success" loop. (That's how TCP works, for example.)
        
           | benbjohnson wrote:
           | Not all corruption is detectable. You could make a copy
           | during a transaction where only a subset of the transactions
           | saved pages are persisted but all branch & leaf pages are
           | pointed to correctly. That would give you a state of the
           | database that never actually existed and break atomicity.
        
             | jmull wrote:
             | > Not all corruption is detectable.
             | 
             | Well, I don't know rsync that well. If you're saying it
             | doesn't detect changes to files while it's being copied,
             | then I'll believe you.
             | 
             | And, as far as I know, it's impossible to detect absolutely
             | _all_ corruption.
             | 
             | But you can pretty easily detect, e.g., that a file has or
             | has not changed since before you copied it to after, on a
             | system with a basically functioning filesystem and clock,
             | with a reasonable/useful level of confidence.
        
         | lknuth wrote:
         | While I run and love litestream on my own system, I also like
         | that they have a pretty comprehensive guide on how to do
         | something like this manually, via built-in tools:
         | https://litestream.io/alternatives/cron/
        
         | yard2010 wrote:
         | Litestream is really cool! I'm planning to use it to backup and
         | restore my SQLite in the container level, just like what that
         | ex-google guy who started a startup of a small KVM and had a
         | flood in his warehouse while on vacation did. If I'm not
         | mistaken. I would link here the perfect guide he wrote but
         | there's 0 chance I'll find it. If you understand the reference
         | please post the link.
        
           | mtlynch wrote:
           | Haha, that sounds like me. Here's the writeup you're talking
           | about:
           | 
           | https://mtlynch.io/litestream/
           | 
           | And here's the flooding story:
           | 
           | https://mtlynch.io/solo-developer-year-6/#the-most-
           | terrifyin...
           | 
           | Sidenote: I still use Litestream in every project where I use
           | SQLite.
        
         | rsync wrote:
         | "For replicating sqlite databases safely there is (litestream)
         | ..."
         | 
         | A reminder that litestream can run over plain old SFTP[1] which
         | means you can stream database replication to just about any
         | UNIX endpoint over SSH.
         | 
         | I have a favorite[2] but any SFTP server will do ...
         | 
         | [1] https://github.com/benbjohnson/litestream/issues/140
         | 
         | [2]
         | https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...
        
       | ollybee wrote:
       | sqlite provide the sqlite3_rsync command to safely copy databases
       | https://sqlite.org/rsync.html
        
       | npn wrote:
       | ???
       | 
       | Why not just compress the whole database using `gzip` or `lz4`
       | before rsyncing it instead? `zstd` works too but seems like it
       | had a bug regarding compressing file with modified content.
       | 
       | better yet, split your sqlite file to smaller piece. it is not
       | like it needs to contain all the app data in a single sqlite
       | file.
        
       | hundredwatt wrote:
       | The recently released sqlite_rsync utility uses a version of the
       | rsync algorithm optimized to work on the internal structure of a
       | SQLite database. It compares the internal data pages efficiently,
       | then only syncs changed or missing pages.
       | 
       | Nice tricks in the article, but you can more easily use the
       | builtin utility now :)
       | 
       | I blogged about how it works in detail here:
       | https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...
        
         | jgalt212 wrote:
         | sqlite_rsync can only be used in WAL mode. A further constraint
         | of WAL mode is the database file must be stored on local disk.
         | Clearly, you'd want to do this almost all the time, but for the
         | times this is not possible this utility won't work.
        
           | SQLite wrote:
           | I just checked in an experimental change to sqlite3_rsync
           | that allows it to work on non-WAL-mode database files, as
           | long as you do _not_ use the --wal-only command-line option.
           | The downside of this is that the origin database will block
           | all writers while the sync is going on, and the replicate
           | database will block both reads and writers during the sync,
           | because to do otherwise requires WAL-mode. Nevertheless,
           | being able to sync DELETE-mode databases might well be
           | useful, as you observe.
           | 
           | If you are able, please try out this enhancement and let me
           | know if it solves your problem. See
           | <https://sqlite.org/src/info/2025-05-01T16:07Z> for the
           | patch.
        
             | SQLite wrote:
             | Update: This enhancement is now on trunk and will be
             | included in the 3.50.0 release of SQLite due out in about
             | four weeks.
        
         | construct0 wrote:
         | Demands increasing page size if you sync frequently
         | (bandwidth).
        
         | rsync wrote:
         | Also note:
         | 
         | sqlite3_rsync is now built into the rsync.net platform.
         | ssh user@rsync.net sqlite3_rsync ... blah blah ...
         | 
         | ... just added last week and not rolled out in all regions but
         | ... all initial users reported it worked exactly as they
         | expected it to.
        
       | actinium226 wrote:
       | I have recently discovered a tool called mscp which opens open
       | multiple scp threads to copy down large files. It works great for
       | speeding up these sorts of downloads.
       | 
       | https://github.com/upa/mscp
        
       | 404mm wrote:
       | zstd would be a better choice. It's bonkers fast (especially when
       | used with multithreading) and still compresses better than gzip.
       | Alternatively, I'd recommend looking into bzip3, but I'm not sure
       | if it would save time.
        
       | ozim wrote:
       | I guess for me it is obvious you don't try to copy running DB
       | only a backup.
       | 
       | So I see basic stuff needs to be repeated as people still miss
       | those kinds of things.
       | 
       | But I learned that you can easily dump SQLite to a text file -
       | neat!
        
       | yapyap wrote:
       | Very neat walkthrough, clear commands and I appreciate the
       | explanations as to why this may help in OPs case
        
       | markhahn wrote:
       | isn't this rather obvious? doesn't everyone do this when it makes
       | sense? obviously, it applies to other DBs, and you don't even
       | need to store the file (just a single ssh from dumper to remote
       | undumper).
       | 
       | if retaining the snapshot file is of value, great.
       | 
       | I'd be a tiny bit surprised if rsync could recognize diffs in the
       | dump, but it's certainly possible, assuming the dumper is
       | "stable" (probably is because its walking the tables as trees).
       | the amount of change detected by rsync might actually be a useful
       | thing to monitor.
        
       | RKFADU_UOFCCLEL wrote:
       | Pretty good point. I just wonder if databases in generally can be
       | perfectly reconstructed from a text dump. For instance, do the
       | insertion orders change in any of the operations between dumping
       | and importing?
        
       | M95D wrote:
       | Saving to text file is inefficient. I save sqlite databases using
       | VACUUM INTO, like this:                 sqlite3 -readonly
       | /path/db.sqlite "VACUUM INTO '/path/backup.sqlite';"
       | 
       | From https://sqlite.org/lang_vacuum.html :                 The
       | VACUUM command with an INTO clause is an alternative to the
       | backup API for generating backup copies of a live database. The
       | advantage of using VACUUM INTO is that the resulting backup
       | database is minimal in size and hence the amount of filesystem
       | I/O may be reduced.
        
         | gwbas1c wrote:
         | Does that preserve the indexes? As the TFA mentioned, the
         | indexes are why the sqlite files are huge.
        
           | M95D wrote:
           | You're right. It does. I never thought about it until you
           | asked.
        
             | 4silvertooth wrote:
             | I think it won't preserve the index but it will recreate
             | the index while running the text sql.
        
         | nine_k wrote:
         | It's cool but it does not address the issue of indexes,
         | mentioned in the original post. Not carrying index data over
         | the slow link was the key idea. The VACUUM INTO approach keeps
         | indexes.
         | 
         | A text file may be inefficient as is, but it's perfectly
         | compressible, even with primitive tools like gzip. I'm not sure
         | the SQLite binary format compresses equality well, though it
         | might.
        
           | vlovich123 wrote:
           | > A text file may be inefficient as is, but it's perfectly
           | compressible, even with primitive tools like gzip. I'm not
           | sure the SQLite binary format compresses equality well,
           | though it might.
           | 
           | I hope you're saying because of indexes? I think you may want
           | to revisit how compression works to fix your intuition.
           | Text+compression will always be larger and slower than
           | equivalent binary+compression assuming text and binary
           | represent the same contents? Why? Binary is less compressible
           | as a percentage but starts off smaller in absolute terms
           | which will result in a smaller absolute binary. A way to
           | think about it is information theory - binary should
           | generally represent the data more compactly already because
           | the structure lived in the code. Compression is about
           | replacing common structure with noise and it works better if
           | there's a lot of redundant structure. However while text has
           | a lot of redundant structure, that's actually bad for the
           | compressor because it has to find that structure and process
           | more data to do that. Additionally, is using generic
           | mathematical techniques to remove that structure which are
           | genetically optimal but not as optimal as removing that
           | structure by hand via binary is.
           | 
           | There's some nuance here because the text represents slightly
           | different things than the raw binary SQLite (how to restore
           | data in the db vs the precise relationships + data structures
           | for allowing insertion/retrieval. But still I'd expect it to
           | end up smaller compressed for non trivial databases
        
             | dunham wrote:
             | Below I'm discussing compressed size here rather than how
             | "fast" it is to copy databases.
             | 
             | Yeah there are indexes. And even without indexes there is
             | an entire b-tree sitting above the data. So we're weighing
             | the benefits of having a domain dependent compression
             | (binary format) vs dropping all of the derived data. I'm
             | not sure how that will go, but lets try one.
             | 
             | Here is sqlite file containing metadata for apple's photo's
             | application:                   767979520 May  1 07:28
             | Photos.sqlite
             | 
             | Doing a VACUUM INTO:                   719785984 May  1
             | 08:56 photos.sqlite
             | 
             | gzip -k photos.sqlite (this took 20 seconds):
             | 303360460 May  1 08:56 photos.sqlite.gz
             | 
             | sqlite3 -readonly photos.sqlite .dump > photos.dump (10
             | seconds):                   1277903237 May  1 09:01
             | photos.dump
             | 
             | gzip -k photos.dump (21 seconds):
             | 285086642 May  1 09:01 photos.dump.gz
             | 
             | About 6% smaller for dump vs the original binary (but there
             | are a bunch of indexes in this one). For me, I don't think
             | it'd be worth the small space savings to spend the extra
             | time doing the dump.
             | 
             | With indexes dropped and vacuumed, the compressed binary is
             | 8% smaller than compressed text (despite btree overhead):
             | 566177792 May  1 09:09 photos_noindex.sqlite
             | 262067325 May  1 09:09 photos_noindex.sqlite.gz
             | 
             | About 13.5% smaller than compressed binary with indices.
             | And one could re-add the indices on the other side.
        
               | ForOldHack wrote:
               | Brilliant. >60% savings. 700mb? wow.
        
               | cnewey wrote:
               | Is that really necessary?
        
               | nine_k wrote:
               | Depending on the bandwidth at the target site, which may
               | be pretty remote, and not exposing a public internet
               | service.
        
               | cnewey wrote:
               | Ah no, I meant "is the snark necessary?" to the parent
               | comment. I enjoyed the read!
        
               | vlovich123 wrote:
               | Yup, these results are pretty consistent with what I'd
               | expect (& why I noted the impact of indices) cause even
               | string data has a lot of superfluous information when
               | expressed in the DDL ("INSERT INTO foo ...") - I would
               | expect all of that to exceed any bookkeeping within the
               | btree. And non-string values like blobs or numbers are
               | going to be stored more efficiently than in the dump
               | which is a text encoding (or even hex for blobs) which is
               | going to blow things up further.
        
           | conradev wrote:
           | SQLite tosses out the SQL once it is parsed into bytecode.
           | Using text is just going to take longer, even though I'm sure
           | it works great.
           | 
           | You can modify the database before vacuuming by making a new
           | in-memory database, copying selected tables into it, and then
           | vacuuming that to disk.
        
             | nine_k wrote:
             | This should be the accepted answer.
        
       | xnx wrote:
       | Great example of Cunningham's Law:
       | https://en.wikipedia.org/wiki/Ward_Cunningham#:~:text=%22Cun...
        
       | bluefirebrand wrote:
       | All this obsession with making processes like this faster
       | 
       | When is a guy supposed to get a coffee and stretch his legs
       | anymore?
        
       | gwbas1c wrote:
       | I wonder if there's a way to export to parquet files? They are
       | designed to be extremely compact.
        
       | kazinator wrote:
       | Does the author not know that rsync can use compression (rsync -z
       | | --compress | --compress-level=<n> ), or does he not think it
       | worthwhile to compare that data point?
       | 
       | I just tried some comparisons (albeit with a fairly small sqlite
       | file). The text compressed to only about 84% of the size of the
       | compressed binary database, which isn't negligible, but not
       | necessarily worth fussing over in every situation. (The binary
       | compressed to 7.1%, so it's 84% relative to that).
       | 
       | bzip2 performed better on both formats; its compression of the
       | binary database was better than gzip's compression of the text
       | (91.5%) and bzip2's text was better than binary (92.5).
       | 
       | Though that is not available inside rsync, it indicates that if
       | you're going with an external compression solution, maybe gzip
       | isn't the best choice if you care about every percentage
       | reduction.
       | 
       | If you don't care about every percentage reduction, maybe just
       | rsync compression.
       | 
       | One thing worth mentioning is that if you are updating the file,
       | rsync will only compress what is sent. To replicate that with the
       | text solution, you will have to be retaining the text on both
       | sides to do the update between them.
        
       | isaacvando wrote:
       | Nice!
        
       | alienbaby wrote:
       | I'm surprised sqlite is duplicating data to make indexes? Surely
       | it would just be manipulating groups of pointers?
        
       | nodesocket wrote:
       | I've been looking into a way to replicate a SQLite database and
       | came across the LiteFS project by Fly.io. Seems like a solid
       | drop-in solution backed by FUSE and Consul. Anybody used it in
       | production? My use case is high availability between multiple
       | VMs.
        
       | baxter001 wrote:
       | What technologies we have in 2025!
        
       | quantadev wrote:
       | Since sqlite is just a simple file-level locking DB, I'm pretty
       | shocked they don't have an option to let the indexes be stored in
       | separate files for all kinds of obvious and beneficial reasons,
       | like the fact that you can easily exclude them from backups if
       | they were, and you can make them "rebuild" just by deleting them.
       | Probably their reason for keeping all internal has to do with
       | being sure indexes are never out of sync, but that could just as
       | easily be accomplished with hashing algos.
        
       | conradev wrote:
       | SQLite has an official tool for this, fwiw:
       | https://www.sqlite.org/rsync.html
       | 
       | It works at the page level:
       | 
       | > The protocol is for the replica to send a cryptographic hash of
       | each of its pages over to the origin side, then the origin sends
       | back the complete content of any page for which the hash does not
       | match.
        
       | limaoscarjuliet wrote:
       | In case someone is looking for a proper way of making db copy in
       | PostgreSQL and not SQLLite, pg_basebackup is the proper way:
       | https://www.postgresql.org/docs/current/app-pgbasebackup.htm...
        
       | deepsun wrote:
       | "Skip the index and turn it on later" is an old trick from
       | mainframes and DB2.
        
       ___________________________________________________________________
       (page generated 2025-05-01 23:00 UTC)