[HN Gopher] WAL Mode in LiteFS
___________________________________________________________________
WAL Mode in LiteFS
Author : eatonphil
Score : 95 points
Date : 2023-01-04 16:36 UTC (6 hours ago)
(HTM) web link (fly.io)
(TXT) w3m dump (fly.io)
| mbStavola wrote:
| Having an interest in Sqlite and LiteFS, but not having explored
| either very much, I wonder what the DB administration aspect
| looks like?
|
| Say I want to connect a SQL client to my production DB do I
| expose a port on my production app that speaks the protocol? Or
| maybe I have a separate container that is basically just the
| Sqlite replica that I can connect to? Or maybe there is some
| other approach that I'm just not seeing?
|
| Fly, Sqlite, and LiteFS is definitely an interesting combination
| but I'm still sort of mentally stuck in the managed (and
| standalone) DB model.
| bob1029 wrote:
| IMO, the _entire point_ of using something embedded like SQLite
| is to avoid the network stack.
|
| Using SQLite, you can reliably satisfy queries in tens-to-
| hundreds of microseconds.
|
| I cannot imagine any scenario wherein a single-node hosted DB
| (across the LAN) can outperform a properly-configured SQLite
| instance (in same process).
|
| DB administration for us is included in an admin web app that
| is part of the product stack.
| gunnarmorling wrote:
| > I cannot imagine any scenario wherein a single-node hosted
| DB (across the LAN) can outperform a properly-configured
| SQLite instance (in same process).
|
| SQLite only supports one writer at a time, so any use case
| with more than a handful of concurrent writes will likely be
| better off with a remote DB which typically can handle
| thousands of active connections issuing transactions.
| benbjohnson wrote:
| Currently, you need to SSH in and use the sqlite3 CLI on the
| server. There has been some work in this area but it's all
| still rough around the edges. I wrote a server called
| Postlite[1] that exposes remote SQLite databases over the
| Postgres wire protocol but it's _very_ alpha. :)
|
| I'd love to see more work in this area. Ricardo Ander-Egg wrote
| a remote management tool called litexplore[2] that connects
| over SSH to the SQLite CLI behind the scenes. I haven't used it
| but I think there's a lot of potential with that approach.
|
| [1]: https://github.com/benbjohnson/postlite
|
| [2]: https://github.com/litements/litexplore
| PreInternet01 wrote:
| Even if you're not using LiteFS _yet_ , the following bit of
| SQLite advice makes this article absolutely worth the 11 minutes
| it allegedly takes to read:
|
| > Use WAL mode
|
| (For bonus points, also bump cache_size up to 20000 or so and set
| synchronous=NORMAL)
| bob1029 wrote:
| Just to give you an idea of how much difference this makes, I
| ran the following benchmark as of a few minutes ago:
|
| _3 new SQLite databases each with 1 text column, inserting
| 1000 rows of 1 char per. Testing using .NET6 and latest copy of
| System.Data.SQLite._ Default / No Pragmas:
| 8806ms WAL Mode Only: 2819ms WAL Mode +
| Synchronous Normal: 44ms
| benbjohnson wrote:
| The SYNCHRONOUS pragma is great but I'll mention that there
| is a durability trade-off. In "NORMAL" mode, there is not an
| fsync() after every transaction so you could lose recent
| transactions if you unexpectedly shutdown. The WAL is append-
| only so you don't risk data corruption (which is great).
| TillE wrote:
| > unexpectedly shutdown
|
| Right, a power cut or kernel panic. It's certainly
| something to be aware of, but there are probably much more
| likely ways to lose some data (programming errors, etc).
| tptacek wrote:
| Or a point-in-time snapshot of the block device being
| made.
| bob1029 wrote:
| This is actually how we recommend our customers perform
| backups of our product. Crash-consistent snapshots of the
| entire VM's disk every so often. 100% of our installs are
| in virtualized/cloud environments, so this is a very
| convenient and natural way to go about things.
|
| Some loss around the edges (and between snapshots) is
| understood and accepted by all. We've made it clear to
| our customers that by making this trade-off, we can
| vastly simplify operational costs around the system (i.e.
| we only need 1 self-contained VM per environment).
| tptacek wrote:
| Oh, for sure. I'm just saying that in cloud environments
| the "power cut off" scenario is more common than it
| looks. :)
| bob1029 wrote:
| We took the tradeoff in our product due to the strong
| performance upside.
|
| Our reasoning goes something like - whatever happened
| _approximately_ at the edge of the power-loss abyss is
| considered part of the explosion. We have never held out
| hope that we 'd be able to get information to disk up until
| the last microsecond.
|
| Letting WAL roll us back a few transactions is not a huge
| deal for our customers. Even if this breaks logical
| consistency with regard to external systems. We have many
| degrees of "redo" and the back office is always going to be
| a thing in our line of business.
|
| We file this under the "edge cases not economically worth
| targeting" bucket and take our 2 ~free orders of magnitude.
| vlovich123 wrote:
| Are you sure that those failures modes are the only ones
| guaranteed? I'd be worried about durability failures that
| happen due to non-obvious interactions (e.g. committing
| transaction A, losing transaction B, committing
| transaction C) or the database getting left in a state
| where it can't even start.
| bob1029 wrote:
| WAL enforces serialization semantics at crash recovery
| time. The only real variable is how many WAL frames made
| it to disk before the rug pull.
| aidenn0 wrote:
| The WAL is append-only so you don't risk data corruption
| (which is great).
|
| I'm not sure this is true with all filesystems. I think
| there are some filesystems in which a crash during append
| can end up with the file enlarged, but the data not written
| (IIRC I saw something like this with XFS when I was working
| on a kernel module that kept crashing the kernel).
| benbjohnson wrote:
| > I think there are some filesystems in which a crash
| during append can end up with the file enlarged
|
| The SQLite WAL file itself has a running checksum that
| starts from the beginning so if you had an enlarged file
| or even a block of zero'd out bytes in the middle, SQLite
| would still recover gracefully. It recovers up to the
| last valid WAL frame that contains a set "commit" flag.
| aidenn0 wrote:
| I'm honestly not surprised that SQLite handles this
| situation (I seriously doubt there's any file-system
| oddity I've run into that Hipp hasn't). But just being
| "append only" is insufficient.
| [deleted]
| [deleted]
| cldellow wrote:
| WAL and synchronous=NORMAL is huge.
|
| I love databases that choose safe defaults.
|
| That said, many real-life applications of databases don't need
| to block for confirmation that the transaction has been durably
| committed -- ETL jobs, integration tests, most logging, etc.
|
| When you discover that you can get a 20-50x speedup for
| basically free, it's like Christmas.
| zamalek wrote:
| What are the practical limits ( _not_ theoretical limits) for
| LiteFS? Are we talking 100 's of GB, or something smaller?
| benbjohnson wrote:
| We're targeting databases of 1 to 10 GB right now and those
| seem to work pretty well overall. I'm sure we'll expand that
| target over time as LiteFS matures though.
|
| Most of the requests I've seen have been to support a lot of
| smaller databases (e.g. hundreds or thousands of 1GB databases)
| rather than one huge 100GB database. You can do interesting
| things like sharding. Or if you're a SaaS companies, you could
| do one SQLite database per customer. That has some nice
| isolation properties and it improves SQLite's single writer
| restriction as your writes are spread across multiple
| databases.
| ignoramous wrote:
| Hi Ben
|
| > _We 're targeting databases of 1 to 10 GB right now and
| those seem to work pretty well overall._
|
| What are some reasons you reckon that the current setup won't
| scale beyond 10GB? Or, is it some arbitrary threshold beyond
| which you folks don't stress test things?
|
| Also, if I may, you mentioned on Twitter that this work was 3
| months in the making with 100s of PRs. Leaving aside
| stability related bugs, what design decisions previously made
| were the caused painful bugs / roadblocks? Consequently, what
| things majorly surprised you in a way that perhaps has
| altered your approach / outlook towards this project or
| engineering in general?
|
| Thanks.
| benbjohnson wrote:
| > What are some reasons you reckon that the current setup
| won't scale beyond 10GB?
|
| It's more of an arbitrary threshold right now. A lot of
| testing that we do right now is chaos testing where we
| frequently kill nodes to ensure that the cluster recovers
| correctly and we try to test a range of database sizes
| within that threshold. Larger databases should work fine
| but you also run into SQLite limitations of single writer.
| Also, the majority of databases we see in the wild are less
| than 10GB.
|
| > Leaving aside stability related bugs, what design
| decisions previously made were the caused painful bugs /
| roadblocks?
|
| So far the design decisions have held up pretty well. Most
| of the PRs were either stability related or WAL related.
| That being said, the design is pretty simple. We convert
| transactions into files and then ship those files to other
| nodes and replay them.
|
| We recently added LZ4 compression (which will be in the
| next release). There was a design issue there with how we
| were streaming data that we had to fix up. We relied on the
| internal data format of our transaction files to delineate
| them but that would mean we'd need to uncompress them to
| read that. We had to alter our streaming protocol a bit to
| do chunk encoding.
|
| I think our design decisions will be tested more once we
| expand to doing pure serverless & WASM implementations. I'm
| curious how things will hold up then.
|
| > Consequently, what things majorly surprised you in a way
| that perhaps has altered your approach / outlook towards
| this project or engineering in general?
|
| One thing that's surprised me is that we originally wrote
| LiteFS to be used with Consul so it could dynamically
| change its primary node. We kinda threw in our "static"
| leasing implementation for one of our internal use cases.
| But it turns out that for a lot of ancillary cache use
| cases, the static leasing works great! Losing write
| availability for a couple seconds during a deploy isn't
| necessarily a big deal for all applications.
| infogulch wrote:
| Have you compared LZ4 to other compression algorithms,
| zstd for example? ( https://github.com/phiresky/sqlite-
| zstd )
|
| Given that LiteFS operates at the filesystem layer via
| FUSE, have you considered it against designs that use
| built-in features native to some filesystems? For
| example, I've considered a similar system design based on
| a single ZFS primary node that streams ZFS snapshots to
| reader nodes. With some coordination service (e.g.
| consul) it could still allow for the whole node promotion
| process.
| benbjohnson wrote:
| > Have you compared LZ4 to other compression algorithms,
| zstd for example?
|
| I looked at zstd as well. IIRC, LZ4 compresses faster and
| zstd has a better compression ratio. A lot of the files
| being compressed are short-lived so it's typically better
| to compress faster with LiteFS. Maybe we'll support more
| compression formats in the future if it's useful.
|
| > have you considered it against designs that use built-
| in features native to some filesystems
|
| The ZFS idea is interesting. I targeted FUSE because it
| seemed like a lower barrier to entry and easier for users
| to get set up.
| zamalek wrote:
| Awesome, thanks! These ballpark figures would be a great
| improvement to your documentation.
| pgm8705 wrote:
| I'm curious how one goes about handling the need to query
| data across customers with the one-database-per-customer
| pattern.
| ithrow wrote:
| with a script
| benbjohnson wrote:
| It depends on what you're trying to do. If you're trying to
| query revenue then that's usually stored in a third-party
| system like Stripe. If you're tracking metrics then those
| can be aggregated into a time series database like
| Prometheus and viewed through Grafana.
|
| It's definitely a trade-off. Isolation is pretty compelling
| for a lot of mid-market and enterprise customers but it
| does come at a cost of making it a pain to query across
| customers. But then again, that's kinda the point. :)
| giraffe_lady wrote:
| You can attach multiple dbs and join across them and it
| works very well. The default limit is ten and I think the
| max is a hundred so there's definitely a limit. But if you
| really needed to do more I can think of some workarounds.
| Scarbutt wrote:
| What problems regarding resource usage can come up when
| trying to use thousands of sqlite databases simultaneously
| from a single process?
| benbjohnson wrote:
| You'll probably hit file descriptor limits so you'll need
| to up those. There's also some memory overhead per
| connection. I'd recommend limiting the number of concurrent
| connections on a server if you're running on a resource
| constrained box. In practice, you probably won't have all
| thousand databases querying at the same time and reopening
| SQLite connections is pretty fast.
| spiffytech wrote:
| If a LiteFS node is way behind on replication, can it let the app
| know when it's caught up?
|
| What I'd like to do: have one Fly instance running and another in
| cold standby. On deploy, the cold instance boots, fails the
| health check until replication catches up, then takes over while
| the first instance goes cold.
| benbjohnson wrote:
| > If a LiteFS node is way behind on replication, can it let the
| app know when it's caught up?
|
| LiteFS exposes the replication position via a file[1] suffixed
| with "-pos" for each database. That has a monotonically
| increasing TXID that you could compare to the primary.
|
| We track latency via a Prometheus metric but we don't currently
| expose that easily to the application. I added an issue for
| that[2].
|
| > have one Fly instance running and another in cold standby. On
| deploy, the cold instance boots, fails the health check until
| replication catches up, then takes over while the first
| instance goes cold.
|
| When LiteFS starts up, it waits for the node to either connect
| to the primary and catch up or become the primary itself before
| it starts the user's application (if running in a supervisor
| mode). That would probably get you most of the way there to
| what you're trying to do.
|
| [1]: https://fly.io/docs/litefs/position/
|
| [2]: https://github.com/superfly/litefs/issues/253
___________________________________________________________________
(page generated 2023-01-04 23:01 UTC)