[HN Gopher] LiteFS
___________________________________________________________________
LiteFS
Author : danielskogly
Score : 401 points
Date : 2022-09-21 14:33 UTC (8 hours ago)
(HTM) web link (fly.io)
(TXT) w3m dump (fly.io)
| no_wizard wrote:
| This is distributed SQLite 3, running (I assume at least
| partially managed?) LiteFS[5] for you. Which is pretty cool!
|
| What I'd like to have seen is how this compares to things like
| rqlite[1] or Cloudflare's D1[2] addressed directly in the article
|
| That said, I think this is pretty good for things like read
| replica's. I know the sales pitch here is as a full database, and
| I don't disagree with it, and if I was starting from scratch
| today and could use this, I totally would give it a try and
| benchmark / test accordingly, however I can't speak to that use
| case directly.
|
| What I find however and what I can speak to, is that most
| workloads already have database of some kind setup, typically not
| SQLite as their main database (MySQL or PostgreSQL seem most
| common). This is a great way to make very - insanely, really -
| fast read replica's across regions of your data. You can use an
| independent raft[3][4] implementation to do this on write. If
| your database supports it, you can even trigger a replication
| directly from a write to the database itself (I think Aurora has
| this ability, and I think - don't quote me! - PostgreSQL can do
| this natively via an extension to kick off a background job)
|
| To that point, in my experience one thing SQLite is actually
| really good at is storing JSON blobs. I have successfully used it
| for replicating JSON representations of read only data in the
| past to great success, cutting down on read times significantly
| for APIs as the data is "pre-baked" and the lightweight nature of
| SQLite allows you to - if you wanted to naively do this - just
| spawn a new database for each customer and transform their data
| accordingly ahead of time. Its like AOT compilation for your
| data.
|
| if you want to avoid some complexity with sharding (you can't
| always avoid it outright, but this can help cap its complexity)
| this approach helps enormously in my experience. Do try before
| you buy!
|
| EDIT: Looks like its running LiteFS[5] not LiteStream[0]. This is
| my error of understanding.
|
| [0]: https://litestream.io/
|
| [1]: https://github.com/rqlite/rqlite
|
| [2]: https://blog.cloudflare.com/introducing-d1/
|
| [3]: https://raft.github.io/
|
| [4]: https://raft.github.io/#implementations
|
| [5]: https://github.com/superfly/litefs
| Gys wrote:
| > This is distributed SQLite 3, running (I assume at least
| partially managed?) Litestream[0] for you.
|
| Litestream and LiteFS are by the same author and serve
| different purposes.
|
| https://litestream.io/alternatives/ goes into this. It includes
| LiteFS and rqlite.
|
| [Edited for clarification]
| no_wizard wrote:
| I did see this, its a little light on details in the way I
| was thinking about this, however I appreciate the call out,
| especially for others!
|
| It is missing the D1 comparison though.
| benbjohnson wrote:
| I'm happy to add D1 onto the alternatives list once it's
| released. I believe it's still in closed beta though.
| benbjohnson wrote:
| LiteFS/Litestream author here. You bring up a lot of great
| points that I'll try to address.
|
| > What I'd like to have seen is how this compares to things
| like rqlite or Cloudflare's D1 addressed directly in the
| article
|
| I think a post comparing the different options is a great idea.
| I'll try to summarize a bit here though. LiteFS aims to be an
| analogue to Postgres replication but with built-in failover.
| Postgres uses log shipping to copy database state from a
| primary to its replicas, as does LiteFS. LiteReplica is
| probably the closest thing to LiteFS although it uses a dual
| GPL/commercial license model. LiteFS uses Apache 2.
|
| There are Raft-based tools like rqlite or dqlite. These have
| higher consistency guarantees, however, they tend to be more
| complex to set up -- especially in an ephemeral environment
| like Kubernetes. LiteFS has a more relaxed membership model for
| this reason. As for Cloudflare's D1, they haven't released many
| details and I assume it's closed source. It also requires using
| a custom JavaScript client instead of using a native SQLite
| client.
|
| There are also several eventually consistent stores built on
| SQLite such as Mycelial[1]. These work great for applications
| with loose consistency needs. LiteFS still maintains
| serializable isolation within a transaction, although, it has
| looser guarantees across nodes than something like rqlite.
|
| > Most workloads are already have database of some kind setup,
| typically not SQLite as their main database (MySQL or
| PostgreSQL seem most common)
|
| Yes, that's absolutely true. I don't expect anyone to port
| their application from Postgres/MySQL to SQLite so they can use
| LiteFS. Databases and database tools have a long road to
| becoming mainstream and it follows the traditional adoption
| curve. I wrote a Go database library called BoltDB about 10
| years ago and it had the same adoption concerns early on.
| Typically, folks try it out with toy applications and play
| around with it. Once they get more comfortable, then they
| create new applications on top of it. As more people use it,
| then late adopters get more comfortable and it further builds
| trust.
|
| We're committed to LiteFS for the long term so we'll be making
| updates, fixing bugs, and we'll keep trying to build trust with
| the community.
|
| [1]: https://mycelial.com/
| no_wizard wrote:
| Note: I apologize if this is overstepping, its hard to tell!
|
| I think a strong - extremely strong - selling point is the
| point I made about "prebaked" data for your APIs, since the
| entire strength of these SQLite based systems reside in their
| fast read capacity (as mentioned elsewhere and in this
| article, its very fast for read heavy applications, which is
| most) you could take on an angle around that to get people
| "in the door" by showing a pathway of how this fits inside
| your existing data warehouse / storage model.
|
| We found we liked the SQLite durability to do this. It was a
| bit smarter than just a plain cache (such as Redis) with
| better durability and (for our needs) comparable enough
| performance (I think in absolute terms, a tuned Redis
| instance will always be faster, but up to a certain point,
| speed isn't everything, especially when factoring cost).
|
| We found it was cheaper - by a good margin - to do this over
| caching everything AOT in a redis cluster, and we could
| therefore much more cheaply go multi-region and have DB's
| sitting next to our customers that acted as a nearline cache.
|
| The complexity - which is an area where this might help in
| the future, and why I'm mentioning it - is shipping changes
| back. What we ended up doing is setting up a write Redis
| cluster that clients write to, and we take those writes and
| trigger a propogation job back to the database. This allowed
| us to run a much slimmer redis cluster and made us feel more
| comfortable doing cache eviction since we could verify writes
| pretty easily. You could do this with memcache or whatever
| too.
|
| Sounds convoluted, but it worked really well. It allowed us
| to keep our centralized database intact without having to
| spin up expensive instances to be multi-region or commit to
| ever growing Redis cluster(s). the SQLite flat file model +
| history of durability made things the perfect tradeoff for
| this use case. Of course, YMMV, however it was a novel
| solution that used "enterprise grade" parts all the way down,
| which made it an easy selling point.
|
| You might find it worth exploring this more.
|
| As far as the comparisons go, I think it'd be cool to see a
| deep dive, and run a test suite against each of the major
| SQLite as a distributed database model. For that, I don't
| think it has to be open source to do a reasonable comparison?
| benbjohnson wrote:
| > I think a strong - extremely strong - selling point is
| the point I made about "prebaked" data for your APIs.
|
| I think caches are an excellent use case for LiteFS early
| on. Sorry I didn't make that point in my previous reply.
| It's a good way to get benefits out of LiteFS without
| committing to it as your source of truth. Also related,
| Segment built a custom SQLite-based solution[1] for
| distributing out cached data that worked well for them.
|
| [1]: https://segment.com/blog/separating-our-data-and-
| control-pla...
|
| > We found it was cheaper - by a good margin - to do this
| over caching everything AOT in a redis cluster
|
| Do you remember specifics of the cost difference? I can
| imagine it'd be pretty significant since you don't need to
| spin up servers with a bunch of RAM.
|
| > Note: I apologize if this is overstepping, its hard to
| tell!
|
| Not overstepping at all! It's great to hear folks'
| feedback.
| no_wizard wrote:
| I'm a little fuzzy on the numbers, but it was 2-3x (or
| there about), since we didn't have to run multiple
| gigabyte clusters anymore. Even with auto-scaling based
| on demand, the clusters were more expensive simply due to
| the fact that if you wanted users to have a good
| experience you had to make sure you had a a reasonable
| amount of their data AOT cached, which itself was alot of
| complexity to manage.
|
| With SQLite, we could just scale instances during peak
| demand (so you could read from different instances of the
| same data if there was a bottleneck) and scale back down
| again, without (!usually) losing the data.
|
| It was a really complex - but fun - project all told, but
| its underpinnings were really simple.
|
| How Segment approached the problem isn't dissimilar to
| how we did it, honestly.
|
| The only thing that we (may) have done different is we
| had failover. If SQLite didn't respond our API layer
| could then talk directly to a database service to get the
| data. That was surprisingly complex to do.
|
| Its entirely possible that even more robust setups than
| ours was when we did this would yield higher cost
| savings. We did this before we hit our next scale of
| customers, just to add a little more context
| ignoramous wrote:
| Thanks.
|
| > _LiteFS still maintains serializable isolation within a
| transaction, although, it has looser guarantees across nodes
| than something like rqlite._
|
| Picking up a term from the consistency map here [0], what
| guarantees LiteFS makes across nodes?
|
| [0] https://jepsen.io/consistency
| benbjohnson wrote:
| Good question. Right now, LiteFS operates with async
| replication so its possible to have a transaction written
| to the primary get lost if the primary fails before it's
| replicated out. For that short window, you could read a
| transaction that no longer exists. From that standpoint, I
| believe it would technically be Read Uncommitted.
|
| However, during normal operation it'll function more like
| Snapshot Isolation. LiteFS does provide a transaction ID so
| requests could wait for a replica to catch up before
| issuing a transaction to get something closer to
| Serializable.
| Spivak wrote:
| I think you're being too harsh on yourself, isolation
| levels don't typically account for replication lag and
| network partitioning into account. For example on MySQL
| you can have async replication and serializable
| isolation. Replicas in this mode might never receive
| updates.
| benbjohnson wrote:
| Isolation levels in ACID are typically used for single
| instances. However, if you're talking about a distributed
| system then consistency diagram on Jepsen is more
| applicable[1]. Raft, for example, can ensure read
| consistency across a cluster by running reads through the
| consensus mechanism.
|
| LiteFS aims to be the analogue to Postgres/MySQL
| replication and those generally work great for most
| applications.
|
| [1]: https://jepsen.io/consistency
| benbjohnson wrote:
| > This is distributed SQLite 3, running (I assume at least
| partially managed?) Litestream for you.
|
| Oh, I forgot to touch on this point. LiteFS uses some
| concepts to Litestream (e.g. log shipping), however, it
| doesn't use Litestream internally. It has much stricter
| requirements in terms of ensuring consistency since it's
| distributed so it performs an incremental checksum of the
| database on every transaction. It has additional benefits
| with its internal storage format called LTX. These storage
| files can be compacted together which will allow point-in-
| time restores that are nearly instant.
| threatofrain wrote:
| Cloudflare D1 is in closed beta so everything is presumably
| tentative, but atm they don't support transactions (!) and it
| doesn't appear to be on their near-term product roadmap. That
| one really surprised me.
| [deleted]
| hinkley wrote:
| > Second, your application can only serve requests from that one
| server. If you fired up your server in Dallas then that'll be
| snappy for Texans. But your users in Chennai will be cursing your
| sluggish response times since there's a 250ms ping time between
| Texas & India.
|
| > To improve availability, it uses leases to determine the
| primary node in your cluster. By default, it uses Hashicorp's
| Consul.
|
| Having a satellite office become leader of a cluster is one of
| the classic blunders in distributed computing.
|
| There are variants of Raft where you can have quorum members that
| won't nominate themselves for election, but out of the box this
| is a bad plan.
|
| If you have a Dallas, Chennai, Chicago, and Cleveland office and
| Dallas goes dark (ie, the tunnel gets fucked up for the fifth
| time this year), you want Chicago to become the leader, Cleveland
| if you're desperate. But if Chennai gets elected then everyone
| has a bad time, including Dallas when it comes back online.
| hobo_mark wrote:
| Well that was fast... [1]
|
| Are the readable replicas supposed to be long-lived (as in, I
| don't know, hours)? Or does consul happily converge even with
| ephemeral instances coming and going every few minutes (thinking
| of something like Cloud Run and the like, not sure if Fly works
| the same way)? And do they need to make a copy of the entire DB
| when they "boot" or do they stream pages in on demand?
|
| [1] https://news.ycombinator.com/item?id=32240230
| benbjohnson wrote:
| > Are the readable replicas supposed to be long-lived
|
| Right now, yes, they should be relatively long-lived (e.g.
| hours). Each node keeps a full copy of the database so it's
| recommended to use persistent volumes with it so it doesn't
| need to re-snapshot on boot.
|
| We do have plans to make this work with very short-lived
| requests for platforms like Lambda or Vercel. That'll use
| transactionally-aware paging on-demand. You could even run it
| in a browser, although I'm not sure why you would want to. :)
| jensneuse wrote:
| Sound like a drop in solution to add high availability to
| WunderBase (https://github.com/wundergraph/wunderbase). Can we
| combine LiteFS with Litestream for Backups, or how would you do
| HA + Backups together?
| benbjohnson wrote:
| Yes, it should fit with WunderBase to provide HA. I'm adding S3
| support which will work the same (in principle) as Litestream.
| I'm hoping to have that in about a month or so.
| rsync wrote:
| Can you comment on how you added SFTP support to litestream
| so that it would work with rsync.net[1][2] ?
|
| How does that compare/contrast with what my grandparent is
| alluding to ?
|
| [1] https://github.com/benbjohnson/litestream/issues/140
|
| [2] https://www.rsync.net/resources/notes/2021-q3-rsync.net_t
| ech...
| benbjohnson wrote:
| Litestream does log shipping so it just takes a chunk of
| the write-ahead log (WAL) and copies it out a destination.
| That can be S3, GCP, SFTP, etc.
|
| LiteFS will do the same although I'm only targeting S3
| initially. That seemed to be the bulk of what people used.
| nicoburns wrote:
| Where is the data actually being stored in this setup? A copy on
| each machine running the application? If so, is there another
| copy somewhere else (e.g. S3) in case all nodes go down?
|
| Also, what happens if the Consul instance goes down?
|
| If my application nodes can't be ephemeral then this seems like
| it would be harder to operate than Postgres or MySQL in practice.
| If it completely abstracts that away somehow then I suppose
| that'd be pretty cool.
|
| Currently finding it hard to get on board with the idea that
| adding a distributed system here actually makes things simpler.
| benbjohnson wrote:
| > Where is the data actually being stored in this setup? A copy
| on each machine running the application?
|
| Yes, each node has a full copy of the database locally.
|
| > If so, is there another copy somewhere else (e.g. S3) in case
| all nodes go down?
|
| S3 replication support is coming[1]. Probably in the next month
| or so. Until then, it's recommended that you run a persistent
| volume with your nodes.
|
| > What happens if the Consul instance goes down?
|
| If Consul goes down then you'll lose write availability but
| your nodes will still be able to perform read queries.
|
| > If my application nodes can't be ephemeral then this seems
| like it would be harder to operate than Postgres or MySQL in
| practice.
|
| Support for pure ephemeral nodes is coming. If you're comparing
| LiteFS to a single node Postgres/MySQL then you're right, it's
| harder to operate. However, distributing out Postgres/MySQL to
| regions around the world and handling automatic failovers is
| likely more difficult to operate than LiteFS.
| omnimus wrote:
| Will this replicate the filesystem/volume between instances?
| Can for example use text files or different kind of filebased
| database? Or it requires me to use sqlite?
| simonw wrote:
| This is only for SQLite. But you can put text files (or
| even binary files - see
| https://simonwillison.net/2020/Jul/30/fun-binary-data-and-
| sq...) in a SQLite database and you may find you get better
| performance than reading files from disk:
| https://www.sqlite.org/fasterthanfs.html
| whoisjuan wrote:
| Unrelated, but why does the map on their homepage show a region
| in Cuba? That must be wrong.
| dcre wrote:
| I think it's in Miami and the icon is confusingly centered over
| it. The icons look like pins but I think they're supposed to be
| hot-air balloons. This explains why some of them appear to
| point to spots in the ocean.
|
| https://fly.io/docs/reference/regions/
| whoisjuan wrote:
| Ahh. Great catch. Their balloon definitely looks like a pin.
| [deleted]
| theomega wrote:
| Does anyone else bump into the issue, that the fly.io website
| does not load if requested via IPv6 on Mac? I tried Safari,
| Chrome and curl and neither work: $ curl -v
| https://fly.io/blog/introducing-litefs/ * Trying
| 2a09:8280:1::a:791:443... * Connected to fly.io
| (2a09:8280:1::a:791) port 443 (#0) * ALPN, offering h2
| * ALPN, offering http/1.1 * successfully set certificate
| verify locations: * CAfile: /etc/ssl/cert.pem *
| CApath: none * (304) (OUT), TLS handshake, Client hello
| (1): curl: (35) error:02FFF036:system
| library:func(4095):Connection reset by peer
|
| Requesting via ipv4 works $ curl -4v
| https://fly.io/blog/introducing-litefs/ * Trying
| 37.16.18.81:443... * Connected to fly.io (37.16.18.81) port
| 443 (#0) * ALPN, offering h2 * ALPN, offering
| http/1.1 * successfully set certificate verify locations:
| * CAfile: /etc/ssl/cert.pem * CApath: none * (304)
| (OUT), TLS handshake, Client hello (1): * (304) (IN), TLS
| handshake, Server hello (2): * (304) (IN), TLS handshake,
| Unknown (8): * (304) (IN), TLS handshake, Certificate (11):
| * (304) (IN), TLS handshake, CERT verify (15): * (304)
| (IN), TLS handshake, Finished (20): * (304) (OUT), TLS
| handshake, Finished (20): * SSL connection using TLSv1.3 /
| AEAD-CHACHA20-POLY1305-SHA256 * ALPN, server accepted to
| use h2 * Server certificate: * subject: CN=fly.io
| * start date: Jul 25 11:20:01 2022 GMT * expire date: Oct
| 23 11:20:00 2022 GMT * subjectAltName: host "fly.io"
| matched cert's "fly.io" * issuer: C=US; O=Let's Encrypt;
| CN=R3 * SSL certificate verify ok. * Using HTTP2,
| server supports multiplexing * Connection state changed
| (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer
| to connection buffer after upgrade: len=0 * Using Stream
| ID: 1 (easy handle 0x135011c00) > GET /blog/introducing-
| litefs/ HTTP/2 > Host: fly.io > user-agent:
| curl/7.79.1 > accept: */* > * Connection state
| changed (MAX_CONCURRENT_STREAMS == 32)! < HTTP/2 200
| < accept-ranges: bytes < cache-control: max-age=0, private,
| must-revalidate < content-type: text/html < date:
| Wed, 21 Sep 2022 16:50:16 GMT < etag: "632b20f0-1bdc1"
| < fly-request-id: 01GDGFA3RPZPRDV9M3AQ3159ZK-fra < last-
| modified: Wed, 21 Sep 2022 14:34:24 GMT < server:
| Fly/51ee4ef9 (2022-09-20) < via: 1.1 fly.io, 2 fly.io
| < <!doctype html> ...
| tptacek wrote:
| Works fine over IPv6 from this Mac.
| toast0 wrote:
| Could easily be a path MTU issue? I don't currently have
| working IPv6, but on TCP/IPv4, the server->client handshake
| packets I get back from fly.io are full length packets. If your
| tcpv6 MSS is set incorrectly, and some other not great things
| are happening in the path, then you might not be able to
| receive large packets.
| ignoramous wrote:
| > _Could easily be a path MTU issue?_
|
| Perhaps: https://community.fly.io/t/ipv6-pmtud-issue/5081
| theomega wrote:
| How could I find out if this is the case? And, what can I do
| about it? Disabling IPv6 obviously fixes it, but that it not
| a solution...
| toast0 wrote:
| I'm not aware of a simple to use test site for this on IPv6
| (I've got one for IPv4, but my server host doesn't do IPv6
| either, and using a tunnel will limit the ranges I can
| test, etc)... so you'll need to test ye olde fashioned way.
|
| I don't have a mac, but if the man page[1] is right,
| something like this should work to see how big of a packet
| you can successfully send and receive:
| ping -6 -D -G 1500 -g 1400 fly.io
|
| (you may need to run as root to set packet sizes). You
| should get back a list of replies with say 1408 bytes, then
| 1409, etc. The last number you get back is effectively the
| largest IPv6 payload you can receive (on this path, it
| could be different for other paths), and if you add the
| IPv6 header length of 40, that's your effective path MTU.
|
| Use tcpdump to see what TCP MSS is being sent on your
| outgoing SYN packets, for IPv4, the MSS is MTU - 40 (20 for
| IPV4 header, 20 for TCP header), for IPv6, the MSS should
| be MTU - 60 (40 for IPv6 header, 20 for TCP header). If
| your TCP MSS is higher than the observed path MTU to
| fly.io, that's likely the immediate cause of your problem.
|
| If you're using a router, make sure it knows the proper MTU
| for the IPv6 connection, and enable MSS clamping on IPv6,
| if possible --- or make sure the router advertisement
| daemon shares the correct MTU.
|
| Hope this gets you started.
|
| [1] https://ss64.com/osx/ping.html
| theomega wrote:
| Thanks, that was of great help!
|
| Just for reference, the ping command is a little
| different sudo ping6 -D -G 1500,1400
| fly.io
|
| I set an MSS to 1492 which pfsense (my router) translates
| to an MSS clamp of 1492-60 for IPv6 and 1492-40 for IPv4.
| This is a German Deutsche Telekom Fiber connection. Now
| everything works fine, I can request fly.io (and also
| discovered that https://ipv6-test.com was not working
| before and now does with the MSS clamping)
|
| Does MSS clamping have any disadvantages? Are there any
| alternatives in my case?
| toast0 wrote:
| Excellent, happy to help, glad you figured out the right
| command!. 1492 MTU is consistent with PPPoE, or a network
| where they didn't want to run a separate MTU for PPPoE
| and straight ethernet.
|
| The only downside to MSS clamping is the computational
| expense of inspecting and modifying the packets. On a
| residential connection, where you're running pfsense
| already, it's probably not even noticeable; but your ISP
| wouldn't be able to do clamping for you, because large
| scale routers don't have the processing budget to inspect
| packets at that level. I've seen some MSS clamping
| implementations that only clamp packets going out to the
| internet, and not the return packets... that can lead to
| problems sending large packets (which isn't always very
| noticeable, actually; a lot of basic browsing doesn't
| send packets large enough to hit this, unless you go to a
| site that sets huge cookies or do some real uploading)
|
| The alternative would be to run a 1492 MTU on your LAN,
| but that has the marginal negative of reducing your
| maximum packet size for LAN to LAN transfers.
| wglb wrote:
| Just tried it on my linux box: * Trying
| 2a09:8280:1::a:791:443... * Connected to fly.io
| (2a09:8280:1::a:791) port 443 (#0) ALPN, * offering h2
| ALPN, offering http/1.1 CAfile: * /etc/ssl/certs/ca-
| certificates.crt CApath: /etc/ssl/certs TLSv1.0 * (OUT),
| TLS header, Certificate Status (22): TLSv1.3 (OUT), TLS *
| handshake, Client hello (1): TLSv1.2 (IN), TLS header,
| Certificate * Status (22): TLSv1.3 (IN), TLS handshake,
| Server hello (2): TLSv1.2 * (IN), TLS header, Finished
| (20): TLSv1.2 (IN), TLS header, * Supplemental data (23):
| TLSv1.3 (IN), TLS handshake, Encrypted * Extensions (8):
| TLSv1.2 (IN), TLS header, Supplemental data (23): *
| TLSv1.3 (IN), TLS handshake, Certificate (11): TLSv1.2 (IN),
| TLS a * header, Supplemental data (23): TLSv1.3 (IN), TLS
| handshake, CERT * verify (15): TLSv1.2 (IN), TLS header,
| Supplemental data (23): * TLSv1.3 (IN), TLS handshake,
| Finished (20): TLSv1.2 (OUT), TLS * header, Finished
| (20): TLSv1.3 (OUT), TLS change cipher, Change * cipher
| spec (1): TLSv1.2 (OUT), TLS header, Supplemental data (23):
| * TLSv1.3 (OUT), TLS handshake, Finished (20): SSL connection
| using * TLSv1.3 / TLS_AES_256_GCM_SHA384 ALPN, server
| accepted to use h2 * Server certificate:
|
| ...
|
| Same on my macbook
| ranger_danger wrote:
| ELI5?
| benbjohnson wrote:
| LiteFS makes it so that you can have a SQLite database that is
| transparently replicated to a cluster of machines. The use case
| is if you have your app running on a server running in Chicago,
| then users in Europe will have 100ms latency to your server and
| users in Asia will have a 250ms latency. That's on top of the
| time it takes for your server to process the request. Many
| people target sub-100ms latency to make their applications feel
| snappy but that's impossible if users are waiting on their
| request to be sent halfway around the world.
|
| Traditionally, it's complicated to replicate your database to
| different regions of the world using something like Postgres.
| However, LiteFS aims to make it as simple as just spinning up
| more server instances around the world. It connects them
| automatically and ships changes in a transactionally safe way
| between them.
| boltzmann-brain wrote:
| > Traditionally, it's complicated to replicate your database
| to different regions of the world using something like
| Postgres
|
| what makes it complicated?
| twobitshifter wrote:
| It's a tool to allow you to keep your database on the app
| server and backed up on s3. This means you don't need a
| separate database server. Having a database on the same machine
| as the app is very fast, but it is usually risky. The backups
| made by LiteFS lessen the risk.
| hobo_mark wrote:
| That's Litestream, not LiteFS.
| vcryan wrote:
| Currently, I am running multiple application servers (and lambda
| functions) using AWS Fargate that access sqlite files (databases)
| on an EFS share. So far so good, although my use cases are fairly
| simple.
| asim wrote:
| 10 years ago fly.io is the company I wanted to build. Something
| with massive technical depth that becomes a developer product.
| They're doing an incredible job and part of that comes down to
| how they evangelise the product outside of all the technical
| hackery. This requires so much continued effort. AND THEN to
| actually run a business on top of all that. Kudos to you guys. I
| struggled so much with this. Wish you nothing but continued
| success.
| ignoramous wrote:
| 5 years ago, fly.io was basically poor man's deno.com / oven.sh
| [0]. In my (incorrect) opinion, tptacek changed fly.io's
| trajectory single-handedly.
|
| [0] https://ghostarchive.org/varchive/r-1hXDvOoHA
| stingraycharles wrote:
| What is @tptacek's relation with fly.io, and how did he
| change it?
| tptacek wrote:
| No? No. No!
| pphysch wrote:
| > Developing against a relational database requires devs to watch
| out for "N+1" query patterns, where a query leads to a loop that
| leads to more queries. N+1 queries against Postgres and MySQL can
| be lethal to performance. Not so much for SQLite.
|
| This is misleading AFAICT. The article(s) is actually comparing
| remote RDBMS to local RDBMS, not Postgres to SQLite.
|
| Postgres can also be served over a UNIX socket, removing the
| individual query overhead due to TCP roundtrip.
|
| SQLite is a great technology, but keep in mind that you can also
| deploy Postgres right next to your app as well. If your app is
| something like a company backend that could evolve a lot and
| benefit from Postgres's advanced features, this may be the right
| choice.
| simonw wrote:
| I get the impression that there's still quite a bit of overhead
| involved in talking to PostgreSQL that you don't get with
| SQLite. A SQLite query is pretty much a C function call that
| works directly with data structures and local disk.
| tptacek wrote:
| Anyone want to chime in with the largest app they've deployed
| where prod Postgres was reachable over a Unix domain socket?
| tshaddox wrote:
| Are there significant limits to the size of an app that could
| be deployed alongside Postgres versus the size of an app that
| could be deployed alongside SQLite?
| zie wrote:
| No, the big difference is write speed. SQLite is limited in
| the amount of writes it can do at the same time(typically
| 1). This is generally called "concurrent writes". Last I
| checked SQLite can't really get past 1 write at the same
| time, but it can emulate concurrent writes in WAL mode such
| that it isn't normally a problem, for most applications.
| Postgres has no such limit(though there can be limits to
| updating a particular row/column of data concurrently,
| obviously).
|
| Otherwise both are generally limited to the physical
| resources of the machine(memory, disk, etc). Generally
| speaking you can scale boxes much much farther than your
| data size for most applications.
| nicoburns wrote:
| > it can emulate concurrent writes in WAL mode
|
| Seems a bit unfair to call WAL mode emulation of "true"
| concurrent writes as I'm pretty sure a write-ahead-log
| (WAL) is exactly how other databases implement multiple
| concurrent writes. It's just always-on rather than being
| opt-in.
| zie wrote:
| Well, but that's not really what it's doing. WAL1 mode in
| SQLite is different than in your typical RDBMS. WAL mode
| in SQLite is just getting the write path outside of the
| read path, but there is still only 1 write allowed at any
| given time. The "fake" concurrency is done by letting
| write transactions queue up and wait for the lock.
|
| See [here](https://sqlite.org/wal.html) under "2.2.
| Concurrency" where it says:
|
| > "However, since there is only one WAL file, there can
| only be one writer at a time."
|
| SQLite is awesome, I'm a huge fan, but if you need to do
| lots and lots of writes, then SQLite is not your friend.
| Luckily that's a very rare application. There is a reason
| you never see SQLite being the end point for logs and
| other write-heavy applications.
| nucleardog wrote:
| Used to work developing standalone kiosks and things (think
| along the lines of stuff you'd find at a museum as the basic
| type of thing I'd work on), so absolutely not the use case
| you're thinking of.
|
| In a number of cases, we were working with data that would
| really benefit from actual GIS tooling so PostGIS was kind of
| the natural thing to reach for. Many of these kiosks had
| slow, intermittent, or just straight up no internet
| connection available.
|
| So we just deployed PostGIS directly on the kiosk hardware.
| Usually little embedded industrial machines with passive
| cooling working with something like a generation or two out-
| of-date Celeron, 8GB RAM, and a small SSD.
|
| We'd load up shapefiles covering a bunch of features across
| over a quarter million square miles while simultaneously
| running Chromium in kiosk mode.
|
| I know for a fact some of those had around 1k DAU. I mean,
| never more than one simultaneously but we did have them! I'm
| sure it would have handled more just fine if it weren't for
| the damn laws of physics limiting the number of people that
| can be physically accessing the hardware at the same time.
|
| That said, we had the benefit of knowing that our user counts
| would never really increase and due to real-world limitations
| we'd never be working with a remote database because there
| was literally no internet available. In general I'd still say
| colocating your database is not an excuse to allow N+1 query
| patterns to slip in. They'll be fine for now and just come
| back to bite you in the ass when your app _does_ outgrow
| colocating the app/database.
| BeefWellington wrote:
| Back in my consulting days one of my clients had an
| application with nearly all of the business logic being done
| in-database. The app was essentially a frontend for stored
| procedures galore and it sat on an absolute monster of a
| system for its time (~256GB of RAM, 24 or 32 Cores I think).
| It handled something like 100k DAU with no issue. That was
| for just the transactional system. They replicated data to
| other (similar style) systems for less mission-critical
| things like reporting, billing, etc.
|
| I want to be clear though, I would never recommend anyone do
| this. For its time it was impressive but I suspect by now
| it's been re-engineered away from that model, if for no other
| reason than its a lot of eggs in one basket.
| metafex wrote:
| Had an instance with almost 2TB on disk in 2015, it's
| probably a lot more by now, if it's still running as-is. Was
| for analytics/data-warehousing. Though the big table was
| partitioned so queries were always nice and snappy. The
| machine had 128gb ram back then, but probably never needed
| that much. The applications working with the data ran on that
| same server, so processing latency was usually <10ms even for
| larger queries.
| fdr wrote:
| Not me personally, but my understanding some Cisco products
| have Postgres "embedded." If they're not using unix domain
| sockets, they're at least using loopback.
|
| ...come to think of it, they may not be the only network
| device maker that does this.
| pphysch wrote:
| I would wager that the median IT shop/dept is running off a
| single node RDBMS, the majority of which also have the spare
| resources to support running the primary application (which
| has 1-100 DAU).
|
| No practical way to test this, of course.
| whalesalad wrote:
| is it cheating when pgbouncer is being accessed via socket?
| lol
| tptacek wrote:
| Yes, that is cheating.
| jzelinskie wrote:
| I'm surprised no one is chiming in from the LAMP era. Maybe I
| was just in a bubble, but I feel like before modern cloud
| providers, folks deploying on VPSs tried to keep operational
| complexity lower by running their database on the same box
| until they couldn't anymore.
|
| In my experience, I just cached the living hell out of the
| apps to avoid the I/O from the database. Providers in those
| days had huge issues with noisy neighbors, so I/O performance
| was quite poor.
|
| [0]: https://en.wikipedia.org/wiki/LAMP_(software_bundle)
| lbhdc wrote:
| We use managed postgres on GCP, and it is served over unix
| socket.
| cbsmith wrote:
| Kind of by definition, GCP's managed postgres is served up
| over the network. I'm guessing you mean "unix socket" in
| the sense of the socket API, not unix _domain_ sockets.
| shrubble wrote:
| It's a great way to deploy if you have the RAM and disk.
|
| Bump up the shared buffers, have a multithreaded or pooled
| driver for your app server so connections are not being
| opened and closed all the time.
|
| It really 'flies' under such a scenario.
|
| Largest app was a big analytics and drug tracking application
| for a small pharmaceutical company. Not sure of the size but
| it was very reporting heavy.
| pweissbrod wrote:
| Select N+1 is a fundamental anti-pattern of relational database
| usage. Reducing the latency per round trip doesn't change this
| fact.
| simonw wrote:
| I disagree. I think it makes a big difference.
|
| https://sqlite.org/np1queryprob.html#the_need_for_over_200_s.
| .. talks about this in the context of Fossil, which uses
| hundreds of queries per page and loads extremely fast.
|
| It turns out the N+1 thing really is only an anti-pattern if
| you're dealing with significant overhead per query. If you
| don't need to worry about that you can write code that's much
| easier to write and maintain just by putting a few SQL
| queries in a loop!
|
| Related: the N+1 problem is notorious in GraphQL world as one
| of the reasons building a high performance GraphQL API is
| really difficult.
|
| With SQLite you don't have to worry about that! I built
| https://datasette.io/plugins/datasette-graphql on SQLite and
| was delighted at how well it can handle deeply nested
| queries.
| alberth wrote:
| A better comparison might be Postgres to BedrockDB.
|
| BeckrockDB is a MySQL compatible interface wrapped around
| SQLite and as such, is intended for remote RDBMS use.
|
| Expensify has some interesting scaling performance data for
| their use of it.
|
| https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...
| woooooo wrote:
| Isn't "n + 1" typically a degenerate case of ORM rather than a
| relational thing? The whole point of rdbms is you can do a join
| instead of iterated lookups.
| pphysch wrote:
| Yeah most N+1 issues can be mitigated by optimizing the
| queries.
|
| You definitely don't need an ORM to get to N+1 hell, a for-
| loop in vanilla PHP will do.
| woooooo wrote:
| Yeah, but you have to go out of your way to do that. If
| you're in touch with the SQL you're executing at all, the
| join is the obvious easy way to do it, it's not some arcane
| optimization.
| tyingq wrote:
| You do see it often in things like canned ecommerce
| software. They have this ecosystem that encourages you to
| add widgets to category listings, product listings, etc,
| that all create queries keying off the page context.
|
| Where widgets might be "similar items", "people that
| bought x, bought y", "recent purchases of this widget",
| "best sellers in this category" and so on.
| RussianCow wrote:
| That's a really big "if". In my experience, most
| application engineers aren't very good at SQL because
| they don't have to be; most apps just don't ever reach a
| scale where it matters, and something like Rails or
| Django with N+1 queries all over the place performs just
| fine.
| woooooo wrote:
| Hence my 2 decade losing battle against ORM at every
| place I've worked :)
|
| I'm, like, 1 for 10 maybe.
| mgkimsal wrote:
| A for-loop in vanilla _anything_ will do. Nothing special
| about PHP. Ruby, Python, Java, etc...
| pphysch wrote:
| Of course. PHP is the one you will mainly encounter in
| the wilderness, though.
| 0x457 wrote:
| Most common place I saw N+1 is ActiveRecord (Ruby on
| Rails).
|
| Because by default, no relationships are loaded and the
| way AR "loads" stuff makes is extremely easy to create
| N+M queries on 5 lines of ruby. Then, those people will
| tell you that you run out of database before you run out
| of ruby...
| dspillett wrote:
| You see it in direct code too, from new or naive coders (the
| good old "it was fast enough on my test data...") or when
| proof-of-concept code "accidentally" escapes into production.
|
| Even within the database you'll sometimes find examples:
| stored procedures looping over a cursor and running queries
| or other statements in each iteration. This doesn't have the
| same latency implications as N+1 requests over a network, but
| still does add significant time per step compared to a set
| based equivalent.
|
| If you think about it, the database will sometimes do this
| itself when your database is not optimally structured &
| indexed for your queries (in SQL Server, look at "number of
| executions" in query plans).
| nicoburns wrote:
| > You see it in direct code too, from new or naive coders
|
| Many moons ago in one of my first coding projects I not
| only used a separate query for each record, but a separate
| database connection too. In addition to this, I didn't use
| a loop to generate these queries, I _copy and pasted the
| connection setup and query code for each record_ , leading
| to a 22k code file just for this single web page.
|
| There were probably on the order of 1000 records (they
| represented premier league football players), and amazingly
| this code was fast enough (took a couple of seconds to load
| - which wasn't so unusual back in 2007) and worked well
| enough that it actually had ~100 real users (not bad a 14
| year old's side project).
| benbjohnson wrote:
| You could run Postgres over UNIX sockets although you will
| still get higher latency than SQLite's in-process model. Also,
| running a Postgres on every app instance on the edge probably
| isn't practical. Postgres has some great advanced features if
| you need them but it's also much more heavy weight.
|
| With LiteFS, we're aiming to easily run on low resource cloud
| hardware such as nodes with 256MB or less of RAM. I haven't
| tried it on a Raspberry Pi yet but I suspect it would run fine
| there as well.
| pphysch wrote:
| SQLite _almost certainly_ is the better edge RDBMS than
| Postgres, if only because it has less features taking up
| space.
|
| However, "local SQLite vs. remote Postgres/MySQL" remains a
| false dichotomy when talking about network latency.
| clord wrote:
| It's an application design choice. It's perfectly
| reasonable to consider those two options when designing a
| system. The constraints of the system determine which is
| better for the application.
|
| I'd pick a centralized network-reachable database with a
| strong relational schema for write-heavy applications, and
| a lighter in-process system for something that is mostly
| reads and where latency matters. It's not a false dichotomy
| -- but more like a continuum that certainly includes both
| extremes on it.
| cbsmith wrote:
| Isn't this all talking to S3 anyway, not to mention the
| network trips intrinsic to the system before it gets to S3? I
| mean, I'm sure there's _some_ performance win here, but I 'm
| surprised it's so significant.
|
| It's not like the address-space separation is without
| benefits... heck, if it weren't, you could simply have
| embedded the whole application inside Postgres and achieved
| the same effect.
| simonw wrote:
| You may be confusing Litestream and LiteFS.
|
| Litestream writes everything to S3 (or similar storage).
|
| LiteFS lets different nodes copy replicated data directly
| to each other over a network, without involving S3.
|
| In either case, the actual SQLite writes and reads all
| happen directly against local disk, without any network
| traffic. Replication happens after that.
| cbsmith wrote:
| I think I was definitely confusing it with Litestream as
| the blog post made reference to it (and I did find that
| confusing).
|
| That said, unless I've misunderstood the LifeFS use case,
| you're still going over the network to reach a node, and
| that node is still going through a FUSE filesystem. That
| would seem to create overhead comparable (potentially
| more significant) to talking to a Postgres database
| hosted on a remote node.
|
| It just doesn't seem that obvious that there's a big
| performance win here. I'd be curious to see the profiling
| data behind this.
| WorldMaker wrote:
| It's an "eventual consistency" transaction scheme for
| SQLite: reads happen locally and immediately with the
| nearest edge copy available, it's only writes that need
| to be forwarded to a (potentially far away) primary DB
| over the network somewhere, and transactions will then
| take some amount of network time to fan back out to
| edges, so _eventually_ the edges will read the latest
| transactions, but during that fan out will read the
| previous transaction state.
| cbsmith wrote:
| Local for the application code on the node you're talking
| to (just like the read would be local for a Postgres
| process running on the node), but there's still a network
| trip to get to the node, no?
|
| Even if the reads are happening locally, if they're going
| through FUSE (even a "thin" pass-through that does
| nothing), that means they're getting routed from kernel
| space to a FUSE daemon, which means you're still doing
| IPC to another process through a kernel...
| simonw wrote:
| The performance boost that matters most here is when your
| application reads from the database. Your application
| code is reading directly from disk there, through a very
| thin FUSE layer that does nothing at all with reads (it
| only monitors writes).
|
| So your read queries should mostly be measured in
| microseconds.
| cbsmith wrote:
| > So your read queries should mostly be measured in
| microseconds.
|
| You should check out the read latency for read-only
| requests over unix domain sockets with PostgreSQL. You
| tend to measure it in microseconds, and depending on
| circumstances it can be single-digit microseconds.
|
| Regardless of whether your FUSE logic does nothing at
| all, It sure _seems_ like there 's intrinsic overhead to
| the FUSE model that is very similar to the intrinsic
| overhead of talking to another userspace database
| process... because you're talking to another userspace
| process (through the VFS layer).
|
| When the application reads, those requests go from
| userspace to the FUSE driver & /dev/fuse, with the thread
| being put into a wait state; then the FUSE daemon needs
| to pull the request from /dev/fuse to service it; then
| your FUSE code does whatever minimal work it needs to do
| to process the read and passes it back through /dev/fuse
| and the FUSE driver, and from there back to your
| application. That gets you pretty much the same "block
| and context switch" overhead of an IPC call to Postgres
| (arguably more). FUSE uses splicing to minimize data
| copying (of course, unix domain sockets also minimize
| data copying), though looking at the LiteFS Go daemon,
| I'm not entirely sure there isn't a copy going on anyway.
| Memory copying issues aside, from a latency perspective,
| you're jumping through very similar hoops to talking to
| another user-space process... because that's how FUSE
| works.
|
| There's a potential performance win if the data you're
| reading is already in the VFS cache, since that would
| bypass having to through the FUSE filesystem (and the
| /dev/fuse-to-userspace jump) entirely. The catch is, at
| that point you're bypassing SQLite transaction engine
| semantics entirely, giving you a dirty read that's really
| just a cached result from a previous read. That's not
| really a new trick, and you can get even better
| performance with client-side caching that can avoid a
| trip to kernel space.
|
| I'm sure there's a win here somewhere, but I'm struggling
| to understand where.
| benbjohnson wrote:
| > There's a potential performance win if the data you're
| reading is already in the VFS cache, since that would
| bypass having to through the FUSE filesystem.
|
| That's a typical case for a lot of databases. Most of the
| "hot" data is in a small subset of the pages and many of
| those can live in the in-process page cache.
|
| > The catch is, at that point you're bypassing SQLite
| transaction engine semantics entirely, giving you a dirty
| read that's really just a cached result from a previous
| read.
|
| It's not bypassing the transaction engine semantics. For
| WAL mode, SQLite can determine when pages are updated by
| checking the SHM file and then reading updated pages from
| the WAL file. Pages in the cache don't need to flushed on
| every access or even between transactions to be valid.
|
| > I'm sure there's a win here somewhere, but I'm
| struggling to understand where.
|
| The main goal of LiteFS is to make it easy to globally
| replicate applications. Many apps run in a single region
| of the US (e.g. us-east-1) and that's fast for Americans
| but it's a 100ms round trip to Europe and a 250ms round
| trip to Asia. Sure, you can spin up a multi-region
| Postgres but it's not that easy and you'll likely deploy
| as separate database and application servers because
| Postgres is not very lightweight.
|
| LiteFS aims to have a minimal footprint so it makes it
| possible to deploy many small instances since SQLite is
| built to run on low resource hardware.
|
| As far as comparisons with Postgres over UNIX sockets, I
| agree that the performance of a single instance is
| probably comparable with a FUSE layer.
| simonw wrote:
| That's a really detailed, interesting reply. I am now
| really interested to understand more about the
| comparative overhead here.
| fcoury wrote:
| Wow, thank you so much for explaining the overhead around
| FUSE usage, that's really eye opening and makes having
| some performance benchmarks down the line even more
| interesting.
| hinkley wrote:
| Yep. The poison is still in the cup, it's just a smaller dose.
| Keep drinking if you think that fact is enough to save you...
|
| I see bad judgement calls coming from small numbers, often due
| to failing to do the cost x frequency math properly in your
| head. If you're looking at an operation that takes 3ms or 3ms
| that is called a million times per minute, or twenty thousand
| times per request, you don't have enough significant figures
| there and people make mistakes. 3ms x 12345 = ~40s, not
| 37035ms. Better if you use a higher resolution clock and find
| out it's actually 3.259 ms, leading to a total of ~40.23s
|
| Point is, when we are doing billions of operations per second,
| lots of small problems that hide in clock jitter can become
| really big problems, and a 5-10% error repeated for half a
| dozen concerns can lead to serious miscalculations in capacity
| planning and revenue.
| lijogdfljk wrote:
| This is really cool! Unfortunately i primarily am interested in
| offline databases so perhaps i'm just not the target audience.
| However i have to ask, on that note, does this have any
| application in the offline space?
|
| Ie i wonder if there's a way to can write your applications such
| that they have less/minimal contention, and then allow the
| databases to merge when back online? Of course, what happens when
| there inevitably _is_ contention? etc
|
| Not sure that idea would have a benefit over many SQLite DBs with
| userland schemas mirroring CRDT principles though. But a boy can
| dream.
|
| Regardless, very cool work being done here.
| bigfish24 wrote:
| Shameless plug but would love for you to try out Ditto[1]!
| CRDT-based replicated database that works peer-to-peer
| including automatically with WiFi and Bluetooth.
|
| [1] https://www.ditto.live
| benbjohnson wrote:
| Thanks! It depends on your particular use case. If you need all
| nodes to allow writes then, unfortunately, no it doesn't
| support that. LiteFS uses physical replication so there's no
| way to merge pages from separate writers. You're probably
| better off looking at eventually consistent SQLite tools like
| Mycelial[1].
|
| However, if you have a single writer and just need replicas to
| get updates when they periodically connect then yes, LiteFS
| could fit that use case.
|
| [1]: https://mycelial.com/
| infogulch wrote:
| > To improve latency, we're aiming at a scale-out model that
| works similarly to Fly Postgres. That's to say: writes get
| forwarded to the primary and all read requests get served from
| their local copies.
|
| How can you ensure that a client that just performed a forwarded
| write will be able to read that back on their local replica on
| subsequent reads?
| sk55 wrote:
| They don't have ACID guarantees with this setup.
| benbjohnson wrote:
| ACID is typically used to describe single database instances
| and from that perspective, LiteFS doesn't change the ACID
| semantics that SQLite has. For distributed consistency
| models[1], it gets more complicated as LiteFS currently just
| supports async replication. We'll introduce stronger
| guarantees in the future although we currently provide some
| information such as transaction IDs so you can manage
| replication lag if you need stronger consistency.
|
| [1]: https://jepsen.io/consistency
| hinkley wrote:
| That's a fun one.
|
| A couple years ago someone posted a solution to that here. I'm
| not sure if it works for SQLite, but it worked for Postgres.
| The basics of it were that each replica was aware of the latest
| transaction ID it had seen. On a normal read you'd deal with
| the usual set of eventually consistent issues. But on a read-
| after-write, you would select a replica that was ahead of the
| write transaction.
|
| Ultimately that's a very odd flavor of sharding. What I don't
| recall is how they propagated that shard information
| efficiently, since with sharding and consistent hashing the
| whole idea is that the lookup algorithm is deterministic, and
| therefore can be run anywhere and get the same result. WAL lag
| information is dynamic, so... Raft?
| benbjohnson wrote:
| LiteFS provides a transaction ID that applications can use to
| determine replication lag. If the replica is behind the TXID,
| it can either wait or it can forward to the primary to ensure
| consistency.
| simonw wrote:
| Is that transaction ID made available to SQLite clients? As
| a custom SQL function or a value in a readable table
| somewhere?
| benbjohnson wrote:
| It's available as a file handle. If you have a database
| file named "db" then you can read "db-pos" to read the
| replication position. The position is a tuple of the TXID
| and the database checksum.
| infogulch wrote:
| Yeah a transaction ID solution would work pretty well. Every
| request would return the highest transaction ID seen by the
| replica; writes forwarded to the primary would also return
| the transaction ID after that write; any request with a
| transaction ID higher than the replica would be forwarded to
| the primary.
| hinkley wrote:
| I think that may have been the case. Try the local replica,
| if it is too old, hit the primary, rather than shopping
| around for the next best replica.
| simonw wrote:
| Another solution that I've used successfully in the past for
| web apps is to set a 10s cookie every time a user performs a
| write, and then route their read requests to the lead server
| until that cookie expires. That way they're sure to see the
| impact of the write they just made.
| hinkley wrote:
| If write traffic is low enough I could see that working,
| but 10s is an awful long time. Anything less than a 1000:1
| read:write ratio would start running into capacity problems
| wouldn't it?
| simonw wrote:
| That's never been a problem on any system I've worked on.
| If you assume a web application with logged in users who
| are occasionally updating their data, the percentage of
| users who will have performed a write within the last 10s
| is always going to be absolutely tiny.
| hinkley wrote:
| The absolute rate is the problem. You can't shed that
| load to other machines so the percentage of users or
| percentage of traffic doesn't matter. This is basically a
| variant of Little's Law, where server count has a ceiling
| of 1.
|
| 100 edits a minute from distinct sessions is 1000
| sessions pinned at any moment. If they read anything in
| that interval it comes from the primary. The only
| question is what's the frequency and interval of reads
| after a write.
| fny wrote:
| This reads like a professor who's so steeped in research that
| he's forgotten how to communicate to his students!
|
| What exactly are we talking about here? A WebSQL thats actually
| synced to a proper RDBMS? Synced across devices? I'm not clear
| about an end to end use case.
|
| Edit: Honestly, this line from the LiteFS docs[0] needs to be
| added to the top of the article:
|
| > LiteFS is a distributed file system that transparently
| replicates SQLite databases. This lets you run your application
| like it's running against a local on-disk SQLite database but
| behind the scenes the database is replicated to all the nodes in
| your cluster. This lets you run your database right next to your
| application on the edge.
|
| I had no idea what was being talked about otherwise.
|
| [0]: https://fly.io/docs/litefs/
| mwcampbell wrote:
| I wonder if using FUSE has had any appreciable impact on
| performance, particularly read performance. I ask because FUSE
| has historically had a reputation for being slow, e.g. with the
| old FUSE port of ZFS.
| miduil wrote:
| I think FUSE performance has significantly improved since 2010,
| the year in which ZFS-On-Linux became available I believe.
|
| This is just one of many examples:
| https://www.phoronix.com/news/MTI1MzM
|
| In contrary, in some cases FUSE is even faster than doing a
| regular kernel mount(). There is an experimental research
| distribution called distri that is exclusively relying on fuse
| mounting and figured out that FUSE was faster for doing mounts.
| https://michael.stapelberg.ch/posts/tags/distri/
| seabrookmx wrote:
| You're conflating zfs-fuse and zfs-on-linux (ZoL).
|
| The first one is as it's name suggests.. a user space
| implementation of ZFS.
|
| ZoL (now unified with OpenZFS) is implemented as a kernel
| module and as such does _not_ run in user space. It performs
| significantly better as a result.
|
| FUSE is still slow, which is why there's ongoing effort to
| replace things like NTFS-3G (the default NTFS implementation
| in most linux distros) with an in-kernel implementation:
| https://news.ycombinator.com/item?id=28418674
| miduil wrote:
| How do I conflate zfs-fuse with ZoL? I was trying to
| highlight that if someones memory of "poor fuse
| performance" is because of zfs-fuse, then someones memory
| must be quite old, as ZoL has been available for quite some
| time and since then also fuse itself has received quite
| some performance gains.
|
| Edit: Also I don't want to imply that FUSE is near in-
| kernel filesystems, but it is certainly performing much
| better than 12 years ago.
| endisneigh wrote:
| Seems neat, until you try to do schema migrations. Unless they
| can guarantee that all containers' SQLite instances have the same
| scheme without locking I'm not sure how doesn't run into the same
| issues as many NoSQL.
|
| CouchDB had this same issue with its database per user model and
| eventually consistent writes.
| slively wrote:
| In my experience distributed database migrations tend to be two
| phase though so it can be compatible during the migration. The
| first migration to add something, release the apps to use it,
| then a second migration to cleanup.
| vcryan wrote:
| This approach is very appealing to me :) curious about how people
| handle schema migrations when using this approach.
|
| I segment sqlite files (databases) that have the same schema into
| the same folder. I haven't really had a case where migrations was
| really a concern, but I could see it happening soon.
|
| Seems like in my deployment, I'm going to need an approach to
| loop over dbs to apply this change... I currently have a step of
| app deployment that attempts to apply migrations... but it is
| more simplistic because the primary RDBMS (postgresql) just
| appears to the application as a single entity which is the
| normative use-case for db-migrate-runners.
| simonw wrote:
| Yeah schema migrations are going to be interesting. Presumably
| those end up making a HUGE change to the WAL (since they could
| affect every stored row in a large table), which means LiteFS
| then has to send a truly giant chunk of data out to the
| replicas.
|
| I wonder how this would present itself in real-world usage?
| Would the replicas go out-of-date for 10-30s but continue
| serving read-only traffic, or could there be some element of
| cluster downtime caused by this?
| benbjohnson wrote:
| > Presumably those end up making a HUGE change to the WAL
| (since they could affect every stored row in a large table),
| which means LiteFS then has to send a truly giant chunk of
| data out to the replicas.
|
| Yes, a large migration could end up sending a lot of data out
| to replicas. One way to mitigate this is to shard your data
| into separate SQLite databases so you're only migrating a
| small subset at a time. Of course, that doesn't work for
| every application.
|
| > Would the replicas go out-of-date for 10-30s but continue
| serving read-only traffic, or could there be some element of
| cluster downtime caused by this?
|
| Once WAL support is in LiteFS, it will be able to replicate
| this out without causing any read down time. SQLite is still
| a single-writer database though so writes would be blocked
| during a migration.
| clord wrote:
| I can imagine a database which tries to solve both of these
| domains.
|
| A centralized database handles consistency, and vends data
| closures to distributed applications for in-process querying (and
| those closures reconcile via something like CRDT back to the core
| db).
|
| Does this exist?
| ignoramous wrote:
| (except for the "reconcile back" to primary) Yes?
| https://martin.kleppmann.com/2015/11/05/database-inside-out-...
|
| See also: materialize.com, readyset.io, aws elastic-views
| (forever in preview).
___________________________________________________________________
(page generated 2022-09-21 23:00 UTC)