[HN Gopher] LiteFS
       ___________________________________________________________________
        
       LiteFS
        
       Author : danielskogly
       Score  : 401 points
       Date   : 2022-09-21 14:33 UTC (8 hours ago)
        
 (HTM) web link (fly.io)
 (TXT) w3m dump (fly.io)
        
       | no_wizard wrote:
       | This is distributed SQLite 3, running (I assume at least
       | partially managed?) LiteFS[5] for you. Which is pretty cool!
       | 
       | What I'd like to have seen is how this compares to things like
       | rqlite[1] or Cloudflare's D1[2] addressed directly in the article
       | 
       | That said, I think this is pretty good for things like read
       | replica's. I know the sales pitch here is as a full database, and
       | I don't disagree with it, and if I was starting from scratch
       | today and could use this, I totally would give it a try and
       | benchmark / test accordingly, however I can't speak to that use
       | case directly.
       | 
       | What I find however and what I can speak to, is that most
       | workloads already have database of some kind setup, typically not
       | SQLite as their main database (MySQL or PostgreSQL seem most
       | common). This is a great way to make very - insanely, really -
       | fast read replica's across regions of your data. You can use an
       | independent raft[3][4] implementation to do this on write. If
       | your database supports it, you can even trigger a replication
       | directly from a write to the database itself (I think Aurora has
       | this ability, and I think - don't quote me! - PostgreSQL can do
       | this natively via an extension to kick off a background job)
       | 
       | To that point, in my experience one thing SQLite is actually
       | really good at is storing JSON blobs. I have successfully used it
       | for replicating JSON representations of read only data in the
       | past to great success, cutting down on read times significantly
       | for APIs as the data is "pre-baked" and the lightweight nature of
       | SQLite allows you to - if you wanted to naively do this - just
       | spawn a new database for each customer and transform their data
       | accordingly ahead of time. Its like AOT compilation for your
       | data.
       | 
       | if you want to avoid some complexity with sharding (you can't
       | always avoid it outright, but this can help cap its complexity)
       | this approach helps enormously in my experience. Do try before
       | you buy!
       | 
       | EDIT: Looks like its running LiteFS[5] not LiteStream[0]. This is
       | my error of understanding.
       | 
       | [0]: https://litestream.io/
       | 
       | [1]: https://github.com/rqlite/rqlite
       | 
       | [2]: https://blog.cloudflare.com/introducing-d1/
       | 
       | [3]: https://raft.github.io/
       | 
       | [4]: https://raft.github.io/#implementations
       | 
       | [5]: https://github.com/superfly/litefs
        
         | Gys wrote:
         | > This is distributed SQLite 3, running (I assume at least
         | partially managed?) Litestream[0] for you.
         | 
         | Litestream and LiteFS are by the same author and serve
         | different purposes.
         | 
         | https://litestream.io/alternatives/ goes into this. It includes
         | LiteFS and rqlite.
         | 
         | [Edited for clarification]
        
           | no_wizard wrote:
           | I did see this, its a little light on details in the way I
           | was thinking about this, however I appreciate the call out,
           | especially for others!
           | 
           | It is missing the D1 comparison though.
        
             | benbjohnson wrote:
             | I'm happy to add D1 onto the alternatives list once it's
             | released. I believe it's still in closed beta though.
        
         | benbjohnson wrote:
         | LiteFS/Litestream author here. You bring up a lot of great
         | points that I'll try to address.
         | 
         | > What I'd like to have seen is how this compares to things
         | like rqlite or Cloudflare's D1 addressed directly in the
         | article
         | 
         | I think a post comparing the different options is a great idea.
         | I'll try to summarize a bit here though. LiteFS aims to be an
         | analogue to Postgres replication but with built-in failover.
         | Postgres uses log shipping to copy database state from a
         | primary to its replicas, as does LiteFS. LiteReplica is
         | probably the closest thing to LiteFS although it uses a dual
         | GPL/commercial license model. LiteFS uses Apache 2.
         | 
         | There are Raft-based tools like rqlite or dqlite. These have
         | higher consistency guarantees, however, they tend to be more
         | complex to set up -- especially in an ephemeral environment
         | like Kubernetes. LiteFS has a more relaxed membership model for
         | this reason. As for Cloudflare's D1, they haven't released many
         | details and I assume it's closed source. It also requires using
         | a custom JavaScript client instead of using a native SQLite
         | client.
         | 
         | There are also several eventually consistent stores built on
         | SQLite such as Mycelial[1]. These work great for applications
         | with loose consistency needs. LiteFS still maintains
         | serializable isolation within a transaction, although, it has
         | looser guarantees across nodes than something like rqlite.
         | 
         | > Most workloads are already have database of some kind setup,
         | typically not SQLite as their main database (MySQL or
         | PostgreSQL seem most common)
         | 
         | Yes, that's absolutely true. I don't expect anyone to port
         | their application from Postgres/MySQL to SQLite so they can use
         | LiteFS. Databases and database tools have a long road to
         | becoming mainstream and it follows the traditional adoption
         | curve. I wrote a Go database library called BoltDB about 10
         | years ago and it had the same adoption concerns early on.
         | Typically, folks try it out with toy applications and play
         | around with it. Once they get more comfortable, then they
         | create new applications on top of it. As more people use it,
         | then late adopters get more comfortable and it further builds
         | trust.
         | 
         | We're committed to LiteFS for the long term so we'll be making
         | updates, fixing bugs, and we'll keep trying to build trust with
         | the community.
         | 
         | [1]: https://mycelial.com/
        
           | no_wizard wrote:
           | Note: I apologize if this is overstepping, its hard to tell!
           | 
           | I think a strong - extremely strong - selling point is the
           | point I made about "prebaked" data for your APIs, since the
           | entire strength of these SQLite based systems reside in their
           | fast read capacity (as mentioned elsewhere and in this
           | article, its very fast for read heavy applications, which is
           | most) you could take on an angle around that to get people
           | "in the door" by showing a pathway of how this fits inside
           | your existing data warehouse / storage model.
           | 
           | We found we liked the SQLite durability to do this. It was a
           | bit smarter than just a plain cache (such as Redis) with
           | better durability and (for our needs) comparable enough
           | performance (I think in absolute terms, a tuned Redis
           | instance will always be faster, but up to a certain point,
           | speed isn't everything, especially when factoring cost).
           | 
           | We found it was cheaper - by a good margin - to do this over
           | caching everything AOT in a redis cluster, and we could
           | therefore much more cheaply go multi-region and have DB's
           | sitting next to our customers that acted as a nearline cache.
           | 
           | The complexity - which is an area where this might help in
           | the future, and why I'm mentioning it - is shipping changes
           | back. What we ended up doing is setting up a write Redis
           | cluster that clients write to, and we take those writes and
           | trigger a propogation job back to the database. This allowed
           | us to run a much slimmer redis cluster and made us feel more
           | comfortable doing cache eviction since we could verify writes
           | pretty easily. You could do this with memcache or whatever
           | too.
           | 
           | Sounds convoluted, but it worked really well. It allowed us
           | to keep our centralized database intact without having to
           | spin up expensive instances to be multi-region or commit to
           | ever growing Redis cluster(s). the SQLite flat file model +
           | history of durability made things the perfect tradeoff for
           | this use case. Of course, YMMV, however it was a novel
           | solution that used "enterprise grade" parts all the way down,
           | which made it an easy selling point.
           | 
           | You might find it worth exploring this more.
           | 
           | As far as the comparisons go, I think it'd be cool to see a
           | deep dive, and run a test suite against each of the major
           | SQLite as a distributed database model. For that, I don't
           | think it has to be open source to do a reasonable comparison?
        
             | benbjohnson wrote:
             | > I think a strong - extremely strong - selling point is
             | the point I made about "prebaked" data for your APIs.
             | 
             | I think caches are an excellent use case for LiteFS early
             | on. Sorry I didn't make that point in my previous reply.
             | It's a good way to get benefits out of LiteFS without
             | committing to it as your source of truth. Also related,
             | Segment built a custom SQLite-based solution[1] for
             | distributing out cached data that worked well for them.
             | 
             | [1]: https://segment.com/blog/separating-our-data-and-
             | control-pla...
             | 
             | > We found it was cheaper - by a good margin - to do this
             | over caching everything AOT in a redis cluster
             | 
             | Do you remember specifics of the cost difference? I can
             | imagine it'd be pretty significant since you don't need to
             | spin up servers with a bunch of RAM.
             | 
             | > Note: I apologize if this is overstepping, its hard to
             | tell!
             | 
             | Not overstepping at all! It's great to hear folks'
             | feedback.
        
               | no_wizard wrote:
               | I'm a little fuzzy on the numbers, but it was 2-3x (or
               | there about), since we didn't have to run multiple
               | gigabyte clusters anymore. Even with auto-scaling based
               | on demand, the clusters were more expensive simply due to
               | the fact that if you wanted users to have a good
               | experience you had to make sure you had a a reasonable
               | amount of their data AOT cached, which itself was alot of
               | complexity to manage.
               | 
               | With SQLite, we could just scale instances during peak
               | demand (so you could read from different instances of the
               | same data if there was a bottleneck) and scale back down
               | again, without (!usually) losing the data.
               | 
               | It was a really complex - but fun - project all told, but
               | its underpinnings were really simple.
               | 
               | How Segment approached the problem isn't dissimilar to
               | how we did it, honestly.
               | 
               | The only thing that we (may) have done different is we
               | had failover. If SQLite didn't respond our API layer
               | could then talk directly to a database service to get the
               | data. That was surprisingly complex to do.
               | 
               | Its entirely possible that even more robust setups than
               | ours was when we did this would yield higher cost
               | savings. We did this before we hit our next scale of
               | customers, just to add a little more context
        
           | ignoramous wrote:
           | Thanks.
           | 
           | > _LiteFS still maintains serializable isolation within a
           | transaction, although, it has looser guarantees across nodes
           | than something like rqlite._
           | 
           | Picking up a term from the consistency map here [0], what
           | guarantees LiteFS makes across nodes?
           | 
           | [0] https://jepsen.io/consistency
        
             | benbjohnson wrote:
             | Good question. Right now, LiteFS operates with async
             | replication so its possible to have a transaction written
             | to the primary get lost if the primary fails before it's
             | replicated out. For that short window, you could read a
             | transaction that no longer exists. From that standpoint, I
             | believe it would technically be Read Uncommitted.
             | 
             | However, during normal operation it'll function more like
             | Snapshot Isolation. LiteFS does provide a transaction ID so
             | requests could wait for a replica to catch up before
             | issuing a transaction to get something closer to
             | Serializable.
        
               | Spivak wrote:
               | I think you're being too harsh on yourself, isolation
               | levels don't typically account for replication lag and
               | network partitioning into account. For example on MySQL
               | you can have async replication and serializable
               | isolation. Replicas in this mode might never receive
               | updates.
        
               | benbjohnson wrote:
               | Isolation levels in ACID are typically used for single
               | instances. However, if you're talking about a distributed
               | system then consistency diagram on Jepsen is more
               | applicable[1]. Raft, for example, can ensure read
               | consistency across a cluster by running reads through the
               | consensus mechanism.
               | 
               | LiteFS aims to be the analogue to Postgres/MySQL
               | replication and those generally work great for most
               | applications.
               | 
               | [1]: https://jepsen.io/consistency
        
           | benbjohnson wrote:
           | > This is distributed SQLite 3, running (I assume at least
           | partially managed?) Litestream for you.
           | 
           | Oh, I forgot to touch on this point. LiteFS uses some
           | concepts to Litestream (e.g. log shipping), however, it
           | doesn't use Litestream internally. It has much stricter
           | requirements in terms of ensuring consistency since it's
           | distributed so it performs an incremental checksum of the
           | database on every transaction. It has additional benefits
           | with its internal storage format called LTX. These storage
           | files can be compacted together which will allow point-in-
           | time restores that are nearly instant.
        
           | threatofrain wrote:
           | Cloudflare D1 is in closed beta so everything is presumably
           | tentative, but atm they don't support transactions (!) and it
           | doesn't appear to be on their near-term product roadmap. That
           | one really surprised me.
        
         | [deleted]
        
       | hinkley wrote:
       | > Second, your application can only serve requests from that one
       | server. If you fired up your server in Dallas then that'll be
       | snappy for Texans. But your users in Chennai will be cursing your
       | sluggish response times since there's a 250ms ping time between
       | Texas & India.
       | 
       | > To improve availability, it uses leases to determine the
       | primary node in your cluster. By default, it uses Hashicorp's
       | Consul.
       | 
       | Having a satellite office become leader of a cluster is one of
       | the classic blunders in distributed computing.
       | 
       | There are variants of Raft where you can have quorum members that
       | won't nominate themselves for election, but out of the box this
       | is a bad plan.
       | 
       | If you have a Dallas, Chennai, Chicago, and Cleveland office and
       | Dallas goes dark (ie, the tunnel gets fucked up for the fifth
       | time this year), you want Chicago to become the leader, Cleveland
       | if you're desperate. But if Chennai gets elected then everyone
       | has a bad time, including Dallas when it comes back online.
        
       | hobo_mark wrote:
       | Well that was fast... [1]
       | 
       | Are the readable replicas supposed to be long-lived (as in, I
       | don't know, hours)? Or does consul happily converge even with
       | ephemeral instances coming and going every few minutes (thinking
       | of something like Cloud Run and the like, not sure if Fly works
       | the same way)? And do they need to make a copy of the entire DB
       | when they "boot" or do they stream pages in on demand?
       | 
       | [1] https://news.ycombinator.com/item?id=32240230
        
         | benbjohnson wrote:
         | > Are the readable replicas supposed to be long-lived
         | 
         | Right now, yes, they should be relatively long-lived (e.g.
         | hours). Each node keeps a full copy of the database so it's
         | recommended to use persistent volumes with it so it doesn't
         | need to re-snapshot on boot.
         | 
         | We do have plans to make this work with very short-lived
         | requests for platforms like Lambda or Vercel. That'll use
         | transactionally-aware paging on-demand. You could even run it
         | in a browser, although I'm not sure why you would want to. :)
        
       | jensneuse wrote:
       | Sound like a drop in solution to add high availability to
       | WunderBase (https://github.com/wundergraph/wunderbase). Can we
       | combine LiteFS with Litestream for Backups, or how would you do
       | HA + Backups together?
        
         | benbjohnson wrote:
         | Yes, it should fit with WunderBase to provide HA. I'm adding S3
         | support which will work the same (in principle) as Litestream.
         | I'm hoping to have that in about a month or so.
        
           | rsync wrote:
           | Can you comment on how you added SFTP support to litestream
           | so that it would work with rsync.net[1][2] ?
           | 
           | How does that compare/contrast with what my grandparent is
           | alluding to ?
           | 
           | [1] https://github.com/benbjohnson/litestream/issues/140
           | 
           | [2] https://www.rsync.net/resources/notes/2021-q3-rsync.net_t
           | ech...
        
             | benbjohnson wrote:
             | Litestream does log shipping so it just takes a chunk of
             | the write-ahead log (WAL) and copies it out a destination.
             | That can be S3, GCP, SFTP, etc.
             | 
             | LiteFS will do the same although I'm only targeting S3
             | initially. That seemed to be the bulk of what people used.
        
       | nicoburns wrote:
       | Where is the data actually being stored in this setup? A copy on
       | each machine running the application? If so, is there another
       | copy somewhere else (e.g. S3) in case all nodes go down?
       | 
       | Also, what happens if the Consul instance goes down?
       | 
       | If my application nodes can't be ephemeral then this seems like
       | it would be harder to operate than Postgres or MySQL in practice.
       | If it completely abstracts that away somehow then I suppose
       | that'd be pretty cool.
       | 
       | Currently finding it hard to get on board with the idea that
       | adding a distributed system here actually makes things simpler.
        
         | benbjohnson wrote:
         | > Where is the data actually being stored in this setup? A copy
         | on each machine running the application?
         | 
         | Yes, each node has a full copy of the database locally.
         | 
         | > If so, is there another copy somewhere else (e.g. S3) in case
         | all nodes go down?
         | 
         | S3 replication support is coming[1]. Probably in the next month
         | or so. Until then, it's recommended that you run a persistent
         | volume with your nodes.
         | 
         | > What happens if the Consul instance goes down?
         | 
         | If Consul goes down then you'll lose write availability but
         | your nodes will still be able to perform read queries.
         | 
         | > If my application nodes can't be ephemeral then this seems
         | like it would be harder to operate than Postgres or MySQL in
         | practice.
         | 
         | Support for pure ephemeral nodes is coming. If you're comparing
         | LiteFS to a single node Postgres/MySQL then you're right, it's
         | harder to operate. However, distributing out Postgres/MySQL to
         | regions around the world and handling automatic failovers is
         | likely more difficult to operate than LiteFS.
        
           | omnimus wrote:
           | Will this replicate the filesystem/volume between instances?
           | Can for example use text files or different kind of filebased
           | database? Or it requires me to use sqlite?
        
             | simonw wrote:
             | This is only for SQLite. But you can put text files (or
             | even binary files - see
             | https://simonwillison.net/2020/Jul/30/fun-binary-data-and-
             | sq...) in a SQLite database and you may find you get better
             | performance than reading files from disk:
             | https://www.sqlite.org/fasterthanfs.html
        
       | whoisjuan wrote:
       | Unrelated, but why does the map on their homepage show a region
       | in Cuba? That must be wrong.
        
         | dcre wrote:
         | I think it's in Miami and the icon is confusingly centered over
         | it. The icons look like pins but I think they're supposed to be
         | hot-air balloons. This explains why some of them appear to
         | point to spots in the ocean.
         | 
         | https://fly.io/docs/reference/regions/
        
           | whoisjuan wrote:
           | Ahh. Great catch. Their balloon definitely looks like a pin.
        
       | [deleted]
        
       | theomega wrote:
       | Does anyone else bump into the issue, that the fly.io website
       | does not load if requested via IPv6 on Mac? I tried Safari,
       | Chrome and curl and neither work:                 $ curl -v
       | https://fly.io/blog/introducing-litefs/       *   Trying
       | 2a09:8280:1::a:791:443...       * Connected to fly.io
       | (2a09:8280:1::a:791) port 443 (#0)       * ALPN, offering h2
       | * ALPN, offering http/1.1       * successfully set certificate
       | verify locations:       *  CAfile: /etc/ssl/cert.pem       *
       | CApath: none       * (304) (OUT), TLS handshake, Client hello
       | (1):       curl: (35) error:02FFF036:system
       | library:func(4095):Connection reset by peer
       | 
       | Requesting via ipv4 works                 $ curl -4v
       | https://fly.io/blog/introducing-litefs/       *   Trying
       | 37.16.18.81:443...       * Connected to fly.io (37.16.18.81) port
       | 443 (#0)       * ALPN, offering h2       * ALPN, offering
       | http/1.1       * successfully set certificate verify locations:
       | *  CAfile: /etc/ssl/cert.pem       *  CApath: none       * (304)
       | (OUT), TLS handshake, Client hello (1):       * (304) (IN), TLS
       | handshake, Server hello (2):       * (304) (IN), TLS handshake,
       | Unknown (8):       * (304) (IN), TLS handshake, Certificate (11):
       | * (304) (IN), TLS handshake, CERT verify (15):       * (304)
       | (IN), TLS handshake, Finished (20):       * (304) (OUT), TLS
       | handshake, Finished (20):       * SSL connection using TLSv1.3 /
       | AEAD-CHACHA20-POLY1305-SHA256       * ALPN, server accepted to
       | use h2       * Server certificate:       *  subject: CN=fly.io
       | *  start date: Jul 25 11:20:01 2022 GMT       *  expire date: Oct
       | 23 11:20:00 2022 GMT       *  subjectAltName: host "fly.io"
       | matched cert's "fly.io"       *  issuer: C=US; O=Let's Encrypt;
       | CN=R3       *  SSL certificate verify ok.       * Using HTTP2,
       | server supports multiplexing       * Connection state changed
       | (HTTP/2 confirmed)       * Copying HTTP/2 data in stream buffer
       | to connection buffer after upgrade: len=0       * Using Stream
       | ID: 1 (easy handle 0x135011c00)       > GET /blog/introducing-
       | litefs/ HTTP/2       > Host: fly.io       > user-agent:
       | curl/7.79.1       > accept: */*       >       * Connection state
       | changed (MAX_CONCURRENT_STREAMS == 32)!       < HTTP/2 200
       | < accept-ranges: bytes       < cache-control: max-age=0, private,
       | must-revalidate       < content-type: text/html       < date:
       | Wed, 21 Sep 2022 16:50:16 GMT       < etag: "632b20f0-1bdc1"
       | < fly-request-id: 01GDGFA3RPZPRDV9M3AQ3159ZK-fra       < last-
       | modified: Wed, 21 Sep 2022 14:34:24 GMT       < server:
       | Fly/51ee4ef9 (2022-09-20)       < via: 1.1 fly.io, 2 fly.io
       | <       <!doctype html> ...
        
         | tptacek wrote:
         | Works fine over IPv6 from this Mac.
        
         | toast0 wrote:
         | Could easily be a path MTU issue? I don't currently have
         | working IPv6, but on TCP/IPv4, the server->client handshake
         | packets I get back from fly.io are full length packets. If your
         | tcpv6 MSS is set incorrectly, and some other not great things
         | are happening in the path, then you might not be able to
         | receive large packets.
        
           | ignoramous wrote:
           | > _Could easily be a path MTU issue?_
           | 
           | Perhaps: https://community.fly.io/t/ipv6-pmtud-issue/5081
        
           | theomega wrote:
           | How could I find out if this is the case? And, what can I do
           | about it? Disabling IPv6 obviously fixes it, but that it not
           | a solution...
        
             | toast0 wrote:
             | I'm not aware of a simple to use test site for this on IPv6
             | (I've got one for IPv4, but my server host doesn't do IPv6
             | either, and using a tunnel will limit the ranges I can
             | test, etc)... so you'll need to test ye olde fashioned way.
             | 
             | I don't have a mac, but if the man page[1] is right,
             | something like this should work to see how big of a packet
             | you can successfully send and receive:
             | ping -6 -D -G 1500 -g 1400 fly.io
             | 
             | (you may need to run as root to set packet sizes). You
             | should get back a list of replies with say 1408 bytes, then
             | 1409, etc. The last number you get back is effectively the
             | largest IPv6 payload you can receive (on this path, it
             | could be different for other paths), and if you add the
             | IPv6 header length of 40, that's your effective path MTU.
             | 
             | Use tcpdump to see what TCP MSS is being sent on your
             | outgoing SYN packets, for IPv4, the MSS is MTU - 40 (20 for
             | IPV4 header, 20 for TCP header), for IPv6, the MSS should
             | be MTU - 60 (40 for IPv6 header, 20 for TCP header). If
             | your TCP MSS is higher than the observed path MTU to
             | fly.io, that's likely the immediate cause of your problem.
             | 
             | If you're using a router, make sure it knows the proper MTU
             | for the IPv6 connection, and enable MSS clamping on IPv6,
             | if possible --- or make sure the router advertisement
             | daemon shares the correct MTU.
             | 
             | Hope this gets you started.
             | 
             | [1] https://ss64.com/osx/ping.html
        
               | theomega wrote:
               | Thanks, that was of great help!
               | 
               | Just for reference, the ping command is a little
               | different                 sudo ping6 -D -G 1500,1400
               | fly.io
               | 
               | I set an MSS to 1492 which pfsense (my router) translates
               | to an MSS clamp of 1492-60 for IPv6 and 1492-40 for IPv4.
               | This is a German Deutsche Telekom Fiber connection. Now
               | everything works fine, I can request fly.io (and also
               | discovered that https://ipv6-test.com was not working
               | before and now does with the MSS clamping)
               | 
               | Does MSS clamping have any disadvantages? Are there any
               | alternatives in my case?
        
               | toast0 wrote:
               | Excellent, happy to help, glad you figured out the right
               | command!. 1492 MTU is consistent with PPPoE, or a network
               | where they didn't want to run a separate MTU for PPPoE
               | and straight ethernet.
               | 
               | The only downside to MSS clamping is the computational
               | expense of inspecting and modifying the packets. On a
               | residential connection, where you're running pfsense
               | already, it's probably not even noticeable; but your ISP
               | wouldn't be able to do clamping for you, because large
               | scale routers don't have the processing budget to inspect
               | packets at that level. I've seen some MSS clamping
               | implementations that only clamp packets going out to the
               | internet, and not the return packets... that can lead to
               | problems sending large packets (which isn't always very
               | noticeable, actually; a lot of basic browsing doesn't
               | send packets large enough to hit this, unless you go to a
               | site that sets huge cookies or do some real uploading)
               | 
               | The alternative would be to run a 1492 MTU on your LAN,
               | but that has the marginal negative of reducing your
               | maximum packet size for LAN to LAN transfers.
        
         | wglb wrote:
         | Just tried it on my linux box:                 * Trying
         | 2a09:8280:1::a:791:443...       * Connected to fly.io
         | (2a09:8280:1::a:791) port 443 (#0) ALPN,       * offering h2
         | ALPN, offering http/1.1 CAfile:       * /etc/ssl/certs/ca-
         | certificates.crt CApath: /etc/ssl/certs TLSv1.0       * (OUT),
         | TLS header, Certificate Status (22): TLSv1.3 (OUT), TLS       *
         | handshake, Client hello (1): TLSv1.2 (IN), TLS header,
         | Certificate       * Status (22): TLSv1.3 (IN), TLS handshake,
         | Server hello (2): TLSv1.2       * (IN), TLS header, Finished
         | (20): TLSv1.2 (IN), TLS header,       * Supplemental data (23):
         | TLSv1.3 (IN), TLS handshake, Encrypted       * Extensions (8):
         | TLSv1.2 (IN), TLS header, Supplemental data (23):       *
         | TLSv1.3 (IN), TLS handshake, Certificate (11): TLSv1.2 (IN),
         | TLS a       * header, Supplemental data (23): TLSv1.3 (IN), TLS
         | handshake, CERT       * verify (15): TLSv1.2 (IN), TLS header,
         | Supplemental data (23):       * TLSv1.3 (IN), TLS handshake,
         | Finished (20): TLSv1.2 (OUT), TLS       * header, Finished
         | (20): TLSv1.3 (OUT), TLS change cipher, Change       * cipher
         | spec (1): TLSv1.2 (OUT), TLS header, Supplemental data (23):
         | * TLSv1.3 (OUT), TLS handshake, Finished (20): SSL connection
         | using       * TLSv1.3 / TLS_AES_256_GCM_SHA384 ALPN, server
         | accepted to use h2       * Server certificate:
         | 
         | ...
         | 
         | Same on my macbook
        
       | ranger_danger wrote:
       | ELI5?
        
         | benbjohnson wrote:
         | LiteFS makes it so that you can have a SQLite database that is
         | transparently replicated to a cluster of machines. The use case
         | is if you have your app running on a server running in Chicago,
         | then users in Europe will have 100ms latency to your server and
         | users in Asia will have a 250ms latency. That's on top of the
         | time it takes for your server to process the request. Many
         | people target sub-100ms latency to make their applications feel
         | snappy but that's impossible if users are waiting on their
         | request to be sent halfway around the world.
         | 
         | Traditionally, it's complicated to replicate your database to
         | different regions of the world using something like Postgres.
         | However, LiteFS aims to make it as simple as just spinning up
         | more server instances around the world. It connects them
         | automatically and ships changes in a transactionally safe way
         | between them.
        
           | boltzmann-brain wrote:
           | > Traditionally, it's complicated to replicate your database
           | to different regions of the world using something like
           | Postgres
           | 
           | what makes it complicated?
        
         | twobitshifter wrote:
         | It's a tool to allow you to keep your database on the app
         | server and backed up on s3. This means you don't need a
         | separate database server. Having a database on the same machine
         | as the app is very fast, but it is usually risky. The backups
         | made by LiteFS lessen the risk.
        
           | hobo_mark wrote:
           | That's Litestream, not LiteFS.
        
       | vcryan wrote:
       | Currently, I am running multiple application servers (and lambda
       | functions) using AWS Fargate that access sqlite files (databases)
       | on an EFS share. So far so good, although my use cases are fairly
       | simple.
        
       | asim wrote:
       | 10 years ago fly.io is the company I wanted to build. Something
       | with massive technical depth that becomes a developer product.
       | They're doing an incredible job and part of that comes down to
       | how they evangelise the product outside of all the technical
       | hackery. This requires so much continued effort. AND THEN to
       | actually run a business on top of all that. Kudos to you guys. I
       | struggled so much with this. Wish you nothing but continued
       | success.
        
         | ignoramous wrote:
         | 5 years ago, fly.io was basically poor man's deno.com / oven.sh
         | [0]. In my (incorrect) opinion, tptacek changed fly.io's
         | trajectory single-handedly.
         | 
         | [0] https://ghostarchive.org/varchive/r-1hXDvOoHA
        
           | stingraycharles wrote:
           | What is @tptacek's relation with fly.io, and how did he
           | change it?
        
           | tptacek wrote:
           | No? No. No!
        
       | pphysch wrote:
       | > Developing against a relational database requires devs to watch
       | out for "N+1" query patterns, where a query leads to a loop that
       | leads to more queries. N+1 queries against Postgres and MySQL can
       | be lethal to performance. Not so much for SQLite.
       | 
       | This is misleading AFAICT. The article(s) is actually comparing
       | remote RDBMS to local RDBMS, not Postgres to SQLite.
       | 
       | Postgres can also be served over a UNIX socket, removing the
       | individual query overhead due to TCP roundtrip.
       | 
       | SQLite is a great technology, but keep in mind that you can also
       | deploy Postgres right next to your app as well. If your app is
       | something like a company backend that could evolve a lot and
       | benefit from Postgres's advanced features, this may be the right
       | choice.
        
         | simonw wrote:
         | I get the impression that there's still quite a bit of overhead
         | involved in talking to PostgreSQL that you don't get with
         | SQLite. A SQLite query is pretty much a C function call that
         | works directly with data structures and local disk.
        
         | tptacek wrote:
         | Anyone want to chime in with the largest app they've deployed
         | where prod Postgres was reachable over a Unix domain socket?
        
           | tshaddox wrote:
           | Are there significant limits to the size of an app that could
           | be deployed alongside Postgres versus the size of an app that
           | could be deployed alongside SQLite?
        
             | zie wrote:
             | No, the big difference is write speed. SQLite is limited in
             | the amount of writes it can do at the same time(typically
             | 1). This is generally called "concurrent writes". Last I
             | checked SQLite can't really get past 1 write at the same
             | time, but it can emulate concurrent writes in WAL mode such
             | that it isn't normally a problem, for most applications.
             | Postgres has no such limit(though there can be limits to
             | updating a particular row/column of data concurrently,
             | obviously).
             | 
             | Otherwise both are generally limited to the physical
             | resources of the machine(memory, disk, etc). Generally
             | speaking you can scale boxes much much farther than your
             | data size for most applications.
        
               | nicoburns wrote:
               | > it can emulate concurrent writes in WAL mode
               | 
               | Seems a bit unfair to call WAL mode emulation of "true"
               | concurrent writes as I'm pretty sure a write-ahead-log
               | (WAL) is exactly how other databases implement multiple
               | concurrent writes. It's just always-on rather than being
               | opt-in.
        
               | zie wrote:
               | Well, but that's not really what it's doing. WAL1 mode in
               | SQLite is different than in your typical RDBMS. WAL mode
               | in SQLite is just getting the write path outside of the
               | read path, but there is still only 1 write allowed at any
               | given time. The "fake" concurrency is done by letting
               | write transactions queue up and wait for the lock.
               | 
               | See [here](https://sqlite.org/wal.html) under "2.2.
               | Concurrency" where it says:
               | 
               | > "However, since there is only one WAL file, there can
               | only be one writer at a time."
               | 
               | SQLite is awesome, I'm a huge fan, but if you need to do
               | lots and lots of writes, then SQLite is not your friend.
               | Luckily that's a very rare application. There is a reason
               | you never see SQLite being the end point for logs and
               | other write-heavy applications.
        
           | nucleardog wrote:
           | Used to work developing standalone kiosks and things (think
           | along the lines of stuff you'd find at a museum as the basic
           | type of thing I'd work on), so absolutely not the use case
           | you're thinking of.
           | 
           | In a number of cases, we were working with data that would
           | really benefit from actual GIS tooling so PostGIS was kind of
           | the natural thing to reach for. Many of these kiosks had
           | slow, intermittent, or just straight up no internet
           | connection available.
           | 
           | So we just deployed PostGIS directly on the kiosk hardware.
           | Usually little embedded industrial machines with passive
           | cooling working with something like a generation or two out-
           | of-date Celeron, 8GB RAM, and a small SSD.
           | 
           | We'd load up shapefiles covering a bunch of features across
           | over a quarter million square miles while simultaneously
           | running Chromium in kiosk mode.
           | 
           | I know for a fact some of those had around 1k DAU. I mean,
           | never more than one simultaneously but we did have them! I'm
           | sure it would have handled more just fine if it weren't for
           | the damn laws of physics limiting the number of people that
           | can be physically accessing the hardware at the same time.
           | 
           | That said, we had the benefit of knowing that our user counts
           | would never really increase and due to real-world limitations
           | we'd never be working with a remote database because there
           | was literally no internet available. In general I'd still say
           | colocating your database is not an excuse to allow N+1 query
           | patterns to slip in. They'll be fine for now and just come
           | back to bite you in the ass when your app _does_ outgrow
           | colocating the app/database.
        
           | BeefWellington wrote:
           | Back in my consulting days one of my clients had an
           | application with nearly all of the business logic being done
           | in-database. The app was essentially a frontend for stored
           | procedures galore and it sat on an absolute monster of a
           | system for its time (~256GB of RAM, 24 or 32 Cores I think).
           | It handled something like 100k DAU with no issue. That was
           | for just the transactional system. They replicated data to
           | other (similar style) systems for less mission-critical
           | things like reporting, billing, etc.
           | 
           | I want to be clear though, I would never recommend anyone do
           | this. For its time it was impressive but I suspect by now
           | it's been re-engineered away from that model, if for no other
           | reason than its a lot of eggs in one basket.
        
           | metafex wrote:
           | Had an instance with almost 2TB on disk in 2015, it's
           | probably a lot more by now, if it's still running as-is. Was
           | for analytics/data-warehousing. Though the big table was
           | partitioned so queries were always nice and snappy. The
           | machine had 128gb ram back then, but probably never needed
           | that much. The applications working with the data ran on that
           | same server, so processing latency was usually <10ms even for
           | larger queries.
        
           | fdr wrote:
           | Not me personally, but my understanding some Cisco products
           | have Postgres "embedded." If they're not using unix domain
           | sockets, they're at least using loopback.
           | 
           | ...come to think of it, they may not be the only network
           | device maker that does this.
        
           | pphysch wrote:
           | I would wager that the median IT shop/dept is running off a
           | single node RDBMS, the majority of which also have the spare
           | resources to support running the primary application (which
           | has 1-100 DAU).
           | 
           | No practical way to test this, of course.
        
           | whalesalad wrote:
           | is it cheating when pgbouncer is being accessed via socket?
           | lol
        
             | tptacek wrote:
             | Yes, that is cheating.
        
           | jzelinskie wrote:
           | I'm surprised no one is chiming in from the LAMP era. Maybe I
           | was just in a bubble, but I feel like before modern cloud
           | providers, folks deploying on VPSs tried to keep operational
           | complexity lower by running their database on the same box
           | until they couldn't anymore.
           | 
           | In my experience, I just cached the living hell out of the
           | apps to avoid the I/O from the database. Providers in those
           | days had huge issues with noisy neighbors, so I/O performance
           | was quite poor.
           | 
           | [0]: https://en.wikipedia.org/wiki/LAMP_(software_bundle)
        
           | lbhdc wrote:
           | We use managed postgres on GCP, and it is served over unix
           | socket.
        
             | cbsmith wrote:
             | Kind of by definition, GCP's managed postgres is served up
             | over the network. I'm guessing you mean "unix socket" in
             | the sense of the socket API, not unix _domain_ sockets.
        
           | shrubble wrote:
           | It's a great way to deploy if you have the RAM and disk.
           | 
           | Bump up the shared buffers, have a multithreaded or pooled
           | driver for your app server so connections are not being
           | opened and closed all the time.
           | 
           | It really 'flies' under such a scenario.
           | 
           | Largest app was a big analytics and drug tracking application
           | for a small pharmaceutical company. Not sure of the size but
           | it was very reporting heavy.
        
         | pweissbrod wrote:
         | Select N+1 is a fundamental anti-pattern of relational database
         | usage. Reducing the latency per round trip doesn't change this
         | fact.
        
           | simonw wrote:
           | I disagree. I think it makes a big difference.
           | 
           | https://sqlite.org/np1queryprob.html#the_need_for_over_200_s.
           | .. talks about this in the context of Fossil, which uses
           | hundreds of queries per page and loads extremely fast.
           | 
           | It turns out the N+1 thing really is only an anti-pattern if
           | you're dealing with significant overhead per query. If you
           | don't need to worry about that you can write code that's much
           | easier to write and maintain just by putting a few SQL
           | queries in a loop!
           | 
           | Related: the N+1 problem is notorious in GraphQL world as one
           | of the reasons building a high performance GraphQL API is
           | really difficult.
           | 
           | With SQLite you don't have to worry about that! I built
           | https://datasette.io/plugins/datasette-graphql on SQLite and
           | was delighted at how well it can handle deeply nested
           | queries.
        
         | alberth wrote:
         | A better comparison might be Postgres to BedrockDB.
         | 
         | BeckrockDB is a MySQL compatible interface wrapped around
         | SQLite and as such, is intended for remote RDBMS use.
         | 
         | Expensify has some interesting scaling performance data for
         | their use of it.
         | 
         | https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...
        
         | woooooo wrote:
         | Isn't "n + 1" typically a degenerate case of ORM rather than a
         | relational thing? The whole point of rdbms is you can do a join
         | instead of iterated lookups.
        
           | pphysch wrote:
           | Yeah most N+1 issues can be mitigated by optimizing the
           | queries.
           | 
           | You definitely don't need an ORM to get to N+1 hell, a for-
           | loop in vanilla PHP will do.
        
             | woooooo wrote:
             | Yeah, but you have to go out of your way to do that. If
             | you're in touch with the SQL you're executing at all, the
             | join is the obvious easy way to do it, it's not some arcane
             | optimization.
        
               | tyingq wrote:
               | You do see it often in things like canned ecommerce
               | software. They have this ecosystem that encourages you to
               | add widgets to category listings, product listings, etc,
               | that all create queries keying off the page context.
               | 
               | Where widgets might be "similar items", "people that
               | bought x, bought y", "recent purchases of this widget",
               | "best sellers in this category" and so on.
        
               | RussianCow wrote:
               | That's a really big "if". In my experience, most
               | application engineers aren't very good at SQL because
               | they don't have to be; most apps just don't ever reach a
               | scale where it matters, and something like Rails or
               | Django with N+1 queries all over the place performs just
               | fine.
        
               | woooooo wrote:
               | Hence my 2 decade losing battle against ORM at every
               | place I've worked :)
               | 
               | I'm, like, 1 for 10 maybe.
        
             | mgkimsal wrote:
             | A for-loop in vanilla _anything_ will do. Nothing special
             | about PHP. Ruby, Python, Java, etc...
        
               | pphysch wrote:
               | Of course. PHP is the one you will mainly encounter in
               | the wilderness, though.
        
               | 0x457 wrote:
               | Most common place I saw N+1 is ActiveRecord (Ruby on
               | Rails).
               | 
               | Because by default, no relationships are loaded and the
               | way AR "loads" stuff makes is extremely easy to create
               | N+M queries on 5 lines of ruby. Then, those people will
               | tell you that you run out of database before you run out
               | of ruby...
        
           | dspillett wrote:
           | You see it in direct code too, from new or naive coders (the
           | good old "it was fast enough on my test data...") or when
           | proof-of-concept code "accidentally" escapes into production.
           | 
           | Even within the database you'll sometimes find examples:
           | stored procedures looping over a cursor and running queries
           | or other statements in each iteration. This doesn't have the
           | same latency implications as N+1 requests over a network, but
           | still does add significant time per step compared to a set
           | based equivalent.
           | 
           | If you think about it, the database will sometimes do this
           | itself when your database is not optimally structured &
           | indexed for your queries (in SQL Server, look at "number of
           | executions" in query plans).
        
             | nicoburns wrote:
             | > You see it in direct code too, from new or naive coders
             | 
             | Many moons ago in one of my first coding projects I not
             | only used a separate query for each record, but a separate
             | database connection too. In addition to this, I didn't use
             | a loop to generate these queries, I _copy and pasted the
             | connection setup and query code for each record_ , leading
             | to a 22k code file just for this single web page.
             | 
             | There were probably on the order of 1000 records (they
             | represented premier league football players), and amazingly
             | this code was fast enough (took a couple of seconds to load
             | - which wasn't so unusual back in 2007) and worked well
             | enough that it actually had ~100 real users (not bad a 14
             | year old's side project).
        
         | benbjohnson wrote:
         | You could run Postgres over UNIX sockets although you will
         | still get higher latency than SQLite's in-process model. Also,
         | running a Postgres on every app instance on the edge probably
         | isn't practical. Postgres has some great advanced features if
         | you need them but it's also much more heavy weight.
         | 
         | With LiteFS, we're aiming to easily run on low resource cloud
         | hardware such as nodes with 256MB or less of RAM. I haven't
         | tried it on a Raspberry Pi yet but I suspect it would run fine
         | there as well.
        
           | pphysch wrote:
           | SQLite _almost certainly_ is the better edge RDBMS than
           | Postgres, if only because it has less features taking up
           | space.
           | 
           | However, "local SQLite vs. remote Postgres/MySQL" remains a
           | false dichotomy when talking about network latency.
        
             | clord wrote:
             | It's an application design choice. It's perfectly
             | reasonable to consider those two options when designing a
             | system. The constraints of the system determine which is
             | better for the application.
             | 
             | I'd pick a centralized network-reachable database with a
             | strong relational schema for write-heavy applications, and
             | a lighter in-process system for something that is mostly
             | reads and where latency matters. It's not a false dichotomy
             | -- but more like a continuum that certainly includes both
             | extremes on it.
        
           | cbsmith wrote:
           | Isn't this all talking to S3 anyway, not to mention the
           | network trips intrinsic to the system before it gets to S3? I
           | mean, I'm sure there's _some_ performance win here, but I 'm
           | surprised it's so significant.
           | 
           | It's not like the address-space separation is without
           | benefits... heck, if it weren't, you could simply have
           | embedded the whole application inside Postgres and achieved
           | the same effect.
        
             | simonw wrote:
             | You may be confusing Litestream and LiteFS.
             | 
             | Litestream writes everything to S3 (or similar storage).
             | 
             | LiteFS lets different nodes copy replicated data directly
             | to each other over a network, without involving S3.
             | 
             | In either case, the actual SQLite writes and reads all
             | happen directly against local disk, without any network
             | traffic. Replication happens after that.
        
               | cbsmith wrote:
               | I think I was definitely confusing it with Litestream as
               | the blog post made reference to it (and I did find that
               | confusing).
               | 
               | That said, unless I've misunderstood the LifeFS use case,
               | you're still going over the network to reach a node, and
               | that node is still going through a FUSE filesystem. That
               | would seem to create overhead comparable (potentially
               | more significant) to talking to a Postgres database
               | hosted on a remote node.
               | 
               | It just doesn't seem that obvious that there's a big
               | performance win here. I'd be curious to see the profiling
               | data behind this.
        
               | WorldMaker wrote:
               | It's an "eventual consistency" transaction scheme for
               | SQLite: reads happen locally and immediately with the
               | nearest edge copy available, it's only writes that need
               | to be forwarded to a (potentially far away) primary DB
               | over the network somewhere, and transactions will then
               | take some amount of network time to fan back out to
               | edges, so _eventually_ the edges will read the latest
               | transactions, but during that fan out will read the
               | previous transaction state.
        
               | cbsmith wrote:
               | Local for the application code on the node you're talking
               | to (just like the read would be local for a Postgres
               | process running on the node), but there's still a network
               | trip to get to the node, no?
               | 
               | Even if the reads are happening locally, if they're going
               | through FUSE (even a "thin" pass-through that does
               | nothing), that means they're getting routed from kernel
               | space to a FUSE daemon, which means you're still doing
               | IPC to another process through a kernel...
        
               | simonw wrote:
               | The performance boost that matters most here is when your
               | application reads from the database. Your application
               | code is reading directly from disk there, through a very
               | thin FUSE layer that does nothing at all with reads (it
               | only monitors writes).
               | 
               | So your read queries should mostly be measured in
               | microseconds.
        
               | cbsmith wrote:
               | > So your read queries should mostly be measured in
               | microseconds.
               | 
               | You should check out the read latency for read-only
               | requests over unix domain sockets with PostgreSQL. You
               | tend to measure it in microseconds, and depending on
               | circumstances it can be single-digit microseconds.
               | 
               | Regardless of whether your FUSE logic does nothing at
               | all, It sure _seems_ like there 's intrinsic overhead to
               | the FUSE model that is very similar to the intrinsic
               | overhead of talking to another userspace database
               | process... because you're talking to another userspace
               | process (through the VFS layer).
               | 
               | When the application reads, those requests go from
               | userspace to the FUSE driver & /dev/fuse, with the thread
               | being put into a wait state; then the FUSE daemon needs
               | to pull the request from /dev/fuse to service it; then
               | your FUSE code does whatever minimal work it needs to do
               | to process the read and passes it back through /dev/fuse
               | and the FUSE driver, and from there back to your
               | application. That gets you pretty much the same "block
               | and context switch" overhead of an IPC call to Postgres
               | (arguably more). FUSE uses splicing to minimize data
               | copying (of course, unix domain sockets also minimize
               | data copying), though looking at the LiteFS Go daemon,
               | I'm not entirely sure there isn't a copy going on anyway.
               | Memory copying issues aside, from a latency perspective,
               | you're jumping through very similar hoops to talking to
               | another user-space process... because that's how FUSE
               | works.
               | 
               | There's a potential performance win if the data you're
               | reading is already in the VFS cache, since that would
               | bypass having to through the FUSE filesystem (and the
               | /dev/fuse-to-userspace jump) entirely. The catch is, at
               | that point you're bypassing SQLite transaction engine
               | semantics entirely, giving you a dirty read that's really
               | just a cached result from a previous read. That's not
               | really a new trick, and you can get even better
               | performance with client-side caching that can avoid a
               | trip to kernel space.
               | 
               | I'm sure there's a win here somewhere, but I'm struggling
               | to understand where.
        
               | benbjohnson wrote:
               | > There's a potential performance win if the data you're
               | reading is already in the VFS cache, since that would
               | bypass having to through the FUSE filesystem.
               | 
               | That's a typical case for a lot of databases. Most of the
               | "hot" data is in a small subset of the pages and many of
               | those can live in the in-process page cache.
               | 
               | > The catch is, at that point you're bypassing SQLite
               | transaction engine semantics entirely, giving you a dirty
               | read that's really just a cached result from a previous
               | read.
               | 
               | It's not bypassing the transaction engine semantics. For
               | WAL mode, SQLite can determine when pages are updated by
               | checking the SHM file and then reading updated pages from
               | the WAL file. Pages in the cache don't need to flushed on
               | every access or even between transactions to be valid.
               | 
               | > I'm sure there's a win here somewhere, but I'm
               | struggling to understand where.
               | 
               | The main goal of LiteFS is to make it easy to globally
               | replicate applications. Many apps run in a single region
               | of the US (e.g. us-east-1) and that's fast for Americans
               | but it's a 100ms round trip to Europe and a 250ms round
               | trip to Asia. Sure, you can spin up a multi-region
               | Postgres but it's not that easy and you'll likely deploy
               | as separate database and application servers because
               | Postgres is not very lightweight.
               | 
               | LiteFS aims to have a minimal footprint so it makes it
               | possible to deploy many small instances since SQLite is
               | built to run on low resource hardware.
               | 
               | As far as comparisons with Postgres over UNIX sockets, I
               | agree that the performance of a single instance is
               | probably comparable with a FUSE layer.
        
               | simonw wrote:
               | That's a really detailed, interesting reply. I am now
               | really interested to understand more about the
               | comparative overhead here.
        
               | fcoury wrote:
               | Wow, thank you so much for explaining the overhead around
               | FUSE usage, that's really eye opening and makes having
               | some performance benchmarks down the line even more
               | interesting.
        
         | hinkley wrote:
         | Yep. The poison is still in the cup, it's just a smaller dose.
         | Keep drinking if you think that fact is enough to save you...
         | 
         | I see bad judgement calls coming from small numbers, often due
         | to failing to do the cost x frequency math properly in your
         | head. If you're looking at an operation that takes 3ms or 3ms
         | that is called a million times per minute, or twenty thousand
         | times per request, you don't have enough significant figures
         | there and people make mistakes. 3ms x 12345 = ~40s, not
         | 37035ms. Better if you use a higher resolution clock and find
         | out it's actually 3.259 ms, leading to a total of ~40.23s
         | 
         | Point is, when we are doing billions of operations per second,
         | lots of small problems that hide in clock jitter can become
         | really big problems, and a 5-10% error repeated for half a
         | dozen concerns can lead to serious miscalculations in capacity
         | planning and revenue.
        
       | lijogdfljk wrote:
       | This is really cool! Unfortunately i primarily am interested in
       | offline databases so perhaps i'm just not the target audience.
       | However i have to ask, on that note, does this have any
       | application in the offline space?
       | 
       | Ie i wonder if there's a way to can write your applications such
       | that they have less/minimal contention, and then allow the
       | databases to merge when back online? Of course, what happens when
       | there inevitably _is_ contention? etc
       | 
       | Not sure that idea would have a benefit over many SQLite DBs with
       | userland schemas mirroring CRDT principles though. But a boy can
       | dream.
       | 
       | Regardless, very cool work being done here.
        
         | bigfish24 wrote:
         | Shameless plug but would love for you to try out Ditto[1]!
         | CRDT-based replicated database that works peer-to-peer
         | including automatically with WiFi and Bluetooth.
         | 
         | [1] https://www.ditto.live
        
         | benbjohnson wrote:
         | Thanks! It depends on your particular use case. If you need all
         | nodes to allow writes then, unfortunately, no it doesn't
         | support that. LiteFS uses physical replication so there's no
         | way to merge pages from separate writers. You're probably
         | better off looking at eventually consistent SQLite tools like
         | Mycelial[1].
         | 
         | However, if you have a single writer and just need replicas to
         | get updates when they periodically connect then yes, LiteFS
         | could fit that use case.
         | 
         | [1]: https://mycelial.com/
        
       | infogulch wrote:
       | > To improve latency, we're aiming at a scale-out model that
       | works similarly to Fly Postgres. That's to say: writes get
       | forwarded to the primary and all read requests get served from
       | their local copies.
       | 
       | How can you ensure that a client that just performed a forwarded
       | write will be able to read that back on their local replica on
       | subsequent reads?
        
         | sk55 wrote:
         | They don't have ACID guarantees with this setup.
        
           | benbjohnson wrote:
           | ACID is typically used to describe single database instances
           | and from that perspective, LiteFS doesn't change the ACID
           | semantics that SQLite has. For distributed consistency
           | models[1], it gets more complicated as LiteFS currently just
           | supports async replication. We'll introduce stronger
           | guarantees in the future although we currently provide some
           | information such as transaction IDs so you can manage
           | replication lag if you need stronger consistency.
           | 
           | [1]: https://jepsen.io/consistency
        
         | hinkley wrote:
         | That's a fun one.
         | 
         | A couple years ago someone posted a solution to that here. I'm
         | not sure if it works for SQLite, but it worked for Postgres.
         | The basics of it were that each replica was aware of the latest
         | transaction ID it had seen. On a normal read you'd deal with
         | the usual set of eventually consistent issues. But on a read-
         | after-write, you would select a replica that was ahead of the
         | write transaction.
         | 
         | Ultimately that's a very odd flavor of sharding. What I don't
         | recall is how they propagated that shard information
         | efficiently, since with sharding and consistent hashing the
         | whole idea is that the lookup algorithm is deterministic, and
         | therefore can be run anywhere and get the same result. WAL lag
         | information is dynamic, so... Raft?
        
           | benbjohnson wrote:
           | LiteFS provides a transaction ID that applications can use to
           | determine replication lag. If the replica is behind the TXID,
           | it can either wait or it can forward to the primary to ensure
           | consistency.
        
             | simonw wrote:
             | Is that transaction ID made available to SQLite clients? As
             | a custom SQL function or a value in a readable table
             | somewhere?
        
               | benbjohnson wrote:
               | It's available as a file handle. If you have a database
               | file named "db" then you can read "db-pos" to read the
               | replication position. The position is a tuple of the TXID
               | and the database checksum.
        
           | infogulch wrote:
           | Yeah a transaction ID solution would work pretty well. Every
           | request would return the highest transaction ID seen by the
           | replica; writes forwarded to the primary would also return
           | the transaction ID after that write; any request with a
           | transaction ID higher than the replica would be forwarded to
           | the primary.
        
             | hinkley wrote:
             | I think that may have been the case. Try the local replica,
             | if it is too old, hit the primary, rather than shopping
             | around for the next best replica.
        
           | simonw wrote:
           | Another solution that I've used successfully in the past for
           | web apps is to set a 10s cookie every time a user performs a
           | write, and then route their read requests to the lead server
           | until that cookie expires. That way they're sure to see the
           | impact of the write they just made.
        
             | hinkley wrote:
             | If write traffic is low enough I could see that working,
             | but 10s is an awful long time. Anything less than a 1000:1
             | read:write ratio would start running into capacity problems
             | wouldn't it?
        
               | simonw wrote:
               | That's never been a problem on any system I've worked on.
               | If you assume a web application with logged in users who
               | are occasionally updating their data, the percentage of
               | users who will have performed a write within the last 10s
               | is always going to be absolutely tiny.
        
               | hinkley wrote:
               | The absolute rate is the problem. You can't shed that
               | load to other machines so the percentage of users or
               | percentage of traffic doesn't matter. This is basically a
               | variant of Little's Law, where server count has a ceiling
               | of 1.
               | 
               | 100 edits a minute from distinct sessions is 1000
               | sessions pinned at any moment. If they read anything in
               | that interval it comes from the primary. The only
               | question is what's the frequency and interval of reads
               | after a write.
        
       | fny wrote:
       | This reads like a professor who's so steeped in research that
       | he's forgotten how to communicate to his students!
       | 
       | What exactly are we talking about here? A WebSQL thats actually
       | synced to a proper RDBMS? Synced across devices? I'm not clear
       | about an end to end use case.
       | 
       | Edit: Honestly, this line from the LiteFS docs[0] needs to be
       | added to the top of the article:
       | 
       | > LiteFS is a distributed file system that transparently
       | replicates SQLite databases. This lets you run your application
       | like it's running against a local on-disk SQLite database but
       | behind the scenes the database is replicated to all the nodes in
       | your cluster. This lets you run your database right next to your
       | application on the edge.
       | 
       | I had no idea what was being talked about otherwise.
       | 
       | [0]: https://fly.io/docs/litefs/
        
       | mwcampbell wrote:
       | I wonder if using FUSE has had any appreciable impact on
       | performance, particularly read performance. I ask because FUSE
       | has historically had a reputation for being slow, e.g. with the
       | old FUSE port of ZFS.
        
         | miduil wrote:
         | I think FUSE performance has significantly improved since 2010,
         | the year in which ZFS-On-Linux became available I believe.
         | 
         | This is just one of many examples:
         | https://www.phoronix.com/news/MTI1MzM
         | 
         | In contrary, in some cases FUSE is even faster than doing a
         | regular kernel mount(). There is an experimental research
         | distribution called distri that is exclusively relying on fuse
         | mounting and figured out that FUSE was faster for doing mounts.
         | https://michael.stapelberg.ch/posts/tags/distri/
        
           | seabrookmx wrote:
           | You're conflating zfs-fuse and zfs-on-linux (ZoL).
           | 
           | The first one is as it's name suggests.. a user space
           | implementation of ZFS.
           | 
           | ZoL (now unified with OpenZFS) is implemented as a kernel
           | module and as such does _not_ run in user space. It performs
           | significantly better as a result.
           | 
           | FUSE is still slow, which is why there's ongoing effort to
           | replace things like NTFS-3G (the default NTFS implementation
           | in most linux distros) with an in-kernel implementation:
           | https://news.ycombinator.com/item?id=28418674
        
             | miduil wrote:
             | How do I conflate zfs-fuse with ZoL? I was trying to
             | highlight that if someones memory of "poor fuse
             | performance" is because of zfs-fuse, then someones memory
             | must be quite old, as ZoL has been available for quite some
             | time and since then also fuse itself has received quite
             | some performance gains.
             | 
             | Edit: Also I don't want to imply that FUSE is near in-
             | kernel filesystems, but it is certainly performing much
             | better than 12 years ago.
        
       | endisneigh wrote:
       | Seems neat, until you try to do schema migrations. Unless they
       | can guarantee that all containers' SQLite instances have the same
       | scheme without locking I'm not sure how doesn't run into the same
       | issues as many NoSQL.
       | 
       | CouchDB had this same issue with its database per user model and
       | eventually consistent writes.
        
         | slively wrote:
         | In my experience distributed database migrations tend to be two
         | phase though so it can be compatible during the migration. The
         | first migration to add something, release the apps to use it,
         | then a second migration to cleanup.
        
       | vcryan wrote:
       | This approach is very appealing to me :) curious about how people
       | handle schema migrations when using this approach.
       | 
       | I segment sqlite files (databases) that have the same schema into
       | the same folder. I haven't really had a case where migrations was
       | really a concern, but I could see it happening soon.
       | 
       | Seems like in my deployment, I'm going to need an approach to
       | loop over dbs to apply this change... I currently have a step of
       | app deployment that attempts to apply migrations... but it is
       | more simplistic because the primary RDBMS (postgresql) just
       | appears to the application as a single entity which is the
       | normative use-case for db-migrate-runners.
        
         | simonw wrote:
         | Yeah schema migrations are going to be interesting. Presumably
         | those end up making a HUGE change to the WAL (since they could
         | affect every stored row in a large table), which means LiteFS
         | then has to send a truly giant chunk of data out to the
         | replicas.
         | 
         | I wonder how this would present itself in real-world usage?
         | Would the replicas go out-of-date for 10-30s but continue
         | serving read-only traffic, or could there be some element of
         | cluster downtime caused by this?
        
           | benbjohnson wrote:
           | > Presumably those end up making a HUGE change to the WAL
           | (since they could affect every stored row in a large table),
           | which means LiteFS then has to send a truly giant chunk of
           | data out to the replicas.
           | 
           | Yes, a large migration could end up sending a lot of data out
           | to replicas. One way to mitigate this is to shard your data
           | into separate SQLite databases so you're only migrating a
           | small subset at a time. Of course, that doesn't work for
           | every application.
           | 
           | > Would the replicas go out-of-date for 10-30s but continue
           | serving read-only traffic, or could there be some element of
           | cluster downtime caused by this?
           | 
           | Once WAL support is in LiteFS, it will be able to replicate
           | this out without causing any read down time. SQLite is still
           | a single-writer database though so writes would be blocked
           | during a migration.
        
       | clord wrote:
       | I can imagine a database which tries to solve both of these
       | domains.
       | 
       | A centralized database handles consistency, and vends data
       | closures to distributed applications for in-process querying (and
       | those closures reconcile via something like CRDT back to the core
       | db).
       | 
       | Does this exist?
        
         | ignoramous wrote:
         | (except for the "reconcile back" to primary) Yes?
         | https://martin.kleppmann.com/2015/11/05/database-inside-out-...
         | 
         | See also: materialize.com, readyset.io, aws elastic-views
         | (forever in preview).
        
       ___________________________________________________________________
       (page generated 2022-09-21 23:00 UTC)