[HN Gopher] Strong Consistency with Raft and SQLite
       ___________________________________________________________________
        
       Strong Consistency with Raft and SQLite
        
       Author : marcobambini
       Score  : 86 points
       Date   : 2023-03-21 14:02 UTC (8 hours ago)
        
 (HTM) web link (blog.sqlitecloud.io)
 (TXT) w3m dump (blog.sqlitecloud.io)
        
       | VWWHFSfQ wrote:
       | I see a lot of this kind of stuff but does anybody know what the
       | actual DBA/SA story is for some big SQLite cluster? What if you
       | have a non-trivial amount of data and you actually care about it?
       | Do you just have a bunch of 800 gigabyte db.sqlite files sitting
       | around?
        
         | bastawhiz wrote:
         | I mean, nobody thinks of Postgres and MySQL as 800gb of files
         | of tuples and btree data sitting around. Traditional DBMS
         | software handles the replication and coordination of all those
         | files. SQLite doesn't do that, and there's some bespoke
         | external software that handles that. If you're a DBA/SA working
         | with one of those systems, you're going to need to be an expert
         | not just at SQLite, but at the software that does the work that
         | SQLite isn't doing on its own. That'll be special skills.
        
       | AYBABTME wrote:
       | Is this somehow related to rqlite? https://rqlite.io/
       | 
       | The architecture is very similar.
        
         | otoolep wrote:
         | rqlite creator here. It's not related to my work, best of luck
         | to them.
         | 
         | [1] https://rqlite.io
        
       | lukev wrote:
       | Seems similar to https://github.com/rqlite/rqlite or
       | https://dqlite.io.
       | 
       | Would be interesting to see a breakdown of the differences.
        
         | otoolep wrote:
         | rqlite[1] creator here. This offering is completely new to me.
         | Welcome to the party, that's what I say. I started doing what
         | they are doing about 10 years ago. :-)
         | 
         | https://www.philipotoole.com/9-years-of-open-source-database...
         | 
         | https://www.philipotoole.com/replicating-sqlite-using-raft-c...
         | 
         | https://www.philipotoole.com/rqlite-v3-0-1-globally-replicat...
         | 
         | The post makes a fairly big point of the shortcomings of
         | statement-based replication (like what rqlite does).
         | 
         | What they state is correct -- non-deterministic SQL will be a
         | problem, but it's not a difficult problem to solve. Just parse
         | and rewrite the SQL statement before sending it to Raft. It's
         | _exactly_ what rqlite does[2] (though there is more to do).
         | However they claim that there are two concerns with this
         | approach.
         | 
         | >This solution introduces a runtime delay because each SQL
         | statement must be parsed and analyzed.
         | 
         | The delay is _insignificant_ compared with the delay introduced
         | by Raft consensus. This is not a valid concern. I 'm surprised
         | they even mentioned it.
         | 
         | >More importantly, it increases the difficulty of staying up-
         | to-date with the new versions of the underline SQL engine
         | (SQLite in this specific case) because the custom parser must
         | be updated each time a new SQL function/feature is introduced.
         | 
         | Theoretically somewhat true, but not true in practice. Any
         | competent programmer can update any parsing code in a matter of
         | hours. Also just because a new feature is introduced does _not_
         | mean that the parser needs to be rewritten. That 's because
         | SQLite will not introduce breaking changes into their SQL
         | specification. Any parser needs to be updated _only_ if you
         | want to support any new, non-deterministic feature.
         | 
         | [1] https://rqlite.io
         | 
         | [2] https://rqlite.io/docs/api/non-deterministic/
        
       | jcutrell wrote:
       | Totally unrelated to the content, but it's interesting to see
       | these types of articles clearly using something like Midjourney
       | to generate their header images. It almost feels like a dream-
       | state when I first land on the page... my brain can't totally
       | parse what I'm looking at on this one, for example. For me, that
       | was actually pretty distracting.
        
       | aeyes wrote:
       | Where are the specs of the offering?
       | 
       | I couldn't find information on the maximum DB size, maximum
       | number of connections, network throughput, datacenter locations
       | and other basic information. The screenshot on the homepage
       | clearly shows that there is a disk and CPU limitation.
       | 
       | It looks like this is a database with an SQLite API but
       | internally it could be anything.
        
       | xrd wrote:
       | I love that the rqlite author has chimed in here.
       | 
       | Anyone know how this compares to the work done by litestream:
       | https://litestream.io/how-it-works/
       | 
       | Is this a fundamentally different idea: consistency vs.
       | replication?
        
         | otoolep wrote:
         | This is my take: https://rqlite.io/docs/faq/#how-is-it-
         | different-than-litestr...
         | 
         | From the Litestream page:
         | https://litestream.io/alternatives/#raft-based-consensus
        
       | [deleted]
        
       | lmc wrote:
       | Marco, you're probably going to want to rethink your branding. It
       | sounds like you're associated SQLite officially, when AFAIK this
       | isn't the case (?)
       | 
       | The code is open source, but the name is a registered trademark.
       | Similar issue here: https://github.com/vapor/sqlite-kit/issues/47
        
         | marcobambini wrote:
         | Dr. Richard Hipp (the SQLite author) is one of our
         | shareholders, and we have the right to use the SQLite name.
        
           | lmc wrote:
           | Great, that is good to know! Perhaps a little about this on
           | the site could be helpful for others.
        
             | marcobambini wrote:
             | You are right, thanks.
        
       | eatonphil wrote:
       | I'm not an expert and every distributed database is different. My
       | intuition about strapping a consensus library to a database is
       | that you're going to have way higher latency because every
       | operation, writes and reads, must go through consensus. (IIRC
       | some databases like Cockroach mitigate this by having many mini
       | Raft clusters rather than one single one [which IIRC is how
       | rqlite works anyway].)
       | 
       | The advantage you undeniably get is durability/high availability.
       | 
       | > Scalability: Distributed databases can be scaled horizontally
       | by adding more nodes to the network, making handling larger
       | amounts of data and higher volumes of requests easier.
       | 
       | This is only possible if you have a data partitioning scheme,
       | right? But data partitioning is sort of a different problem from
       | consensus/HA, right? If you do the simplest version of a
       | consensus + database implementation, every disk on every instance
       | of the consensus cluster is basically identical. So you don't
       | benefit in that simple approach from adding more disks. It
       | potentially just makes write latency worse unless you do flexible
       | quorums so the quorum size is independent of the number of nodes
       | in the cluster?
       | 
       | > Localized access: With a distributed database, data can be
       | stored closer to the users who need it, which can improve
       | response times and reduce network latency.
       | 
       | This would also only be true if you relax consistency
       | requirements, right? If you need to be at the highest level of
       | consistency you must send reads through consensus too otherwise
       | if you read from the node nearest you it may be out of date.
       | 
       | It would be cool to see some numbers about the difference between
       | latency of reads/writes in sqlitecloud and sqlite on a single
       | node! It would help my intuition anyway.
        
         | adamgordonbell wrote:
         | So you don't benefit in that simple approach from adding more
         | disks.
         | 
         | Not sure about this DB, but that's my understanding of
         | consensus as well. The more nodes, the less likely any one
         | going down harms things but the harder it is to reach
         | consensus. You have to write to more places. It becomes slower
         | not faster.
        
           | eatonphil wrote:
           | > The more nodes, the less likely any one going down harms
           | things but the harder it is to reach consensus
           | 
           | Flexible quorums deals with this in the usual read/write
           | case. But yeah leader election then becomes the bottleneck.
           | So if the cluster is unstable overall performance gets even
           | worse as you scale number of nodes.
           | 
           | https://fpaxos.github.io/
        
           | remram wrote:
           | Those places could still all be closer to the client. If you
           | have 50 nodes around the country and you require every write
           | to be acknowledged by 3 nodes for consistency, you can
           | probably find 3 nodes very close to the client and therefore
           | commit writes faster than if they all came to a central
           | location.
        
             | adamgordonbell wrote:
             | you require every write to be acknowledged by 3 nodes for
             | consistency
             | 
             | How would that work? Someone else could get a different 3
             | nodes and write the same key, so they wouldn't read my
             | write before writing their own? This isn't my area but I
             | would assume the data has to be sharded ahead of time, not
             | just write to any 3 out of 50 nodes that are closest to me.
        
             | eatonphil wrote:
             | Everything has to be directed through a leader though. So
             | it's not a matter of how close the client is to the cluster
             | but how far the cluster's nodes are from eachother (since
             | anyone could be the leader). No?
        
               | remram wrote:
               | In practice you don't run the consensus over the whole
               | cluster for every shard of data. This would be
               | inefficient in terms of data storage (every node stores a
               | copy of everything) and latency.
               | 
               | For every shard you pick a few nodes, and you run
               | consensus between those. This is the way Elasticsearch,
               | CockroachDB, ... work.
               | 
               | So if a specific shard is getting writes from a specific
               | region, you can easily move that shard to nodes in that
               | region.
               | 
               | In contrast, etcd runs consensus over the whole cluster.
               | So it is very safe, very slow, and meant for small
               | datasets. In many ways, it is not a database, but a
               | metadata store for distributed systems.
        
               | eatonphil wrote:
               | Yup that makes sense. It wasn't clear from the article
               | though how their system works and other sqlite+raft
               | implementations do the simpler version of one raft
               | cluster for the entire database, I think.
        
               | anonymousDan wrote:
               | This is not entirely accurate. If you want to forego the
               | ability to support cross shard transactions then sure you
               | can run consensus independently on each shard. But I
               | don't think that is what Spanner/CockRoachDB do. In
               | addition, even if you want to limit transactions to a a
               | single shard, the shard replicas need to be in different
               | datacenters/AZs if you want to remain available when a
               | DC/AZ goes down.
        
               | [deleted]
        
           | bastawhiz wrote:
           | If you have fifty nodes, you don't want all fifty working on
           | every database. Each db should only be associated with with
           | nodes to have a healthy quorum. But this is fine: you
           | probably only have 3-5 regions. Each region could have 1-2
           | nodes supporting each db. It's almost certainly the case that
           | you don't have dozens of regions/PoPs (and if you do, the
           | material difference in latency between one PoP and another is
           | probably ones of milliseconds for most users).
        
         | aaronblohowiak wrote:
         | > My intuition about strapping a consensus library to a
         | database is that you're going to have way higher latency
         | because every operation, writes and reads, must go through
         | consensus
         | 
         | nah, you have leases (a time-limited and often renewable lock).
         | so like your leader gets elected through consensus and then for
         | the next N miliseconds, you have a garauntee that nobody else
         | can be leader. this means your reads dont need to go through
         | consensus. (hand-waving a lot of the details here.)
        
       | robertlagrant wrote:
       | I'm curious about sending changesets instead of commands.
       | 
       | I get there are some things are nondeterministic, but I'm
       | surprised that they couldn't be converted into deterministic
       | commands before sending them out. E.g. send the result of
       | `RANDOM()` instead of the call itself.
       | 
       | I don't know how much that would buy you, but I'd imagine it
       | might make the amount of information to send much smaller.
        
       | conqrr wrote:
       | The most interesting thing for me in this is the article implies
       | that they offer globally distributed consensus. Not sure if this
       | is their actual offering too, but the example of NYC and
       | Signapore implies this. While reads would be fine, the latency
       | for writes could be atleast 2*250ms (roughly the speed between
       | furthest AWS regions). I can still see this useful for
       | applications that require strong consistency and global
       | distribution, but likely a niche. Would be real interesting to
       | see some actual numbers!
        
         | pclmulqdq wrote:
         | It looks like you would still want something like CockroachDB
         | instead of this for a global database layer.
        
           | conqrr wrote:
           | Agreed. Cockroach does a better job with multiple raft
           | clusters, even Spanner which I believe does 3 Regional
           | clusters and are battle tested.
        
       ___________________________________________________________________
       (page generated 2023-03-21 23:01 UTC)