[HN Gopher] Strong Consistency with Raft and SQLite
___________________________________________________________________
Strong Consistency with Raft and SQLite
Author : marcobambini
Score : 86 points
Date : 2023-03-21 14:02 UTC (8 hours ago)
(HTM) web link (blog.sqlitecloud.io)
(TXT) w3m dump (blog.sqlitecloud.io)
| VWWHFSfQ wrote:
| I see a lot of this kind of stuff but does anybody know what the
| actual DBA/SA story is for some big SQLite cluster? What if you
| have a non-trivial amount of data and you actually care about it?
| Do you just have a bunch of 800 gigabyte db.sqlite files sitting
| around?
| bastawhiz wrote:
| I mean, nobody thinks of Postgres and MySQL as 800gb of files
| of tuples and btree data sitting around. Traditional DBMS
| software handles the replication and coordination of all those
| files. SQLite doesn't do that, and there's some bespoke
| external software that handles that. If you're a DBA/SA working
| with one of those systems, you're going to need to be an expert
| not just at SQLite, but at the software that does the work that
| SQLite isn't doing on its own. That'll be special skills.
| AYBABTME wrote:
| Is this somehow related to rqlite? https://rqlite.io/
|
| The architecture is very similar.
| otoolep wrote:
| rqlite creator here. It's not related to my work, best of luck
| to them.
|
| [1] https://rqlite.io
| lukev wrote:
| Seems similar to https://github.com/rqlite/rqlite or
| https://dqlite.io.
|
| Would be interesting to see a breakdown of the differences.
| otoolep wrote:
| rqlite[1] creator here. This offering is completely new to me.
| Welcome to the party, that's what I say. I started doing what
| they are doing about 10 years ago. :-)
|
| https://www.philipotoole.com/9-years-of-open-source-database...
|
| https://www.philipotoole.com/replicating-sqlite-using-raft-c...
|
| https://www.philipotoole.com/rqlite-v3-0-1-globally-replicat...
|
| The post makes a fairly big point of the shortcomings of
| statement-based replication (like what rqlite does).
|
| What they state is correct -- non-deterministic SQL will be a
| problem, but it's not a difficult problem to solve. Just parse
| and rewrite the SQL statement before sending it to Raft. It's
| _exactly_ what rqlite does[2] (though there is more to do).
| However they claim that there are two concerns with this
| approach.
|
| >This solution introduces a runtime delay because each SQL
| statement must be parsed and analyzed.
|
| The delay is _insignificant_ compared with the delay introduced
| by Raft consensus. This is not a valid concern. I 'm surprised
| they even mentioned it.
|
| >More importantly, it increases the difficulty of staying up-
| to-date with the new versions of the underline SQL engine
| (SQLite in this specific case) because the custom parser must
| be updated each time a new SQL function/feature is introduced.
|
| Theoretically somewhat true, but not true in practice. Any
| competent programmer can update any parsing code in a matter of
| hours. Also just because a new feature is introduced does _not_
| mean that the parser needs to be rewritten. That 's because
| SQLite will not introduce breaking changes into their SQL
| specification. Any parser needs to be updated _only_ if you
| want to support any new, non-deterministic feature.
|
| [1] https://rqlite.io
|
| [2] https://rqlite.io/docs/api/non-deterministic/
| jcutrell wrote:
| Totally unrelated to the content, but it's interesting to see
| these types of articles clearly using something like Midjourney
| to generate their header images. It almost feels like a dream-
| state when I first land on the page... my brain can't totally
| parse what I'm looking at on this one, for example. For me, that
| was actually pretty distracting.
| aeyes wrote:
| Where are the specs of the offering?
|
| I couldn't find information on the maximum DB size, maximum
| number of connections, network throughput, datacenter locations
| and other basic information. The screenshot on the homepage
| clearly shows that there is a disk and CPU limitation.
|
| It looks like this is a database with an SQLite API but
| internally it could be anything.
| xrd wrote:
| I love that the rqlite author has chimed in here.
|
| Anyone know how this compares to the work done by litestream:
| https://litestream.io/how-it-works/
|
| Is this a fundamentally different idea: consistency vs.
| replication?
| otoolep wrote:
| This is my take: https://rqlite.io/docs/faq/#how-is-it-
| different-than-litestr...
|
| From the Litestream page:
| https://litestream.io/alternatives/#raft-based-consensus
| [deleted]
| lmc wrote:
| Marco, you're probably going to want to rethink your branding. It
| sounds like you're associated SQLite officially, when AFAIK this
| isn't the case (?)
|
| The code is open source, but the name is a registered trademark.
| Similar issue here: https://github.com/vapor/sqlite-kit/issues/47
| marcobambini wrote:
| Dr. Richard Hipp (the SQLite author) is one of our
| shareholders, and we have the right to use the SQLite name.
| lmc wrote:
| Great, that is good to know! Perhaps a little about this on
| the site could be helpful for others.
| marcobambini wrote:
| You are right, thanks.
| eatonphil wrote:
| I'm not an expert and every distributed database is different. My
| intuition about strapping a consensus library to a database is
| that you're going to have way higher latency because every
| operation, writes and reads, must go through consensus. (IIRC
| some databases like Cockroach mitigate this by having many mini
| Raft clusters rather than one single one [which IIRC is how
| rqlite works anyway].)
|
| The advantage you undeniably get is durability/high availability.
|
| > Scalability: Distributed databases can be scaled horizontally
| by adding more nodes to the network, making handling larger
| amounts of data and higher volumes of requests easier.
|
| This is only possible if you have a data partitioning scheme,
| right? But data partitioning is sort of a different problem from
| consensus/HA, right? If you do the simplest version of a
| consensus + database implementation, every disk on every instance
| of the consensus cluster is basically identical. So you don't
| benefit in that simple approach from adding more disks. It
| potentially just makes write latency worse unless you do flexible
| quorums so the quorum size is independent of the number of nodes
| in the cluster?
|
| > Localized access: With a distributed database, data can be
| stored closer to the users who need it, which can improve
| response times and reduce network latency.
|
| This would also only be true if you relax consistency
| requirements, right? If you need to be at the highest level of
| consistency you must send reads through consensus too otherwise
| if you read from the node nearest you it may be out of date.
|
| It would be cool to see some numbers about the difference between
| latency of reads/writes in sqlitecloud and sqlite on a single
| node! It would help my intuition anyway.
| adamgordonbell wrote:
| So you don't benefit in that simple approach from adding more
| disks.
|
| Not sure about this DB, but that's my understanding of
| consensus as well. The more nodes, the less likely any one
| going down harms things but the harder it is to reach
| consensus. You have to write to more places. It becomes slower
| not faster.
| eatonphil wrote:
| > The more nodes, the less likely any one going down harms
| things but the harder it is to reach consensus
|
| Flexible quorums deals with this in the usual read/write
| case. But yeah leader election then becomes the bottleneck.
| So if the cluster is unstable overall performance gets even
| worse as you scale number of nodes.
|
| https://fpaxos.github.io/
| remram wrote:
| Those places could still all be closer to the client. If you
| have 50 nodes around the country and you require every write
| to be acknowledged by 3 nodes for consistency, you can
| probably find 3 nodes very close to the client and therefore
| commit writes faster than if they all came to a central
| location.
| adamgordonbell wrote:
| you require every write to be acknowledged by 3 nodes for
| consistency
|
| How would that work? Someone else could get a different 3
| nodes and write the same key, so they wouldn't read my
| write before writing their own? This isn't my area but I
| would assume the data has to be sharded ahead of time, not
| just write to any 3 out of 50 nodes that are closest to me.
| eatonphil wrote:
| Everything has to be directed through a leader though. So
| it's not a matter of how close the client is to the cluster
| but how far the cluster's nodes are from eachother (since
| anyone could be the leader). No?
| remram wrote:
| In practice you don't run the consensus over the whole
| cluster for every shard of data. This would be
| inefficient in terms of data storage (every node stores a
| copy of everything) and latency.
|
| For every shard you pick a few nodes, and you run
| consensus between those. This is the way Elasticsearch,
| CockroachDB, ... work.
|
| So if a specific shard is getting writes from a specific
| region, you can easily move that shard to nodes in that
| region.
|
| In contrast, etcd runs consensus over the whole cluster.
| So it is very safe, very slow, and meant for small
| datasets. In many ways, it is not a database, but a
| metadata store for distributed systems.
| eatonphil wrote:
| Yup that makes sense. It wasn't clear from the article
| though how their system works and other sqlite+raft
| implementations do the simpler version of one raft
| cluster for the entire database, I think.
| anonymousDan wrote:
| This is not entirely accurate. If you want to forego the
| ability to support cross shard transactions then sure you
| can run consensus independently on each shard. But I
| don't think that is what Spanner/CockRoachDB do. In
| addition, even if you want to limit transactions to a a
| single shard, the shard replicas need to be in different
| datacenters/AZs if you want to remain available when a
| DC/AZ goes down.
| [deleted]
| bastawhiz wrote:
| If you have fifty nodes, you don't want all fifty working on
| every database. Each db should only be associated with with
| nodes to have a healthy quorum. But this is fine: you
| probably only have 3-5 regions. Each region could have 1-2
| nodes supporting each db. It's almost certainly the case that
| you don't have dozens of regions/PoPs (and if you do, the
| material difference in latency between one PoP and another is
| probably ones of milliseconds for most users).
| aaronblohowiak wrote:
| > My intuition about strapping a consensus library to a
| database is that you're going to have way higher latency
| because every operation, writes and reads, must go through
| consensus
|
| nah, you have leases (a time-limited and often renewable lock).
| so like your leader gets elected through consensus and then for
| the next N miliseconds, you have a garauntee that nobody else
| can be leader. this means your reads dont need to go through
| consensus. (hand-waving a lot of the details here.)
| robertlagrant wrote:
| I'm curious about sending changesets instead of commands.
|
| I get there are some things are nondeterministic, but I'm
| surprised that they couldn't be converted into deterministic
| commands before sending them out. E.g. send the result of
| `RANDOM()` instead of the call itself.
|
| I don't know how much that would buy you, but I'd imagine it
| might make the amount of information to send much smaller.
| conqrr wrote:
| The most interesting thing for me in this is the article implies
| that they offer globally distributed consensus. Not sure if this
| is their actual offering too, but the example of NYC and
| Signapore implies this. While reads would be fine, the latency
| for writes could be atleast 2*250ms (roughly the speed between
| furthest AWS regions). I can still see this useful for
| applications that require strong consistency and global
| distribution, but likely a niche. Would be real interesting to
| see some actual numbers!
| pclmulqdq wrote:
| It looks like you would still want something like CockroachDB
| instead of this for a global database layer.
| conqrr wrote:
| Agreed. Cockroach does a better job with multiple raft
| clusters, even Spanner which I believe does 3 Regional
| clusters and are battle tested.
___________________________________________________________________
(page generated 2023-03-21 23:01 UTC)