[HN Gopher] Let's build a distributed Postgres proof of concept
___________________________________________________________________
Let's build a distributed Postgres proof of concept
Author : eatonphil
Score : 168 points
Date : 2022-05-17 16:21 UTC (6 hours ago)
(HTM) web link (notes.eatonphil.com)
(TXT) w3m dump (notes.eatonphil.com)
| cube2222 wrote:
| Really nice blog post! Especially because it actually builds
| something that works in the end, while keeping stuff fairly
| simple.
|
| Another approach could be using PostgreSQL as a stateless
| scalable query execution engine, while stubbing out the storage
| layer with a remote replicated consistent storage solution.
|
| This is also what I think Aurora does.
| cdumler wrote:
| Is this going down the route of YugabyteDB[1]?
|
| [1] https://www.yugabyte.com
| vitruvvius wrote:
| now extend into raft atop rdma :)
| bogomipz wrote:
| I really enjoyed this post. I had a question, is the raft-boltdb
| instance the actual raft log where the sql commands are being
| written to then?
|
| It looks like the apply method is responsible for writing the sql
| statements to the raft log as well as executing the sql queries.
| Is waiting for a quorum of writes to the raft log by the other
| members not needed? Or is this all just handled under the hood by
| the raft libraries being used.
| otoolep wrote:
| It's all handled by the Raft libraries (I know this because
| rqlite uses the same Raft library at this blog post). When
| integrating with Hashicorp's Raft code, your code must
| implement Apply()[1]. The Hashicorp code promises not to call
| Apply() until a quorum of nodes have been persisted the data to
| the Raft log. That way when your code is called to modify your
| state machine (FSM), quorum has already been achieved by the
| Raft system.
|
| Architecturally (in rqlite's case) a node looks something like
| this:
| https://docs.google.com/presentation/d/1Q8lQgCaODlecHa2hS-Oe...
|
| [1] https://pkg.go.dev/github.com/hashicorp/raft#FSM
| bogomipz wrote:
| Thanks. So this embedding of a database engine into the Raft
| FSM is the general pattern then?
|
| It looks like Phil's post uses boltdb for the Postgres
| storage engine as well as for Raft log(via Hashicorpo's
| implementation Raft lib.)
|
| Thanks for the link to the slides as well. I've seen rqlite
| mentioned a few times in the last few week and so it was on
| my short list of things to read up on.
| otoolep wrote:
| Yes, embedding any kind of store can work. For example,
| here is an in-memory KV store embedded in Raft:
|
| https://github.com/otoolep/hraftd
|
| This is kind of a reference use of Hashicorp's Raft.
| remram wrote:
| distributed _Postgres-wire-compatible*_ proof of concept
| KingOfCoders wrote:
| Time might be up (saying as a long term PG user) with Litestream
| adding distributed features to SQLite, local inprocess databases
| might be the next thing.
| nijave wrote:
| Litestream allows scale up application architectures whereas
| Postgres allows scale out architectures (by only scaling up the
| database portion instead of also scaling up application
| servers).
|
| It's basically the same thing as monolith vs microservices but
| extending monolith to the data persistence layer. With
| horizontally scaling apps being the predominant architecture
| right now, I don't really see Litestream changing much.
|
| If you're going to horizontally scale Litestream with a
| multiple writers you're going to end up introducing all the
| network and synchronization pieces Postgres architectures
| already have.
| cromd wrote:
| Local inprocess databases might be a bigger thing, but won't
| they do nothing to help cases where 10 different applications
| on different machines need to read and write in the same
| database? I would think the target audiences are very
| different, i.e. people who are finding non-distributed PG
| insufficient probably can't alleviate their problems by
| switching to SQLite
| KingOfCoders wrote:
| It depends on your access pattern, if every application is
| write heavy to the same data set, this won't work. If every
| application is light on writes and those are mostly local
| (like for many SaaS companies), this will work fine with the
| future Litestream that redirect writes to one db and then
| distributes the data (if you can live with slightly stale
| data). Also depends on the size of you company. Slack does
| 300k/s messages, this won't work :-)
| cromd wrote:
| Oh, maybe I am behind on this "future litestream" that
| "redirects writes". I only knew of the ability to
| sync/backup/restore. Do you have a link to something
| discussing those other features?
| ankraft wrote:
| It doesn't redirect writes but you can create read-only
| replicas: https://github.com/benbjohnson/litestream-read-
| replica-examp...
| SahAssar wrote:
| Litestream still only does single-master, right? To me it feels
| like the next step up would be something like foundationdb
| (true multi-master) with a good psql layer.
| justinsaccount wrote:
| Interesting. I figured this would be like rqlite but on top of
| postgresql instead, but this builds a simple sql layer on top of
| bbolt.
| otoolep wrote:
| I was half-thinking about taking the Postgres wire protocol
| code[1] that CockroachDB uses, and adding it to rqlite[2]. Then
| you could talk to rqlite using Postgres tooling (perhaps). The
| only thing is that rqlite doesn't support distributed
| transactions in the usual sense, so there might be some
| difficulty mapping Postgres transaction control to rqlite. But
| still, could be interesting.
|
| [1] https://github.com/jackc/pgproto3
|
| [2] https://github.com/rqlite/rqlite
|
| Disclaimer: I am the creator of rqlite.
| awinter-py wrote:
| > What is CockroachDB under the hood? ... wire protocol, a
| storage layer, a Raft implementation, PostgreSQL's grammar
| definition. ... To be absurdly reductionist, CockroachDB is just
| the glue around these libraries
|
| ^ semi off topic, but love the idea of distinguishing certain
| projects as 'just glue'. (Not a dis, glue matters). Especially
| interesting in the context of OSS tools whiteboxed by cloud
| vendors with 'proprietary glue'
| eatonphil wrote:
| I said absurdly reductionist and definitely meant that. :)
| There is a ton going on under the hood to make it work well, to
| fully implement postgres, and to work efficiently.
| jchw wrote:
| I don't think we should be saying "fully implement
| PostgreSQL." People are feeling burned by this notion because
| it sounds like a drop-in replacement, but it isn't really
| that simple. Being wire compatible is mostly just convenient,
| since you can reuse existing drivers. But existing software
| built on top of PostgreSQL won't necessarily work out of the
| box...
|
| Maybe this is a bit nit picky, but, still.
| eatonphil wrote:
| I've written about the "world of postgres wire compatible"
| [0] and yes there are some databases that implement the
| protocol and don't implement postgres (see: immudb and
| ravendb) I give databases like CockroachDB more credit for
| actually making their intention and generally following
| through with implementing Postgres. Also unlike some others
| on my list CockroachDB really is meant as an OLTP
| replacement. Performance characteristics and operations
| will definitely differ but as solely a user/application
| developer you shouldn't notice as much.
|
| [0]
| https://datastation.multiprocess.io/blog/2022-02-08-the-
| worl...
| [deleted]
| truth_seeker wrote:
| I would rather use LOGICAL replication to achieve 1MASTER-1SLAVE
| or 1MASTER-Multi SLAVE or MULTI MASTER - Multi Slave
| configuration after working schema design details and read-write
| query patterns across tables.
| 3iggy wrote:
| I was expecting a guerilla db instead of a cockroach db. Like
| serialising the db for dropbox. This is like a drifter camping on
| private land... you can say it isnt really decentralised but i'll
| just move somewhere else.
| eloff wrote:
| The word used in the title and article is distributed not
| decentralized.
| fsckboy wrote:
| not challenging you, just asking, what is the distinction
| you're drawing between distributed and decentralized?
|
| one idea would be, the db is completely synchronized on 3
| redundant servers, so it's decentralized
|
| each column of a table is stored on a different server, so
| it's distributed...?
| ragona wrote:
| I'd say distributed is about redundancy and scale within a
| single entity's infrastructure, where as decentralized
| implies multiple entities working together.
| jen20 wrote:
| Decentralised implies lack of a global leader. However,
| each raft group has a leader, so you can only take this
| so far in a multi-group raft design.
| wiz21c wrote:
| sidenote : implementing RAFT myself was really tough. The
| original paper has 2 or 3 obscure points which were really hard
| to figure out...
| [deleted]
| mjb wrote:
| It is tough.
|
| My approach when learning new protocols like Raft or Paxos is
| to implement them in Pluscal (TLA+'s higher-level language) or
| P (https://github.com/p-org/P). I've found that helps separate
| the protocol-level concerns from the implementation-level
| concerns (sockets? wire format?) in a way that reduces the
| difficulty of learning the protocol.
| eatonphil wrote:
| Link to your implementation or notes? :)
| tynpeddler wrote:
| Echoing the other, how did you configure your test suite to
| check for correctness?
| SahAssar wrote:
| Something something jepsen test?
| emteycz wrote:
| Please share the source! I'd appreciate it very much.
| ilovepostgres9 wrote:
| azurezyq wrote:
| Raft is the secret sauce behind many distributed databases, like
| CockroachDB, TiDB, Yugabyte, etc..
|
| But to make it scalable and production grade, you have to run
| multiple Raft operations in parallel _and_ manage them well.
| E.g., sharding them correctly to make the load balanced and
| adjusting dynamically.
|
| Here's a pleasant article to read:
| https://en.pingcap.com/blog/how-tikv-reads-and-writes/
| rad_gruchalski wrote:
| There's also YugabyteDB which is actual Postgres with distributed
| flex. It even supports regular Postgres extensions. Anything in
| golang imitating Postgres will be subpar for years to come.
|
| YugabyteDB splits tables into tablets (partitions), every tablet
| has its own raft. It's a modified rocksdb under the hood.
|
| OTOH, it's cool to see how someone builds something like this
| from scratch and is willing to talk about it.
___________________________________________________________________
(page generated 2022-05-17 23:00 UTC)