[HN Gopher] Show HN: Pg_CRDT - an experimental CRDT extension fo...
___________________________________________________________________
Show HN: Pg_CRDT - an experimental CRDT extension for Postgres
This is an experimental extension for CRDTs, pg_crdt: GitHub
repo[0]. It supports Yjs/Yrs and Automerge. The linked blog post
[1] describes how we're thinking about this extension in a Supabase
context. I want to emphasise this part: "pg_crdt has not been
released onto the Supabase platform (and it may never be). We're
considering many options for offline-sync/support and, while CRDTs
will undoubtedly factor in, we're not sure if this is the right
approach." [0] GitHub repo: https://github.com/supabase/pg_crdt
[1] Blog post: https://supabase.com/blog/postgres-crdt
Author : kiwicopple
Score : 233 points
Date : 2022-12-10 12:20 UTC (10 hours ago)
(HTM) web link (supabase.com)
(TXT) w3m dump (supabase.com)
| kiwicopple wrote:
| For quick viewing, here's how it works in Postgres (using Yjs):
| create table posts ( id serial primary key,
| content crdt.ydoc default crdt.new_ydoc() );
| insert into posts (content) values (crdt.new_ydoc());
| update posts set content = content || crdt.new_ydoc()
| where id = 1;
|
| The heavy-lifting is done by the teams at Yjs[0] and
| Automerge[1], who have re-implemented their JS libs in Rust. This
| made it possible for us to wrap them into a Postgres extension
| using pgx[2] (which is fast becoming a de-facto tool at supabase)
|
| [0] Yjs: https://github.com/yjs/yjs
|
| [1] Automerge: https://github.com/automerge/automerge
|
| [3] pgx: https://github.com/tcdi/pgx
| samwillis wrote:
| Super excited to see this Paul, and amused to see the nod to my
| comment in the post :-)
|
| I'm going to enjoy getting stuck into this! (When I'm not stuck
| driving for six hours... just a quick coffee break reading
| this)
|
| Support for CRDT columns within a db to enable JSON like
| indexing/querying capabilities is fundamental to then building
| some exciting further developments. Supporting both Yjs and
| Automerge is exactly the right route forward.
| samwillis wrote:
| Had a chance to take a little more of a look, did you
| investigate querying/indexing within the documents? (Didn't
| spot any sign in the repository)
|
| Without the ability to index properties, or do ad-hock
| queries, on the documents there is little reason to to the
| merging in-DB. It then makes more sense to merge outside the
| DB and save both the raw CRDT object and a serialised JSON
| next to it for querying. I assume that if you had found it's
| performance better, indexing+querying would have been the
| next step?
|
| However I'm not surprised you found performance not ideal, I
| suspect a layer in-front of PG throttling writes is the only
| way forward for a real-time use case.
|
| Personally I find the offline + eventual consistency the most
| interesting use case. The ability to sync a users document
| set to a local db (SQLite or a future mini-pg/supabase) with
| _local_ querying and then merge them back later. This could
| be as basic as a three column table (guid, doc, clock), the
| clock being a global Lamport like clock to enable syncing
| only changed documents.
| aaronblohowiak wrote:
| Read modify write outside the db seems worse though?
| kiwicopple wrote:
| > _This could be as basic as a three column table (guid,
| doc, clock), the clock being a global Lamport_
|
| This would be achievable, but also perhaps quite strict -
| we're dictating a table structure rather than working with
| existing CRDT libs. That said, we could probably do this
| today, without any mods/extensions
|
| > _querying /indexing_
|
| Yeah, there is nothing in this implementation for that
|
| > _Without the ability to index properties, or do ad-hock
| queries, on the documents there is little reason to to the
| merging in-DB_
|
| One benefit is the ability to send "directly" to the
| database, but I agree that lack of additional features in
| the current form outweighs this benefit
|
| > _a serialised JSON_
|
| This is how we might do it with with "Realtime as an
| authority" approach (and is the most likely approach tbh).
| Horusiath wrote:
| Yrs (Yjs on Rust) maintainer here: we actually had some idea
| about building extension to Postgres ;) Great to see that
| others share the excitement about this idea as well. See:
| https://github.com/y-crdt/y-crdt/issues/220
| kiwicopple wrote:
| perfect, we'd love to collaborate with you. I'll jump into
| your github after our Launch Week (this week), or feel free
| to reach out directly (contact details in my bio)
| zombodb wrote:
| Always nice to see pgx being used in the wild! Awesome work.
| nixpulvis wrote:
| I for one respect the need to resolve merge conflicts.
| danbruc wrote:
| I do not get peoples' obsession with CRDTs. They are useful in a
| limited set of use cases when your operations are commutative but
| having only commutative operations is pretty rare except for some
| almost exotic use cases. Distributed document editing is for some
| reason a reoccurring theme but document editing operations are
| all but commutative. What people then build with all kinds of ad
| hoc conflict resolution strategies is barely worth being called a
| CRDT - nothing there is conflict-free, you are just defining
| conflicts away. Sometimes that works okay, at least for a limited
| uses case, but the resulting operations are rarely general, and
| sometimes they are just weird.
| preseinger wrote:
| > CRDTs ... are useful in a limited set of use cases when your
| operations are commutative
|
| As you correctly observe, very few applications that exist
| today have data models with operations that satisfy the
| requirements of CRDTs, including e.g. commutativity. But CRDTs
| were never meant to be a "swappable" state layer for existing
| applications. The idea has always been that applications would
| need to change how they interact with state, to a model which
| does satisfy those requirements. Such models can absolutely
| generalize, it just takes a bit of thinking.
| kiwicopple wrote:
| I touch on this in the blog post: I
| personally believe CRDTs are the future. *For some things.*
| It's possible that developers will need to be selective about
| their "level" of offline support [example table
| for a blog post] Perhaps it's not that
| important if the title has a "last write wins" strategy,
| because it's rarely updated and less likely to have a merge
| conflict. But it is critical to use a smarter algorithm for the
| content of the blog post, especially in a team where multiple
| users are editing a blog post at the same time.
| danbruc wrote:
| The C in CRDT is for conflict-free, i.e. there must not be
| any conflicts between operations. Incrementing and
| decrementing a counter, the final result will be the same
| independent of the order of operations. You can almost not
| get any further away from a CRDT than last write wins. And
| for more complicated merge strategies you are essentially
| guessing what the desired result is but you do not know and
| everything really depends on the quality of your guesses.
|
| Like the example with the changed fruits in the list, maybe
| one or the other should end up in the final result, maybe
| both, maybe none. The correct conflict resolution strategy is
| not obvious and highly dependent on the use case, that is why
| I said that implementations are usually not general or they
| are only general in the sense that you can plug in any
| behavior you like. Compare this with a real CRDT like a
| counter, there there is absolutely no doubt about the desired
| outcome.
| AlexErrant wrote:
| It's on the application developer to choose the correct
| CRDT for the job. Last write wins is there as a CRDT of
| last resort - there's a trade off between the very general
| but "weak" LWW and the specific but "powerful" counter. (In
| my own app, I plan on observing "failed" LWW writes and
| surfacing them to the user as notifications or something
| transient.) Ideally one could choose a CRDT where the
| semantics of an application conflict are the same as CRDT
| conflicts, but we don't live in an ideal world. I don't
| think you'll find anyone arguing that CRDTs are a silver
| bullet.
| danbruc wrote:
| Maybe my point is just that those things should not be
| called CRDTs. There are conflicts in the basic operations
| and you just decide on a case by case basis how to
| resolve them which makes the entire construct look
| superficially like a CRDT but kind of ignores the core
| characteristic of CRDTs, namely that the operations never
| conflict.
| preseinger wrote:
| LWWRegister does provide the deterministic guarantee that
| operations never conflict. The "problem" is just that the
| decision is basically arbitrary.
| jamil7 wrote:
| The point is that they all eventually converge on the same
| value, and can do that independent of some centralised
| merging location, rather than the exact way in which they
| do that, whether that's LLW or anything else.
| danbruc wrote:
| Just store all updates - as if you do event sourcing -
| ordered by a timestamp. The list of updates will
| eventually converge as will any function of that list,
| including any function that merges all the updates into a
| current state. You might have to [partially] reevaluate
| the function in case a delayed update shows up and has to
| be inserted into the middle of the list of updates. So I
| guess there are additional requirements in order to
| qualify as a CRDT, bounded additional space or bounded
| amount of computation on updates. But I don't know as it
| turned out my understanding of what qualifies as a good
| CRDT was too narrow.
| jamil7 wrote:
| Yeah event sourcing and what you described is what these
| libraries are calling delta CRDTs. The appeal is they're
| hiding away the actual event log, metadata and
| rebasing/merging parts and exposing data types that look
| something like what you'd normally use.
| jclem wrote:
| An LWW register is most certainly a CRDT, it just has
| limited use. I think you're making a good point about how
| one must be careful applying CRDTs, but I don't think it's
| accurate to say that a suboptimal merge strategy in terms
| of user experience means that something does not meet the
| definition of a CRDT.
|
| In other words, the "conflict-free" part of the term refers
| to the fact that all replicas converge on the same value
| when all updates have been exchanged. What you're
| discussing seems to have more to do with concerns around
| intention preservation when using CRDTs. In other words, an
| LWW register on a text field is still a CRDT, but in many
| applications it is potentially a poor choice in terms of
| preserving the intent of user-generated operations, but
| determining this is something done in a case-by-case basis.
|
| If the content is a long document which users collaborate
| on over time, then a register is most certainly a poor
| choice. If it is a short text field akin to a label, an LWW
| register isn't necessarily bad, because user intent is
| likely to replace the entire value, rather than to perform
| minute edits on the value.
| danbruc wrote:
| I just reread the Wikipedia article on CRDTs and was
| actually wrong. I only had a subset of CRDTs in mind but
| they actually encompass a much wider class than I
| thought. In some way I now understand the fuzz about them
| even less, they now look almost trivial, but maybe I am
| still not understanding some important aspect properly.
| Should maybe read the paper introducing them. But at the
| very least I should not complain about people calling
| things CRDTs that actually fall under the actual
| definition.
| kiwicopple wrote:
| > _But at the very least I should not complain about
| people calling things CRDTs that actually fall under the
| actual definition_
|
| that's... a very reasonable response. Nice one.
|
| I think the fuzz comes from the more esoteric CRDTs,
| which solve actually hard problems (traditionally solved
| by Operational Transforms). But I don't think I could
| have created a simple example in the blog post for one of
| these.
| matlin wrote:
| Last Writer Wins is not the most compelling use case for CRDT
| because if those fruits were denormalized into Postgres rows,
| you'd already have the behavior described. A more interesting
| example, might be if both users were inserting a new fruit, one
| at the beginning and one at the end. What should the sequence of
| fruits be after that? Likely a list with both new fruits
| included. But it depends on the domain and use case.
|
| In general, Sequence CRDTs seem to be the most difficult and most
| interesting area of continuous research. In my own work, I've
| implemented a sequence CRDT to create a distributed log of events
| across users that then (because the order is guaranteed to
| eventually be the same for each device) can just be reduced to
| whatever data structure you need. It means, we end up with one
| CRDT to rule them all and then can just write the "conflict
| resolution" in simple term with the business logic.
| jamil7 wrote:
| > Last Writer Wins is not the most compelling use case for CRDT
| because if those fruits were denormalized into Postgres rows,
| you'd already have the behavior described.
|
| I'm not sure from the article, but I'm guessing the difference
| here is that it will apply the updates in the correct order
| (using last writer wins) regardless of when the yjs doc hits
| the server. I guess the idea here is you can get decent
| offline-ish support without directly storing timestamps,
| version numbers, soft-deletes etc and associated logic.
| Hixon10 wrote:
| Hm, I am a bit worried about using PostgreSQL with CRDT
| (Limitations section from the blog). I guess that PostgreSQL
| doesn't like frequently updates, because of MVCC. Maybe it is OK
| to use PG as a cold storage, if you have another application,
| which would work with hot data (like, when users change
| something), and eventually store it inside PG.
| kiwicopple wrote:
| unless we can figure out how to overcome those limitations,
| you're right. Potentially some sort of throttling system for
| CRDT updates (offloaded to a background worker so that it's
| non-blocking?). If anyone else has ideas/approaches which could
| work in a PG-only context, we'd love to hear them
| Hixon10 wrote:
| Maybe you can research a bit pluggable table storage
| interface in PostgreSQL. It is possible either to implement
| your storage, which works fine with CRDT, or try to find some
| storage implementation, which works fine with such usage
| pattern (like, https://wiki.postgresql.org/wiki/Zheap )
| cranium wrote:
| CRDTs are really cool if your merging function behaves
| intuitively _in the context of your application_. You can easily
| implement a distributed VCS by using the Last Writer Wins, but it
| would be a nightmare to work with - imagine having your
| modifications rolled back because someone committed after you.
|
| In the same way some operations are embarrassingly parallelizable
| (think adding elements to a set), some operations are inherently
| conflicting: editing elements of the same text, image,
| directory,... More than trying to decide _in all cases_ which
| modification wins, we should detect when there are multiple
| legitimate options and ask the user to choose. Like a merge
| conflict in Git for example.
|
| The application state feels so clean when conflicts can't arise
| or are automatically dealt with. But we have to acknowledge there
| are only so much datastructures that can merge automatically and
| still be intuitive for the user.
| preseinger wrote:
| +1, I think the most valuable lesson we've learned w.r.t. CRDTs
| at the moment is that valid merge operations basically don't
| generalize, and basically have to leverage properties that are
| application-specific. This isn't a fatal flaw, it just
| constrains the design space.
| Smaug123 wrote:
| Forgive me, I'm still recovering from the flu and am having some
| trouble thinking, but I didn't find your "Why not build this into
| Realtime?" section very compelling. If you've already got a
| central source of truth (so you don't currently need the peer-to-
| peer nature of CRDTs), isn't it faster to iterate on your
| solution if you implement the merge semantics within the central
| source of truth that you have full control over? I'm not
| convinced from your post that turning the database into a
| component of a peer-to-peer distributed system is predictably
| worthwhile; it's certainly predictably going to be very complex.
| It smacks to me of spending a great deal of engineering effort to
| keep your options much further open than I've seen you justify.
| kiwicopple wrote:
| Thanks, perhaps I can try explaining it from another angle.
|
| At the moment, every tool in supabase is decoupled from the
| other tools. The only thing that each tool depends on is
| Postgres. This is beneficial, because now we don't need to
| build interfaces - Postgres is the interface.
|
| Putting this implementation into Realtime would introduce a
| second dependency for some tools or create a burden for the
| developer. For example, this query goes directly to PostgREST:
| supabase.from('posts').update({ title: 'Hello', content:
| 'World' });
|
| If the "content" part of this is a CRDT, then suddenly the
| developer needs to know that _that particular field_ needs to
| be sent through realtime (psuedo-code):
| supabase.from('posts').update({ title: 'Hello' });
| supabase.from('posts').merge({ content: 'World' });
|
| Or we'd need to build some automatic-routing logic into the
| libraries (or PostgREST), which is also complex.
|
| In my view, CRDTs are a data problem, and therefore probably a
| database problem. If we can make Postgres act like "just
| another peer", then it simplifies the architecture.
|
| > _isn 't it faster to iterate on your solution if you
| implement the merge semantics within the central source of
| truth that you have full control over?_
|
| We have full control over both Realtime and Postgres (at least
| the extension), so this is less of a concern. We're also less
| worried about fastest iteration than finding the correct
| solution long-term
___________________________________________________________________
(page generated 2022-12-10 23:00 UTC)