[HN Gopher] You don't need a CRDT to build a collaborative exper...
___________________________________________________________________
You don't need a CRDT to build a collaborative experience
Author : zknill
Score : 142 points
Date : 2023-11-16 13:36 UTC (9 hours ago)
(HTM) web link (zknill.io)
(TXT) w3m dump (zknill.io)
| MontagFTB wrote:
| One of the main points of this article is to "just use locks",
| which glosses over a lot of technical complications about locking
| elements within a document. How long is the lock held? Can it be
| stolen from a user who has gone offline, or is still online but
| off to lunch, and we _really_ need to make this change before the
| presentation in an hour? What if the user comes back online and
| they have changes, but the lock was stolen - how are those
| changes reconciled to the document?
|
| I am generally in favor of simpler is better, and if there is a
| way to build a collaborative experience without using CRDTs, then
| go for it. However, sometimes the cure can be worse than the
| disease, and solutions like locking may introduce more technical
| complexity than originally thought.
| dsmmcken wrote:
| > How long is the lock held?
|
| For however long the user has a browser focus state on the
| element seems like a reasonable answer, and submit changes as
| they are made. However, I don't know how you resolve conflicts
| of two users simultaneously attempting to acquire a lock.
| Arelius wrote:
| > However, I don't know how you resolve conflicts of two
| users simultaneously attempting to acquire a lock.
|
| It turns out you just have to pick one... This all depends on
| a source of truth, and when you are there it's easy to pick
| one, say based on whichever arrived at the network interface
| first.
| filleokus wrote:
| The server must keep track of the locks, and it can only know
| about the lock being released if the client tells it. E.g by
| sending a message that the field is not focused any more. The
| tricky thing is in the "edge" cases, or really the non-
| perfect cases, which there are plenty of (as I think GP
| described).
|
| The server can decide that the client is offline if the
| server misses expected heartbeat messages from the client.
| But how often will those be sent and how long grace period
| will we allow? If it's too short then it will be unreliable
| on shaky 4G connections, if it's too long then it will be
| annoying in the other direction.
|
| And that's not considering the "social" problems with locks.
| I've worked on replacing a system that was lock-based with
| CRDTs where the lunch scenario from MontagFTB actually was a
| common occurrence.
|
| In an "ideal" scenario your lock acquisition problem is not
| hard. Client's just show the UI optimistically and whoever
| the server decide was first gets the lock. The loosing client
| throws any state the user created in the short time-frame.
| Over reliable and fast connections for granular locks, this
| works fine. But in the real world that's just one of the
| issues with a lock based approach...
| jitl wrote:
| Having used an app with locks like this (Quip) I can tell you
| it really sucks to have a lock lease >10s after the last
| edit. The "lock" should be a UI level concept anyways -
| clients should broadcast "hey I'm taking field X"; if you
| have two users submit a write+lock request concurrently
| you'll need to pick a winner anyways.
| chii wrote:
| By exploring that locking and unlocking mechanism, you will
| find that the logical conclusion in the end, when enough
| complexity and edge cases get covered/fixed as bugs, that it
| turns into a crude form of "CRDT" (where it's not actually
| consistent, but merges within reason for 99% of use cases).
|
| It might as well have been CRDT from the get go.
| zknill wrote:
| You're absolutely right, locking is a hard problem. Especially
| when you get into the edge cases of clients not disconnecting
| cleanly.
|
| If you've ever used one of those awful early-2000s Microsoft
| word collaboration systems where you have to check-out the doc,
| and remember to check it back in, and no one can use it until
| you've checked it back in... it's awful!
|
| I'm not directly in this team, but one of the teams at my
| company have been working on this problem. They call it
| "Spaces", and one of the features solves this component locking
| problem.
|
| https://ably.com/examples/component-locking
| charles_f wrote:
| That's true for collaborative experience. Crdts are a mechanism
| to handle eventual consistency (that's even the preface of the
| paper). If you assume that said collaborative experience is
| always online, you don't need them, and "using locks" as you
| described is probably enough.
|
| If you want a mechanism to handle that eventual consistency, it's
| probably better to reuse their principles rather than reinventing
| something that will eventually ressemble Crdts.
|
| You mentioned "offline first", I think it's probably a good place
| to pluck that ib https://www.inkandswitch.com/local-first/
| insanitybit wrote:
| > Ever-growing state: for CRDTs to work well they need to keep a
| record of both what exists, and what has been deleted (so that
| the deletes aren't accidentally added back in later). This means
| that CRDT state will continually expand. There's a bunch of magic
| that CRDT library authors are doing with clever compression
| techniques to make this problem less-bad, but it's basically in-
| escapable. The size of your CRDT state is not purely a function
| of the size of the state the CRDT represents, but also of the
| number of updates that state has gone through.
|
| A) This is only the case for certain CRDTs, such as sets that
| support deletion - so, if you want Set semantics with deletion
| support, you need two sets, one to that tracks all deletions and
| one that tracks all insertions.
|
| B) You can garbage collection your sets. They don't have to grow
| forever.
|
| > Complex implementations: CRDTs are easy to implement wrong, so
| probably don't roll your own.
|
| Personally, I've never done this. I've just added a `merge(&mut
| self, other: &Self)` method to structs in Rust. Guaranteeing CRDT
| properties is often trivial, or at least it was in my case.
|
| > Opaque state: Because the CRDT has to represent both the
| underlying state and the updates that led to that state
|
| Again, this is only if you need specific operations on your CRDTs
| _and_ if your CRDTs are encoded in specific ways.
|
| I've said it before, but a trivial crdt looks like this
| struct Grows(u64); impl Grows { fn
| merge(&mut self, other: &Self) { self.0 =
| max(self.0, other.0); } }
|
| et voila? Obviously you lose all intermediary states, but since
| that is specified to be a negative thing, I just want to be clear
| that it's often optional.
|
| > So maybe you are convinced that CRDTs are not the be-all-and-
| end-all of collaboration, and that you aren't in one of the two
| categories where you probably should use a CRDT, and you've made
| it this far in the post.
|
| I am convinced that CRDTs are not the be-all-and-end-all, because
| Strong Eventual Consistency does not provide strong enough
| guarantees for all use cases.
|
| Once again we have a CRDT article that's about user
| collaboration, which I find somewhat frustrating because CRDTs
| can be used in far more places than that, and user collaboration
| is like _the_ most complicated thing you could ever write since
| it 's all of the problems of a distributed system _and then we
| add humans into the mix_. There is no "good" solution to this
| problem - CRDTs aren't going to solve it, and neither is any
| other algorithm, because it's not possible to encode every
| possible state update in a way that never conflicts and is also
| what a human expects (especially since humans have varying
| expectations).
|
| The algorithm/ approach, as described, seems perfectly fine - it
| will have edge cases just like CRDTs will. In reality, for such
| an impossibly complex problem, you're probably going to end up
| with something really complex to solve it. You're almost
| certainly going to start adding CRDT-like operations, like "ok
| technically this user held a lock on X, but the other user
| performed an operation on X that technically commutes, so we can
| allow both" to alleviate some of the inherent complexities (and
| UX issues) with locking.
| saqadri wrote:
| You may not need CRDT per-se, but building a collaborative
| experience is so difficult. I worked on collaborative systems for
| a bit, and also have read a bit about how Figma and Notion do it
| (this is a good read: https://www.figma.com/blog/how-figmas-
| multiplayer-technology...) -- it's still super hard to get right.
|
| This talk by Karri about Linear's "sync engine" is also a good
| watch: https://www.youtube.com/watch?v=Wo2m3jaJixU.
| jitl wrote:
| I agree broadly with the article's position but I think locks are
| more harmful than helpful. When I was a Quip user (2018) it was
| super frustrating to get locked out of a paragraph because
| someone's cursor idled there. Instead just allow LWW overwrites.
| If users have contention and your sync & presence is fast,
| they'll figure it out pretty quick, and at most lose 1-2
| keystrokes, or one drag gesture, or one color pick.
|
| Notion is "collaborative" and we don't use a CRDT for text, it's
| all last-write-wins decided by the server. However our LWW texts
| are individually small - one block/paragraph in size - and
| adding/moving/removing blocks is intention-preserving if not
| perfectly convergent.
|
| As the article says, the downside for LWW is that "offline" /
| async collaboration isn't so great. That's why we're working on
| switching to CRDT for our texts. If you're interested in bringing
| CRDTs to a product with a lot of users, consider joining Notion's
| Docs team - https://boards.greenhouse.io/notion/jobs/5602426003 /
| @jitl on Twitter / jake@makenotion.com
| btown wrote:
| Are you considering having a CRDT for each text block
| individually, or moving to a CRDT for the entire data model for
| a document? Really curious about the design approach here,
| especially insofar as there's now an external API that the data
| models need to service!
| jitl wrote:
| It's Complicated (tm), a hybrid of the two. Text content
| across multiple CRDT enabled blocks may be connected in a
| single logical CRDT, but the segment visible in any given
| block is stored in that block object, which maintains our
| permission system invariants. We'll still support moving
| blocks to different pages, so one page may have a few
| different CRDTs visually interleaved.
|
| All the edge cases are very interesting, like what happens if
| I use backspace to join a block of CRDT B into block of CRDT
| A.
|
| API consumers are unaffected. Likely we'll support API
| updates of text as best-effort diff-match-patch of the text
| on our side applied as CRDT ops.
| npunt wrote:
| always great hearing your perspective on the notion text
| engine jake, thanks for taking the time to explain
| everything!
| zknill wrote:
| So Last-Write-Wins (LWW) basically _is_ a CRDT, but not in the
| sense that anyone really expects, because they aren't that
| useful or intention preserving. Especially if the two writes
| happen in very quick succession / concurrently.
|
| LWW becomes useful if you can:
|
| a) help humans to see who is doing what on a doc
|
| b) reduce the size of the change that is LWW
|
| As you've said:
|
| > However our LWW texts are individually small - one
| block/paragraph in size
|
| This is really important, because by reducing the size of the
| individual element that any single user is updating you can
| also reduce the chance of conflicts. With a small amount of
| contextual feedback (like the notion icon next to the block) a
| lot of the problem cases are just avoided.
|
| Clearly locking and updating the entire document would be
| terrible, but if you can do it on a small enough scope that
| others can change other elements, it can work really well.
|
| (If you've worked on the notion things you're describing then
| I'm sure you know this better than I do, but just spelling it
| out really clearly.)
| jitl wrote:
| You can have a LWW CRDT, but not every LWW is a CRDT. LWW
| CRDTs generally pick a winner based on causal order which is
| _convergent_ , the C in CRDT, because every peer receiving
| the same ops in any order would pick the same winner.
|
| Picking a winner based on wall clock time order (as suggested
| in the article, and implemented by Notion) is not convergent;
| if peers used that algorithm to apply ops I they would not
| converge to the same state. Instead you need an authority
| (your server) to essentially pick an arbitrary order of ops
| and clients just have to deal with it.
|
| A practical example is to consider three users (A, B, C)
| working on a document.
|
| 1. A, B, C start online working together.
|
| 2. C goes offline
|
| 3. A, B stay online, and make 100s of changes.
|
| 4. C makes a few changes while offline (not sent to other
| peers yet)
|
| 5. C comes back online and sends their changes to (the server
| / other peers).
|
| In wall-clock LWW, C's changes will overwrite A & B's
| changes, even though A & B have done a lot more work.
|
| In a causal ordering CRDT implementing LWW, C's changes will
| "lose" to A & B's changes, since they were actually made
| "earlier" in causal ordering.
|
| > Clearly locking and updating the entire document would be
| terrible, but if you can do it on a small enough scope that
| others can change other elements, it can work really well.
|
| I'm sure good UX is possible with locks, but I haven't used
| one yet for document editing. I'm still traumatized from Quip
| which did per-paragraph (per block?) locking and was really
| annoying. Eventually they added an unlock button next to the
| block so anyone could "steal" the lock at will due to all the
| user complaints, but at that point, why put it in at all?
| maclockard wrote:
| I understand what you are saying here in terms of the
| difference between using wall-clock or causal ordering to
| determine who 'wins' for LWW. However, both of these
| strategies seem convergent to me? In any case, all clients
| will agree on whose changes win.
|
| 1. With wall-clock decided by clients, A + B changes will
| win since C's wall-time is earlier (yes, C could lie, but
| still would converge).
|
| 2. With wall-clock decided by server C will win and
| everyone will agree.
|
| 3. With causal ordering, everyone will agree that A + B
| won.
|
| 2 is not a CRDT since it requires a central server, but I
| think 1 would still count? Or stated another way: I'm not
| sure the _convergence_ is what determines if these
| strategies are CRDTs or not, but rather whether or not this
| decision making is _distributed_ or not.
| asib wrote:
| > ...based on causal order which is convergent, the C in
| CRDT...
|
| Doesn't the C stand for conflict-free? I suppose both are
| kind of getting at the same idea though.
| lewisjoe wrote:
| > everyone's gonna say "but hey, google docs uses operational
| transform not CRDTs".. OK yes, but you are not google.
|
| Well, google docs works not because they somehow figured out how
| to converge OT edits with as much precision as CRDTs do, but
| simply because they have a central server which orders edits
| anyway and don't need true leader-less convergence.
|
| In fact, I agree not many things don't need a CRDT. CRDTs help
| with mathematical rigidity of convergence when you want true
| peer-2-peer collaboration which works without any central
| authority.
|
| However, most apps anyway work on top of a central authority
| (example SaaS apps) so there is no real reason to accomodate all
| the compexity that comes with CRDT. They might get far with a
| simpler OT based model + central server based ordering.
|
| For example even Figma doesn't call its model a 100% pure CRDT.
| It's a partial, simpler CRDT implemented with an assumption that
| there's going to be a server that understands ordering.
|
| It's the same with Google Docs. They don't need a CRDT because
| it's a cloud app after all, which means OT is more convenient
| with major heavylifting (ordering and conflict handlings)
| outsourced to the server.
| josephg wrote:
| Yeah. Text based OT is pretty simple to implement, too. It's
| about 200 lines of code, plus a simple network protocol. It's
| fast and relatively straightforward to code up, and unlike text
| CRDTs it's pretty fast by default.
|
| I use it as my standard test when trying out a new programming
| language. It was unexpectedly ugly in go because of go's lack
| of enums & unions, and that's one of the big reasons I never
| got in to programming Go.
| jay-aye-see-key wrote:
| Operational transforms are one of those interesting
| technologies I've wanted to learn by implementing but haven't
| made the time yet. I also didn't realise it could be
| implemented in that little code.
|
| Can you recommend any learning resources for implementing an
| OT? (Ideally typescript, python, or rust)
| jahfer wrote:
| I haven't explored this space in a while, but I have a
| couple of examples that could be helpful. A Clojure library
| of mine [0] has a decent README with some background
| reading on how to use operational transform.
|
| I also reimplemented it in a surprisingly tiny amount of
| OCaml [1] which was a fun way to learn that language :)
|
| [0]: https://github.com/jahfer/othello [1]:
| https://github.com/jahfer/othello-ocaml
| dugmartin wrote:
| This single file shows the entire set of OT transformations
| (retain, insert, delete):
|
| https://github.com/Operational-
| Transformation/ot.js/blob/mas...
|
| and this is a good post outlining the basics of OT, from
| the creator of CodeMirror:
|
| https://marijnhaverbeke.nl/blog/collaborative-editing-
| cm.htm...
| antidnan wrote:
| I don't think you need a pure CRDT either but I think locking and
| presence is a bit of an oversimplification.
|
| LWW is a good place to start, and updating the smallest piece of
| information possible is the right idea in general but there is a
| lot more nuance to handling complex applications like a
| spreadsheet (I'm working on one) and whiteboard apps.
|
| Things like reparenting or grouping shapes [1], or updating
| elements that aren't at the lowest scale like deleting a row or
| column in a spreadsheet make locking challenging to implement. Do
| you lock the entire row while I'm moving it? Do you lock the
| entire group of shapes?
|
| With the exception of text editing, the popular libraries like
| Yjs don't just give you a perfect CRDT out of the box. You still
| have to construct your data model in a way that enables small
| scale updates [2], and CRDT libraries and literature are the best
| source of thinking for these problems that I've found.
|
| [1] https://www.figma.com/blog/how-figmas-multiplayer-
| technology...
|
| [2] https://mattweidner.com/2022/02/10/collaborative-data-
| design...
| earthboundkid wrote:
| I've seen locks used at the CMSes of large news organizations.
| It's fine, but they all need a mechanism to kick out an editor
| who has an idle tab left open. For my own small scale CMS, I just
| wrapped Google Docs and let them handle all the syncing
| headaches.
| ivanjermakov wrote:
| Conflict-Free Replicated Data Type (CRDT) is a type of data
| structure that enables concurrent updates across multiple
| replicas without the need for coordination between them.
| jes5199 wrote:
| CRDT is a different paradigm. Ideally we'd use it to _replace_
| client-server
| iamwil wrote:
| I think we can. Following all the implications down the rabbit
| hole, you end up with a system architecture where there's no
| front-end, and there's no back-end.
|
| Since CRDTs are still kinda new for most people, the discussion
| hasn't really gotten there yet.
| yencabulator wrote:
| And wrangling CRDTs into a client-server architecture is
| actually extra work. The basic design assumes you can trust
| every peer -- making sure that jes5199 actually was the peer
| who added a comment that is marked as author: "jes5199" is not
| trivial.
| spion wrote:
| For offline first apps, or for applications where very high
| degree of control for the content is needed (e.g. legal docs) and
| realtime collaboration isn't that valuable, there is also the
| option to use 3-way merge instead.
|
| The benefit is that you can even allow the user to resolve
| conflicts in a satisfactory way.
|
| Another benefit is that the document doesn't even have to be
| derived from the original, it could go through exports and re-
| imports and it will still be possible to run a 3-way merge as
| long as a common base version is declared. This can be especially
| covnenient for systems that involve e.g. MS Word.
| namelosw wrote:
| That's not gonna work for real-world projects. Real-world apps
| often have larger edits than locking individual cells/cards e.g.
| Move columns or replace large chunks of spreadsheets in Google
| Sheets, or Ctrl-A to select all and then drag to move.
|
| Also, if you consider latency, locking does not work well because
| client B might do operations before he/she even acknowledges the
| lock from client A because of latency.
| iamwil wrote:
| > Ever-growing state: for CRDTs to work well they need to keep a
| record of both what exists, and what has been deleted (so that
| the deletes aren't accidentally added back in later). This means
| that CRDT state will continually expand.
|
| I guess a couple things:
|
| It depends on the CRDT. Some CRDTs grow with the number of
| replicas and others with the number of events.
|
| State-based CRDTs don't need to keep history and don't need
| causal ordering of messages, but internal bookkeeping grows with
| the number of replicas. And for large states (like sets and
| maps), it can be prohibitive to send the state all over the wire
| for an idempotent merge.
|
| That's why in practice, people implement Op-based CRDTs, which
| makes the trade: in order to send small ops over the wire, we now
| need causal ordering of messages. To make sure we can sync with
| replicas long offline, we keep as much history so that they can
| catch up.
|
| There are other variations, such as delta-state based CRDTs that
| send diffs, and merkle CRDTs, which use merkle data structures to
| calculate diffs and detect concurrency, which have different
| growth characteristics.
|
| ---
|
| As for a growing state: Is this actually a concern for devs that
| aren't using CRDTs for collaborative text? I can see that being
| an issue with the amount of changes that can happen.
|
| But outside of that, lots of data don't grow that fast. We all
| regularly use Git and it keeps a history of everything. Our disks
| are huge, and having an immutable record is great for lots of
| things (providing you can access it).
|
| > Opaque state: ...you're generally left with an opaque blob of
| binary encoded data.
|
| Most CRDT libraries take a document-orientated angle. It assumes
| that you can contain the entire "unit of work", like a document,
| inside of a CRDT. However, if your data is more relational, it
| doesn't quite fit. And while there's immutable data in a CRDT, I
| do wish it was more accessible and queryable. In addition, being
| a binary blob, it's not exactly composable. I think CRDT
| libraries should be composable with each other.
| chromatin wrote:
| We took a super simple (IMO) approach to collaborative editing in
| my current project:
|
| Each block of text has a version number which must be incremented
| by one by the client at the time of submission. The database
| provides conflict prevention by uniqueness constraint which
| bubbles up to the API code. The frontend is informed of conflict,
| so that the user can be notified and let the human being perform
| conflict resolution.
|
| Because most concurrent users are working on different blocks,
| this works great.
| zknill wrote:
| How do you handle getting the changes that one client makes
| onto the other clients? Are you pushing it from the server to
| the clients with websockets, or waiting for the clients to ask
| for new info, or waiting for the conflict to happen when
| someone else tries to make a change, or something else?
|
| I'm thinking a lot about keeping server and client data in sync
| while working on our hopefully-soon-to-be-released LiveSync
| product[1]
|
| [1] https://ably.com/livesync
| chromatin wrote:
| When the client gets HTTP 409 Conflict, it asks for the
| current version and presents to the user for manual
| resolution
| maclockard wrote:
| Wrote about something similar a while ago
| https://hex.tech/blog/a-pragmatic-approach-to-live-collabora...
|
| Using a server to tie break and locking has worked pretty well
| for us
| parhamn wrote:
| At this point, given the maturity of libraries (I was exploring
| this recently), I think you'd have to make the case that CRDTs
| are bad not just "too much".
|
| Interfacing with the 'blob' is a real thing (y-js is solving some
| of this with a rust implementation that has cross language
| binding) but generally the things they noted (e.g. a Figma
| canvas) aren't things you commonly do joins across and if you did
| you'd have an independent indexing store for that functionality.
|
| With tools like SyncedStore [1] and HocusPocus [2] you end up
| with a pretty good, we'll tested, easy to implement base for good
| collaboration.
|
| [1] syncedstore.org
|
| [2] github.com/ueberdosis/hocuspocus
| lmm wrote:
| > You can't inspect your model represented by the CRDT without
| using the CRDT library to decode the blob, and you can't just
| store the underlying model state because the CRDT needs its
| change history also. You're left with an opaque blob of data in
| your database. You can't join on it, you can't search it, you
| can't do much without building extra features around that state
| blob.
|
| So use the CRDT library when building your indices? Or better yet
| use a CRDT-aware datastore. This doesn't seem like a real
| problem.
|
| > Locking for safety
|
| Please don't. You're inevitably going to have lost locks, lost
| updates, or most likely both.
| zknill wrote:
| > So use the CRDT library when building your indices?
|
| Yeah, sure, you can build a secondary index over the data. But
| you're still having to decode the blob and index it. There's no
| version where you can index the data without using the library
| to expose the underlying model state (like you could if you
| weren't using a CRDT).
|
| On locking, yes, it's hard. But it's not the same kind of
| locking that you'd expect in other parts of a system. You're
| locking the UI, not the actual data, so it's a tiny bit more
| forgiving. In general the locks aren't trying to force
| consistency, instead they are trying to prompt the humans to
| reduce the chance of conflict happening in the first place.
| Ofc, you still have to care about the locking, unlocking,
| disconnection problems, etc.
|
| Here's a decent example/demo you can play around with in
| multiple windows:
|
| https://examples.ably.dev/component-locking?space=W0V-t5oY6A...
|
| https://ably.com/spaces
| swyx wrote:
| > I'll run through a bunch of broad categories of applications,
| and describe how to make use of these features.
|
| i love these kinds of taxonomies of apps, because then you can
| get specific about tech stack choices. just offering a couple
| more that i've come across in my years:
|
| - more prototypical: 7GUIs
| https://eugenkiss.github.io/7guis/tasks
|
| - Application holotypes:
| https://twitter.com/arthurwuhoo/status/1470489178186170374
| aboodman wrote:
| It's true that a CRDT is often not the right thing for a classic
| client/server application. But this doesn't mean we should just
| give up on ux and use locking.
|
| There are approaches to multiplayer that are client/server
| native. By leveraging the authoritative server they can offer
| features that CRDTs can't, while preserving the great ux.
|
| I'm partial to server reconciliation:
|
| https://www.gabrielgambetta.com/client-side-prediction-serve...
|
| My product, Reflect, implements server reconciliation as a
| service. You can learn more about how it works here:
|
| https://rocicorp.dev/blog/ready-player-two
| zknill wrote:
| > But this doesn't mean we should just give up on ux
|
| It's a little unfair to describe locking as "[giving] up on
| UX", especially given some well known collaboration products
| use it quite successfully; Google Sheets cell locking, Miro
| element/Text locking, etc.
|
| Ofc, it's going to depend on the scope of what's being locked.
| These two examples are quite finely scoped element locks.
| speps wrote:
| The Wikipedia page for Operational Transformation [1] mentions
| Differential Synchronization [2] as an alternative, does anyone
| have any experience with DS?
|
| [1] https://en.wikipedia.org/wiki/Operational_transformation [2]
| https://neil.fraser.name/writing/sync/
___________________________________________________________________
(page generated 2023-11-16 23:00 UTC)