[HN Gopher] Collaborative text editing with Eg-Walker: Better, f...
       ___________________________________________________________________
        
       Collaborative text editing with Eg-Walker: Better, faster, smaller
        
       Author : czx111331
       Score  : 226 points
       Date   : 2024-09-27 12:53 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | matlin wrote:
       | Seph (author) also has a reference implementation in Typescript:
       | https://github.com/josephg/eg-walker-reference
       | 
       | I've stated before that I think the main thing holding back
       | collaborative text / sequence CRDTs is integration with a
       | production database.
       | 
       | Eg-walker looks interesting because it might lend itself to be
       | integrated into a database because the operations are immutable
       | and only appended. However, to demonstrate the effectiveness of
       | these algorithms library authors (see Yjs, DiamondTypes, etc)
       | build stand-alone data structures (usually specialized search
       | trees) that most databases already provide.
       | 
       | Personally, I've been trying to adapt a Piece Table[1] to be
       | collaborative and stored in Triplit[2] which runs on both client
       | and server and already implements logical clocks but I might see
       | how well I can adapt this algorithm instead!
       | 
       | 1. https://en.wikipedia.org/wiki/Piece_table 2.
       | https://github.com/aspen-cloud/triplit
        
         | benjismith wrote:
         | Awesome, I'm been following Seph's work for many years! Always
         | thoughtful and well-executed. Probably the most prolific and
         | insightful engineer in the "collaborative text editing"
         | universe.
         | 
         | I use ShareDB every day, which originated from Seph's excellent
         | work on OT algorithms. Good stuff!
        
           | josephg wrote:
           | Good to hear it's still in use! That's very kind.
        
         | riedel wrote:
         | There was a recent thread about the 2001 post that afaik
         | eventually lead to this paper (diamond types is the rust
         | implementation): https://news.ycombinator.com/item?id=41372833
        
         | btown wrote:
         | This seems to be a holy grail, to be honest! Super-simple
         | database representations with barely any processing required on
         | the "write path," instant startup, minimal memory requirements
         | on both server and client without a need for CRDT data
         | structures to be in memory, none of the O(n^2) complexity of
         | OT. In fact, if I'm interpreting it correctly, it should be
         | straightforward to get this working in a serverless environment
         | without any notion of session fixation, nor active documents
         | needing to be kept in memory.
         | 
         | I can see this completely reshaping the landscape of what's
         | possible with collaborative documents!
        
           | josephg wrote:
           | Author here. Thanks! Yeah this is my hope too.
           | 
           | Egwalker has one other advantage here: the data format will
           | be stable and consistent. With CRDTs, every different crdt
           | algorithm (Yjs, automerge/rga, fugue, etc) actually stores
           | different fields on disk. So if someone figure out a new way
           | to make text editing work better, we need to rip up our file
           | formats and network protocols.
           | 
           | Egwalker just stores the editing events in their original
           | form. (Eg insert "a" at position 100). It uses a crdt
           | implementation in memory to merge concurrent changes (and
           | everyone needs to use the same crdt algorithm for
           | convergence). But the network protocol and file format is
           | stable no matter what algorithm you use.
        
             | vlovich123 wrote:
             | I've loved learning all your detailed info on CRDT work.
             | Thank you for progressing the field!
             | 
             | Since it stores all the editing events, does this mean that
             | the complexity of opening a document is at least O(N) in
             | terms of number of edits? Or are there interim snapshots /
             | merging / and/or intelligent range computations to reduce
             | the number of edits that need to be processed?
        
               | josephg wrote:
               | You can just store a snapshot on disk (ie, the raw text)
               | and load that directly. You only ever need to look at
               | historical edits when merging concurrent changes into the
               | local document state. (And thats pretty rare in
               | practice).
               | 
               | Even when that happens, the algorithm only needs to look
               | at operations as far back as the most recent "fork point"
               | between the two branches in order to merge. (And we can
               | compute that fork point in O(n) time - where n is the
               | number of events that have happened since then). Its
               | usually very very fast.
               | 
               | In an application like google docs or a wiki, the browser
               | will usually never need to look at any historical changes
               | at all in order to edit a document.
        
               | vlovich123 wrote:
               | Very clever idea. Thanks for explaining
        
       | 1attice wrote:
       | s/e.g./EG/
        
         | deathanatos wrote:
         | s/e.g./Eg/, which is how the paper stylizes it?
        
       | canadiantim wrote:
       | Looks like amazing work, congrats!! Excited to see
       | implementations in the wild, definitely would be keen to play
       | around with.
        
       | eclectic29 wrote:
       | If Martin Kleppmann is the author I know this stuff will be worth
       | watching out for.
        
       | britannio wrote:
       | Joseph explains the algorithm on YouTube too:
       | https://www.youtube.com/watch?v=rjbEG7COj7o
       | 
       | It's great work, combining the best of OT and CRDTs.
        
         | auggierose wrote:
         | I find the formulation in the abstract slightly confusing. As
         | far as I understand EG-Walker _is_ a CRDT, an operation-based
         | one.
        
           | josephg wrote:
           | Author here. It's kinda both a crdt and an operational
           | transform system.
           | 
           | It's a crdt in that all peers share & replicate the set of
           | all editing events. (A grow-only set crdt if we're being
           | precise). Peers can use those editing events to generate the
           | document state at any point in time, merge changes and so on.
           | 
           | But the editing events themselves are stored and expressed in
           | their "original" form (unlike existing CRDTs, which need a
           | prepare function). That means lower memory usage during use.
           | 
           | The replying / merging process itself is kind of a batch
           | operational transform algorithm. It works by building a
           | normal crdt state object in memory in order to transform the
           | events so they can be replayed. In that sense, it's an OT
           | system. (But one which transforms by using a crdt, like Yjs,
           | internally within each peer).
           | 
           | I don't know if that clarifies things. Feel free to ask more
           | questions!
        
             | judofyr wrote:
             | Let me see if I understand this correctly:
             | 
             | CRDTs take an editor event such as "insert at position X"
             | and turns it into something a concrete operation like
             | "insert to the right of node Y created by client C" which
             | is then sent. This makes it super easy to apply concurrent
             | operations since they have a direct reference to where
             | they're located. However, it also means that you know have
             | to keep track of these nodes. All of the nodes that has
             | ever existed is present at all times, and deletion is
             | handled as a state flag which marks it as hidden.
             | 
             | OTs take an editor event such as "insert at position X" and
             | keeps it like that. Whenever a concurrent event is received
             | it then tries to "rebase" it (i.e. patch that event) so
             | that it makes sense on top of the current events. However,
             | this (1) can be quite finicky to get right and (2) it is
             | based on there being One True Order of Events (i.e. a
             | server).
             | 
             | This approach takes an editor event such as "insert at
             | position X" and keeps it like that. When applied, it _can_
             | be inserted into an  "ever-growing list of atoms with state
             | flag". However, in this algorithm the data structure is
             | actually capable of representing two different versions at
             | the same time: A _current_ version and a _final_ version.
             | This is handled by there being _two_ state flags instead of
             | one: Every node has a  "current state = exists/deleted" and
             | "final state = exists/deleted".
             | 
             | This gives us the power of doing a "soft undo" (which is
             | called "retreat" in the paper): We can take our own latest
             | event which we've applied, revert the effect on the
             | _current_ version, while still keeping the _final_ version
             | the same. We 're handling this very similar to CRDTs: We
             | keep all the nodes at all time, we're just using state
             | flags to keep track of whether it exists or not.
             | 
             | This is useful when we observe a concurrent event. This
             | event have references to positions which are valid in the
             | context of its parent. If we "retreat" of all of _our_
             | events until we reach the parent, we then have a data
             | structure which represents the text at that point. We can
             | now apply the  "insert at position Y"-events which we
             | received by interpreting "Y" in terms of the _current_
             | version. After we 've applied all of those events we can
             | then look at the _final_ version and this now actually
             | contains the _combined_ result of both changes!
             | 
             | And here comes the nice part: Since the events themselves
             | are always on the form "insert at position X" it means that
             | we can choose another representation of applying them. For
             | instance, if we know that there are no concurrent events
             | that are happening, we might as well apply them _directly_
             | on a string without bothering with the whole
             | "current/final dual data structure".
        
               | josephg wrote:
               | First paragraph: yes, exactly.
               | 
               | > OTs take an event..
               | 
               | This is how the early Jupiter OT works, yes. And most OT
               | systems work like this. But there are also some papers on
               | more recent OT systems which can work with more than 2
               | peers. Unfortunately, many of these systems have turned
               | out to have convergence bugs and/or they are O(n^2). For
               | our paper one of our example datasets takes tens of
               | milliseconds to replay with CRDTs and egwalker but an
               | hour of time with OT!
               | 
               | > the data structure is capable of representing two
               | different versions...
               | 
               | With egwalker it's important to distinguish between two
               | different data structures. There's a grow only set of
               | original editing events. This is persisted to disk and
               | replicated over the network. Then while actually
               | replaying events or merging, we generate a second,
               | temporary, in memory data structure which resembles a
               | normal CRDT. (Except with an extra state field on each
               | item like you said). This crdt state object isn't
               | persisted. It's usually discarded as soon as the merge
               | (transform) operation is complete. One big advantage of
               | this approach is that this data structure does not need
               | to represent all items ever inserted. Just the concurrent
               | items, back to the most recent common branch. So it's
               | usually tiny. And that allows history to be pruned -
               | which CRDTs typically don't allow.
               | 
               | But yes, everything else is right!
        
               | ProblemFactory wrote:
               | This is probably a question about classic CRDTs as much
               | as eg-walker:
               | 
               | Do all possible topological sorts of the event graph
               | result in the same final consensus document? If yes how
               | do we know that, and if no, how do they resolve the order
               | in which each branch is applied?
        
               | josephg wrote:
               | > Do all possible topological sorts of the event graph
               | result in the same final consensus document?
               | 
               | Yes. Thats usually referred to as the "convergence
               | property".
               | 
               | > If yes how do we know that
               | 
               | Usually, careful design, mathematical proofs and
               | randomized (fuzz) testing. Fuzz testing is absolutely
               | essential - In over a decade of working on systems like
               | this, I don't know if I've ever implemented something
               | correctly first try. Fuzz testing is essential. You
               | shouldn't trust the correctness of any system which
               | haven't been sufficiently fuzzed. (Luckily, fuzzers are
               | easy to write, and the convergence property is very easy
               | to test for.)
               | 
               | For Eg-walker, I think we've pumped around 100M randomly
               | generated events (in horribly complex graphs) through our
               | implementation to flush out any bugs.
        
               | auggierose wrote:
               | This seems to be a field perfect for theorem proving, I
               | think I've seen some work by Kleppmann using Isabelle.
               | 
               | I once tried to understand the Yjs paper, but I came to
               | the conclusion that their proof is just wrong! They do
               | some impressively looking logical reasoning in the paper,
               | but they define some order in terms of itself, so they
               | don't really show anything, if I remember correctly. If
               | you tried that in Isabelle, it would stop you already at
               | the very start of all that nonsense.
        
               | josephg wrote:
               | I talked to Kevin Jahns (the author of the YATA paper &
               | Yjs) about his paper a few years ago. He said he found
               | errors in the algorithm described in the paper, after it
               | was published. The algorithm he uses in Yjs is subtly
               | different from YATA in order to fix the mistakes.
               | 
               | He was quite surprised the mistakes went unnoticed
               | through the peer review process.
               | 
               | There have also been some (quite infamous) OT algorithm
               | papers which contain proofs of correctness, but which
               | later turned out to actually be incorrect. (Ie, the
               | algorithms don't actually converge in some instances).
               | 
               | I'm embarassed to say I don't know Isabelle well enough
               | to know how you would use it to prove convergence
               | properties. But I have gotten very good at fuzz testing
               | over the years. Its wild how many bugs in seemingly-
               | working software I've found using the technique.
               | 
               | I think ideally you'd use both approaches.
        
               | auggierose wrote:
               | Ah, that makes sense! I thought that Yjs must be doing
               | something differently than described, because it seems to
               | work well in practice, but I couldn't see how Yata would.
               | Anyway, I learnt a lot by thinking through that paper :-)
               | 
               | Fuzz testing and proof are complementary, I think, both
               | catch things the other one might not have caught. The
               | advantage of Fuzz testing is that it tests the real
               | thing, not a mathematical replica of it.
        
       | abdullahkhalids wrote:
       | Do collaborative whiteboard like software use the same
       | algorithms, or are there more suitable algorithms for picture
       | collaborations?
        
         | josephg wrote:
         | There's usually more suitable algorithms for picture
         | collaborations.
         | 
         | Text is hard because it's a list of characters, and when items
         | are inserted and deleted the operations change the index of all
         | subsequent elements.
         | 
         | Usually, editing a digital whiteboard is much simpler.
        
       | Palmik wrote:
       | Saw the YouTube video when it was first posted, and it could be a
       | great match for a new project I have in mind.
       | 
       | Is there a practical implementation yet that supports not just
       | strings, but also lists and maps?
       | 
       | Would be great to see it integrated into yjs / y-crdt.
        
       ___________________________________________________________________
       (page generated 2024-09-28 23:01 UTC)