[HN Gopher] Resilient Sync for Local First
       ___________________________________________________________________
        
       Resilient Sync for Local First
        
       Author : ingve
       Score  : 170 points
       Date   : 2024-06-24 05:58 UTC (1 days ago)
        
 (HTM) web link (holtwick.de)
 (TXT) w3m dump (holtwick.de)
        
       | vlovich123 wrote:
       | The challenge of course is that if your write depends on a
       | previous read, offline/online sync can easily in the wrong sync
       | state once you come back online (in the general case). Even CRDTs
       | are not immune from this - for example your offline document copy
       | says step 1 is delete line 3 so you delete it but then you sync
       | and step 1 was corrected to be "delete line 4" your deletion of
       | line 3 is now incorrect even though the merged result is "valid".
       | We see this everyday in code development - they're called merge
       | conflicts.
        
         | eptcyka wrote:
         | This is exactly the kind of scenario CRDTs rectify.
        
           | threatofrain wrote:
           | CRDT's largely can't rectify merge conflicts... what CRDT's
           | can do is make your merges convergent. If I auto-merged your
           | Git code using a blunt policy I don't know if you'd reach for
           | "correct" as your first description as opposed to
           | "convergent".
           | 
           | The longer you're offline the more you have to explain to
           | your users the circumstances where they might need to do
           | manual data cleanup or expect creative merging policy. No
           | matter what tech you're using.
        
             | ErikBjare wrote:
             | This is true if you're working with documents edited by
             | different devices, but many CRDTs do not and can indeed be
             | not only "convergent" but also "correct".
             | 
             | How SSB works is a good example.
        
               | vlovich123 wrote:
               | Can't find what SSB is. Can you clarify?
               | 
               | But yes, it's possible there are applications that can
               | constrain the set of operations such that the merged
               | result is also correct. I was just highlighting that
               | CRDTs are trotted out as a catch all when a) it's just a
               | way of thinking about the problem rather than an actual
               | algorithm (eg while there's a lot of CRDT work for
               | document editing, that work has to be done for scratch
               | for any other problem domain) b) you have to modify your
               | problem constrains and solution such that merge results
               | are correct. B is a very very hard problem in a
               | distributed system even if you ignore the challenge of a
               | which itself is quite hard. I wouldn't trust anything in
               | this space too much without a TLA+ proof that it's
               | correct (unless it's something low stakes like document
               | editing).
        
               | endgame wrote:
               | I think SSB is "Secure Scuttlebutt":
               | http://ssbc.github.io/ssb-db/
        
               | threatofrain wrote:
               | Some of our most important DB sync'ing algorithms were
               | only very recently verified. It was okay to run the whole
               | world's systems, including hospitals and armed forces
               | this way. Very very few industry algos receive formal
               | verification.
        
             | kiitos wrote:
             | CRDTs solve merge conflicts by construction, because they
             | necessarily don't permit merge conflicts in the first
             | place.
        
               | bombela wrote:
               | Alright, try this in Google Doc.
               | 
               | Two users simultaneously select the same word, and
               | replace it by a different word each.
               | 
               | You might end up with: "apple" -> "carrot orange"
               | 
               | Sure it technically merged, and did not lose data. But
               | you still need humans to reconcile the final state
               | anyways. Worse, you might not even notice that you need
               | to cleanup.
        
               | vlovich123 wrote:
               | Bad example. Google docs doesn't use CRDTs but uses OT
               | instead. CRDTs may handle your scenario just fine
               | depending on how they decide to handle this scenario.
               | 
               | The easier scenario that is independent of the specific
               | algorithm is the one I described (have follow up comments
               | describing it in more explicit detail) where you have a
               | logical meta instruction within the document and then the
               | document is updated based on outdated meta instructions
               | offline but when they come back online the meta
               | instructions were changed.
               | 
               | Then there's not even a merge conflict to really worry
               | about.
        
               | mattarm wrote:
               | > Bad example. Google docs doesn't use CRDTs but uses OT
               | instead. CRDTs may handle your scenario just fine
               | depending on how they decide to handle this scenario.
               | 
               | The CRDT may pick one or the other replacement word, but
               | who is to say that either choice is correct? Perhaps
               | including both words is correct.
               | 
               | > Then there's not even a merge conflict...
               | 
               | Agree, this is what CRDTs are all about.
               | 
               | > ...to really worry about.
               | 
               | I think it is important to make clear that CRDTs do not
               | "solve" the merging problem, they merely make it possible
               | to solve in a deterministic way across replicas.
               | 
               | Often, CRDTs do not capture higher level schema
               | invariants, and so a "conflict free" CRDT merge can
               | produce an invalid state for a particular application.
               | 
               | There is also the example above, where at the application
               | level, one particular merge outcome may be preferred over
               | another.
               | 
               | So, it isn't as simple as having nothing to worry about.
               | When using CRDTs, often, there are some pretty subtle
               | things that must be worried about. :-)
        
               | vlovich123 wrote:
               | Yup 100% agreed.
        
               | xnorswap wrote:
               | By definition they don't have conflicts in the technical
               | sense of the word, but they can easily get into bad
               | states after a merge.
               | 
               | It's not technically a conflict since there is a path
               | forward, but the resulting output can be nonsensical.
               | 
               | For example you could employ a strategy where given two
               | concurrent edits, the merge "randomly*" picks one of the
               | two edits for each property/fact, and abandons the other.
               | 
               | That's a construction of a conflict free replicated data
               | type, but it says nothing of the quality of the final
               | result.
               | 
               | Ultimately, the quality comes down to how well you can
               | produce merge strategies for your domain.
               | 
               | * "Randomly" in a deterministic way, e.g. by comparing a
               | hashes of transaction IDs or similar.
        
               | kiitos wrote:
               | > By definition they don't have conflicts in the
               | technical sense of the word...
               | 
               | ...which is the sense that we're talking about, when we
               | talk about CRDTs.
        
               | Dylan16807 wrote:
               | When someone goes into detail about correct versus
               | convergent, they are _explaining_ how the technical sense
               | is not what people really want, and you bluntly replying
               | that it 's solved (because it's technically solved) is
               | unhelpful at best and misleading at worse.
        
           | jamil7 wrote:
           | No, they don't, unfortunately. I feel like the term CRDT
           | leads to misunderstandings like this. CRDTs are more like
           | packaging and abstractions on top of various strategies for
           | building multi-user, local-first UX, they let end users
           | program against recognisable data structures like strings,
           | lists and dictionaries. How they handle conflicts is
           | implementation dependant and isn't anything new, many will
           | fall back to LWW for example.
        
             | eptcyka wrote:
             | The specific case of an edit to a line being applied to a
             | different line than what it was intended for because a
             | previous change deleted said line is exactly what CRDTs
             | rectify. I'm not referring to merge conflicts per say, or
             | that the final document will make human sense. But CRDTs
             | will prevent changesets from different participats from
             | being applied _in the worst possible way_. And there will
             | be no data loss.
        
               | holtwick wrote:
               | For certain scenarios there will be conflicts, take a
               | boolean value. Client A sets it and client B unsets it.
               | There can only be one winner.
               | 
               | But that might be a benefit from the proposed log sync,
               | because these conflicting situations can be shown and
               | marked for human review in the UI. Each step of change is
               | well documented and the history can fully be reviewed.
        
               | vlovich123 wrote:
               | Note that in the example I gave there's not even a
               | conflict to notice. You just have meta instructions
               | telling you what to do within a document. You apply those
               | instructions offline but in the mean time your meta
               | instructions are updated online. You come back and
               | synchronize your changes but there's nothing that
               | suggests your offline sync against the meta instructions
               | was stale so your edits are applied based on stale meta
               | instructions resulting in a logically undesirable result.
               | 
               | This is basically a TOCTOU race in a distributed context
               | and CRDTs do not magically solve this problem because
               | it's not solvable. That's why we have PAXOS and RAFT to
               | do such things and why many many papers have been written
               | trying to solve the Byzantine generals problem in
               | constrained scenarios with well defined criteria.
        
               | threatofrain wrote:
               | > And there will be no data loss.
               | 
               | CRDT's can absolutely lose data!
        
               | gizmo wrote:
               | Not when the log is append-only.
        
               | threatofrain wrote:
               | Yes, and CRDT's can absolutely lose data. That's the
               | point that should be delivered across when someone claims
               | they can't. Nobody needs to be warned that if you save
               | endless data then you could save all data.
        
               | vlovich123 wrote:
               | You're misunderstanding what I'm saying.
               | 
               | Original document has a list of instructions. The first
               | instruction says "delete line 3". User 1 offline deletes
               | line 3. User 2 updates the instructions so that step 1
               | says "delete line 4". Merged result when user 1 comes
               | back online: step 1 says "delete line 4" and line 3 is
               | deleted with the users none the wiser about what happened
               | unless they double check the merged result. Physically
               | the result is merged consistently but logically the end
               | result of the document is not what was intended.
        
               | eptcyka wrote:
               | That's why relative positions are not used in CRDTs in
               | the absolute sense. Entities are added via IDs and
               | operations are recorded via IDs. If user one deletes line
               | 3 and user 2 deletes line 4, user one will see lines 1, 2
               | and 5 and onwards. The commands that are captured by
               | CRDTs only operate on the state of the document as the
               | user sees it at the time, their own changes might be
               | overwritten later during a merge, but their changes won't
               | interfere with parts they were not destined to interfere
               | with.
        
               | vlovich123 wrote:
               | You seem to still not be getting it. Here's the original
               | online document:
               | 
               | > 1. Please delete line 3.
               | 
               | > 2. Blank line
               | 
               | > 3. This is line 3
               | 
               | > 4. This is line 4.
               | 
               | User 1 is accessing the document offline, sees the
               | instruction on line 1 and changes their local copy to:
               | 
               | > 1. Please delete line 3
               | 
               | > 2. Blank line
               | 
               | > 3. This is line 4
               | 
               | User 2 concurrently changes the document to update the
               | instruction:
               | 
               | > 1. Please delete line 4
               | 
               | > 2. Blank line
               | 
               | > 3. This is line 3
               | 
               | > 4. This is line 4
               | 
               | When user 1 & user 2 merge their documents, they'll get:
               | 
               | > 1. Please delete line 4
               | 
               | > 2. Blank line
               | 
               | > 3. This is line 4
               | 
               | You can also split this into two documents with a similar
               | effect (one contains instructions, the other contents).
               | The point is that it's not about how changes are
               | represented. It's that the logical and semantic structure
               | isn't represented in the CRDT and thus while some
               | eventually consistent result is attained by all nodes,
               | the logical and semantic structure may end up broken.
               | That's why you generally don't see CRDTs for code editors
               | - the merge conflicts still need to be resolved by hand
               | and doing it automatically can result in worse and harder
               | to understand results than normal.
        
         | ricardobeat wrote:
         | Under what circumstances would a past entry be corrected?
         | Doesn't CRDT require an immutable stream of operations?
        
           | vlovich123 wrote:
           | Original document has a list of instructions. The first
           | instruction says "delete line 3". User 1 offline deletes line
           | 3. User 2 updates the instructions so that step 1 says
           | "delete line 4". Merged result when user 1 comes back online:
           | step 1 says "delete line 4" and line 3 is deleted with the
           | users none the wiser about what happened unless they double
           | check the merged result. Physically the result is merged
           | consistently but logically the end result of the document is
           | not what was intended.
           | 
           | CRDT just guarantees that the document merges an edit stream
           | in some sensible way without any need for manual conflict
           | resolution. However, such resolution doesn't (and can't
           | really) understand logically that the merged result is not
           | what was intended to have happened. This is a contrived
           | example but you can imagine similar scenarios and it's
           | concerning that people think CRDTs would solve the scenario I
           | presented somehow when they solve a much more constrained
           | technical problem. Imagine you have a CRDT algorithm for
           | managing distributed state. And then you update the code
           | about how the state itself is interpreted or managed. Now you
           | have the problem of differing versions of code running
           | mutating the document containing the state and the document
           | is always in some valid state, but the server are not
           | actually in agreement about how to apply the mutations and
           | the state can diverge in unintended ways.
        
             | ricardobeat wrote:
             | I get it that you can reach conflicting states, but not
             | with this particular example.
             | 
             | Assuming updates are synced through a centralized server,
             | if user 1 has seen the "delete line 3" instruction, it
             | means that instruction was committed by user 2. At this
             | point they can't "edit" the first instruction. You can only
             | move forward. Undoing the line 3 deletion is in itself a
             | new instruction adding the line back; then line 4 gets
             | deleted and both clients' states have converged.
        
       | ergl wrote:
       | The ideas described here are very similar to what SSB
       | (https://ssbc.github.io/docs/ssb/faq.html) implemented.
       | 
       | The main problems with having a log is that it grows with every
       | single change (how granular are your changes? with CRDTs, any
       | mutation, no matter how small, is a change). Questions of data
       | retention (is your protocol tolerant to missing log entries?) or
       | data rewriting (if your log is a merkle tree, rewriting something
       | in the past means rewriting all subsequent entries) are also
       | open.
       | 
       | The post also mentions that the log entries could be CRDTs. But
       | if that's so, then you don't need a log at all, since all the
       | information you need to compute the minimal delta to sync between
       | peers is inside the CRDT itself. For a good overview of how this
       | could be done, see (specifically the "Calculating Diffs"
       | section): https://ditto.live/blog/dittos-delta-state-crdts
       | (disclaimer: I work at ditto)
        
         | holtwick wrote:
         | Thanks for the detailed feedback.
         | 
         | The growth of the log is indeed a weak point that could be
         | improved by regularly merging entries. Missing entries are
         | easily recognizable because a consecutive index is used. The
         | checksums on the previous entry should improve data
         | consistency.
         | 
         | The point that CRDTs themselves already contain all the
         | information required for an update is absolutely correct. I
         | have been working on this protocol for some time and one
         | objective was the reproducibility of the individual changes fro
         | accountability reasons. But this may not be necessary for all
         | applications and could possibly be achieved in other ways.
         | Thank you for pointing this out, I will reconsider the concept
         | in this respect!
        
           | holtwick wrote:
           | I would like to refine my answer regarding the rapidly
           | growing log. If we assume that we have a real-time
           | application, then every keystroke or pointer action can
           | indeed create a change entry.
           | 
           | But this storage format is designed for "long term" and
           | "slow" operations. Where "slow" means in the time lapse of a
           | second instead of a milisecond. This allows us to combine
           | multiple changes into a single log entry.
           | 
           | CRDT implementations like Yjs are good at concentrating such
           | changes into smaller chunks of data. For example, writing
           | text in a rich text editor like Prosemirror is then reduced
           | to something like a string and a position.
           | 
           | But the UI can also be lazy and throttle things. A string
           | input field can only fire changes when the field is left or
           | not typed for a second or so.
           | 
           | These steps will significantly reduce the size of the log.
           | They did in my implementations.
           | 
           | But this is not the end of realtime for such applications.
           | These applications could still pass changes directly over
           | P2P, as long as the log remains consistent, so that the
           | resulting document will always eventually consistent.
        
       | gritzko wrote:
       | Excellent. This approach to CRDTs existed even before the term
       | itself was invented. In the 2010 article[2] on Causal Trees[1],
       | your humble servant calls these per-peer op logs "yarns". In the
       | 2011, observing the proliferation of proposals, Shapiro&friends
       | propose[3] the term "CRDT".
       | 
       | That is essentially a partially-ordered approach to oplogs.
       | Oplogs (binlogs, WALs) underlie the entire universe of databases.
       | The problems are exactly the same as the problems "normal"
       | database developers faced in the far past: how much history do
       | you want to keep? How to synchronize the log and the state
       | reliably? What is the format of the log? How to store/transfer
       | the log reliably?
       | 
       | The last question seems trivial, but it is not. If you read some
       | databases' source code, you may find a lot of paranoid things[4],
       | obviously inspired by humiliating real-life incidents. The other
       | questions are not trivial by any means.
       | 
       | So, yes, this is the way[5].
       | 
       | [1]: Alexei "archagon" Baboulevitch excellent popular summary of
       | Causal Trees, 2018 http://archagon.net/blog/2018/03/24/data-
       | laced-with-history/
       | 
       | [2]: The 2010 paper
       | https://www.researchgate.net/publication/221367739_Deep_hype...
       | 
       | [3]: The CRDT paper
       | https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf
       | 
       | [4]: e.g.
       | https://github.com/facebook/rocksdb/blob/main/db/log_reader....
       | 
       | [5]: the simultaneous post by Nikita Prokopov
       | https://tonsky.me/blog/crdt-filesync/
        
         | unstable wrote:
         | Thanks for the links, awesome!
        
       | ngrilly wrote:
       | The approach seems similar to Delta Lake's consistency model,
       | using object storage like S3, and yet allows concurrent writers
       | and readers: https://jack-
       | vanlightly.com/analyses/2024/4/29/understanding...
        
         | holtwick wrote:
         | Thanks for feedback. The link is broken for me, but I think you
         | are referring to this page? https://jack-
         | vanlightly.com/analyses/2024/4/29/understanding...
         | 
         | Indeed, the approach is similar. Especially the separation of
         | assets (they call it "Data files") from the log ("Data log") is
         | something I consider being a good choice.
        
           | ngrilly wrote:
           | Yes, that's the correct link. Sorry for the broken link.
           | 
           | If the approach works for something as heavy duty as Delta
           | Lake, then it is should work for syncing the data of a end-
           | user app across several devices.
           | 
           | Even SQLite is separating its data and WAL in separate files
           | :)
        
       | SushiHippie wrote:
       | Cross posting my comment from the other thread about local first
       | https://news.ycombinator.com/item?id=40786425
       | 
       | My comment may fit a bit better here as this post talks about a
       | protocol instead.
       | 
       | ---
       | 
       | https://remotestorage.io/ was a protocol intended for this.
       | 
       | IIRC the visison was that all applications could implement this
       | and you could provide that application with your remotestorage
       | URL, which you could self host.
       | 
       | I looked into this some time ago as I was fed up with WebDAV
       | being the only viable open protocol for file
       | shares/synchronization (especially after hosting my own NextCloud
       | instance, which OOMed because the XML blobs for a large folder it
       | wanted to create as a response used too much memory) and found it
       | through this gist [0] which was a statement about Flock [1]
       | shutting down.
       | 
       | It looks like a cool and not that complex protocol, but all the
       | implementations seem to be unmaintained.
       | 
       | And the official javascript client [2] seems to be ironically be
       | used mostly to access Google Drive or DropBox
       | 
       | Remotestorage also has an internet draft
       | https://datatracker.ietf.org/doc/draft-dejong-remotestorage/
       | which is relatively easy to understand and not very long.
       | 
       | [0] https://gist.github.com/rhodey/873ae9d527d8d2a38213
       | 
       | [1] https://github.com/signalapp/Flock
       | 
       | [2] https://github.com/remotestorage/remotestorage.js
        
         | holtwick wrote:
         | Thanks for sharing, this is super interesting. Although it
         | doesn't seem to be super active these days. Probably because it
         | is difficult to commercialize local first. That might be why we
         | need a widely adopted and super flexible standard to become
         | attractive to hosters of such services.
        
       | gwbas1c wrote:
       | I was desktop client lead for Syncplicity (major Dropbox
       | competitor) for almost a decade.
       | 
       | Some thoughts:
       | 
       | Most important: _Git (the protocol) kinda-sorta does this
       | already._ I personally don 't know if git is CRDT; but the .git
       | folder is a database that contains the entire history of the
       | repos. The "git" command can be used to sync a repository among
       | different computers. You don't need to use Github or set up a git
       | server. (But everyone does it that way because it's much, much
       | easier.)
       | 
       | Secondly: _The proposal made assumes that everyone will rewrite
       | their software to be CRDT-based and fit into this schema._
       | Writing software this way is _hard,_ and then we need to convince
       | the general public to adopt versions of software that are based
       | around this system. _Could you port LibreOffice to writing out
       | documents this way?_
       | 
       | Thirdly: _Resolving conflicts when computers are disconnected
       | from the network is always a mess for end users._ (This was a
       | huge pain point for Syncplicity 's users, and is a huge pain
       | point in other "sync" products.) CRDTs don't "solve" the problem
       | for users: The best you can hope for is something like git where
       | it identifies merge conflicts, and then the user has to go in and
       | clean them up.
       | 
       | For example: Two computers are disconnected from the network.
       | Edits are made to the same sentence in a document while
       | disconnected. What happens? The CRDT approach might result in
       | "consistent data," but it cannot "read minds" and know what is
       | the correct result.
       | 
       | ----
       | 
       | IMO, some things I would consider:
       | 
       | Try to somehow replicate the desired workflows with git or a
       | similar tool. (Remember Mercurial?) See what kind of problems pop
       | up.
       | 
       | Consider writing a "git aware" word processor that is based
       | around Markdown; and is somewhat aware of how git behaves around
       | conflicts.
       | 
       | Try "bolting on" a sync protocol (or git) to LibreOffice as a way
       | to understand just how easy / hard a project like this is.
       | 
       | Consider encapsulating the entire database in a SQLite file,
       | instead of using the filesystem.
        
         | vlovich123 wrote:
         | Git is obviously not a CRDT because merge conflicts have to be
         | manually resolved whereas CRDTs require this to happen
         | automatically. A less interesting subset of Git can be viewed
         | as CRDT (e.g. mirrors).
        
           | gwbas1c wrote:
           | Your users will still need to manually merge in a CRDT
           | application; _because computers can not read minds._
           | 
           | The best a CRDT application can do is spit out garbage when
           | changes conflict.
           | 
           | IE, start with: The quick brown fox jumped over the fence.
           | 
           | We both disconnect. At the same exact time...
           | 
           | I change it to: The quick brown fox ran around the fence.
           | 
           | You change it to: The quick brown fox dug under the fence.
           | 
           | The best you can do is make a data structure that is
           | consistent and identifies the conflict. CRDT can't "read our
           | minds" and decide between "ran around" or "dug under".
        
             | Groxx wrote:
             | It can't decide _human intent_ flawlessly, but the point of
             | a CRDT is that it _does_ choose one, and all others choose
             | the same one regardless of how they got there.
             | 
             | Git does not do this, so it is not a CRDT. The content-
             | addressable-database portion of git sorta fits this though
             | (as does any other content-addressable system).
        
               | kaba0 wrote:
               | This is basically git automatically doing "accept theirs"
               | or "yours" for any fork. You can see that it will not
               | generally be what you want, so whether such a strategy
               | could work is domain-dependent.
        
               | vlovich123 wrote:
               | No that's not accurate. If I merge branch X and then
               | branch Y and someone else merges branch Y and then branch
               | X, with CRDT the result should also be the same whereas
               | with git it won't be if you're strategy is always "accept
               | theirs" or "accept yours". CRDT is also order invariant -
               | it doesn't matter which ordering of edit operations you
               | accept, the end result is consistent across all nodes.
               | 
               | You may want to read up on the Wikipedia page rather than
               | taking 1 thing I said and extrapolating.
               | https://en.wikipedia.org/wiki/Conflict-
               | free_replicated_data_...
        
             | threatofrain wrote:
             | Whether or not the user is manually involved at some point
             | is a product decision. I think the trend of consumer
             | companies is not to do that. Possibly damaging user data is
             | simply a tradeoff in this way of thinking.
        
               | vlovich123 wrote:
               | No, that's literally the definition of CRDT. Requirement
               | 2 out of the 3 listed on Wikipedia:
               | 
               | > An algorithm (itself part of the data type)
               | automatically resolves any inconsistencies that might
               | occur.
               | 
               | So no, human resolution vs automatic is not a product
               | decision but a key definitional requirement to be a CRDT.
               | 
               | https://en.wikipedia.org/wiki/Conflict-
               | free_replicated_data_...
        
       ___________________________________________________________________
       (page generated 2024-06-25 23:01 UTC)