[HN Gopher] Resilient Sync for Local First
___________________________________________________________________
Resilient Sync for Local First
Author : ingve
Score : 170 points
Date : 2024-06-24 05:58 UTC (1 days ago)
(HTM) web link (holtwick.de)
(TXT) w3m dump (holtwick.de)
| vlovich123 wrote:
| The challenge of course is that if your write depends on a
| previous read, offline/online sync can easily in the wrong sync
| state once you come back online (in the general case). Even CRDTs
| are not immune from this - for example your offline document copy
| says step 1 is delete line 3 so you delete it but then you sync
| and step 1 was corrected to be "delete line 4" your deletion of
| line 3 is now incorrect even though the merged result is "valid".
| We see this everyday in code development - they're called merge
| conflicts.
| eptcyka wrote:
| This is exactly the kind of scenario CRDTs rectify.
| threatofrain wrote:
| CRDT's largely can't rectify merge conflicts... what CRDT's
| can do is make your merges convergent. If I auto-merged your
| Git code using a blunt policy I don't know if you'd reach for
| "correct" as your first description as opposed to
| "convergent".
|
| The longer you're offline the more you have to explain to
| your users the circumstances where they might need to do
| manual data cleanup or expect creative merging policy. No
| matter what tech you're using.
| ErikBjare wrote:
| This is true if you're working with documents edited by
| different devices, but many CRDTs do not and can indeed be
| not only "convergent" but also "correct".
|
| How SSB works is a good example.
| vlovich123 wrote:
| Can't find what SSB is. Can you clarify?
|
| But yes, it's possible there are applications that can
| constrain the set of operations such that the merged
| result is also correct. I was just highlighting that
| CRDTs are trotted out as a catch all when a) it's just a
| way of thinking about the problem rather than an actual
| algorithm (eg while there's a lot of CRDT work for
| document editing, that work has to be done for scratch
| for any other problem domain) b) you have to modify your
| problem constrains and solution such that merge results
| are correct. B is a very very hard problem in a
| distributed system even if you ignore the challenge of a
| which itself is quite hard. I wouldn't trust anything in
| this space too much without a TLA+ proof that it's
| correct (unless it's something low stakes like document
| editing).
| endgame wrote:
| I think SSB is "Secure Scuttlebutt":
| http://ssbc.github.io/ssb-db/
| threatofrain wrote:
| Some of our most important DB sync'ing algorithms were
| only very recently verified. It was okay to run the whole
| world's systems, including hospitals and armed forces
| this way. Very very few industry algos receive formal
| verification.
| kiitos wrote:
| CRDTs solve merge conflicts by construction, because they
| necessarily don't permit merge conflicts in the first
| place.
| bombela wrote:
| Alright, try this in Google Doc.
|
| Two users simultaneously select the same word, and
| replace it by a different word each.
|
| You might end up with: "apple" -> "carrot orange"
|
| Sure it technically merged, and did not lose data. But
| you still need humans to reconcile the final state
| anyways. Worse, you might not even notice that you need
| to cleanup.
| vlovich123 wrote:
| Bad example. Google docs doesn't use CRDTs but uses OT
| instead. CRDTs may handle your scenario just fine
| depending on how they decide to handle this scenario.
|
| The easier scenario that is independent of the specific
| algorithm is the one I described (have follow up comments
| describing it in more explicit detail) where you have a
| logical meta instruction within the document and then the
| document is updated based on outdated meta instructions
| offline but when they come back online the meta
| instructions were changed.
|
| Then there's not even a merge conflict to really worry
| about.
| mattarm wrote:
| > Bad example. Google docs doesn't use CRDTs but uses OT
| instead. CRDTs may handle your scenario just fine
| depending on how they decide to handle this scenario.
|
| The CRDT may pick one or the other replacement word, but
| who is to say that either choice is correct? Perhaps
| including both words is correct.
|
| > Then there's not even a merge conflict...
|
| Agree, this is what CRDTs are all about.
|
| > ...to really worry about.
|
| I think it is important to make clear that CRDTs do not
| "solve" the merging problem, they merely make it possible
| to solve in a deterministic way across replicas.
|
| Often, CRDTs do not capture higher level schema
| invariants, and so a "conflict free" CRDT merge can
| produce an invalid state for a particular application.
|
| There is also the example above, where at the application
| level, one particular merge outcome may be preferred over
| another.
|
| So, it isn't as simple as having nothing to worry about.
| When using CRDTs, often, there are some pretty subtle
| things that must be worried about. :-)
| vlovich123 wrote:
| Yup 100% agreed.
| xnorswap wrote:
| By definition they don't have conflicts in the technical
| sense of the word, but they can easily get into bad
| states after a merge.
|
| It's not technically a conflict since there is a path
| forward, but the resulting output can be nonsensical.
|
| For example you could employ a strategy where given two
| concurrent edits, the merge "randomly*" picks one of the
| two edits for each property/fact, and abandons the other.
|
| That's a construction of a conflict free replicated data
| type, but it says nothing of the quality of the final
| result.
|
| Ultimately, the quality comes down to how well you can
| produce merge strategies for your domain.
|
| * "Randomly" in a deterministic way, e.g. by comparing a
| hashes of transaction IDs or similar.
| kiitos wrote:
| > By definition they don't have conflicts in the
| technical sense of the word...
|
| ...which is the sense that we're talking about, when we
| talk about CRDTs.
| Dylan16807 wrote:
| When someone goes into detail about correct versus
| convergent, they are _explaining_ how the technical sense
| is not what people really want, and you bluntly replying
| that it 's solved (because it's technically solved) is
| unhelpful at best and misleading at worse.
| jamil7 wrote:
| No, they don't, unfortunately. I feel like the term CRDT
| leads to misunderstandings like this. CRDTs are more like
| packaging and abstractions on top of various strategies for
| building multi-user, local-first UX, they let end users
| program against recognisable data structures like strings,
| lists and dictionaries. How they handle conflicts is
| implementation dependant and isn't anything new, many will
| fall back to LWW for example.
| eptcyka wrote:
| The specific case of an edit to a line being applied to a
| different line than what it was intended for because a
| previous change deleted said line is exactly what CRDTs
| rectify. I'm not referring to merge conflicts per say, or
| that the final document will make human sense. But CRDTs
| will prevent changesets from different participats from
| being applied _in the worst possible way_. And there will
| be no data loss.
| holtwick wrote:
| For certain scenarios there will be conflicts, take a
| boolean value. Client A sets it and client B unsets it.
| There can only be one winner.
|
| But that might be a benefit from the proposed log sync,
| because these conflicting situations can be shown and
| marked for human review in the UI. Each step of change is
| well documented and the history can fully be reviewed.
| vlovich123 wrote:
| Note that in the example I gave there's not even a
| conflict to notice. You just have meta instructions
| telling you what to do within a document. You apply those
| instructions offline but in the mean time your meta
| instructions are updated online. You come back and
| synchronize your changes but there's nothing that
| suggests your offline sync against the meta instructions
| was stale so your edits are applied based on stale meta
| instructions resulting in a logically undesirable result.
|
| This is basically a TOCTOU race in a distributed context
| and CRDTs do not magically solve this problem because
| it's not solvable. That's why we have PAXOS and RAFT to
| do such things and why many many papers have been written
| trying to solve the Byzantine generals problem in
| constrained scenarios with well defined criteria.
| threatofrain wrote:
| > And there will be no data loss.
|
| CRDT's can absolutely lose data!
| gizmo wrote:
| Not when the log is append-only.
| threatofrain wrote:
| Yes, and CRDT's can absolutely lose data. That's the
| point that should be delivered across when someone claims
| they can't. Nobody needs to be warned that if you save
| endless data then you could save all data.
| vlovich123 wrote:
| You're misunderstanding what I'm saying.
|
| Original document has a list of instructions. The first
| instruction says "delete line 3". User 1 offline deletes
| line 3. User 2 updates the instructions so that step 1
| says "delete line 4". Merged result when user 1 comes
| back online: step 1 says "delete line 4" and line 3 is
| deleted with the users none the wiser about what happened
| unless they double check the merged result. Physically
| the result is merged consistently but logically the end
| result of the document is not what was intended.
| eptcyka wrote:
| That's why relative positions are not used in CRDTs in
| the absolute sense. Entities are added via IDs and
| operations are recorded via IDs. If user one deletes line
| 3 and user 2 deletes line 4, user one will see lines 1, 2
| and 5 and onwards. The commands that are captured by
| CRDTs only operate on the state of the document as the
| user sees it at the time, their own changes might be
| overwritten later during a merge, but their changes won't
| interfere with parts they were not destined to interfere
| with.
| vlovich123 wrote:
| You seem to still not be getting it. Here's the original
| online document:
|
| > 1. Please delete line 3.
|
| > 2. Blank line
|
| > 3. This is line 3
|
| > 4. This is line 4.
|
| User 1 is accessing the document offline, sees the
| instruction on line 1 and changes their local copy to:
|
| > 1. Please delete line 3
|
| > 2. Blank line
|
| > 3. This is line 4
|
| User 2 concurrently changes the document to update the
| instruction:
|
| > 1. Please delete line 4
|
| > 2. Blank line
|
| > 3. This is line 3
|
| > 4. This is line 4
|
| When user 1 & user 2 merge their documents, they'll get:
|
| > 1. Please delete line 4
|
| > 2. Blank line
|
| > 3. This is line 4
|
| You can also split this into two documents with a similar
| effect (one contains instructions, the other contents).
| The point is that it's not about how changes are
| represented. It's that the logical and semantic structure
| isn't represented in the CRDT and thus while some
| eventually consistent result is attained by all nodes,
| the logical and semantic structure may end up broken.
| That's why you generally don't see CRDTs for code editors
| - the merge conflicts still need to be resolved by hand
| and doing it automatically can result in worse and harder
| to understand results than normal.
| ricardobeat wrote:
| Under what circumstances would a past entry be corrected?
| Doesn't CRDT require an immutable stream of operations?
| vlovich123 wrote:
| Original document has a list of instructions. The first
| instruction says "delete line 3". User 1 offline deletes line
| 3. User 2 updates the instructions so that step 1 says
| "delete line 4". Merged result when user 1 comes back online:
| step 1 says "delete line 4" and line 3 is deleted with the
| users none the wiser about what happened unless they double
| check the merged result. Physically the result is merged
| consistently but logically the end result of the document is
| not what was intended.
|
| CRDT just guarantees that the document merges an edit stream
| in some sensible way without any need for manual conflict
| resolution. However, such resolution doesn't (and can't
| really) understand logically that the merged result is not
| what was intended to have happened. This is a contrived
| example but you can imagine similar scenarios and it's
| concerning that people think CRDTs would solve the scenario I
| presented somehow when they solve a much more constrained
| technical problem. Imagine you have a CRDT algorithm for
| managing distributed state. And then you update the code
| about how the state itself is interpreted or managed. Now you
| have the problem of differing versions of code running
| mutating the document containing the state and the document
| is always in some valid state, but the server are not
| actually in agreement about how to apply the mutations and
| the state can diverge in unintended ways.
| ricardobeat wrote:
| I get it that you can reach conflicting states, but not
| with this particular example.
|
| Assuming updates are synced through a centralized server,
| if user 1 has seen the "delete line 3" instruction, it
| means that instruction was committed by user 2. At this
| point they can't "edit" the first instruction. You can only
| move forward. Undoing the line 3 deletion is in itself a
| new instruction adding the line back; then line 4 gets
| deleted and both clients' states have converged.
| ergl wrote:
| The ideas described here are very similar to what SSB
| (https://ssbc.github.io/docs/ssb/faq.html) implemented.
|
| The main problems with having a log is that it grows with every
| single change (how granular are your changes? with CRDTs, any
| mutation, no matter how small, is a change). Questions of data
| retention (is your protocol tolerant to missing log entries?) or
| data rewriting (if your log is a merkle tree, rewriting something
| in the past means rewriting all subsequent entries) are also
| open.
|
| The post also mentions that the log entries could be CRDTs. But
| if that's so, then you don't need a log at all, since all the
| information you need to compute the minimal delta to sync between
| peers is inside the CRDT itself. For a good overview of how this
| could be done, see (specifically the "Calculating Diffs"
| section): https://ditto.live/blog/dittos-delta-state-crdts
| (disclaimer: I work at ditto)
| holtwick wrote:
| Thanks for the detailed feedback.
|
| The growth of the log is indeed a weak point that could be
| improved by regularly merging entries. Missing entries are
| easily recognizable because a consecutive index is used. The
| checksums on the previous entry should improve data
| consistency.
|
| The point that CRDTs themselves already contain all the
| information required for an update is absolutely correct. I
| have been working on this protocol for some time and one
| objective was the reproducibility of the individual changes fro
| accountability reasons. But this may not be necessary for all
| applications and could possibly be achieved in other ways.
| Thank you for pointing this out, I will reconsider the concept
| in this respect!
| holtwick wrote:
| I would like to refine my answer regarding the rapidly
| growing log. If we assume that we have a real-time
| application, then every keystroke or pointer action can
| indeed create a change entry.
|
| But this storage format is designed for "long term" and
| "slow" operations. Where "slow" means in the time lapse of a
| second instead of a milisecond. This allows us to combine
| multiple changes into a single log entry.
|
| CRDT implementations like Yjs are good at concentrating such
| changes into smaller chunks of data. For example, writing
| text in a rich text editor like Prosemirror is then reduced
| to something like a string and a position.
|
| But the UI can also be lazy and throttle things. A string
| input field can only fire changes when the field is left or
| not typed for a second or so.
|
| These steps will significantly reduce the size of the log.
| They did in my implementations.
|
| But this is not the end of realtime for such applications.
| These applications could still pass changes directly over
| P2P, as long as the log remains consistent, so that the
| resulting document will always eventually consistent.
| gritzko wrote:
| Excellent. This approach to CRDTs existed even before the term
| itself was invented. In the 2010 article[2] on Causal Trees[1],
| your humble servant calls these per-peer op logs "yarns". In the
| 2011, observing the proliferation of proposals, Shapiro&friends
| propose[3] the term "CRDT".
|
| That is essentially a partially-ordered approach to oplogs.
| Oplogs (binlogs, WALs) underlie the entire universe of databases.
| The problems are exactly the same as the problems "normal"
| database developers faced in the far past: how much history do
| you want to keep? How to synchronize the log and the state
| reliably? What is the format of the log? How to store/transfer
| the log reliably?
|
| The last question seems trivial, but it is not. If you read some
| databases' source code, you may find a lot of paranoid things[4],
| obviously inspired by humiliating real-life incidents. The other
| questions are not trivial by any means.
|
| So, yes, this is the way[5].
|
| [1]: Alexei "archagon" Baboulevitch excellent popular summary of
| Causal Trees, 2018 http://archagon.net/blog/2018/03/24/data-
| laced-with-history/
|
| [2]: The 2010 paper
| https://www.researchgate.net/publication/221367739_Deep_hype...
|
| [3]: The CRDT paper
| https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf
|
| [4]: e.g.
| https://github.com/facebook/rocksdb/blob/main/db/log_reader....
|
| [5]: the simultaneous post by Nikita Prokopov
| https://tonsky.me/blog/crdt-filesync/
| unstable wrote:
| Thanks for the links, awesome!
| ngrilly wrote:
| The approach seems similar to Delta Lake's consistency model,
| using object storage like S3, and yet allows concurrent writers
| and readers: https://jack-
| vanlightly.com/analyses/2024/4/29/understanding...
| holtwick wrote:
| Thanks for feedback. The link is broken for me, but I think you
| are referring to this page? https://jack-
| vanlightly.com/analyses/2024/4/29/understanding...
|
| Indeed, the approach is similar. Especially the separation of
| assets (they call it "Data files") from the log ("Data log") is
| something I consider being a good choice.
| ngrilly wrote:
| Yes, that's the correct link. Sorry for the broken link.
|
| If the approach works for something as heavy duty as Delta
| Lake, then it is should work for syncing the data of a end-
| user app across several devices.
|
| Even SQLite is separating its data and WAL in separate files
| :)
| SushiHippie wrote:
| Cross posting my comment from the other thread about local first
| https://news.ycombinator.com/item?id=40786425
|
| My comment may fit a bit better here as this post talks about a
| protocol instead.
|
| ---
|
| https://remotestorage.io/ was a protocol intended for this.
|
| IIRC the visison was that all applications could implement this
| and you could provide that application with your remotestorage
| URL, which you could self host.
|
| I looked into this some time ago as I was fed up with WebDAV
| being the only viable open protocol for file
| shares/synchronization (especially after hosting my own NextCloud
| instance, which OOMed because the XML blobs for a large folder it
| wanted to create as a response used too much memory) and found it
| through this gist [0] which was a statement about Flock [1]
| shutting down.
|
| It looks like a cool and not that complex protocol, but all the
| implementations seem to be unmaintained.
|
| And the official javascript client [2] seems to be ironically be
| used mostly to access Google Drive or DropBox
|
| Remotestorage also has an internet draft
| https://datatracker.ietf.org/doc/draft-dejong-remotestorage/
| which is relatively easy to understand and not very long.
|
| [0] https://gist.github.com/rhodey/873ae9d527d8d2a38213
|
| [1] https://github.com/signalapp/Flock
|
| [2] https://github.com/remotestorage/remotestorage.js
| holtwick wrote:
| Thanks for sharing, this is super interesting. Although it
| doesn't seem to be super active these days. Probably because it
| is difficult to commercialize local first. That might be why we
| need a widely adopted and super flexible standard to become
| attractive to hosters of such services.
| gwbas1c wrote:
| I was desktop client lead for Syncplicity (major Dropbox
| competitor) for almost a decade.
|
| Some thoughts:
|
| Most important: _Git (the protocol) kinda-sorta does this
| already._ I personally don 't know if git is CRDT; but the .git
| folder is a database that contains the entire history of the
| repos. The "git" command can be used to sync a repository among
| different computers. You don't need to use Github or set up a git
| server. (But everyone does it that way because it's much, much
| easier.)
|
| Secondly: _The proposal made assumes that everyone will rewrite
| their software to be CRDT-based and fit into this schema._
| Writing software this way is _hard,_ and then we need to convince
| the general public to adopt versions of software that are based
| around this system. _Could you port LibreOffice to writing out
| documents this way?_
|
| Thirdly: _Resolving conflicts when computers are disconnected
| from the network is always a mess for end users._ (This was a
| huge pain point for Syncplicity 's users, and is a huge pain
| point in other "sync" products.) CRDTs don't "solve" the problem
| for users: The best you can hope for is something like git where
| it identifies merge conflicts, and then the user has to go in and
| clean them up.
|
| For example: Two computers are disconnected from the network.
| Edits are made to the same sentence in a document while
| disconnected. What happens? The CRDT approach might result in
| "consistent data," but it cannot "read minds" and know what is
| the correct result.
|
| ----
|
| IMO, some things I would consider:
|
| Try to somehow replicate the desired workflows with git or a
| similar tool. (Remember Mercurial?) See what kind of problems pop
| up.
|
| Consider writing a "git aware" word processor that is based
| around Markdown; and is somewhat aware of how git behaves around
| conflicts.
|
| Try "bolting on" a sync protocol (or git) to LibreOffice as a way
| to understand just how easy / hard a project like this is.
|
| Consider encapsulating the entire database in a SQLite file,
| instead of using the filesystem.
| vlovich123 wrote:
| Git is obviously not a CRDT because merge conflicts have to be
| manually resolved whereas CRDTs require this to happen
| automatically. A less interesting subset of Git can be viewed
| as CRDT (e.g. mirrors).
| gwbas1c wrote:
| Your users will still need to manually merge in a CRDT
| application; _because computers can not read minds._
|
| The best a CRDT application can do is spit out garbage when
| changes conflict.
|
| IE, start with: The quick brown fox jumped over the fence.
|
| We both disconnect. At the same exact time...
|
| I change it to: The quick brown fox ran around the fence.
|
| You change it to: The quick brown fox dug under the fence.
|
| The best you can do is make a data structure that is
| consistent and identifies the conflict. CRDT can't "read our
| minds" and decide between "ran around" or "dug under".
| Groxx wrote:
| It can't decide _human intent_ flawlessly, but the point of
| a CRDT is that it _does_ choose one, and all others choose
| the same one regardless of how they got there.
|
| Git does not do this, so it is not a CRDT. The content-
| addressable-database portion of git sorta fits this though
| (as does any other content-addressable system).
| kaba0 wrote:
| This is basically git automatically doing "accept theirs"
| or "yours" for any fork. You can see that it will not
| generally be what you want, so whether such a strategy
| could work is domain-dependent.
| vlovich123 wrote:
| No that's not accurate. If I merge branch X and then
| branch Y and someone else merges branch Y and then branch
| X, with CRDT the result should also be the same whereas
| with git it won't be if you're strategy is always "accept
| theirs" or "accept yours". CRDT is also order invariant -
| it doesn't matter which ordering of edit operations you
| accept, the end result is consistent across all nodes.
|
| You may want to read up on the Wikipedia page rather than
| taking 1 thing I said and extrapolating.
| https://en.wikipedia.org/wiki/Conflict-
| free_replicated_data_...
| threatofrain wrote:
| Whether or not the user is manually involved at some point
| is a product decision. I think the trend of consumer
| companies is not to do that. Possibly damaging user data is
| simply a tradeoff in this way of thinking.
| vlovich123 wrote:
| No, that's literally the definition of CRDT. Requirement
| 2 out of the 3 listed on Wikipedia:
|
| > An algorithm (itself part of the data type)
| automatically resolves any inconsistencies that might
| occur.
|
| So no, human resolution vs automatic is not a product
| decision but a key definitional requirement to be a CRDT.
|
| https://en.wikipedia.org/wiki/Conflict-
| free_replicated_data_...
___________________________________________________________________
(page generated 2024-06-25 23:01 UTC)