[HN Gopher] Paxos vs. Raft: Have we reached consensus on distrib...
___________________________________________________________________
Paxos vs. Raft: Have we reached consensus on distributed consensus?
Author : yagizdegirmenci
Score : 113 points
Date : 2021-07-14 10:39 UTC (2 days ago)
(HTM) web link (charap.co)
(TXT) w3m dump (charap.co)
| aneutron wrote:
| It's not possible do to so, since it would be done
| asynchronously.
|
| I'm sorry, I'll see myself out.
| erik_seaberg wrote:
| Obviously we need a third distributed consensus algorithm as a
| tiebreaker.
| _jal wrote:
| That really should be generalized to support N algorithms.
| Some sort of quorum protocol, maybe.
| josh2600 wrote:
| Ok so having worked on distributed consensus a bunch here are a
| couple thoughts in no particular order:
|
| * In the real world, servers misbehave, like, a lot, so you have
| to deal with that fact. All assumptions about network robustness
| in particular will be proven wrong on a long enough timeline.
|
| * Leader election in a world without a robust network is an
| exercise in managing acceptable failure tolerances. Every
| application has a notion of acceptable failure rates or
| acceptable downtime. Discovering that is a non-trivial effort.
|
| Jeff Dean and others at Google famously came to the conclusion
| that it was ok for some parts of the Gmail service to be down for
| limited periods of time. Accepting that all self-healing/robust
| systems will eventually degrade and have to be restored is the
| first step in building something manageable. The AXD301 is the
| most robust system ever built by humans to my knowledge (I think
| it did 20 years of uptime in production). Most other systems will
| fail long before that. Managing systems as they fail is an art,
| particularly as all systems operate in a degraded state.
|
| In short, in a lab environment networks function really well. In
| the real world, it's a jungle. Plan accordingly.
| goodpoint wrote:
| Most importantly: leader election should be used only as last
| resort.
|
| Leaderless systems with eventual consistency are often possible
| and more robust.
| brickbrd wrote:
| In practice, for the systems where I built a replication system
| from the ground up, once you factor in all the performance,
| scale, storage layer and networking implications, this Paxos vs.
| Raft thing is largely a theoretical discussion.
|
| Basic paxos, is well, too basic and people mostly run
| modifications of this to get higher throughput and better
| latencies. After those modifications, it does not look very
| different from Raft with modifications applied for storage
| integration and so on.
| arielweisberg wrote:
| This exactly my take as well. Multi-Paxos and Raft seem very
| similar to me. Calling out what the exact differences and
| tradeoffs are would be good blog/research fodder.
| ignoramous wrote:
| > _Basic Paxos, is well, too basic and people mostly run
| modifications of this to get higher throughput and better
| latencies. After those modifications, it does not look very
| different from Raft._
|
| Alan Vermeulen, one of the founding AWS engineers, calls
| inventing newer solutions to distributed consensus an exercise
| in _re-discovering_ Paxos.
|
| https://youtu.be/QVvFVwyElLY?t=2367
| tschellenbach wrote:
| Stream's consensus algorithms are all Raft based, the Go Raft
| ecosystem is very solid. We did end up forking some of the
| libraries, but nothing major.
| lowbloodsugar wrote:
| Problem: "Raft protocol described and analyzed in English has
| problems." Solution: "Here is a modification to the protocol,
| described in English, that does not have such a problem."
|
| Seems like there is a common failure mode of "describing
| distributed protocols in plain english and thinking this is a
| proof"?
|
| The actual problem here is not "There was a problem in the Raft
| protocol, and I figured it out and provided a work around". The
| actual problem here is "Reasonably experienced software engineers
| reviewed the specification and didn't see any problems." This
| actual problem has not been addressed by the article.
| adsharma wrote:
| Author seems to be using https://github.com/ailidani/paxi for
| actual implementation and proof.
|
| I'm more of a python/rust guy. There have been some attempts to
| make model checkers in rust:
| https://github.com/stateright/stateright
|
| The issue is that rust is a very large language and it's hard
| to get it right.
|
| I have a python implementation of raft over here:
|
| https://github.com/adsharma/raft/tree/master/raft/states
|
| That's small enough to be self contained and perhaps run
| through a model checker some day and transpiled to many
| statically typed languages.
|
| The issue with TLA+ proofs such as:
|
| https://github.com/fpaxos/raft.tla
|
| is that it's hard to tell if a particular C++ or Rust
| implementation conforms to the spec.
|
| So how do we check and transpile?
|
| * https://www.philipzucker.com/Modelling_TLA_in_z3py/ *
| https://github.com/adsharma/py2many
| hwayne wrote:
| > is that it's hard to tell if a particular C++ or Rust
| implementation conforms to the spec.
|
| You can use either TLC's graph output or simulation mode to
| generate lots of TLA+ traces. Then you write an orchestrator
| that reads the trace as JSON and run the same actions in each
| step on the program, then confirm they have the same final
| states. You can go the other way, too, where you take program
| traces and map them onto the TLA+ state space.
|
| It's a bit tricky to set up, but it's possible. I've had
| several clients who've built something similar.
| im_down_w_otp wrote:
| This is, incidentally, one of the kinds of use cases that
| informed part of our product functionality.
|
| 1. A way to write down fairly complicated system/behavior-
| level specs for distributed systems.
|
| 2. A way to instrument implementations to get the necessary
| signal out of the system to verify it nominally.
|
| 3. A way to automatically explore adverse conditions to
| validate that the system keeps behaving as needed and
| specifically isolate contributing factors for when it
| doesn't.
|
| The very first prototype was leveraged (in-part, as there
| was much more too the system under test) in a consensus
| mechanism that was central to the correct, stable, and most
| importantly... safe... operation of some critical software
| infrastructure. It was necessary to try to explore where
| the model assumptions broke down in practice in a real
| system.
| adsharma wrote:
| https://www.microsoft.com/en-
| us/research/blog/p-programming-... makes the case better
| than I can on why it's important to have the specification
| and implementation come from the same source as opposed to
| maintaining two and then verify via traces that they're
| compatible.
|
| But definitely worth doing (like Jepsen does) to find
| problems in the implementation.
| hwayne wrote:
| Yeah, if you can do it economically, having a single
| source is preferable.
| toolslive wrote:
| you can abstract IO and async'ness from the implementation,
| and model it as a pure function. Then build and inspect the
| state space. A detailed description was done here:
|
| https://incubaid.wordpress.com/2012/10/25/caulking-your-
| dist...
| littlestymaar wrote:
| > The issue is that rust is a very large language and it's
| hard to get it right.
|
| In my experience, Rust type system (especially the _Send_ and
| _Sync_ traits, which prevent data races) makes it easier -
| not harder - to get big things work.
| adsharma wrote:
| My intention is to reuse the rust type system where we can
| and yet keep the language small enough to be able to
| implement and verify non-trivial real world programs.
|
| https://github.com/adsharma/py2many/blob/main/doc/langspec.
| m... https://github.com/adsharma/py2many/blob/main/CHANGELO
| G.md#r...
| hutrdvnj wrote:
| Can someone provide a short description of the differences
| between Paxos and Raft?
| polynomial wrote:
| Paxos is a family of algorithms which are aimed at distributed
| consistency / monotonic state guarantees. However, Paxos allows
| for leaders with out-of-order logs to be elected leaders
| (provided they then reorder their logs) whereas Raft requires a
| server's log to be up-to-date before it can be elected leader.
| Moreover, Raft has a reputation for having better
| understandability than Paxos.
|
| edit: It looks like the linked paper covers the main
| differences, albeit in a more detailed manner. Also, it sems as
| if the author rejects the idea that Raft is more understandable
| and makes a case why he thinks Paxos is more understandable.
| anonymousDan wrote:
| Personally I find paxos more understandable. For example, KTH
| have a really nice incremental development of Multi-Paxos
| called Sequence Paxos: https://arxiv.org/abs/2008.13456
| ignoramous wrote:
| Raft is very similar to Multi-Paxos. These Raft and Paxos user
| studies comparing the "understandability" of the two protocols
| might be helpful: https://youtu.be/JEpsBg0AO6o (Paxos),
| https://youtu.be/YbZ3zDzDnrw (Raft).
|
| See also, comparison of various "Paxos" implementations:
| https://muratbuffalo.blogspot.com/2016/04/consensus-in-cloud...
| jiryu wrote:
| I gave the presentation in the linked article, here's a written
| version of the presentation: https://emptysqua.re/blog/paxos-vs-
| raft/
|
| I hope that the "Paxos vs Raft" debate can die down, now that
| engineers are learning TLA+ and distributed systems more
| thoroughly. These days we can design new protocols and prove
| their correctness, instead of always relying on academics. For
| example, at MongoDB we considered adopting the reconfiguration
| protocol from Raft, but instead we designed our own and checked
| it with TLA+. See "Design and Verification of a Logless Dynamic
| Reconfiguration Protocol in MongoDB Replication" for details:
| https://arxiv.org/pdf/2102.11960.pdf
| butterisgood wrote:
| VR for the win... (no not that VR)
| http://pmg.csail.mit.edu/papers/vr-revisited.pdf
|
| Ok, maybe not for the win, but it's worth a look. I'm actually
| fairly certain one of the Paxos implementations I've worked with
| and used is really more of a VR bend to Paxos anyway.
| benlivengood wrote:
| Since the article mentions Google as the outlier preferring
| Paxos, I may be able to shed some light from a few years ago.
|
| The Paxos, paxosdb, and related libraries (despite the name, all
| are multi-paxos) are solid and integrated directly into a number
| of products (Borg, Chubby, CFS, Spanner, etc.). There are years
| of engineering effort and unit tests behind the core Paxos
| library and so it makes sense to keep using and improving it
| instead of going off to Raft. As far as I am aware the Google
| Paxos implementation predates Raft by quite a while.
|
| I think in general if most other people use Raft it's better for
| the community to have single, stable, and well-tested shared
| implementations for much the same reason it's good for Google to
| stick with Paxos.
| tyingq wrote:
| This makes sense to me. Very few of us have the resources to
| maintain, for example, the kind of globally synced (way beyond
| typical NTP) clock infrastructure that Google has
| (TrueTime[1]).
|
| [1] https://cloud.google.com/spanner/docs/true-time-external-
| con...
| gravypod wrote:
| If you want to know just how hard this can be with bare metal
| that you control, take a look at this:
| https://signalsandthreads.com/clock-synchronization/
| jeffbee wrote:
| On the other hand, if you run in Google Compute Engine you
| get accurate clock discipline for free.
| littlestymaar wrote:
| Heidi Howard, the first author of this paper did two videos about
| her paper:
|
| - A 10' short intro https://www.youtube.com/watch?v=JQss0uQUc6o
|
| - A more in depth one :
| https://www.youtube.com/watch?v=0K6kt39wyH0
___________________________________________________________________
(page generated 2021-07-16 23:00 UTC)