[HN Gopher] Paxos
___________________________________________________________________
Paxos
Author : joeyespo
Score : 83 points
Date : 2022-01-06 19:14 UTC (3 hours ago)
(HTM) web link (martinfowler.com)
(TXT) w3m dump (martinfowler.com)
| tombert wrote:
| I have a bit of a man-crush on Leslie Lamport (and have
| (unsuccessfully) tried to get TLA+ to be used at every job I've
| had for the last five years), and I have read through both the
| original Paxos paper in addition to the Paxos Made Simple paper,
| and I'm just going to say that I still found it about five times
| harder to grok than Raft. Honestly the TLA+ specification of
| Paxos made it click more than the papers for me, but that could
| just be because of the repeated readings before.
|
| It's a beautiful algorithm, but Raft has the advantage of being a
| lot simpler.
| uvdn7 wrote:
| I am also a huge fan of Lamport. Paxos is so concise and of
| such beauty that once you understand it, all other consensus
| protocols seem just different ways of phrasing Paxos with
| additional features (e.g. membership changes, make it into a
| log, etc.). Paxos is the smallest, most elegant consensus
| building block that I have seen so far. Raft's protocol covers
| a much bigger area than just consensus (it gives you a log,
| supports reconfiguration, etc.). Paxos' entire protocol fits a
| single slide. Period.
|
| I wrote a few posts about how to understand paxos, and this is
| my personal favorite - https://blog.the-pans.com/understanding-
| paxos/.
|
| Lamport did comment on it (via email) saying that understanding
| Paxos is less important than proving it works (via TLA+).
| MatteoFrigo wrote:
| Your view that Paxos is an algorithm for read-modify-write is
| a really deep insight. Once you adopt this point of view,
| things like Raft are just RMW updates of a log. (I like to
| think of Paxos as the fault-tolerant version of the load-
| linked/store-conditional pair of instructions.)
|
| However, there is a difficulty in expressing exactly what
| RMW-Paxos does, because it may end up "modifying" a value
| that was never "committed".
|
| For example, assume that I have three acceptors storing a
| replicated counter. The proposer executes phase-1, increments
| the value received from the acceptor with the highest ballot,
| and sends the value to all acceptors in phase-2. Consider now
| an execution where all acceptors initially hold 0. The
| proposer manages to store a new value 1 in one acceptor, and
| then crashes. Thus, 1 was not committed and a reader may
| conclude that the consensus value is still 0. Now a new
| proposer arrives, receives (1, 0, 0) from the three
| acceptors, chooses 1 as the newest value, and writes (2, 2,
| 2) to all acceptors. Thus "2" is now committed, and the
| commit of the "2" retroactively commits the "1". Thus, the
| condition for being committed is no longer "there exists a
| phase-2 quorum ...", but it is more complicated and depends
| on future history.
|
| I have verified with TLA+ that this algorithm does indeed
| implement an atomic read-modify-write operation, but only
| under a complicated notion of being "committed" that boils
| down to either having a phase-2 quorum, or one of the
| successor operations having a phase-2 quorum.
|
| Did you ever figure out a simpler way to say this?
| iambvk wrote:
| Instead of Read-Modify-Write, I like to think of it as
| Read/Determine/Write cycle.
|
| A note worthy point is that, Paxos actually expects
| proposers to _drop_ their value and _carry-forward_
| someone-else 's value to complete the consensus. This is
| typically not an intended behavior from client's point of
| view, but proposers role is to form the consensus, not push
| their value.
|
| Read phase is basically understanding if there is any value
| that could've been chosen when F nodes are unavailable.
|
| Determine phase is to choose to carry-forward any value
| (if-any) from the Read phase or use the new value from the
| client.
|
| Write phase is to actually ask majority of the acceptors to
| save the value selected in the Determine phase.
| MatteoFrigo wrote:
| You are totally correct that in Paxos the proposer is
| supposed to "carry-forward" the value proposed by
| somebody else. In this view, which is how everybody
| understands Paxos, Paxos is an algorithm for consensus.
|
| However, the proposer can also _modify_ the value
| proposed by somebody else, and it turns out that this is
| a perfectly valid algorithm for implementing a
| distributed read-modify-write. This variant does more
| than just consensus---it basically behaves like a fault-
| tolerant memory with an atomic update operation (think
| compare-and-swap or load-linked /store-conditional). In
| his blog post, GP also mentions that paxos is a read-
| modify-write transaction, but perhaps I read too much
| into what he is saying. Either way, this RMW variant is
| correct, I have verified it exhaustively with TLA+, and
| it is used in a few production systems that I have
| implemented.
|
| The difficulty is now to say exactly what this RMW
| variant does. I was hoping that GP (or anybody else) may
| have some insights.
| tombert wrote:
| > Lamport did comment on it (via email) saying that
| understanding Paxos is less important than proving it works
| (via TLA+).
|
| I suppose that's not wrong; if I know for a fact that
| something is going to behave how I expect it too, I don't
| generally care how it works. I can treat the algorithm as a
| black box and just go from there.
|
| That said, I dont' completely agree. If I am implementing
| Paxos in NewLangOfTheMonth, it's likely that a straight one-
| to-one port of another version will not fully work due to
| tiny differences in the language semantics. If I don't
| understand Paxos, then it's unlikely I'll be able to correct
| any mistakes in the implementation.
| iambvk wrote:
| I totally agree. Paxos is such a simple solution that it is
| so amazing.
| loquor wrote:
| > I have a bit of a man-crush on Leslie Lamport
|
| +1 - I really admire him too. I wonder, what makes people like
| him and Edsger Dijkstra so charismatic in this sense?
| User23 wrote:
| Dijkstra was and Lamport is very good at writing clearly and
| distinctly. Many people with that level of raw brainpower
| don't make that effort, so the ones that do are easy to
| admire and be grateful to.
| tombert wrote:
| I think a lot of it has to do with the fact that he's a very
| prolific writer and speaker, in addition to contributing to a
| fairly diverse set of Computer Science fields.
|
| I usually enjoy Lamport's speeches, and I have said before
| that his book Specifying Systems should be mandatory reading
| for nearly anyone interested in CS, due to how approachable
| it is, and yet how deep it goes.
| solidangle wrote:
| You might be interested in this paper [1] which presents Paxos
| in the same way as Raft is presented in the Raft paper. In my
| view it really illustrates that most of the simplicity of Raft
| comes from its excellent presentation and not necessarily from
| its algorithmic simplicity.
|
| [1]https://arxiv.org/pdf/2004.05074
| kkwteh wrote:
| Paxos has a reputation of being hard to understand. I was under
| the impression that Raft was now preferred for distributed
| consensus. The creators of the Raft algorithm explicitly had
| understandability as a goal when designing it and called out
| Paxos as being hard to understand.
|
| https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf
| solidangle wrote:
| (Multi-)Paxos and Raft are incredibly similar. I would even
| argue that Raft is just a variant of Paxos in the same way that
| e.g. Fast Paxos and EPaxos are variants of Paxos. Most of the
| Raft algorithm is identical to the Paxos algorithm. The main
| difference between the two algorithms is how leadership
| election works. In my view the main reason why Raft is
| considered to be simpler is because it was presented in a
| really clear paper. Paxos becomes equally simple when using a
| similar representation as used in that paper. See this paper:
| https://arxiv.org/pdf/2004.05074
| daveevad wrote:
| My grey matter is failing me but does anyone remember a rant
| about 5-10 years ago about developing software in this era that
| had some punch line about consensus algorithms that went a little
| like, "so, there's this guy named Diego?"
|
| edit: found it
|
| https://circleci.com/blog/its-the-future/
| lowbloodsugar wrote:
| >Why don't I just use Google's thing?
|
| >-You think that's going to be around in 6 months?
|
| bahaha.
| jldugger wrote:
| > Unmesh Joshi
|
| It feels kinda strange to see an author other than Martin Fowler
| blogging on martinfowler.com....
| DerpyBaby123 wrote:
| Maybe so, although he has been featuring other authors on his
| blog regularly now. He works with them to keep the quality
| high, check them out.
| Jtsummers wrote:
| This came up with another martinfowler.com post, and I checked
| the past couple years. Roughly half the content is Fowler's,
| and the other half are other writers (from memory, not
| recounting it now). I don't know the ratio going back prior
| years (only checked 2021 and 2020) but I've seen lots of posts
| by others on his site for the past few years.
|
| https://martinfowler.com/articles/patterns-of-distributed-sy...
| - other content by the same author under title _Patterns of
| Distributed Systems_.
| throwaway984393 wrote:
| I think a flow chart would have been easier to understand, or
| perhaps a video or .gif
| uvdn7 wrote:
| No, it doesn't. A video might help you _feel_ like you
| understand it but it doesn't really help you understand it.
| That's just not how distributed systems work.
| spicybright wrote:
| Now that's a little silly to say, some people grok more from
| video, some more from articles with pictures.
| rdtsc wrote:
| Is there a mistake on the second diagram where prepare(1,e) is
| sent to Byzantium? Wonder if it should be prepare(1,a), since
| Athens hasn't learned the (1,e) value which only Ephesus and
| Delphi know about at that point?
|
| Otherwise I think it's a pretty good description of Paxos. I like
| that it walks through a good number of failure scenarios and
| mentions interesting corner cases such as a lack of an explicit
| commit in the original papers and that reading also needs to run
| the full Paxos algorithm.
| Jtsummers wrote:
| It does seem to be a typo, based on the table and text below
| it.
| arsh wrote:
| There are huge gaps between the scientific paper and how you
| actually operate and scale Paxos. This can be seen in Paxos Made
| Live - An Engineering Perspective by Tushar Chandra et.al.
|
| Similarly, it also has real drawbacks like being slow, or lacking
| liveness guarantees.
|
| See other alternatives in https://aws.amazon.com/builders-
| library/leader-election-in-d...
___________________________________________________________________
(page generated 2022-01-06 23:00 UTC)