https://brooker.co.za/blog/2021/11/16/paxos.html
Marc's Blog
About Me
My name is Marc Brooker. I've been writing code, reading code, and
living vicariously through computers for as long as I can remember. I
like to build things that work. I also dabble in brewing, cooking and
skiing.
I'm currently an engineer at Amazon Web Services (AWS) in Seattle,
where I lead engineering on AWS Lambda and our other serverless
products. Before that, I worked on EC2 and EBS.
All opinions are my own.
Links
My Publications and Videos
@MarcJBrooker on Twitter
The Bug in Paxos Made Simple
There's not really a bug in Paxos, but clickbait is fun.
Over the last few weeks, I've been picking up the excellent P
programming language, a language for modelling and specifying
distributed systems. One of the first things I did in P was implement
Paxos - an algorithm I know well, has a lot of subtle failure modes,
and is easy to get wrong. Perfect for practicing specification. To
test out P's model checker, I intentionally implemented a subtly
buggy version of Paxos, following the description in Paxos Made
Simple. The model checker found, as expected, implemented the way I
read Paxos Made Simple, that Paxos is broken.
I mentioned this to a colleague who said they had never heard of this
bug. I think it deserves to be more well known, so I thought I'd
write a bit about it.
The problem lies not in the Paxos algorithm itself, but in the
description in the paper. Michael Deardeuff pointed out this bug to
me, and also wrote it up in what may be the best Stack Overflow
exchange of all time^1 (or, at least, the one with the best
value-to-upvotes ratio). In the Stack Overflow question, user lambda
describes the following sequence of events:
Consider that there are totally 3 acceptors ABC. We will use X
(n:v,m) to denote the status of acceptor X: proposal n:v is the
largest numbered proposal accepted by X where n is the proposal
number and v is the value of the proposal, and m is the number of
the largest numbered prepare request that X has ever responded.
The following can play out:
1. P1 sends 'prepare 1' to AB
2. Both AB respond P1 with a promise to not to accept any request numbered smaller than 1.\
Now the status is: A(-:-,1) B(-:-,1) C(-:-,-)
3. P1 receives the responses, then gets stuck and runs very slowly
4. P2 sends 'prepare 100' to AB
5. Both AB respond P2 with a promise to not to accept any request numbered smaller than 100.
Now the status is: A(-:-,100) B(-:-,100) C(-:-,-)
6. P2 receives the responses, chooses a value b and sends 'accept 100:b' to BC
7. BC receive and accept the accept request, the status is: A(-:-,100) B(100:b,100) C(100:b,-).
Note that proposal 100:b has been chosen.
8. P1 resumes, chooses value a and sends 'accept 1:a' to BC
9. B doesn't accept it, but C accepts it because C has never promise anything.
Status is: A(-:-,100) B(100:b,100) C(1:a,-).
This seems to be a major problem, because now the system could forget
the decided value and decide on another one, violating the most basic
safety property of Paxos. As Micheal points out in his answer, it
turns out that this happens because of two ambiguities in the text of
Paxos Made Simple. First, on the selection of acceptors for the
accept (second) phase (from Paxos Made Simple):
If the proposer receives a response to its prepare requests
(numbered n) from a majority of acceptors, then it sends an
accept request to each of those acceptors for a proposal numbered
n with a value v, where v is the value of the highest-numbered
proposal among the responses, or is any value if the responses
reported no proposals.
If you follow the letter of this statement, and send the accept
messages to the acceptors who responded to your first phase messages,
then the problem can't happen. Unfortunately, this also makes the
algorithm somewhat less robust in practice. Fortunately, there's
another possible fix. Again, from Michael's answer:
By accepting a value the node is also promising to not accept
earlier values.
Lamport doesn't say this in Paxos Made Simple. Instead, he says:
If an acceptor receives an accept request for a proposal numbered
n, it accepts the proposal unless it has already responded to a
prepare request having a number greater than n.
So if you don't quite follow the letter of the text about acceptor
selection, and then do follow the text about how acceptors handle
accept messages, then you end up with the bug described in the Stack
Overflow question. That seems like a narrow case, but I'll admit that
I've implemented Paxos incorrectly in this way multiple times. It's a
very easy mistake to make.
Leslie Lamport is one of my technical writing heroes. I re-read some
of his papers, like What Good is Temporal Logic? from time to time
just because I like the way they are written. Pointing out this
ambiguity isn't criticizing his writing, but rather reminding you
about how hard it is to write crisp descriptions of even relatively
simple distributed protocols in text. As Lamport himself says:
Prose is not the way to precisely describe algorithms.
That's a big part of why I like languages like P and TLA+ so much.
Not only are they great ways to specify, check, and model algorithms,
but they are also great ways to communicate them. If you work with
distributed algorithms, I strongly advise picking up one of these
languages.
Footnotes
1. I was reminded about this excellent exchange by Mahesh
Balakrishnan's recent post Paxos made Abstract., which is well
worth reading to give a different perspective on how Paxos works,
and one way to think about it from a systems perspective.
2. There are a lot more correct variants of jury selection for
different phases that meet different design goals, as Heidi
Howard's work clearly points out.
Other Posts
<< Back to the blog index
* 20 Oct 2021 >> Serial, Parallel, and Quorum Latencies
* 27 Aug 2021 >> Caches, Modes, and Unstable Systems
* 11 Aug 2021 >> My Proposal for Arecibo: Drones
Marc Brooker
The opinions on this site are my own. They do not necessarily
represent those of my employer.
marcbrooker@gmail.com
[feed-icon-] RSS [feed-icon-] Atom
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0
International License.