hngopher.com

       [HN Gopher] Raft: Understandable Distributed Consensus (2014)
       ___________________________________________________________________
        
       Raft: Understandable Distributed Consensus (2014)
        
       Author : Hrun0
       Score  : 218 points
       Date   : 2024-09-27 12:54 UTC (1 days ago)
        
 (HTM) web link (thesecretlivesofdata.com)
 (TXT) w3m dump (thesecretlivesofdata.com)
        
       | ko_pivot wrote:
       | The writing and the visualizations are great. The 'continue'
       | button is way too frequent.
        
       | MarkMarine wrote:
       | This is one of my favorite pieces of software engineering because
       | it took something difficult and tried to design something easy to
       | understand as a main criteria for success. The PHD Thesis has a
       | lot more info about this if anyone is curious, it is approachable
       | and easy to read:
       | 
       | https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pd...
       | 
       | I think this was core to Raft's success, and I strive to create
       | systems like this with understandability as a first goal.
        
         | throwawaymaths wrote:
         | Weirdly it's also kinda worse is better: raft is non-
         | deterministic and has an unboundedly long election cycle time.
         | IIRC:
         | 
         | - it assumes no hysteresis in network latencies and if there is
         | a hysteresis it's possible that elections can be
         | deterministically infinite.
         | 
         | - this fact and the use of raft in production has caused real,
         | large scale network outages.
         | 
         | Paxos is of course a beast and hard to understand. There is an
         | alternative, VSR (which was developed ~time of paxos) which is
         | easy to understand and does not have the issues caused by
         | election nondeterminism in raft.
         | 
         | Of course everyone uses raft so raft dominates.
        
           | eatonphil wrote:
           | > - this fact and the use of raft in production has caused
           | real, large scale network outages.
           | 
           | While this has surely happened, I am not so confident about
           | what the reasons were for this. If you've got links on
           | details I'd love to read.
           | 
           | > which is easy to understand
           | 
           | I've implemented core bits of Raft twice now and have looked
           | at VSR a couple of times and VSR wasn't easier for me to
           | understand. I'm sure I could implement VSR and would like to
           | some day, but just comparing the papers alone I personally
           | felt like Raft was better presented (i.e. easier to
           | understand).
           | 
           | Also keep in mind that nobody ships consensus implementations
           | exactly in line with the original paper. There are dozens or
           | hundreds of papers on variations and extensions of Raft/Paxos
           | and every actual implementation is going to implement some
           | selection of these extensions/variations. You have to look at
           | each implementation carefully to know how it diverges from
           | the original paper.
        
             | throwawaymaths wrote:
             | > A known limitation of the base Raft protocol is that
             | partial/asymmetric network partitions can cause a loss of
             | liveness [27, 32]. For instance, if a leader can no longer
             | make progress because it cannot receive messages from the
             | other nodes, it continues to send AE heartbeats to
             | followers, preventing them from timing out and from
             | electing a new leader who can make progress.
             | 
             | (Howard, Abram et al)
             | 
             | Me: note this can also occur if there isn't a complete
             | outage, if the latency back to the shit leader is different
             | from the latency out of the shit leader.
             | 
             | > nobody ships consensus implementations exactly in line
             | with the original paper. There are dozens or hundreds of
             | papers on variations
             | 
             | As the paper above explains once you add extensions you
             | might have broken the correctness proofs in raft. More to
             | the original point, you're now in a state where it's no
             | longer "simple"... I would go so far as to say if you
             | _have_ to consider the extensions, which are distributed
             | over several papers and sometimes not even papers at all,
             | you 're in "deceptively simple" land.
             | 
             | As a pedagogical tool, raft is valuable because it can be a
             | launching ground for conversations like these... But maybe
             | we shouldn't use it in prod when there are better,
             | straightforward options? I get the feeling that being hard
             | sold as simple nerdsniped devs into writing it and someone
             | r/very smart put it into prod and with social proof more
             | people did and now here we are
        
               | lifeinthevoid wrote:
               | > A known limitation of the base Raft protocol is that
               | partial/asymmetric network partitions can cause a loss of
               | liveness [27, 32]. For instance, if a leader can no
               | longer make progress because it cannot receive messages
               | from the other nodes, it continues to send AE heartbeats
               | to followers, preventing them from timing out and from
               | electing a new leader who can make progress.
               | 
               | Real-world raft implementations make the leader step down
               | if it hasn't heard from a quorum for a while. Not part of
               | vanilla raft though.
        
               | eatonphil wrote:
               | The thesis does describe doing this fwiw while the paper
               | does not.
        
           | convolvatron wrote:
           | don't forget about calm. its a shame that that isn't the
           | default we reach for, and only struggle when we really need
           | stronger latency.
           | 
           | https://arxiv.org/abs/1901.01930
        
             | prydt wrote:
             | I think the CALM theorem and this whole line of research is
             | so interesting and it is still carried on by the CRDT
             | people. But I would love to see more of this.
             | 
             | I feel like it doesn't get as much attention as it
             | deserves.
        
           | kfrzcode wrote:
           | I've made a longer comment in another thread; but have you
           | investigated the hashgraph algorithm? Gossip-about-gossip and
           | virtual voting combine to result in leaderless consensus,
           | fair ordering and aBFT. It's very performant with over 10k
           | TPS on-chain. I'm learning about DLT from the perspective of
           | hashgraph which is why I don't understand why it doesn't get
           | love - it seems to have all of the good and none of the bad.
        
           | MarkMarine wrote:
           | > this fact and the use of raft in production has caused
           | real, large scale network outages.
           | 
           | Paxos as well, I remember full cloud GCP outage that had
           | something to do with Paxos, and I can't find the data on it
           | but I thought there was a nasty bug in zookeeper paxos
           | implementation.
           | 
           | That isn't to say any of these are perfect or bug free, it's
           | made by humans and we're going to make mistakes, but my
           | experience implementing both was I had a working raft
           | implementation and paxos baked my brain until I gave up.
           | 
           | I think everyone uses raft _because_ it was possible to
           | implement for a working dev, so there are a number of
           | implementations, and it's easier to understand the phases the
           | application is in.
           | 
           | I'll check out VSR I appreciate the rec.
        
             | lanstin wrote:
             | My zookeeper outages are always due to simple things like
             | all the workers having the same basic image and then having
             | increased write rate filling up the disk of all workers at
             | the same time.
        
       | rapsey wrote:
       | While understandable, implementing it is however far from easy.
        
         | MarkMarine wrote:
         | Right, but Paxos is double hard in comparison. I've read both
         | papers multiple times, tried to implement and failed, and I
         | still don't think I understand Paxos.
        
           | candiddevmike wrote:
           | IMO, Paxos has a lot less edge cases than Raft, mostly
           | because of the complexity of the implementation covers
           | them/forces you to think about how to handle them.
        
             | hinkley wrote:
             | How many people actually understand Paxos?
             | 
             | The assertion at the time was that only a few people
             | understood it well enough to make a correct implementation,
             | the others were full of bugs.
             | 
             | The problem we have with Lamport is that he's very good at
             | talking to computers but not so good at talking to humans.
             | I think the world would be a better place today if someone
             | had forced him to learn to speak human.
             | 
             | He did a presentation at MS just after he won the Turing
             | Award. He prefaces it with how important writing is to
             | thinking, and how you don't really know what you think
             | until you write it down. Those are the words and thinking
             | of an introvert. Writing is still the shallow end of
             | understanding. The deep end is teaching. If you understand
             | something and you teach it to others, then you have proven
             | that you understand it, and caused that understanding not
             | to be lost to posterity. If you only write about it, it
             | might work as instructional material, or it may require
             | some very clever people who can teach themselves using your
             | words. But they may also get it wrong, and not have you for
             | feedback.
             | 
             | The latter is where seem to be with Leslie's works. The
             | consensus is that few people actually understand what he's
             | talking about well enough to implement it correctly.
             | 
             | I had a coworker once who was shocked to learn I read the
             | ACM SIGPLAN proceedings. "You can read those??" I knew what
             | he meant and yeah, a lot of those were very unapproachable
             | and I understood two thirds of them and only half of each
             | of the rest. Before I committed to using Raft I gave
             | Lamport's paper a try. It was a slog and he doesn't sell
             | the why of each part. He's just giving you a very, very
             | long recipe without the mental models necessary to
             | reproduce it robustly.
        
           | mrkeen wrote:
           | I'm on the other side.
           | 
           | I think Leslie Lamport asserted that Paxos is minimal, and
           | that "all other consensus algorithms are just Paxos with more
           | steps". I'm inclined to believe him.
           | 
           | I've implemented Paxos but I can't get through "Raft for
           | dummies" style blog posts.
           | 
           | Regarding Raft [1]:                 > The consensus problem
           | is divided into three sub-problems: Leader election,
           | Replication and Safety.
           | 
           | What is leader election? It's a distributed system coming to
           | consensus on a fact (i.e. who the leader is.) Then once you
           | have the leader, you do additional steps. _The entirety_ of
           | Paxos is a distributed system coming to consensus on a fact.
           | 
           | When I read these posts, i see things like "timeout",
           | "heartbeat", and I think: timeout according to whom? I read
           | "once the leader has been elected", um, hangon, according to
           | whom? Has node 1 finally agreed on the leader, just while
           | node 3 has given up and started another election? I don't
           | doubt that Raft is correct, but the writing about it seems
           | simple by glossing over details.
           | 
           | Paxos, on the other hand, seems timeless. (And the writing
           | about it doesn't trigger my "distributed system fallacies"
           | reaction)
           | 
           | [1] https://www.brianstorti.com/raft/
        
             | spmurrayzzz wrote:
             | I tend to agree that many explanations of raft dont get
             | into the useful details and handwave some of the hard
             | problems. But the original paper does do a good job of this
             | and is pretty accessible to read IMO.
             | 
             | > I read "once the leader has been elected", um, hangon,
             | according to whom? Has node 1 finally agreed on the leader,
             | just while node 3 has given up and started another
             | election?
             | 
             | The simple response I think to "according to whom" is "the
             | majority of voting nodes". When the leader assumes its
             | role, it sends heartbeats which are then accepted by the
             | other nodes in the cluster. Even if (in your example) node
             | 3 starts a new election, it will only succeed if it can get
             | a majority of votes. If node 2 has already acknowledged a
             | leader, it won't vote for node 3 in the same term.
             | 
             | There's some implicit concessions inherent there around
             | eventual consistency, but I don't think thats novel to Raft
             | compared to other distributed consensus protocols.
        
               | withinboredom wrote:
               | > The simple response I think to "according to whom" is
               | "the majority of voting nodes".
               | 
               | Reminds me of this one time we had a Raft cluster arguing
               | over who was the leader for 20 minutes in production.
               | Raft leader election is non-deterministic, while Paxos is
               | deterministic. It can 'randomly' get into a situation it
               | cannot resolve for quite a long time.
        
               | spmurrayzzz wrote:
               | > Reminds me of this one time we had a Raft cluster
               | arguing over who was the leader for 20 minutes in
               | production
               | 
               | That's certainly an interesting failure mode. Do you
               | recall the details around root cause? I could imagine
               | ephemeral network partitions (flapping interfaces?
               | peering loss?) causing something like this for sure.
               | 
               | In my own experience, I've been running services that use
               | Raft under the hood for the last ~10 years in production
               | and haven't seen this happen myself. Though I do
               | absolutely remember having misconfigured election
               | timeouts causing very painful latency issues in failover
               | scenarios.
        
             | ivankelly wrote:
             | 100% agree. I haven't read the raft paper in years, but I
             | remember thinking there's just too much stuff in there.
             | That stuff in important, but if you want people to
             | understand what's happening they internalize the
             | fundamental idea of being able to block other writers by
             | bumping a number. Which is all covered in the single decree
             | paxos section in part time parliment.
        
             | kfrzcode wrote:
             | Paxos is nice, sure, but Hedera does DLT _with_ aBFT and
             | much more efficient, as well as being faster and ensuring
             | fairness. It 's leaderless, and achieves incredible TPS
             | (10k+ in practice, 100k+ in theory).
             | 
             | I am curious on your thoughts here.
        
           | Vervious wrote:
           | Raft has a problem where, in the protocol description,
           | sometimes I have no idea why some line is there, but if you
           | take it out the protocol comes to a grinding halt... it's
           | really a mumble jumble. It has good intuition, but the
           | details are really messy, it's edge cases all the way down.
           | 
           | Whereas I think each line of pseudocode in Paxos is much more
           | motivated.
           | 
           | In other words, if a philosopher had to design a crash-fault
           | protocol from scratch, without having seen any before, I
           | think 80% of the time it would look exactly like Paxos.
        
           | jeffbee wrote:
           | Students of raft say they understand it but when they have to
           | implement it they make as many mistakes as students of paxos.
           | The simplicity is misleading.
        
         | ukd1 wrote:
         | Having implemented it twice (for fun/learning, though) I think
         | you're right - yet it's also the easiest, imho. The paper, and
         | resources around implementing it (posts, other implementations)
         | are great. Also, the authors still reply on github when issues
         | are opened, which is great.
        
         | jimbokun wrote:
         | How do you write a testing framework/system for evaluating
         | distributed algorithms like this?
         | 
         | Doing that in a way that's easy to automate, set up, evaluate,
         | and tear down various scenarios seems like the hardest part to
         | me.
        
       | skilning wrote:
       | I was asked to click "continue" after each of the first two
       | sentences, and the fade-in of the text took longer than reading
       | the text.
       | 
       | This may be a great article, but I'll never know because it's
       | frustrating to try and read.
        
       | benbjohnson wrote:
       | Author here. I'm happy to answer any questions although this
       | project was from 10+ years ago so I could be a little rusty.
       | 
       | Over the years I've been trying to find better ways to do this
       | kind of visualization but for other CS topics. Moving to video is
       | the most realistic option but using something like After Effects
       | takes A LOT of time and energy for long-form visualizations. It
       | also doesn't produce a readable output file format that could be
       | shared, diff'd, & tweaked.
       | 
       | I spent some time on a project recently to build out an SVG-based
       | video generation tool that can use a sidecar file for defining
       | animations. It's still a work in progress but hopefully I can get
       | it to a place where making this style of visualizations isn't so
       | time intensive.
        
         | mgenglder wrote:
         | This is wonderful. Can I ask how you created it? Stack used and
         | sour e code? I'd love to create something like this to help
         | visualize things I'm working with currently.
        
           | benbjohnson wrote:
           | It's all done with d3 and JavaScript. The visualizations
           | aren't deterministic so I ended up writing a shitty Raft
           | implementation in JS. Overall it was a terrible approach
           | because it was so time consuming but I made it work. You can
           | find all the source code in this repo:
           | https://github.com/benbjohnson/thesecretlivesofdata
        
         | kfrzcode wrote:
         | What are your thoughts on Dr. Leemon Baird's Hedera Hashgraph?
         | 
         | https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf
        
           | benbjohnson wrote:
           | I haven't read that paper but it seems like it's fixing a
           | different problem of Byzantine fault tolerance. Most
           | consensus systems that are internal for an organization don't
           | have the Byzantine issue so it simplifies the problem.
        
         | huntaub wrote:
         | I just want you to know how much this visualization was
         | appreciated. In my time working at AWS, I recommended this
         | website to every one of our new hires to learn how distributed
         | consensus works. Know that this has taught probably 50+ people.
         | Thank you for what you've built.
        
           | benbjohnson wrote:
           | Thanks so much for letting me know! It's always hard to tell
           | when I put something out there if it just gets lost in the
           | ether. I'm glad to hear it helped so many folks.
        
       | eatonphil wrote:
       | Ben's visualization here is great.
       | 
       | The other biggest help to me aside from the paper and the thesis
       | was Ongaro's TLA+ spec:
       | https://github.com/ongardie/raft.tla/blob/master/raft.tla. It's
       | the only super concise "implementation" I found that is free of
       | production-grade tricks, optimizations, and abstractions.
       | 
       | And for building an intuition, TigerBeetle's sim.tigerbeetle.com
       | is great. What happens to consensus when there's high latency to
       | disk or network? Or as processes crash more frequently? It
       | demonstrates.
        
         | pa7ch wrote:
         | Interesting that TigerBeetle uses Viewstamp Replication over
         | Paxos/Raft. TB says viewstamp replication lends itself to a
         | more performant implementation and doesn't rely on disk storage
         | as much. I'm surprised I'm not seeing this brought up more in
         | Paxos/Raft discussions.
        
           | butterisgood wrote:
           | VR is often overlooked and really pretty easily understood
        
       | _russross wrote:
       | I am in the minority who thinks Raft is overrated.
       | 
       | I tried teaching Raft one year instead of Paxos but ended up
       | switching back. While it was much easier to understand how to
       | implement Raft, I think my students gained deeper insight when
       | focusing on single-decision Paxos. There is a lightbulb moment
       | when they first understand that consensus is a property of the
       | system that happens first (and they can point at the moment it
       | happens) and then the nodes discover that it has been achieved
       | later. Exploring various failure modes and coming to understand
       | how Paxos is robust against them seems to work better in this
       | setting as well.
       | 
       | I think this paper by Heidi Howard and Richard Mortier is a great
       | way to move on to Multipaxos:
       | 
       | https://arxiv.org/abs/2004.05074
       | 
       | They present Multipaxos in a similar style to how Raft is laid
       | out and show that Multipaxos as it is commonly implemented and
       | Raft are almost the same protocol.
       | 
       | Raft was a great contribution to the engineering community to
       | make implementing consensus more approachable, but in the end I
       | don't think the protocol itself is actually more understandable.
       | It was presented better for implementers, but the implementation
       | focus obscures some of the deep insights that plain Paxos
       | exposes.
        
         | withinboredom wrote:
         | Know that I join you in Raft being overrated. I'm working on a
         | multipaxos implementation right now. There are some really neat
         | capabilities/properties that paxos has that Raft can never
         | achieve (see wpaxos, for example, that lets keys migrate to
         | nodes near the client).
        
         | weinzierl wrote:
         | This is an interesting insight into the educational side, but
         | now I am curious about the implementation side. Raft is easier
         | to implement but that's just one factor. Looking at real world
         | usages there seems to be a draw. I could easily count as many
         | Paxos implementations as Raft. Is this just historical or are
         | there good reasons for a new project to still implement Oaxos?
        
           | bfdes wrote:
           | Paxos and Multi-Paxos have been around much longer than Raft.
           | The paper that introduced Raft was published in 2014.
        
         | alexchamberlain wrote:
         | I've read both the Paxos and Raft papers a few times, and
         | hacked on some implementations, but never quite got one over
         | the line to working...
         | 
         | Raft strikes me as a particular set of decisions made within a
         | Paxos framework, such as having 1 entity for Proposers,
         | Acceptor and Followers. It's frustrating that there isn't a
         | clearly written defacto paper on Paxos - the story style
         | confused the monkeys out of me.
        
           | jimbokun wrote:
           | > but never quite got one over the line to working...
           | 
           | I've never implemented something like this. But my first
           | thought is "how do you implement the testing system?"
           | 
           | I feel like once you had a robust testing system that can
           | verify things work correctly in all the different network
           | partition and other scenarios, and allowing rapid iteration
           | of setting up those scenarios, the implementation would be
           | comparatively easy.
        
             | alexchamberlain wrote:
             | Yeah, you kind of can't test any of it until you test all
             | of it...
        
             | heromal wrote:
             | Check out maelstrom
        
         | senderista wrote:
         | Agree, Raft is less modular and therefore harder to understand
         | than MultiPaxos:
         | 
         | https://maheshba.bitbucket.io/blog/2021/12/14/Modularity.htm...
        
         | alexnewman wrote:
         | You are telling me your students can safely implement single
         | decree paxos.... I worked on a few paxos production
         | implementations before the RAFT paper, single and multi decree.
         | The idea that paxos, as it is to be implemented, is easy for
         | students to understand... Well let me re-read the paper, but i
         | assure you, raft was a big deal
        
         | hintymad wrote:
         | Is there any paper/handouts/video that explains Paxos in depth,
         | especially its implementations and intuitions? Paxos Made
         | Simple gave intuitive explanations, but I feel it still misses
         | a lot of intricate details if I were to build Praxos for
         | production use.
        
           | mananaysiempre wrote:
           | The papers and talks[1] by Heidi Howard and Richard Mortier,
           | in particular "Paxos vs Raft: Have we reached consensus on
           | distributed consensus?"[2,3] and "Flexible Paxos"[4,5], are
           | what finally made things click for me. A real implementation
           | also needs other stuff[6], though, such as dynamic membership
           | and state machine replication, which I still don't know how
           | to do.
           | 
           | [1] https://fpaxos.github.io/
           | 
           | [2] https://dl.acm.org/doi/abs/10.1145/3380787.3393681 or
           | https://arxiv.org/abs/2004.05074
           | 
           | [3] https://www.youtube.com/watch?v=0K6kt39wyH0
           | 
           | [4] https://doi.org/10.4230/LIPIcs.OPODIS.2016.25 or
           | https://arxiv.org/abs/1608.06696
           | 
           | [5] https://www.youtube.com/watch?v=r6NG_1HM0lA
           | 
           | [6] https://github.com/heidihoward/distributed-consensus-
           | reading...
        
           | nvarsj wrote:
           | The best approach is to implement Paxos.
           | 
           | I suggest https://github.com/emichael/dslabs.
           | 
           | It will take 100-200 hours to fully implement with all tests
           | passing.
        
             | withinboredom wrote:
             | Bummer that it requires java. Would be awesome to have a
             | 'networked' version where you just need to implement the
             | protocol.
        
         | sriram_malhar wrote:
         | I have had the opposite trajectory. Used to teach Paxos, but
         | was so relieved to switch to Raft when the paper came out.
         | 
         | I discuss distributed data structures in the context of maps
         | and sequences.
         | 
         | For maps, I discuss key-value stores (NUMA, Redis). I have them
         | implement cache coherence (MESI protocol, TARDIS 2.0), then
         | linearizable, fault-tolerant, wait-free shared memory registers
         | (the Attiya/Bar-Noy/Dolev algorithm[1]).
         | 
         | For sequences, I cover shared logs and state machine
         | replication, including database log shipping, Kafka, queues and
         | Raft.
         | 
         | I like Raft because it cuts down the design space by making
         | certain very intuitive and pragmatic choices, like using
         | timeouts (which are almost beneath Lamport to discuss :), or
         | idioms like "follow the leader", "if the leader is unreachable,
         | stand for election", "elect the latest & most informed leader",
         | (how I wish that was true in real life!), "always append" etc.
         | There are simple mechanisms to preserve invariants.
         | 
         | The problem with Paxos is that there is such a large range of
         | papers that there is no one paper that makes the leap in easy
         | digestible chunks from Basic to MultiPaxos. When I got students
         | to implement MultiPaxos, I never could get sufficient
         | confidence that it was done right (esp. the "disorderly"
         | filling in of log slots).
         | 
         | Paxos is like Monads; when you get it, you feel compelled to
         | write a "Paxos explained" paper :)
         | 
         | [1]https://groups.csail.mit.edu/tds/papers/Attiya/PODC90.pdf
        
           | withinboredom wrote:
           | I feel like you are doing your students a disservice.
           | (multi-)Paxos, while complex to wrap your head around,
           | enables far more modes of consensus. The possibilities and
           | papers out there are amazing.
           | 
           | Raft essentially only allows a single mode. Moreover, you are
           | starting to see people putting things on top of Raft instead
           | of something like Paxos, in the enterprise, because they
           | don't know any better nor have the foundation to understand
           | what they are doing is "wrong."
           | 
           | > When I got students to implement MultiPaxos, I never could
           | get sufficient confidence that it was done right
           | 
           | Testing this is fairly straightforward, they should be able
           | to join an already existing cluster. If they got it wrong, it
           | shouldn't take down the cluster, and they should be able to
           | step through their own code. There aren't any timeouts, so
           | they can take their time, going through each step of the
           | process until a value is committed.
           | 
           | At that point, you simply explain each step as an individual
           | algorithm, not the sum of its parts. You can even build each
           | part individually because an existing cluster should recover
           | from a misbehaving peer.
           | 
           | From there, it is a rather simple visualization process to
           | see what is going on.
           | 
           | The hard part of paxos is building it from scratch.
        
             | sriram_malhar wrote:
             | Can you tell more about what other interesting modes
             | multiPaxos allows? For an example of what I find
             | uninteresting, it is proposers or acceptors not having a
             | local disk (I know there are some uses for it, but there
             | are relatively straightforward ways of solving that issue
             | without requiring a whole new protocol). In all examples I
             | have seen, multipaxos and raft are fairly alike, except for
             | parts of their leadership election as Howard and Mortier
             | also describe.
             | 
             | Raft's simplicity and the fact that there's exactly one way
             | to do it is what I find most comforting in the most
             | important component of a distributed system. I can look at
             | an implementation of raft and immediately understand what's
             | going on, and what is missing.
        
               | withinboredom wrote:
               | What makes multiPaxos a better learning tool isn't
               | multiPaxos for the sake of multiPaxos, but rather Paxos
               | itself. The write-once consensus primitive is a valuable
               | tool (whether you are implementing Raft or multiPaxos) or
               | thing to know. Especially when it comes to flexible /
               | compartmentalized Paxos. Don't get me wrong, the same
               | things can theoretically be done with Raft (if they
               | haven't already), but that single-degree primitive makes
               | it make that much more sense. This is missing from Raft.
               | 
               | It's like learning about a byte and never learning about
               | endianness because everything is pretty much little-
               | endian these days.
        
       | kfrzcode wrote:
       | DLT technology discussions are entirely incomplete without
       | consideration of Hedera Hashgraph [0], an aBFT, leaderless, fair
       | and _fast_ DLT using a gossip-about-gossip consensus mechanism.
       | It 's absolutely a more robust and scalable technology than Paxos
       | or any other DLT for that matter. I'd love to know what the HN
       | crowd thinks about Hedera as the trust layer of the internet
       | but.... nobody around here seems to have any. It's like ignoring
       | Linux while comparing Mac and Windows based computing.
       | 
       | [0]: https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf
        
         | ryanthemadone wrote:
         | Paxos isn't a DLT, it's a consensus algorithm -- granted, DLTs
         | tend to require a consensus algorithm, but they're not the same
         | things.
         | 
         | As for Hedera Hashgraph being the trust layer of the internet,
         | we tend to build the internet through the IETF and standards
         | setting. Unfortunately HH is an endeavour from a private
         | company so isn't especially likely to be taken on in that
         | context.
         | 
         | I'd also wonder what you mean by the trust layer of the
         | internet, what are the use cases that you'd like to see solved
         | with such a trust layer?
        
           | kfrzcode wrote:
           | Great point, I wasn't being precise. My point stands,
           | however! I must also correct your understanding re: "private
           | company."
           | 
           | The DLT in question - Hedera - is built on the unique
           | consensus algorithm, the "hashgraph". The hashgraph algorithm
           | combines a gossip-about-gossip protocol with virtual voting.
           | 
           | In fact, services development is now in the hands of the
           | largest open source foundation in the world. These
           | implementations are entirely open source [0], and very
           | recently the codebase has been donated in whole to the Linux
           | Foundation.
           | 
           | Furthermore, Hedera is a Pioneer member of the newly founded
           | Linux Foundation Decentralized Trust organization [1] - along
           | with Chainlink, Deloitte, Hitachi, many other major
           | organizations, some of whom are also on the Hedera governing
           | council [2]. This foundation will be a big player in the
           | future of decentralized web, and Hedera is the only L1 that I
           | know of which is both primed for this future and actually
           | scalable.
           | 
           | I understand the IETF and standards approach; and am not
           | aware of a current draft or intention for such a draft. The
           | idea of Hedera being the 'trust layer' of the internet is
           | more about use cases like decentralized recovery, process
           | validation, carbon offsets, extremely granular supply chain
           | auditing, and any other application you might imagine that
           | would benefit from having extremely fast (10k+ TPS), 100%
           | guaranteed aBFT consensus on-chain. I'd love to hear what you
           | might think up or where this could be particularly useful.
           | Strong governance with 39 council members - including Google,
           | IBM, Boeing, Tata, AP+, Hitachi and more... with
           | decentralized network operation and stable fees (essential
           | for enterprise application).
           | 
           | > Additional use cases include, but are not limited to,
           | financial markets, matching engines (like those used in Uber
           | or AirBnb), or supply chain negotiations (e.g., several
           | competing factories bidding on parts from several competing
           | parts suppliers).
           | 
           | So, admittedly the use cases maybe aren't evident or
           | interesting to you and I right now at the TCP/IP layer, but I
           | can certainly say there are a plethora of trust-based
           | problems that could be solved with consensus only needing a
           | few thousand ms. Think digital identity, healthcare,
           | financial markets, IoT, supply chain, real-world asset
           | tokenization... For any real-world, scaled application, a DLT
           | must have very high throughput at high performance. It's the
           | absolute highest performance and security possible in a
           | leaderless consensus-based DLT as far as I know.
           | 
           | Literally carbon neutral or negative because of buybacks but
           | even without Hedera buying carbon credits, it's the single
           | most "green" i.e. power-efficient DLT on the market. Does HN
           | still care about Bitcoin using too much power? They would
           | like Hedera, to that extent. Predictable, very low fixed
           | fees. Long list of the biggest tech players leading the open
           | development process. What's not to love?
           | 
           | I urge you to help me invalidate these claims as it's pretty
           | important I understand the tech here... But I'm very bullish
           | on the token price as the technology is proven, robust, and
           | overall extremely undervalued by retail - in my opinion. NFA.
           | 
           | > The Hashgraph consensus algorithm is an algorithm for
           | asynchronous Byzantine fault tolerance intended for
           | distributed shared ledgers. Its main distinguishing
           | characteristic is it achieves consensus without exchanging
           | any extra messages; each participant's votes can be
           | determined from public information, so votes need not be
           | transmitted.
           | 
           | For more rigorous explanation, see [3] and the associated Coq
           | proof of the algorithm [4].
           | 
           | [0]: https://github.com/hashgraph/hedera-services
           | 
           | [1]: https://www.lfdecentralizedtrust.org/
           | 
           | [2]: https://hedera.com/ecosystem/governing-council
           | 
           | [3]: https://hedera.com/papers
           | 
           | [4]: https://www.cs.cmu.edu/~crary/papers/2021/hashgraph.pdf
        
         | lapinot wrote:
         | I got interested into the hashgraph algorithm quite early and
         | wrote a toy implementation in python (in fact discovering an
         | error in the paper in the process). Unless something changed
         | it's entirely useless for open-membership internet-scale
         | consensus. As I remember, the processing time of a message at a
         | node is linear in the number of nodes and same for the local
         | storage at a node, meaning it's not very scalable. Moreover,
         | the nodes must somehow a priori agree on the list of
         | participants of the consensus process, again something which is
         | not realistic for internet-wide consensus. The protocol is
         | quite neat and not too hard to implement but it's similar in
         | scope to paxos/raft: consensus inside an organization, where
         | some things are a priori agreed upon.
        
       | cedws wrote:
       | Can't proof-of-work be used as a leader election algorithm? If
       | the proof is hard enough to generate then one node should be able
       | to generate one and broadcast it before the other nodes can, then
       | that node becomes the leader.
        
         | withinboredom wrote:
         | There's a paper about that, using paxos as the base. Can't find
         | it right now, it is called "chained paxos" or "block paxos" or
         | something like that.
        
         | ryanthemadone wrote:
         | Proof of work is a leader election algorithm!
        
       | mbivert wrote:
       | In case this is of interest, MIT's 6.5840[0], distributed
       | systems, has a series of labs, implementing Raft in Go. Haven't
       | made it through the whole thing yet, but it's quite entertaining
       | so far.
       | 
       | The teachers provide you with some code templates, a bunch of
       | tests, and a progressive way to implement it all.
       | 
       | [0]: https://pdos.csail.mit.edu/6.824/index.html
        
         | lifeinthevoid wrote:
         | really liked the course, it was also instrumental in landing me
         | my previous job :-)
        
       | prydt wrote:
       | I've run a reading group for distributed systems for the last 2
       | years now and I do think that Raft is a better introduction to
       | Consensus than any Paxos paper I have seen (I mean the Paxos Made
       | Simple paper literally has bugs in it). But when I learned
       | consensus in school, we used Paxos and Multi-Paxos and I do
       | believe that there was a lot to be gained by learning both
       | approaches.
       | 
       | Heidi Howard has several amazing papers about how the differences
       | between Raft and Multi-Paxos are very surface level and that
       | Raft's key contribution is its presentation as well as being a
       | more "complete" presentation since there are so many fragmented
       | different presentations of Multi-Paxos.
       | 
       | As a bonus, one of my favorite papers I have read recently is
       | Compartmentalized Paxos:
       | https://vldb.org/pvldb/vol14/p2203-whittaker.pdf which is just a
       | brilliant piece on how to scale Multi-Paxos
        
         | senderista wrote:
         | There are several Multi-Paxos papers (some of them dating
         | before Raft) that are intended as guidance for implementers:
         | 
         | https://paper-notes.zhjwpku.com/assets/pdfs/paxos_for_system...
         | 
         | https://www.cs.cornell.edu/home/rvr/Paxos/paxos.pdf
         | 
         | https://www.scs.stanford.edu/~dm/home/papers/paxos.pdf
        
           | prydt wrote:
           | Ah thank you. That is a good list although I personally
           | dislike the "Paxos Made Moderately Complex" paper... I think
           | it adds too many different roles for very little benefit.
           | When implementing multi-Paxos for class, I used that paper
           | and felt it was more trouble than it needed to be.
           | 
           | I'll check out the other two papers though! Also just looking
           | around and I found this paper https://arxiv.org/pdf/1103.2408
           | [PDF] which looks useful as well.
        
         | shepherdjerred wrote:
         | What's your reading group?
         | 
         | I took a DS class and (poorly) implemented Paxos a few years
         | ago. I'm curious about how others continue learning about DS.
        
         | nano_o wrote:
         | The bug in the Paxos Made Simple paper is that Lamport forgot
         | to mention that, upon accepting a proposal, an acceptor also
         | implicitly promises not to accept any proposal in lower
         | ballots. It's discussed at length here:
         | https://stackoverflow.com/questions/29880949/contradiction-i...
        
       | dang wrote:
       | Related:
       | 
       |  _Raft Consensus Animated (2014)_ -
       | https://news.ycombinator.com/item?id=32484584 - Aug 2022 (67
       | comments)
       | 
       |  _Raft Visualization_ -
       | https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35
       | comments)
       | 
       |  _Raft: Understandable Distributed Consensus_ -
       | https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79
       | comments)
        
       | shiredude95 wrote:
       | "Paxos Made Moderately Complex" by Robert van Renesse and Deniz
       | Altinbuken:
       | http://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf is a
       | great starting point for implementing multi-paxos. The authors
       | also provide a working python implementation.
        
       | jhanschoo wrote:
       | Just earlier this month I was going through
       | https://github.com/jepsen-io/maelstrom and following the demo
       | implementing (a cut-corners version of) Raft, and I found it
       | quite elucidating. The sample (ruby) code contains some bugs, and
       | I had to use some understanding to fix them. (The bugs were of
       | the kind where a dummy implementation from an earlier step isn't
       | correctly changed)
        
       | matthewfcarlson wrote:
       | This really teaches Raft well. Is there a good example of this
       | but for Paxos?
        
       | quelltext wrote:
       | In the log replication example, after healing the partition the
       | uncommitted log changes in the minority group are rolled back and
       | the leader's log is used.
       | 
       | However it's not clear how that log is transmitted. Until this
       | point only heartbeats via append entry were discussed, so it's
       | not clear if the followers pull that information from the leader
       | somehow via a different mechanism, or whether it's the leader's
       | responsibility to detect followers that are left behind and
       | replay everything. That would seem rather error prone and a lot
       | of coordination effort. So how's it actually done?
        
       | wg0 wrote:
       | Off topic - folks in the know, besides Paxos/Raft what are some
       | of the other most complex algorithm in computer sciences that are
       | widely used or are a bedrock?
        
         | mbivert wrote:
         | I'd be curious to know as well; I'd bet on things pertaining to
         | either compilers (e.g. optimization stuff), concurrency/OSes,
         | graphics engines & image processing.
        
         | withinboredom wrote:
         | I can think of a deceptively simple one: recursive descent
         | parsers. They are pretty straightforward to implement but hard
         | to read/reason about if you don't already know the grammar.
        
       | eatonphil wrote:
       | Here's a visualization of Paxos and MultiPaxos that seems to be
       | based on Ben's work: https://visual.ofcoder.com/.
        
       ___________________________________________________________________
       (page generated 2024-09-28 23:02 UTC)