[HN Gopher] Raft Consensus Animated
       ___________________________________________________________________
        
       Raft Consensus Animated
        
       Author : pkilgore
       Score  : 214 points
       Date   : 2022-08-16 15:57 UTC (7 hours ago)
        
 (HTM) web link (thesecretlivesofdata.com)
 (TXT) w3m dump (thesecretlivesofdata.com)
        
       | purpleblue wrote:
       | Excellent!
       | 
       | A couple of questions:
       | 
       | 1) In the case of a network partition, the client that is
       | currently connected to the leader, do they get notified that
       | there's a partition, or that the cluster is not in a healthy
       | situation?
       | 
       | 2) If a client writes to the partition that will get rolled back,
       | and all their transactions get rolled back after the partition
       | heals, do they get notified that their data was rolled back?
        
         | avmich wrote:
         | > 1) In the case of a network partition, the client that is
         | currently connected to the leader, do they get notified that
         | there's a partition, or that the cluster is not in a healthy
         | situation?
         | 
         | The cluster - or any server of the cluster - finds out about
         | network partition only when the timeout passes. At this point
         | the leader - which becomes the former leader - can notify the
         | client, or the client can see for itself that the timeout has
         | passed.
         | 
         | > 2) If a client writes to the partition that will get rolled
         | back, and all their transactions get rolled back after the
         | partition heals, do they get notified that their data was
         | rolled back?
         | 
         | Note that the client was never notified that their data was
         | committed in the first place. So the client can assume that if
         | the timeout passed without notification that the data wasn't
         | set in the cluster.
         | 
         | Surely there could be problems between the client and the
         | leader. Idempotent messages could be useful.
        
       | daenz wrote:
       | Is it a weakness to only commit on majority consensus? I'm
       | thinking of a very unstable global network, where partitions are
       | happening everywhere. In that scenario, only one cluster can
       | reach consensus (if you're lucky). If the partitions are such
       | that no cluster has majority, nothing can proceed.
       | 
       | Is there a better way to proceed with tentative consensus, until
       | a majority cluster can be realized, and then have a conflict
       | resolution strategy? People operate this way.
        
         | zambal wrote:
         | You either need consistency or you don't. Raft is for systems
         | that need this guarantee. If you don't need it, something like
         | CRDT's can be used.
        
         | kortex wrote:
         | I'm not sure Raft is the best distributed consensus algorithm
         | for the situation of a global, unstable, frequently-
         | partitioning network. I think it is in its niche when leaders
         | are running on _fairly_ stable networks ( >1-2 nines), and the
         | main source of node failures are due to task cycles / rolling
         | deploys.
         | 
         | I've played around with Hashicorp Consul on "edge boxes" -
         | long-haul, wirelessly-connected embedded computers, with
         | unreliable power supplies. Allowing edge boxes be Consul
         | leaders results in all kinds of mayhem: split brain situations,
         | corrupted state, stale DNS resolutions (Consul handles DNS as
         | well), cats and dogs living together, mass hysteria. A much
         | better topology is to have 3 server nodes on a LAN as the "head
         | cluster" and letting all the edge boxes be clients of the head.
         | 
         | I haven't used it but Consul has a multi-datacenter mode, which
         | I believe is designed to better handle such a situation, which
         | I believe has a dedicated raft cluster per datacenter.
         | 
         | https://learn.hashicorp.com/tutorials/consul/federation-goss...
        
         | aaronblohowiak wrote:
         | This is a consistent and partition tolerant system what you are
         | describing is an available and partition tolerant system, but
         | not one that can provide consistent results. (That you cannot
         | have all three properties is called the CAP theorem and some
         | people say they have all three but they just put a tight bound
         | on unavailable and claim it doesn't exist..) There are a
         | variety of ways to achieve available and partition tolerant,
         | with the conflict resolution as a rule implemented by the
         | database or by the application.
        
       | alpb wrote:
       | Unsolicited feedback: use fewer text-appear animations, and allow
       | people to skip through stuff. I've spent a full minute clicking
       | next next and still haven't seen a visualization aside from text
       | slides loading slowly with animations. It's like a long YouTube
       | ad that you cannot skip.
        
         | benbjohnson wrote:
         | Author here. Yeah, I think I'll go with actual video for future
         | visualizations. I made this visualization about 10 years ago
         | and going back to it I feel the same way about the slowness. At
         | least with video you can run it at 2x. :)
        
         | [deleted]
        
           | [deleted]
        
         | kortex wrote:
         | Arrow keys worked for me, but sadly back-arrow did go to the
         | previous animation (Firefox, mac).
        
         | tootallgavin wrote:
         | yep, basically no animation just explanation
        
           | wyldfire wrote:
           | I like the animation because it shows the dynamic behavior.
           | 
           | But the slow nature of the introduction to the elements on
           | each incremental click is a bit irritating.
           | 
           | I'd recommend static image(s) with legend/highlights for each
           | node and message, etc. And animations for each relevant
           | scenario illustrated.
        
         | thanatos519 wrote:
         | I didn't even make it past the introduction.
        
         | beckingz wrote:
         | Agreed!
         | 
         | I really like this, but not being able to go slightly faster
         | with arrow keys was aggravating.
         | 
         | Cool explanation though!
        
         | wowokay wrote:
         | What did you guys run this on? I have had zero issues with it
         | and the animations and progression felt guided and informative.
        
           | phailhaus wrote:
           | ...for you. You may love the speed, but it's not right for
           | everyone. When designing interactive interfaces like this,
           | it's important to cede control to the user so that they can
           | choose the rate at which they consume content. Otherwise,
           | half your users won't like it and bail.
        
       | edfletcher_t137 wrote:
       | This is genuinely lovely and informative. Thank you!
        
       | grogenaut wrote:
       | Ugh... how slow the animations are... I read much much faster
       | than that but it feels like playing through an old JRPG that
       | doesn't let you speed up the text playback.
        
       | kretaceous wrote:
       | I've only heard about Raft Consensus algorithm thrown around in a
       | few GitHub repos/HN comments but never got a chance to really
       | know it.
       | 
       | This webpage cleared some long-standing doubts about what
       | distributed computing means, what a consensus algorithm is and
       | what his Raft thing is.
       | 
       | Kudos to the developer. You got a newbie interested in the field!
        
       | stevewatson301 wrote:
       | On a related note, I've found
       | https://martinfowler.com/articles/patterns-of-distributed-sy...
       | to be quite instructive in understanding distributed systems in
       | general.
        
       | majewsky wrote:
       | Great timing. I'm part of a German podcast on fundamentals of
       | computing [1], and we just recorded an episode on Distributed
       | Systems that discusses Raft as an example. We will probably be
       | adding an addendum to link to this.
       | 
       | [1] https://www.schluesseltechnologie-podcast.de
        
       | cphoover wrote:
       | It would be cool to also see an animated visualization of the
       | paxos consensus algorithm
        
         | harveywi wrote:
         | Indeed. An animation by Terry Gilliam with each of the
         | distributed processes represented by Leslie Lamport wearing a
         | different disguise.
        
       | dec0dedab0de wrote:
       | I ran into this while setting up Hashicorp Vault a year or two
       | ago. It was good at helping me understand what's happening, but I
       | don't particularly like raft. I want to be able to recover from
       | one server, and I don't want to have to wait for a majority on
       | every transaction should I add many servers. I know it's an
       | impossible problem to solve generally, but I think in many
       | situations an alert saying some specific data had a conflict and
       | might not have been resolved correctly is a much better outcome
       | than an outage.
        
         | greiskul wrote:
         | If you don't want distributed consensus, don't use a
         | distributed consensus algorithm. Raft/paxos is not the best fit
         | for every problem, but for problems where you NEED to ensure
         | consistency, it is the best tool for the job. And while it
         | could have outage problems theoretically, Google's Chubby lock
         | service, written using paxos, has such high availability in
         | it's global instance, that the SRE's introduce artificial mini
         | outages, just so dependent services don't assume it has a 100%
         | SLA.
        
           | outworlder wrote:
           | > Google's Chubby lock service, written using paxos, has such
           | high availability in it's global instance, that the SRE's
           | introduce artificial mini outages, just so dependent services
           | don't assume it has a 100% SLA
           | 
           | That's fascinating. Got more information on that?
        
             | beckingz wrote:
             | I think there was something about it in the Google SRE
             | book?
             | 
             | https://sre.google/sre-book/table-of-contents/
        
               | joek1301 wrote:
               | See "The Global Chubby Planned Outage" on this page:
               | 
               | https://sre.google/sre-book/service-level-objectives/
        
               | outworlder wrote:
               | Oh. I had read the book before, but that snipped simply
               | disappeared from my mind.
               | 
               | Thank you!
        
       | cooze wrote:
       | is this similar to how the ethereum network operates? This is an
       | awesome animation
        
         | latchkey wrote:
         | Not at all. Today, ETH is PoW based for consensus. It is moving
         | to PoS in the future.
         | 
         | ETH has do deal with at least one thing that Raft doesn't have
         | to deal with... bad actors trying to inject bad data into the
         | system, also known as the Byzantine generals problem [1].
         | 
         | [1] https://en.wikipedia.org/wiki/Byzantine_fault
        
       | ollemasle wrote:
       | More generally, the Raft page on Github lists some good resources
       | on that subject (including that really good animation):
       | 
       | https://raft.github.io/
        
       | bberrry wrote:
       | I've had a surprisingly hard time finding a bare-bones Raft
       | implementation in Java purely for leader election.
       | 
       | The same hunt also surprised me that there is no common way to do
       | leader election among pods in Kubernetes.
        
         | vultour wrote:
         | Operator Framework (and I assume the upstream k8s go library)
         | provide leader election.
        
         | dharmab wrote:
         | How long ago was this? There is now a native Lease resource
         | which allows you to piggyback off the etcd consensus.
         | 
         | https://kubernetes.io/docs/reference/kubernetes-api/cluster-...
        
       | waynesonfire wrote:
       | Have there been any notable ammendments made to the protocol
       | whether to improve correctness or performance?
        
       | benbjohnson wrote:
       | Author here. I made this visualization over a decade ago and I'm
       | glad it's still useful for folks! Let me know if you have any
       | questions.
       | 
       | I've also been trying on-and-off again some different techniques
       | for doing the visualization as I'd like to do more of these. I'm
       | currently looking at trying to make it work with Remotion[1]. The
       | JavaScript version I did for Raft was time intensive and I ended
       | up having to write an entire (albeit terrible) implementation of
       | Raft to even get it to work. lol.
       | 
       | [1] https://www.remotion.dev/
        
         | doctor_eval wrote:
         | It's awesome. Thanks for this. I kinda-sorta understood how it
         | worked from watching logs of systems that used Raft, but seeing
         | it clearly like this made me say "oooh!" a couple of times.
        
         | aaronblohowiak wrote:
         | Fascinating! Thank you. Perhaps eventually the work of Heidi
         | Howard would inspire a domain model that would work for
         | multiple consensus algorithms? Great work, visualizations help
         | a lot
        
         | onlyrealcuzzo wrote:
         | LOL - I was wondering how you would do this without actually
         | implementing Raft.
         | 
         | It appears you actually did implement it!
        
       | Vervious wrote:
       | Previous discussion (in 2020):
       | https://news.ycombinator.com/item?id=25326645
       | 
       | Also, I personally think the current blockchain literature is
       | much more intuitive and easier to follow, for learning about
       | consensus. The Byzantine case isn't really that different than
       | the crash case if we assume cryptography. On the other hand, Raft
       | is a spiderweb of a protocol, very easy to get wrong.
        
       | travisgriggs wrote:
       | This is the first I have heard of Raft, but enjoyed the
       | animations and ideas. I work on multi-node radio communications
       | for ag automation. I had two questions after watching this:
       | 
       | - Is Raft alone in this space, or are there other popular
       | algorithms/libraries that fill the same space?
       | 
       | - What happens when the node count gets larger than a handful?
       | What happens when you hit hundreds or even thousands of nodes,
       | that are trying to achieve consensus? In particular, the part
       | where all of the nodes respond (semi) simultaneously to a
       | broadcast node. In a radio spectrum world, that would be a
       | disaster. N:1 communication slots are choke points for timely
       | communication.
        
         | throwawaymaths wrote:
         | Paxos and viewstamped replication are basically the two most
         | well known other well-known asynchronous consensus mechanisms
         | that have been mathematically verified.
         | 
         | If you just need eventual consistency, CRDTs are also possible.
         | 
         | Going in the other direction, if you don't mind the latency
         | full consensus with global locking, you could just do that.
        
         | aordano wrote:
         | - There also is Paxos[0] as the most significant option.
         | 
         | - You should not have too many nodes to make a decision, this
         | is usually reserved for leaders; if you have a large
         | distributed system you may clusterize them or forward decisions
         | to leaders, whom decide for consensus. If you clusterize, the
         | leaders for each node can also be selected by consensus. If you
         | can't do any of those then having a consensus protocol might
         | not even be a good idea; you'd end up with a sort of merkle
         | tree (or some sort of blockchain) to make sure all the data is
         | registered, or maybe audit transactions. In any case this[1]
         | might be interesting.
         | 
         | [0] https://en.wikipedia.org/wiki/Paxos_(computer_science) [1]
         | https://doi.org/10.1016/j.neucom.2016.10.011
        
         | umanwizard wrote:
         | You don't normally have hundreds or thousands of nodes trying
         | to achieve consensus. You have 3-5 nodes trying to achieve
         | consensus and then serving requests to the other 100s-1000s
         | nodes.
        
       ___________________________________________________________________
       (page generated 2022-08-16 23:00 UTC)