[HN Gopher] Raft Consensus Animated
___________________________________________________________________
Raft Consensus Animated
Author : pkilgore
Score : 214 points
Date : 2022-08-16 15:57 UTC (7 hours ago)
(HTM) web link (thesecretlivesofdata.com)
(TXT) w3m dump (thesecretlivesofdata.com)
| purpleblue wrote:
| Excellent!
|
| A couple of questions:
|
| 1) In the case of a network partition, the client that is
| currently connected to the leader, do they get notified that
| there's a partition, or that the cluster is not in a healthy
| situation?
|
| 2) If a client writes to the partition that will get rolled back,
| and all their transactions get rolled back after the partition
| heals, do they get notified that their data was rolled back?
| avmich wrote:
| > 1) In the case of a network partition, the client that is
| currently connected to the leader, do they get notified that
| there's a partition, or that the cluster is not in a healthy
| situation?
|
| The cluster - or any server of the cluster - finds out about
| network partition only when the timeout passes. At this point
| the leader - which becomes the former leader - can notify the
| client, or the client can see for itself that the timeout has
| passed.
|
| > 2) If a client writes to the partition that will get rolled
| back, and all their transactions get rolled back after the
| partition heals, do they get notified that their data was
| rolled back?
|
| Note that the client was never notified that their data was
| committed in the first place. So the client can assume that if
| the timeout passed without notification that the data wasn't
| set in the cluster.
|
| Surely there could be problems between the client and the
| leader. Idempotent messages could be useful.
| daenz wrote:
| Is it a weakness to only commit on majority consensus? I'm
| thinking of a very unstable global network, where partitions are
| happening everywhere. In that scenario, only one cluster can
| reach consensus (if you're lucky). If the partitions are such
| that no cluster has majority, nothing can proceed.
|
| Is there a better way to proceed with tentative consensus, until
| a majority cluster can be realized, and then have a conflict
| resolution strategy? People operate this way.
| zambal wrote:
| You either need consistency or you don't. Raft is for systems
| that need this guarantee. If you don't need it, something like
| CRDT's can be used.
| kortex wrote:
| I'm not sure Raft is the best distributed consensus algorithm
| for the situation of a global, unstable, frequently-
| partitioning network. I think it is in its niche when leaders
| are running on _fairly_ stable networks ( >1-2 nines), and the
| main source of node failures are due to task cycles / rolling
| deploys.
|
| I've played around with Hashicorp Consul on "edge boxes" -
| long-haul, wirelessly-connected embedded computers, with
| unreliable power supplies. Allowing edge boxes be Consul
| leaders results in all kinds of mayhem: split brain situations,
| corrupted state, stale DNS resolutions (Consul handles DNS as
| well), cats and dogs living together, mass hysteria. A much
| better topology is to have 3 server nodes on a LAN as the "head
| cluster" and letting all the edge boxes be clients of the head.
|
| I haven't used it but Consul has a multi-datacenter mode, which
| I believe is designed to better handle such a situation, which
| I believe has a dedicated raft cluster per datacenter.
|
| https://learn.hashicorp.com/tutorials/consul/federation-goss...
| aaronblohowiak wrote:
| This is a consistent and partition tolerant system what you are
| describing is an available and partition tolerant system, but
| not one that can provide consistent results. (That you cannot
| have all three properties is called the CAP theorem and some
| people say they have all three but they just put a tight bound
| on unavailable and claim it doesn't exist..) There are a
| variety of ways to achieve available and partition tolerant,
| with the conflict resolution as a rule implemented by the
| database or by the application.
| alpb wrote:
| Unsolicited feedback: use fewer text-appear animations, and allow
| people to skip through stuff. I've spent a full minute clicking
| next next and still haven't seen a visualization aside from text
| slides loading slowly with animations. It's like a long YouTube
| ad that you cannot skip.
| benbjohnson wrote:
| Author here. Yeah, I think I'll go with actual video for future
| visualizations. I made this visualization about 10 years ago
| and going back to it I feel the same way about the slowness. At
| least with video you can run it at 2x. :)
| [deleted]
| [deleted]
| kortex wrote:
| Arrow keys worked for me, but sadly back-arrow did go to the
| previous animation (Firefox, mac).
| tootallgavin wrote:
| yep, basically no animation just explanation
| wyldfire wrote:
| I like the animation because it shows the dynamic behavior.
|
| But the slow nature of the introduction to the elements on
| each incremental click is a bit irritating.
|
| I'd recommend static image(s) with legend/highlights for each
| node and message, etc. And animations for each relevant
| scenario illustrated.
| thanatos519 wrote:
| I didn't even make it past the introduction.
| beckingz wrote:
| Agreed!
|
| I really like this, but not being able to go slightly faster
| with arrow keys was aggravating.
|
| Cool explanation though!
| wowokay wrote:
| What did you guys run this on? I have had zero issues with it
| and the animations and progression felt guided and informative.
| phailhaus wrote:
| ...for you. You may love the speed, but it's not right for
| everyone. When designing interactive interfaces like this,
| it's important to cede control to the user so that they can
| choose the rate at which they consume content. Otherwise,
| half your users won't like it and bail.
| edfletcher_t137 wrote:
| This is genuinely lovely and informative. Thank you!
| grogenaut wrote:
| Ugh... how slow the animations are... I read much much faster
| than that but it feels like playing through an old JRPG that
| doesn't let you speed up the text playback.
| kretaceous wrote:
| I've only heard about Raft Consensus algorithm thrown around in a
| few GitHub repos/HN comments but never got a chance to really
| know it.
|
| This webpage cleared some long-standing doubts about what
| distributed computing means, what a consensus algorithm is and
| what his Raft thing is.
|
| Kudos to the developer. You got a newbie interested in the field!
| stevewatson301 wrote:
| On a related note, I've found
| https://martinfowler.com/articles/patterns-of-distributed-sy...
| to be quite instructive in understanding distributed systems in
| general.
| majewsky wrote:
| Great timing. I'm part of a German podcast on fundamentals of
| computing [1], and we just recorded an episode on Distributed
| Systems that discusses Raft as an example. We will probably be
| adding an addendum to link to this.
|
| [1] https://www.schluesseltechnologie-podcast.de
| cphoover wrote:
| It would be cool to also see an animated visualization of the
| paxos consensus algorithm
| harveywi wrote:
| Indeed. An animation by Terry Gilliam with each of the
| distributed processes represented by Leslie Lamport wearing a
| different disguise.
| dec0dedab0de wrote:
| I ran into this while setting up Hashicorp Vault a year or two
| ago. It was good at helping me understand what's happening, but I
| don't particularly like raft. I want to be able to recover from
| one server, and I don't want to have to wait for a majority on
| every transaction should I add many servers. I know it's an
| impossible problem to solve generally, but I think in many
| situations an alert saying some specific data had a conflict and
| might not have been resolved correctly is a much better outcome
| than an outage.
| greiskul wrote:
| If you don't want distributed consensus, don't use a
| distributed consensus algorithm. Raft/paxos is not the best fit
| for every problem, but for problems where you NEED to ensure
| consistency, it is the best tool for the job. And while it
| could have outage problems theoretically, Google's Chubby lock
| service, written using paxos, has such high availability in
| it's global instance, that the SRE's introduce artificial mini
| outages, just so dependent services don't assume it has a 100%
| SLA.
| outworlder wrote:
| > Google's Chubby lock service, written using paxos, has such
| high availability in it's global instance, that the SRE's
| introduce artificial mini outages, just so dependent services
| don't assume it has a 100% SLA
|
| That's fascinating. Got more information on that?
| beckingz wrote:
| I think there was something about it in the Google SRE
| book?
|
| https://sre.google/sre-book/table-of-contents/
| joek1301 wrote:
| See "The Global Chubby Planned Outage" on this page:
|
| https://sre.google/sre-book/service-level-objectives/
| outworlder wrote:
| Oh. I had read the book before, but that snipped simply
| disappeared from my mind.
|
| Thank you!
| cooze wrote:
| is this similar to how the ethereum network operates? This is an
| awesome animation
| latchkey wrote:
| Not at all. Today, ETH is PoW based for consensus. It is moving
| to PoS in the future.
|
| ETH has do deal with at least one thing that Raft doesn't have
| to deal with... bad actors trying to inject bad data into the
| system, also known as the Byzantine generals problem [1].
|
| [1] https://en.wikipedia.org/wiki/Byzantine_fault
| ollemasle wrote:
| More generally, the Raft page on Github lists some good resources
| on that subject (including that really good animation):
|
| https://raft.github.io/
| bberrry wrote:
| I've had a surprisingly hard time finding a bare-bones Raft
| implementation in Java purely for leader election.
|
| The same hunt also surprised me that there is no common way to do
| leader election among pods in Kubernetes.
| vultour wrote:
| Operator Framework (and I assume the upstream k8s go library)
| provide leader election.
| dharmab wrote:
| How long ago was this? There is now a native Lease resource
| which allows you to piggyback off the etcd consensus.
|
| https://kubernetes.io/docs/reference/kubernetes-api/cluster-...
| waynesonfire wrote:
| Have there been any notable ammendments made to the protocol
| whether to improve correctness or performance?
| benbjohnson wrote:
| Author here. I made this visualization over a decade ago and I'm
| glad it's still useful for folks! Let me know if you have any
| questions.
|
| I've also been trying on-and-off again some different techniques
| for doing the visualization as I'd like to do more of these. I'm
| currently looking at trying to make it work with Remotion[1]. The
| JavaScript version I did for Raft was time intensive and I ended
| up having to write an entire (albeit terrible) implementation of
| Raft to even get it to work. lol.
|
| [1] https://www.remotion.dev/
| doctor_eval wrote:
| It's awesome. Thanks for this. I kinda-sorta understood how it
| worked from watching logs of systems that used Raft, but seeing
| it clearly like this made me say "oooh!" a couple of times.
| aaronblohowiak wrote:
| Fascinating! Thank you. Perhaps eventually the work of Heidi
| Howard would inspire a domain model that would work for
| multiple consensus algorithms? Great work, visualizations help
| a lot
| onlyrealcuzzo wrote:
| LOL - I was wondering how you would do this without actually
| implementing Raft.
|
| It appears you actually did implement it!
| Vervious wrote:
| Previous discussion (in 2020):
| https://news.ycombinator.com/item?id=25326645
|
| Also, I personally think the current blockchain literature is
| much more intuitive and easier to follow, for learning about
| consensus. The Byzantine case isn't really that different than
| the crash case if we assume cryptography. On the other hand, Raft
| is a spiderweb of a protocol, very easy to get wrong.
| travisgriggs wrote:
| This is the first I have heard of Raft, but enjoyed the
| animations and ideas. I work on multi-node radio communications
| for ag automation. I had two questions after watching this:
|
| - Is Raft alone in this space, or are there other popular
| algorithms/libraries that fill the same space?
|
| - What happens when the node count gets larger than a handful?
| What happens when you hit hundreds or even thousands of nodes,
| that are trying to achieve consensus? In particular, the part
| where all of the nodes respond (semi) simultaneously to a
| broadcast node. In a radio spectrum world, that would be a
| disaster. N:1 communication slots are choke points for timely
| communication.
| throwawaymaths wrote:
| Paxos and viewstamped replication are basically the two most
| well known other well-known asynchronous consensus mechanisms
| that have been mathematically verified.
|
| If you just need eventual consistency, CRDTs are also possible.
|
| Going in the other direction, if you don't mind the latency
| full consensus with global locking, you could just do that.
| aordano wrote:
| - There also is Paxos[0] as the most significant option.
|
| - You should not have too many nodes to make a decision, this
| is usually reserved for leaders; if you have a large
| distributed system you may clusterize them or forward decisions
| to leaders, whom decide for consensus. If you clusterize, the
| leaders for each node can also be selected by consensus. If you
| can't do any of those then having a consensus protocol might
| not even be a good idea; you'd end up with a sort of merkle
| tree (or some sort of blockchain) to make sure all the data is
| registered, or maybe audit transactions. In any case this[1]
| might be interesting.
|
| [0] https://en.wikipedia.org/wiki/Paxos_(computer_science) [1]
| https://doi.org/10.1016/j.neucom.2016.10.011
| umanwizard wrote:
| You don't normally have hundreds or thousands of nodes trying
| to achieve consensus. You have 3-5 nodes trying to achieve
| consensus and then serving requests to the other 100s-1000s
| nodes.
___________________________________________________________________
(page generated 2022-08-16 23:00 UTC)