[HN Gopher] The Raft Consensus Algorithm (2015)
___________________________________________________________________
The Raft Consensus Algorithm (2015)
Author : oumua_don17
Score : 240 points
Date : 2023-09-03 12:22 UTC (10 hours ago)
(HTM) web link (raft.github.io)
(TXT) w3m dump (raft.github.io)
| henrik_w wrote:
| Can't resist posting this (from Classic Programmer Paintings)
|
| "Raft Consensus Algorithm Failure",Theodore Gericault, 1819
|
| https://classicprogrammerpaintings.com/post/6141087496359280...
| galenmarchetti wrote:
| If you're interested in consensus algorithms, you might be
| interested in this book that I used in a theoretical course on
| distributed system called "Reasoning about Knowledge"
| (https://mitpress.mit.edu/9780262562003/reasoning-about-
| knowl...).
|
| You have to invest a bit in learning about modal logic, but once
| you do get past that part, this book provides proofs of why
| things like Raft or Paxos work that are super intuitive and
| straightforward. Basically pushing the complexity of proving
| these algorithms into the logic structure used to form the proof
| (in an intuitive way). Highly recommend, changed how I think
| about consensus!
| bjornasm wrote:
| Here is their answer to their own question - "What is Raft?"
|
| >Raft is a consensus algorithm that is designed to be easy to
| understand. It's equivalent to Paxos in fault-tolerance and
| performance. The difference is that it's decomposed into
| relatively independent subproblems, and it cleanly addresses all
| major pieces needed for practical systems. We hope Raft will make
| consensus available to a wider audience, and that this wider
| audience will be able to develop a variety of higher quality
| consensus-based systems than are available today.
|
| After reading that I still have no idea. They are not alone in
| doing this, but I think its a shame that people don't spend the
| extra time and effort in properly describing their work.
| oasisaimlessly wrote:
| Did you try reading the next paragraph ("Hold on--what is
| consensus?")?
|
| Sometimes concepts are too "big" to be introduced to someone
| with no background in a single paragraph; such is life.
| However, the linked article is a great intro if you manage to
| keep reading.
| BirdieNZ wrote:
| Arguably it's a good description because it automatically
| filters out audiences who don't know what it means; if you're
| building a distributed system and looking at different
| consensus algorithms then it's a simple and clear explanation,
| but if you aren't then it's not a relevant algorithm for you
| anyway!
|
| The general class of consensus algorithms are for trying to
| solve the problem of what to do when you have multiple replicas
| of a data store across several physical devices, when one or
| many of the devices or their connections fail in some fashion.
| They are titled "consensus" because the machines need to come
| to a consensus for what decision to make about a piece of data
| when a failure event occurs.
|
| For example, you have three servers all replicating the same
| SQL database:
|
| (A) - (B) - (C)
|
| The network connection linking (C) to the other two drops; (A)
| and (B) are notified and (B) is promoted to be the primary (it
| receives the writes and then distributes them to the replicas).
|
| However, (C) doesn't know what's happened, and continues to
| receive some writes. The network connection is restored, and
| now (A) and (B) and (C) need to decide what to do. (B) and (C)
| have both independently received a different set of writes, and
| the servers need to come to a consensus on what to do with the
| data.
|
| This is what Raft, Paxos etc. are attempting to solve in a
| consistent and performant fashion.
| EGreg wrote:
| Is Raft byzantine-fault-tolerant though? Can it be made so?
|
| Paxos can:
| http://en.wikipedia.org/wiki/Paxos_(computer_science)#Byzant...
| intelVISA wrote:
| No, Raft is mostly for learning the basics of distributed
| consensus afaik.
| kevdev wrote:
| Raft is not byzantine fault tolerant.
| posix86 wrote:
| It assumes no malicious agents. If you're malicious you can
| take control over everything with a single message (advertise
| being the leader of the next term).
| sethev wrote:
| There's a BFT version: https://www.scs.stanford.edu/17au-
| cs244b/labs/projects/clow_...
| dwheeler wrote:
| Raft cannot handle Byzatune faults by itself. There are variant
| algorithms that can.
| u320 wrote:
| Paxos is a family of algorithms of which Raft is a member. You
| can change Raft in various ways but then it's not Raft anymore,
| but it can still be Paxos. So of course Paxos can "do more".
| henrik_w wrote:
| When I was studying the Raft algorithm a year and a half ago, I
| found this video on it by John Ousterhout to be a good
| complement:
|
| Designing for Understandability: The Raft Consensus Algorithm
|
| https://www.youtube.com/watch?v=vYp4LYbnnW8
| eatonphil wrote:
| I had a fun time recently implementing Raft leader election and
| log replication (i.e. I didn't get to
| snapshotting/checkpointing). One of the most challenging projects
| I've tried to do.
|
| The Raft paper is very gentle to read, and gives you a great
| intuition on its own. Even if you don't want to implement it, you
| probably use software that uses it: like etcd or consul or
| cockroach or tidb, etc.
|
| I collected all the resources I found useful while doing it here:
| https://github.com/eatonphil/goraft#references. This includes
| Diego Ongaro's thesis and his TLA+ spec.
|
| Some people say Figure 2 of the Raft paper has everything you
| need but I'm pretty sure that's just not true. It's a little bit
| more vague than looking at the TLA+ spec to me anyway.
| agonz253 wrote:
| This looks great. However, without a truly comprehensive test
| suite it's guaranteed to have a great many subtle bugs :). I
| recommend trying to hook it up with the tests from MIT's
| distributed systems course:
| https://pdos.csail.mit.edu/6.824/labs/lab-raft.html.
|
| It looks as though it would take only a minor refactoring at
| least for the leader election and log replication.
| kevdev wrote:
| Figure 2 is great, but I would agree the entire paper is needed
| if you are implementing raft. There are a few specifics in the
| paper that you need when implementing it.
|
| P.S. love your blog :)
| eatonphil wrote:
| Cheers!
|
| > Figure 2 is great, but I would agree the entire paper is
| needed if you are implementing raft. There are a few
| specifics in the paper that you need when implementing it.
|
| It was more than that. I'm blanking on what it was but there
| were parts where I really couldn't find _anything_ about the
| intended behavior in the paper (let alone in Figure 2) except
| for in Diego 's thesis or in the TLA+ spec.
|
| Though maybe I was just not reading the paper correctly.
| lpage wrote:
| Maelstrom [1], a workbench for learning distributed systems from
| the creator of Jepsen, includes a simple (model-checked)
| implementation of Raft and an excellent tutorial on implementing
| it.
|
| Raft is a simple algorithm, but as others have noted, the
| original paper includes many correctness details often brushed
| over in toy implementations. Furthermore, the fallibility of
| real-world hardware (handling memory/disk corruption and grey
| failures), the requirements of real-world systems with tight
| latency SLAs, and a need for things like flexible quorum/dynamic
| cluster membership make implementing it for production a long and
| daunting task. The commit history of etcd and hashicorp/raft,
| likely the two most battle-tested open source implementations of
| raft that still surface correctness bugs on the regular tell you
| all you need to know.
|
| The tigerbeetle team talks in detail about the real-world aspects
| of distributed systems on imperfect hardware/non-abstracted
| system models, and why they chose viewstamp replication, which
| predates Paxos but looks more like Raft.
|
| [1]: https://github.com/jepsen-io/maelstrom/
|
| [2]:
| https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/DE...
| mananaysiempre wrote:
| > [Viewstamped replication] predates Paxos but looks more like
| Raft.
|
| Heidi Howard and Richard Mortier's paper[1] on the topic of
| Paxos vs Raft has (multi-decree) Paxos and Raft written out in
| a way that makes it clear that they are very, very close. I'm
| very far from knowing what consequences (if any) this has for
| the implementation concerns you state, but the paper is lovely
| and I wanted to plug it. (There was also a presentation[2], but
| IMO the text works better when you want to refer back and
| forth.)
|
| [1] https://doi.org/10.1145/3380787.3393681
|
| [2] https://www.youtube.com/watch?v=0K6kt39wyH0
| LAC-Tech wrote:
| The view-stamped replication paper was surprisingly readable -
| I'd never looked at consensus algorithms before in my life and
| I found I could kind of follow it after a couple of reads.
|
| https://dspace.mit.edu/bitstream/handle/1721.1/71763/MIT-CSA...
| m00dy wrote:
| looks like a great playground to get familiar with ds
| kevdev wrote:
| I love this site. When I was learning & implementing raft in my
| distributed systems course, this page was invaluable. Plus the
| paper itself is pretty easy to read.
| dwheeler wrote:
| I teach a class distributed systems, and this site is one of
| the references. Thanks for making Raft so clear!
| candiddevmike wrote:
| Anyone using object storage like S3 for cluster
| coordination/election instead?
| hotpotamus wrote:
| I want to say that jgroups has an S3 coordination method. I
| came across quite a few methods while trying to set up a
| Keycloak cluster in Docker Swarm, but ultimately the Kubernetes
| DNS method worked best for my case (Swarm seems close enough
| that I was able to use it with no problem). I'd also note that
| my understanding is that Swarm uses the Raft algorithm.
| ithkuil wrote:
| Not sure about S3. Iirc they relatively recently implemented
| consistent writes.
|
| But this article explains how to "piggyback" on stores
| providing consistent writes in order to implement your own
| little leader election Algo on top of a foundation layer
|
| https://cloud.google.com/blog/topics/developers-practitioner...
| cweld510 wrote:
| I think S3 is too slow to put in your control plane's critical
| path; better to use something like Redis.
| maxpert wrote:
| I've written a whole SQLite replication system that works on top
| of RAFT ( https://github.com/maxpert/marmot ). Best part is RAFT
| has a well understood and strong library ecosystem as well. I
| started of with libraries and when I noticed I am reimplementing
| distributed streams, I just took off the shelf implementation
| (https://docs.nats.io/nats-concepts/jetstream) and embedded it in
| system. I love the simplicity and reasoning that comes with RAFT.
| However I am playing with epaxos these days
| (https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf), because
| then I can truly decentralize the implementation for truly
| masterless implementation. Right now I've added sharding
| mechanism on various streams so that in high load cases masters
| can be distributed across nodes too.
| dang wrote:
| I took a crack at finding the interesting past related threads.
| Any others?
|
| (I've left out posts about particular implementations,
| extensions, and so on--there are too many. The intention is
| threads about the algorithm itself.)
|
| _Raft Is So Fetch: The Raft Consensus Algorithm Explained
| Through Mean Girls_ -
| https://news.ycombinator.com/item?id=33071069 - Oct 2022 (53
| comments)
|
| _Raft Consensus Animated (2014)_ -
| https://news.ycombinator.com/item?id=32484584 - Aug 2022 (67
| comments)
|
| _Why use Paxos instead of Raft?_ -
| https://news.ycombinator.com/item?id=32467962 - Aug 2022 (45
| comments)
|
| _In Search of an Understandable Consensus Algorithm (2014)
| [pdf]_ - https://news.ycombinator.com/item?id=29837995 - Jan 2022
| (12 comments)
|
| _Raft Consensus Protocol_ -
| https://news.ycombinator.com/item?id=29079079 - Nov 2021 (51
| comments)
|
| _Paxos vs. Raft: Have we reached consensus on distributed
| consensus?_ - https://news.ycombinator.com/item?id=27831576 -
| July 2021 (48 comments)
|
| _Raft Visualization_ -
| https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35
| comments)
|
| _Raft: A Fantastical and Absurd Exploration_ -
| https://news.ycombinator.com/item?id=23129707 - May 2020 (1
| comment)
|
| _Understanding Raft Consensus_ -
| https://news.ycombinator.com/item?id=23128787 - May 2020 (3
| comments)
|
| _In Search of an Understandable Consensus Algorithm (2014)
| [pdf]_ - https://news.ycombinator.com/item?id=23113419 - May 2020
| (26 comments)
|
| _Paxos vs. Raft: Have we reached consensus on distributed
| consensus?_ - https://news.ycombinator.com/item?id=22994420 -
| April 2020 (65 comments)
|
| _Raft Is So Fetch: The Raft Consensus Algorithm Explained
| Through Mean Girls_ -
| https://news.ycombinator.com/item?id=22520040 - March 2020 (4
| comments)
|
| _Implementing Raft: Part 2: Commands and Log Replication_ -
| https://news.ycombinator.com/item?id=22451959 - Feb 2020 (16
| comments)
|
| _Building a Large-Scale Distributed Storage System Based on
| Raft_ - https://news.ycombinator.com/item?id=21447528 - Nov 2019
| (5 comments)
|
| _In Search of an Understandable Consensus Algorithm [pdf]_ -
| https://news.ycombinator.com/item?id=14724883 - July 2017 (14
| comments)
|
| _Instructors ' Guide to Raft_ -
| https://news.ycombinator.com/item?id=11300428 - March 2016 (3
| comments)
|
| _Fuzzing Raft for Fun and Publication_ -
| https://news.ycombinator.com/item?id=10432062 - Oct 2015 (10
| comments)
|
| _Prove Raft Correct_ -
| https://news.ycombinator.com/item?id=10017549 - Aug 2015 (27
| comments)
|
| _Scaling Raft_ - https://news.ycombinator.com/item?id=9725094 -
| June 2015 (12 comments)
|
| _Raft Consensus Algorithm_ -
| https://news.ycombinator.com/item?id=9613493 - May 2015 (24
| comments)
|
| _Creator of Raft is speaking at our meetup. What questions do
| you want answered?_ -
| https://news.ycombinator.com/item?id=9351794 - April 2015 (6
| comments)
|
| _Replicating SQLite using Raft Consensus_ -
| https://news.ycombinator.com/item?id=9092110 - Feb 2015 (21
| comments)
|
| _Raft Refloated: Do We Have Consensus? [pdf]_ -
| https://news.ycombinator.com/item?id=9015085 - Feb 2015 (4
| comments)
|
| _Analysis of Raft Consensus [pdf]_ -
| https://news.ycombinator.com/item?id=8736868 - Dec 2014 (3
| comments)
|
| _The Raft Consensus Algorithm_ -
| https://news.ycombinator.com/item?id=8527440 - Oct 2014 (27
| comments)
|
| _Raft: Understandable Distributed Consensus_ -
| https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79
| comments)
|
| _Raft - The Understandable Distributed Protocol_ -
| https://news.ycombinator.com/item?id=6859101 - Dec 2013 (10
| comments)
|
| _Raft, a scrutable successor to Paxos_ -
| https://news.ycombinator.com/item?id=5624627 - April 2013 (2
| comments)
| jmholla wrote:
| Are there consensus algorithms that don't require changes go
| through a leader? In many distributed systems, you want to
| distribute intake as well.
| dboreham wrote:
| Blockchains
| Rhapso wrote:
| No, those only allow 1 leader and 1 intake at a time. They
| just pick the leader by proof of waste first.
| lucb1e wrote:
| I think you may be confusing proof of work with blockchain
| as a general principle. (I'm not a fan of this
| cryptocurrency tech anymore either, my comment is purely
| about the technical properties which can be interesting to
| study.)
| Detrytus wrote:
| No, GP is right. Without Proof of Work blockchains have
| no leader - think about Git repo with multiple branches:
| which one is main? Nothing in blockchain itself
| determines this, only we, humans chose to name one of
| them "main" (formerly "master").
| lucb1e wrote:
| > Without Proof of Work blockchains have no leader
|
| But, proof of stake?
| mrkeen wrote:
| Paxos doesn't favour any node above any other node.
|
| That said, if you're trying to distribute for the purpose of
| throughput, then it might be more efficient to only need one
| leader rather than the quorum that Paxos requires. Only
| speculating though.
|
| Paxos is also more efficient if calls go to the same place each
| time - you avoid contention and revotes, etc.
| darkmarmot wrote:
| Many systems have lots of leaders to handle concurrency. Ours
| has millions of them for instance.
| havnagiggle wrote:
| There's a sharded paxos with a shardmaster. Essentially each
| shard has its own paxos consensus.
| User23 wrote:
| This paper[1] discusses "termination detection" which consensus
| can directly be built on with trivial proof. It assumes a
| single environment, and while mathematically that is a kind of
| graph node it really maps to all outside sources of input and
| not a cluster member.
|
| [1]
| https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...
| LAC-Tech wrote:
| I think in this case there's no longer a 'consensus' - since
| any node can accept any write immediately.
|
| Look into CRDTs, which are essentially algebraic laws for
| systems that do this and are guaranteed to eventually converge.
| Racing0461 wrote:
| The original dynamo paper (not dynamodb, its predecessor)
| didn't use leader election. it used quorom with some speed
| hacks (hinted handoffs and sloppy quorom)
| vhiremath4 wrote:
| There are algos which prioritize high fault tolerance and do so
| by increasing the number of leaders (or the minimum number of
| nodes that must have a copy of the lookup data).
|
| One such algo is Chord:
|
| https://en.m.wikipedia.org/wiki/Chord_(peer-to-peer)
|
| It's a peer-to-peer ring of nodes which have their values
| consistently hashed between them. The network leverages these
| things called "finger tables" which essentially store
| replication information in the form of a table. This table can
| have information which is incorrect or outdated and the peer
| you go to can tell you to go to another peer (usually the
| "next"/"successor") until you find the value (or don't).
|
| Reason this algo can be used with no "leader" is because it can
| also work by just going to a node and doing a linear scan
| across all nodes. You don't need a thumb table to speed up
| queries.
| toast0 wrote:
| Electing a leader and sending changes through the leader
| simplifies the system and improves throughput and
| predictability in the presence of contention.
|
| If your transactions are on independent topics, you can
| distribute the load by sharding leaders: assign ranges of the
| key space to different leaders and manipulate the elections so
| each node has a reasonable share of leadership.
|
| You can go leaderless and structure each write as essentially
| an election: broadcast the tenative transaction (or a request
| to transact, if the transaction is large enough) to all nodes,
| if you get a quorum of acceptance, you win and can commit the
| transaction. But if multiple nodes attempt transactions near
| the same time, consensus may be time consuming. If you have
| many nodes, and they all have pending transactions for the same
| topic, electing a leader and sending all transactions through
| the leader is going to be a lot faster than establishing
| consensus on every transaction individually.
| lostcolony wrote:
| Or more succinctly - a quorum agreeing to a leader to
| serialize writes through is both less work than trying to get
| a quorum to agree on every individual write, and equally as
| consistent.
| htowerad3242 wrote:
| It's a shame many undergraduate CS curricula are allergic to
| distributed systems and type systems.
|
| Even the grad program I was looking at is hot dog water.
|
| I've been playing with raft and paxos. Employers will not care as
| these were learned out-of-band from degree mills.
| badcarbine wrote:
| ELI5
| quest88 wrote:
| You need to at least try.
| badcarbine wrote:
| You need to, too.
| cweagans wrote:
| http://thesecretlivesofdata.com/raft/
| mrkeen wrote:
| * 1 computer will eventually break (or one service will crash,
| or be replaced). To keep processing requests, you need more
| than 1 computer.
|
| * Two-or-more computers can be fed conflicting information.
|
| * Consensus algorithms prevent such conflicts from being
| accepted by the system.
|
| * Raft is one such algorithm. It operates by choosing a leader
| and then routing all requests via that leader, so that no
| observers will see a different sequence of requests.
| lucb1e wrote:
| If anyone else doesn't understand what the visualisation is
| supposed to show, note that you can click on one of the nodes and
| make them fail. Particularly try this with the current "leader"
| (the thing that's sending and receiving all the packets). Press
| the little pause icon next to the first slider to turn it back
| into a clock and resume the simulation.
|
| Has someone else figured out what the spreadsheet on the right
| is? It looks broken to me (but so I thought the rest of the
| simulation was before understanding that it only shows the happy
| flow by default), as it always remains empty. The clickable
| elements I discovered so far are the two sliders, the clock/pause
| icon, and the individual servers.
| sethev wrote:
| It shows the state of each replica's log. Click on the leader
| and select 'request' to simulate sending a command. If you take
| a replica offline, you can see that it falls behind and then
| gets caught up when it comes back.
| lucb1e wrote:
| Ah, thanks, though I pushed that button a couple times (also
| before your comment) but it didn't/doesn't do anything that I
| can tell.
| _daor wrote:
| [flagged]
| capableweb wrote:
| > ??? That isn't easy to understand
|
| Distributed systems aren't simple, but compared to Paxos, Raft
| is definitely on the simpler side. I think they're also having
| a specific audience in mind (distributed systems engineers)
| rather than the public, when they write "it's easy to
| understand", which makes sense.
| 63 wrote:
| > Raft is a consensus algorithm that is designed to be easy to
| understand.
|
| > Consensus typically arises in the context of replicated state
| machines, a general approach to building fault-tolerant systems.
|
| I recognize that I'm not the intended audience but I do think I
| would be a lot more capable of understanding this article if it
| used less jargon or at least defined what it meant. I'm only
| mentioning this because ease of understanding is an explicit
| goal.
|
| Can someone give a real world example of where this would be used
| in a production app? I'm sure it's very practical but I'm getting
| caught up in trying to understand what it's saying
| galenmarchetti wrote:
| Consensus algorithms underly distributed databases like
| Cassandra, or distributed caches like Redis. (in fact Redis
| does use a version of Raft).
|
| Simple use case, you're running a database across three
| servers, so that users can ping any one of the servers and it
| still works (fault tolerance).
|
| User X pings server A to write something to a table. User Y
| pings server B to read from that table.
|
| How do you make sure that User Y reads what User X wrote, so
| that User X and User Y are on the same page with what happened?
|
| That's the consensus algo
| couchand wrote:
| Not to be too pedantic, but the step of actually shipping
| those changes across servers is usually outside the scope of
| consensus algorithms. Generally they are limited to picking a
| single server to act as a short-term leader. That server then
| is responsible for managing the data itself.
|
| Though you can conceive of a system where all data flows
| through the consensus algorithm, practically speaking that
| would introduce significant overhead at a granularity where
| it isn't adding value.
|
| There isn't neceasarily one dictator for the whole cluster,
| but rather usually they are scoped to each domain that
| requires serialization.
| adunk wrote:
| Wikipedia has a nice list of software using the algorithm:
| https://en.wikipedia.org/wiki/Raft_(algorithm)#Production_us...
| oriettaxx wrote:
| apparently there is no reference to Docker in swarm mode,
| where it is used to decide what node should be the Leader
| dcchuck wrote:
| Simplest terms - your app needs several different services or
| instances of a service to agree upon a value. There are a lot
| of reasons you can't use things like the system clock to agree
| upon when something happened for example - this is where RAFT
| steps in.
|
| You'll see "fault-tolerant" and "replicated state machines"
| often alongside them. Let's break those down in this context.
|
| For "fault-tolerance" - think production environments where I
| need to plan for hardware failure. If one of my services goes
| down, I want to be able to continue operating - so we run a few
| copies of the app, and when one goes down, another operating
| instance will step up.
|
| In that case - how do we pick what's in charge? How do all
| copies agree on things while everything is working smoothly?
| Raft.
|
| For "replicated state machines" - let's stay in this world of
| fault-tolerance, where we have multiple instances of our app
| running. In each service, could reside a state machine. The
| state machine promises to take any ordered series of events,
| and always arrive at the same value. Meaning - if all of our
| instances get the same events in the same order, they will all
| have the same state. Determinism.
|
| This is where it all comes together, and why I think the jargon
| becomes tightly coupled to an "easy to understand" definition.
|
| You will reach for replicated state machines when you need
| deterministic state across multiple service instances. But the
| replicated state machines need a way to agree on order and
| messages received. That's the contract - if you give everything
| all the messages in the same order, everything will be in the
| same state.
|
| But how do we agree on order? How do we agree on what messages
| were actually received? Just because "Client A" sends a
| messages "1", and "2', in a specific order does not guaranteed
| it is delivered at all, let alone in that order.
|
| Raft creates "consensus" around these values. It allows the
| copies to settle on which messages were actually received and
| when.
|
| So, you could use other approaches to manage "all your service
| copies getting along" but a replicated state machine is a nice
| approach. That replicated state machine architecture needs some
| way to agree on order, and Raft is a great choice for that.
| bjornasm wrote:
| Made a similar comment as this. If it is for a wider audience
| etc you would think it would be beneficial to explain it in a
| way that the wider audience understands.
| uw_rob wrote:
| Think databases which run across many different machines.
|
| Distributed databases are often conceptually modeled as a state
| machine. Writes are then mutations on the state machine.
|
| With a starting state (empty database), if everyone agrees that
| on a fixed list of mutations which are executed in a rigidity
| defined ordering, you will get the same final state.
|
| Which makes sense, right? If you run the following commands on
| an empty database, you would expect the same final state:
|
| 1. CREATE TABLE FOO (Columns = A, B) 1. INSERT INTO FOO (1, 2)
| 1. INSERT INTO FOO (3, 4)
|
| Which would be:
|
| ``` FOO: |A|B| |-|-| |1|2| |3|4| ```
|
| So where does "consensus" come into play? Consensus is needed
| to determine `mutation 4`. If the user can send a request to
| HOST 1 saying 'Mutation 4. should be `INSERT INTO FOO (5, 6)`'
| then HOST 1 will need to coordinate together with all of the
| other hosts and hopefully all of the hosts can agree that this
| is the 4th mutation and then enact that change on their local
| replica.
|
| This ordering of mutations is called the transaction log.
|
| So, why is this such a hard problem? Most of the reasons are in
| [Fallacies of distributed computing](https://en.wikipedia.org/w
| iki/Fallacies_of_distributed_compu...) but the tl;dr is that
| everything in distributed computing is hard because hardware is
| unreliable and anything that can go wrong will go wrong. Also,
| because multiple things can be happening at the same time in
| multiple places so it's hard to figure out who came first, etc.
|
| RAFT is such an algorithm to let all of these hosts coordinate
| together in a fault tolerant way to figure out the 4th
| mutation.
|
| Disclaimer: The above is just one use of RAFT. Another way RAFT
| is used in distributed databases is as a mechanism for the
| hosts to coordinate a hierarchy of communicate among themselves
| and when a host in the hierarchy is having problems RAFT can be
| used again to figure out another hierarchy. (Think consensus is
| reached on leader election to improve throughput)
___________________________________________________________________
(page generated 2023-09-03 23:00 UTC)