[HN Gopher] Implementing a distributed key-value store on top of...
___________________________________________________________________
Implementing a distributed key-value store on top of implementing
Raft in Go
Author : simjue
Score : 219 points
Date : 2023-05-25 13:26 UTC (9 hours ago)
(HTM) web link (notes.eatonphil.com)
(TXT) w3m dump (notes.eatonphil.com)
| didip wrote:
| Big fan of Phil!
|
| A lot of us tinkered like him, but very few came back and write
| the findings nicely in an easy to read form.
| eatonphil wrote:
| Glad to hear it! :)
| yi_xuan wrote:
| Recently, I have been watching the course MIT6.824. This article
| appeared very timely :)
|
| Here is another raft implementation in Go
| https://github.com/eliben/raft
| siftrics wrote:
| This is essentially do-it-yourself etcd.
|
| https://github.com/etcd-io/etcd
| esafak wrote:
| And tidb
| moderation wrote:
| TIKV [0] is closer (written in Rust).
|
| 0. https://github.com/tikv/tikv
| bit_flipper wrote:
| Your note about encoding/gob being inefficient is somewhat
| accurate for how you're using it, but I want to talk a bit about
| how you could improve your use.
|
| encoding/gob is intended for streams, not stateless
| marshals/unmarshals. The first thing that is sent over the stream
| is the type information the receiver should expect, that's why
| your payload was so large. After the first type is received,
| subsequent messages are much smaller. You can see this by
| extending your example to do multiple writes; each write after
| the first is only 10 bytes: https://play.golang.com/p/Po_iaXrTUER
|
| You have to plan differently, but you could get large
| improvements to transmission sizes by changing to append only
| files and creating the gob encoder once per file. If you find
| you're creating a gob encoder/decoder very often, that's a
| telltale sign you're not using it as intended.
| eatonphil wrote:
| Another thing I glossed over (unintentionally) is that I only
| started limiting batch size after I switched to the custom
| encoder. So persist() being a function of length wouldn't be
| quite true anymore.
|
| However, I still keep seeing encoding/gob high up in the
| profiler taking a lot of time doing reflection during RPC
| calls.
|
| So it does still seem like it's not ideal. Though I may still
| just not be understanding how to use net/rpc correctly either.
| candiddevmike wrote:
| > encoding/gob is intended for streams, not stateless
| marshals/unmarshals
|
| I don't understand this within the context of the rest of your
| comment. I use gob for marshaling stuff to storage all the
| time, I'm not aware of a better way to do that (serialize data
| to binary).
| bit_flipper wrote:
| Sorry, I should have been more precise. encoding/gob is not
| optimized for situations where you create an encoder or
| decoder, read/write a single value, then discard that
| encoder/decoder. As the author noted, payloads for a single
| call to Encode() are quite large. Additionally, re-
| instantiating a gob encoder for each call to Encode() is very
| expensive allocation-wise and benchmarks where this happens
| will show gob to perform poorly in these scenarios. You can
| certainly still use gob this way, and if the performance
| works for you then have at it! But it performs significantly
| better in situations where you make multiple calls to
| Encode() with the same encoder.
| throwaway894345 wrote:
| The parent is saying that gob includes the type information
| in the serialized payload. This is extraneous information in
| many cases, and the result is an unnecessarily large payload.
| dotnwat wrote:
| epic
| cheeseprocedure wrote:
| I was lucky enough to take David Beazley's "Rafting Trip," a
| five-day training course that guides each student through
| building their own Raft implementation from scratch:
|
| https://www.dabeaz.com/raft.html
|
| I'd recommend the course to anyone with development experience
| working with or near distributed systems. David is a fantastic
| instructor and facilitator, and the blend of student backgrounds
| led to some great learning and discussion.
|
| (I have no financial or personal interest here; I just loved the
| course.)
| geospeck wrote:
| Thanks for the post!
|
| Another great blog post series about implementig Raft in Go that
| I found is this one
| https://eli.thegreenplace.net/2020/implementing-raft-part-0-...
| eatonphil wrote:
| Yes I definitely referred to his posts and project while
| working on this. I'm a big fan of his blog!
| Xeoncross wrote:
| These kinds of posts are my favorite part of HN. The deep dives
| into LLM's, state machine theory, fourier transforms, locking
| data structures, and memory allocation that is miles above the
| basic posts around the internet you get with searching - but not
| quite a full book yet.
|
| The author spent 7 months tinkering and cared enough to come back
| and share that with us.
| beoberha wrote:
| Phil rocks. Highly suggest following him on twitter. He's
| pretty outspoken about writing. This is a great post of his
| that I need to reflect on more: https://notes.eatonphil.com/is-
| it-worth-writing-about.html
| eatonphil wrote:
| Glad that post resonated! <3
| quaintdev wrote:
| I wish there was a separate catalog of these posts which I
| could browse through.
| restlake wrote:
| Cool post. Wonder if it would have helped to take a look at the
| MIT distributed systems course on the web and YouTube - one of
| the projects is exactly this (Go Raft implementation)
|
| https://pdos.csail.mit.edu/6.824/labs/lab-shard.html
| eatonphil wrote:
| I just don't love learning from videos personally. But I'm sure
| those are good if you do!
| powerset wrote:
| I highly recommend MIT open courseware 6.824. Incredibly valuable
| for learning distributed systems, and one of the lab assignments
| is implementing raft in Go.
| http://nil.csail.mit.edu/6.824/2022/schedule.html
|
| There are a ton of fascinating and potentially frustrating edge
| cases and gotchas to implementing raft correctly. There's no
| better way to understand it than actually implementing it, and I
| probably never would have done it myself without these course
| materials.
| rochak wrote:
| Unfortunately, I hit a roadblock while implementing the Raft
| assignment. I knew it was simply beyond my capabilities but
| would have made through if I had anyone I could reach out to. I
| second this recommendation, but make sure you know what you are
| signing up for. This course is as hard as they come.
| valzam wrote:
| I have found the performance tests very tricky to get to pass
| without having any input from others. The assignment is
| really very unforgiving, I would wager the test suite is
| comparable to how commercial Raft implementations are tested
| (e.g. https://github.com/hashicorp/raft)
| leetrout wrote:
| So kinda like "build your own consul"
| eatonphil wrote:
| I'm pretty sure consul builds on top of hashicorp's Raft
| implementation so I'd say end result, yes, but that this post
| goes into the Raft implementation.
|
| And I'm sure consul does plenty my silly key-value state
| machine doesn't.
|
| I'm not super familiar with it.
| roncesvalles wrote:
| Reminded me of etcd, which is exactly this - Raft-based key-
| value store written in Go.
|
| https://github.com/etcd-io/etcd
| fredski42 wrote:
| Here is nice interactive and visual representation of how raft
| works: http://thesecretlivesofdata.com/raft/
| thinkharderdev wrote:
| That's really cool!
| hintymad wrote:
| A side topic: how can a company find people like the author, who
| can really articulate how to build a system? I've been
| interviewing senior engineers for years, and more than 95% of
| them, if not more, are really box drawers. They throw terms
| around confidently but fail miserably the moment I ask for
| specifics. No, I don't ask for details like how Raft works. That
| level of details probably would fail all the candidates. Just
| simple things like how you organize the data, how you route the
| traffic, how you manage metadata, or how your data path keeps
| metadata in sync. Really basic ones, yet most interviewers fail.
| Indeed, the more senior they are, the more likely they lose touch
| of the details -- even those who claim to work on distributed
| databases.
| pavluha wrote:
| Try to interview more senior people, who still remember their C
| course in an University, or even know what is assembler.
| Currently, most graduates (from my experience) don't know how
| an integer is stored in memory.
| mydogcanpurr wrote:
| I don't even know what a good answer to that is. Some
| integers are stored in registers. Some are stored on the
| stack. Some are stored on the heap. Or you could be asking
| how integers are represented, with two's complement being the
| usual way. Did I pass your ambiguous question? Probably not.
| qznc wrote:
| You find them via their blogs.
| bombela wrote:
| Where do you find the jobs where people actually care about
| engineering? Because real work is mostly careful plumbing
| without breaking what works, but in interviews you must
| leetcode. Real systems only work when kept reasonably simple,
| but system design interviews are all about drawing boxes.
| BossingAround wrote:
| I recently interviewed with a company, and the interviewer
| asked me "how do you organize data." I wasn't sure if they
| wanted to talk about classes, modules, databases, k-v stores,
| hashing data and routing distributed requests to the same pods.
| I asked, and they answered "I mean in general, how do you
| organize data."
|
| After talking for a bit about pretty much all of the above, the
| interviewer asked "have you used dictionaries?"
|
| The reason I'm telling the story is, if a lot of your
| candidates fail to answer your questions, the problem might be
| in the question.
| skippyboxedhero wrote:
| I have had that exact experience before. It is fine, you
| realise the company is toasted and you move on (the place
| that asked me this had just got rinsed by an offshore
| consulting firm, ecomm site built with a graph database,
| comedic stuff).
| maximinus_thrax wrote:
| > After talking for a bit about pretty much all of the above,
| the interviewer asked "have you used dictionaries?"
|
| Hahaha classic.
|
| I was once in an interview (which I failed) and was asked a
| problem related to some sort of monotonic queue. I wrote the
| solution and was going through it, when the interviewer
| asked: 'what CS concept are you using in this solution'? I
| didn't really understand what he was asking for and in turn I
| asked for more clarification like 'Are you referring to the
| data structures I used? Or the technique (it was some sort of
| greedy)?' 'no'. After a couple of back-and-forth questions
| and answers, he finally tells me he was looking for me to say
| 'a state machine'. Seriously?
| no_wizard wrote:
| I would have halted for a moment and asked something like:
| _organize data for what purpose?_ At which point I would
| expect there to be some amount of clarification. As it sounds
| to me like they 're talking about organizing some set of data
| for lookup, as opposed to organizing data around how it flows
| through an application, for example
| aranw wrote:
| I often find when you are given questions like that no
| matter how you probe you get a really defensive interviewer
| who doesn't want to give away the answer they are
| expecting. I've been in very similar situations with vague
| questions and I've tried to probe for more details and just
| been met with defensiveness with an attitude that says they
| expect me to know exactly what they mean
| d0mine wrote:
| It is worse: your attempt to clarify the question may be
| regarded by some interviewers as a sign you are not
| senior dev (they think [erroneously in my opinion] that
| "senior" means that you must figure out fine details
| yourself).
| maximinus_thrax wrote:
| One a rare occasion I actually got feedback after the
| interview (it was for a startup), It was because I asked
| 'too many questions' it was decided that I'd need way
| more mentoring than they can afford for the senior
| position I was applying for.
| no_wizard wrote:
| This seems odd to me, though its entirely possible my
| experience thus far hasn't been this. I've been told at
| the last 3 places I worked that one of the reasons I was
| hired was how inquisitive I was during the interview
| stage, asking _lots_ of questions before coming up with
| solutions
|
| I am most definitely not ruling out what you're saying
| may be common case and I am just very lucky, but I do
| want to act as a sort of counter point to see if others
| can weigh in so we can get more of an industry sense
| around this
| zimpenfish wrote:
| Anecdatum: my experience has been the same as others -
| asking questions about the interview questions generally
| leads to a bad outcome. In 25 years, I reckon it's less
| than a handful of times where asking questions of the
| interviewers has actually got a positive response.
|
| Hell, even when in jobs, asking questions about projects
| I'm assigned has sometimes got a negative response...
| pavluha wrote:
| There are plenty of different types of dictionaries,
| organizing their data differently.
| nitwit005 wrote:
| You seem confident you wouldn't have failed the author of this
| blog post, but I wouldn't be so confident.
|
| Interviews are a high pressure situation where people want an
| immediate answer, without much of any time to think, let alone
| research options or run an experiment. You get people's
| absolute worst possible results.
| convolvatron wrote:
| I have worked in distributed databases. I can't seem to find
| any host organization that's actually building anything and
| really cares about those details. I know several other senior
| systems engineers in the same boat. where are you guys hiding?
| bastardoperator wrote:
| Maybe you're the problem? Looking at your questions, you're
| asking for opinions, not answers or solutions. Putting people
| on the spot is bad form in my opinion. If I have a technical
| challenge I want you to solve, I'll give you the time for that
| so I can actually see your work and discuss it with you.
|
| I've seen people that can talk the talk but not walk the walk
| and vice versa, they can't articulate well, but their approach
| and work speaks for itself.
| hintymad wrote:
| Maybe I am. For a senior enough role, my questions are open
| ended. I'd expect the candidate to drive conversations and
| examine alternatives. I ask follow-up questions based on what
| the candidate mentions. Yeah, such may be indeed their
| opinions, but at least they should be opinions derived from
| assumptions, constraints, and fundamentals.
| darth_avocado wrote:
| That's a terrible way to interview. Interviews are high
| stress, time constrained settings where you're talking to
| people you don't know. If you want thoughtful answers, you
| need to be precise in what you're asking. You can't be like
| "tell me about yourself" and then complain I didn't get
| from the candidate whether steak is their favorite food.
| skippyboxedhero wrote:
| Do you understand the difference between writing a blog post
| about something and being able to talk about it in an
| interview? The majority of work in software is not being an
| "expert" on any one thing because the one thing that employers
| require is changing quite frequently. Rather it is being able
| to build up experience and carrying knowledge forward to new
| areas.
|
| Most technical interviews are conducted in a completely inane
| way where the goal is to memorize pieces of information. If
| that was how software worked, the best engineer would just be
| Google. 95% of people who know enough to build this system
| would likely be unable to answer your questions because, more
| than likely, they do not currently work in a job that requires
| them to use this information every day. This does not mean they
| do not understand it or are unable to build it.
|
| The reason why companies can't find people like this is a
| combination of not understanding what software development is
| (and not understanding people, it is really the blind leading
| the blind but technical people are often as bad as
| interviewing).
|
| You are asking completely wrong questions. You aren't starting
| from the right place.
| hintymad wrote:
| > Rather it is being able to build up experience and carrying
| knowledge forward to new areas.
|
| Then I'd expect the candidate could articulate her experience
| and knowledge in some way, no? Otherwise, how would I know
| the candidate has the built-up expertise? Of course, I assume
| we can only have interviews. Otherwise, we can have other
| means, like mini-project, onsite project, or a writeup. Some
| candidates do like the alternatives more, and some not.
|
| > where the goal is to memorize pieces of information
|
| I disagree. The goal is to see if a candidate does have what
| the candidate herself claims to know. I find it hard to
| imagine that a candidate claims to be an expert in a field
| yet couldn't articulate even one thing in depth for hours if
| not days, let alone 30 minutes of interview time. Note this
| is not about any specific details, but the general picture
| and insights that an expert can convey. This is like a PhD
| oral defense. The candidate talks about the topic that _the
| candidate_ is familiar with, and the professors dive in on
| such topics. I don 't see what's wrong with that.
| skippyboxedhero wrote:
| Candidates aren't defending a PHd. The interviewer is
| nowhere near competent enough for this.
|
| > The goal is to see if a candidate does have what the
| candidate herself claims to know.
|
| The process you are using does not tell you that. You will
| notice that your explanation is full of things that you
| expect, not based in an understanding of how reality
| actually works. And the result is, unsurprisingly, one
| where you do not understand the output.
| samsquire wrote:
| Thank you for the write up eatonphil.
|
| I experimentally implemented Raft in Java but I am not very
| confident that I did it correctly.
|
| I wish there was a way to implement stateful programs that
| guarantee "forward progress" and are "steady state systems". I
| think essentially a state machine that cannot be trapped in a
| state. Debugging the absence of something of forward moving
| progress or lack of causation is very difficult.
|
| When there's essentially different actors in the system and they
| can interact with eachother by communicating, they each have a
| number of states they can get into. There's no guarantee that the
| system shall converge on a state that forward progress can be
| made. Maybe TLA+ is the right answer here.
|
| YMMV but I think (my) reasoning over stateful systems is rather
| difficult, I think there's lots of hidden states that we cannot
| easily detect or reason about because they're in our blind spots.
| Especially related to synchronization. I think it's part of what
| makes multithreading and distributed systems so hard, because
| every component can be in a different state and if something is
| not where it is expected to be, the baton doesn't get passed to
| the correct state. If you check for something too early, you have
| a race condition.
|
| If we could see in slow motion what was going on, an interaction
| between different actors, we could work out why something happens
| the way it does. But usually the logs are too numerous to get to
| this detail. I think animation can save us, but what does a Raft
| animation look like?
|
| How often have you seen an endless spinner? It's as if a
| completion event was raised but didn't get detected and the
| system is waiting for something that shall never occur. I want
| this kind of error to be impossible. This is one form of hidden
| state that prevents progress.
|
| I wrote an eventually consistent mesh protocol in Python and
| tested it with Jepsen, it is not linearizable because the
| consistency level is "eventually consistent".
|
| I don't understand how Raft can scale writes or reads across
| multiple machines due to the round trip time talking to other
| nodes.
| qznc wrote:
| I think you should try TLA+. I found it surprisingly easy:
| https://beza1e1.tuxen.de/tla-plus.html
|
| Still haven't found an opportunity to use it professionally
| though.
| no_wizard wrote:
| Sounds like a job for state machines like you can build out
| with a library like xstate[0] (though I'm sure there are
| similar libraries in whatever language you choose. Python has
| one called automat[1])
|
| These exist to formalize state logic (current state, computing
| state, transitioning state etc), you can even produce diagrams
| based on their definitions. Advanced libraries like xstate even
| have Actors are part of the core of the library
|
| [0]: https://stately.ai/docs/xstate
|
| [1]: https://github.com/glyph/Automat
| tbrockman wrote:
| Don't have anything to comment on the rest, but my
| understanding with respect to:
|
| > I don't understand how Raft can scale writes or reads across
| multiple machines due to the round trip time talking to other
| nodes.
|
| Is that at the point where that becomes a bottleneck you scale
| to multiple clusters, where the read/write destination cluster
| is determined by a key and whatever your preferred hashing
| mechanism is.
| DylanSp wrote:
| Microsoft has a library/tool called Coyote* that helps with
| testing distributed systems; you can write
| tests/specifications, Coyote will systematically explore
| nondeterminism in your system and check if your tests still
| pass. If there's a failure, it'll show the sequence of events
| that led to the failing test.
|
| I started a project to implement Raft with a KV-store on top,
| similar to the article, meaning to use Coyote to test it; I
| didn't get that far before losing interest, though. It's
| reassuring to read that it took Phil several months to write
| the code in the post, it's good to know that this is a
| decidedly nontrivial problem.
|
| * https://github.com/microsoft/coyote
___________________________________________________________________
(page generated 2023-05-25 23:00 UTC)