[HN Gopher] A distributed systems reading list
       ___________________________________________________________________
        
       A distributed systems reading list
        
       Author : davidw
       Score  : 256 points
       Date   : 2024-02-08 15:38 UTC (7 hours ago)
        
 (HTM) web link (ferd.ca)
 (TXT) w3m dump (ferd.ca)
        
       | nonrandomstring wrote:
       | A good list of concepts and resources. I'll just mention Andrew
       | Tanenbaum's "Distributed Operating Systems" (Prentice-Hall 1995)
       | which was my entry point.
        
       | eachro wrote:
       | When people learn about distributed systems outside of work, how
       | do they actually get hands on experience with it (assuming they
       | don't go spinning up a bunch of machines on aws/gcp/azure/etc)? I
       | find it easiest to learn by doing, writing simple proof of
       | concepts but that seems a bit harder to do in this area than
       | others? What is the hello world/mnist of messing around with
       | distributed systems?
        
         | TrustPatches wrote:
         | Gossip Glomers might be fun if you're looking for some hands-on
         | exercises :)
         | 
         | https://fly.io/dist-sys/
        
         | Jtsummers wrote:
         | 1. A bunch of communicating local processes.
         | 
         | 2. A bunch of communicating local VMs (easier with a beefier
         | machine like my current desktop).
         | 
         | 3. Mininet (there are other options) to simulate a network
         | environment, can fully control the topology very easily.
         | Lighter weight than (2), more control for simulating different
         | network effects than (1) alone.
        
         | John23832 wrote:
         | Frankly, just build something.
         | 
         | Use a small k8's distro (kind, minikube, k3s) and build
         | something that talks amongst itself and is resilient.
        
           | ActionHank wrote:
           | Sites like leetcode are great for coding and improving,
           | because you get to compare your solutions to those of others.
           | Sadly just building something on your own helps you learn the
           | moving parts, but not optimal, neat, or best practice
           | solutions.
        
             | John23832 wrote:
             | Sites like leetcode overindexes on the rote abilities that
             | you can stamp out. Actually building something is exploring
             | in a creative way which builds a deeper understanding.
             | 
             | If you want to be "optimal, neat, or best practiced" read a
             | book, and get stuck in tutorial hell. If you actually want
             | to learn how to do something, literally go do it. Nobody
             | has ever built anything of value (whether that is
             | financial, intellectual, or emotional) by leetcoding.
        
               | ActionHank wrote:
               | Any suggestions for books to read?
        
               | mannyv wrote:
               | What you should read or start with are the designs for
               | Cassandra, Kafka, Foundation DB, etc.
               | 
               | The problems they're trying to solve are related to
               | really large distributed systems that fail a lot, and
               | their design decisions are basically a "this is how we
               | worked around that problem."
               | 
               | You can also look for the LISA archives
               | (https://www.usenix.org/publications/loginonline/thirty-
               | five-...). System administrators were the first people
               | that had to deal with large distributed systems at scale,
               | and university system administrators led the charge.
               | 
               | You might want to hunt down the comp.sys.admin archives
               | (I can't remember the newsgroup anymore).
               | 
               | Most of the ideas and issues behind distributed computing
               | are obvious if you think about it. Many of the actual
               | implementation and mitigation of those are not obvious,
               | though.
               | 
               | And there's also the client side of distributed
               | computing, which I don't think is discussed as much.
               | 
               | As an example, exponential backoff is one of the go-to
               | techniques for clients when the servers are under load.
               | Unfortunately that doesn't really work IRL, because
               | instead of spreading the load you get waves of load
               | coming back over and over. Likewise on the server side
               | you have problems with peak load.
        
               | John23832 wrote:
               | The only one I would really suggest is Designing Data
               | Intensive Applications. But it is very DB centric.
               | 
               | https://www.amazon.com/Designing-Data-Intensive-
               | Applications...
        
           | davidw wrote:
           | "Just build something" is good advice but it's easier to find
           | some kind of fun thing to build with, say, a web framework
           | that's educational and maybe not a _complete_ throwaway
           | either.
           | 
           | Maybe some Internet of Things applications would provide a
           | good avenue for some distributed systems exploration?
        
         | Xamayon wrote:
         | It's not as easy as playing with more 'normal' stuff, but I
         | usually use VMs on a local hypervisor like ESXi, or a bunch of
         | old desktop/server hardware if I have enough
         | space/power/cooling at the time. Winter helps, big stuff often
         | runs loud and hot. To get specialized hardware when needed,
         | ebay or 'trash' from work and such can help a lot.
        
         | mannyv wrote:
         | The easiest way is to fire up a bunch of VMs.
         | 
         | The cheapest way is to pick up an old ThinkStation (or other
         | tower), load it up with 128GB (or more) of ram and install ESXI
         | on it. That's a perfectly good baseline, and you can run about
         | 30 4gb linux VMs on it.
         | 
         | Ideally you'd have a bit less than 1 core per VM, just so it's
         | a bit slow. Lots of people assume your nodes are quick, but in
         | real life they may not be. And really, most of the time your
         | machines won't be doing squat.
         | 
         | You might want to have SSDs in there too, because ESXI doesn't
         | have RAID capability (or at least mine didn't). I don't think
         | you can get a cloud device that uses spinning disks anymore,
         | and you wouldn't use it in real life anyway.
         | 
         | A 2tb drive is cheap these days, or just slap all those old
         | small SSDs in there. Everyone has a bunch of those small SSDs
         | left over, and they're perfect.
        
           | Xamayon wrote:
           | ESXi needs the RAID to be handled by another device, the
           | simplest case is a hardware RAID card with disks locally
           | attached to it. You can also attach remote disks/volumes from
           | other systems, with or without RAID, over the network/SAN/etc
           | using an HBA, special network card, or the software iscsi
           | initiator stuff in ESXi. You can even have something like a
           | windows server act as the iscsi volume host, and attach to it
           | over the normal network if you don't really care about
           | reliability. The ESXi OS will not appreciate it if you ever
           | turn the remote volume host system off, or if the network
           | drops out. It's really too bad the free and cheap ESXi
           | licenses are going away, it was always so nice to work
           | with...
        
             | mannyv wrote:
             | That's right, Broadcom bought them and the party's open.
             | Download your ESXi while you can!
        
         | lijok wrote:
         | Most systems are "distributed" actually, even your CRUD apps
         | and CLI tools that write to disk. The better question is "how
         | do you learn to deal with distributed systems intricacies in
         | places where it matters (such as finance)?", the answer for
         | which is super simple;                 1. Write any stateful
         | program       2. Now look at every single LOC and imagine what
         | happens to the system if the service crashes before executing
         | the next LOC. Then modify the system to deal with those
         | scenarios.
        
           | awesome_dude wrote:
           | > Most systems are "distributed" actually, even your CRUD
           | apps and CLI tools that write to disk.
           | 
           | Yeah, people miss this. If your app interacts with another
           | app - bam distributed.
        
             | patmorgan23 wrote:
             | Its almost as if the world isn't single threaded
        
         | hiAndrewQuinn wrote:
         | You take traditional non-distributed systems and push them to
         | their limits in some regard.
         | 
         | "Singularity" systems are an abstraction afforded to us by the
         | grace of the hardware we run them on. If you start pushing
         | their performance hard enough, however, you inevitably get
         | distributed behavior.
         | 
         | This is also a good potential career reason to try to make
         | software which is as performnt as possible - you'll get all the
         | tasty edge cases and complexity war stories to talk about.
        
         | apwell23 wrote:
         | > how do they actually get hands on experience with it
         | 
         | By designing twitter in a 45 min interview.
        
         | m_0x wrote:
         | I don't think it's a trivial thing to do outside of work. At
         | most you can play with kubernetes and cloud but in an interview
         | the lack of experience will come out because I think some stuff
         | can only be learned at work. Especially scalability.
        
           | awesome_dude wrote:
           | > but in an interview the lack of experience will come out
           | 
           | Some people are just up front about it - I've read a lot, and
           | practiced the best I can, but am looking for some real world
           | experience to marry that too.
        
         | otoolep wrote:
         | FWIW, I built hraftd[1] many years ago to make it easy to play
         | with a simple distributed system, but one that uses a
         | production-grade implementation of Raft[2]. You can spin up a
         | cluster in seconds on a single machine, kill nodes, watch a new
         | Leader get elected, and so on.
         | 
         | It's written in Go, so it'll help if you are familiar with Go.
         | But the code is not difficult to understand even if you don't.
         | 
         | [1] https://github.com/otoolep/hraftd
         | 
         | [2] https://github.com/hashicorp/raft
        
           | otoolep wrote:
           | Oh, and more background here:
           | https://www.philipotoole.com/building-a-distributed-key-
           | valu...
        
       | davidw wrote:
       | Among other things, Fred is the author of "Learn you some Erlang"
       | which is one of the best programming books I've read. It's so
       | obviously a labor of love.
       | 
       | https://learnyousomeerlang.com/
        
       | keschi wrote:
       | I recommended "Understanding Distributed Systems: What every
       | developer should know about large distributed applications" by
       | Roberto Vitillo to all my colleagues back when I worked on SaaS
       | systems.
       | 
       | "Designing Data-Intensive Applications: The Big Ideas Behind
       | Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann
       | as the more advanced deep dive.
       | 
       | Both books provide timeless conceptual advice. Kleppmann's
       | description of developing a database by starting from an append-
       | only text file really stuck with me.
        
         | wooly_bully wrote:
         | The book that Dominik Tornow is writing "Thinking in
         | Distributed Systems" has been an excellent next read after DDIA
         | for me (it's not yet finished I believe).
         | 
         | Really shows the experience of someone who understands this
         | stuff inside and out (was one of the main people behind
         | Temporal).
        
           | killthebuddha wrote:
           | FWIW I don't see mention of incompleteness on the book's site
           | http://book.dtornow.com/
        
         | hiAndrewQuinn wrote:
         | I often like to think that, at a basic level, all a [edit:
         | indexed] db "does" is move our O(n) search of an unordered text
         | file to the O(log n) search of a tree
        
           | maerF0x0 wrote:
           | if the facet is indexed.
        
             | hiAndrewQuinn wrote:
             | ah yes, thanks
        
           | teraflop wrote:
           | Yup.
           | 
           | From a high-altitude view, that's why splitting a huge
           | database table into smaller partitions is not an _automatic_
           | performance win. If you have M partitions with N rows each,
           | then a lookup might require O(log M) time to find a partition
           | and O(log N) time to find a row within the partition. But
           | O(log M + log N) = O(log MN) which is what you would get from
           | a single big table with appropriate indexing.
           | 
           | Of course, in the real world constant factors and
           | implementation details matter, so this is just a heuristic.
           | But it seems to run contrary to a lot of novice programmers'
           | intuition that a large DB table must automatically be a slow
           | one.
        
         | ryandv wrote:
         | To add to this list, there is also "Principles of Eventual
         | Consistency" [0] for getting down to the mathematical
         | formalisms.
         | 
         | In addition, Lamport's paper "Time, Clocks, and the Ordering of
         | Events in a Distributed System" [1].
         | 
         | [0] https://www.microsoft.com/en-us/research/wp-
         | content/uploads/...
         | 
         | [1] https://lamport.azurewebsites.net/pubs/time-clocks.pdf
        
           | yodsanklai wrote:
           | > Lamport's paper "Time, Clocks, and the Ordering of Events
           | in a Distributed System"
           | 
           | I know this article is a classic. I studied it at school but
           | I've always found it very hard to understand. Maybe I'm wrong
           | but I have the feeling that relatively few engineers use
           | these formalisms as their mental models when designing
           | distributed systems.
        
         | bostik wrote:
         | It was surprising that Kleppman's book was mentioned only at
         | the _very end_ of the article, but at least it came with an
         | understandable caveat. That book is incredible - although in
         | all honesty it does require solid foundation of distributed
         | systems to make proper sense.
         | 
         | Until you have personally battled with replication lag, real-
         | life impacts of eventual consistency and distributed writes,
         | Data-Intensive Applications feels like a dry theoretical read.
         | If you do come across the book with the scars and lessons, it
         | does open the world up.
        
       | throw0101b wrote:
       | The words "reading list" implied to me a list of books, article,
       | _etc_ , that one would go over to learn about the topic.
       | 
       | Can anyone familiar on the topic suggest a list? Perhaps starting
       | with a "101" item for those that want a general understanding /
       | scratch a curiosity itch and perhaps proceeding to more technical
       | items for those that want to dig deep.
        
         | ahansen wrote:
         | I feel like the article itself does a pretty good job of
         | introducing a lot of the core topics with a short paragraph for
         | each
        
       | 0xbadcafebee wrote:
       | Is there a wiki for computer science? I feel like I have a couple
       | books worth of knowledge on building and maintaining distributed
       | systems that is just gonna die with me. Could try to start
       | flushing out articles but would be helpful if others contributed
        
         | plumeria wrote:
         | Why not contribute to Wikipedia?
        
         | patmorgan23 wrote:
         | There's tons of CS stuff on regular Wikipedia.
        
       | allendoerfer wrote:
       | If you want to tackle it from a more practical point of view, I
       | can also recommend "Site Reliability Engineering (How Google runs
       | production systems)", which is not only about the method itself,
       | but naturally goes over distributed systems and explains some
       | fundamentals.
        
       | lysecret wrote:
       | Do not like these types of books try all of them. However,
       | Designing Data-Intensive Applications is just fantastic.
        
       | revskill wrote:
       | No book mentioned how to do distributed transaction though.
        
         | convolvatron wrote:
         | atomic transactions by Nancy Lynch et al.
        
         | esafak wrote:
         | DDIA does under "Distributed Transactions and Consensus" in
         | Chapter 9: Consistency and Consensus.
        
       | HextenAndy wrote:
       | I haven't followed the link - but why not put it all in one
       | place?
        
       | rochak wrote:
       | I'd recommend checking out MIT's Distributed Systems course. All
       | its videos and assignments are available online and teach you
       | everything you would need to get into these systems and go in
       | depth in them.
        
       | nc0 wrote:
       | I also recommend anyone interested to have a look at the
       | Erlang/OTP ecosystem, especially for their design decisions.
       | While the language and the platform isn't popular, the OTP team
       | does present rich architectural patterns and ideas that can
       | improve your design
        
       | sharas- wrote:
       | "End-to-End Argument in System Design" - Classic. Basically
       | means: nerds, stop playing with yourselves and think about
       | users/clients of your system.
        
       | LAC-Tech wrote:
       | I've gotten a lot of value by going to a topic I don't understand
       | on wikipedia, finding the oldest paper they cite, and printing it
       | out.
       | 
       | Sometimes I only finish half the paper, but damned if I haven't
       | learned a lot.
       | 
       | Disclaimer: I can could never go through and systematically work
       | through a giant list like this. If you know yourself and you can,
       | this may be more effective.
        
       | max_ wrote:
       | I think "Specifying Systems" by Leslie Lamport should be on the
       | list.
       | 
       | Along with "Mastering Bitcoin: Programming the Open Blockchain
       | Book" by Andreas Antonopoulos
        
       | nvdnadj92 wrote:
       | The best introductory resource I have found in my career was:
       | "Distributed Systems for Fun and Profit" by Mixu. It's about 50
       | pages long, and is broken down quite well.
       | 
       | https://book.mixu.net/distsys/single-page.html
        
       | macintux wrote:
       | Warning: I haven't checked these links in forever, but here's a
       | list of distributed systems reading lists.
       | 
       | https://gist.github.com/macintux/6227368
        
       ___________________________________________________________________
       (page generated 2024-02-08 23:01 UTC)