[HN Gopher] Notes on Theory of Distributed Systems [pdf]
___________________________________________________________________
Notes on Theory of Distributed Systems [pdf]
Author : htfy96
Score : 68 points
Date : 2022-08-24 19:51 UTC (3 hours ago)
(HTM) web link (www.cs.yale.edu)
(TXT) w3m dump (www.cs.yale.edu)
| infogulch wrote:
| 15 pages of just TOC. 400+ pages of content
|
| > These are notes for the Fall 2022 semester version of the Yale
| course CPSC 465/565 Theory of Distributed Systems
|
| There are a lot of algorithms, but I don't see CRDTs mentioned by
| name. Perhaps it's most closely related to "19.3 Faster snapshots
| using lattice agreement"?
| dragontamer wrote:
| > CRDTs
|
| Wrong level of abstraction. This is clearly a lower level
| course than that and discusses more fundamental ideas.
|
| A quickie look through chapter 6 reminds me of CRDTs, at least
| the vector clock concept. Other bits from other parts of this
| course probably need to be combined into what would be called a
| CRDT.
| phtrivier wrote:
| Is anyone teaching "Practice of boring Distributed Systems 101
| for dummies on a budget with a tight schedule" ?
|
| As in, "we have a PHP monolith used by all of 12 people in the
| accounting department, and for some reason we've been tasked with
| making it run on multiple machines ("for redundancy" or
| something) by next month.
|
| The original developpers left to start a Bitcoin scam.
|
| Some exec read about the "cloud", but we'll probably get just
| enough budget to buy a coffee to an AWS salesman.
|
| Don't even dream of hiring a "DevOps" to deploy a kubernetes
| cluster to orchestrate anything. Don't dream of hiring anyone,
| actually. Or, paying anything, for that matter.
|
| You had one machine ; here is a second machine. That's a 100%
| increase in your budget, now go get us some value with that !
|
| And don't come back in three months to ask for another budget to
| 'upgrade'."
|
| Where would someone start ?
|
| (EDIT: To clarify, this is a tongue in cheek hyperbole scenario,
| not a cry for immediate help. Thanks to all who offered help ;)
|
| Yet, I'm curious about any resource on how to attack such
| problems, because I can only find material on how to handle large
| scale multi million users high availability stuff.)
| keule wrote:
| > As in, "we have a PHP monolith used by all of 12 people in
| the accounting department, and for some reason we've been
| tasked with making it run on multiple machines ("for
| redundancy" or something) by next month.
|
| Usually, your monolith has these components: a web server
| (apache/nginx + php), a database, and other custom tooling.
|
| > Where would someone start ?
|
| I think a first step is to move the database to something
| managed, like AWS RDS or Azure Managed Databases. Herein lies
| the basis for scaling out your web tier later. And here you
| will find the most pain because there are likely: custom backup
| scripts, cron jobs, and other tools that access the DB in
| unforeseen ways.
|
| If you get over that hump you have done your first big step
| towards a more robust model. Your DB will have automated
| backups, managed updates, rollover, read replicas etc. You may
| or may not see a performance increase, because you effectively
| split your workload across two machines.
|
| _THEN_ you can front your web tier with a load balancer, i.e.
| you load balance to one machine. This gives you: better
| networking, custom error pages, support for sticky sessions
| (you likely need them later), and better/more monitoring.
|
| From thereon you can start working on removing those custom
| scripts of the web tier machine and start splitting this into
| an _actual_ load-balanced infrastructure, going to two web-tier
| machines, where traffic is routed using sticky-sessions.
|
| Depending on the application design you can start introducing
| containers.
|
| Now, this approach will not give you a _cloud-native awesome
| microservice architecture_ with CI/CD and devops. But it will
| be enough to have higher availability and more robust handling
| of the (predictable) load in the near future. And on the way,
| you will remove bad patterns that eventually allow you to go to
| a better approach.
|
| I would be interested in hearing if more people face this
| challenge. I don't know if guides exist around this on the
| webs.
| qntty wrote:
| Sounds like you could be looking for something like VMware
| vSphere if primary-backup replication is what you want
| throwaway787544 wrote:
| If someone would pay for it I'd write that book. There are lots
| of different methods for different scenarios. There are some
| books on it but they're either very dry and technical or have
| very few examples.
|
| Here's the cliffs notes version for your situation:
|
| 1. Build a server. Make an image/snapshot of it.
|
| 2. Build a second server from the snapshot.
|
| 3. Use rsync to copy files your PHP app writes from one machine
| ('primary') to another ('secondary').
|
| 4. To make a "safe" change, change the secondary server, test
| it.
|
| 5. To "deploy" the change, snapshot the secondary, build a new
| third server, stop writes on the primary, sync over the files
| to the third server one last time, point the primary hostname
| at the third server IP, test this new primary server, destroy
| the old primary server.
|
| 6. If you ever need to "roll back" a change, you can do that
| while there's still three servers up (blue/green), or deploy a
| new server with the last working snapshot.
|
| 7. Set up PagerDuty to wake you up if the primary dies. When it
| does, change the hostname of the first box to point to the IP
| of the second box.
|
| That's just one way that is very simple. It is a redundant
| active/passive distributed system with redundant storage and
| immutable blue/green deployments. It can be considered high-
| availability although that term is somewhat loaded; ideally
| you'd make as much of the system HA as possible, such as
| independent network connections to the backbone, independent
| power drops, UPC, etc (both for bare-metal and VMs).
|
| You can get much more complicated but that's good enough for
| what they want (redundancy) and it buys you a lot of other
| benefits.
| fredsmith219 wrote:
| I can't believe at 12 people would actually be stressing the
| system. Could you meet the requirements of the project by
| setting up the second machine as a hot back up at an offsite
| location?
| phtrivier wrote:
| Maybe. How do I find the O'Reilly book that explains that ?
| And the petty details about knowing the first one is down and
| starting the backup ? And just enough data replication to
| actually have some data in the second machine ? Etc, etc...
|
| My pet peeves with distributed and ops books is that they
| usually start by laying out all those problems, but then move
| on to either :
|
| - explain how Big Tech has even bigger problems, before
| explainig how you can fix Big Tech problems with Big Tech
| budgets and headcound by deploying just one more layer of
| distributed cache or queue that vietually ensures your app is
| never going to work again (That's "Desifning Data Intensive
| Applications", in bad faith.)
|
| - or, not really explain anything, wave their hands chanting
| "trade offs trade offs" and start telling kids stories about
| Byzantine Generals.
| EddySchauHai wrote:
| What you're describing there sounds like general Linux
| sysadmin to me?
| phtrivier wrote:
| Not entirely, I would argue, if you look at it from the
| application developper.
|
| You have to adapt parts of your app to handle the fact
| that two machines might be handling the service (either
| at the same time, or in succession.)
|
| This has impact on how you use memory, how you persist
| stuff, etc...
|
| None of which is rocket science, probably - but even
| things that look "obvious" to lots of people get their
| O'Reilly books, so...
|
| But you're right that a part of the "distribution" of a
| system is in the hands of ops more than devs.
| EddySchauHai wrote:
| I guess it's just experience to be honest. It happens
| rarely, you might be lucky enough to be involved with
| solving it, and then you focus on the important parts of
| the project again. I've only worked in startups so don't
| know about the 'Big Tech' solutions but a little
| knowledge of general linux sysadmin, containers, and
| queues has yet to block me :) Once the company is big
| enough to need some complexity beyond that I assume
| there's enough money to hire someone to come in and put
| everything into CNCFs 1000 layer tech stack.
|
| Edit: Thinking on this, if I want to scale something it'd
| be specific to the problem I'm having so some sort of
| debugging process like https://netflixtechblog.com/linux-
| performance-analysis-in-60... to find the root cause
| would be generic advice. Then you can decide to scale
| vertically/horizontally/refactor to solve the problem and
| move on.
| lmwnshn wrote:
| More entertainment than how-to guide, and oriented more
| towards developers than ops, but if you haven't read
| "Scalability! But at what COST?" [0], I think you'll enjoy
| it.
|
| [0] https://www.frankmcsherry.org/graph/scalability/cost/20
| 15/01...
| arinlen wrote:
| > _explain how Big Tech has even bigger problems, before
| explainig how you can fix Big Tech problems with Big Tech
| budgets and headcound (...)_
|
| What do you have to say about the fact that the career
| goals of those interested in this sort of topic is... Be
| counted as part of the headcount of these Big Tech
| companies while getting paid Big Tech budget salaries?
|
| Because if you count yourself among those interested in the
| topic, that's precisely the type of stuff you're eager to
| learn so that you're in a better position to address those
| problems.
|
| What's your answer to that? Continue writing "hello world"
| services with Spring Initializr because that's all you
| need?
| phtrivier wrote:
| > Because if you count yourself among those interested in
| the topic, that's precisely the type of stuff you're
| eager to learn so that you're in a better position to
| address those problems.
|
| People will work on problems of different scales in a
| career ; will you agree that different scales of problems
| call for different techniques ?
|
| I have no problem with FANGs documenting how to fix FANGs
| issues !
|
| I'm a little bit concerned about FANGs-devs-wanabee
| applying the same techniques to non-FANGs issues, though,
| for lack of training resources about the "not trivial but
| a bit boring" techniques.
|
| Your insight about the budget / salaries makes sense,
| though : a book about "building your first boring IT
| project right" is definitely not going to be a best
| seller anytime soon :D !
| enumjorge wrote:
| Nothing wrong with having those aspirations, but sounds
| like the parent commenter has non-Big-Tech-sized problems
| he needs to solve now.
| slt2021 wrote:
| distributed systems are usually for millions of users, not 12
| users.
|
| for your problem you can start by configuring nginx to work as
| load balancer and spin up 2nd VM with php app
| phtrivier wrote:
| "But what if _the_ machine goes down ? What if it goes down
| _during quarter earnings legally requested reporting
| consolidation period_ ? We need _redundancy_ !!"
|
| Also, philosophically, I guess, a "distributed" systems
| starts at "two machines". (And you actually get most of the
| "fun" of distributed systems with "two processes on the same
| machine".)
|
| We're taught how to deal with "N=1" in school, and "N=all
| fans of Taylor Swift in the same seconds" in FAANGS.
|
| Yet I suspect most people will be working on "N=12, 5 hours a
| day during office hours, except twice a year." And I'm not
| sure what's the reference techniques for that.
| arinlen wrote:
| > _Also, philosophically, I guess, a "distributed" systems
| starts at "two machines"._
|
| People opening a page in a browser that sends requests to a
| server is already a distributed system.
|
| A monolith sending requests to a database instance is
| already a distributed system.
|
| Having a metrics sidecar running along your monolith is
| already a distributed system.
| phtrivier wrote:
| > A monolith sending requests to a database instance is
| already a distributed system.
|
| True, of course.
|
| And even a simple set like this brings in "distribution"
| issues for the app developper:
|
| When do you connect ? When do you reconnect ?
|
| Where do you get your connection credentials from ?
|
| What should happen when those credentials have to change
| ?
|
| Do you ever decide to connect to a backup db ?
|
| Do you ever switch your application logic to a mode where
| you know the DB is down, but you still try to work
| without it anyway ?
|
| Etc..
|
| Those examples are specific to DBS, but in a distributed
| system any other services brings in the same questions.
|
| With experience you get opinions and intuitions about how
| to attack each issues ; my question is still : "should
| you need to point a newcomer to some reference / book
| about those questions, where would you point to ?"
| random_coder wrote:
| It's a joke.
| arinlen wrote:
| > _As in, "we have a PHP monolith used by all of 12 people in
| the accounting department, and for some reason we've been
| tasked with making it run on multiple machines ("for
| redundancy" or something) by next month._
|
| I find this comment highly ignorant. The need to deploy a
| distributed system is not always tied to performance or
| scalability or reliability.
|
| Sometimes all it takes is having to reuse a system developed by
| a third party, or consume an API.
|
| Do you believe you'll always have the luxury of having a single
| process working on a single machine that does zero
| communication over a network?
|
| Hell, even a SPA calling your backend is a distributed system.
| Is this not a terribly common usecase?
|
| Enough about these ignorant comments. They add nothing to the
| discussion and are completely detach from reality
| phtrivier wrote:
| I failed to make the requester sound more obnoxious than the
| request.
|
| My point is precisely that transitioning from a single app on
| a machine is a natural and necessary part of a system's life,
| but that I can't find satisfying resources on how to handle
| thise phase, as opposed to handling much higher load.
|
| Sorry for the missed joke.
| salawat wrote:
| Easiest starting point is modeliing the problem between you and
| your co-workers paying painstaking attention to the flow of
| knowledge.
|
| Seriously. Most of the difficulty of distributed systems is
| because you're actually having to manage the flow of
| information between distinct members of a networked composite.
| Every time someone is out of the loop, what do you do?
|
| Can you tell if someone is out of the loop? What happens if
| your detector breaks?
|
| Try it with your coworkers. You have to be super serious on
| running down the "but how did you know" parts.
|
| Once you have a handle of the way you trip, go hit the books,
| and learn all the names to the SNAFUs you just acted out.
| tychota wrote:
| Why teach Paxos and not raft. I thought raft was easier to grasp,
| and is used a lot nowadays?.
___________________________________________________________________
(page generated 2022-08-24 23:00 UTC)