[HN Gopher] Building a Fly.io-like scheduler with resource requi...
___________________________________________________________________
Building a Fly.io-like scheduler with resource requirements
Author : dangoodmanUT
Score : 92 points
Date : 2024-02-25 12:53 UTC (10 hours ago)
(HTM) web link (www.aspiring.dev)
(TXT) w3m dump (www.aspiring.dev)
| mjb wrote:
| This is a cool series of posts, thanks for writing it!
|
| We've released a bit about how the AWS Lambda scheduler works (a
| distributed, but stateful, sticky load balancer). There are a
| couple of reasons why Lambda doesn't use this broadcast approach
| to solve a similar problem to the one these posts are solving.
|
| One is that this 'broadcast' approach introduces a tricky
| tradeoff decision about how long to wait for somebody to take the
| work before you create more capacity for that resource. The
| longer you wait, the higher your latency variance is. The shorter
| you wait, the more likely you are to 'strand' good capacity that
| just hasn't had a chance to respond yet. That's a tunable
| tradeoff, but the truly tough problem is that it creates a kind
| of metastable behavior under load: excess load delays responses,
| which makes 'stranding' more frequent, which reduces resource
| usage efficiency, which makes load problems worse. Again, that's
| a solvable problem, but solving it adds significant complexity to
| what was a rather simple protocol.
|
| Another issue is dealing with failures of capacity (say a few
| racks lose power). The central system doesn't know what resources
| it lost (because that knowledge is only distributed in the
| workers), and so needs to discover that information from the flow
| of user requests. That can be OK, but again means modal latency
| behavior in the face of failures.
|
| Third, the broadcast behavior requires O(N^2) messages for N
| requests processed (on the assumption that the fleet size is O(N)
| too). This truly isn't a big deal at smaller scales (packets are
| cheap) but can become expensive at larger scales (N^2 gets
| steep). The related problem is that the protocol also introduces
| another round-trip for discovery, increasing latency. That could
| be as low as a few hundred microseconds, but it's not nothing
| (and, again, the need to optimize for happy-case latency against
| bad-case efficiency makes tuning awkward).
|
| Fourth, the dynamic behavior under load is tricky to reason about
| because of the race between "I can do this" and getting the work.
| You can be optimistic (not reserving capacity), at the cost of
| having to re-run the protocol (potentially an unbounded number of
| times!) if you lose the race to another source of work. Or, you
| can be pessimistic (reserving capacity and explicitly releasing
| what you don't need), at the cost of making the failure cases
| tricky (see the classic problem with 2PC coordinator failure),
| and reducing efficiency for popular resources (in proportion to
| the latency and popularity of the resource you're looking for).
| Slow coordinators can also cause significant resource wastage, so
| you're back to tuning timeouts and inventing heuristics. It's a
| game you can win, but a tough one.
|
| This needle-in-a-haystack placement problem really is an
| interesting one, and it's super cool to see people writing about
| it and approaching the trade-offs in designs in a different way.
| dangoodmanUT wrote:
| This is a great summary, thanks for sharing it! Totally agree
| with you that broadcasting is not the best at AWS scales, but
| it's so simple and works elegantly at small-medium (e.g. not
| AWS scale lol) without the requirement of a centralized source
| of truth for resource availability. It's a super interesting
| problem over all and I think there is still so much more to
| learn!
|
| Eager to learn more about lambda scheduling, are you referring
| to this reinvent talk?
| https://youtu.be/0_jfH6qijVY?si=Uc6xpdpXiJ6oRHWD&t=671
| mjb wrote:
| Yes, I think that's the most recent talk (and the most in-
| depth on this side of the problem).
| Cwizard wrote:
| > We've released a bit about how the AWS Lambda scheduler works
| (a distributed, but stateful, sticky load balancer).
|
| Do you have a link? I would like to read more about this.
| dangoodmanUT wrote:
| Check my youtube link above
| shayonj wrote:
| this is very interesting, thank you for sharing!
| stergios wrote:
| This reminds me of scheduling for contact centers with agent
| skill based constraints and service level requirements.
| mbreese wrote:
| Another place to look for similar workflows is the HPC world. HPC
| schedulers have been doing resource requirement based scheduling
| for forever. Many of those those keep the scheduling state
| centralized though. Instead of asking the workers "who can
| fulfill these requirements", the coordinator process keeps track
| of the state of what resources are still available for each
| worker. If you have a task that needs 12GB of RAM, 4 processors,
| and is in us-west, the coordinator already knows the answer. So
| instead of N^2 messages, you just have two (submit work and ack).
|
| The major reason why HPC keeps this centrally is that it makes
| time based scheduling easier too. In general, you have more work
| than nodes and each job only has a finite run time. This lets the
| scheduler do fun things like backfilling jobs and more
| coordinated scheduling (such as keeping an entire rack idle for
| power savings unless necessary).
| dangoodmanUT wrote:
| Yeah, the problem with central is that it can get out of sync,
| but for tracking time it's definitely superior
| Thaxll wrote:
| It's easier to actually let remote compute update their load to a
| central place and do calculation on it rather than asking
| everyone all the time.
| dangoodmanUT wrote:
| IDK about easier... How is it easier if you now need to add in
| a DB, workers need to pulse their availability (load on DB),
| and you need to write queries to the DB to figure it out,
| handle race conditions on claiming resources, etc?
___________________________________________________________________
(page generated 2024-02-25 23:01 UTC)