[HN Gopher] Building a Fly.io-like scheduler with resource requi...
       ___________________________________________________________________
        
       Building a Fly.io-like scheduler with resource requirements
        
       Author : dangoodmanUT
       Score  : 92 points
       Date   : 2024-02-25 12:53 UTC (10 hours ago)
        
 (HTM) web link (www.aspiring.dev)
 (TXT) w3m dump (www.aspiring.dev)
        
       | mjb wrote:
       | This is a cool series of posts, thanks for writing it!
       | 
       | We've released a bit about how the AWS Lambda scheduler works (a
       | distributed, but stateful, sticky load balancer). There are a
       | couple of reasons why Lambda doesn't use this broadcast approach
       | to solve a similar problem to the one these posts are solving.
       | 
       | One is that this 'broadcast' approach introduces a tricky
       | tradeoff decision about how long to wait for somebody to take the
       | work before you create more capacity for that resource. The
       | longer you wait, the higher your latency variance is. The shorter
       | you wait, the more likely you are to 'strand' good capacity that
       | just hasn't had a chance to respond yet. That's a tunable
       | tradeoff, but the truly tough problem is that it creates a kind
       | of metastable behavior under load: excess load delays responses,
       | which makes 'stranding' more frequent, which reduces resource
       | usage efficiency, which makes load problems worse. Again, that's
       | a solvable problem, but solving it adds significant complexity to
       | what was a rather simple protocol.
       | 
       | Another issue is dealing with failures of capacity (say a few
       | racks lose power). The central system doesn't know what resources
       | it lost (because that knowledge is only distributed in the
       | workers), and so needs to discover that information from the flow
       | of user requests. That can be OK, but again means modal latency
       | behavior in the face of failures.
       | 
       | Third, the broadcast behavior requires O(N^2) messages for N
       | requests processed (on the assumption that the fleet size is O(N)
       | too). This truly isn't a big deal at smaller scales (packets are
       | cheap) but can become expensive at larger scales (N^2 gets
       | steep). The related problem is that the protocol also introduces
       | another round-trip for discovery, increasing latency. That could
       | be as low as a few hundred microseconds, but it's not nothing
       | (and, again, the need to optimize for happy-case latency against
       | bad-case efficiency makes tuning awkward).
       | 
       | Fourth, the dynamic behavior under load is tricky to reason about
       | because of the race between "I can do this" and getting the work.
       | You can be optimistic (not reserving capacity), at the cost of
       | having to re-run the protocol (potentially an unbounded number of
       | times!) if you lose the race to another source of work. Or, you
       | can be pessimistic (reserving capacity and explicitly releasing
       | what you don't need), at the cost of making the failure cases
       | tricky (see the classic problem with 2PC coordinator failure),
       | and reducing efficiency for popular resources (in proportion to
       | the latency and popularity of the resource you're looking for).
       | Slow coordinators can also cause significant resource wastage, so
       | you're back to tuning timeouts and inventing heuristics. It's a
       | game you can win, but a tough one.
       | 
       | This needle-in-a-haystack placement problem really is an
       | interesting one, and it's super cool to see people writing about
       | it and approaching the trade-offs in designs in a different way.
        
         | dangoodmanUT wrote:
         | This is a great summary, thanks for sharing it! Totally agree
         | with you that broadcasting is not the best at AWS scales, but
         | it's so simple and works elegantly at small-medium (e.g. not
         | AWS scale lol) without the requirement of a centralized source
         | of truth for resource availability. It's a super interesting
         | problem over all and I think there is still so much more to
         | learn!
         | 
         | Eager to learn more about lambda scheduling, are you referring
         | to this reinvent talk?
         | https://youtu.be/0_jfH6qijVY?si=Uc6xpdpXiJ6oRHWD&t=671
        
           | mjb wrote:
           | Yes, I think that's the most recent talk (and the most in-
           | depth on this side of the problem).
        
         | Cwizard wrote:
         | > We've released a bit about how the AWS Lambda scheduler works
         | (a distributed, but stateful, sticky load balancer).
         | 
         | Do you have a link? I would like to read more about this.
        
           | dangoodmanUT wrote:
           | Check my youtube link above
        
         | shayonj wrote:
         | this is very interesting, thank you for sharing!
        
         | stergios wrote:
         | This reminds me of scheduling for contact centers with agent
         | skill based constraints and service level requirements.
        
       | mbreese wrote:
       | Another place to look for similar workflows is the HPC world. HPC
       | schedulers have been doing resource requirement based scheduling
       | for forever. Many of those those keep the scheduling state
       | centralized though. Instead of asking the workers "who can
       | fulfill these requirements", the coordinator process keeps track
       | of the state of what resources are still available for each
       | worker. If you have a task that needs 12GB of RAM, 4 processors,
       | and is in us-west, the coordinator already knows the answer. So
       | instead of N^2 messages, you just have two (submit work and ack).
       | 
       | The major reason why HPC keeps this centrally is that it makes
       | time based scheduling easier too. In general, you have more work
       | than nodes and each job only has a finite run time. This lets the
       | scheduler do fun things like backfilling jobs and more
       | coordinated scheduling (such as keeping an entire rack idle for
       | power savings unless necessary).
        
         | dangoodmanUT wrote:
         | Yeah, the problem with central is that it can get out of sync,
         | but for tracking time it's definitely superior
        
       | Thaxll wrote:
       | It's easier to actually let remote compute update their load to a
       | central place and do calculation on it rather than asking
       | everyone all the time.
        
         | dangoodmanUT wrote:
         | IDK about easier... How is it easier if you now need to add in
         | a DB, workers need to pulse their availability (load on DB),
         | and you need to write queries to the DB to figure it out,
         | handle race conditions on claiming resources, etc?
        
       ___________________________________________________________________
       (page generated 2024-02-25 23:01 UTC)