[HN Gopher] Scaling Kubernetes to 7,500 Nodes
       ___________________________________________________________________
        
       Scaling Kubernetes to 7,500 Nodes
        
       Author : sillysaurusx
       Score  : 109 points
       Date   : 2021-01-25 19:09 UTC (3 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | sandGorgon wrote:
       | > _Pods communicate directly with one another on their pod IP
       | addresses with MPI via SSH, not service endpoints._
       | 
       | this is super interesting. They are using mpi on kubernetes for
       | their AI training i suppose.
       | 
       | So they are not using anything like kubeflow. Any idea which
       | framework this is ?
       | 
       | The state of AI training on kubernetes is not so hot. And this
       | would be a good learning. There is Ray Distributed today that
       | claims a better performance (as well as a better developer
       | experience) than OpenMPI -
       | https://www.usenix.org/system/files/osdi18-moritz.pdf
       | 
       | wonder why the choices were made as such
        
         | benchess wrote:
         | Hi, co-author here!
         | 
         | We use a pretty standard tech stack of PyTorch + NCCL + MPI.
         | We've used both OpenMPI and MPICH to varying degrees.
         | 
         | Kubeflow is interesting, but it solves a slightly different
         | problem of scheduling/coordinating ML workflows on top of Kube.
         | It doesn't get involved with how an ML job communicates within
         | itself cross-node.
        
           | mmq wrote:
           | Probably OP was referring to the MPIOperator, TFOperator,
           | PytorchOperator, ... they are under the Kuberflow org, but
           | can be deployed independently of Kubeflow itself. Several
           | other projects are using those operators to provide similar
           | abstractions you mentioned in your blog post, e.g. Gang
           | scheduling, cross-nodes communication, ...
           | 
           | One difference is that these operators use the Kubernetes
           | service interface for communication, generally exposing a
           | headless service for each replica.
        
       | trhway wrote:
       | >Our biggest jobs run MPI, and all pods within the job are
       | participating in a single MPI communicator. If any of the
       | participating pods die, the entire job halts and needs to be
       | restarted.
       | 
       | Sounds like a strange approach for ML jobs - i mean you'd expect
       | that all those parallel subtasks aren't each individually midway
       | interconnected with each other, and that the failed subtask can
       | be easily restarted on its own. With thousands of subtasks
       | running in parallel some are bound to fail for whatever reason.
       | Their choice of MPI over HTTP suggests though that they pay a
       | premium for latency and that suggests the subtasks actively
       | cross-communicating, a typical case for MPI.
        
         | jamesblonde wrote:
         | This is probably data-parallel training with collective all-
         | reduce (Horovod probably, as they are using MPI). Membership of
         | the ring in Horovod is static - you can't recover from a failed
         | worker. You would need to build a consistent hashing ring (like
         | a DHT), so that workers could identify and agree on failing
         | workers (heartbeats) and evict them. None of those goodies in
         | Horovod yet.
         | 
         | The workaround is to have a chief-node do periodic
         | checkpointing of the model weights and epoc/iteration, so that
         | you can recover from the checkpoint if a worker fails.
        
       | andyxor wrote:
       | Excellent engineering but I wish they also worked on AI
        
       | jacques_chester wrote:
       | > _That said, strain on the kube-scheduler is spiky. A new job
       | may consist of many hundreds of pods all being created at once,
       | then return to a relatively low rate of churn._
       | 
       | Last I checked, the default scheduler places Pods one at a time.
       | It might be advantageous to use a gang/batch scheduler like kube-
       | batch[0], Poseidon[1] or DCM[2].
       | 
       | Edit: looks like they're already investigating that approach --
       | 
       | > _We tried a few things needing a custom scheduler, but ran into
       | edge cases that caused conflicts with how normal pods were
       | scheduled. Kubernetes 1.18 introduced a plugin architecture for
       | the core Kubernetes scheduler, making it much easier to add
       | features like this natively. We recently landed on the
       | Coscheduling plugin as a good way to solve this problem._
       | 
       | [0] https://github.com/kubernetes-sigs/kube-batch
       | 
       | [1] https://github.com/kubernetes-sigs/poseidon
       | 
       | [2] https://github.com/vmware/declarative-cluster-management
        
       | sillysaurusx wrote:
       | Godda hand it to OpenAI. My opinion about them is slowly
       | reversing. I was nervous when they went full API, but CLIP is a
       | fantastic model that they released for free.
        
       | chubot wrote:
       | _A large machine learning job spans many nodes and runs most
       | efficiently when it has access to all of the hardware resources
       | on each node ... So for many of our workloads, a single pod
       | occupies the entire node_
       | 
       | Hm, why not just use the underlying nodes then, without
       | Kubernetes?
       | 
       | Is the underlying cloud that bad at scheduling, and are they
       | keeping the VMs warm all the time?
       | 
       | What are they gaining for this indirection? Is it to get a common
       | interface across GCP and other clouds?
       | 
       |  _Bin-packing or fragmentation is not a common problem_
       | 
       |  _there's relatively low strain on the scheduler._
       | 
       |  _That said, strain on the kube-scheduler is spiky_
        
         | benchess wrote:
         | Hi! Co-author here. We do keep the nodes running 24/7, so
         | Kubernetes still provides the scheduling to decide which nodes
         | are free or not at any given time. Generally starting a
         | container on a pre-warmed node is still much much faster than
         | booting a VM. Also, some of our servers are bare-metal.
         | 
         | EDIT: Also don't discount the rest of the Kubernetes ecosystem.
         | It's more than just a scheduler. It provides configuration,
         | secrets management, healthchecks, self-healing, service
         | discovery, ACLs... there are absolutely other ways to solve
         | each of these things. But when starting from scratch there's a
         | wide field of additional questions to answer, problems to
         | solve.
        
           | xorcist wrote:
           | Isn't Kubernetes a pretty lousy scheduler when it doesn't
           | take this into consideration? There are a number of
           | schedulers used in high performance computing that should be
           | able to do a better job.
        
             | chubot wrote:
             | Yeah exactly... This seems closer to an HPC problem, not a
             | "cloud" problem.
             | 
             | Related comment from 6 months ago about Kubernetes use
             | cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_
             | with_kub...
             | 
             | Summary: scale has at least 2 different meanings. Scaling
             | in resources doesn't really mean you need Kubernetes.
             | Scaling in terms of workload diversity is a better use case
             | for it.
             | 
             | Kubernetes is basically a knockoff of Borg, but Borg is
             | designed (or evolved) to run diverse services (search,
             | maps, gmail, etc.; batch and low latency). Ironically most
             | people who run their own Kube clusters don't seem to have
             | much workload diversity.
             | 
             | On the other hand, HPC is usually about scaling in terms of
             | resources: running a few huge jobs on many nodes. A single
             | job will occupy an entire node (and thousands of nodes),
             | which is what's happening here.
             | 
             | I've never used these HPC systems but it looks like they
             | are starting to run on the cloud. Kubernetes may still have
             | been a defensible choice for other reasons, but as someone
             | who used Borg for a long time, it's weird what it's turned
             | into. Sort of like protobufs now have a weird "reflection
             | service". Huh?
             | 
             | https://aws.amazon.com/blogs/publicsector/tag/htcondor/
             | 
             | https://aws.amazon.com/marketplace/pp/Center-for-High-
             | Throug...
        
               | [deleted]
        
               | jacobr1 wrote:
               | Exactly, we migrated to k8s not because we needed better
               | scaling (ec2 auto scaling groups were working reasonably
               | well for us) but because we kept investing our own way to
               | do rolling deploys or run scheduled jobs, and had a
               | variety of ways to store secrets. On top of that
               | developers were increasingly running their own containers
               | with docker compose test services talking to each to each
               | other. We migrated to k8s to A) have a way to standardize
               | how to run containerized builds and get the benefits for
               | "it works on my laptop" matching how it works in
               | production (at least functionally) and B) a common set of
               | patterns for managing deployed software. Resource
               | scheduling only became of interest after we migrated when
               | we realized the aggregation of our payloads allowed us to
               | use things like spot instances without jeopardizing
               | availability.
        
               | vergessenmir wrote:
               | It maybe an HPC problem but I'm not sure the available
               | solutions come close to k8s in terms of functionality and
               | I'm not talking about scheduling.
               | 
               | I used to work in HPC/Grid but it's been a while but I do
               | remember Condor being clunky even though it had its uses.
               | 
               | And the commercial grid offerings couldn't scale to
               | almost 10k nodes back then (am not sure about now, or if
               | they even exist anymore)
        
               | toomuchtodo wrote:
               | Condor is clunky, but still in use in high energy
               | physics, for example (LHC CMS detector data processing).
               | 
               | For greenfield deployments, I would recommend Hashicorp's
               | Nomad before Kubernetes or Condor if your per server
               | container intent is ~1 (bare metal with a light
               | hypervisor for orchestration), but still steer you to
               | Kubernetes for microservices and web-based cookie cutter
               | apps (I know many finance shops using Nomad, but
               | Cloudflare uses it with Consul, so no hard and fast
               | rules).
               | 
               | Disclosure: Worked in HPC space managing a cluster for
               | high energy physics. I also use (free version) Nomad for
               | personal cluster workload scheduling.
        
               | jedbrown wrote:
               | Condor and the like are for independent jobs "throughput
               | computing" but the authors here are using MPI for
               | tightly-coupled jobs. SLURM and Flux are actively-
               | developed schedulers for these kind of jobs.
               | 
               | https://slurm.schedmd.com/
               | 
               | https://flux-framework.readthedocs.io/en/latest/
        
             | AlphaSite wrote:
             | If all you care about is node in use or not in use I think
             | it's fine. You don't need anything complex from the
             | scheduler.
        
           | stonogo wrote:
           | Are you starting from scratch? This architecture seems like a
           | pretty standard HPC deployment with unnecessary
           | containerization involved.
        
           | hamandcheese wrote:
           | Not to me mention it's a well known skillset that can more
           | easily be hired for, as opposed to "come work on our crazy
           | sauce job scheduler, you'll love it!"
        
           | dijit wrote:
           | I feel like we solved this problem over a decade ago (if
           | you're keeping machines warm anyway) with job brokers. Am I
           | somehow mistaken?
        
       | dssdd wrote:
       | >Pod network traffic shaping
       | 
       | Have you considered EDT-based rate limiting for Pods? This should
       | scale well compared to TBF or HTB. Cilium developers have
       | integrated this natively:
       | https://cilium.io/blog/2020/11/10/cilium-19#bwmanager
        
         | benchess wrote:
         | Hi, co-author here. Yes we are excited about the potential of
         | this!
        
       ___________________________________________________________________
       (page generated 2021-01-25 23:00 UTC)