[HN Gopher] Building the largest known Kubernetes cluster
       ___________________________________________________________________
        
       Building the largest known Kubernetes cluster
        
       Author : TangerineDream
       Score  : 86 points
       Date   : 2025-11-21 17:56 UTC (3 days ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | rvz wrote:
       | > While we don't yet officially support 130K nodes, we're very
       | encouraged by these findings. If your workloads require this
       | level of scale, reach out to us to discuss your specific needs
       | 
       | Obviously this is a typical experiment at Google on running a K8s
       | cluster at 130K nodes but if there is a company out their that
       | "requires" this scale, I must question their architecture and
       | their infrastructure costs.
       | 
       | But of course someone will always request that they somehow need
       | this sort of scale to run their enterprise app. But once again,
       | let's remind the pre-revenue startups talking about scale before
       | they hit PMF:
       | 
       | Unless you are ready to donate tens of billions of dollars
       | yearly, you do not need this.
       | 
       | You are not Google.
        
         | mlnj wrote:
         | >You are not Google.
         | 
         | It's literally Google coming out with this capability and how
         | is the criticism still "You are not Google"
        
           | Rastonbury wrote:
           | The criticism is at pre-PMF startups who believe they need
           | something similar
        
         | jcims wrote:
         | I work for a mature public company that most people in the US
         | have at least heard of. We're far from the largest in our
         | industry and we run jobs with more than that almost every
         | night. Not via k8s though.
        
           | Tostino wrote:
           | You have jobs running on more than 130k different machines
           | daily??
           | 
           | Are they cloud based VMs, or your own hardware? If cloud
           | based, do you reprovision all of them daily and incur no cost
           | when you are not running jobs? If it's your own hardware,
           | what else do you do with it when not batch processing?
        
             | jcims wrote:
             | They are provisioned on demand (cloud) and shut down when
             | no longer needed.
        
         | game_the0ry wrote:
         | > You are not Google.
         | 
         | 100% agree.
         | 
         | People at my co are horny to adopt k8s. Really, tech leads want
         | to put it on their resume ("resume driven development") and use
         | a tool that was made to solve a particular problem we never
         | had. The downside is now we now need to be proficient it at,
         | know how to troubleshoot it, etc. It was sold to leadership as
         | something that would make our lives easier but the exact
         | opposite has happened.
        
           | BruSwain wrote:
           | I think k8s has a learning curve, absolutely, and there are
           | absolutely cases where it can be unnecessary overhead. But I
           | actually think those cases are pretty small. If you're
           | running multiple apps, k8s is valuable. There is initial
           | investment in learning the system, but its v-extensible,
           | flexible, & portable. (Yes, every hyperscaler's
           | implementation of k8s has its own nuance in certain places,
           | but the core concept of k8s translates very well)
        
       | hazz99 wrote:
       | I'm sure this work is very impressive, but these QPS numbers
       | don't seem particularly high to me, at least compared to existing
       | horizontally scalable service patterns. Why is it hard for the
       | kube control plane to hit these numbers?
       | 
       | For instance, postgres can hit this sort of QPS easily, afaik.
       | It's not distributed, but I'm sure Vitess could do something
       | similar. The query patterns don't seem particularly complex
       | either.
       | 
       | Not trying to be reductive - I'm sure there's some complexity
       | here I'm missing!
        
         | phrotoma wrote:
         | I am extremely Not A Database Person but I understand that the
         | rationale for Kubernetes adopting etcd as its preferred data
         | store was more about its distributed consistency features and
         | less about query throughput. etcd is slower cause it's doing
         | RAFT things and flushing stuff to disk.
         | 
         | Projects like kine allow K8s users to swap sqlite or postgres
         | in place of etcd which (I assume, please correct me otherwise)
         | would deliver better throughput since those backends don't need
         | to perform consenus operations.
         | 
         | https://github.com/k3s-io/kine
        
           | dijit wrote:
           | You might not be a database person, but you're spot on.
           | 
           | A well managed HA postgresql (active/passive) is going to run
           | circles around etcd for kube controlplane operations.
           | 
           | The caveat here is increased risk of downtime, and a much
           | higher management overhead, which is why its not the default.
        
           | Sayrus wrote:
           | GKE uses Spanner as an etcd replacement.
        
             | ZeroCool2u wrote:
             | But, and I'm honestly asking, you as a GKE user don't have
             | to manage that spanner instance, right? So, you should in
             | theory be able to just throw higher loads at it and spanner
             | should be autoscaling?
        
               | DougBTX wrote:
               | Yes, from the article:
               | 
               | > To support the cluster's massive scale, we relied on a
               | proprietary key-value store based on Google's Spanner
               | distributed database... We didn't witness any bottlenecks
               | with respect to the new storage system and it showed no
               | signs of it not being able to support higher scales.
        
               | ZeroCool2u wrote:
               | Yeah, I guess my question was a bit more nuanced. What I
               | was curious about was if they were fully relying on
               | normal autoscaling that any customer would get or were
               | they manually scaling the spanner instance in
               | anticipation of the load? I guess it's unlikely we're
               | going to get that level of detailed info from this
               | article though.
        
           | travem wrote:
           | There are also distributed databases that use RAFT but can
           | still scale while delivering distributed consensus don't is
           | not a challenge that can't be solved. For example, TiDB
           | handles millions of QPS while delivering ACID transactions,
           | e.g. https://vivekbansal.substack.com/p/system-design-study-
           | how-f...
        
         | PunchyHamster wrote:
         | it's not really bottlenecked by the store but by the
         | calculations performed on each pod schedule/creation.
         | 
         | It's basically "take global state of node load and capacity,
         | pick where to schedule it", and I'd imagine probably not
         | running in parallel coz that would be far harder to manage.
        
           | senorrib wrote:
           | No a k8s dev, but I feel like this is the answer. K8s isn't
           | usually just scheduling pods round robin or at random.
           | There's a lot of state to evaluate, and the problem of
           | scheduling pods becomes an NP-hard problem similar to bin
           | packing problem. I doubt the implementation tries to be
           | optimal here, but it feels a computationally heavy problem.
        
             | OvervCW wrote:
             | In what way is it NP-hard? From what I can gather it just
             | eliminates nodes where the pod wouldn't be allowed to run,
             | calculates a score for each and then randomly selects one
             | of the nodes that has the lowest score, so trivially
             | parallelizable.
        
           | __turbobrew__ wrote:
           | The k8s scheduler lets you tweak how many nodes to look at
           | when scheduling a pod (percentage of nodes to score) so you
           | can change how big "global state" is according to the
           | scheduler algorithm.
        
         | nonameiguess wrote:
         | It says in the blog that they require 13,000 queries per second
         | to _update lease objects_ , not that 13,000 is the total for
         | all queries. I don't know why they cite that instead of total,
         | but etcd's normal performance testing indicates it can handle
         | at least 50,000 writes per second and 180,000 reads:
         | https://etcd.io/docs/v3.6/op-guide/performance/. So, without
         | them saying what the real number is, I'm going to guess their
         | reads and writes outside of lease updates are at least much
         | larger than those numbers.
        
       | xyse53 wrote:
       | They mention GCS fuse. We've had nothing but performance and
       | stability problems with this.
       | 
       | We treat it as a best effort alternative when native GCS access
       | isn't possible.
        
         | dijit wrote:
         | fuse based filesystems _in general_ shouldn't be treated as
         | production ready in my experience.
         | 
         | They're wonderful for low volume, low performance and low
         | reliability operations. (browsing, copying, integrating with
         | legacy systems that do not permit native access), but beyond
         | that they consume huge resources and do odd things when the
         | backend is not in its most ideal state.
        
           | thundergolfer wrote:
           | AWS Lambda uses FUSE and that's one of the largest prod
           | systems in the world.
        
             | dijit wrote:
             | An option exists, but they prefer you use the block storage
             | API.
        
           | dotwaffle wrote:
           | I started rewriting gcsfuse using
           | https://github.com/hanwen/go-fuse instead of
           | https://github.com/jacobsa/fuse and found it rock-solid. FUSE
           | has come a long way in the last few years, including things
           | like passthrough.
           | 
           | Honestly, I'd give FUSE a second chance, you'd be surprised
           | at how useful it can be -- after all, it's literally running
           | in userland so you don't need to do anything funky with
           | privileges. However, if I starting afresh on a similar
           | project I'd probably be looking at using 9p2000.L instead.
        
       | zoobab wrote:
       | The new mainframe.
        
       | blurrybird wrote:
       | AWS and Anthropic did this back in July:
       | https://aws.amazon.com/blogs/containers/amazon-eks-enables-u...
        
         | cowsandmilk wrote:
         | That is 100k vs 130k for Google's new announcement. I can't
         | speak as to whether the additional 30k presented new challenges
         | though.
        
           | Cthulhu_ wrote:
           | I want to believe that this is an order-of-magnitude kind of
           | problem, that is, if 100K is fine then 500K is also fine.
           | 
           | I only skimmed the article though, but I'm confident that
           | it's more a physical hardware, time, space and electricity
           | problem than a software / orchestration one; the article
           | mentions that a cluster that size needs to be multi-
           | datacenter already given the sheer power requirements (2700
           | watts for one GPU in a single node).
        
       | belter wrote:
       | 130k nodes...cute...but can Google conquer the ultimate software
       | engineering challenge they warn you about in CS school? A
       | functional online signup flow?
        
         | jasonvorhe wrote:
         | For what? Access to the control plane API?
        
           | belter wrote:
           | In general... Try to sign up for their AI services...
        
         | chrisandchris wrote:
         | The could team up with Microsoft, because their signup flow is
         | fine but the login flow is badly broken.
        
       | yanhangyhy wrote:
       | there is a doc about how to do with 1M nodes:
       | https://bchess.github.io/k8s-1m/#_why
       | 
       | so i guess the title is not true?
        
         | arccy wrote:
         | That's simulated using kwok, not real.
         | 
         | > Unfortunately running 1M real kubelets is beyond my budget.
        
         | Thaxll wrote:
         | THis is a PoC not backed by a reliable etcd replacement.
        
       | jakupovic wrote:
       | Doing this at anything > 1k nodes is a pain in the butt. We
       | decided to run many <100 nodes clusters rather than a few big
       | ones.
        
         | kvrty wrote:
         | Same here. Non Kubernetes project originated control plane
         | components start failing beyond a certain limit - your ingress
         | controllers, service meshes etc. So I don't usually take node
         | numbers from these benchmarks seriously for our kind of
         | workloads. We run a bunch of sub-1k node clusters.
        
         | liveoneggs wrote:
         | Same. The control plane and various controllers just aren't up
         | to the task.
        
         | preisschild wrote:
         | Meh, I've had had clusters with close to 1k nodes (w/ cilium as
         | CNI) and didnt have major issues
        
           | __turbobrew__ wrote:
           | When I was involved about a year ago, cilium falls apart at
           | around a few thousand nodes.
           | 
           | One of the main issues of cilium is that the bpf maps scale
           | with the number of nodes/pods in the cluster, so you get
           | exponential memory growth as you add more nodes with the
           | cilium agent on them. https://docs.cilium.io/en/stable/operat
           | ions/performance/scal...
        
             | oasisaimlessly wrote:
             | Wouldn't that be quadratic rather than exponential?
        
       | blinding-streak wrote:
       | Imagine a Beowulf cluster of these
        
       | sandGorgon wrote:
       | does anyone know the size at openai ? it used to run a 7500 node
       | cluster back in 2021 https://openai.com/index/scaling-kubernetes-
       | to-7500-nodes/
        
       | blamestross wrote:
       | I worked in DHTs in grad school. I still double take that Google
       | and other companies "computers dedicated to a task" numbers are
       | missing 2 digits from what I expected. We have a lot of room left
       | for expansion, we just have to relax centralized management
       | expectations.
        
       | Nextgrid wrote:
       | K8S clusters on VMs strike me as odd.
       | 
       | I see the appeal of K8s in dividing raw, stateful hardware to run
       | multiple parallel workloads, but if you're dealing with stateless
       | cloud VMs, why would you need K8S and its overhead when the VM
       | hypervisor already gives you all that functionality?
       | 
       | And if you insist anyway, run a few big VMs rather than many
       | small ones, since K8s overhead is per-node.
        
         | victorbjorklund wrote:
         | Because k8s gives you lots of other things out of the box like
         | easy scaling of apps etc. Harder to do on VM:s where you would
         | either have to dedicate one VM per app (might be a waste of
         | resources) or you have to try and deploy and run multiple apps
         | on multiple VM:s etc.
         | 
         | (For the record I'm not a k8s fanatic. Most of the time a
         | regular VM is better. But a VM isn't = a kubernetes cluster).
        
         | GauntletWizard wrote:
         | The reason to target k8s on cloud vms is that cloud VMs don't
         | subdivide as easily or as cleanly. Managing them is a pain. K8s
         | is an abstraction layer for that - Rather than building whole
         | machine images for each product, you create lighter weight
         | docker images (how light weight is a point of some contention),
         | and you only have to install your logging, monitoring, and etc
         | once.
         | 
         | Your advice about bigger machines is spot on - K8s biggest
         | problem is how relatively heavyweight the kublet is, with
         | memory requirements of roughly half a gig. On a modern 128g
         | server node that's a reasonable overhead, for small companies
         | running a few workloads on 16g nodes it's a cost of doing
         | business, but if you're running 8 or 4g nodes, it looks pretty
         | grim for your utilization.
        
           | nyrikki wrote:
           | You can run pods, with podman and avoid the entire k8s stack
           | or even use minikube on a machine if you wanted to. Now that
           | rootless is the default in k8s[0] the workflow is even more
           | convenient and you can even use systemd with isolated users
           | on the VM to provide more modularity and seporation.
           | 
           | It really just depends on if you feel that you get value from
           | the orchestration that full k8s offers.
           | 
           | Note that on k8s or podman, you can get rid of most of the
           | 'cost' of that virtualization for single placement and or
           | long lived pods by simply sharing a emptyDir or volume shared
           | between pod members.                 # Create Pod
           | podman pod create --name pgdemo-pod       # Create client
           | podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e
           | POSTGRES_PASSWORD=password --name client
           | docker.io/ubuntu:25.04       # Unsafe hack to fix permissions
           | in quick demo and install packages       podman exec client
           | /bin/bash -c 'chmod 0777 /mnt; apt update ; apt install -y
           | postgresql-client'       # Create postgres server
           | podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e
           | POSTGRES_PASSWORD=password --name pg
           | docker.io/postgres:bookworm -c
           | unix_socket_directories='/mnt,/var/run/postgresql/'       #
           | Invoke client using unix socket       podman exec -it client
           | /bin/bash -c "psql -U postgres -h /mnt"       # Invoke client
           | using localhost network       podman exec -it client
           | /bin/bash -c "psql -U postgres -h localhost"
           | 
           | There is enough there for you to test to see that the
           | performance is so close to native sharing unix sockets that
           | way, that there is very little performance cost and a lot of
           | security and workflow benefits to gain.
           | 
           | As podman is daemonless, easily rootless, and on mac even
           | allows you to ssh into the local linux vm with `podman
           | machine ssh` you aren't stuck with the hidden abstractions of
           | docker-desktop which hides that from you it has lots of
           | value.
           | 
           | Plus you can dump a k8s like yaml to use for the above with:
           | podman kube generate pgdemo-pod
           | 
           | So you can gain the advantages of k8s without the overhead of
           | the cluster, and there are ways to launch those pods from
           | systemd even from a local user that has zero sudo abilities
           | etc...
           | 
           | I am using it to validate that upstream containers don't have
           | dial home by producing pcap files, and I would also typically
           | run the above with no network on the pgsql host, so it
           | doesn't have internet access.
           | 
           | IMHO the confusion of k8s pods, being the minimal unit of
           | deployment, with the fact that they are just a collection of
           | containers with specific shared namespaces in the general
           | form is missed.
           | 
           | As Redhat gave podman to CNCF in 2024, I have shifted to it,
           | so haven't seen if rancher can do the same.
           | 
           | The point being is that you don't even need the complexity of
           | minikube on VM's, you can use most of the workflow even for
           | the traditional model.
           | 
           | [0] https://kubernetes.io/blog/2025/04/25/userns-enabled-by-
           | defa...
        
         | locknitpicker wrote:
         | > I see the appeal of K8s in dividing raw, stateful hardware to
         | run multiple parallel workloads, but if you're dealing with
         | stateless cloud VMs, why would you need K8S and its overhead
         | when the VM hypervisor already gives you all that
         | functionality?
         | 
         | I think you're not familiar with Kubernetes and what features
         | it provides.
         | 
         | For example, kubernetes supports blue-green deployments and
         | rollbacks, software-defined networks, DNS, node-specific purges
         | and taints, etc. Those are not hypervisor features.
         | 
         | Also, VMs are the primitives of some cloud providers.
         | 
         | It sounds like you heard about how Borg/Kubernetes was used to
         | simplify the task of putting together clusters with COTS
         | hardware and you didn't bothered to learn more about
         | Kubernetes.
        
         | acedTrex wrote:
         | because if you just do a few huge VMs you still have all the
         | problems that k8s solves out of the box. Except now you have to
         | solve them yourself, which will likely end up being a crappier
         | less robust version of kubernetes.
        
         | tayo42 wrote:
         | In a large organization their more efficient to run on VMS. You
         | can colocate services that fit together on one machine.
         | 
         | And in reality no one sizes their machines correctly. They
         | always do some handwavey thing like we need 4 cores, but maybe
         | well burst and maybe there will be an outage so lets double it.
         | Now all that utilization can be watched and you can take
         | advantage of over subscription.
        
       | supportengineer wrote:
       | Imagine a Beowulf cluster of these
        
       | __turbobrew__ wrote:
       | It makes me sad that to get these scalability numbers requires
       | some secret sauce on top of spanner, which no body else in the
       | k8s community can benefit from. Etcd is the main bottleneck in
       | upstream k8s and it seems like there is no real steam to build an
       | upstream replacement for etcd/boltdb.
       | 
       | I did poke around a while ago to see what interfaces that etcd
       | has calling into boltdb, but the interface doesn't seem super
       | clean right now, so the first step in getting off boltdb would be
       | creating a clean interface that could be implemented by another
       | db.
        
         | iwontberude wrote:
         | For those not aware, if you create too many resources you can
         | easily use up all of the 8GB hard coded maximum size in etcd
         | which causes a cluster failure. With compaction and maintenance
         | this risk is mitigated somewhat but it just takes one
         | misbehaving operator or integration (e.g. hundreds of thousands
         | of dex session resources created for pingdom/crawlers) to mess
         | everything up. Backups of etcd are critical. That dex example
         | is why I stopped it for my IDP.
        
           | scoodah wrote:
           | This is why I've always thought Tekton was a strange project.
           | It feels inevitable that if you buy into Tekton CI/CD you
           | will hit issues with etcd scaling due to the sheer number of
           | resources you can wind up with.
        
         | locknitpicker wrote:
         | > It makes me sad that to get these scalability numbers
         | requires some secret sauce on top of spanner, which no body
         | else in the k8s community can benefit from.
         | 
         | I'm not so sure. I mean, everything has tradeoffs, and what you
         | need to do to put together the largest cluster known to man is
         | not necessarily what you want to have to put together a mundane
         | cluster.
        
         | nonameiguess wrote:
         | It's _possible_ I 'm talking out of my ass and totally wrong
         | because I'm basing this on principles, not benchmarking, but
         | I'm pretty sure the problem is more etcd itself than boltdb.
         | Specifically, the Raft protocol requires that the cluster
         | leader's log has to be replicated to a quorum of voting
         | members, who need to write to disk, including a flush, and then
         | respond to the leader, before a write is considered committed.
         | That's floor(n/2) + 1 disk flushes and twice as many network
         | roundtrips to write any value. When your control plane has to
         | span multiple data centers because the electricity cost of the
         | cluster is too large for a single building to handle, it's hard
         | for that not to become a bottleneck. Other limitations include
         | the 8GiB disk limit another comment mentions and etcd's hard-
         | coded 1.5 MiB request size limit that prevents you from writing
         | large object collections in a single bundle.
         | 
         | etcd is fine for what it is, but that's a system meant to be
         | reliable and simple to implement. Those are important
         | qualities, but it wasn't built for scale or for speed.
         | Ironically, etcd recommends 5 as the ideal number of cluster
         | members and 7 as a maximum based on Google's findings from
         | running chubby, that between-member latency gets too big
         | otherwise. With 5, that means you can't ever store more than
         | 40GiB of data. I have no idea what a typical ratio of cluster
         | nodes to total data is, but that only gives you about 307MiB
         | per node for 130,000 nodes, which doesn't seem like very much.
         | 
         | There are other options. k3s made kine which acts as a shim
         | intercepting the etcd API calls made by the apiserver and
         | translating it into calls to some other dbms. Originally, this
         | was to make a really small Kubernetes that used an embedded
         | sqlite as its datastore, but you could do the same thing for
         | any arbitrary backend by just changing one side of the shim.
        
       | bhouston wrote:
       | Sounds like hell. But I do really dislike Kubernetes:
       | https://benhouston3d.com/blog/why-i-left-kubernetes-for-goog...
        
       | jeffbee wrote:
       | You could remove all references to AI/ML topics from this article
       | and it would remain just as interesting and informative. I really
       | hate that we let marketing people cram the buzzword of the day
       | into what should be a purely technical discussion.
        
       ___________________________________________________________________
       (page generated 2025-11-24 23:00 UTC)