[HN Gopher] Building the largest known Kubernetes cluster
___________________________________________________________________
Building the largest known Kubernetes cluster
Author : TangerineDream
Score : 86 points
Date : 2025-11-21 17:56 UTC (3 days ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| rvz wrote:
| > While we don't yet officially support 130K nodes, we're very
| encouraged by these findings. If your workloads require this
| level of scale, reach out to us to discuss your specific needs
|
| Obviously this is a typical experiment at Google on running a K8s
| cluster at 130K nodes but if there is a company out their that
| "requires" this scale, I must question their architecture and
| their infrastructure costs.
|
| But of course someone will always request that they somehow need
| this sort of scale to run their enterprise app. But once again,
| let's remind the pre-revenue startups talking about scale before
| they hit PMF:
|
| Unless you are ready to donate tens of billions of dollars
| yearly, you do not need this.
|
| You are not Google.
| mlnj wrote:
| >You are not Google.
|
| It's literally Google coming out with this capability and how
| is the criticism still "You are not Google"
| Rastonbury wrote:
| The criticism is at pre-PMF startups who believe they need
| something similar
| jcims wrote:
| I work for a mature public company that most people in the US
| have at least heard of. We're far from the largest in our
| industry and we run jobs with more than that almost every
| night. Not via k8s though.
| Tostino wrote:
| You have jobs running on more than 130k different machines
| daily??
|
| Are they cloud based VMs, or your own hardware? If cloud
| based, do you reprovision all of them daily and incur no cost
| when you are not running jobs? If it's your own hardware,
| what else do you do with it when not batch processing?
| jcims wrote:
| They are provisioned on demand (cloud) and shut down when
| no longer needed.
| game_the0ry wrote:
| > You are not Google.
|
| 100% agree.
|
| People at my co are horny to adopt k8s. Really, tech leads want
| to put it on their resume ("resume driven development") and use
| a tool that was made to solve a particular problem we never
| had. The downside is now we now need to be proficient it at,
| know how to troubleshoot it, etc. It was sold to leadership as
| something that would make our lives easier but the exact
| opposite has happened.
| BruSwain wrote:
| I think k8s has a learning curve, absolutely, and there are
| absolutely cases where it can be unnecessary overhead. But I
| actually think those cases are pretty small. If you're
| running multiple apps, k8s is valuable. There is initial
| investment in learning the system, but its v-extensible,
| flexible, & portable. (Yes, every hyperscaler's
| implementation of k8s has its own nuance in certain places,
| but the core concept of k8s translates very well)
| hazz99 wrote:
| I'm sure this work is very impressive, but these QPS numbers
| don't seem particularly high to me, at least compared to existing
| horizontally scalable service patterns. Why is it hard for the
| kube control plane to hit these numbers?
|
| For instance, postgres can hit this sort of QPS easily, afaik.
| It's not distributed, but I'm sure Vitess could do something
| similar. The query patterns don't seem particularly complex
| either.
|
| Not trying to be reductive - I'm sure there's some complexity
| here I'm missing!
| phrotoma wrote:
| I am extremely Not A Database Person but I understand that the
| rationale for Kubernetes adopting etcd as its preferred data
| store was more about its distributed consistency features and
| less about query throughput. etcd is slower cause it's doing
| RAFT things and flushing stuff to disk.
|
| Projects like kine allow K8s users to swap sqlite or postgres
| in place of etcd which (I assume, please correct me otherwise)
| would deliver better throughput since those backends don't need
| to perform consenus operations.
|
| https://github.com/k3s-io/kine
| dijit wrote:
| You might not be a database person, but you're spot on.
|
| A well managed HA postgresql (active/passive) is going to run
| circles around etcd for kube controlplane operations.
|
| The caveat here is increased risk of downtime, and a much
| higher management overhead, which is why its not the default.
| Sayrus wrote:
| GKE uses Spanner as an etcd replacement.
| ZeroCool2u wrote:
| But, and I'm honestly asking, you as a GKE user don't have
| to manage that spanner instance, right? So, you should in
| theory be able to just throw higher loads at it and spanner
| should be autoscaling?
| DougBTX wrote:
| Yes, from the article:
|
| > To support the cluster's massive scale, we relied on a
| proprietary key-value store based on Google's Spanner
| distributed database... We didn't witness any bottlenecks
| with respect to the new storage system and it showed no
| signs of it not being able to support higher scales.
| ZeroCool2u wrote:
| Yeah, I guess my question was a bit more nuanced. What I
| was curious about was if they were fully relying on
| normal autoscaling that any customer would get or were
| they manually scaling the spanner instance in
| anticipation of the load? I guess it's unlikely we're
| going to get that level of detailed info from this
| article though.
| travem wrote:
| There are also distributed databases that use RAFT but can
| still scale while delivering distributed consensus don't is
| not a challenge that can't be solved. For example, TiDB
| handles millions of QPS while delivering ACID transactions,
| e.g. https://vivekbansal.substack.com/p/system-design-study-
| how-f...
| PunchyHamster wrote:
| it's not really bottlenecked by the store but by the
| calculations performed on each pod schedule/creation.
|
| It's basically "take global state of node load and capacity,
| pick where to schedule it", and I'd imagine probably not
| running in parallel coz that would be far harder to manage.
| senorrib wrote:
| No a k8s dev, but I feel like this is the answer. K8s isn't
| usually just scheduling pods round robin or at random.
| There's a lot of state to evaluate, and the problem of
| scheduling pods becomes an NP-hard problem similar to bin
| packing problem. I doubt the implementation tries to be
| optimal here, but it feels a computationally heavy problem.
| OvervCW wrote:
| In what way is it NP-hard? From what I can gather it just
| eliminates nodes where the pod wouldn't be allowed to run,
| calculates a score for each and then randomly selects one
| of the nodes that has the lowest score, so trivially
| parallelizable.
| __turbobrew__ wrote:
| The k8s scheduler lets you tweak how many nodes to look at
| when scheduling a pod (percentage of nodes to score) so you
| can change how big "global state" is according to the
| scheduler algorithm.
| nonameiguess wrote:
| It says in the blog that they require 13,000 queries per second
| to _update lease objects_ , not that 13,000 is the total for
| all queries. I don't know why they cite that instead of total,
| but etcd's normal performance testing indicates it can handle
| at least 50,000 writes per second and 180,000 reads:
| https://etcd.io/docs/v3.6/op-guide/performance/. So, without
| them saying what the real number is, I'm going to guess their
| reads and writes outside of lease updates are at least much
| larger than those numbers.
| xyse53 wrote:
| They mention GCS fuse. We've had nothing but performance and
| stability problems with this.
|
| We treat it as a best effort alternative when native GCS access
| isn't possible.
| dijit wrote:
| fuse based filesystems _in general_ shouldn't be treated as
| production ready in my experience.
|
| They're wonderful for low volume, low performance and low
| reliability operations. (browsing, copying, integrating with
| legacy systems that do not permit native access), but beyond
| that they consume huge resources and do odd things when the
| backend is not in its most ideal state.
| thundergolfer wrote:
| AWS Lambda uses FUSE and that's one of the largest prod
| systems in the world.
| dijit wrote:
| An option exists, but they prefer you use the block storage
| API.
| dotwaffle wrote:
| I started rewriting gcsfuse using
| https://github.com/hanwen/go-fuse instead of
| https://github.com/jacobsa/fuse and found it rock-solid. FUSE
| has come a long way in the last few years, including things
| like passthrough.
|
| Honestly, I'd give FUSE a second chance, you'd be surprised
| at how useful it can be -- after all, it's literally running
| in userland so you don't need to do anything funky with
| privileges. However, if I starting afresh on a similar
| project I'd probably be looking at using 9p2000.L instead.
| zoobab wrote:
| The new mainframe.
| blurrybird wrote:
| AWS and Anthropic did this back in July:
| https://aws.amazon.com/blogs/containers/amazon-eks-enables-u...
| cowsandmilk wrote:
| That is 100k vs 130k for Google's new announcement. I can't
| speak as to whether the additional 30k presented new challenges
| though.
| Cthulhu_ wrote:
| I want to believe that this is an order-of-magnitude kind of
| problem, that is, if 100K is fine then 500K is also fine.
|
| I only skimmed the article though, but I'm confident that
| it's more a physical hardware, time, space and electricity
| problem than a software / orchestration one; the article
| mentions that a cluster that size needs to be multi-
| datacenter already given the sheer power requirements (2700
| watts for one GPU in a single node).
| belter wrote:
| 130k nodes...cute...but can Google conquer the ultimate software
| engineering challenge they warn you about in CS school? A
| functional online signup flow?
| jasonvorhe wrote:
| For what? Access to the control plane API?
| belter wrote:
| In general... Try to sign up for their AI services...
| chrisandchris wrote:
| The could team up with Microsoft, because their signup flow is
| fine but the login flow is badly broken.
| yanhangyhy wrote:
| there is a doc about how to do with 1M nodes:
| https://bchess.github.io/k8s-1m/#_why
|
| so i guess the title is not true?
| arccy wrote:
| That's simulated using kwok, not real.
|
| > Unfortunately running 1M real kubelets is beyond my budget.
| Thaxll wrote:
| THis is a PoC not backed by a reliable etcd replacement.
| jakupovic wrote:
| Doing this at anything > 1k nodes is a pain in the butt. We
| decided to run many <100 nodes clusters rather than a few big
| ones.
| kvrty wrote:
| Same here. Non Kubernetes project originated control plane
| components start failing beyond a certain limit - your ingress
| controllers, service meshes etc. So I don't usually take node
| numbers from these benchmarks seriously for our kind of
| workloads. We run a bunch of sub-1k node clusters.
| liveoneggs wrote:
| Same. The control plane and various controllers just aren't up
| to the task.
| preisschild wrote:
| Meh, I've had had clusters with close to 1k nodes (w/ cilium as
| CNI) and didnt have major issues
| __turbobrew__ wrote:
| When I was involved about a year ago, cilium falls apart at
| around a few thousand nodes.
|
| One of the main issues of cilium is that the bpf maps scale
| with the number of nodes/pods in the cluster, so you get
| exponential memory growth as you add more nodes with the
| cilium agent on them. https://docs.cilium.io/en/stable/operat
| ions/performance/scal...
| oasisaimlessly wrote:
| Wouldn't that be quadratic rather than exponential?
| blinding-streak wrote:
| Imagine a Beowulf cluster of these
| sandGorgon wrote:
| does anyone know the size at openai ? it used to run a 7500 node
| cluster back in 2021 https://openai.com/index/scaling-kubernetes-
| to-7500-nodes/
| blamestross wrote:
| I worked in DHTs in grad school. I still double take that Google
| and other companies "computers dedicated to a task" numbers are
| missing 2 digits from what I expected. We have a lot of room left
| for expansion, we just have to relax centralized management
| expectations.
| Nextgrid wrote:
| K8S clusters on VMs strike me as odd.
|
| I see the appeal of K8s in dividing raw, stateful hardware to run
| multiple parallel workloads, but if you're dealing with stateless
| cloud VMs, why would you need K8S and its overhead when the VM
| hypervisor already gives you all that functionality?
|
| And if you insist anyway, run a few big VMs rather than many
| small ones, since K8s overhead is per-node.
| victorbjorklund wrote:
| Because k8s gives you lots of other things out of the box like
| easy scaling of apps etc. Harder to do on VM:s where you would
| either have to dedicate one VM per app (might be a waste of
| resources) or you have to try and deploy and run multiple apps
| on multiple VM:s etc.
|
| (For the record I'm not a k8s fanatic. Most of the time a
| regular VM is better. But a VM isn't = a kubernetes cluster).
| GauntletWizard wrote:
| The reason to target k8s on cloud vms is that cloud VMs don't
| subdivide as easily or as cleanly. Managing them is a pain. K8s
| is an abstraction layer for that - Rather than building whole
| machine images for each product, you create lighter weight
| docker images (how light weight is a point of some contention),
| and you only have to install your logging, monitoring, and etc
| once.
|
| Your advice about bigger machines is spot on - K8s biggest
| problem is how relatively heavyweight the kublet is, with
| memory requirements of roughly half a gig. On a modern 128g
| server node that's a reasonable overhead, for small companies
| running a few workloads on 16g nodes it's a cost of doing
| business, but if you're running 8 or 4g nodes, it looks pretty
| grim for your utilization.
| nyrikki wrote:
| You can run pods, with podman and avoid the entire k8s stack
| or even use minikube on a machine if you wanted to. Now that
| rootless is the default in k8s[0] the workflow is even more
| convenient and you can even use systemd with isolated users
| on the VM to provide more modularity and seporation.
|
| It really just depends on if you feel that you get value from
| the orchestration that full k8s offers.
|
| Note that on k8s or podman, you can get rid of most of the
| 'cost' of that virtualization for single placement and or
| long lived pods by simply sharing a emptyDir or volume shared
| between pod members. # Create Pod
| podman pod create --name pgdemo-pod # Create client
| podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e
| POSTGRES_PASSWORD=password --name client
| docker.io/ubuntu:25.04 # Unsafe hack to fix permissions
| in quick demo and install packages podman exec client
| /bin/bash -c 'chmod 0777 /mnt; apt update ; apt install -y
| postgresql-client' # Create postgres server
| podman run -dt -v pgdemo:/mnt --pod pgdemo-pod -e
| POSTGRES_PASSWORD=password --name pg
| docker.io/postgres:bookworm -c
| unix_socket_directories='/mnt,/var/run/postgresql/' #
| Invoke client using unix socket podman exec -it client
| /bin/bash -c "psql -U postgres -h /mnt" # Invoke client
| using localhost network podman exec -it client
| /bin/bash -c "psql -U postgres -h localhost"
|
| There is enough there for you to test to see that the
| performance is so close to native sharing unix sockets that
| way, that there is very little performance cost and a lot of
| security and workflow benefits to gain.
|
| As podman is daemonless, easily rootless, and on mac even
| allows you to ssh into the local linux vm with `podman
| machine ssh` you aren't stuck with the hidden abstractions of
| docker-desktop which hides that from you it has lots of
| value.
|
| Plus you can dump a k8s like yaml to use for the above with:
| podman kube generate pgdemo-pod
|
| So you can gain the advantages of k8s without the overhead of
| the cluster, and there are ways to launch those pods from
| systemd even from a local user that has zero sudo abilities
| etc...
|
| I am using it to validate that upstream containers don't have
| dial home by producing pcap files, and I would also typically
| run the above with no network on the pgsql host, so it
| doesn't have internet access.
|
| IMHO the confusion of k8s pods, being the minimal unit of
| deployment, with the fact that they are just a collection of
| containers with specific shared namespaces in the general
| form is missed.
|
| As Redhat gave podman to CNCF in 2024, I have shifted to it,
| so haven't seen if rancher can do the same.
|
| The point being is that you don't even need the complexity of
| minikube on VM's, you can use most of the workflow even for
| the traditional model.
|
| [0] https://kubernetes.io/blog/2025/04/25/userns-enabled-by-
| defa...
| locknitpicker wrote:
| > I see the appeal of K8s in dividing raw, stateful hardware to
| run multiple parallel workloads, but if you're dealing with
| stateless cloud VMs, why would you need K8S and its overhead
| when the VM hypervisor already gives you all that
| functionality?
|
| I think you're not familiar with Kubernetes and what features
| it provides.
|
| For example, kubernetes supports blue-green deployments and
| rollbacks, software-defined networks, DNS, node-specific purges
| and taints, etc. Those are not hypervisor features.
|
| Also, VMs are the primitives of some cloud providers.
|
| It sounds like you heard about how Borg/Kubernetes was used to
| simplify the task of putting together clusters with COTS
| hardware and you didn't bothered to learn more about
| Kubernetes.
| acedTrex wrote:
| because if you just do a few huge VMs you still have all the
| problems that k8s solves out of the box. Except now you have to
| solve them yourself, which will likely end up being a crappier
| less robust version of kubernetes.
| tayo42 wrote:
| In a large organization their more efficient to run on VMS. You
| can colocate services that fit together on one machine.
|
| And in reality no one sizes their machines correctly. They
| always do some handwavey thing like we need 4 cores, but maybe
| well burst and maybe there will be an outage so lets double it.
| Now all that utilization can be watched and you can take
| advantage of over subscription.
| supportengineer wrote:
| Imagine a Beowulf cluster of these
| __turbobrew__ wrote:
| It makes me sad that to get these scalability numbers requires
| some secret sauce on top of spanner, which no body else in the
| k8s community can benefit from. Etcd is the main bottleneck in
| upstream k8s and it seems like there is no real steam to build an
| upstream replacement for etcd/boltdb.
|
| I did poke around a while ago to see what interfaces that etcd
| has calling into boltdb, but the interface doesn't seem super
| clean right now, so the first step in getting off boltdb would be
| creating a clean interface that could be implemented by another
| db.
| iwontberude wrote:
| For those not aware, if you create too many resources you can
| easily use up all of the 8GB hard coded maximum size in etcd
| which causes a cluster failure. With compaction and maintenance
| this risk is mitigated somewhat but it just takes one
| misbehaving operator or integration (e.g. hundreds of thousands
| of dex session resources created for pingdom/crawlers) to mess
| everything up. Backups of etcd are critical. That dex example
| is why I stopped it for my IDP.
| scoodah wrote:
| This is why I've always thought Tekton was a strange project.
| It feels inevitable that if you buy into Tekton CI/CD you
| will hit issues with etcd scaling due to the sheer number of
| resources you can wind up with.
| locknitpicker wrote:
| > It makes me sad that to get these scalability numbers
| requires some secret sauce on top of spanner, which no body
| else in the k8s community can benefit from.
|
| I'm not so sure. I mean, everything has tradeoffs, and what you
| need to do to put together the largest cluster known to man is
| not necessarily what you want to have to put together a mundane
| cluster.
| nonameiguess wrote:
| It's _possible_ I 'm talking out of my ass and totally wrong
| because I'm basing this on principles, not benchmarking, but
| I'm pretty sure the problem is more etcd itself than boltdb.
| Specifically, the Raft protocol requires that the cluster
| leader's log has to be replicated to a quorum of voting
| members, who need to write to disk, including a flush, and then
| respond to the leader, before a write is considered committed.
| That's floor(n/2) + 1 disk flushes and twice as many network
| roundtrips to write any value. When your control plane has to
| span multiple data centers because the electricity cost of the
| cluster is too large for a single building to handle, it's hard
| for that not to become a bottleneck. Other limitations include
| the 8GiB disk limit another comment mentions and etcd's hard-
| coded 1.5 MiB request size limit that prevents you from writing
| large object collections in a single bundle.
|
| etcd is fine for what it is, but that's a system meant to be
| reliable and simple to implement. Those are important
| qualities, but it wasn't built for scale or for speed.
| Ironically, etcd recommends 5 as the ideal number of cluster
| members and 7 as a maximum based on Google's findings from
| running chubby, that between-member latency gets too big
| otherwise. With 5, that means you can't ever store more than
| 40GiB of data. I have no idea what a typical ratio of cluster
| nodes to total data is, but that only gives you about 307MiB
| per node for 130,000 nodes, which doesn't seem like very much.
|
| There are other options. k3s made kine which acts as a shim
| intercepting the etcd API calls made by the apiserver and
| translating it into calls to some other dbms. Originally, this
| was to make a really small Kubernetes that used an embedded
| sqlite as its datastore, but you could do the same thing for
| any arbitrary backend by just changing one side of the shim.
| bhouston wrote:
| Sounds like hell. But I do really dislike Kubernetes:
| https://benhouston3d.com/blog/why-i-left-kubernetes-for-goog...
| jeffbee wrote:
| You could remove all references to AI/ML topics from this article
| and it would remain just as interesting and informative. I really
| hate that we let marketing people cram the buzzword of the day
| into what should be a purely technical discussion.
___________________________________________________________________
(page generated 2025-11-24 23:00 UTC)