[HN Gopher] Automatic K8s pod placement to match external servic...
___________________________________________________________________
Automatic K8s pod placement to match external service zones
Author : toredash
Score : 76 points
Date : 2025-10-08 05:21 UTC (6 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| toredash wrote:
| Hi HN,
|
| I wanted to share something I've worked a bit to solve regarding
| Kubernetes: its scheduler has no awareness of the network
| topology for external services that workloads communicate with.
| If a pod talks to a database (e.g AWS RDS), K8s does not know it
| should schedule it in the same AZ as the database. If placed in
| the wrong AZ, it leads to unnecessary cross-AZ network traffic,
| adding latency (and costs $).
|
| I've made a tool I've called "Automatic Zone Placement", which
| automatically aligns Pod placements with their external
| dependencies.
|
| Testing shows that placing the pod in the same AZ resulted in a
| ~175-375% performance increase. Measured with small, frequent SQL
| requests. It's not really that strange, same AZ latency is much
| lower than cross-AZ. Lower latency = increased performance.
|
| The tool has two components:
|
| 1) A lightweight lookup service: A dependency-free Python service
| that takes a domain name (e.g., your RDS endpoint) and resolves
| its IP to a specific AZ.
|
| 2 ) A Kyverno mutating webhook: This policy intercepts pod
| creation requests. If a pod has a specific annotation, the
| webhook calls the lookup service and injects the required
| nodeAffinity to schedule the pod onto a node in the correct AZ.
|
| The goal is to make this an automatic process, the alternative is
| to manually add a nodeAffinity spec to your workloads. But
| resources moves between AZ, e.g. during maintenance events for
| RDS instances. I built this with AWS services in mind, the
| concept is generic enough to be used for on-premise clusters to
| make scheduling decisions based on rack, row, or data center
| properties.
|
| I'd love some feedback on this, happy to answer questions :)
| darkwater wrote:
| Interesting project! Kudos for the release. One question: how
| are the failure scenario managed, i.e. AZP fails for whatever
| reason and it's in a crash loop? Just "no hints" to the
| scheduler, and that's it?
| toredash wrote:
| If the AZP deployment fails, yes your correct there is no
| hints anywhere. If the lookup to AZP fails for whatever
| reason, it would be noted in the Kyverno logs. And based on
| if you -require- this policy to take affect or not, you have
| to decide if it you want pods to fail or not in the
| scheduling step. In most cases, you don't want to stop
| scheduling :)
| mathverse wrote:
| Typically you have multi-az setup for app deployment for HA.
| How would you without traffic management controll solve this?
| toredash wrote:
| I'm not sure I follow. Are you talking about the AZP service,
| or ... ?
| dserodio wrote:
| It's a best practice to have a Deployment run multiple Pods
| in separate AZs to increase availability
| toredash wrote:
| Yes I get that. But are we talking HA for this lookup
| service that I've made?
|
| If yes, that's a simple update of the manifest to have 3
| replicas with ab affinity setting to spread that out over
| different AZ. Kyverno would use the internal Service
| object this service provide to have a HA endpoint to send
| queries to.
|
| If we are not talking about this AZP service, I don't
| understand what we are talkin about.
| stackskipton wrote:
| How do you handle RDS failovers? Mutating Webhook is only fired
| when Pods are created so if AZ zone does not fail, there is no
| pods to be created and affinity rules to be changed.
| toredash wrote:
| As it stands now, it doesn't. Unless you modify the Kyverno
| Policy to be of a background scanning.
|
| I would create a similar policy where Kyverno at intervals
| would check the Deployment spec to see if the endpoint is
| changed, and alter the affinity rules. It would then be a
| traditional update of the Deployment spec to reflect the
| desire to run in another AZ, if that made sense?
| ruuda wrote:
| > Have you considered alternative solutions?
|
| How about, don't use Kubernetes? The lack of control over where
| the workload runs is a problem caused by Kubernetes. If you
| deploy an application as e.g. systemd services, you can pick the
| optimal host for the workload, and it will not suddenly jump
| around.
| arccy wrote:
| k8s doesn't lack control, you can select individual nodes, AZs,
| regions, etc with the standard affinity settings.
| mystifyingpoi wrote:
| > The lack of control
|
| This project literally sets the affinity. That's precisely the
| control you seem to negate.
| toredash wrote:
| Agree, Kubernetes isn't for everyone. This solution came from
| an specific issue with a client which had ad hoc performance
| problems when a Pod was placed in the "in-correct" AZ. So this
| solution was created to place the Pods in the most optimal zone
| when they were created.
| glennpratt wrote:
| Comparing systemd and Kubernetes for this scenario is like
| comparing an apple tree to a citrus grove.
|
| You can specify just about anything, including exact nodes, for
| Kubernetes workloads.
|
| This is just injecting some of that automatically.
|
| I'm not knocking systemd, it's just not relevant.
| Spivak wrote:
| You need it to jump around because your RDS database might fail
| over to a different AZ.
|
| Being able to move workloads around is kinda the point. The
| need exists irrespective of what you use to deploy your app.
| toredash wrote:
| The nice thing about this solution, its not limited to RDS. I
| used RDS as an example as many are familiar with it and are
| known to the fact that it will change AZ during maintenance
| events.
|
| Any hostname for a service in AWS that can relocate to
| another AZ (for whatever reason), can use this.
| aduwah wrote:
| Mind you, that you are facing the same problem with any
| Autoscaling group that lives in multiple AZs. You don't need
| kubernetes for this
| indemnity wrote:
| > The lack of control over where the workload runs is a problem
| caused by Kubernetes.
|
| Fine grained control over workload scheduling is one of the K8s
| core features?
|
| Affinity, anti-affinity, priority classes, node selectors,
| scheduling gates - all of which affect scheduling for different
| use cases, and all under the operator's control.
| kentm wrote:
| Sure, but there are scenarios and architectures where you do
| want the workload to jump around, but just to a subset of hosts
| matching certain criteria. Kubernetes does solve that problem.
| mystifyingpoi wrote:
| Cool idea, I like that. Though I'm curious about the lookup
| service. You say:
|
| > To gather zone information, use this command ...
|
| Why couldn't most of this information be gathered by lookup
| service itself? A point could be made about excessive IAM, but a
| simple case of RDS reader residing in a given AZ could be easily
| handled by simply listing the subnets and finding where a given
| IP belongs.
| toredash wrote:
| Totally agree!
|
| This service is published more as a concept to be built on top
| of, than a complete solution.
|
| You wouldn't even need IAM rights to read RDS information, you
| need subnet information. As subnets are zonal, it does not if
| the service is RDS or Redis/ElastiCache. The IP returned from
| the hostname lookup, at the time your pod is scheduled,
| determines which AZ that Pod should (optimally) be deployed to.
|
| Where this solution was created, was in a multi AWS account
| environment. Doing describe subnets API calls across multiple
| accounts is a hassle. It was "good enough" to have a static
| mapping of subnets, as they didn't change frequently.
| stackskipton wrote:
| This is one of those ideas that sounds great and appears simple
| at first but can grow into mad monster. Here my potential
| thoughts after 5 minutes.
|
| Kyverno requirement makes it limited. There is no "automatic-
| zone-placement-disabled" function in case someone wants to
| temporarily disable zone placement but not remove the label. How
| do we handle RDS Zone changing after workload scheduling? No
| automatic look up of IPs and Zones. What if we only have one node
| in specific zone? Are we willing to handle EC2 failure or should
| we trigger scale out?
| toredash wrote:
| > Kyverno requirement makes it limited.
|
| You don't have to use Kyverno. You could use a standard
| mutating webhook, but you would have to generate your own
| certificate and mutate on every Pod.CREATE operations. Not
| really a problem but, it depends.
|
| > There is no "automatic-zone-placement-disabled"
|
| True. Thats why I choose to use
| preferredDuringSchedulingIgnoredDuringExecution instead of
| requiredDuringSchedulingIgnoredDuringExecution. In my case,
| where this solutions originated from, Kubernetes was already a
| multi AZ solution where there was always at least one node in
| each AZ. It was nice if the Pod could be scheduled into the
| same AZ, but it was not a hard requirement,
|
| > No automatic look up of IPs and Zones. Yup, it would generate
| a lot of extra "stuff" to mess with: IAM Roles, how to lookup
| IP/subnet information from multi account AWS setup with VPC
| Peerings. In our case it was "good enough" with a static
| approach. Subnet/network topology didnt change frequently
| enough to add another layer of complexity.
|
| > What if we only have one node in specific zone?
|
| Thats why we defaulted to
| preferredDuringSchedulingIgnoredDuringExecution and not
| required.
| pikdum wrote:
| Wasn't aware that there was noticeably higher latency between
| availability zones in the same AWS region. Kinda thought the
| whole point was to run replicas of your application in multiple
| to achieve higher availability.
| toredash wrote:
| I was surprised to. Of course it makes sense when you look at
| it hard enough, two seperate DCs won't have the same latency
| than internal DC communication. It might have the same physical
| wire-speed, but physical distance matter.
| stackskipton wrote:
| It's generally sub 2MS. Most people take slight latency
| increase for higher availability, but I guess in this case,
| that was not acceptable.
| dilyevsky wrote:
| They also charge you like 1c/GB for traffic egress between the
| zones. To top it off there are issues with AWS loadbalancers in
| multi-zone setups. Ultimately i've come to the conclusion that
| large multi-zonal clusters is a mistake. Do several single-zone
| disposable clusters if you want zone redundancy.
| solatic wrote:
| I don't really understand why you think this tool is needed and
| what exact problem/risk it's trying to solve.
|
| Most people should start with a single-zone setup and just accept
| that there's a risk associated with zone failure. If you have a
| single-zone setup, you have a node group in that one zone, you
| have the managed database in the same zone, and you're done.
| Zone-wide failure is extremely rare in practice and you would be
| surprised at the number (and size of) companies that run single-
| zone production setups to save on cloud bills. Just write the
| zone label selector into the node affinity section by hand, you
| don't need a fancy admission webhook if you want to reduce
| chance's factor.
|
| _If_ you decide that you want to handle the additional
| complexity of supporting failover in case of zone failure, the
| easiest approach is to just setup another node group in the
| secondary zone. If the primary zone fails, manually scale up the
| node pool in the secondary zone. Kubernetes will automatically
| schedule all the pods on the scaled up node pool (remember:
| primary zone failure, no healthy nodes in the primary zone), and
| you 're done.
|
| If you want to handle zone failover completely automatically,
| this tool represents additional cost, because it forces you to
| have nodes running in the secondary zone during normal usage.
| Hopefully you are not running a completely empty, redundant set
| of service VMs in normal operation, because that would be a
| colossal waste of money. So you are _presuming_ that, when RDS
| automatically fails over to zone b to account for zone a failure,
| that you will certainly be able to scale up a full scale
| production environment in zone b as well, in spite of nearly
| every other AWS customer attempting more or less the same
| strategy; roughly half of zone a traffic will spill over to zone
| b, roughly half to zone c, minus all the traffic that is zone-
| locked to a (e.g. single-zone databases without failover
| mechanisms). That is a _big_ assumption to make and you run a
| serious risk of not getting sufficient capacity in what was
| basically an arbitrarily chosen zone (chosen without context on
| whether there is sufficient capacity for the rest of your
| workloads) and being caught with zonal mismatches and not knowing
| what to do. You very well might need to failover to another
| region entirely to get sufficient capacity to handle your full
| workload.
|
| If you are both cost- and latency-sensitive to stick to a single
| zone, you're likely much better off coming up with a migration
| plan, writing an automated runbook/script to handle it, and
| testing it on gamedays.
| stronglikedan wrote:
| > I don't really understand why you think this tool is needed
| and what exact problem/risk it's trying to solve.
|
| They lay out the problem and solution pretty well in the link.
| If you still don't understand after reading it, then that's
| okay! It just means you're not having this problem and you're
| not in need of this tool, so go you! But at least you'll come
| away with the understanding that someone was having this
| problem and someone needed this tool to solve it, so win win
| win!
| westurner wrote:
| Couldn't something like this make CI builds faster by running
| builds near already-cached container images?
___________________________________________________________________
(page generated 2025-10-14 23:00 UTC)