[HN Gopher] Automatic K8s pod placement to match external servic...
       ___________________________________________________________________
        
       Automatic K8s pod placement to match external service zones
        
       Author : toredash
       Score  : 76 points
       Date   : 2025-10-08 05:21 UTC (6 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | toredash wrote:
       | Hi HN,
       | 
       | I wanted to share something I've worked a bit to solve regarding
       | Kubernetes: its scheduler has no awareness of the network
       | topology for external services that workloads communicate with.
       | If a pod talks to a database (e.g AWS RDS), K8s does not know it
       | should schedule it in the same AZ as the database. If placed in
       | the wrong AZ, it leads to unnecessary cross-AZ network traffic,
       | adding latency (and costs $).
       | 
       | I've made a tool I've called "Automatic Zone Placement", which
       | automatically aligns Pod placements with their external
       | dependencies.
       | 
       | Testing shows that placing the pod in the same AZ resulted in a
       | ~175-375% performance increase. Measured with small, frequent SQL
       | requests. It's not really that strange, same AZ latency is much
       | lower than cross-AZ. Lower latency = increased performance.
       | 
       | The tool has two components:
       | 
       | 1) A lightweight lookup service: A dependency-free Python service
       | that takes a domain name (e.g., your RDS endpoint) and resolves
       | its IP to a specific AZ.
       | 
       | 2 ) A Kyverno mutating webhook: This policy intercepts pod
       | creation requests. If a pod has a specific annotation, the
       | webhook calls the lookup service and injects the required
       | nodeAffinity to schedule the pod onto a node in the correct AZ.
       | 
       | The goal is to make this an automatic process, the alternative is
       | to manually add a nodeAffinity spec to your workloads. But
       | resources moves between AZ, e.g. during maintenance events for
       | RDS instances. I built this with AWS services in mind, the
       | concept is generic enough to be used for on-premise clusters to
       | make scheduling decisions based on rack, row, or data center
       | properties.
       | 
       | I'd love some feedback on this, happy to answer questions :)
        
         | darkwater wrote:
         | Interesting project! Kudos for the release. One question: how
         | are the failure scenario managed, i.e. AZP fails for whatever
         | reason and it's in a crash loop? Just "no hints" to the
         | scheduler, and that's it?
        
           | toredash wrote:
           | If the AZP deployment fails, yes your correct there is no
           | hints anywhere. If the lookup to AZP fails for whatever
           | reason, it would be noted in the Kyverno logs. And based on
           | if you -require- this policy to take affect or not, you have
           | to decide if it you want pods to fail or not in the
           | scheduling step. In most cases, you don't want to stop
           | scheduling :)
        
         | mathverse wrote:
         | Typically you have multi-az setup for app deployment for HA.
         | How would you without traffic management controll solve this?
        
           | toredash wrote:
           | I'm not sure I follow. Are you talking about the AZP service,
           | or ... ?
        
             | dserodio wrote:
             | It's a best practice to have a Deployment run multiple Pods
             | in separate AZs to increase availability
        
               | toredash wrote:
               | Yes I get that. But are we talking HA for this lookup
               | service that I've made?
               | 
               | If yes, that's a simple update of the manifest to have 3
               | replicas with ab affinity setting to spread that out over
               | different AZ. Kyverno would use the internal Service
               | object this service provide to have a HA endpoint to send
               | queries to.
               | 
               | If we are not talking about this AZP service, I don't
               | understand what we are talkin about.
        
         | stackskipton wrote:
         | How do you handle RDS failovers? Mutating Webhook is only fired
         | when Pods are created so if AZ zone does not fail, there is no
         | pods to be created and affinity rules to be changed.
        
           | toredash wrote:
           | As it stands now, it doesn't. Unless you modify the Kyverno
           | Policy to be of a background scanning.
           | 
           | I would create a similar policy where Kyverno at intervals
           | would check the Deployment spec to see if the endpoint is
           | changed, and alter the affinity rules. It would then be a
           | traditional update of the Deployment spec to reflect the
           | desire to run in another AZ, if that made sense?
        
       | ruuda wrote:
       | > Have you considered alternative solutions?
       | 
       | How about, don't use Kubernetes? The lack of control over where
       | the workload runs is a problem caused by Kubernetes. If you
       | deploy an application as e.g. systemd services, you can pick the
       | optimal host for the workload, and it will not suddenly jump
       | around.
        
         | arccy wrote:
         | k8s doesn't lack control, you can select individual nodes, AZs,
         | regions, etc with the standard affinity settings.
        
         | mystifyingpoi wrote:
         | > The lack of control
         | 
         | This project literally sets the affinity. That's precisely the
         | control you seem to negate.
        
         | toredash wrote:
         | Agree, Kubernetes isn't for everyone. This solution came from
         | an specific issue with a client which had ad hoc performance
         | problems when a Pod was placed in the "in-correct" AZ. So this
         | solution was created to place the Pods in the most optimal zone
         | when they were created.
        
         | glennpratt wrote:
         | Comparing systemd and Kubernetes for this scenario is like
         | comparing an apple tree to a citrus grove.
         | 
         | You can specify just about anything, including exact nodes, for
         | Kubernetes workloads.
         | 
         | This is just injecting some of that automatically.
         | 
         | I'm not knocking systemd, it's just not relevant.
        
         | Spivak wrote:
         | You need it to jump around because your RDS database might fail
         | over to a different AZ.
         | 
         | Being able to move workloads around is kinda the point. The
         | need exists irrespective of what you use to deploy your app.
        
           | toredash wrote:
           | The nice thing about this solution, its not limited to RDS. I
           | used RDS as an example as many are familiar with it and are
           | known to the fact that it will change AZ during maintenance
           | events.
           | 
           | Any hostname for a service in AWS that can relocate to
           | another AZ (for whatever reason), can use this.
        
         | aduwah wrote:
         | Mind you, that you are facing the same problem with any
         | Autoscaling group that lives in multiple AZs. You don't need
         | kubernetes for this
        
         | indemnity wrote:
         | > The lack of control over where the workload runs is a problem
         | caused by Kubernetes.
         | 
         | Fine grained control over workload scheduling is one of the K8s
         | core features?
         | 
         | Affinity, anti-affinity, priority classes, node selectors,
         | scheduling gates - all of which affect scheduling for different
         | use cases, and all under the operator's control.
        
         | kentm wrote:
         | Sure, but there are scenarios and architectures where you do
         | want the workload to jump around, but just to a subset of hosts
         | matching certain criteria. Kubernetes does solve that problem.
        
       | mystifyingpoi wrote:
       | Cool idea, I like that. Though I'm curious about the lookup
       | service. You say:
       | 
       | > To gather zone information, use this command ...
       | 
       | Why couldn't most of this information be gathered by lookup
       | service itself? A point could be made about excessive IAM, but a
       | simple case of RDS reader residing in a given AZ could be easily
       | handled by simply listing the subnets and finding where a given
       | IP belongs.
        
         | toredash wrote:
         | Totally agree!
         | 
         | This service is published more as a concept to be built on top
         | of, than a complete solution.
         | 
         | You wouldn't even need IAM rights to read RDS information, you
         | need subnet information. As subnets are zonal, it does not if
         | the service is RDS or Redis/ElastiCache. The IP returned from
         | the hostname lookup, at the time your pod is scheduled,
         | determines which AZ that Pod should (optimally) be deployed to.
         | 
         | Where this solution was created, was in a multi AWS account
         | environment. Doing describe subnets API calls across multiple
         | accounts is a hassle. It was "good enough" to have a static
         | mapping of subnets, as they didn't change frequently.
        
       | stackskipton wrote:
       | This is one of those ideas that sounds great and appears simple
       | at first but can grow into mad monster. Here my potential
       | thoughts after 5 minutes.
       | 
       | Kyverno requirement makes it limited. There is no "automatic-
       | zone-placement-disabled" function in case someone wants to
       | temporarily disable zone placement but not remove the label. How
       | do we handle RDS Zone changing after workload scheduling? No
       | automatic look up of IPs and Zones. What if we only have one node
       | in specific zone? Are we willing to handle EC2 failure or should
       | we trigger scale out?
        
         | toredash wrote:
         | > Kyverno requirement makes it limited.
         | 
         | You don't have to use Kyverno. You could use a standard
         | mutating webhook, but you would have to generate your own
         | certificate and mutate on every Pod.CREATE operations. Not
         | really a problem but, it depends.
         | 
         | > There is no "automatic-zone-placement-disabled"
         | 
         | True. Thats why I choose to use
         | preferredDuringSchedulingIgnoredDuringExecution instead of
         | requiredDuringSchedulingIgnoredDuringExecution. In my case,
         | where this solutions originated from, Kubernetes was already a
         | multi AZ solution where there was always at least one node in
         | each AZ. It was nice if the Pod could be scheduled into the
         | same AZ, but it was not a hard requirement,
         | 
         | > No automatic look up of IPs and Zones. Yup, it would generate
         | a lot of extra "stuff" to mess with: IAM Roles, how to lookup
         | IP/subnet information from multi account AWS setup with VPC
         | Peerings. In our case it was "good enough" with a static
         | approach. Subnet/network topology didnt change frequently
         | enough to add another layer of complexity.
         | 
         | > What if we only have one node in specific zone?
         | 
         | Thats why we defaulted to
         | preferredDuringSchedulingIgnoredDuringExecution and not
         | required.
        
       | pikdum wrote:
       | Wasn't aware that there was noticeably higher latency between
       | availability zones in the same AWS region. Kinda thought the
       | whole point was to run replicas of your application in multiple
       | to achieve higher availability.
        
         | toredash wrote:
         | I was surprised to. Of course it makes sense when you look at
         | it hard enough, two seperate DCs won't have the same latency
         | than internal DC communication. It might have the same physical
         | wire-speed, but physical distance matter.
        
         | stackskipton wrote:
         | It's generally sub 2MS. Most people take slight latency
         | increase for higher availability, but I guess in this case,
         | that was not acceptable.
        
         | dilyevsky wrote:
         | They also charge you like 1c/GB for traffic egress between the
         | zones. To top it off there are issues with AWS loadbalancers in
         | multi-zone setups. Ultimately i've come to the conclusion that
         | large multi-zonal clusters is a mistake. Do several single-zone
         | disposable clusters if you want zone redundancy.
        
       | solatic wrote:
       | I don't really understand why you think this tool is needed and
       | what exact problem/risk it's trying to solve.
       | 
       | Most people should start with a single-zone setup and just accept
       | that there's a risk associated with zone failure. If you have a
       | single-zone setup, you have a node group in that one zone, you
       | have the managed database in the same zone, and you're done.
       | Zone-wide failure is extremely rare in practice and you would be
       | surprised at the number (and size of) companies that run single-
       | zone production setups to save on cloud bills. Just write the
       | zone label selector into the node affinity section by hand, you
       | don't need a fancy admission webhook if you want to reduce
       | chance's factor.
       | 
       |  _If_ you decide that you want to handle the additional
       | complexity of supporting failover in case of zone failure, the
       | easiest approach is to just setup another node group in the
       | secondary zone. If the primary zone fails, manually scale up the
       | node pool in the secondary zone. Kubernetes will automatically
       | schedule all the pods on the scaled up node pool (remember:
       | primary zone failure, no healthy nodes in the primary zone), and
       | you 're done.
       | 
       | If you want to handle zone failover completely automatically,
       | this tool represents additional cost, because it forces you to
       | have nodes running in the secondary zone during normal usage.
       | Hopefully you are not running a completely empty, redundant set
       | of service VMs in normal operation, because that would be a
       | colossal waste of money. So you are _presuming_ that, when RDS
       | automatically fails over to zone b to account for zone a failure,
       | that you will certainly be able to scale up a full scale
       | production environment in zone b as well, in spite of nearly
       | every other AWS customer attempting more or less the same
       | strategy; roughly half of zone a traffic will spill over to zone
       | b, roughly half to zone c, minus all the traffic that is zone-
       | locked to a (e.g. single-zone databases without failover
       | mechanisms). That is a _big_ assumption to make and you run a
       | serious risk of not getting sufficient capacity in what was
       | basically an arbitrarily chosen zone (chosen without context on
       | whether there is sufficient capacity for the rest of your
       | workloads) and being caught with zonal mismatches and not knowing
       | what to do. You very well might need to failover to another
       | region entirely to get sufficient capacity to handle your full
       | workload.
       | 
       | If you are both cost- and latency-sensitive to stick to a single
       | zone, you're likely much better off coming up with a migration
       | plan, writing an automated runbook/script to handle it, and
       | testing it on gamedays.
        
         | stronglikedan wrote:
         | > I don't really understand why you think this tool is needed
         | and what exact problem/risk it's trying to solve.
         | 
         | They lay out the problem and solution pretty well in the link.
         | If you still don't understand after reading it, then that's
         | okay! It just means you're not having this problem and you're
         | not in need of this tool, so go you! But at least you'll come
         | away with the understanding that someone was having this
         | problem and someone needed this tool to solve it, so win win
         | win!
        
       | westurner wrote:
       | Couldn't something like this make CI builds faster by running
       | builds near already-cached container images?
        
       ___________________________________________________________________
       (page generated 2025-10-14 23:00 UTC)