[HN Gopher] Zero-Downtime Kubernetes Deployments on AWS with EKS
       ___________________________________________________________________
        
       Zero-Downtime Kubernetes Deployments on AWS with EKS
        
       Author : pmig
       Score  : 53 points
       Date   : 2025-03-10 12:48 UTC (10 hours ago)
        
 (HTM) web link (glasskube.dev)
 (TXT) w3m dump (glasskube.dev)
        
       | _bare_metal wrote:
       | I run https://BareMetalSavings.com.
       | 
       | The amount of companies who use K8s when they have no business
       | nor technological justification for it is staggering. It is the
       | number one blocker in moving to bare metal/on prem when costs
       | become too much.
       | 
       | Yes, on prem has its gotchas just like the EKS deployment
       | described in the post, but everything is so much simpler and
       | straightforward it's much easier to grasp the on prem side of
       | things.
        
         | abtinf wrote:
         | Could you expand a bit on the point of K8S being a blocker to
         | moving to on-prem?
         | 
         | Naively, I would think it be neutral, since I would assume that
         | if a customer gets k8s running on-prem, then apps designed for
         | running in k8s should have a straightforward migration path?
        
           | MPSimmons wrote:
           | I can expand a little bit, but based on your question, I
           | suspect you may know everything I'm going to type.
           | 
           | In cloud environments, it's pretty common that your cloud
           | provider has specific implementations of Kubernetes objects,
           | either by creating custom resources that you can make use of,
           | or just building opinionated default instances of things like
           | storage classes, load balancers, etc.
           | 
           | It's pretty easy to not think about the implementation
           | details of, say, an object-storage-backed PVC until you need
           | to do it in a K8s instance that doesn't already have your
           | desired storage class. Then you've got to figure out how to
           | map your simple-but-custom $thing from provider-managed to
           | platform-managed. If you're moving into Rancher, for
           | instance, it's relatively batteries-included, but there are
           | definitely considerations you need to make for things like
           | how machines are built from disk storage perspective and
           | where longhorn drives are mapped, for instance.
           | 
           | It's like that for a ton of stuff, and a whole lot of the
           | Kubernetes/OutsideInfra interface is like that. Networking,
           | storage, maybe even certificate management, those all need
           | considerations if you're migrating from cloud to on-prem.
        
             | anang wrote:
             | I think K8S distributions like K3S make this way simpler.
             | If you're wanting to run distributed object storage on bare
             | metal the you're in store for a lot of complexity, with or
             | without k8s.
             | 
             | I've ran 3 server k3s instances on bare metal and they work
             | very well with little maintenance. I didn't do anything
             | special, and while it's more complex than some ansible
             | scripts and haproxy, I think the breadth of tooling makes
             | it worth it.
        
               | hadlock wrote:
               | I ran K3S locally during the pandemic and the only issue
               | at the time was getting PV/PVC provisioned cleanly, I
               | think Longhorn was just reaching maturity and five years
               | ago the docs were pretty sparse. But yeah k3s is a dream
               | to work with in 2025 the docs are great and as long as
               | you stay on the happy path and your network is setup it's
               | about as effortless as cluster computing can get.
        
         | reillyse wrote:
         | Out of interest do you recommend any good places to host a
         | machine in the US? A major part of why I like cloud is because
         | it really simplifies the hardware maintenance.
        
         | adamcharnock wrote:
         | I've come at this from a slightly different angle. I've seen
         | many clients running k8s on expensive cloud instances, but to
         | me that is solving the same problems twice. Both k8s and cloud
         | instances solve a highly related and overlapping set of
         | problems.
         | 
         | Instead you can take k8s, deploy it to bare metal, and have a
         | much much more power for a much lower cost. Of course this
         | requires some technical knowledge, but the benefits are
         | significant (lower costs, stable costs, no vendor lock-in, all
         | the postgres extensions you want, response times halved, etc).
         | 
         | k8s smoothes over the vagaries of bare-metal very nicely.
         | 
         | If you'll excuse a quick plug for my work: We [1] offer a
         | middle ground for this, whereby we do and manage all this for
         | you. We take over all DevOps and infrastructure responsibility
         | while also cutting spend by around 50%. (cloud hardware really
         | is that expensive in comparison).
         | 
         | [1]: https://lithus.eu
        
       | evacchi wrote:
       | somewhat related https://architect.run/
       | 
       | > Seamless Migrations with Zero Downtime
       | 
       | (I don't work for them but they are friends ;))
        
       | paol wrote:
       | I'm not sure why they state "although the AWS Load Balancer
       | Controller is a fantastic piece of software, it is surprisingly
       | tricky to roll out releases without downtime."
       | 
       | The AWS Load Balancer Controller uses readiness gates by default,
       | exactly as described in the article. Am I missing something?
       | 
       | Edit: Ah, it's not by default, it requires a label in the
       | namespace. I'd forgotten about this. To be fair though, the AWS
       | docs tell you to add this label.
        
         | pmig wrote:
         | Yes, that is what we thought as well, but it turns out that the
         | there is still a delay between the load balancer controller
         | registering a target as offline and the pod actually being
         | already terminated. We did some benchmarks to highlight that
         | gap.
        
           | paol wrote:
           | You mean the problem you describe in "Part 3" of the article?
           | 
           | Damn it, now you've made me paranoid. I'll have to check the
           | ELB logs for 502 errors during our deployment windows.
        
             | pmig wrote:
             | Exactly! We initially received some sentry errors that
             | triggered our curiosity.
        
         | Spivak wrote:
         | I think the "label (edit: annotation) based configuration" has
         | got to be my least favorite thing about the k8s ecosystem.
         | They're super magic, completely undiscoverable outside the
         | documentation, not typed, not validated (for mutually exclusive
         | options), and rely on introspecting the cluster and so aren't
         | part of the k8s solver.
         | 
         | AWS uses them for all of their integrations and they're never
         | not annoying.
        
           | merb wrote:
           | I think you mean annotations. Labels and annotations are
           | different things. And btw. Annotations can be validated and
           | can be typed. With validation webhooks.
        
       | glenjamin wrote:
       | The fact that the state of the art container orchestration system
       | requires you to run a sleep command in order to not drop traffic
       | on the floor is a travesty of system design.
       | 
       | We had perfectly good rolling deploys before k8s came on the
       | scene, but k8s insistence on a single-phase deployment process
       | means we end up with this silly workaround.
       | 
       | I yelled into the void about this once and I was told that this
       | was inevitable because it's an eventually consistent distributed
       | system. I'm pretty sure it could still have had a 2 phase pod
       | shutdown by encoding a timeout on the first stage. Sure, it would
       | have made some internals require more complex state - but isn't
       | that the point of k8s? Instead everyone has to rediscover the
       | sleep hack over and over again.
        
         | dilyevsky wrote:
         | They are a few warts like this with core/apps controllers.
         | Nothing unfixable within general k8s design imho but
         | unfortunately most of the community have moved on to newer
         | shinier things
        
         | deathanatos wrote:
         | It shouldn't. I've not had the braincells yet to fully
         | internalize the entire article, but it seems like we go wrong
         | about here:
         | 
         | > _The AWS Load Balancer keeps sending new requests to the
         | target for several seconds after the application is sent the
         | termination signal!_
         | 
         | And then concluded a wait is required...? Yes, traffic might
         | not cease immediately, but you drain the connections to the
         | load balancer, and then exit. A decent HTTP framework should be
         | doing this by default on SIGTERM.
         | 
         | > _I yelled into the void about this once and I was told that
         | this was inevitable because it 's an eventually consistent
         | distributed system._
         | 
         | Yeah, I wouldn't agree with that either. A terminating pod is
         | inherently "not ready", that not-ready state should cause the
         | load balancer to remove it from rotation. Similarly, the pod
         | itself can drain its connections to the load balancer. That
         | could take time; there's always going to be some point at which
         | you'd have to give up on a slowloris request.
        
           | singron wrote:
           | Most http frameworks don't do this right. They typically wait
           | until all known in-flight requests complete and then exit.
           | That's usually too fast for a load balancer that's still
           | sending new requests. Instead you should just wait 30 seconds
           | or so while still accepting new requests and replying not
           | ready to load balancer health checks, and then if you want to
           | wait additional time for long running requests, you can. You
           | can also send clients "connection: close" to convince them to
           | reopen connections against different backends.
        
         | cedws wrote:
         | K8S is overrated, it's actually pretty terrible but everyone
         | has been convinced it's the solution to all of their problems
         | because it's slightly better than what we had 15 years ago
         | (Ansible/Puppet/Bash/immutable deployments) at 10x the
         | complexity. There are so many weird edge cases just waiting to
         | completely ruin your day. Like subPath mounts. If you use
         | subPath then changes to a ConfigMap don't get reflected into
         | the container. The container doesn't get restarted either of
         | course, so you have config drift built in, unless you install
         | one of those weird hacky controllers that restarts pods for
         | you.
        
       | bradleyy wrote:
       | I know this won't be helpful to folks committed to EKS, but AWS
       | ECS (i.e. running docker containers with AWS controlling) does a
       | really great job on this, we've been running ECS for years (at
       | multiple companies), and basically no hiccups.
       | 
       | One of my former co-workers went to a K8S shop, and longs for the
       | simplicity of ECS.
       | 
       | No software is a panacea, but ECS seems to be one of those "it
       | just works" technologies.
        
         | GiorgioG wrote:
         | We've been moving away from K8S to ECS...it just works without
         | all the complexity.
        
         | layoric wrote:
         | Completely agree, unless you are operating a platform for
         | others to deploy to, ECS is a lot simpler, and works really
         | well for a lot of common setups.
        
         | FridgeSeal wrote:
         | > One of my former co-workers went to a K8S shop, and longs for
         | the simplicity of ECS.
         | 
         | I was using K8s previously, and I'm currently using ECS in my
         | current team, and I hate it. I would _much _ rather have K8s
         | back. The UX is all over the place, none of my normal tooling
         | works, deployment configs are so much worse than the K8s
         | equivalent.
        
           | easton wrote:
           | I think like a lot of things, once you're used to having the
           | knobs of k8s and its DX, you'll want them always. But a lot
           | of teams adopt k8s because they need a containerized service
           | in AWS, and have no real opinions about how, and in those
           | cases ECS is almost always easier (even with all its quirks).
           | 
           | (And it's free, if you don't mind the mild lock-in).
        
         | pmig wrote:
         | I agree that ECS works great for stateless containerized
         | workloads. But you will need other AWS-managed services for
         | state (RDS), caching (ElastiCache), and queueing (SQS).
         | 
         | So your application is now suddenly spread across multiple
         | services, and you'll need an IaC tool like Terraform, etc.
         | 
         | The beauty (and the main reason we use K8s) is that everything
         | is inside our cluster. We use cloudnative-pg, Redis pods, and
         | RabbitMQ if needed, so everything is maintained in a GitOps
         | project, and we have no IaC management overhead.
         | 
         | (We do manually provision S3 buckets for backups and object
         | storage, though.)
        
           | placardloop wrote:
           | Mentioning "no IaC management overhead" is weird. If you're
           | not using IaC, you're doing it wrong.
           | 
           | However, GitOps _is_ IaC, just by another name, so you
           | actually do have IaC "overhead".
        
           | williamdclt wrote:
           | Many companies run k8s for compute and use rds/sqs/redis
           | outside of it. For example RDS is not just hosted PG, it has
           | a whole bunch of features that don't come out of the box (you
           | do pay for it, I'm not giving an opinion as to whether it's
           | worth the price)
        
         | icedchai wrote:
         | If you're on GCP, Google Cloud Run also "just works" quite
         | well, too.
        
           | holografix wrote:
           | Amazing product, doesn't get nearly the attention it
           | deserves. ECS is a hot spaghetti mess in comparison.
        
       | yosefmihretie wrote:
       | highly recommend porter if you are a startup who doesn't wanna
       | think about things like this
        
       | NightMKoder wrote:
       | This is actually a fascinatingly complex problem. Some notes
       | about the article: * The 20s delay before shutdown is called
       | "lame duck mode." As implemented it's close to good, but not
       | perfect. * When in lame duck mode you should fail the pod's
       | health check. That way you don't rely on the ALB controller to
       | remove your pod. Your pod is still serving other requests, but
       | gracefully asking everyone to forget about it. * Make an effort
       | to close http keep-alive connections. This is more important if
       | you're running another proxy that won't listen to the health
       | checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you
       | can only do that when a request comes in - but it's as simple as
       | a Connection: close header on the response. * On a fun note, the
       | new-ish kubernetes graceful node shutdown feature won't remove
       | your pod readiness when shutting down.
        
         | nosefrog wrote:
         | By health check, do you mean the kubernetes liveness check?
         | Does that make kube try to kill or restart your container?
        
         | spockz wrote:
         | With health I presume you mean readiness check. right?
         | Otherwise it will kill the container when the liveness check
         | fails.
        
       | paranoidrobot wrote:
       | We had to figure this out the hard way, and ended up with this
       | approach (approximately).
       | 
       | K8S provides two (well three, now) health checks.
       | 
       | How this interacts with ALB is quite important.
       | 
       | Liveness should always return 200 OK unless you have hit some
       | fatal condition where your container considers itself dead and
       | wants to be restarted.
       | 
       | Readiness should only return 200 OK if you are ready to serve
       | traffic.
       | 
       | We configure the ALB to only point to the readiness check.
       | 
       | So our application lifecycle looks like this:
       | 
       | * Container starts
       | 
       | * Application loads
       | 
       | * Liveness begins serving 200
       | 
       | * Some internal health checks run and set readiness state to True
       | 
       | * Readiness checks now return 200
       | 
       | * ALB checks begin passing and so pod is added to the target
       | group
       | 
       | * Pod starts getting traffic.
       | 
       | time passes. Eventually for some reason the pod needs to shut
       | down.
       | 
       | * Kube calls the preStop hook
       | 
       | * PreStop sends SIGUSR1 to app and waits for N seconds.
       | 
       | * App handler for SIGUSR1 tells readiness hook to start failing.
       | 
       | * ALB health checks begin failing, and no new requests should be
       | sent.
       | 
       | * ALB takes the pod out of the target group.
       | 
       | * PreStop hook finishes waiting and returns
       | 
       | * Kube sends SIGTERM
       | 
       | * App wraps up any remaining in-flight requests and shuts down.
       | 
       | This allows the app to do graceful shut down, and ensures the ALB
       | doesn't send traffic to a pod that knows it is being shut down.
       | 
       | Oh, and on the Readiness check - your app can use this to
       | (temporarily) signal that it is too busy to serve more traffic.
       | Handy as another signal you can monitor for scaling.
       | 
       | e: Formatting was slightly broken.
        
       ___________________________________________________________________
       (page generated 2025-03-10 23:00 UTC)