[HN Gopher] Zero-Downtime Kubernetes Deployments on AWS with EKS
___________________________________________________________________
Zero-Downtime Kubernetes Deployments on AWS with EKS
Author : pmig
Score : 53 points
Date : 2025-03-10 12:48 UTC (10 hours ago)
(HTM) web link (glasskube.dev)
(TXT) w3m dump (glasskube.dev)
| _bare_metal wrote:
| I run https://BareMetalSavings.com.
|
| The amount of companies who use K8s when they have no business
| nor technological justification for it is staggering. It is the
| number one blocker in moving to bare metal/on prem when costs
| become too much.
|
| Yes, on prem has its gotchas just like the EKS deployment
| described in the post, but everything is so much simpler and
| straightforward it's much easier to grasp the on prem side of
| things.
| abtinf wrote:
| Could you expand a bit on the point of K8S being a blocker to
| moving to on-prem?
|
| Naively, I would think it be neutral, since I would assume that
| if a customer gets k8s running on-prem, then apps designed for
| running in k8s should have a straightforward migration path?
| MPSimmons wrote:
| I can expand a little bit, but based on your question, I
| suspect you may know everything I'm going to type.
|
| In cloud environments, it's pretty common that your cloud
| provider has specific implementations of Kubernetes objects,
| either by creating custom resources that you can make use of,
| or just building opinionated default instances of things like
| storage classes, load balancers, etc.
|
| It's pretty easy to not think about the implementation
| details of, say, an object-storage-backed PVC until you need
| to do it in a K8s instance that doesn't already have your
| desired storage class. Then you've got to figure out how to
| map your simple-but-custom $thing from provider-managed to
| platform-managed. If you're moving into Rancher, for
| instance, it's relatively batteries-included, but there are
| definitely considerations you need to make for things like
| how machines are built from disk storage perspective and
| where longhorn drives are mapped, for instance.
|
| It's like that for a ton of stuff, and a whole lot of the
| Kubernetes/OutsideInfra interface is like that. Networking,
| storage, maybe even certificate management, those all need
| considerations if you're migrating from cloud to on-prem.
| anang wrote:
| I think K8S distributions like K3S make this way simpler.
| If you're wanting to run distributed object storage on bare
| metal the you're in store for a lot of complexity, with or
| without k8s.
|
| I've ran 3 server k3s instances on bare metal and they work
| very well with little maintenance. I didn't do anything
| special, and while it's more complex than some ansible
| scripts and haproxy, I think the breadth of tooling makes
| it worth it.
| hadlock wrote:
| I ran K3S locally during the pandemic and the only issue
| at the time was getting PV/PVC provisioned cleanly, I
| think Longhorn was just reaching maturity and five years
| ago the docs were pretty sparse. But yeah k3s is a dream
| to work with in 2025 the docs are great and as long as
| you stay on the happy path and your network is setup it's
| about as effortless as cluster computing can get.
| reillyse wrote:
| Out of interest do you recommend any good places to host a
| machine in the US? A major part of why I like cloud is because
| it really simplifies the hardware maintenance.
| adamcharnock wrote:
| I've come at this from a slightly different angle. I've seen
| many clients running k8s on expensive cloud instances, but to
| me that is solving the same problems twice. Both k8s and cloud
| instances solve a highly related and overlapping set of
| problems.
|
| Instead you can take k8s, deploy it to bare metal, and have a
| much much more power for a much lower cost. Of course this
| requires some technical knowledge, but the benefits are
| significant (lower costs, stable costs, no vendor lock-in, all
| the postgres extensions you want, response times halved, etc).
|
| k8s smoothes over the vagaries of bare-metal very nicely.
|
| If you'll excuse a quick plug for my work: We [1] offer a
| middle ground for this, whereby we do and manage all this for
| you. We take over all DevOps and infrastructure responsibility
| while also cutting spend by around 50%. (cloud hardware really
| is that expensive in comparison).
|
| [1]: https://lithus.eu
| evacchi wrote:
| somewhat related https://architect.run/
|
| > Seamless Migrations with Zero Downtime
|
| (I don't work for them but they are friends ;))
| paol wrote:
| I'm not sure why they state "although the AWS Load Balancer
| Controller is a fantastic piece of software, it is surprisingly
| tricky to roll out releases without downtime."
|
| The AWS Load Balancer Controller uses readiness gates by default,
| exactly as described in the article. Am I missing something?
|
| Edit: Ah, it's not by default, it requires a label in the
| namespace. I'd forgotten about this. To be fair though, the AWS
| docs tell you to add this label.
| pmig wrote:
| Yes, that is what we thought as well, but it turns out that the
| there is still a delay between the load balancer controller
| registering a target as offline and the pod actually being
| already terminated. We did some benchmarks to highlight that
| gap.
| paol wrote:
| You mean the problem you describe in "Part 3" of the article?
|
| Damn it, now you've made me paranoid. I'll have to check the
| ELB logs for 502 errors during our deployment windows.
| pmig wrote:
| Exactly! We initially received some sentry errors that
| triggered our curiosity.
| Spivak wrote:
| I think the "label (edit: annotation) based configuration" has
| got to be my least favorite thing about the k8s ecosystem.
| They're super magic, completely undiscoverable outside the
| documentation, not typed, not validated (for mutually exclusive
| options), and rely on introspecting the cluster and so aren't
| part of the k8s solver.
|
| AWS uses them for all of their integrations and they're never
| not annoying.
| merb wrote:
| I think you mean annotations. Labels and annotations are
| different things. And btw. Annotations can be validated and
| can be typed. With validation webhooks.
| glenjamin wrote:
| The fact that the state of the art container orchestration system
| requires you to run a sleep command in order to not drop traffic
| on the floor is a travesty of system design.
|
| We had perfectly good rolling deploys before k8s came on the
| scene, but k8s insistence on a single-phase deployment process
| means we end up with this silly workaround.
|
| I yelled into the void about this once and I was told that this
| was inevitable because it's an eventually consistent distributed
| system. I'm pretty sure it could still have had a 2 phase pod
| shutdown by encoding a timeout on the first stage. Sure, it would
| have made some internals require more complex state - but isn't
| that the point of k8s? Instead everyone has to rediscover the
| sleep hack over and over again.
| dilyevsky wrote:
| They are a few warts like this with core/apps controllers.
| Nothing unfixable within general k8s design imho but
| unfortunately most of the community have moved on to newer
| shinier things
| deathanatos wrote:
| It shouldn't. I've not had the braincells yet to fully
| internalize the entire article, but it seems like we go wrong
| about here:
|
| > _The AWS Load Balancer keeps sending new requests to the
| target for several seconds after the application is sent the
| termination signal!_
|
| And then concluded a wait is required...? Yes, traffic might
| not cease immediately, but you drain the connections to the
| load balancer, and then exit. A decent HTTP framework should be
| doing this by default on SIGTERM.
|
| > _I yelled into the void about this once and I was told that
| this was inevitable because it 's an eventually consistent
| distributed system._
|
| Yeah, I wouldn't agree with that either. A terminating pod is
| inherently "not ready", that not-ready state should cause the
| load balancer to remove it from rotation. Similarly, the pod
| itself can drain its connections to the load balancer. That
| could take time; there's always going to be some point at which
| you'd have to give up on a slowloris request.
| singron wrote:
| Most http frameworks don't do this right. They typically wait
| until all known in-flight requests complete and then exit.
| That's usually too fast for a load balancer that's still
| sending new requests. Instead you should just wait 30 seconds
| or so while still accepting new requests and replying not
| ready to load balancer health checks, and then if you want to
| wait additional time for long running requests, you can. You
| can also send clients "connection: close" to convince them to
| reopen connections against different backends.
| cedws wrote:
| K8S is overrated, it's actually pretty terrible but everyone
| has been convinced it's the solution to all of their problems
| because it's slightly better than what we had 15 years ago
| (Ansible/Puppet/Bash/immutable deployments) at 10x the
| complexity. There are so many weird edge cases just waiting to
| completely ruin your day. Like subPath mounts. If you use
| subPath then changes to a ConfigMap don't get reflected into
| the container. The container doesn't get restarted either of
| course, so you have config drift built in, unless you install
| one of those weird hacky controllers that restarts pods for
| you.
| bradleyy wrote:
| I know this won't be helpful to folks committed to EKS, but AWS
| ECS (i.e. running docker containers with AWS controlling) does a
| really great job on this, we've been running ECS for years (at
| multiple companies), and basically no hiccups.
|
| One of my former co-workers went to a K8S shop, and longs for the
| simplicity of ECS.
|
| No software is a panacea, but ECS seems to be one of those "it
| just works" technologies.
| GiorgioG wrote:
| We've been moving away from K8S to ECS...it just works without
| all the complexity.
| layoric wrote:
| Completely agree, unless you are operating a platform for
| others to deploy to, ECS is a lot simpler, and works really
| well for a lot of common setups.
| FridgeSeal wrote:
| > One of my former co-workers went to a K8S shop, and longs for
| the simplicity of ECS.
|
| I was using K8s previously, and I'm currently using ECS in my
| current team, and I hate it. I would _much _ rather have K8s
| back. The UX is all over the place, none of my normal tooling
| works, deployment configs are so much worse than the K8s
| equivalent.
| easton wrote:
| I think like a lot of things, once you're used to having the
| knobs of k8s and its DX, you'll want them always. But a lot
| of teams adopt k8s because they need a containerized service
| in AWS, and have no real opinions about how, and in those
| cases ECS is almost always easier (even with all its quirks).
|
| (And it's free, if you don't mind the mild lock-in).
| pmig wrote:
| I agree that ECS works great for stateless containerized
| workloads. But you will need other AWS-managed services for
| state (RDS), caching (ElastiCache), and queueing (SQS).
|
| So your application is now suddenly spread across multiple
| services, and you'll need an IaC tool like Terraform, etc.
|
| The beauty (and the main reason we use K8s) is that everything
| is inside our cluster. We use cloudnative-pg, Redis pods, and
| RabbitMQ if needed, so everything is maintained in a GitOps
| project, and we have no IaC management overhead.
|
| (We do manually provision S3 buckets for backups and object
| storage, though.)
| placardloop wrote:
| Mentioning "no IaC management overhead" is weird. If you're
| not using IaC, you're doing it wrong.
|
| However, GitOps _is_ IaC, just by another name, so you
| actually do have IaC "overhead".
| williamdclt wrote:
| Many companies run k8s for compute and use rds/sqs/redis
| outside of it. For example RDS is not just hosted PG, it has
| a whole bunch of features that don't come out of the box (you
| do pay for it, I'm not giving an opinion as to whether it's
| worth the price)
| icedchai wrote:
| If you're on GCP, Google Cloud Run also "just works" quite
| well, too.
| holografix wrote:
| Amazing product, doesn't get nearly the attention it
| deserves. ECS is a hot spaghetti mess in comparison.
| yosefmihretie wrote:
| highly recommend porter if you are a startup who doesn't wanna
| think about things like this
| NightMKoder wrote:
| This is actually a fascinatingly complex problem. Some notes
| about the article: * The 20s delay before shutdown is called
| "lame duck mode." As implemented it's close to good, but not
| perfect. * When in lame duck mode you should fail the pod's
| health check. That way you don't rely on the ALB controller to
| remove your pod. Your pod is still serving other requests, but
| gracefully asking everyone to forget about it. * Make an effort
| to close http keep-alive connections. This is more important if
| you're running another proxy that won't listen to the health
| checks above (eg AWS -> Node -> kube-proxy -> pod). Note that you
| can only do that when a request comes in - but it's as simple as
| a Connection: close header on the response. * On a fun note, the
| new-ish kubernetes graceful node shutdown feature won't remove
| your pod readiness when shutting down.
| nosefrog wrote:
| By health check, do you mean the kubernetes liveness check?
| Does that make kube try to kill or restart your container?
| spockz wrote:
| With health I presume you mean readiness check. right?
| Otherwise it will kill the container when the liveness check
| fails.
| paranoidrobot wrote:
| We had to figure this out the hard way, and ended up with this
| approach (approximately).
|
| K8S provides two (well three, now) health checks.
|
| How this interacts with ALB is quite important.
|
| Liveness should always return 200 OK unless you have hit some
| fatal condition where your container considers itself dead and
| wants to be restarted.
|
| Readiness should only return 200 OK if you are ready to serve
| traffic.
|
| We configure the ALB to only point to the readiness check.
|
| So our application lifecycle looks like this:
|
| * Container starts
|
| * Application loads
|
| * Liveness begins serving 200
|
| * Some internal health checks run and set readiness state to True
|
| * Readiness checks now return 200
|
| * ALB checks begin passing and so pod is added to the target
| group
|
| * Pod starts getting traffic.
|
| time passes. Eventually for some reason the pod needs to shut
| down.
|
| * Kube calls the preStop hook
|
| * PreStop sends SIGUSR1 to app and waits for N seconds.
|
| * App handler for SIGUSR1 tells readiness hook to start failing.
|
| * ALB health checks begin failing, and no new requests should be
| sent.
|
| * ALB takes the pod out of the target group.
|
| * PreStop hook finishes waiting and returns
|
| * Kube sends SIGTERM
|
| * App wraps up any remaining in-flight requests and shuts down.
|
| This allows the app to do graceful shut down, and ensures the ALB
| doesn't send traffic to a pod that knows it is being shut down.
|
| Oh, and on the Readiness check - your app can use this to
| (temporarily) signal that it is too busy to serve more traffic.
| Handy as another signal you can monitor for scaling.
|
| e: Formatting was slightly broken.
___________________________________________________________________
(page generated 2025-03-10 23:00 UTC)