[HN Gopher] Kubernetes Failure Stories
___________________________________________________________________
Kubernetes Failure Stories
Author : jakozaur
Score : 155 points
Date : 2021-02-11 19:38 UTC (3 hours ago)
(HTM) web link (k8s.af)
(TXT) w3m dump (k8s.af)
| BasedInfra wrote:
| They walked so we could run our k8s services.
|
| Anyone with their failures developing, deploying or running
| software (or anything!) put it out there. It's a great resource.
| kall wrote:
| Nice to see HN submissions that I also just added to reading list
| based on another HN submission, Litestream in this case, right?
| Benlights wrote:
| I see you read the article about litestream, I love following the
| rabbit hole. Good work!
| An0mammall wrote:
| I first thought this was kind of a smear attempt but I realise
| now that this is a great learning ressource, thanks!
| kureikain wrote:
| The first link in this article is about switching from fluentbit
| to fluentd.
|
| Me too. I have issue with fluenbit. Especially around their JSON
| processing. As in this issue: https://github.com/fluent/fluent-
| bit/issues/1588
|
| If you want a simple log forwarding, fluentbit is really good.
| But if you found yourself started to tweak fluentbit config, or
| write custom plugin in C and (recompile yourself) then it's time
| to move to fluentd.
| opsunit wrote:
| I did a lot of the work to get sub-second timestamps working
| within the fluentd+ElasticSearch ecosystem and the thing is a
| tire fire.
| fuzzythinker wrote:
| Meta. Is this a common usage of .af domains?
| pc86 wrote:
| I only came here to comment that the domain name was fantastic
| for this usage.
| plange wrote:
| Donated by Joe Beda himself.
| yongjik wrote:
| Maybe I should mutter[1] "That's Kubernetes as fuck" next
| time I see another shenanigans involving multiple
| overcomplicated layers with confusing documentation
| interacting with each other in a way nobody can figure out.
|
| [1] Just to myself, of course.
| jrockway wrote:
| Is Kubernetes really overcomplicated, though? Say you
| wanted to do a release of a new version of your app. You'd
| probably boot up a VM for the new version, provision it an
| IP, copy your code to it, and check if it's healthy. Then
| you'd edit your load balancer configuration to send traffic
| to that new IP, and drain traffic from the old IP. Then
| you'd shut the old instance down.
|
| That's basically what a Deployment in Kubernetes would do;
| create a Pod to provision compute resources and an IP, pull
| a container to copy your code into the environment, run a
| health check to see if that all went OK, and then update
| the Service's Endpoints to start sending it traffic. Then
| it would drain the old Pod in a reverse of starting a new
| one, perhaps running your configured cleanup steps along
| the way.
|
| The difference between the example and what Kubernetes does
| is that Kubernetes is a computer program that can do this
| thousands of times for you without really making you think
| about getting the details right every time. Doesn't seem
| overly complicated to me. A lot of stuff? Sure. But it's
| pretty close to the right level of complexity. (Some may
| argue that this is even too simple. What about blue/green
| testing? What about canaries? When do my database
| migrations run? Kubernetes has no answer, and if it did, it
| certainly wouldn't make it simpler. Perhaps the real
| problem is that Kubernetes isn't complicated enough!)
|
| Anyway, "that's Kubernetes as fuck" is kind of a good meme.
| I'll give you that. But a deep understanding usually makes
| memes less amusing.
| zwp wrote:
| Afghanistan? Air force? Uh, as fuck? Really?
|
| I'm not prudish about words but .lol would've been clearer.
| genmud wrote:
| I absolutely love it when people talk about failures, it is so
| nice to be able to learn about stuff without having to make the
| mistake myself.
| gen220 wrote:
| This is a compilation of gotcha-discovery-reports, distributed
| across the surface area of K8s (which, since K8s is huge, covers
| many nooks and crannies). This is _not_ a compilation of "K8s
| sucks, here are 10 reasons why", which is what my kneejerk
| expectation was. (maybe I am too cynical...).
|
| Overall, this is a fantastic index to some very interesting
| resources.
|
| There's an old saying that you learn more from failure than from
| success (whose analogue is a blogpost with a title "look how easy
| it is to set up k8s in 15 minutes at home!"). If you really want
| to think about how to deploy nuanced technology like this at
| scale, operational crash reports such as these are invaluable.
| tempest_ wrote:
| Which is why postmortem blog posts are always infinity more
| enlightening than the "How we build X" posts
| barrkel wrote:
| "All happy families are alike; each unhappy family is unhappy
| in its own way."
|
| https://en.wikipedia.org/wiki/Anna_Karenina_principle
| WJW wrote:
| Not to knock on Tolstoy but there are many ways in which
| unhappy families can be grouped together. You have the
| alcoholic parents, the living-vicariously-through-their-
| children folks, the abusive parents, etc etc etc.
|
| To tie it back to Kubernetes, you have the scheduler-does-
| not-schedule lessons, the ingress-not-consistent lessons,
| the autoscaler-killed-all-my-pods lessons, the moving-
| stateful-services-with-60TB-storage-size-took-20-minutes
| lessons and probably many more. It's like Churchills
| explanation of democracy: terrible, but better than all the
| alternatives. (at scale, at least)
| trixie_ wrote:
| Can we do one for micro-services next?
| UweSchmidt wrote:
| It takes a high level of skill and maturity to articulate such
| failure stories. The thing you missed is obvious to others, the
| "bug" you found is actually documented somewhere, and you can
| mitigate and move on from disaster without completely drilling
| down and reconstructing the problem. So thanks for posting.
| mitjak wrote:
| really cool resource for learning but a lot of these have nothing
| to do with k8s, beyond the company in question having k8s as part
| of their stack (i'm addressing the possible perception of the
| post title suggesting k8s horror stories).
| jrockway wrote:
| A lot of them do have things to do with k8s, though. Admission
| webhooks, Istio sidecar injection, etc.
|
| The CPU limits = weird latency spikes also shows up a lot
| there, but it's technically a cgroups problem. (Set
| GOMAXPROCS=16, set cpu limit to 1, wonder why your program is
| asleep 15/16th of every cgroups throttling interval. I see that
| happen to people a lot, the key point being that GOMAXPROCS and
| the throttling interval are not something they ever manually
| configured, hence it's surprising how they interact. I ship
| https://github.com/uber-go/automaxprocs in all of my open
| source stuff to avoid bug reports about this particular issue.
| Fun stuff! :)
|
| DNS also makes a regular appearance, and I agree it's not
| Kubernetes' fault, but on the other hand, people probably just
| hard-coded service IPs for service discovery before Kubernetes,
| so DNS issues are a surprise to them. When they type
| "google.com" into their browser, it works every time, so why
| wouldn't "service.namespace.svc.cluster.local" work just as
| well? (I also love the cloud providers' approach to this rough
| spot -- GKE has a service that exists to scale up kube-dns if
| you manually scale it down!)
|
| Anyway, it's all good reading. If you don't read this, you are
| bound to have these things happen to you. Many of these things
| will happen to you even if you don't use Kubernetes!
| cbushko wrote:
| The Istio one hits home. It is the single scariest thing to work
| with in our kubernetes clusters. We've caused several outages by
| changing the smallest things.
| tutfbhuf wrote:
| The current trend goes to multi-cluster environments, because
| it's way too easy to destroy a single k8s cluster due to bugs,
| updates or human mistake. Just like it's not an very unlikely
| event to kill a single host in the network e.g. due to
| updates/maintenance.
|
| For instance, we had several outages when upgrading the
| kubernetes version in our clusters. If you have many small
| cluster it's much easier and more save to apply cluster wide
| updates, one cluster at a time.
| colllectorof wrote:
| The tech churn cycle is getting more and more insane. It's the
| same thing again and again.
|
| 1. Identify one problem you want to fix and ignore everything
| else.
|
| 2. Make a tool to manage the problem while still ignoring
| everything else.
|
| 3. Hype the tool up and shove it in every niche and domain
| possible.
|
| 4. Observer how "everything else" bites you in the ass.
|
| 5. Identify the worst problem from #4, use it to start the
| whole process again.
|
| Microservices will save us! Oh no, they make things
| complicated. Well, I will solve that with containers! On no,
| managing containers is a pain. Container orchestration! Oh no,
| our clusters fail.
|
| Meanwhile, the complexity of our technology goes up and its
| reliability _in practice_ goes down.
| superbcarrot wrote:
| Sounds like madness. Should we expect a new tool that
| orchestrates all of your Kubernetes clusters?
| theptip wrote:
| It used to be called Ubernetes, shame that name didn't stick.
| jknoepfler wrote:
| I operate one at work, actually. It's not so bad.
| thirdlamp wrote:
| it's kubernetes all the way down
| codeulike wrote:
| Kubermeta
| yongjik wrote:
| > There are two ways of constructing a software design: One
| way is to make it so simple that there are obviously no
| deficiencies, and the other way is to make it so complicated
| that there are no obvious deficiencies. The first method is
| far more difficult.
|
| - C. A. R. Hoare
| mullingitover wrote:
| Rancher does this nicely.
| verdverm wrote:
| KubeFed - https://github.com/kubernetes-sigs/kubefed
| bo0tzz wrote:
| https://github.com/kubernetes-sigs/cluster-api
| bcrosby95 wrote:
| A Kubernetes for Kubernetes! But what about when we want to
| upgrade that?
| dabeeeenster wrote:
| I mean thats simply Kubernetes for Kubernetes for
| Kubernetes. Lets go!
| testplzignore wrote:
| I need a Kubernetes Kubernetes Kubernetes Kubernetes
| Kubernetes emulator so I can run Civilization 14 on my
| iPhone 35XSXS.
| higeorge13 wrote:
| Inception!
| YawningAngel wrote:
| You end up needing this anyway, because Kubernetes clusters
| are confined to single geographic regions and you need some
| way of managing the separate cluster(s) you have in each
| region
| WJW wrote:
| Not saying you should, but it's entirely possible to have
| nodes all over the world managed by a single set of master
| nodes. The system is not really designed with that use case
| in mind though, the added latencies will hurt a lot in
| getting etcd consensus.
| redis_mlc wrote:
| > For instance, we had several outages when upgrading the
| kubernetes version in our clusters.
|
| Yes, this was a big problem in 2018-2019. Are things smoother
| now in your experience?
|
| What magnified the problems where I was working at was that the
| India team would silently do the k8s upgrade overnight, and
| when the US office opened in the morning we'd find problems
| several hours later. So several hours of downtime.
| [deleted]
| nhoughto wrote:
| These are great, I found myself trying to predict the related
| areas, "..Slow connections.." ooh that must be conntrack! Been
| burnt by kube2iam too.. I guess it's k8s failure bingo.
___________________________________________________________________
(page generated 2021-02-11 23:00 UTC)