[HN Gopher] Learnings from our 8 years of Kubernetes in production
       ___________________________________________________________________
        
       Learnings from our 8 years of Kubernetes in production
        
       Author : jonsson101
       Score  : 54 points
       Date   : 2024-02-06 10:10 UTC (12 hours ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | nostrebored wrote:
       | My question when looking at Kubernetes for small teams is always
       | the same. Why?
       | 
       | In the blog, there are multiple days of downtime, a complete
       | cluster rebuild, a description of how individual experts have to
       | be crowned as the technology is too complex to jump in and out of
       | in any real production environment, handling versioning of helm
       | and k8s, a description of managing the underlying scripts to
       | rebuild for disaster (I'm assuming there's a data
       | persistence/backup step here that goes unmentioned!), and on, and
       | on and on.
       | 
       | When you're already using cloud primitives, why not use your
       | existing expertise there, their serverless offerings, and learn
       | the IaC tooling of choice for that provider?
       | 
       | Yes it will be more expensive on your cloud bell. But when you
       | measure the TCO, is it really?
        
         | belval wrote:
         | Especially considering that the author seems to be using some
         | Azure specific features anyway:
         | 
         | > While being vendor-agnostic is a great idea, for us, it came
         | with a high opportunity cost. After a while, we decided to go
         | all-in on AKS-related Azure products, like the container
         | registry, security scanning, auth, etc. For us, this resulted
         | in an improved developer experience, simplified security (
         | centralized access management with Azure Entra Id), and more,
         | which led to faster time-to-market and reduced costs (volume
         | benefits).
        
         | teaearlgraycold wrote:
         | We're starting to use k8s as a small team because the simpler
         | offerings with GPUs available don't meet our needs. It's clear
         | they're either built for someone else or are less reliable than
         | an EKS cluster would be.
        
         | liveoneggs wrote:
         | k8s is half-baked at best but people enjoy copy-paste yaml
         | recipes, which half-baked products lend themselves to, so it is
         | loved
        
           | auspiv wrote:
           | I work for a US subsidiary of a very large oil company. We
           | are migrating from Azure to AWS for many things (it is deemed
           | "OneCloud"). A very large number of our new EC2 instances,
           | and even our EKS instances, were provisioned within the last
           | 6 months as T2 instances. Some, if we were lucky, were T3. T3
           | was released 10 years ago. Copy + paste indeed.
        
         | menschmanfred wrote:
         | Our setup works very very well.
         | 
         | And in smaller setups you would have a shared cluster or fully
         | managed like gke etc.
        
         | cortesoft wrote:
         | Do people try to push it that strongly for small teams? Lots of
         | us work on bigger teams and enjoy more of the benefits.
         | 
         | However, I also still use Kubernetes for my personal projects,
         | because I really appreciate the level of abstraction it
         | supplies. Everyone always points out that you can do all the
         | things k8s does in other ways, but what I like about it defines
         | a common way to do everything. I don't care that there are 50
         | ways to do it, I just like having one way.
         | 
         | What this allows is for tools to seamlessly work together. It
         | is trivial to have all sorts of cool functionality with minimal
         | configuration.
        
           | karolist wrote:
           | This. It's the npm install 100 packages and do everything
           | with JS vs Rails arguments all over again.
        
           | politelemon wrote:
           | > Do people try to push it that strongly for small teams?
           | 
           | Yes. You have to understand that a lot of people without the
           | benefit of experience will often base their technology
           | choices on blog posts. K8S has a lot of mindshare and blog
           | attention, so it gets seen as the only way to run a container
           | in a production environment, while all the important aspects
           | of it are ignored.
        
           | datadeft wrote:
           | > because I really appreciate the level of abstraction it
           | supplies
           | 
           | which are?
           | 
           | I am seriously asking. I use docker-compose of some of the
           | things I do but it never occured to me during my 20 years in
           | systems engineering that k8s offers any kind of great
           | abstraction. For small systems it is easy to use docker (for
           | example running a database for testing). For larger projects
           | there are so many aternatives to k8s that are better,
           | including the major cloud vendor offerings that I have really
           | a hard time justifying even to consider k8s. After years of
           | carnage that they left, seeing failures after failures, even
           | customers reaching out to me in panic to help them because
           | there are timeouts or other issues that nobody can resolve
           | after selling them the idea that k8s has "great level of
           | abstraction" and putting it to production.
           | 
           | > I don't care that there are 50 ways to do it, I just like
           | having one way.
           | 
           | Seeing everything as a nail...
        
             | cortesoft wrote:
             | >> because I really appreciate the level of abstraction it
             | supplies
             | 
             | > which are?
             | 
             | When I am creating a new service/application, I just need
             | to define in my resource what I need... listening ports,
             | persistent storage, CPU, memory, ingress, etc... then I am
             | free to change how those are provided without having to
             | change the app. If a new, better, storage provider comes
             | along, I can switch it out without changing anything on my
             | app.
             | 
             | At my work, we have on premise clusters as well as cloud
             | clusters, and I can move my workloads between them
             | seamlessly. In the cloud, we use EBS backed volumes, but my
             | app doesn't need to care. On the on-prem clusters, we use
             | longhorn, but again my app doesn't care. In AWS, we use the
             | ELB as our ingress, but my app doesn't care... on prem, I
             | use metallb, but my app doesn't care.
             | 
             | I just specify that I need a cert and a URL, and each
             | cluster is set up to update DNS and get me a cert. I don't
             | have to worry about DNS or certs expiring. When I deploy my
             | app to a different cluster, that all gets updated
             | automatically.
             | 
             | I also get monitoring for free. Prometheus knows how to
             | discover my services and gather metrics, no matter where I
             | deploy. For log processing, when a new tool comes out, I
             | can plug it in with a few lines of configuration.
             | 
             | The kubernetes resource model provides a standard way to
             | define my stuff. Other services know how to read that
             | resource model and interact with it. If I need something
             | different, I can create my own CRD and controller.
             | 
             | I am able to run a database using a cluster controller with
             | my on prem cluster without having to manage individual
             | nodes. Anyone who has run a database cluster manually knows
             | hardware maintenance or failure is a whole thing... with
             | controllers and k8s nodes, I just need to use node drain
             | and my controller will know how to move the cluster members
             | to different nodes. I can update and upgrade the hardware
             | without having to do anything special. Hardware patching is
             | way easier.
             | 
             | The k8s model forces you to specify how your service should
             | handle node failure, and nodes coming in or out are built
             | into the model from the beginning. It forces you to think
             | about horizontal scaling, failover, and maintenance from
             | the beginning, and gives a standard way for it to work.
             | When you do a node drain, every single app deployed to the
             | cluster knows what to do, and the maintainer doesn't have
             | to think about it.
             | 
             | >> I don't care that there are 50 ways to do it, I just
             | like having one way.
             | 
             | > Seeing everything as a nail...
             | 
             | I don't think that is a fair comparison, because you can
             | create CRDs if your model doesn't fit any existing
             | resource. However, even when you create a CRD, it is still
             | a standard resource that hooks into all of the k8s
             | lifecycle management, and you become part of that
             | ecosystem.
        
         | dilyevsky wrote:
         | I would think it's more dependent on technology requirements
         | more than the size of the team. If all you need is some
         | variation of LAMP stack, then you'd probably be better off with
         | a paas like render, fly or the like.
        
         | datadeft wrote:
         | Same exact question I ask every single time. We just decided
         | against k8s, again, in 2024. We are going to go with AWS ECS
         | and Azure Container Apps (the infra has to exist in both
         | clouds).
         | 
         | ECS and Container Apps provides all the benefits of k8s without
         | the cons. What we want is a to be able to execute container
         | (Docker) images with autoscaling and control which group of
         | instances can talk to each other. What we do not want to do:
         | 
         | - learn all of the error modes of k8s
         | 
         | - learn all the network modes of k8s
         | 
         | - learn the tooling of k8s (and the pitfalls)
         | 
         | - learn how to embed yaml into yaml the right way (I have seen
         | some of the tools are doing this)
         | 
         | - do upgrades of k8s and figuring out what has changed the way
         | that is backward incompatible
         | 
         | - learn how to manage certificates for k8s the right way
         | 
         | - learn how to debug DNS issues in a distributed system
         | (https://github.com/kubernetes/kubernetes/issues/110550 and
         | many more)
         | 
         | I could go on and on but many people and companies figured out
         | the hard way that k8s complexity is not justified.
        
         | ManBeardPc wrote:
         | My experience with Kubernetes has been mostly bad. I always see
         | an explosion of complexity and there is something that needs
         | fixing all the time. The knowledge required comes on top of the
         | existing stack.
         | 
         | Maybe I'm biased and just have the wrong kind of projects, but
         | so far everything I encountered could be built with a simple
         | tech stack on virtual or native hardware. A reverse
         | proxy/webserver, some frontend library/framework, a backend,
         | database, maybe some queues/logs/caching solutions on any
         | server Linux distribution. Maintenance is minimal, dirt cheap,
         | no vendor lock-in and easy to teach. Is everyone building the
         | next Amazon/Netflix/Goole and needs to scale to infinity? I
         | feel there is such a huge amount of software and companies that
         | will never require or benefit from Kubernetes.
        
       | betaby wrote:
       | In 2000s we were talking that `snowflake` servers are bad. New
       | generation is re-learning the same with k8s, which can be
       | summarized as 'snowflake k8s clusters are bad'. Fundamentally
       | it's the same problem.
        
         | menschmanfred wrote:
         | It's not.
         | 
         | The control plane is ha and you can upgrade one after the other
         | independent of your workers.
         | 
         | With workers you can do that too.
         | 
         | That feels much less like a snowflake and more like snow.
        
           | betaby wrote:
           | Control panel as HA as your certificates which were expired
           | in the article.
        
             | menschmanfred wrote:
             | K8s introduced Auto rotation surprisingly late.
             | 
             | Even we run into that issue 5 years ago.
             | 
             | But k8s is still very young and the problem is solved for
             | 5-6 years
        
         | dilyevsky wrote:
         | are snowflake products bad too?
        
       | zellyn wrote:
       | I'm curious: what do you do for developer environments? Do you
       | have a need to spin up a partial subgraph of microservices, and
       | have them talk to each other while developing against a slice of
       | the full stack?
        
         | Moto7451 wrote:
         | Can't speak for everyone but I have worked in this environment.
         | It can work fine if you allocate a sub slice of CPU time (.1
         | CPU for example) and small amounts of (overcommitted) memory,
         | and explicitly avoid using it for things that are more easily
         | managed by cloud provider sub accounts and managed services. IE
         | don't force your devs to manage owncloud or a similar stand in
         | for S3 - use something first party to stand in or S3 itself.
         | 
         | This doesn't always work and the failure mode of committing to
         | this can be doubling your hosting bill if it won't run locally
         | and densely packed small instances can't handle your app.
        
         | sdwr wrote:
         | That's how we do it - micro services run locally in tilt,
         | pointed at staging services / DB for whatever isn't local.
         | 
         | When it works it's great.
        
           | mieubrisse wrote:
           | Can you clarify more about the "when it works"? What are the
           | pain points you're seeing?
        
         | jscheel wrote:
         | It's worked great for us. Every developer runs a dev cluster on
         | their own machines. Services like s3 are transparently replaced
         | with mock versions. We have two builds that can be run, which
         | really just determines which set of helm charts to deploy: the
         | full stack or a lightweight one with just the bare necessities.
        
           | mieubrisse wrote:
           | What did you use for mock versions? Localstack?
        
         | dilyevsky wrote:
         | I would recommend Tilt + kind clusters (via
         | https://github.com/tilt-dev/ctlptl) - minimum headache setup by
         | a large margin and runs well on linux _and_ macs
        
       | Bassilisk wrote:
       | Not a native English speaker, but when exactly did "lessons" get
       | replaced by "learnings"?
       | 
       | To me the latter always sounds very unsophisticated.
        
         | ojbyrne wrote:
         | As a native English speaker, in my opinion it's incredibly
         | pretentious.
        
         | teaearlgraycold wrote:
         | Native English speaker - I refuse to use "learnings". It's a
         | ridiculous office-speak word.
        
           | PaulStatezny wrote:
           | Same - just like "asks". "Here's the ask" versus "here's the
           | request".
        
         | dartos wrote:
         | Native English speaker.
         | 
         | I don't think they got replaced. Colloquially they mean the
         | same thing.
         | 
         | Maybe learnings sounds a little more casual and lessons more
         | academic or formal.
        
         | pmontra wrote:
         | Apparently it's a 21st century thing
         | https://en.wiktionary.org/wiki/learnings
        
         | OJFord wrote:
         | Ha, some time this century for sure. To me it's not
         | 'unsophisticated' _exactly_ , but it's definitely a certain
         | sort of person - it's the ' _Hi team_ - just sharing some
         | _learnings_ - please do _reach out_ if you have any questions '
         | sort of corporate speak.
        
         | JasonSage wrote:
         | Not to discredit your experience, but I'm a native English
         | speaker and I've never had the perception that it's
         | unsophisticated. I think they can have a very slightly
         | different connotation from one another, but in a lot of usage I
         | think they're interchangeable.
        
           | ibejoeb wrote:
           | It's corporate-speak. There are all sorts of these things.
           | 
           | Lessons/learnings
           | 
           | Requests/asks
           | 
           | Solutions/solves
           | 
           | Agreement/alignments
           | 
           | It definitely sounds weird if you don't spend a lot of time
           | in that world. It's like they replace the actual noun forms
           | with an oddly cased verb form, i.e., nominalization.
           | 
           | Oh, one of my most hated:
           | 
           | Thoughts/ideations
           | 
           | Jeez...
        
             | packetlost wrote:
             | This is basically saying that it's the opposite of
             | "unsophisticated" but instead corporate _formal_ speak.
        
             | throwaway11460 wrote:
             | And lessons is academy/state school-speak. Can't stand the
             | word. Take your lessons home Ms Teacher, this is a place of
             | business.
        
         | danielvaughn wrote:
         | I'm a native English speaker and I agree, though it could be a
         | regional/cultural thing. It sounds pretty odd to me.
        
         | doctor_eval wrote:
         | I'm a native English speaker and don't use the phrase, but I've
         | always thought that a _lesson_ is something taught, but a
         | _learning_ is something learned. The former does not always
         | imply the latter.
        
         | radicalbyte wrote:
         | I've always assumed that "learnings" was the American English
         | version of "lessons" in English English.
        
           | burkaman wrote:
           | I think it's more Corporate English. I've never heard anyone
           | say it outside of a work meeting.
        
             | radicalbyte wrote:
             | Those are really American though.. like "co-worker", that
             | isn't a word which was used in England. We'd use
             | "colleague". It came from American English as part of the
             | corporate lingo.
        
         | karlshea wrote:
         | It's very much just bro corporate speak, if I heard someone use
         | "learnings" instead of "lessons" irl they would definitely fall
         | into the slot for a specific type of person in my head. Very
         | LinkedIn.
        
         | niam wrote:
         | I feel like the word unambiguously describes exactly what it
         | is, which is all I can really ask for from a word.
         | 
         | "Lesson" by itself might connote a more concrete transmission
         | of knowledge (like a school lesson). Which is a meaningful
         | distinction if the goal of the article is merely to muse about
         | lessons _they 've_ learned rather than imply that this is a
         | lesson from the writers to the audience. "Lesson learned" could
         | imply the same thing, but is longer to say -\\_(tsu)_/-
         | 
         | I get what the comments here are saying about it sounding
         | corporate, but I think this is a unique situation where this
         | word actually makes sense.
        
         | youngtaff wrote:
         | It's a bloody American thing... Lessons FTW... uses less
         | characters too
        
       | louwrentius wrote:
       | What I really miss in articles like this - and I understand why
       | to some degree - what the actual numbers would be.
       | 
       | Admitting that you need at least two full-time engineers working
       | on Kubernetes I wonder how that kind of investment pay's itself
       | back, especially because of all the added complexity.
       | 
       | I desperately would like to rebuild their environment on regular
       | VMs, maybe not even containerized and understand what the
       | infrastructure cost would have been. And what the maintenance
       | burden would have been as compared to kubernetes.
       | 
       | Maybe it's not about pure infrastructure cost but about the
       | development-to-production pipeline. But still.
       | 
       | These is just so much context that seems relevant to understand
       | if an investment in kubernetes is warranted or not.
        
         | karolist wrote:
         | k8s is simply a set of bullet proof ideas to run production
         | grade services forcing "hope is not a strategy" as much as
         | possible, it standardises things like rollouts, rolling
         | restarts, canary deployments, failover etc. You can replicate
         | it with a zoo of loosely coupled products but a monolith which
         | you can hire for with impeccable production record and industry
         | certs will always be preferable to orgs. It's Googles way of
         | fighting cloud vendor lockin' when they saw they're losing
         | market share to AWS. Only large companies need it really, a
         | small 5 person startup will do on Digital Ocean VPS just fine
         | with some S3 for blob storage and CDN cache.
        
         | jurschreuder wrote:
         | This is always my exact thought with k8.
         | 
         | Why not just have auto-scaling servers with a CI/CD pipeline.
         | 
         | Seems so much easier and more convenient.
         | 
         | But I guess developers are just always drawn to complexity.
         | 
         | It's in their nature that's why they became developers in the
         | first place.
        
       | biggestlou wrote:
       | Can we please put the term "learnings" to rest?
        
         | chunha wrote:
         | I don't see the problem with it tbh
        
         | holmb wrote:
         | The OP is Swedish.
        
         | geodel wrote:
         | I see no problem with leveraging best-of-breed terms like
         | learning.
        
       | 0xbadcafebee wrote:
       | You aren't a real K8s admin until your self-managed cluster
       | crashes hard and you have to spend 3 days trying to
       | recover/rebuild it. Just dealing with the certs once they start
       | expiring is a nightmare.
       | 
       | To avoid chicken-and-egg, your critical services (Drone, Vault,
       | Bind) need to live outside of K8s in something stupid simple,
       | like an ASG or a hot/cold EC2 pair.
       | 
       | I've mostly come to think of K8s as a development tool. It makes
       | it quick and easy for devs to mock up a software architecture and
       | run it anywhere, compared to trying to adopt a single cloud
       | vendor's SaaS tools, and giving devs all the Cloud access needed
       | to control it. Give them access to a semi-locked-down K8s cluster
       | instead and they can build pretty much whatever they need without
       | asking anyone for anything.
       | 
       | For production, it's kind of crap, but usable. It doesn't have
       | any of the operational intelligence you'd want a resilient
       | production system to have, doesn't have real version control,
       | isn't immutable, and makes it very hard to identify and fix
       | problems. A production alternative to K8s should be much more
       | stripped-down, like Fargate, with more useful operational
       | features, and other aspects handled by external projects.
        
         | throwboatyface wrote:
         | Honestly in this day and age rolling your own k8s cluster is
         | negligent. I've worked at multiple companies using EKS, AKS,
         | GKE, and we haven't had 10% of the issues I see people
         | complaining about.
        
           | jauntywundrkind wrote:
           | Once your team has upgrades down, everything is pretty rote.
           | This submission (Urbit, lol) seemed particularly incompetent
           | at managing cert rotation.
           | 
           | The other capital lesson here? Have backups. The team couldnt
           | restore a bunch of their services effectively, cause they
           | didn't have the manifests. Sure, a managed provider may have
           | less disruptions/avoid some fuckups, but the whole point of
           | Kubernetes is Promise Theory, is Desired State Mamagememt. If
           | you can re-state your asks, put the manifests back, most shit
           | should just work again, easy as that. The team had seemingly
           | no operational system so their whole cluster was a vast
           | special pet. They fucked up. Don't do that.
        
           | dilyevsky wrote:
           | I've picked my fare share of outages on managed k8s
           | solutions. The difference there is once it's hosed, your fate
           | is 100% in the hands of cloud support and well... good luck
           | with that one.
        
         | nvarsj wrote:
         | [delayed]
        
         | vundercind wrote:
         | In the bad old days of self-managing some servers with a few
         | libvirt VMs and such, I'd have considered a 3-day outage such a
         | shockingly bad outcome that I'd have totally reconsidered what
         | I was doing.
         | 
         | And k8s is supposed to make that situation _better_ , but these
         | multi-day outage stories are... common? Why are we adding all
         | this complexity and cost of the result is consumer-PC-tower-in-
         | a-closet-with-no-IAC uptime (or worse)?
        
       | datadeft wrote:
       | This is insane:                    The Root CA certificate, etcd
       | certificate, and API server certificate           expired, which
       | caused the cluster to stop working and prevented our
       | management of it. The support to resolve this, at that time, in
       | kube-aws           was limited. We brought in an expert, but in
       | the end, we had to rebuild the           entire cluster from
       | scratch.
       | 
       | I can't even imagine how I could explain any of my customers such
       | an outage.
        
         | bdangubic wrote:
         | "us-east-1 was down" :)
        
           | datadeft wrote:
           | If most infra I worked on was a single region one, sure. :)
           | DR is so much easier in the cloud. You can have ECS scale to
           | 0 in the DR site and when us-east-1 goes down just move the
           | traffic there. We did that with amazon.com before AWS even
           | existed. With AWS it became easier. There are still some
           | challenges, like having a replica of the main SQL db if you
           | run a traditional stack for example.
        
         | dilyevsky wrote:
         | Just in last couple of years I can recall DataDog being down
         | for most of the day and Roblox took something like 72h outage.
         | If huge public companies managed, you probably can too. I'd
         | argue that unless real monetary damage was done it's actually
         | worse for the customer to experience many small-scale outages
         | than a very occasional big outage.
        
           | geodel wrote:
           | Well the industry analysts and consultants who develop
           | _metrics_ have decided that multiple outages is the way to go
           | as it keeps people on toes more often. And management likes
           | busy people as they are earning their keep.
        
       | dilyevsky wrote:
       | > During our self-managed time on AWS, we experienced a massive
       | cluster crash that resulted in the majority of our systems and
       | products going down. The Root CA certificate, etcd certificate,
       | and API server certificate expired, which caused the cluster to
       | stop working and prevented our management of it. The support to
       | resolve this, at that time, in kube-aws was limited. We brought
       | in an expert, but in the end, we had to rebuild the entire
       | cluster from scratch.
       | 
       | That's crazy, I've personally recovered 1.11-ish kops clusters
       | from this exact fault and it's not that hard when you really
       | understand how it works. Sounds like a case of bad "expert"
       | advice.
        
       | therealfiona wrote:
       | If anyone has any tips on keeping up with control plane upgrades,
       | please share them. We're having trouble keeping up with EKS
       | upgrades. But, I think it's self-inflicted and we've got a lot of
       | work to remove the knives that keep us from moving faster.
       | 
       | Things on my team's todo list (aka: correct the sins that
       | occurred before therealfiona was hired):
       | 
       | - Change manifest files over to Helm. (Managing thousands of
       | lines of yaml sucks, don't do it, use Helm or similar that we
       | have not discovered yet.) - Setup Renovate to help keep Helm
       | chart versions up to date. - Continue improving our process
       | because there was none as of 2 years ago.
        
         | throwboatyface wrote:
         | IME EKS version upgrades are pretty painless - AWS has a tool
         | that tells you if any of your resources would be affected by an
         | upcoming change even.
        
           | raffraffraff wrote:
           | It's not the EKS upgrade part that's a pain, it's the
           | deprecated K8S resources that you mention. Layers of
           | terraform, fluxcd, helm charts getting sifted through and
           | upgraded before the EKS upgrade. You get all your clusters
           | safely upgraded, and in the blink of an eye you have to do it
           | all over again.
        
         | catherinecodes wrote:
         | This is definitely a hard problem.
         | 
         | One technique is to never upgrade clusters. Instead, create a
         | new cluster, apply manifests, then point your DNS or load
         | balancers to the new one.
         | 
         | That technique won't work with every kind of architecture, but
         | it works with those that are designed with the "immutable
         | infrastructure" approach in mind.
         | 
         | There's a good comment in this thread about not having your
         | essential services like vault inside of kubernetes.
        
           | jauntywundrkind wrote:
           | This indeed seems like _The Way_ but I have no idea how it
           | works when storage is involved. How do Rook or any other
           | storage providers deal with this?
           | 
           | If Kubernetes is _only_ for stateless services, well, that 's
           | much less useful for the org to invest in.
        
       | doctor_eval wrote:
       | I'm in the very unusual situation of being tasked to set up a
       | self-sufficient, local development team for a significant
       | national enterprise in a developing country. We don't have AWS,
       | Google or any other cloud service here, so getting something
       | running locally, that they can deploy code to, is part of my job.
       | I also want to ensure that my local team is learning about modern
       | engineering environments. And there is a large mix of unrelated
       | applications to build, so a monolith of some sort is out of the
       | question; there will be a mix of applications and languages and
       | different reliability requirements.
       | 
       | In a nutshell, I'm looking for a general way to provide compute
       | and storage to future, modern, applications and engineers, while
       | at the same time training them to manage this themselves. It's a
       | medium-long term thing. The scale is already there - one of our
       | goals is to replace an application with millions of existing
       | users.
       | 
       | Importantly, the company wants us to be self sufficient. So a
       | RedHat contract to manage an OpenShift cluster won't fly
       | (although maybe openshift itself will?)
       | 
       | For the specific goals that we have, the broad features of
       | Kubernetes fit the bill - in terms of our ability to launch a set
       | of containers or features into a cluster, run CICD, run tests,
       | provide storage, host long- and short lived applications, etc.
       | But I'm worried about the complexity and durability of such a
       | complex system in our environment - in the medium term, they need
       | to be able to do this without me, that's the whole point. This
       | article hasn't helped me feel better about k8s!
       | 
       | I personally avoided using k8s until the managed flavours came
       | about, and I'm really concerned about the complexity of deploying
       | this, but I think some kind of cluster management system is
       | critical; I don't want us to go back to manually installing
       | software on individual machines (using either packaging or just
       | plain docker). I want there to be a bunch of resources that we
       | can consume or grow as we become more proficient.
       | 
       | I've previously used Nomad in production, which was much simpler
       | than K8s, and I was wondering if this or something else might be
       | a better choice? How hard is k8s to set up _today_? What is the
       | risk of the kind of failures these guys hit, _today_?
       | 
       | Are there any other environments where I can manage a set of
       | applications on a cluster of say 10 compute VMs? Any other
       | suggestions?
       | 
       | Without knowing a lot about their systems, I suspect something
       | like Oxide might be the best bet for us - but I doubt we have the
       | budget for a machine like that. But any other thoughts or ideas
       | would be welcome.
        
         | geodel wrote:
         | Well Amazon CEO himself said, there is no shortcut to
         | experience. I am sure gaining experience in developing
         | infrastructure solution will give you respectable return in
         | long term. Of course Cloud vendors will be happy to sell
         | turnkey solutions to you though.
        
         | steveklabnik wrote:
         | (I work at Oxide)
         | 
         | > I doubt we have the budget for a machine like that.
         | 
         | Before even thinking about budget,
         | 
         | > for a significant national enterprise in a developing
         | country.
         | 
         | I suspect we just aren't ready to sell in your country,
         | whatever it is, for very normal "gotta get the product
         | certified through safety regulations" kinds of reasons. We will
         | get there eventually.
         | 
         | buuuuut also,
         | 
         | > Are there any other environments where I can manage a set of
         | applications on a cluster of say 10 compute VMs? Any other
         | suggestions?
         | 
         | Oxide would give you those VMs, but if you want orchestration
         | with them, you'd be running kubes or whatever else, yourself,
         | on top of it. So I don't think our current product would give
         | you exactly what you want anyway, or at least, you'd be in the
         | same spot you are now with regards to the orchestration layer.
        
       | aguacaterojo wrote:
       | Very similar story for my team, incl. the 2x cert expiry cluster
       | disasters early on requiring a rebuild. We migrated from
       | Kubespray to kOPs (with almost no deviations from a default
       | install) and it's been quite smooth for 4 or 5 years now.
       | 
       | I traded ELK for Clickhouse & we use Fluentbit to relay logs,
       | mostly created by our homegrown opentelemetry-like lib. We still
       | use Helm, Quay & Drone.
       | 
       | Software architecture is mostly stateless replicas of ~12x mini
       | services with a primary monolith. DBs etc sit off cluster. Full
       | cluster rebuild and switchover takes about 60min-90min, we do it
       | about 1-2x a year and have 3 developers in a team of 5 that can
       | do it (thanks to good documentation, automation and keeping our
       | use simple).
       | 
       | We have a single cloud dev environment, local dev is just running
       | the parts of the system you need to affect.
       | 
       | Some tradeoffs and yes burned time to get there, but it's great.
        
       ___________________________________________________________________
       (page generated 2024-02-06 23:00 UTC)