[HN Gopher] A Pipeline Made of Airbags
       ___________________________________________________________________
        
       A Pipeline Made of Airbags
        
       Author : packetlost
       Score  : 184 points
       Date   : 2024-09-05 14:11 UTC (4 days ago)
        
 (HTM) web link (ferd.ca)
 (TXT) w3m dump (ferd.ca)
        
       | swiftcoder wrote:
       | It's a real shame that we are steadily losing all the lessons of
       | Erlang/SmallTalk/Lisp machines.
        
         | nine_k wrote:
         | What are the specific lessons worth preserving, but being lost?
         | 
         | (I assume that "keep an image, it's too costly to rebuild
         | everything from version-controlled sources" is not such a
         | lesson.)
        
           | igouy wrote:
           | Yes specific would be better.
           | 
           | Of course "keep an image" and "version-controlled sources"
           | are not mutually exclusive.
           | 
           | https://www.google.com/books/edition/Mastering_ENVY_Develope.
           | ..
        
           | swiftcoder wrote:
           | The biggest common lesson is being able to
           | inspect/interrogate/modify the running system. Debugging
           | distributed system failures purely based on logs/metrics
           | output is not a particularly pleasant job, and most immutable
           | software stacks don't offer a log more than that.
           | 
           | However, for Erlang specifically, the lesson is pushing
           | statelessness as far down the system as you possibly can.
           | Stateless immutable containers that we can kill at will are
           | great - but what if we could do the same thing at the request
           | handler level?
        
       | dools wrote:
       | Ha, I recently wrote a system that does more or less the exact
       | same thing for pushing updates to IoT devices. I can tell the
       | system to update particular nodes to a given git commit, then I
       | can roll it out to a handful of devices, then I can say "update
       | all of them" but wait 30 seconds in between each update and so
       | on.
        
       | jamesblonde wrote:
       | Joe would be turning in his grave if he knew where industry are
       | right now on the k8s love-bomb.
        
         | p_l wrote:
         | The real issue is that the languages most use do not support,
         | reliably, another approach.
         | 
         | k8s is not the issue. Worse Is Better languages and runtimes
         | are.
        
       | Sebb767 wrote:
       | The big thing about immutable infrastructure is that it is
       | reproducible. I've seen both worlds and I do appreciate the
       | simplicity and quickness of the upgrade solution presented in the
       | post. The problem with this manual approach is that it is quite
       | easy to end up with a few undocumented fixes/upgrades/changes to
       | your pet server and suddenly upgrading or even just rebooting the
       | servers/app becomes something scary.
       | 
       | Now, for immutable infrastructure, you have a whole different set
       | of problems. All your changes are nicely logged in git, but to
       | deploy you need to rebuild containers and roll them out over a
       | cluster. To do this smoothly, the cluster also needs to have some
       | kind of high availability setup, making everything quite complex
       | and, in the end, you wasted minutes to hours of compute for
       | something that a pet setup can do in a few seconds. But you can
       | be sure that a server going down or a reboot are completely safe
       | operations.
       | 
       | What works for you really depends on your situation (team size,
       | importance of the app, etc.), but both approaches do have their
       | uses and reducing the immutable infra approach to "people run k8s
       | because it's hip" misses the point.
        
         | swiftcoder wrote:
         | > undocumented fixes/upgrades/changes to your pet server and
         | suddenly upgrading or even just rebooting the servers/app
         | becomes something scary.
         | 
         | You can mostly prevent this by mandating that fresh nodes come
         | up regularly. Have a management process that keeps a rolling
         | window of ~5% of your fleet in connection-drain, and replaces
         | the nodes as soon as they hit low-digits of connections.
         | 
         | Whole fleet is replaced every ~3 weeks, you learn about any
         | deployment/startup failures within one day of new code landing
         | in trunk, minimal disruption to client connections.
        
           | turtlebits wrote:
           | This doesn't prevent anything, it just schedules possible
           | breakages because your infra isn't 100% immutable.
           | 
           | IME, this doesn't work because companies won't implement/will
           | deprioritize any infra changes that impact the development
           | cycle.
        
             | swiftcoder wrote:
             | It's not so very different to dropping your PR into any
             | other automated-CI/CD-all-the-way-to-prod pipeline.
             | 
             | Albeit maybe a little easier to justify to management that
             | you are dropping everything to fix the breakage when your
             | CI/CD pipeline stops.
        
         | packetlost wrote:
         | > Now, for immutable infrastructure, you have a whole different
         | set of problems. All your changes are nicely logged in git, but
         | to deploy you need to rebuild containers and roll them out over
         | a cluster.
         | 
         | The real issue is that it effectively forces externalizing
         | nearly all state. On the surface, this seems like it's just a
         | good thing, but if you think about the limitations and
         | complexity it creates, it starts seeming less unquestionably
         | good. Sometimes that complexity is warranted, but very
         | frequently it is not.
         | 
         | That being said, I think modifying code is a running system
         | without pretty strict procedures/control around it is...
         | dangerous. I've seen hotfixex get dropped/forgotten because it
         | only existed on running system and not in source control more
         | than a couple of times.
        
         | toast0 wrote:
         | > The big thing about immutable infrastructure is that it is
         | reproducible. I've seen both worlds and I do appreciate the
         | simplicity and quickness of the upgrade solution presented in
         | the post. The problem with this manual approach is that it is
         | quite easy to end up with a few undocumented
         | fixes/upgrades/changes to your pet server and suddenly
         | upgrading or even just rebooting the servers/app becomes
         | something scary.
         | 
         | The point is not really automation vs manual. Hot loading is
         | amenable to automation too. The point is really that when you
         | replace immutable servers with state with another set, there's
         | a lengthy process to migrate the state. If you can mutate the
         | servers, you save a lot of wall clock time, a lot of server cpu
         | time, and a lot of client cpu time.
         | 
         | I deal with this issue at my current job. I used to work in
         | Erlang and it took a couple minutes to push most changes to
         | production. Once I was ready to move to production, it was less
         | than 30 minutes to prepare, push, load, verify and move on with
         | my life. I could push follow ups right away, or wrap up several
         | issues, one at a time, in a single day. Coming from PHP was
         | pretty similar, with caveats about careful replacement of files
         | (to avoid serving half a PHP file) and PHP caching.
         | 
         | Now I work with Rust, terraform, and GCP; it takes about 12
         | minutes for CI to build production builds, it takes terraform
         | at least 15 minutes to build a new production version
         | deployment, and several more minutes for it to actually finish
         | coming up, only then can I _start_ to move traffic, and the
         | traffic takes a long time to fully move, so I have to come back
         | the next day to tear down the old version. I won 't typically
         | push a follow up right away, because then I've got three
         | versions running. I can't push multiple times a day. If I'm
         | working many small issues, everything has to be batched into
         | one release, or I'll be spending way too much of my time doing
         | deploys, and the deployment process will be holding back
         | progress.
        
         | fiddlerwoaroof wrote:
         | The funny thing here is BEAM is "immutable infrastructure as a
         | programming language environment" which, to me, is strictly
         | superior to the current disjunction between "infrastructure
         | configuration" and "application code".
         | 
         | Erlang defaults to pure code and every actor is like a little
         | microservice with good tooling for coordination. There are
         | mutable aspects like a distributed database, but nothing all
         | that different from the mutable state that exists in every
         | "immutable infrastructure" deployment I've seen.
        
       | wpietri wrote:
       | What a lovely and well-written piece. I think the dev vs ops
       | divide has caused so many problems like this. We just write
       | systems differently when we have to run things versus when it
       | gets thrown over the wall to other people to deal with.
       | 
       | Maybe that sounds like I'm blaming developers, but I'm not. I
       | think this is rooted in management theories of work. They
       | optimize for simple top-down understanding, not cross-functional
       | collaboration. If people are rewarded for keeping to over-
       | optimistic managerial plan (or keeping up "velocity"), then
       | they're mostly going to throw things over the wall.
        
       | from-nibly wrote:
       | I get all of these complaints. Why do I also have to be an
       | infrastrucutre engineer? And why is my infrastructure not bespoke
       | enough to do this weird thing I want to do? Why cant I use 5
       | different languages at this 30 person company?
       | 
       | The thing about immutable infrastructure is that its
       | straightforward. There are a set of assumptions others can make
       | about your app.
       | 
       | Immutable infrastructure is boring. Deployments are uncreative.
       | Thats a good thing.
       | 
       | Repeat after me, "my creative energy should be spent on my
       | customers"
        
         | ActionHank wrote:
         | I think theres more to it than that.
         | 
         | You are correct for 90% of the cases, but this also kills
         | innovation.
        
           | from-nibly wrote:
           | If you want to do insfrastructure innovation you are more
           | than welcome to. There are lots of engineers dedicated to it.
           | Its also not that hard to go from software engineer to
           | infrastructure engineer thus bringing your experiece and
           | unique perspective. But working at a SMB or startup (the 90%)
           | doesn't justify innovation for innovations sake. 1 acre of
           | corn doesn't justify inventing the combine.
        
             | andiareso wrote:
             | I love that last line. That's the best analogy I've heard.
        
           | marcosdumay wrote:
           | The way to enable fast ops evolution is by creating a small
           | bubble with either a mutable facade or the immutability
           | restrictions disabled, and go innovate there. Once you are
           | ready, you can port the changes to the overall environment.
           | 
           | And the way to do the thing the article complains about is
           | with partial deployments.
           | 
           | Both of those are much better behaved on a large-scale ops
           | than the small-scale counterparts. K8s kinda "supports" both
           | of them, but like almost everything in k8s, it's more work,
           | and there are many foot-guns.
        
         | jtbayly wrote:
         | How is downtime beneficial to the customer?
        
           | dvdkon wrote:
           | Nobody wants downtime, but it's easy to spend too much effort
           | on avoiding it, taking time from actually important
           | development. Plenty of customers don't mind occasional
           | downtime, and it can mean the system is simpler and they get
           | features faster.
        
             | fifilura wrote:
             | I have been there! Duly upvoted!
             | 
             | Too much power to architects worsen the situation because
             | they both have formal responsibility to keep the downtime
             | low, but they are also appointed to finding technical
             | solutions rather than sometimes technically mundane product
             | improvements.
             | 
             | Also in the worst case, this solution becomes so cool that
             | it attracts the best developers internally, away from
             | building products.
        
         | phkahler wrote:
         | >> Repeat after me, "my creative energy should be spent on my
         | customers"
         | 
         | I agree with you. But from the blog:
         | 
         | >> "Product requirements were changed to play with the adopted
         | tech."
         | 
         | That's when things may have gone too far.
        
         | schmidtleonard wrote:
         | It's "weird" to want low downtime?
         | 
         | The general nastiness of updates is one of the largest customer
         | friction points in many systems, but creative energy should be
         | directed away from fixing it?
         | 
         | Gross.
        
           | roland35 wrote:
           | I think like many things in engineering, it depends?
           | 
           | I'm sure most applications in life benefit from accepting a
           | little downtime in order to simplify development. But there
           | are certainly scenarios where we can use some "high quality
           | engineering" to make downtime as low as possible.
        
         | aziaziazi wrote:
         | Don't forget startup innovation culture: everything has to be
         | disturbed. Encourage with tax exemptions for << innovative >>
         | jobs and you'll have cohorts of engenders reinventing wheels
         | from infra to UX in a glorified innovative modern "industry".
        
         | srpablo wrote:
         | > Repeat after me, "my creative energy should be spent on my
         | customers"
         | 
         | "I should save my energy, so I won't exercise."
         | 
         | "I should save money, so I won't deploy it towards
         | investments."
         | 
         | I don't think "creativity" is a zero-sum, finite resource; I
         | think it's possible to generate more by spending it
         | intelligently. And he pointed out how moving towards immutable
         | infrastructure, while more "standard," directly hurt customers
         | (the engineering team lost deployment velocity and
         | functionality), so it's especially weird to end your comment
         | the way you did.
         | 
         | To say "immutable infrastructure is just more straightforward"
         | so definitively, from the limited information you have, is just
         | you stating your biases. The stateful system he describes the
         | company moving away from may also have been pretty
         | "straightforward" and "boring," just with different fixed
         | points. Beauty in the eye of the beholder and all that.
        
       | thih9 wrote:
       | Would a middle ground be possible? E.g. by default use stateless
       | containers, but for certain stacks or popular app frameworks
       | support automated stateful deploys?
        
         | mononcqc wrote:
         | Two years after writing A Pipeline Made of Airbags, I ended up
         | prototyping a minimal way to do hot code loading from
         | kubernetes instances by using generic images and using a
         | sidecar to load pre-built software releases from a manifest in
         | a way that worked both for cold restarts and for hot code
         | loading: https://ferd.ca/my-favorite-erlang-container.html
         | 
         | It's more or less as close to a middle-ground as I could
         | imagine at the time.
        
       | specialist wrote:
       | Exactly right. OC hits all the points. Fine granularity of
       | failure, (re)warming caches, faster iteration by reducing cost of
       | changes, etc.
       | 
       | I too lived this. Albeit with "poor man's Erlang" (aka Java). Our
       | customers were hospitals, ERs, etc. Our stack could not go down.
       | And it had to be correct. So sometimes that means manual human
       | intervention.
       | 
       | There's another critial distinction, missed by the "whole
       | freaking docker-meets-kubernetes" herd:
       | 
       | Our deployed systems were "pets". Whereas k8s is meant for
       | "cattle".
       | 
       | Different tools for different use cases.
        
       | turtlebits wrote:
       | A well crafted app is great. It is also complex and generally
       | only maintainable/supportable by those who built it.
        
         | kmoser wrote:
         | If the original devs wrote good documentation then pretty much
         | anybody can maintain it easily.
        
         | michaelteter wrote:
         | A truly well crafted app requires very little maintenance or
         | support, and that maintenance/support has already been throught
         | through and made easy to learn and do.
         | 
         | These things are possible, and they fit economically somewhere
         | in the 3-5 year maturity of a system. Years 1-3 are usually
         | necessarily focused on features and releases, but far too many
         | orgs just stop at that point and aren't willing to invest that
         | extra year or two in work that will save time and money for
         | many years to follow.
         | 
         | I believe this resistance is due to the short-sightedness of
         | buyouts/IPOs or simply leadership churn.
        
       | anothername12 wrote:
       | K8s is a Google thing for Google problems. It's just not needed
       | for most software delivery.
       | 
       | Edit: meanwhile I'm waiting 45 minutes on average for my most
       | recent, single-line change, PR to roll out to k8s cluster at
       | $massive_company that totally does this like everyone else.
        
         | fragmede wrote:
         | Google doesn't use Kubernetes internally. Kubernetes is a
         | simplified version of Borg. If it's taking 45 minutes to deploy
         | a change, that's on your company's platform team, not Google.
        
       ___________________________________________________________________
       (page generated 2024-09-09 23:01 UTC)