[HN Gopher] We built a self-healing system to survive a concurre...
       ___________________________________________________________________
        
       We built a self-healing system to survive a concurrency bug at
       Netflix
        
       Author : zdw
       Score  : 317 points
       Date   : 2024-11-08 14:52 UTC (5 days ago)
        
 (HTM) web link (pushtoprod.substack.com)
 (TXT) w3m dump (pushtoprod.substack.com)
        
       | coolgoose wrote:
       | One of the things I am greatful for kubernetes and the killing of
       | pods.
       | 
       | Had a similar problem but memory wise with a pesky memory leak,
       | and the short term solution was to do nothing as instances would
       | to do nothing.
        
         | maximinus_thrax wrote:
         | During one of my past gigs, this exact feature hid a huge
         | memory leak, in old code, always running on k8s which we found
         | out only when we moved some instances to bare metal.
        
           | esprehn wrote:
           | We hit this in a past gig too. One of the big services had a
           | leak, but deployed every 24 hours which was hiding it. When
           | the holiday deploy freeze hit the pods lived much longer than
           | normal and caused an OOM storm.
           | 
           | At first I thought maybe we should add a "hack" to cycle all
           | the pods over 24 hours old, but then I wondered if making
           | holiday freezes behave like normal weeks was really a hack at
           | all or just reasonable predictability.
           | 
           | In the end folks managed to fix the leak and we didn't
           | resolve the philosophical question though.
        
       | ksd482 wrote:
       | This was a nice short read. A simple (temporary) solution, yet a
       | clever one.
       | 
       | How was he managing the instances? Was he using kubernetes, or
       | did he write some script to manage the auto terminating of the
       | instances?
       | 
       | It would also be nice to know why:
       | 
       | 1. Killing was quicker than restarting. Perhaps because of the
       | business logic built into the java application?
       | 
       | 2. Killing was safe. How was the system architectured so that the
       | requests weren't dropped altogether.
       | 
       | EDIT: formatting
        
         | jumploops wrote:
         | The author mentions 2011 as the time they switched from REST to
         | RPC-ish APIs, and this issue was related to that migration.
         | 
         | Kubernetes launched in 2014, if memory serves, and it took a
         | bit before widespread adoption, so I'm guessing this was some
         | internal solution.
         | 
         | This was a great read, and harkens back to the days of managing
         | 1000s of cores on bare metal!
        
         | braggerxyz wrote:
         | > It would also be nice to know why:
         | 
         | 1. Killing was quicker than restarting.
         | 
         | If you happen to restart one of the instances that was hanging
         | in the infinite thread, you can wait a very long time until the
         | Java container actually decides to kill itself because it did
         | not finish its graceful shutdown within the alotted timeout
         | period. Some Java containers have a default of 300s for this.
         | In this circumstance kill -9 is faster by a lot ;)
         | 
         | Also we had circumstances where the affected Java container did
         | not stop even if the timeout was reached because the
         | misbehaving thread did consume the whole cpu and none was left
         | for the supervisor thread. Then you can only kill the host
         | process of the JVM.
        
       | kristiandupont wrote:
       | That was a bit underwhelming compared to what the headline set my
       | expectations up for, but definitely a good idea and neat
       | solution.
        
         | vanjajaja1 wrote:
         | from the headline alone I got linkedin ceo vibe. "Built a Self-
         | Healing System to Survive a Concurrency Bug" is how I could
         | describe wrapping a failing method in a retry loop
        
           | fragmede wrote:
           | Put in a couple more if statements checking the output of
           | rand(), call it AI, and you'll be CEO in no time!
        
       | dhruvrrp wrote:
       | Interesting read, the fix seems to be straightforward, but I'd
       | have a few more questions if I was trying to do something
       | similar.
       | 
       | Is software deployed regularly on this cluster? Does that
       | deployment happen faster than the rate at which they were losing
       | CPUs? Why not just periodically force a deployment, given it's a
       | repeated process that probably already happens frequently.
       | 
       | What happens to the clients trying to connect to the stuck
       | instances? Did they just get stuck/timeout? Would it have been
       | better to have more targeted terminations/full terminations
       | instead?
        
         | nikita2206 wrote:
         | An answer to basically all your questions is: doesn't matter,
         | they did their best to stabilize in a short amount of time, and
         | it worked - that's what mattered.
        
       | rukugu wrote:
       | I like the practicality of this
        
       | est wrote:
       | Reminds me of the famous quote by Rasmus Lerdorf, creator of PHP
       | 
       | > I'm not a real programmer. I throw together things until it
       | works then I move on. The real programmers will say "Yeah it
       | works but you're leaking memory everywhere. Perhaps we should fix
       | that." I'll just restart Apache every 10 requests.
        
         | nicman23 wrote:
         | i ll argue that doing the restart is more important until
         | someone else finds the leak
        
           | fragmede wrote:
           | Or future me. It hurts on the inside to just kick EC2 every
           | hour because every 61 minutes something goes awry in the
           | process. But the show must go on, so you put in the temporary
           | fix knowing that it's not going to be temporary. Still,
           | weeks/months/years down the line you could get lucky and the
           | problem will go away and you can remove the kludge. But if
           | you're ridiculously lucky, not only will the problem get
           | fixed, but you'll get to understand exactly why the
           | mysterious problem was happening in the first place. Like the
           | gunicorn 500 upgrade bug, or the Postgres TOAST json thing.
           | That sort of satisfaction isn't something money can buy.
           | (Though it will help pay for servers in the interim until you
           | find the bug.)
        
             | nicman23 wrote:
             | or at least after the weekend :P
        
           | morning-coffee wrote:
           | Also uttered by others who thought borrowing money was more
           | important until they could figure out a way to control
           | spending.
        
       | raverbashing wrote:
       | > and to my memory, some calls to ConcurrentHashMap.get() seemed
       | to be running infinitely.
       | 
       | Of course they did. And whoever though "Concurrent" meant it
       | would work fine gets burned by it. Of course.
       | 
       | And of course it doesn't work properly or intuitively for some
       | very stupid reason. Sigh
        
         | xxs wrote:
         | It has to be an error - it could happen to HashMap, it has
         | never been an issue w/ CHM.
        
           | keeganpoppen wrote:
           | this sounds more like citing chapter and verse in an exegesis
           | than anything of direct relevance to the Mortal Plane...
        
       | kenhwang wrote:
       | My workplace currently has a similar problem where a resource
       | leak can be greatly increased with certain unpredictable/unknown
       | traffic conditions.
       | 
       | Our half-day workaround implementation was the same thing, just
       | cycle the cluster regularly automatically.
       | 
       | Since we're running on AWS, we just double the size of the
       | cluster, wait for the instances to initialize, then rapidly
       | decommission the old instances. Every 2 hours.
       | 
       | It's shockingly stable. So much so that resolving the root cause
       | isn't considered a priority and so we've had this running for
       | months.
        
         | gorkempacaci wrote:
         | How about the costs? Isn't this a very expensive bandaid? How
         | is it not a priority? :)
        
           | bratbag wrote:
           | Depends what else it's solving for.
           | 
           | I've seen multiple issues solved like this after engineering
           | teams have been cut to the bone.
           | 
           | If the cost of maintaining enough engineers to keep systems
           | stable for more than 24 hours, is more than the cost of
           | doubling the container count, then this is what happens
        
             | JSDevOps wrote:
             | This. All the domain knowledge has left. This sounds like a
             | Hacky work around at best which AWS will welcome you with
             | open arms come invoice day.
        
           | kenhwang wrote:
           | Depends on how long it takes for the incoming instances to
           | initialize and outgoing instances to fully decommission.
           | 
           | x = time it takes to switchover
           | 
           | y = length of the cycles
           | 
           | x/y = % increase in cost
           | 
           | For us, it's 15 minutes / 120 minutes = 12.5% increase, which
           | was deemed acceptable enough for a small service.
        
           | toast0 wrote:
           | Shouldn't be too high cost if you only run 2x the instances
           | for a short amount of time. A reasonable use of Cloud, IMHO,
           | if you can't figure out a less disruptive bandaid.
        
             | dochne wrote:
             | AWS charges instances in 1 hour increments - so you're
             | paying 150% the EC2 costs if you're doing this every 2
             | hours
        
               | kenhwang wrote:
               | AWS has been charging by the second since 2017:
               | https://aws.amazon.com/blogs/aws/new-per-second-billing-
               | for-...
        
         | JSDevOps wrote:
         | This sounds terrible
        
           | forkerenok wrote:
           | If you squint hard enough, this is an implementation of a
           | higher order garbage collection:
           | MarkNothingAndSweepEverything.
           | 
           | There, formalized the approach, so you can't call it terrible
           | anymore.
        
             | crabbone wrote:
             | Oh no it isn't. Garbage collector needs to prove that
             | what's being collected is garbage. If objects get collected
             | because of an error... that's not really how you want GC to
             | work.
             | 
             | If you are looking for an apt metaphor, Stalin sort might
             | be more in line with what's going on here. Or maybe
             | "ostrich algorithm".
        
               | zoky wrote:
               | I think it's more like Tech Support Sort, as in "Try
               | turning it off and on again and see if it's sorted".
        
               | mech422 wrote:
               | LOL - I like that one! :-)
        
               | nukethegrbj wrote:
               | >Garbage collector needs to prove that what's being
               | collected is garbage
               | 
               | Some collectors may need to do this, but there are
               | several collectors that don't. EpsilonGC is a prime
               | example of a GC that doesen't need to prove anything
        
               | crabbone wrote:
               | EpsilonGC is a GC in the same sense as a suitable-size
               | stick is a fully automatic rifle when you hold it to your
               | shoulder and say pew-pew...
               | 
               | I mean, I interpret your comment to be a joke, but you
               | could've made it a bit more obvious for people not
               | familiar with the latest fancy in Java world.
        
           | rakoo wrote:
           | To be fair this is what the BEAM vm structures everything on:
           | If something is wonky, crash it and restart from a known ok
           | state. Except when BEAM does it everyone says it's brilliant
        
             | ElevenLathe wrote:
             | It's one thing to design a crash-only system, and a quite
             | different to design a system that crashes all the time but
             | paper over it with a cloud orchestration layer later.
        
         | anal_reactor wrote:
         | I've realized that majority of engineers have no critical
         | thinking, and are unable to see things beyond their domain of
         | speciality. Arguments like "even when accounting for potential
         | incident, your solution is more expensive, while our main goal
         | is making money" almost never work, and I've been in countless
         | discussions where some random document with "best practices",
         | whatever they are supposed to be, was treated like a sacred
         | scripture.
        
           | MathMonkeyMan wrote:
           | We are dogmatic and emotional, but the temptation to base
           | your opinions on the "deeper theory" is large.
           | 
           | Pragmatically, restart the service periodically and spend
           | your time on more pressing matters.
           | 
           | On the other hand, we fully understand the reason for the
           | fault, but we don't know exactly where the fault is. And it
           | is, our fault. It takes a certain kind of discipline to say
           | "there are many things I understand but don't have the time
           | to master now, let's leave it."
           | 
           | It's, mostly, embarrassing.
        
             | keeganpoppen wrote:
             | "certain kind" of discipline, indeed... not the good kind.
             | and while your comment goes to great pains to highlight how
             | that particular God is dead (and i agree, for the record),
             | the God of Quality (the one that Pirsig goes to great
             | lengths to not really define) toward which the engineer's
             | heart of heart prays that lives within us all is...
             | unimpressed, to say the least.
        
               | raverbashing wrote:
               | Sure, you worship the God of Quality until you realize
               | that memory leak is being caused by a 3rd party library
               | (extra annoying when you could have solved it yourself)
               | or a quirky stdlib implementation
               | 
               | Then you realize it's a paper idol and the best you can
               | do is suck less than the average.
               | 
               | Thanks for playing Wing Commander!
        
               | mech422 wrote:
               | >> Thanks for playing Wing Commander!
               | 
               |  _captain america voice_ I got that reference :-)
        
               | c0balt wrote:
               | > "certain kind" of discipline, indeed... not the good
               | kind.
               | 
               | Not OP but this is a somewhat normal case of making a
               | tradeoff? They aren't able to repair it at the moment (or
               | rather don't want/can't allocate the time for it) and
               | instead trade their ressource usage for stability and
               | technical debt.
        
           | keeganpoppen wrote:
           | that's because the judge(s) and executioner(s) aren't
           | engineers, and the jury is not of their peers. and for the
           | record i have a hard time faulting the non-engineers above
           | so-described... they are just grasping for things they can
           | understand and have input on. who wouldn't want that? it's
           | not at all reasonable for the keepers of the pursestrings to
           | expect a certain amount of genuflection by way of self-
           | justification. no one watches the watchers... but they're the
           | ones watching, so may as well present them with a
           | verisimilitudinous rendition of reality... right?
           | 
           | but, as a discipline, engineers manage to encourage the
           | ascent of the least engineer-ly (or, perhaps, "hacker"-ly)
           | among them ("us") ...-selves... through their sui generis
           | combination of learned helplessness, willful ignorance,
           | incorrigible myopia, innate naivete, and cynical self-
           | servitude that signify the Institutional (Software) Engineer.
           | coddled more than any other specialty within "the
           | enterprise", they manage to simultaneously underplay their
           | hand with respect to True Leverage (read: "Power") and
           | overplay their hand with respect to complices of superiority.
           | i am ashamed and dismayed to recall the numerous times i have
           | heard (and heard of) comments to the effect of "my time is
           | too expensive for this meeting" in the workplace... every
           | single one of which has come not from the managerial class--
           | as one might reasonably, if superficially, expect-- but from
           | the software engineer rank and file.
           | 
           | to be clear: i don't think it's fair to expect high-minded
           | idealism from _anyone_. but if you are looking for the
           | archetypical  "company person"... engineers need look no
           | further than their fellow podmates / slack-room-mates / etc.
           | and thus no one should be surprised to see the state of the
           | world we all collectively hath wrought.
        
             | resize2996 wrote:
             | I dig your vibe. whaddya working on these days?
        
         | bongodongobob wrote:
         | "It's shockingly stable." You're running a soup. I'm not sure
         | if this is satire or not. This reminds me of using a plug-in
         | light timer to reboot your servers because some java program
         | eats all the memory.
        
           | keeganpoppen wrote:
           | or installing software to jiggle the mouse every so often so
           | that the computer with the spreadsheet that runs the company
           | doesn't go to sleep
        
             | Cthulhu_ wrote:
             | Still infinitely cheaper than rebuilding the spreadsheet
             | tbh.
        
           | HL33tibCe7 wrote:
           | Sometimes running a soup is the correct decision
        
         | Cthulhu_ wrote:
         | There's nothing as permanent as a temporary solution.
        
           | netdevnet wrote:
           | Production environments are full of PoCs that were meant to
           | be binned
        
         | netdevnet wrote:
         | > It's shockingly stable. So much so that resolving the root
         | cause isn't considered a priority and so we've had this running
         | for months.
         | 
         | I don't know why my senses tell me that this is wrong even if
         | you can afford it
        
           | crabbone wrote:
           | Guys might be looking to match the fame of the SolarWinds.
        
           | Retric wrote:
           | > I don't know why my senses tell me that this is wrong
           | 
           | The fix is also hiding other issues that show up. So it
           | degrades over time and eventually you're stuck trying to
           | solve multiple problems at the same time.
        
             | pmarreck wrote:
             | ^ This is the problem. Not only that, solving 10 bugs
             | (especially those more difficult nondeterministic
             | concurrency bugs) at the same time is hideously harder than
             | solving 1 at a time.
             | 
             | As a Director of Engineering at my last startup, I had an
             | "all hands on deck" policy as soon as any concurrency bug
             | was spotted. You do NOT want to let those fester. They are
             | nondeterministic, infrequent, and exponentially dangerous
             | as more and more appear and are swept under the rug via
             | "reset-to-known-good" mitigations.
        
         | cryptonym wrote:
         | People will argue you should spend time on something else once
         | you put bandaid on a wooden leg.
         | 
         | You should do proper risk assessment, such bug may be leveraged
         | by an attacker, that may actually be a symptom of a running
         | attack. That may also lead to data corruption or exposure. That
         | may mean some part of the system are poorly optimised and over-
         | consuming resources, maybe impacting user-experience. With a
         | dirty workaround, your technical debt increases, expect more
         | and more random issues that requires aggressive "self-healing".
        
           | kenhwang wrote:
           | It's just yet another piece of debt that gets prioritized
           | against other pieces of debt. As long as the cost of this
           | debt is purely fiscal, it's easy enough to position in the
           | debt backlog. Maybe a future piece of debt will increase the
           | cost of this. Maybe paying off another piece of debt will
           | also pay off some of this. The tech debt payoff
           | prioritization process will get to it when it gets to it.
        
             | cryptonym wrote:
             | Without proper risk assessment, that's poor management and
             | a recipe for disaster. Without that assessment, you don't
             | know the "cost", if that can even be measured. Of course
             | one can still run a business without doing such risk
             | assessment and poorly managing technical debt, just be
             | prepared for higher disaster chances.
        
         | whatever1 wrote:
         | I think this is a prime example of why the cloud won.
         | 
         | You don't need wizards in your team anymore.
         | 
         | Something seems off in the instance? Just nuke it and spin up a
         | new one. Let the system debugging for the Amazon folks.
        
           | znpy wrote:
           | Amazon folks won't debug your code though, they'll just
           | happily bill you more.
        
             | snicker7 wrote:
             | The point is not to spend time frantically fixing code at 3
             | AM.
        
           | chronid wrote:
           | This has been done forever. Ops team had cronjobs to restart
           | misbehaving applications out of business hours since before I
           | started working. In a previous job, the solution for disks
           | being full on a VM on-prem (no, not databases) was an
           | automatic reimage. I've seen scheduled index rebuilds on
           | Oracle. The list goes on.
        
             | braggerxyz wrote:
             | > I've seen scheduled index rebuilds on Oracle
             | 
             | If you do look into the Oracle dba handbook, scheduled
             | index rebuilds are somewhat recommended. We do it on
             | weekends on our Oracle instances. Otherwise you will
             | encounter severe performance degredation in tables where
             | data is inserted and deleted at high throughput thus
             | leading to fragmented indexes. And since Oracle 12g with
             | ONLINE REBUILD this is no problem anymore even at peak
             | hours.
        
             | xeromal wrote:
             | Rebooting Windows IIS instances every night has been a
             | mainstay for most of my career. haha
        
           | l33t7332273 wrote:
           | Amazon needs wizards then.
        
           | Gud wrote:
           | This is not exactly a new tactic, and not something that
           | would have to have been implemented without any cloud
           | solution. A randomized 'kill -HUP' could do the same thing,
           | for example.
        
         | rsynnott wrote:
         | > So much so that resolving the root cause isn't considered a
         | priority and so we've had this running for months.
         | 
         | I mean, you probably know this, but sooner or later this
         | attitude is going to come back to bite you. What happens when
         | you need to do it every hour? Every ten minutes? Every 30
         | seconds?
         | 
         | This sort of solution is really only suitable for use as short-
         | term life-support; unless you understand exactly what is
         | happening (but for some reason have chosen not to fix it), it's
         | very, very dangerous.
        
           | actionfromafar wrote:
           | In a way, yes. But it's also like a sledge hammer approach to
           | stateless design. New code will be built within the
           | constraint that stuff will be rebooted fairly often. That's
           | not only a bad thing.
        
           | jasonjayr wrote:
           | Well that's the thing: a bug that happens every 2 hrs and
           | cannot be traced easily gives a developer roughly 4
           | opportunities in an 8hr day to reproduce + diagnose.
           | 
           | Once it's happening every 30 seconds, then they have up to
           | 120 opportunities per hour, and it'll be fixed that much
           | quicker!
        
         | rothron wrote:
         | This fix means that you won't notice when you accumulate other
         | such resource leaks. When the shit eventually hits the fan,
         | you'll have to deal with problems you didn't even knew you had.
        
         | DrBazza wrote:
         | Sounds like process-level garbage collection. Just kill it and
         | restart. Which also sound like the apocryphal tale about the
         | leaky code and the missile.
         | 
         | "This sparked and interesting memory for me. I was once working
         | with a customer who was producing on-board software for a
         | missile. In my analysis of the code, I pointed out that they
         | had a number of problems with storage leaks. Imagine my
         | surprise when the customers chief software engineer said "Of
         | course it leaks"
         | 
         | He went on to point out that they had calculated the amount of
         | memory the application would leak in the total possible flight
         | time for the missile and then doubled that number. They added
         | this much additional memory to the hardware to "support" the
         | leaks. Since the missile will explode when it hits it's target
         | or at the end of it's flight, the ultimate in garbage
         | collection is performed without programmer intervention."
         | 
         | https://x.com/pomeranian99/status/858856994438094848
        
           | eschneider wrote:
           | At least with the missile case, someone _did the analysis and
           | knows exactly what's wrong_ before deciding the "solution"
           | was letting the resources leak. That's fine.
           | 
           | What always bothers me, is when (note, I'm not saying this is
           | the case for the grandparent comment, but it's implied)
           | people don't understand what exactly is broken, but just
           | reboot every so often to fix things. :0
           | 
           | For a lot of bugs, there's often the component you see (like
           | the obvious resource leak) combined with subtle problems you
           | don't see (data corruption, perhaps?) and you won't really
           | know until the problem is tracked down.
        
           | xelamonster wrote:
           | That's super interesting and I love the idea of physically
           | destructive GC. But to me that calculation and tracking
           | sounds a lot harder than simply fixing the leaks :)
        
         | braggerxyz wrote:
         | > It's shockingly stable. So much so that resolving the root
         | cause isn't considered a priority and so we've had this running
         | for months.
         | 
         | The trick is to not tell your manager that your bandaid works
         | so well, but that it barely keeps the system alive and you need
         | to introduce a proper fix. Been doing this for the last 10
         | years and we got our system so stable that I haven't had a
         | midnight call in the last two years.
        
           | tfandango wrote:
           | Classic trick. As a recent dev turned manager, these are the
           | kind of things I've had a hard time learning.
        
         | pmarreck wrote:
         | Heroku reboots servers every night no matter what stack is
         | running on them. Same idea.
         | 
         | The problem is that you merely borrowed yourself some time. As
         | time goes on, more inefficiencies/bugs of this nature will
         | creep in unnoticed, some will perhaps silently corrupt data
         | before it is noticed (!), and it will be vastly more difficult
         | at that point to troubleshoot 10 bugs of varying degrees of
         | severity and frequency all happening at the same time causing
         | you to have to reboot said servers at faster and faster
         | intervals which simultaneously makes it harder to diagnose them
         | individually.
         | 
         | > It's shockingly stable.
         | 
         | Well of course it is. You're "turning it off and then on
         | again," the classic way to return to a known-good state. It is
         | not a root-cause fix though, it is a band-aid.
        
           | lanstin wrote:
           | Also, it means you are married to the reboot process. If you
           | loose control of your memory management process too much,
           | you'll never be able to fix it absent a complete rewrite. I
           | worked at a place that had a lot of (c++) CGI programs with a
           | shocking level of disregard for freeing memory, but that was
           | ok because when the CGI request was over the process
           | restarted. But then they reused that same code in SOA/long
           | lived services, but they could never have one worker process
           | handle more than 10 requests due to memory leaks (and
           | inability to re-initialize all the memory used in a request).
           | So they could never use in-process caching or any sort of
           | optimization that long-lived processes could enable.
        
             | pmarreck wrote:
             | I never considered "having to reboot" as "introducing
             | another dependency" (in the sense of wanting to keep those
             | at a minimum) but sure enough, it is.
             | 
             | Also, great point about (depending on your architecture)
             | losing the ability to do things like cache results
        
         | pronoiac wrote:
         | I guess this works right up until it doesn't? It's been a
         | while, but I've seen AWS hit capacity for a specific instance
         | size in a specific availability zone. I remember spot pricing
         | being above the on-demand pricing, which might have been part
         | of the issue.
        
         | Jnr wrote:
         | I am running an old statically compiled perl binary that has a
         | memory leak. So every day the container is restarted
         | automatically so I would not have to deal with the problem. It
         | has been running like this for many many years now.
        
       | xxs wrote:
       | They were just lucky not to have data corruption due to
       | concurrency issue, and the manifestation was infinite get.
       | Overall if you can randomly "kill -9", the case is rather
       | trivial.
       | 
       | Likely replacing HashMap with CHM would not solve the concurrency
       | issue either, but it'd prevent an infinite loop. (Edit) It appear
       | that part is just wrong: "some calls to ConcurrentHashMap.get()
       | seemed to be running infinitely." <-- it's possible to happen no
       | hashmap during concurrent put(s), but not to ConcurrentHashMap
        
         | ay wrote:
         | it wasn't luck, it was very deliberately engineered for. The
         | article does lack a good bit of context about the Netflix
         | infra:
         | 
         | https://netflixtechblog.com/the-netflix-simian-army-16e57fba...
         | 
         | https://github.com/Netflix/chaosmonkey
        
         | vladak wrote:
         | yep, the link in the "some calls to ConcurrentHashMap.get()
         | seemed to be running infinitely." sentence points to
         | HashMap.html#get(java.lang.Object)
        
           | xxs wrote:
           | I have seen that part myself (infinite loops), also I have
           | quite extensive experience with CHM (and HashMap).
           | 
           | Overall such a mistake alone undermines the effort/article.
        
       | conradfr wrote:
       | I have a project where one function (reading metadata from an
       | Icecast stream [0]) was causing a memory leak and ultimately
       | consuming all of it.
       | 
       | I don't remember all the details but I've still not be able to
       | find the bug.
       | 
       | But this being in Elixir I "fixed it" with Task, TaskSupervisor
       | and try/catch/rescue.
       | 
       | Not really a win but it is still running fine to this day.
       | 
       | [0]
       | https://github.com/conradfr/ProgRadio/blob/1fa12ca73a40aedb9...
        
         | cpursley wrote:
         | Half of hn posts are people showing off things where they spent
         | a herculean amount of effort reinventing something that
         | elixir/erlang has had solved 30+ years already.
        
           | nesarkvechnep wrote:
           | Some are even proud of their ignorance and belittle Erlang
           | and Elixir.
        
             | rikthevik wrote:
             | I'm fine with it.
             | 
             | If people want to belittle something, either we aren't
             | trying to solve the same problem (sure) or they're actively
             | turning people away from what could be a serious advantage
             | (more for me!)
             | 
             | If the cost of switching wasn't so high, I'd love to write
             | Elixir all day. It's a joy.
        
       | girishso wrote:
       | Not much familiar with Elixir OTP, but isn't the approach OP took
       | similar to Let It Crash philosophy of OTP?
        
         | ramchip wrote:
         | Not really, you wouldn't normally kill or restart processes
         | randomly in an OTP system. "Let it crash" is more about
         | separating error handling from business logic.
        
       | btbuilder wrote:
       | I mean, it worked for Boeing[1] too.
       | 
       | 1 -
       | https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
        
         | keeganpoppen wrote:
         | "worked"
         | 
         | (not that i don't get the sarcasm)
        
         | shiroiushi wrote:
         | True, but for somewhat different reasons. For the OP, they take
         | this approach because they simply don't know yet what the
         | problem is, and it would take some time to track it down and
         | fix it and they don't want to bother.
         | 
         | For Boeing, it's probably something fairly simple actually, but
         | they don't want to fix it because their software has to go
         | through a strict development process based on requirements and
         | needing certification and testing, so fixing even a trivial bug
         | is extremely time-consuming and expensive, so it's easier to
         | just put a directive in the manual saying the equipment needs
         | to be power-cycled every so often and let the users deal with
         | it. The OP isn't dealing with this kind of situation.
        
       | iLoveOncall wrote:
       | Yeah I had the same issue of my EC2 that I used to host my
       | personal websites randomly getting to 100% CPU and being
       | unreachable.
       | 
       | I put a CloudWatch alarm at 90% CPU usage which would trigger a
       | reboot (which completed way before anyone would notice a
       | downtime).
       | 
       | Never had issues again.
        
       | iluvcommunism wrote:
       | Kill and restart the service. This seems to be the coder solution
       | to everything. We do it for our service as well. The programmer
       | could fix their stuff but alas, that's too much to ask.
        
         | edf13 wrote:
         | Yes - lots of writing for a common solution to a bug...
         | 
         | Memory leaks are often "resolved" this way... until time allows
         | for a proper fix.
        
       | Cthulhu_ wrote:
       | "Did you try turning it off and on again?"
        
       | jumploops wrote:
       | This reminds me of a couple startups I knew running Node.js circa
       | ~2014, where they would just restart their servers every night
       | due to memory issues.
       | 
       | iirc it was mostly folks with websocket issues, but fixing the
       | upstream was harder
       | 
       | 10 years later and specific software has gotten better, but this
       | type of problem is certainly still prevalent!
        
       | JonChesterfield wrote:
       | Title is grossly misleading.
       | 
       | That Netflix had already built a self-healing system means they
       | were able to handle a memory leak by killing random servers
       | faster than memory was leaking.
       | 
       | This post isn't about how they've managed that, it's just showing
       | off that their existing system is robust enough that you can do
       | hacks like this to it.
        
         | 4star3star wrote:
         | Your take is much different than mine. The issue was a
         | practical one of sparing people from working too much over one
         | weekend since the bug would have to wait until Monday, and the
         | author willingly described the solution as the worst.
        
       | rullelito wrote:
       | > Why not just reboot them? Terminating was faster.
       | 
       | If you don't know why you should reboot servers/services properly
       | instead of terminating them..
        
         | cnity wrote:
         | Well, why? This comment seems counter to the now-popular
         | "cattle not pets" approach.
        
           | ooFieTh6 wrote:
           | state
        
       | merizian wrote:
       | This reminds me of LLM pretraining and how there are so many
       | points at which the program could fail and so you need clever
       | solutions to keep uptime high. And it's not possible to just fix
       | the bugs--GPUs will often just crash (e.g. in graphics, if a
       | pixel flips the wrong color for a frame, it's fine, whereas such
       | things can cause numerical instability in deep learning so ECC
       | catches them). You also often have a fixed sized cluster which
       | you want to maximize utilization of.
       | 
       | So improving uptime involves holding out a set of GPUs to swap
       | out failed ones while they reboot. But also the whole run can
       | just randomly deadlock, so you might solve that by listening to
       | the logs and restarting after a certain amount of inactivity. And
       | you have to be clever with how to save/load checkpoints, since
       | that can start to become a huge bottleneck.
       | 
       | After many layers of self healing, we managed to take a vacation
       | for a few days without any calls :)
        
       | rvnx wrote:
       | Meta has a similar strategy, and this is why memory leak bugs in
       | HHVM are not fixed (they consider that instances are going to be
       | regularly killed anyway)
        
       | fidotron wrote:
       | This is a bit odd coming from the company of chaos engineering -
       | has the chaos monkey been abandoned at Netflix?
       | 
       | I have long advocated randomly restarting things with different
       | thresholds partly for reasons like this* and to ensure people are
       | not complacent wrt architecture choices. The resistance, which
       | you can see elsewhere here, is huge, but at scale it will happen
       | regardless of how clever you try to be. (A lesson from the erlang
       | people that is often overlooked).
       | 
       | * Many moons ago I worked on a video player which had a low level
       | resource leak in some decoder dependency. Luckily the leak was
       | attached to the process, so it was a simple matter of cycling the
       | process every 5 minutes and seamlessly attaching a new one. That
       | just kept going for months on end, and eventually the dependency
       | vendor fixed the leak, but many years later.
        
         | ricardobeat wrote:
         | In cases like this won't Chaos Monkey actually hide the
         | problem, since it's basically doing exactly the same as their
         | mitigation strategy - randomly restarting services?
        
           | fidotron wrote:
           | Right. The point of the question is why not ramp up the
           | monkey? They seem to imply it isn't there now, which wouldn't
           | surprise me with the cultural shifts that have occurred in
           | the tech world.
        
       | louison11 wrote:
       | You gotta pick your battles. Part of being in a startup is to be
       | comfortable with quick and dirty when necessary. It's when things
       | get bigger, too corporate and slow that companies stop moving
       | fast.
        
         | Sl1mb0 wrote:
         | We are talking about Netflix. You know, the 'N' in FAANG/MAANG
         | or whatever.
        
           | afavour wrote:
           | As a non-FAANGer Netflix has always intrigued me because of
           | this. While Google, Facebook and others seem to have bogged
           | themselves down in administrative mess, Netflix still seems
           | agile. From the outside at least.
           | 
           | (also worth noting this post seems to be discussing an event
           | that occurred many years ago, circa 2011, so might not be a
           | reflection of where they are today)
        
             | wil421 wrote:
             | Netflix isn't trying to be a search engine, hardware
             | manufacturer, consumer cloud provider (email, OneDrive,
             | etc), cloud infrastructure provider, and an ad company at
             | the same time. Or an Online Walmart who does all the rest
             | and more.
        
             | jldugger wrote:
             | Netflix is a much smaller enterprise. It got included
             | because it was high growth at the time, not because it was
             | destined to become a trillion dollar company.
        
       | alecco wrote:
       | Self-healing system: increase cluster size and replace servers
       | randomly. It works because it was a problem of threads
       | occasionally entering an infinite loop but not corrupting data.
       | And the whole system can tolerate these kind of whole server
       | crashes. IMHO an unusual combination of preconditions.
       | 
       | It's not explained why they couldn't write a monitor script
       | instead to find servers having the issue and only killing those.
        
         | 4star3star wrote:
         | I think they just needed a quick and dirty solution that was
         | good enough for a few days. They figured that for 1% failure
         | per hour, they needed to kill x processes every y minutes to
         | keep ahead of the failures. I'm sure it would be much more
         | efficient but also more complicated to try to target the
         | specific failures, and the "good enough" solution was
         | acceptable.
        
       | btbuildem wrote:
       | On a long enough timescale, everything eventually converges to
       | Erlang
        
         | pmarreck wrote:
         | Hah, hinted at that in my comment:
         | https://news.ycombinator.com/item?id=42126301
         | 
         | It really is a fundamental advantage against the worst kinds of
         | this category of bug
        
       | pmarreck wrote:
       | I had to deal with a concurrency bug in Ruby once and it was so
       | bad* that it pushed me into Elixir, which makes the vast majority
       | of concurrency bugs impossible at the language-design level, thus
       | enabling more sanity.
       | 
       | Ingeniously simple solution for this particular bug though.
       | 
       | *as I recall, it had to do with merging a regular Hash in the ENV
       | with a HashWithIndifferentAccess, which as it turns out was ill-
       | conceived at the time and had undefined corner cases (example:
       | what should happen when you merge a regular Hash containing
       | either a string or symbol key (or both) into a
       | HashWithIndifferentAccess containing the same key but internally
       | only represented as a string? Which takes precedence was
       | undefined at the time.)
        
       | ken47 wrote:
       | If the principles of languages like Erlang were taught in
       | American school, things like this would be much likely to occur.
       | Silly that Computer Science is regarded more highly by many than
       | Software Engineering for Software Engineering jobs.
        
         | nextos wrote:
         | Ideas stemming from Erlang and Mozart/Oz are indeed a big blind
         | spot in most undergrad programs. Sadly, even in EU all this is
         | becoming a niche topic, which is weird as today's applications
         | are more concurrent and data-intensive than ever.
        
       | pronoiac wrote:
       | I've dealt with something similar. We were able to spin up zombie
       | reapers, looking for the cores / CPUs that were pegged at 100%,
       | and prioritize the instances that were worst hit.
        
       | TZubiri wrote:
       | Netflix is supposed to be the bastion of microservices and the
       | trailblazer of all-aws infrastructure.
       | 
       | But as time goes by I just ask, all this work and costs and
       | complexity, to serve files? Yeah don't get me wrong, the size of
       | the files are really big, AND they are streamed, noted. But it's
       | not the programming complexity challenge that one would expect,
       | almost all of the complexity seems to stem from metadata like
       | when users stop watching, and how to recommend them titles to
       | keep them hooked, and when to cut the titles and autoplay the
       | next video to make them addicted to binge watching.
       | 
       | Case in point, the blogpost speaks of a CPU concurrency bug and
       | clients being servers? But never once refers to an actual
       | business domain purpose. Like are these servers even loading
       | video content? My bet is they are more on the optimizing
       | engagement side of things. And I make this bet knowing that these
       | are servers with high video-like load, but I'm confident that
       | these guys are juggling 10TB/s of mouse metadata into some ML
       | system more than I'm confident that they have some problem with
       | the core of their technology which has worked since launch.
       | 
       | As I say this, I know I'm probably wrong, surely the production
       | issues are cause by high peak loads like a new chapter of the
       | latest series or whatever.
       | 
       | I'm all over the place, I just don't like netflix is what I'm
       | saying
        
         | PittleyDunkin wrote:
         | > But as time goes by I just ask, all this work and costs and
         | complexity, to serve files?
         | 
         | You could say the same thing about the entire web.
        
           | nicce wrote:
           | Not really. People are not posting data into Netlix. Netflix
           | is mostly read-only. That is huge complexity reducer.
        
             | TZubiri wrote:
             | I thought about the complexity in terms of compute, but I
             | guess if there's no input then there's no compute possible,
             | as all functions are idempotent and static. At the very
             | least their results are cacheable, or the input is
             | centralized (admins/show producers)
        
             | jacksontheel wrote:
             | Every time you like/dislike/watchlist a movie you're
             | posting data. When you're watching a movie your progress is
             | constantly updated, posting data. Simple stuff but there's
             | possibly hundreds of thousands of concurrent users doing
             | that at any given moment.
        
               | nicce wrote:
               | Yes, but it is still counts only a fraction of the
               | purpose of their infrastructure. There are no hard global
               | real-time sync requirements.
               | 
               | > When you're watching a movie your progress is
               | constantly updated, posting data
               | 
               | This can be implemented on server side and with read
               | requests only.
               | 
               | A proper comparison would be YouTube where people upload
               | videos and comment stuff in real-time.
        
               | PittleyDunkin wrote:
               | > A proper comparison would be YouTube where people
               | upload videos and comment stuff in real-time.
               | 
               | Even in this one sentence you're conflating two types of
               | interaction. Surely downloading videos is yet a third,
               | and possibly the rest of the assets on the site a fourth.
               | 
               | Why not just say the exact problem you think is worth of
               | discussion with your full chest if you so clearly have
               | one in mind?
        
             | PittleyDunkin wrote:
             | Is it? It's pretty rare to download assets from servers
             | that you're uploading to. Sometimes you have truly
             | interactive app servers but that's a pretty small
             | percentage of web traffic. Shared state is not the typical
             | problem to solve on the internet, though it is a popular
             | one to discuss.
        
               | nicce wrote:
               | Whatever your service is, usually the database is the
               | bottleneck. The database limits the latency, scaling and
               | availability.
               | 
               | Of course, how much, depends on the service.
               | Particularly, how much concurrent writing is happening,
               | and do you need to update this state globally, in real-
               | time as result of this writing. Also, is local caching
               | happening and do you need to invalidate the cache as well
               | as a result of this writing.
               | 
               | The most of the relevant problems disappear, if you can
               | just replicate most of the data without worrying that
               | someone is updating it, and you also don't have cache
               | invalidation issues. No race conditions. No real-time
               | replication issues.
        
               | PittleyDunkin wrote:
               | > Whatever your service is, usually the database is the
               | bottleneck. The database limits the latency, scaling and
               | availability.
               | 
               | Database-driven traffic is still a tiny percentage of
               | internet traffic. It's harder to tell these days with
               | encryption but on any given page-load on any project _I
               | 've_ worked on, most of the traffic is in assets, not
               | application data.
               | 
               | Now, latency might be a different issue, but it seems
               | ridiculous to me to consider "downloading a file" to be a
               | niche concern--it's just that most people offload that
               | concern to other people.
        
               | nicce wrote:
               | > It's harder to tell these days with encryption but on
               | any given page-load on any project I've worked on, most
               | of the traffic is in assets, not application data.
               | 
               | Yet you have to design the whole infrastructure to note
               | that tiny margin to work flawlessly, because otherwise
               | the service usually is not driving its purpose.
               | 
               | Read-only assets are the easy part, which was my original
               | claim.
        
         | toast0 wrote:
         | > But as time goes by I just ask, all this work and costs and
         | complexity, to serve files?
         | 
         | IMHO, a large amount of the complexity is all the other stuff.
         | Account information, browsing movies, recommendations,
         | viewed/not/how much seen, steering to local CDN nodes, DRM
         | stuff, etc.
         | 
         | The file servers have a lot less complexity; copy content to
         | CDN nodes, send the client to the right node for the content,
         | serve 400Gbps+ per node. Probably some really interesting stuff
         | for their real time streams (but I haven't seen a
         | blog/presentation on those)
         | 
         | Transcoding is probably interesting too. Managing job queues
         | isn't new, but there's probably some fun stuff around cost
         | effectiveness.
        
         | bobdvb wrote:
         | Netflix has done massive amounts of work on BSD to improve it's
         | network throughput, that's part of them enabling their file
         | delivery from their CDN appliances.
         | https://people.freebsd.org/~gallatin/talks/euro2022.pdf
         | 
         | They've also contributed significantly to open source tools for
         | video processing, one of the biggest things that stands out is
         | probably their VMAF tool for quantifying perceptual quality in
         | video. It's probably the best open source tool for measuring
         | video quality out there right now.
         | 
         | It's also absolutely true that in any streaming service, the
         | orchestration, account management, billing and catalogue
         | components are waaaay more complex than actually delivering
         | video on-demand. To counter one thing you've said: mouse
         | movement... most viewing of premium content isn't done on web
         | or even mobile devices. Most viewing time of paid content is
         | done on a TV, where you're not measuring focus. But that's just
         | a piece of trivia.
         | 
         | As you said, you just don't like them, but they've done a lot
         | for the open source community and that should be understood.
        
       | posix_compliant wrote:
       | What's neat is that this is a differential equation. If you kill
       | 5% of instances each hour, the reduction in bad instances is
       | proportional to the current number of instances.
       | 
       | i.e.
       | 
       | if bad(t) = fraction of bad instances at time t
       | 
       | and
       | 
       | bad(0) = 0
       | 
       | then
       | 
       | d(bad(t))/dt = -0.05 * bad(t) + 0.01 * (1 - bad(t))
       | 
       | so
       | 
       | bad(t) = 0.166667 - 0.166667 e^(-0.06 t)
       | 
       | Which looks a mighty lot like the graph of bad instances in the
       | blog post.
        
         | uvdn7 wrote:
         | Love it! I wonder if the team knew this explicitly or
         | intuitively when they deployed the strategy.
         | 
         | > We created a rule in our central monitoring and alerting
         | system to randomly kill a few instances every 15 minutes. Every
         | killed instance would be replaced with a healthy, fresh one.
         | 
         | It doesn't look like they worked out the numbers ahead of the
         | time.
        
       | eigenvalue wrote:
       | I understand how their approach worked well enough, but I don't
       | get why they couldn't selectively target the VMs that were
       | currently experiencing problems rather than randomly select any
       | VM to terminate. If they were exhausting all their CPU resources,
       | wouldn't that be easy enough to search for using something like
       | ansible?
        
         | rdoherty wrote:
         | I agree, I've been at places that can tie alerts at a host
         | level to an automated task runner. Basically a workflow system
         | that gets kicked off on an alert. Alert fires, host is rebooted
         | or terminated. Helpful for things like this.
        
       | pjdesno wrote:
       | Vaguely related anecdote:
       | 
       | 30 years ago or so I worked at a tiny networking company where
       | several coworkers came from a small company (call it C) that made
       | AppleTalk routers. They recounted being puzzled that their
       | competitor (company S) had a reputation for having a rock-solid
       | product, but when they got it into the lab they found their
       | competitor's product crashed maybe 10 times more often than their
       | own.
       | 
       | It turned out that the competing device could reboot faster than
       | the end-to-end connection timeout in the higher-level protocol,
       | so in practice failures were invisible. Their router, on the
       | other hand, took long enough to reboot that your print job or
       | file server copy would fail. It was as simple as that, and in
       | practice the other product was rock-solid and theirs wasn't.
       | 
       | (This is a fairly accurate summary of what I was told, but
       | there's a chance my coworkers were totally wrong. The conclusion
       | still stands, I think - fast restarts can save your ass.)
        
         | cruffle_duffle wrote:
         | Seems like the next priority would be to make your product
         | reboot just as fast if not faster then theirs.
        
           | rtkwe wrote:
           | Clearly but maybe the thing that makes your product crash
           | less makes it take longer to reboot.
           | 
           | Also the story isn't that they couldn't just that they were
           | measuring the actual failure rate not the effective failure
           | rate because the device could recover faster than the failure
           | caused actual issues.
        
         | kevin_nisbet wrote:
         | This is along the lines of how one of the wireless telecom
         | products I really liked worked.
         | 
         | Each running process had a backup on another blade in the
         | chassis. All internal state was replicated. And the process was
         | written in a crash only fashion, anything unexpected happened
         | and the process would just minicore and exit.
         | 
         | One day I think I noticed that we had over a hundred thousand
         | crashes in the previous 24 hours, but no one complained and we
         | just sent over the minicores to the devs and got them fixed. In
         | theory some users would be impacted that were triggering the
         | crashes, their devices might have a glitch and need to re-
         | associate with the network, but the crashes caused no
         | widespread impacts in that case.
         | 
         | To this day I'm a fan of crash only software as a philosophy,
         | even though I haven't had the opportunity to implement it in
         | the software I work on.
        
       | otterley wrote:
       | > It was Friday afternoon
       | 
       | > Rolling back was cumbersome
       | 
       | It's a fundamental principle of modern DevOps practice that
       | rollbacks should be quick and easy, done immediately when you
       | notice a production regression, and ideally automated. And at
       | Netflix's scale, one would have wanted this rollout to be done in
       | waves to minimize risk.
       | 
       | Apparently this happened back in 2021. Did the team investigate
       | later why you couldn't do this, and address it?
        
         | jldugger wrote:
         | >It's a fundamental principle of modern DevOps practice that
         | rollbacks should be quick and easy
         | 
         | Then DevOps principles are in conflict with reality.
        
           | otterley wrote:
           | Go on...
        
       | __turbobrew__ wrote:
       | > Could we roll back? Not easily. I can't recall why
       | 
       | I can appreciate the hack to deal with this (I actually came up
       | with the same solution in my head as reading) but if you cannot
       | rollback and you cannot roll forward you are stuck in a special
       | purgatory of CD hell that you should be spending every moment of
       | time getting out of before doing anything else.
        
       | Scubabear68 wrote:
       | The real key here is to understand Netflix's business, and also
       | many social media companies too.
       | 
       | These companies have achieved vast scale because correctness
       | doesn't matter that much so long as it is "good enough" for a
       | large enough statistical population, and their Devops practices
       | and coding practices have evolved with this as a key factor.
       | 
       | It is not uncommon at all for Netflix or Hulu or Facebook or
       | Instagram to throw an error or do something bone headed. When it
       | happens you shrug and try again.
       | 
       | Now imagine if this was applied to credit card payments systems,
       | or your ATM network, or similar. The reality of course is that
       | some financial systems do operate this way, but it's recognized
       | as a problem and usually gets on people's radar to fix as failed
       | transaction rates creep up and it starts costing money directly
       | or clients.
       | 
       | "Just randomly kill shit" is perfectly fine in the Netflix world.
       | In other domains, not so much (but again it can and will be used
       | as an emergency measure!).
        
       ___________________________________________________________________
       (page generated 2024-11-13 23:01 UTC)