[HN Gopher] We built a self-healing system to survive a concurre...
___________________________________________________________________
We built a self-healing system to survive a concurrency bug at
Netflix
Author : zdw
Score : 317 points
Date : 2024-11-08 14:52 UTC (5 days ago)
(HTM) web link (pushtoprod.substack.com)
(TXT) w3m dump (pushtoprod.substack.com)
| coolgoose wrote:
| One of the things I am greatful for kubernetes and the killing of
| pods.
|
| Had a similar problem but memory wise with a pesky memory leak,
| and the short term solution was to do nothing as instances would
| to do nothing.
| maximinus_thrax wrote:
| During one of my past gigs, this exact feature hid a huge
| memory leak, in old code, always running on k8s which we found
| out only when we moved some instances to bare metal.
| esprehn wrote:
| We hit this in a past gig too. One of the big services had a
| leak, but deployed every 24 hours which was hiding it. When
| the holiday deploy freeze hit the pods lived much longer than
| normal and caused an OOM storm.
|
| At first I thought maybe we should add a "hack" to cycle all
| the pods over 24 hours old, but then I wondered if making
| holiday freezes behave like normal weeks was really a hack at
| all or just reasonable predictability.
|
| In the end folks managed to fix the leak and we didn't
| resolve the philosophical question though.
| ksd482 wrote:
| This was a nice short read. A simple (temporary) solution, yet a
| clever one.
|
| How was he managing the instances? Was he using kubernetes, or
| did he write some script to manage the auto terminating of the
| instances?
|
| It would also be nice to know why:
|
| 1. Killing was quicker than restarting. Perhaps because of the
| business logic built into the java application?
|
| 2. Killing was safe. How was the system architectured so that the
| requests weren't dropped altogether.
|
| EDIT: formatting
| jumploops wrote:
| The author mentions 2011 as the time they switched from REST to
| RPC-ish APIs, and this issue was related to that migration.
|
| Kubernetes launched in 2014, if memory serves, and it took a
| bit before widespread adoption, so I'm guessing this was some
| internal solution.
|
| This was a great read, and harkens back to the days of managing
| 1000s of cores on bare metal!
| braggerxyz wrote:
| > It would also be nice to know why:
|
| 1. Killing was quicker than restarting.
|
| If you happen to restart one of the instances that was hanging
| in the infinite thread, you can wait a very long time until the
| Java container actually decides to kill itself because it did
| not finish its graceful shutdown within the alotted timeout
| period. Some Java containers have a default of 300s for this.
| In this circumstance kill -9 is faster by a lot ;)
|
| Also we had circumstances where the affected Java container did
| not stop even if the timeout was reached because the
| misbehaving thread did consume the whole cpu and none was left
| for the supervisor thread. Then you can only kill the host
| process of the JVM.
| kristiandupont wrote:
| That was a bit underwhelming compared to what the headline set my
| expectations up for, but definitely a good idea and neat
| solution.
| vanjajaja1 wrote:
| from the headline alone I got linkedin ceo vibe. "Built a Self-
| Healing System to Survive a Concurrency Bug" is how I could
| describe wrapping a failing method in a retry loop
| fragmede wrote:
| Put in a couple more if statements checking the output of
| rand(), call it AI, and you'll be CEO in no time!
| dhruvrrp wrote:
| Interesting read, the fix seems to be straightforward, but I'd
| have a few more questions if I was trying to do something
| similar.
|
| Is software deployed regularly on this cluster? Does that
| deployment happen faster than the rate at which they were losing
| CPUs? Why not just periodically force a deployment, given it's a
| repeated process that probably already happens frequently.
|
| What happens to the clients trying to connect to the stuck
| instances? Did they just get stuck/timeout? Would it have been
| better to have more targeted terminations/full terminations
| instead?
| nikita2206 wrote:
| An answer to basically all your questions is: doesn't matter,
| they did their best to stabilize in a short amount of time, and
| it worked - that's what mattered.
| rukugu wrote:
| I like the practicality of this
| est wrote:
| Reminds me of the famous quote by Rasmus Lerdorf, creator of PHP
|
| > I'm not a real programmer. I throw together things until it
| works then I move on. The real programmers will say "Yeah it
| works but you're leaking memory everywhere. Perhaps we should fix
| that." I'll just restart Apache every 10 requests.
| nicman23 wrote:
| i ll argue that doing the restart is more important until
| someone else finds the leak
| fragmede wrote:
| Or future me. It hurts on the inside to just kick EC2 every
| hour because every 61 minutes something goes awry in the
| process. But the show must go on, so you put in the temporary
| fix knowing that it's not going to be temporary. Still,
| weeks/months/years down the line you could get lucky and the
| problem will go away and you can remove the kludge. But if
| you're ridiculously lucky, not only will the problem get
| fixed, but you'll get to understand exactly why the
| mysterious problem was happening in the first place. Like the
| gunicorn 500 upgrade bug, or the Postgres TOAST json thing.
| That sort of satisfaction isn't something money can buy.
| (Though it will help pay for servers in the interim until you
| find the bug.)
| nicman23 wrote:
| or at least after the weekend :P
| morning-coffee wrote:
| Also uttered by others who thought borrowing money was more
| important until they could figure out a way to control
| spending.
| raverbashing wrote:
| > and to my memory, some calls to ConcurrentHashMap.get() seemed
| to be running infinitely.
|
| Of course they did. And whoever though "Concurrent" meant it
| would work fine gets burned by it. Of course.
|
| And of course it doesn't work properly or intuitively for some
| very stupid reason. Sigh
| xxs wrote:
| It has to be an error - it could happen to HashMap, it has
| never been an issue w/ CHM.
| keeganpoppen wrote:
| this sounds more like citing chapter and verse in an exegesis
| than anything of direct relevance to the Mortal Plane...
| kenhwang wrote:
| My workplace currently has a similar problem where a resource
| leak can be greatly increased with certain unpredictable/unknown
| traffic conditions.
|
| Our half-day workaround implementation was the same thing, just
| cycle the cluster regularly automatically.
|
| Since we're running on AWS, we just double the size of the
| cluster, wait for the instances to initialize, then rapidly
| decommission the old instances. Every 2 hours.
|
| It's shockingly stable. So much so that resolving the root cause
| isn't considered a priority and so we've had this running for
| months.
| gorkempacaci wrote:
| How about the costs? Isn't this a very expensive bandaid? How
| is it not a priority? :)
| bratbag wrote:
| Depends what else it's solving for.
|
| I've seen multiple issues solved like this after engineering
| teams have been cut to the bone.
|
| If the cost of maintaining enough engineers to keep systems
| stable for more than 24 hours, is more than the cost of
| doubling the container count, then this is what happens
| JSDevOps wrote:
| This. All the domain knowledge has left. This sounds like a
| Hacky work around at best which AWS will welcome you with
| open arms come invoice day.
| kenhwang wrote:
| Depends on how long it takes for the incoming instances to
| initialize and outgoing instances to fully decommission.
|
| x = time it takes to switchover
|
| y = length of the cycles
|
| x/y = % increase in cost
|
| For us, it's 15 minutes / 120 minutes = 12.5% increase, which
| was deemed acceptable enough for a small service.
| toast0 wrote:
| Shouldn't be too high cost if you only run 2x the instances
| for a short amount of time. A reasonable use of Cloud, IMHO,
| if you can't figure out a less disruptive bandaid.
| dochne wrote:
| AWS charges instances in 1 hour increments - so you're
| paying 150% the EC2 costs if you're doing this every 2
| hours
| kenhwang wrote:
| AWS has been charging by the second since 2017:
| https://aws.amazon.com/blogs/aws/new-per-second-billing-
| for-...
| JSDevOps wrote:
| This sounds terrible
| forkerenok wrote:
| If you squint hard enough, this is an implementation of a
| higher order garbage collection:
| MarkNothingAndSweepEverything.
|
| There, formalized the approach, so you can't call it terrible
| anymore.
| crabbone wrote:
| Oh no it isn't. Garbage collector needs to prove that
| what's being collected is garbage. If objects get collected
| because of an error... that's not really how you want GC to
| work.
|
| If you are looking for an apt metaphor, Stalin sort might
| be more in line with what's going on here. Or maybe
| "ostrich algorithm".
| zoky wrote:
| I think it's more like Tech Support Sort, as in "Try
| turning it off and on again and see if it's sorted".
| mech422 wrote:
| LOL - I like that one! :-)
| nukethegrbj wrote:
| >Garbage collector needs to prove that what's being
| collected is garbage
|
| Some collectors may need to do this, but there are
| several collectors that don't. EpsilonGC is a prime
| example of a GC that doesen't need to prove anything
| crabbone wrote:
| EpsilonGC is a GC in the same sense as a suitable-size
| stick is a fully automatic rifle when you hold it to your
| shoulder and say pew-pew...
|
| I mean, I interpret your comment to be a joke, but you
| could've made it a bit more obvious for people not
| familiar with the latest fancy in Java world.
| rakoo wrote:
| To be fair this is what the BEAM vm structures everything on:
| If something is wonky, crash it and restart from a known ok
| state. Except when BEAM does it everyone says it's brilliant
| ElevenLathe wrote:
| It's one thing to design a crash-only system, and a quite
| different to design a system that crashes all the time but
| paper over it with a cloud orchestration layer later.
| anal_reactor wrote:
| I've realized that majority of engineers have no critical
| thinking, and are unable to see things beyond their domain of
| speciality. Arguments like "even when accounting for potential
| incident, your solution is more expensive, while our main goal
| is making money" almost never work, and I've been in countless
| discussions where some random document with "best practices",
| whatever they are supposed to be, was treated like a sacred
| scripture.
| MathMonkeyMan wrote:
| We are dogmatic and emotional, but the temptation to base
| your opinions on the "deeper theory" is large.
|
| Pragmatically, restart the service periodically and spend
| your time on more pressing matters.
|
| On the other hand, we fully understand the reason for the
| fault, but we don't know exactly where the fault is. And it
| is, our fault. It takes a certain kind of discipline to say
| "there are many things I understand but don't have the time
| to master now, let's leave it."
|
| It's, mostly, embarrassing.
| keeganpoppen wrote:
| "certain kind" of discipline, indeed... not the good kind.
| and while your comment goes to great pains to highlight how
| that particular God is dead (and i agree, for the record),
| the God of Quality (the one that Pirsig goes to great
| lengths to not really define) toward which the engineer's
| heart of heart prays that lives within us all is...
| unimpressed, to say the least.
| raverbashing wrote:
| Sure, you worship the God of Quality until you realize
| that memory leak is being caused by a 3rd party library
| (extra annoying when you could have solved it yourself)
| or a quirky stdlib implementation
|
| Then you realize it's a paper idol and the best you can
| do is suck less than the average.
|
| Thanks for playing Wing Commander!
| mech422 wrote:
| >> Thanks for playing Wing Commander!
|
| _captain america voice_ I got that reference :-)
| c0balt wrote:
| > "certain kind" of discipline, indeed... not the good
| kind.
|
| Not OP but this is a somewhat normal case of making a
| tradeoff? They aren't able to repair it at the moment (or
| rather don't want/can't allocate the time for it) and
| instead trade their ressource usage for stability and
| technical debt.
| keeganpoppen wrote:
| that's because the judge(s) and executioner(s) aren't
| engineers, and the jury is not of their peers. and for the
| record i have a hard time faulting the non-engineers above
| so-described... they are just grasping for things they can
| understand and have input on. who wouldn't want that? it's
| not at all reasonable for the keepers of the pursestrings to
| expect a certain amount of genuflection by way of self-
| justification. no one watches the watchers... but they're the
| ones watching, so may as well present them with a
| verisimilitudinous rendition of reality... right?
|
| but, as a discipline, engineers manage to encourage the
| ascent of the least engineer-ly (or, perhaps, "hacker"-ly)
| among them ("us") ...-selves... through their sui generis
| combination of learned helplessness, willful ignorance,
| incorrigible myopia, innate naivete, and cynical self-
| servitude that signify the Institutional (Software) Engineer.
| coddled more than any other specialty within "the
| enterprise", they manage to simultaneously underplay their
| hand with respect to True Leverage (read: "Power") and
| overplay their hand with respect to complices of superiority.
| i am ashamed and dismayed to recall the numerous times i have
| heard (and heard of) comments to the effect of "my time is
| too expensive for this meeting" in the workplace... every
| single one of which has come not from the managerial class--
| as one might reasonably, if superficially, expect-- but from
| the software engineer rank and file.
|
| to be clear: i don't think it's fair to expect high-minded
| idealism from _anyone_. but if you are looking for the
| archetypical "company person"... engineers need look no
| further than their fellow podmates / slack-room-mates / etc.
| and thus no one should be surprised to see the state of the
| world we all collectively hath wrought.
| resize2996 wrote:
| I dig your vibe. whaddya working on these days?
| bongodongobob wrote:
| "It's shockingly stable." You're running a soup. I'm not sure
| if this is satire or not. This reminds me of using a plug-in
| light timer to reboot your servers because some java program
| eats all the memory.
| keeganpoppen wrote:
| or installing software to jiggle the mouse every so often so
| that the computer with the spreadsheet that runs the company
| doesn't go to sleep
| Cthulhu_ wrote:
| Still infinitely cheaper than rebuilding the spreadsheet
| tbh.
| HL33tibCe7 wrote:
| Sometimes running a soup is the correct decision
| Cthulhu_ wrote:
| There's nothing as permanent as a temporary solution.
| netdevnet wrote:
| Production environments are full of PoCs that were meant to
| be binned
| netdevnet wrote:
| > It's shockingly stable. So much so that resolving the root
| cause isn't considered a priority and so we've had this running
| for months.
|
| I don't know why my senses tell me that this is wrong even if
| you can afford it
| crabbone wrote:
| Guys might be looking to match the fame of the SolarWinds.
| Retric wrote:
| > I don't know why my senses tell me that this is wrong
|
| The fix is also hiding other issues that show up. So it
| degrades over time and eventually you're stuck trying to
| solve multiple problems at the same time.
| pmarreck wrote:
| ^ This is the problem. Not only that, solving 10 bugs
| (especially those more difficult nondeterministic
| concurrency bugs) at the same time is hideously harder than
| solving 1 at a time.
|
| As a Director of Engineering at my last startup, I had an
| "all hands on deck" policy as soon as any concurrency bug
| was spotted. You do NOT want to let those fester. They are
| nondeterministic, infrequent, and exponentially dangerous
| as more and more appear and are swept under the rug via
| "reset-to-known-good" mitigations.
| cryptonym wrote:
| People will argue you should spend time on something else once
| you put bandaid on a wooden leg.
|
| You should do proper risk assessment, such bug may be leveraged
| by an attacker, that may actually be a symptom of a running
| attack. That may also lead to data corruption or exposure. That
| may mean some part of the system are poorly optimised and over-
| consuming resources, maybe impacting user-experience. With a
| dirty workaround, your technical debt increases, expect more
| and more random issues that requires aggressive "self-healing".
| kenhwang wrote:
| It's just yet another piece of debt that gets prioritized
| against other pieces of debt. As long as the cost of this
| debt is purely fiscal, it's easy enough to position in the
| debt backlog. Maybe a future piece of debt will increase the
| cost of this. Maybe paying off another piece of debt will
| also pay off some of this. The tech debt payoff
| prioritization process will get to it when it gets to it.
| cryptonym wrote:
| Without proper risk assessment, that's poor management and
| a recipe for disaster. Without that assessment, you don't
| know the "cost", if that can even be measured. Of course
| one can still run a business without doing such risk
| assessment and poorly managing technical debt, just be
| prepared for higher disaster chances.
| whatever1 wrote:
| I think this is a prime example of why the cloud won.
|
| You don't need wizards in your team anymore.
|
| Something seems off in the instance? Just nuke it and spin up a
| new one. Let the system debugging for the Amazon folks.
| znpy wrote:
| Amazon folks won't debug your code though, they'll just
| happily bill you more.
| snicker7 wrote:
| The point is not to spend time frantically fixing code at 3
| AM.
| chronid wrote:
| This has been done forever. Ops team had cronjobs to restart
| misbehaving applications out of business hours since before I
| started working. In a previous job, the solution for disks
| being full on a VM on-prem (no, not databases) was an
| automatic reimage. I've seen scheduled index rebuilds on
| Oracle. The list goes on.
| braggerxyz wrote:
| > I've seen scheduled index rebuilds on Oracle
|
| If you do look into the Oracle dba handbook, scheduled
| index rebuilds are somewhat recommended. We do it on
| weekends on our Oracle instances. Otherwise you will
| encounter severe performance degredation in tables where
| data is inserted and deleted at high throughput thus
| leading to fragmented indexes. And since Oracle 12g with
| ONLINE REBUILD this is no problem anymore even at peak
| hours.
| xeromal wrote:
| Rebooting Windows IIS instances every night has been a
| mainstay for most of my career. haha
| l33t7332273 wrote:
| Amazon needs wizards then.
| Gud wrote:
| This is not exactly a new tactic, and not something that
| would have to have been implemented without any cloud
| solution. A randomized 'kill -HUP' could do the same thing,
| for example.
| rsynnott wrote:
| > So much so that resolving the root cause isn't considered a
| priority and so we've had this running for months.
|
| I mean, you probably know this, but sooner or later this
| attitude is going to come back to bite you. What happens when
| you need to do it every hour? Every ten minutes? Every 30
| seconds?
|
| This sort of solution is really only suitable for use as short-
| term life-support; unless you understand exactly what is
| happening (but for some reason have chosen not to fix it), it's
| very, very dangerous.
| actionfromafar wrote:
| In a way, yes. But it's also like a sledge hammer approach to
| stateless design. New code will be built within the
| constraint that stuff will be rebooted fairly often. That's
| not only a bad thing.
| jasonjayr wrote:
| Well that's the thing: a bug that happens every 2 hrs and
| cannot be traced easily gives a developer roughly 4
| opportunities in an 8hr day to reproduce + diagnose.
|
| Once it's happening every 30 seconds, then they have up to
| 120 opportunities per hour, and it'll be fixed that much
| quicker!
| rothron wrote:
| This fix means that you won't notice when you accumulate other
| such resource leaks. When the shit eventually hits the fan,
| you'll have to deal with problems you didn't even knew you had.
| DrBazza wrote:
| Sounds like process-level garbage collection. Just kill it and
| restart. Which also sound like the apocryphal tale about the
| leaky code and the missile.
|
| "This sparked and interesting memory for me. I was once working
| with a customer who was producing on-board software for a
| missile. In my analysis of the code, I pointed out that they
| had a number of problems with storage leaks. Imagine my
| surprise when the customers chief software engineer said "Of
| course it leaks"
|
| He went on to point out that they had calculated the amount of
| memory the application would leak in the total possible flight
| time for the missile and then doubled that number. They added
| this much additional memory to the hardware to "support" the
| leaks. Since the missile will explode when it hits it's target
| or at the end of it's flight, the ultimate in garbage
| collection is performed without programmer intervention."
|
| https://x.com/pomeranian99/status/858856994438094848
| eschneider wrote:
| At least with the missile case, someone _did the analysis and
| knows exactly what's wrong_ before deciding the "solution"
| was letting the resources leak. That's fine.
|
| What always bothers me, is when (note, I'm not saying this is
| the case for the grandparent comment, but it's implied)
| people don't understand what exactly is broken, but just
| reboot every so often to fix things. :0
|
| For a lot of bugs, there's often the component you see (like
| the obvious resource leak) combined with subtle problems you
| don't see (data corruption, perhaps?) and you won't really
| know until the problem is tracked down.
| xelamonster wrote:
| That's super interesting and I love the idea of physically
| destructive GC. But to me that calculation and tracking
| sounds a lot harder than simply fixing the leaks :)
| braggerxyz wrote:
| > It's shockingly stable. So much so that resolving the root
| cause isn't considered a priority and so we've had this running
| for months.
|
| The trick is to not tell your manager that your bandaid works
| so well, but that it barely keeps the system alive and you need
| to introduce a proper fix. Been doing this for the last 10
| years and we got our system so stable that I haven't had a
| midnight call in the last two years.
| tfandango wrote:
| Classic trick. As a recent dev turned manager, these are the
| kind of things I've had a hard time learning.
| pmarreck wrote:
| Heroku reboots servers every night no matter what stack is
| running on them. Same idea.
|
| The problem is that you merely borrowed yourself some time. As
| time goes on, more inefficiencies/bugs of this nature will
| creep in unnoticed, some will perhaps silently corrupt data
| before it is noticed (!), and it will be vastly more difficult
| at that point to troubleshoot 10 bugs of varying degrees of
| severity and frequency all happening at the same time causing
| you to have to reboot said servers at faster and faster
| intervals which simultaneously makes it harder to diagnose them
| individually.
|
| > It's shockingly stable.
|
| Well of course it is. You're "turning it off and then on
| again," the classic way to return to a known-good state. It is
| not a root-cause fix though, it is a band-aid.
| lanstin wrote:
| Also, it means you are married to the reboot process. If you
| loose control of your memory management process too much,
| you'll never be able to fix it absent a complete rewrite. I
| worked at a place that had a lot of (c++) CGI programs with a
| shocking level of disregard for freeing memory, but that was
| ok because when the CGI request was over the process
| restarted. But then they reused that same code in SOA/long
| lived services, but they could never have one worker process
| handle more than 10 requests due to memory leaks (and
| inability to re-initialize all the memory used in a request).
| So they could never use in-process caching or any sort of
| optimization that long-lived processes could enable.
| pmarreck wrote:
| I never considered "having to reboot" as "introducing
| another dependency" (in the sense of wanting to keep those
| at a minimum) but sure enough, it is.
|
| Also, great point about (depending on your architecture)
| losing the ability to do things like cache results
| pronoiac wrote:
| I guess this works right up until it doesn't? It's been a
| while, but I've seen AWS hit capacity for a specific instance
| size in a specific availability zone. I remember spot pricing
| being above the on-demand pricing, which might have been part
| of the issue.
| Jnr wrote:
| I am running an old statically compiled perl binary that has a
| memory leak. So every day the container is restarted
| automatically so I would not have to deal with the problem. It
| has been running like this for many many years now.
| xxs wrote:
| They were just lucky not to have data corruption due to
| concurrency issue, and the manifestation was infinite get.
| Overall if you can randomly "kill -9", the case is rather
| trivial.
|
| Likely replacing HashMap with CHM would not solve the concurrency
| issue either, but it'd prevent an infinite loop. (Edit) It appear
| that part is just wrong: "some calls to ConcurrentHashMap.get()
| seemed to be running infinitely." <-- it's possible to happen no
| hashmap during concurrent put(s), but not to ConcurrentHashMap
| ay wrote:
| it wasn't luck, it was very deliberately engineered for. The
| article does lack a good bit of context about the Netflix
| infra:
|
| https://netflixtechblog.com/the-netflix-simian-army-16e57fba...
|
| https://github.com/Netflix/chaosmonkey
| vladak wrote:
| yep, the link in the "some calls to ConcurrentHashMap.get()
| seemed to be running infinitely." sentence points to
| HashMap.html#get(java.lang.Object)
| xxs wrote:
| I have seen that part myself (infinite loops), also I have
| quite extensive experience with CHM (and HashMap).
|
| Overall such a mistake alone undermines the effort/article.
| conradfr wrote:
| I have a project where one function (reading metadata from an
| Icecast stream [0]) was causing a memory leak and ultimately
| consuming all of it.
|
| I don't remember all the details but I've still not be able to
| find the bug.
|
| But this being in Elixir I "fixed it" with Task, TaskSupervisor
| and try/catch/rescue.
|
| Not really a win but it is still running fine to this day.
|
| [0]
| https://github.com/conradfr/ProgRadio/blob/1fa12ca73a40aedb9...
| cpursley wrote:
| Half of hn posts are people showing off things where they spent
| a herculean amount of effort reinventing something that
| elixir/erlang has had solved 30+ years already.
| nesarkvechnep wrote:
| Some are even proud of their ignorance and belittle Erlang
| and Elixir.
| rikthevik wrote:
| I'm fine with it.
|
| If people want to belittle something, either we aren't
| trying to solve the same problem (sure) or they're actively
| turning people away from what could be a serious advantage
| (more for me!)
|
| If the cost of switching wasn't so high, I'd love to write
| Elixir all day. It's a joy.
| girishso wrote:
| Not much familiar with Elixir OTP, but isn't the approach OP took
| similar to Let It Crash philosophy of OTP?
| ramchip wrote:
| Not really, you wouldn't normally kill or restart processes
| randomly in an OTP system. "Let it crash" is more about
| separating error handling from business logic.
| btbuilder wrote:
| I mean, it worked for Boeing[1] too.
|
| 1 -
| https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
| keeganpoppen wrote:
| "worked"
|
| (not that i don't get the sarcasm)
| shiroiushi wrote:
| True, but for somewhat different reasons. For the OP, they take
| this approach because they simply don't know yet what the
| problem is, and it would take some time to track it down and
| fix it and they don't want to bother.
|
| For Boeing, it's probably something fairly simple actually, but
| they don't want to fix it because their software has to go
| through a strict development process based on requirements and
| needing certification and testing, so fixing even a trivial bug
| is extremely time-consuming and expensive, so it's easier to
| just put a directive in the manual saying the equipment needs
| to be power-cycled every so often and let the users deal with
| it. The OP isn't dealing with this kind of situation.
| iLoveOncall wrote:
| Yeah I had the same issue of my EC2 that I used to host my
| personal websites randomly getting to 100% CPU and being
| unreachable.
|
| I put a CloudWatch alarm at 90% CPU usage which would trigger a
| reboot (which completed way before anyone would notice a
| downtime).
|
| Never had issues again.
| iluvcommunism wrote:
| Kill and restart the service. This seems to be the coder solution
| to everything. We do it for our service as well. The programmer
| could fix their stuff but alas, that's too much to ask.
| edf13 wrote:
| Yes - lots of writing for a common solution to a bug...
|
| Memory leaks are often "resolved" this way... until time allows
| for a proper fix.
| Cthulhu_ wrote:
| "Did you try turning it off and on again?"
| jumploops wrote:
| This reminds me of a couple startups I knew running Node.js circa
| ~2014, where they would just restart their servers every night
| due to memory issues.
|
| iirc it was mostly folks with websocket issues, but fixing the
| upstream was harder
|
| 10 years later and specific software has gotten better, but this
| type of problem is certainly still prevalent!
| JonChesterfield wrote:
| Title is grossly misleading.
|
| That Netflix had already built a self-healing system means they
| were able to handle a memory leak by killing random servers
| faster than memory was leaking.
|
| This post isn't about how they've managed that, it's just showing
| off that their existing system is robust enough that you can do
| hacks like this to it.
| 4star3star wrote:
| Your take is much different than mine. The issue was a
| practical one of sparing people from working too much over one
| weekend since the bug would have to wait until Monday, and the
| author willingly described the solution as the worst.
| rullelito wrote:
| > Why not just reboot them? Terminating was faster.
|
| If you don't know why you should reboot servers/services properly
| instead of terminating them..
| cnity wrote:
| Well, why? This comment seems counter to the now-popular
| "cattle not pets" approach.
| ooFieTh6 wrote:
| state
| merizian wrote:
| This reminds me of LLM pretraining and how there are so many
| points at which the program could fail and so you need clever
| solutions to keep uptime high. And it's not possible to just fix
| the bugs--GPUs will often just crash (e.g. in graphics, if a
| pixel flips the wrong color for a frame, it's fine, whereas such
| things can cause numerical instability in deep learning so ECC
| catches them). You also often have a fixed sized cluster which
| you want to maximize utilization of.
|
| So improving uptime involves holding out a set of GPUs to swap
| out failed ones while they reboot. But also the whole run can
| just randomly deadlock, so you might solve that by listening to
| the logs and restarting after a certain amount of inactivity. And
| you have to be clever with how to save/load checkpoints, since
| that can start to become a huge bottleneck.
|
| After many layers of self healing, we managed to take a vacation
| for a few days without any calls :)
| rvnx wrote:
| Meta has a similar strategy, and this is why memory leak bugs in
| HHVM are not fixed (they consider that instances are going to be
| regularly killed anyway)
| fidotron wrote:
| This is a bit odd coming from the company of chaos engineering -
| has the chaos monkey been abandoned at Netflix?
|
| I have long advocated randomly restarting things with different
| thresholds partly for reasons like this* and to ensure people are
| not complacent wrt architecture choices. The resistance, which
| you can see elsewhere here, is huge, but at scale it will happen
| regardless of how clever you try to be. (A lesson from the erlang
| people that is often overlooked).
|
| * Many moons ago I worked on a video player which had a low level
| resource leak in some decoder dependency. Luckily the leak was
| attached to the process, so it was a simple matter of cycling the
| process every 5 minutes and seamlessly attaching a new one. That
| just kept going for months on end, and eventually the dependency
| vendor fixed the leak, but many years later.
| ricardobeat wrote:
| In cases like this won't Chaos Monkey actually hide the
| problem, since it's basically doing exactly the same as their
| mitigation strategy - randomly restarting services?
| fidotron wrote:
| Right. The point of the question is why not ramp up the
| monkey? They seem to imply it isn't there now, which wouldn't
| surprise me with the cultural shifts that have occurred in
| the tech world.
| louison11 wrote:
| You gotta pick your battles. Part of being in a startup is to be
| comfortable with quick and dirty when necessary. It's when things
| get bigger, too corporate and slow that companies stop moving
| fast.
| Sl1mb0 wrote:
| We are talking about Netflix. You know, the 'N' in FAANG/MAANG
| or whatever.
| afavour wrote:
| As a non-FAANGer Netflix has always intrigued me because of
| this. While Google, Facebook and others seem to have bogged
| themselves down in administrative mess, Netflix still seems
| agile. From the outside at least.
|
| (also worth noting this post seems to be discussing an event
| that occurred many years ago, circa 2011, so might not be a
| reflection of where they are today)
| wil421 wrote:
| Netflix isn't trying to be a search engine, hardware
| manufacturer, consumer cloud provider (email, OneDrive,
| etc), cloud infrastructure provider, and an ad company at
| the same time. Or an Online Walmart who does all the rest
| and more.
| jldugger wrote:
| Netflix is a much smaller enterprise. It got included
| because it was high growth at the time, not because it was
| destined to become a trillion dollar company.
| alecco wrote:
| Self-healing system: increase cluster size and replace servers
| randomly. It works because it was a problem of threads
| occasionally entering an infinite loop but not corrupting data.
| And the whole system can tolerate these kind of whole server
| crashes. IMHO an unusual combination of preconditions.
|
| It's not explained why they couldn't write a monitor script
| instead to find servers having the issue and only killing those.
| 4star3star wrote:
| I think they just needed a quick and dirty solution that was
| good enough for a few days. They figured that for 1% failure
| per hour, they needed to kill x processes every y minutes to
| keep ahead of the failures. I'm sure it would be much more
| efficient but also more complicated to try to target the
| specific failures, and the "good enough" solution was
| acceptable.
| btbuildem wrote:
| On a long enough timescale, everything eventually converges to
| Erlang
| pmarreck wrote:
| Hah, hinted at that in my comment:
| https://news.ycombinator.com/item?id=42126301
|
| It really is a fundamental advantage against the worst kinds of
| this category of bug
| pmarreck wrote:
| I had to deal with a concurrency bug in Ruby once and it was so
| bad* that it pushed me into Elixir, which makes the vast majority
| of concurrency bugs impossible at the language-design level, thus
| enabling more sanity.
|
| Ingeniously simple solution for this particular bug though.
|
| *as I recall, it had to do with merging a regular Hash in the ENV
| with a HashWithIndifferentAccess, which as it turns out was ill-
| conceived at the time and had undefined corner cases (example:
| what should happen when you merge a regular Hash containing
| either a string or symbol key (or both) into a
| HashWithIndifferentAccess containing the same key but internally
| only represented as a string? Which takes precedence was
| undefined at the time.)
| ken47 wrote:
| If the principles of languages like Erlang were taught in
| American school, things like this would be much likely to occur.
| Silly that Computer Science is regarded more highly by many than
| Software Engineering for Software Engineering jobs.
| nextos wrote:
| Ideas stemming from Erlang and Mozart/Oz are indeed a big blind
| spot in most undergrad programs. Sadly, even in EU all this is
| becoming a niche topic, which is weird as today's applications
| are more concurrent and data-intensive than ever.
| pronoiac wrote:
| I've dealt with something similar. We were able to spin up zombie
| reapers, looking for the cores / CPUs that were pegged at 100%,
| and prioritize the instances that were worst hit.
| TZubiri wrote:
| Netflix is supposed to be the bastion of microservices and the
| trailblazer of all-aws infrastructure.
|
| But as time goes by I just ask, all this work and costs and
| complexity, to serve files? Yeah don't get me wrong, the size of
| the files are really big, AND they are streamed, noted. But it's
| not the programming complexity challenge that one would expect,
| almost all of the complexity seems to stem from metadata like
| when users stop watching, and how to recommend them titles to
| keep them hooked, and when to cut the titles and autoplay the
| next video to make them addicted to binge watching.
|
| Case in point, the blogpost speaks of a CPU concurrency bug and
| clients being servers? But never once refers to an actual
| business domain purpose. Like are these servers even loading
| video content? My bet is they are more on the optimizing
| engagement side of things. And I make this bet knowing that these
| are servers with high video-like load, but I'm confident that
| these guys are juggling 10TB/s of mouse metadata into some ML
| system more than I'm confident that they have some problem with
| the core of their technology which has worked since launch.
|
| As I say this, I know I'm probably wrong, surely the production
| issues are cause by high peak loads like a new chapter of the
| latest series or whatever.
|
| I'm all over the place, I just don't like netflix is what I'm
| saying
| PittleyDunkin wrote:
| > But as time goes by I just ask, all this work and costs and
| complexity, to serve files?
|
| You could say the same thing about the entire web.
| nicce wrote:
| Not really. People are not posting data into Netlix. Netflix
| is mostly read-only. That is huge complexity reducer.
| TZubiri wrote:
| I thought about the complexity in terms of compute, but I
| guess if there's no input then there's no compute possible,
| as all functions are idempotent and static. At the very
| least their results are cacheable, or the input is
| centralized (admins/show producers)
| jacksontheel wrote:
| Every time you like/dislike/watchlist a movie you're
| posting data. When you're watching a movie your progress is
| constantly updated, posting data. Simple stuff but there's
| possibly hundreds of thousands of concurrent users doing
| that at any given moment.
| nicce wrote:
| Yes, but it is still counts only a fraction of the
| purpose of their infrastructure. There are no hard global
| real-time sync requirements.
|
| > When you're watching a movie your progress is
| constantly updated, posting data
|
| This can be implemented on server side and with read
| requests only.
|
| A proper comparison would be YouTube where people upload
| videos and comment stuff in real-time.
| PittleyDunkin wrote:
| > A proper comparison would be YouTube where people
| upload videos and comment stuff in real-time.
|
| Even in this one sentence you're conflating two types of
| interaction. Surely downloading videos is yet a third,
| and possibly the rest of the assets on the site a fourth.
|
| Why not just say the exact problem you think is worth of
| discussion with your full chest if you so clearly have
| one in mind?
| PittleyDunkin wrote:
| Is it? It's pretty rare to download assets from servers
| that you're uploading to. Sometimes you have truly
| interactive app servers but that's a pretty small
| percentage of web traffic. Shared state is not the typical
| problem to solve on the internet, though it is a popular
| one to discuss.
| nicce wrote:
| Whatever your service is, usually the database is the
| bottleneck. The database limits the latency, scaling and
| availability.
|
| Of course, how much, depends on the service.
| Particularly, how much concurrent writing is happening,
| and do you need to update this state globally, in real-
| time as result of this writing. Also, is local caching
| happening and do you need to invalidate the cache as well
| as a result of this writing.
|
| The most of the relevant problems disappear, if you can
| just replicate most of the data without worrying that
| someone is updating it, and you also don't have cache
| invalidation issues. No race conditions. No real-time
| replication issues.
| PittleyDunkin wrote:
| > Whatever your service is, usually the database is the
| bottleneck. The database limits the latency, scaling and
| availability.
|
| Database-driven traffic is still a tiny percentage of
| internet traffic. It's harder to tell these days with
| encryption but on any given page-load on any project _I
| 've_ worked on, most of the traffic is in assets, not
| application data.
|
| Now, latency might be a different issue, but it seems
| ridiculous to me to consider "downloading a file" to be a
| niche concern--it's just that most people offload that
| concern to other people.
| nicce wrote:
| > It's harder to tell these days with encryption but on
| any given page-load on any project I've worked on, most
| of the traffic is in assets, not application data.
|
| Yet you have to design the whole infrastructure to note
| that tiny margin to work flawlessly, because otherwise
| the service usually is not driving its purpose.
|
| Read-only assets are the easy part, which was my original
| claim.
| toast0 wrote:
| > But as time goes by I just ask, all this work and costs and
| complexity, to serve files?
|
| IMHO, a large amount of the complexity is all the other stuff.
| Account information, browsing movies, recommendations,
| viewed/not/how much seen, steering to local CDN nodes, DRM
| stuff, etc.
|
| The file servers have a lot less complexity; copy content to
| CDN nodes, send the client to the right node for the content,
| serve 400Gbps+ per node. Probably some really interesting stuff
| for their real time streams (but I haven't seen a
| blog/presentation on those)
|
| Transcoding is probably interesting too. Managing job queues
| isn't new, but there's probably some fun stuff around cost
| effectiveness.
| bobdvb wrote:
| Netflix has done massive amounts of work on BSD to improve it's
| network throughput, that's part of them enabling their file
| delivery from their CDN appliances.
| https://people.freebsd.org/~gallatin/talks/euro2022.pdf
|
| They've also contributed significantly to open source tools for
| video processing, one of the biggest things that stands out is
| probably their VMAF tool for quantifying perceptual quality in
| video. It's probably the best open source tool for measuring
| video quality out there right now.
|
| It's also absolutely true that in any streaming service, the
| orchestration, account management, billing and catalogue
| components are waaaay more complex than actually delivering
| video on-demand. To counter one thing you've said: mouse
| movement... most viewing of premium content isn't done on web
| or even mobile devices. Most viewing time of paid content is
| done on a TV, where you're not measuring focus. But that's just
| a piece of trivia.
|
| As you said, you just don't like them, but they've done a lot
| for the open source community and that should be understood.
| posix_compliant wrote:
| What's neat is that this is a differential equation. If you kill
| 5% of instances each hour, the reduction in bad instances is
| proportional to the current number of instances.
|
| i.e.
|
| if bad(t) = fraction of bad instances at time t
|
| and
|
| bad(0) = 0
|
| then
|
| d(bad(t))/dt = -0.05 * bad(t) + 0.01 * (1 - bad(t))
|
| so
|
| bad(t) = 0.166667 - 0.166667 e^(-0.06 t)
|
| Which looks a mighty lot like the graph of bad instances in the
| blog post.
| uvdn7 wrote:
| Love it! I wonder if the team knew this explicitly or
| intuitively when they deployed the strategy.
|
| > We created a rule in our central monitoring and alerting
| system to randomly kill a few instances every 15 minutes. Every
| killed instance would be replaced with a healthy, fresh one.
|
| It doesn't look like they worked out the numbers ahead of the
| time.
| eigenvalue wrote:
| I understand how their approach worked well enough, but I don't
| get why they couldn't selectively target the VMs that were
| currently experiencing problems rather than randomly select any
| VM to terminate. If they were exhausting all their CPU resources,
| wouldn't that be easy enough to search for using something like
| ansible?
| rdoherty wrote:
| I agree, I've been at places that can tie alerts at a host
| level to an automated task runner. Basically a workflow system
| that gets kicked off on an alert. Alert fires, host is rebooted
| or terminated. Helpful for things like this.
| pjdesno wrote:
| Vaguely related anecdote:
|
| 30 years ago or so I worked at a tiny networking company where
| several coworkers came from a small company (call it C) that made
| AppleTalk routers. They recounted being puzzled that their
| competitor (company S) had a reputation for having a rock-solid
| product, but when they got it into the lab they found their
| competitor's product crashed maybe 10 times more often than their
| own.
|
| It turned out that the competing device could reboot faster than
| the end-to-end connection timeout in the higher-level protocol,
| so in practice failures were invisible. Their router, on the
| other hand, took long enough to reboot that your print job or
| file server copy would fail. It was as simple as that, and in
| practice the other product was rock-solid and theirs wasn't.
|
| (This is a fairly accurate summary of what I was told, but
| there's a chance my coworkers were totally wrong. The conclusion
| still stands, I think - fast restarts can save your ass.)
| cruffle_duffle wrote:
| Seems like the next priority would be to make your product
| reboot just as fast if not faster then theirs.
| rtkwe wrote:
| Clearly but maybe the thing that makes your product crash
| less makes it take longer to reboot.
|
| Also the story isn't that they couldn't just that they were
| measuring the actual failure rate not the effective failure
| rate because the device could recover faster than the failure
| caused actual issues.
| kevin_nisbet wrote:
| This is along the lines of how one of the wireless telecom
| products I really liked worked.
|
| Each running process had a backup on another blade in the
| chassis. All internal state was replicated. And the process was
| written in a crash only fashion, anything unexpected happened
| and the process would just minicore and exit.
|
| One day I think I noticed that we had over a hundred thousand
| crashes in the previous 24 hours, but no one complained and we
| just sent over the minicores to the devs and got them fixed. In
| theory some users would be impacted that were triggering the
| crashes, their devices might have a glitch and need to re-
| associate with the network, but the crashes caused no
| widespread impacts in that case.
|
| To this day I'm a fan of crash only software as a philosophy,
| even though I haven't had the opportunity to implement it in
| the software I work on.
| otterley wrote:
| > It was Friday afternoon
|
| > Rolling back was cumbersome
|
| It's a fundamental principle of modern DevOps practice that
| rollbacks should be quick and easy, done immediately when you
| notice a production regression, and ideally automated. And at
| Netflix's scale, one would have wanted this rollout to be done in
| waves to minimize risk.
|
| Apparently this happened back in 2021. Did the team investigate
| later why you couldn't do this, and address it?
| jldugger wrote:
| >It's a fundamental principle of modern DevOps practice that
| rollbacks should be quick and easy
|
| Then DevOps principles are in conflict with reality.
| otterley wrote:
| Go on...
| __turbobrew__ wrote:
| > Could we roll back? Not easily. I can't recall why
|
| I can appreciate the hack to deal with this (I actually came up
| with the same solution in my head as reading) but if you cannot
| rollback and you cannot roll forward you are stuck in a special
| purgatory of CD hell that you should be spending every moment of
| time getting out of before doing anything else.
| Scubabear68 wrote:
| The real key here is to understand Netflix's business, and also
| many social media companies too.
|
| These companies have achieved vast scale because correctness
| doesn't matter that much so long as it is "good enough" for a
| large enough statistical population, and their Devops practices
| and coding practices have evolved with this as a key factor.
|
| It is not uncommon at all for Netflix or Hulu or Facebook or
| Instagram to throw an error or do something bone headed. When it
| happens you shrug and try again.
|
| Now imagine if this was applied to credit card payments systems,
| or your ATM network, or similar. The reality of course is that
| some financial systems do operate this way, but it's recognized
| as a problem and usually gets on people's radar to fix as failed
| transaction rates creep up and it starts costing money directly
| or clients.
|
| "Just randomly kill shit" is perfectly fine in the Netflix world.
| In other domains, not so much (but again it can and will be used
| as an emergency measure!).
___________________________________________________________________
(page generated 2024-11-13 23:01 UTC)