[HN Gopher] Roblox October Outage Postmortem
___________________________________________________________________
Roblox October Outage Postmortem
Author : kbuck
Score : 235 points
Date : 2022-01-20 20:01 UTC (2 hours ago)
(HTM) web link (blog.roblox.com)
(TXT) w3m dump (blog.roblox.com)
| ryanworl wrote:
| It seems that Consul does not have the ability to use the newer
| hashmap implementation of freelist that Alibaba implemented for
| etcd. I cannot find any reference to setting this option in
| Consul's configuration.
|
| Unfortunate, given it has been around for a while.
|
| https://www.alibabacloud.com/blog/594750
| throwdbaaway wrote:
| I think they just made the switch to the fork that does contain
| the freelist improvement in
| https://github.com/hashicorp/consul/pull/11720
|
| Took a major incident to swallow your pride? (consul, powered
| by go.etcd.io/bbolt)
| ryanworl wrote:
| Is this option enabled by default? I don't this it is and I
| don't think they actually set it manually anywhere.
|
| EDIT: I think we're talking about two different options. I
| meant the ability to leave sync turned on but change the data
| structure.
| ctvo wrote:
| It's a spicy read. Really could have happened to anyone. All very
| reasonable assumptions and steps taken. You could argue they
| could have more thoroughly load tested Consul, but doubtful any
| of us would have done more due diligence than they did with the
| slow rollout of streaming support.
|
| (Ignoring the points around observability dependencies on the
| system that went down causing the failure to be extended)
| yashap wrote:
| The main mistake IMO is that, the day before the outage, they
| made a significant Consul-related infra change. Then they have
| this massive outage, where Consul is clearly the root cause,
| but nobody ever tries rolling that recent change back? That's
| weird.
|
| I went into more detail here:
| https://news.ycombinator.com/item?id=30015826
|
| The outage occurring could certainly happen to anyone, but it
| taking 72 hours to resolve seems like a pretty fundamental SRE
| mistake. It's also strange that "try rollbacks of changes
| related to the affected system" isn't even acknowledged as a
| learning in their learnings/action items section.
| statguy wrote:
| So the outage lasted 3 days and the postmortem took 3 months!
| koshergweilo wrote:
| Read the article " It has been 2.5 months since the outage.
| What have we been up to? We used this time to learn as much as
| we could from the outage, to adjust engineering priorities
| based on what we learned, and to aggressively harden our
| systems. One of our Roblox values is Respect The Community, and
| while we could have issued a post sooner to explain what
| happened, we felt we owed it to you, our community, to make
| significant progress on improving the reliability of our
| systems before publishing."
|
| They wanted to make sure everything was fixed before publishing
| Operyl wrote:
| They just got out of their busiest time of year, and taking the
| time to write an accurate post mortem with data gleamed
| afterwards seems sensible to me.
| encryptluks2 wrote:
| willcipriano wrote:
| I have this little idea I think about called the "status update
| chain". When I worked in small organizations and we had issues
| the status update chain looked like this: ceo-->me, as the
| organizations got larger the chain got longer first it was
| ceo-->manager-->me then ceo-->director-->manager-->me and so on.
| I wonder how long the status update chains are at companies like
| this? How long does at status update take to make it end to end?
| tacLog wrote:
| I am sorry, I didn't have enough context to understand what
| your saying.
|
| When you say: status update chain: ceo --> me. What information
| is flowing from the CEO to you? or is it the other way around?
| willcipriano wrote:
| Both directions, he is asking "What is going on" and I am
| telling him.
| wizwit999 wrote:
| > On October 27th at 14:00, one day before the outage, we enabled
| this feature on a backend service that is responsible for traffic
| routing. As part of this rollout, in order to prepare for the
| increased traffic we typically see at the end of the year, we
| also increased the number of nodes supporting traffic routing by
| 50%.
|
| Seems like the smoking gun, this should have been identified and
| rolled back much earlier.
| conorh wrote:
| Excellent write up. Reading a thorough, detailed and open
| postmortem like this makes me respect the company. They may have
| issues but it sounds like the type of company that (hopefully)
| does not blame, has open processes, and looks to improve - the
| type of company I'd want to work for!
| sam0x17 wrote:
| Too bad they exploit young game developers by taking a 75.5%
| cut of their earnings. Big yikes of a red flag for me.
| https://www.nme.com/news/gaming-news/roblox-is-allegedly-exp...
| badcc wrote:
| This % includes cost of all game server hosting, databases,
| memory stores, etc. even with millions of concurrents, app
| store fees, etc. All included in that number. Developer gets
| effectively pure profit for the sole cost of
| programming/designing a great game. Taught me how to program,
| & changed my entire future. Disclosure: My game is one of
| most popular on the platform.
| ygjb wrote:
| And that's a reasonable decision for an adult to make, and
| if they were targeting an adult developer community.
|
| I don't think anyone objects to adults making that choice
| over say, using Unity or Unreal, and targeting other
| platforms.
|
| In practice, explaining to my son who is growing into an
| avid developer why I won't a) help him build on Roblox, or
| b) fund his objectives of advertising and promoting his
| work in Roblox (by spending Roblox company scrip) on the
| platform has necessitated helping him to learn and
| understand what exploitation means and how to recognize it.
|
| It's a learning experience for him, and a challenging issue
| for me as a technically proficient and financially literate
| parent who actually owns and run businesses related to
| intellectual property. It's got to be much more painful for
| parents who lack in any of those three areas.
| RussianCow wrote:
| Are you really suggesting that Roblox's cut should be
| lower purely because the target market is children? Why?
| If anything, the fact that a kid can code a game in a
| high-level environment and immediately start making money
| --without any of the normal challenges of setting up
| infrastructure, let alone marketing and discovery--is
| _amazing_ , and a feat for which Roblox should definitely
| be rewarded.
|
| In any case, what's the alternative? To teach your son
| how to build the game from scratch in Unity, spin up a
| server infrastructure that won't crumble with more than a
| few concurrent players (not to mention the cash required
| for this), figure out distribution, and then actually get
| people to find and play the game? That seems quite
| unreasonable for most children/parents.
|
| If this were easy, a competitor would have come in and
| offered the same service with significantly lower fees.
| adgjlsfhk1 wrote:
| The problem is that robolox essentially lies to kids (by
| omission) in an attempt to get free labor out of them.
| RussianCow wrote:
| Yes, I agree that the deception is a problem, although I
| admit I'm not well versed in the issue. (I'm watching the
| documentary linked elsewhere now.) But the original claim
| was that they were exploiting young developers by taking
| a big cut of revenues, which I disagree with.
| noobhacker wrote:
| Does your son have other alternatives to learn
| programming and make money other than Roblox?
|
| If there are, then it's a great lesson about looking
| outside of one's immediate circumstance and striving
| towards something better.
| lolinder wrote:
| > And that's a reasonable decision for an adult to make,
| and if they were targeting an adult developer community.
|
| If it's a reasonable decision for an adult to make
| because the trade-offs might be worth it, doesn't that
| mean that it would also be reasonable for a child to make
| the same decision for the same reason?
|
| It's either exploitative or it isn't, the age of the
| developer doesn't alter the trade-offs involved.
| JauntyHatAngle wrote:
| No, because a child is not deemed to have the necessary
| faculties to make these decisions.
|
| The question should not be posed to a child, that is the
| law for child labour, and why we do not have children
| gambling on roulette wheels.
| [deleted]
| DerArzt wrote:
| To add, there is a nice documentary here[1] which also has a
| followup[2] that show even more of the issue at hand. Kids
| making games and only getting 24.5% of the profit is one
| thing, but everything else that Roblox does is much worse.
|
| [1] https://youtu.be/_gXlauRB1EQ
|
| [2] https://youtu.be/vTMF6xEiAaY
| Qualadore wrote:
| The 24.5% cut is fine, you have to consider the 30% app
| store fees for a majority mobile playerbase, all hosting is
| free, moderation is a major expense, and engine and
| platform development.
|
| Successful games subsidize everyone else, which is not
| comparable to Steam or anything else.
|
| Collectible items are fine and can't be exchanged for USD,
| Roblox can't arbitrate developer disputes, "black markets"
| are an extremely tiny niche. A lot of misinformation.
|
| It's annoying to see these videos brought up every single
| time Roblox is mentioned anywhere for these reasons. Part
| of the blame lies with Roblox for one of the worst PR
| responses I have seen in tech, I suppose.
| brimble wrote:
| > The 24.5% cut is fine, you have to consider the 30% app
| store fees for a majority mobile playerbase, all hosting
| is free, moderation is a major expense, and engine and
| platform development.
|
| You have successfully made the case for a 45% fee and
| being considered approximately normal, or a 60% fee and
| being considered pretty high still. 75+% is crazy.
| Qualadore wrote:
| I can't think of any other platform with comparable
| expenses. Traditional game engines have the R&D
| component, but not moderation, developer services, or
| subsidizing games that don't succeed.
|
| It helps that seriously launching a Roblox game costs <
| $1k USD always, usually < $200 USD. It's not easy to
| generate a loss, even when including production costs.
| That's the tradeoff.
| [deleted]
| nostrebored wrote:
| The idea that these children would otherwise be making their
| own games is knowingly, generally wrong.
| munk-a wrote:
| No matter what the cut is I think there are some legitimate
| social questions to ask about whether want young people to be
| potentially exposed to economic pressure to earn or whether
| we'd rather push back aggressively against youth monetization
| to preserve a childhood where, ideally, children get to play.
|
| I know there are lots of child actors and plenty of household
| situations that make enjoying childhood difficult for many
| youths - but just because we're already bad at a thing
| doesn't mean we should let it get worse. Child labour laws
| were some of the first steps of regulation in the industrial
| revolution because inflation works in such a way where
| opening the door up to child labour can put significant
| financial pressure on families that choose not to participate
| when demand adjusts to that participation being normal.
| Aunche wrote:
| By that logic, Dreams is "exploiting" developers by taking a
| 100% cut of their earnings. Making money isn't the point of
| either of these platforms.
| loceng wrote:
| The solution is creating a competing platform and offering a
| better cut. You up for the task?
|
| Edit to add: lazy people downvote.
| flippinburgers wrote:
| I am naive about the reality on the ground when it comes to
| this issue, but doesn't this hinge on transparency? If they
| can show they are covering costs + the going market rate,
| which seems to be 30% (at best), then wouldn't it be
| reasonable? So is a 45% cut for infra ok or not seems to be
| the question.
| perihelions wrote:
| More egregiously, they're (per your article) manipulating
| kids into _buying real ads_ for their creations, with the
| false promise that "you could get rich if you pay us".
|
| > _" As there are no discoverability tools, users are only
| able to see a tiny selection of the millions of experiences
| available. One of the ways boost to discoverability is to pay
| to advertise on the platform using its virtual currency,
| Robux."_
|
| (Note that "virtual" currency is real money, bidirectionally
| exchangeable with USD).
|
| The sales pitch is "get rich fast":
|
| > _" Under the platform's 'Create' tab, it sells the idea
| that users can "make anything", "reach millions of players",
| and "earn serious cash", while its official tutorials and
| support website both "assume" they are looking for help with
| monetisation."_
|
| I agree that this doesn't really look like a labor issue.
| That's distracting and contentious tangent; it's easier to
| just label it a type of _consumer_ exploitation. (Most of the
| people aren 't earning money -- but they are all _paying
| money_ ). It's a scam either way.
| tptacek wrote:
| Again, as across-thread: this is a tangent unrelated to the
| actual story, which is interesting for reasons having nothing
| at all to do with Roblox (I'll never use it, but operating
| HashiStack at this scale is intensely relevant to me). We
| need to be careful with tangents like this, because they're
| easier for people to comment on than the vagaries of Raft and
| Go synchronization primitives, and --- as you can see here
| --- they quickly drown out the on-topic comments.
| breakfastduck wrote:
| Or how about giving a free platform to get into games
| development for young people that otherwise wouldn't have
| become interested.
| digitalengineer wrote:
| badcc wrote:
| As one of the top developers on the platform (& 22 y/o,
| taught myself how to program through Roblox, ~13 years ago),
| I can say that it seems a majority of us in the developer
| community are quite unhappy with the image this video
| portrays. We love Roblox.
| dan_pixelflow wrote:
| That's kind of on Roblox then for not answering their
| questions transparently.
| duxup wrote:
| My son loves it, I think it is a great way to learn.
| empressplay wrote:
| I think what bothers me the most is the effective 'pay to
| play' aspect
| [deleted]
| tptacek wrote:
| This is an interesting debate to have somewhere, but it has
| _nothing to do with this thread_. We need to be careful about
| tangents like this, because it 's a lot easier to have an
| opinion about the cut a platform should take from UGC than it
| is to have opinion about Raft, channel-based concurrency, and
| on-disk freelists. If we're not careful, we basically can't
| have the technical threads, because they're crowded out by
| the "easier" debate.
| digitalengineer wrote:
| True, it is off topic to the postmortem. However, the top
| comment talks about wanting to work there. I get is is very
| relevant to see a bigger picture. Personally, I could never
| work for them. I have a kid and the services and culture
| they created around their product is sickening and should
| be made illegal.
| nightpool wrote:
| While I personally think digitalengineer's comment was low-
| effort and knee-jerk, I think this general thread of
| discussion is on topic for the comment replied to, which
| was specifically about how the postmortem increased the
| commenter's respect for Roblox as a company and made them
| want to work there. I think an acceptable compromise
| between "ethical considerations drown out any technical
| discussion" and "any non-technical discussion gets
| downvoted/flagged to oblivion" would be to quarantine
| comments about the ethics of Roblox's business model to a
| single thread of discussion, and this one seems as good as
| any.
| pvg wrote:
| The guidelines and zillions of moderation comments are
| pretty explicit that doesn't count as 'on topic'. You can
| always hang some rage-subthread off the unrelated
| perfidy, real or perceived, of some entity or another.
| This one is extra tenuous and forced given that 'the type
| of company I'd want to work for' is a generic expression
| of approval/admiration.
| BolexNOLA wrote:
| You've pretty much articulated for me why I've been
| commenting on Reddit less and less frequently.
| duxup wrote:
| I loathe the constant riffing on <related and yet nothing
| indicates it is actually related/> topics.
|
| Sadly it is happening here on HN too, < insert the next
| blurb about corporatism/>
| BolexNOLA wrote:
| Guess we need to find the next space lol
| micromacrofoot wrote:
| Yeah as long as Roblox is exploiting children they're just
| flat-out not respectable. This video is a good look at a
| phenomenon most people are unaware of.
| [deleted]
| charcircuit wrote:
| Players of your game creating content for it is not
| exploitation. It's just how it works in the gaming world.
| When I was a kid I spent time creating a minecraft mod that
| hundreds of people used. Did Mojang or anyone else ever pay
| me? No. I did it because I wanted to.
| jawngee wrote:
| Mojang was likely not selling you on making a mod with
| promises of making money though. Roblox did that, maybe
| they still do it.
| digitalengineer wrote:
| Please review the video. The problem is not 'players
| creating content'.
| [deleted]
| micromacrofoot wrote:
| The way they're paying kids and what they're telling them
| is a big part of the problem... they're pushing a lot of
| the problematic game development industry onto kids that
| are sometimes as young as 10.
|
| If this was free content creation when kids want to do
| it, then it would be an entirely different story.
| ehsankia wrote:
| > the type of company I'd want to work for!
|
| I recommend watching the following:
|
| https://www.youtube.com/watch?v=_gXlauRB1EQ
|
| https://www.youtube.com/watch?v=vTMF6xEiAaY
| ineedasername wrote:
| ">circular dependencies in our observability stack"
|
| This appears to be why the outage was extended, and was
| referenced elsewhere too. It's hard to diagnose something when
| part of the diagnostic tool kit is also malfunctioning.
| AaronFriel wrote:
| This outage has it all, distributed systems, non-uniform memory
| access contention (aka "you wanted scale up? how about instead we
| make your CPU a distributed system that you have to reason
| about?"), a defect in a log-structured merge tree based data
| store, malfunctioning heartbeats affecting scheduling, wow wow
| wow.
|
| Big props to the on-calls during this.
| tacLog wrote:
| > Big props to the on-calls during this.
|
| Kind of curious about this. I know this is probably company
| specific but how do outages get handled at large orgs? Would
| the on-calls have been called in first then called in the rest
| of the relevant team?
|
| Is their a leadership structure that takes command of the
| incident to make big coordinated decisions to manage the risk
| of different approaches?
|
| Would this have represented crunch time to all the relevant
| people or would this be a core team with other people helping
| as needed?
| WaxProlix wrote:
| Oncalls get paged first and then escalate. As they assess
| impact to other teams and orgs, they usually post their
| tickets to a shared space. Once multiple team/org impact is
| determined, leadership and relevant ops groups (networking,
| eg) get pulled in to a call. A single ticket gets designated
| the Master Ticket for the Event, and oncalls dump diagnostic
| info there. Root cause is found (hopefully), affected teams
| work to mitigate while RC team rushes to fix.
|
| The largest of these calls I've seen was well into the
| hundreds of sw engineers, managers, network engineers, etc.
| yazaddaruvala wrote:
| Typically:
|
| Yes. This was a multi-day outage and eventually the oncall
| does need sleep, so you need more of the team to help with
| it. Typically, at any reasonable team, everyone that chipped
| in nights get to take off equivalent days and sprint tasks
| are all punted.
|
| Yes. Not just to manage risks, but also to get quick
| prioritization from all teams at the company. "You need
| legal? Ok, meet ..." "You need string translations? Ok
| escalated to ..." "You need financial approval? Ok, looped in
| ..."
|
| Kinda. Definitely would have represented crunch time, but a
| very very demoralizing crunch time. Managers also try to
| insulate most of their teams from it, but everyone pays
| attention anyways. There is no "core team" other than the
| leadership structure from your question 2. Otherwise, it is
| very much "people/teams helping as needed".
| quirino wrote:
| Google has his Site Reliability Engineering book, which might
| answer some of your questions
|
| https://sre.google/sre-book/table-of-contents/
| sjtindell wrote:
| Super interesting. A place where ipvs or ebpf rules per-host for
| the discovery of services seems much more resilient than this
| heavy reliance on a functional consul service. The team shared a
| great postmortem here. I know the feeling well of testing
| something like a full redeploy and seeing no improvement...easy
| to lose hope at that point. 70+ hours of a full outage, multiple
| failed attempts to restore, has got to add years to your life in
| stress. Well done to all involved.
| johnmarcus wrote:
| aaaalllllllll the way down at the bottom is this gem: >Some core
| Roblox services are using Consul's KV store directly as a
| convenient place to store data, even though we have other storage
| systems that are likely more appropriate.
|
| Yeah, don't use consul as redis, they are not the same.
| stuff4ben wrote:
| But you can... which is what some engineers were thinking. In
| my experience they do this because:
|
| A) they're afraid to ask for permission and would rather ask
| for forgiveness
|
| B) management refused to provision extra infra to support the
| engineers need, but they needed to do this "one thing" anyways
|
| C) security was lax and permissions were wide open so people
| just decided to take advantage of it to test a thing that then
| became a feature and so they kept it but "put it on the
| backlog" to refactor to something better later
| stuff4ben wrote:
| Sounds like they need to switch to Kubernetes?
|
| I kid of course. One of the best post-mortems I've seen. I'm sure
| there are K8s horror stories out there of etcd giving up the
| ghost in a similar fashion.
| spydum wrote:
| you joke, but it's precisely this:
|
| >Critical monitoring systems that would have provided better
| visibility into the cause of the outage relied on affected
| systems, such as Consul. This combination severely hampered the
| triage process.
|
| which gives me goosebumps whenever I hear people proselytizing
| everything run on Kubernetes. At some point, it makes good
| sense to keep capabilities isolated from each other, especially
| when those functions are key to keeping the lights on. Mapping
| out system dependencies (either systems, software components,
| etc) is really the soft underbelly of most tech stacks.
| YATA0 wrote:
| >Sounds like they need to switch to Kubernetes?
|
| Hah! Good one!
| schoolornot wrote:
| The one thing you can say about Nomad is that's generally
| incredibly scalable compared to Kubernetes. At 1000+ nodes over
| multiple datacenters, things in Kube seem to break down.
| tapoxi wrote:
| Do they still? GKE supports 15,000 nodes per cluster.
| samkone wrote:
| Mayhem. Hipsters
| chainwax wrote:
| Love the "Note on Public Cloud", and their stance on owning and
| operating their own hardware in general. I know there has to be
| people thinking this could all be avoided/the blame could be
| passed if they used a public cloud solution. Directly addressing
| that and doubling down on your philosophies is a badass move,
| especially after a situation like this.
| regnull wrote:
| It's weird it took them so long to disable streaming. One of the
| first things you do in this case is roll back the last software
| and config updates, even innocent looking ones.
| yashap wrote:
| That's what stood out to me too. Although they'd been slowly
| rolling it out for awhile, their last major rollout was quite
| close to the outage start:
|
| > Several months ago, we enabled a new Consul streaming feature
| on a subset of our services. This feature, designed to lower
| the CPU usage and network bandwidth of the Consul cluster,
| worked as expected, so over the next few months we
| incrementally enabled the feature on more of our backend
| services. On October 27th at 14:00, one day before the outage,
| we enabled this feature on a backend service that is
| responsible for traffic routing. As part of this rollout, in
| order to prepare for the increased traffic we typically see at
| the end of the year, we also increased the number of nodes
| supporting traffic routing by 50%
|
| Consul was clearly the culprit early on, and you just made a
| significant Consul-related infrastructure change, you'd think
| rolling that back would be one of the first things you'd try.
| One of the absolute first steps in any outage is "is there any
| recent change we could possibly see causing this? If so, try
| rolling it back."
|
| They've obviously got a lot of strong engineers there, and it's
| easy to critique from the outside, but this certainly struck me
| as odd. Sounds like they never even tried "let's try rolling
| back Consul-related changes", it was more that, 50+ hours into
| a full outage, they'd done some deep profiling, and discovered
| the steaming issue. But IMO root cause analysis is for later,
| "resolve ASAP" is the first response, and that often involves
| rollbacks.
|
| I wonder if this actually hindered their response:
|
| > Roblox Engineering and technical staff from HashiCorp
| combined efforts to return Roblox to service. We want to
| acknowledge the HashiCorp team, who brought on board incredible
| resources and worked with us tirelessly until the issues were
| resolved.
|
| i.e. earlier on, were there HashiCorp peeps saying "naw, we
| tested streaming very thoroughly, can't be that"?
| notacoward wrote:
| In a not-too-distant alternate universe, they made the rookie
| assumption that every change to every system is trivially
| reversible, only to find that it's not always true
| (especially for storage or storage-adjacent systems), and
| ended up making things worse. Naturally, people in alternate-
| universe HN bashed them for that too.
| otterley wrote:
| When you're at Roblox's scale, it is often difficult to know
| in advance whether you will have a lower MTTR by rolling back
| or fixing forward. If it takes you longer to resolve a
| problem by rolling back a significant change than by tweaking
| a configuration file, then rolling back is not the best
| action to take.
|
| Also, multiple changes may have confounded the analysis.
| Adjusting the Consul configuration may have been one of many
| changes that happened in the recent past, and certainly
| changes in client load could have been a possible culprit.
| yashap wrote:
| Some changes are extremely hard to rollback, but this
| doesn't sound like one of them. From their report, sounds
| like the rollback process involved simply making a config
| change to disable the streaming feature, it took a bit to
| rollout to all nodes, and then Consul performance almost
| immediately returned to normal.
|
| Blind rollbacks are one thing, but they identified Consul
| as the issue early on, and clearly made a significant
| Consul config change shortly before the outage started,
| that was also clearly quite reversible. Not even trying to
| roll that back is quite strange to me - that's gotta be
| something you try within the first hour of the outage,
| nevermind the first 50 hours.
| [deleted]
| Twirrim wrote:
| The post indicates they'd been rolling it out for months, and
| indicate the feature went live "several months ago".
|
| With the behaviour matching other types of degradation
| (hardware), it's entirely reasonable that it could have taken
| quite a while to recognise that software and configurations
| that have proven stable for several months, that is still there
| working, wasn't quite so stable as it seemed.
| nightpool wrote:
| Right, but it only went live on the DB that failed the day
| before. Obviously, hindsight is 20/20, but it's strange that
| the oversight didn't rate a mention in the postmortem.
| Twirrim wrote:
| "We enjoyed seeing some of our most dedicated players figure out
| our DNS steering scheme and start exchanging this information on
| Twitter so that they could get "early" access as we brought the
| service back up."
|
| Why do I have a feeling "enjoyed" wasn't really enjoyed so much
| as "WTF", followed by "oh shit..." at the thought that their main
| way to balance load may have gone out the window.
| Symbiote wrote:
| It's difficult to know how quickly word could have spread, but
| I enjoy knowing a few 11 year olds learned something about the
| Internet in order to play a game an hour early.
| jandrese wrote:
| The BoltDB issue seems like straight up bad design. Needing a
| freelist is fine, needing to sync the entire freelist to disk
| after every append is pants on head.
| benbjohnson wrote:
| BoltDB author here. Yes, it is a bad design. The project was
| never intended to go to production but rather it was a port of
| LMDB so I could understand the internals. I simplified the
| freelist handling since it was a toy project. At Shopify, we
| had some serious issues at the time (~2014) with either LMDB or
| the Go driver that we couldn't resolve after several months so
| we swapped out for Bolt. And alas, my poor design stuck around.
|
| LMDB uses a regular bucket for the freelist whereas Bolt simply
| saved the list as an array. It simplified the logic quite a bit
| and generally didn't cause a problem for most use cases. It
| only became an issue when someone wrote a ton of data and then
| deleted it and never used it again. Roblox reported having 4GB
| of free pages which translated into a giant array of 4-byte
| page numbers.
| tacLog wrote:
| > BoltDB author here.
|
| How does this happen so often? It's awesome to get the
| authors take on things. Also thank you for explaining and
| owning it. Where you part of this incident response?
| otterley wrote:
| I, for one, appreciate you owning this. It takes humility and
| strength of character to admit one's errors. And Heaven knows
| we all make them, large and small.
| kjw wrote:
| I would not have guessed Roblox was on-prem with such little
| redundancy. Later in the post, they address the obvious "why not
| public cloud question"? They argue that running their own
| hardware gives them advantages to cost and performance. But those
| seem irrelevant if usage and revenue go to zero when you can't
| keep a service up. It will be interesting to see how well this
| architecural decision ages if they keep scaling to their
| ambitions. I wonder about their ability to recruit the level of
| talent required to run a service at this scale.
| dylan604 wrote:
| >I wonder about their ability to recruit the level of talent
| required to run a service at this scale.
|
| According to this user's comments, it doesn't look like it'll
| be that tough for them:
|
| https://news.ycombinator.com/item?id=30014748
| nomel wrote:
| > But those seem irrelevant if usage and revenue go to zero
| when you can't keep a service up
|
| You're assuming the average profits lost are more than the
| average cost of doing things differently, which, according to
| their statement, is not the case.
| noahtallen wrote:
| I think the public cloud is a good choice for startups, teams,
| and projects which don't have infrastructure experience. Plenty
| of companies still have their own infrastructure expertise and
| roll their own CDNs, as an example.
|
| Not only can one save a significant amount of money, it can
| also be simpler to troubleshoot and resolve issues when you
| have a simpler backend tech stack. Perhaps that doesn't apply
| in this case, but there are plenty of use cases which don't
| need a hundred micro services on AWS, none of which anyone
| fully understands.
| otterley wrote:
| Since the issue's root cause was a pathological database
| software issue, Roblox would have suffered the same issue in
| the public cloud. (I am assuming for this analysis that their
| software stack would be identical.) Perhaps they would have
| been better off with other distributed databases than Consul
| (e.g., DynamoDB), but at their scale, that's not guaranteed,
| either. Different choices present different potential
| difficulties.
|
| Playing "what-if" thought experiments is fun, but when the
| rubber hits the road, you often find that things that are
| stable for 99.99%+ of load patterns encounter previously
| unforeseen problems once you get into that far-right-hand side
| of the scale. And it's not like we've completely mastered
| squeezing performance out of huge CPU core counts on NUMA
| architectures while avoiding bottlenecking on critical sections
| in software. This shit is hard, man.
| baskethead wrote:
| This is not true, if they handled the rollout properly.
| Companies like Uber have two entirely different data centers
| and during outages they failover you either datacenter.
|
| Everything is duplicated which is potentially wasteful but
| ensures complete redundancy and it's an insurance policy. If
| you rollout, you rollout to each datacenter separately. So in
| this case rolling out in one complete datacenter and waiting
| a day for their Consul streaming changes probably would have
| caught it.
| otterley wrote:
| The Consul streaming changes were rolled out months before
| the incident occurred.
| Symbiote wrote:
| > So in this case rolling out in one complete datacenter
| and waiting a day for their Consul streaming changes
| probably would have caught it.
|
| But this has nothing to do with cloud vs. colo.
| erwincoumans wrote:
| >> We are working to move to multiple availability zones and data
| centers.
|
| Surprised it was a single availability zone, without redundancy.
| Having multiple fully independent zones seems more reliable and
| failsafe.
| mbesto wrote:
| There have been multiple discussions on HN about cloud vs not
| cloud and there are endless amount of opinions of "cloud is a
| waste blah blah".
|
| This is exactly one of the reasons people go cloud. Introducing
| an additional AZ is a click of a button and some relatively
| trivial infrastructure as code scripting, even at this scale.
|
| Running your own data center and AZ on the other hand requires
| a very tight relationship with your data center provider at
| global scale.
|
| For a platform like Roblox where downtime equals money loss
| (i.e. every hour of the day people make purchases), then there
| is a real tangible benefit to using something _like_ AWS. 72
| hours downtime is A LOT, and we 're talking potentially
| millions of dollars of real value lost and millions of
| potential in brand value lost. I'm not saying definitively they
| would save money (in this case profit impact) by going to AWS,
| but there is definitely a story to be had here.
| treis wrote:
| But it wasn't a hardware issue. It was a software one and
| that would have crossed AZ boundaries.
| mbesto wrote:
| So then why does the post mortem suggest setting up multi-
| az to address the problems they encountered?
| treis wrote:
| I took that to mean sharding Roblox instead of spanning
| it across data center AZs.
| abarringer wrote:
| Was on a call with a bank VP that had moved to AWS. Asked how
| it was going. Said it was going great after six months but just
| learning about availability zones so they were going to have to
| rework a bunch of things.
|
| Astonishing how our important infrastructure is moved to AWS
| with zero knowledge of how AWS works.
| kreeben wrote:
| >> Having multiple fully independent zones seems more reliable
|
| I don't think these independent zones exist. See AWS's recent
| outages, where east cripples west and vice versa.
| count wrote:
| That's not how they work. They exist, and work extremely well
| within their defined engineering / design goals. It's much
| more nuanced than 'everything works independently'.
| kreeben wrote:
| If the design goal of these zones is that they should be
| independent of each other then, no, they do not work
| extremely well.
| Karrot_Kream wrote:
| Availability Zones aren't the same thing as regions. AWS
| regions have multiple Availability Zones. Independent
| availability zones publishes lower reliability SLAs so you
| need to load balance across multiple independent availability
| zones in a region to reach higher reliability. Per AZ SLAs
| are discussed in more detail here [1]
|
| (N.B. I find HN commentary on AWS outages pretty depressing
| because it becomes pretty obvious that folks don't understand
| cloud networking concepts at all.)
|
| [1]: https://aws.amazon.com/compute/sla/
| kreeben wrote:
| >> you need to load balance across multiple independent
| availability zones
|
| The only problem with that is, there are no independent
| availability zones.
|
| What we do have, though, is an architecture where errors
| propagate cross-zone until they can't propagate any
| further, because services can't take any more requests,
| because they froze, because they weren't designed for a
| split brain scenario, and then, half the internet goes
| down.
| outworlder wrote:
| > The only problem with that is, there are no independent
| availability zones.
|
| There are - they can be as independent as you need them
| to be.
|
| Errors won't necessarily propagate cross-zone. If they
| do, someone either screwed up, or they made a trade-off.
| Screwing up is easy, so you need to do chaos testing to
| make sure your system will survive as intended.
| kreeben wrote:
| I'm not talking about my global app. I'm talking about
| the system I deploy to, the actual plumbing, and how a
| huge turd in a western toilet causes east's sewerage
| system to over-flow.
| mlyle wrote:
| > (N.B. I find HN commentary on AWS outages pretty
| depressing because it becomes pretty obvious that folks
| don't understand cloud networking concepts at all.)
|
| What he said was perfectly cogent.
|
| Outages in us-east-1 AZ us-east-1a have caused outages in
| us-west-1a, which is a different region _and_ a different
| AZ.
|
| Or, to put it in the terms of reliability engineering: even
| though these are abstracted as independent systems, in
| reality there are common-mode failures that can cause
| outages to propagate.
|
| So, if you span multiple availability zones, you are not
| spared from events that will impact all of them.
| Karrot_Kream wrote:
| > Or, to put it in the terms of reliability engineering:
| even though these are abstracted as independent systems,
| in reality there are common-mode failures that can cause
| outages to propagate.
|
| It's up to the _user_ of AWS to design around this level
| of reliability. This isn't any different than not using
| AWS. I can run my web business on the super cheap by
| running it out of my house. Of course, then my site's
| availability is based around the uptime of my residential
| internet connection, my residential power, my own ability
| to keep my server plugged into power, and general
| reliability of my server's components. I can try to make
| things more reliable by putting it into a DC, but if a
| backhoe takes out the fiber to that DC, then the DC will
| become unavailable.
|
| It's up to the _user_ to architect their services to be
| reliable. AWS isn't magic reliability sauce you sprinkle
| on your web apps to make them stay up for longer. AWS
| clearly states in their SLA pages what their EC2 instance
| SLAs are in a given AZ; it's 99.5% availability for a
| given EC2 instance in a given region and AZ. This is
| roughly ~1.82 days, or ~ 43.8 hours, of downtime in a
| year. If you add a SPOF around a single EC2 instance in a
| given AZ then your system has a 99.5% availability SLA.
| Remember the cloud is all about leveraging large amounts
| commodity hardware instead of leveraging large, high-
| reliability mainframe style design. This isn't a secret.
| It's openly called out, like in Nishtala et al's "Scaling
| Memcache at Facebook" [1] from 2013!
|
| The background of all of this is that it costs money, in
| terms of knowledgable engineers (not like the kinds in
| this comment thread who are conflating availability zones
| and regions) who understand these issues. Most companies
| don't care; they're okay with being down for a couple
| days a year. But if you want to design high reliability
| architectures, there are plenty of senior engineers
| willing to help, _if_ you're willing to pay their
| salaries.
|
| If you want to come up with a lower cognitive overhead
| cloud solution for high reliability services that's
| economical for companies, be my guest. I think we'd all
| welcome innovation in this space.
|
| [1]: https://www.usenix.org/system/files/conference/nsdi1
| 3/nsdi13...
| mlyle wrote:
| Yes, but the underlying point you're willfully missing
| is:
|
| You can't engineer around AWS AZ common-mode failures
| using AWS.
|
| The moment that you have failures that are not
| independent and common mode, you can't just multiply
| together failure probabilities to know your outage times.
| roughly wrote:
| During a recent AWS outage, the STS service running in
| us-east-1 was unavailable. Unfortunately, all of the
| other _regions_ - not AZs, but _regions_, rely on the STS
| service in us-east-1, which meant that customers which
| had built around Amazon's published reliability model had
| services in every region impacted by an outage in one
| specific availability zone.
|
| This is what kreeben was referring to - not some abstract
| misconception about the difference between AZs and
| Regions, but an actual real world incident in which a
| failure in one AZ had an impact in other Regions.
| Karrot_Kream wrote:
| > Unfortunately, all of the other _regions_ - not AZs,
| but _regions_, rely on the STS service in us-east-1,
| which meant that customers which had built around
| Amazon's published reliability model had services in
| every region impacted by an outage in one specific
| availability zone.
|
| That's not true. STS offers regional endpoints, for
| example if you're in Australia and don't want to pay the
| latency cost to transit to us-east-1 [1]. It's up to the
| user to opt into them though. And that goes back to what
| I was saying earlier, you need engineers willing to read
| their docs closely and architect systems properly.
|
| [1]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_
| credenti...
| otterley wrote:
| It's more subtle than that.
|
| For high availability, STS offers regional endpoints --
| and AWS recommends using them[1] -- but the SDKs don't
| use them by default. The author of the client code, or
| the person configuring the software, has to enable them.
|
| [1] https://docs.aws.amazon.com/IAM/latest/UserGuide/id_c
| redenti...
|
| (I work for AWS. Opinions are my own and not necessarily
| those of my employer.)
| johnmarcus wrote:
| Yup, so true. People think redundant == 100% uptime, or
| that when they advertise 99.9% uptime, it's the same thing
| as 100% minus a tiny bit for "glitches".
|
| It's not. .1% of 365 _24 = 87.6 hours of downtime - that 's
| over 3 days of complete downtime every year!
|
| For a more complete list of their SLA's for every service:
| https://aws.amazon.com/legal/service-level-
| agreements/?aws-s...
|
| They only refund 100% when they fall below 95% of
| availability! 95-99= 30%. I believe the real target is
| above 99.9% though, as that results in 0 refund to the
| customer. What that means is, 3 days of downtime is
| acceptable!
|
| Alternatively, you can return to your own datacenter and
| find out first hand that it's not particularly as easy to
| deliver that as you may think. You too will have power
| outages, network provider disruptions, and the occasional
| "oh shit, did someone just kick that power cord out?" or
| complete disk array meltdowns.
|
| Anywho, they have a lot more room in their published SLA's
| than you think._
|
| Edit: as someone correctly pointed out i did a typo in my
| math. it is only ~9 hours of aloted downtime. Keeping in
| mind that this is _per service_ though - meaning each
| service can have a different 9 hours of downtime before
| they need to pay out 10% of that one service. I still stand
| by my statement thier SLA 's have a lot of wiggle room that
| people should take more seriously.
| mqnfred wrote:
| Your computation is incorrect, 3 days out of 365 is 1% of
| downtime, not 0.1%. I believe your error stems from
| reporting .1% as 0.1. Indeed:
|
| 0.001 (.1%) * 8760 (365d*24h) = 8.76h
|
| Alternatively, the common industry standard in
| infrastructure (the place I work at at least,) is 4
| nines, so 99.99% availability, which is around 52 mins a
| year or 4 mins a month iirc. There's not as much room as
| you'd think! :)
| foobarian wrote:
| > Surprised it was a single availability zone, without
| redundancy. Having multiple fully independent zones seems more
| reliable and failsafe.
|
| It's also a lot more expensive. Probably order of magnitude
| more expensive than the cost of a 1 day outage
| sam0x17 wrote:
| Most startups I've worked at literally have a script to
| deploy their whole setup to a new region when desired. Then
| you just need latency-based routing running on top of it to
| ensure people are processed in the closest region to them.
| Really not expensive. You can do this with under $200/month
| in terms of complexity and the bandwidth + database costs are
| going to be roughly the same as they normally are because
| you're splitting your load between regions. Now if you
| stupidly just duplicate your current infrastructure entirely,
| yes it would be expensive because you'd be massively
| overpaying on DB.
|
| In theory the only additional cost should be the latency-
| based routing itself, which is $50/month. Other than that,
| you'll probably save money if you choose the right regions.
| Symbiote wrote:
| So Roblox need a button to press to (re)deploy 18,000
| servers and 170,000 containers? They already have multiple
| core data centres, as well as many edge locations.
|
| You will note the problem was with the software provided
| and supported by Hashicorp.
| e4e78a06 wrote:
| Correctly handling failure edge cases in a active-active
| multi-region distributed database requires work. SaaS DBs
| do a lot of the heavy lifting but they are still highly
| configurable and you need to understand the impact of the
| config you use. Not to mention your scale-up runbooks need
| to be established so a stampede from a failure in one
| region doesn't cause the other region to go down. You also
| need to avoid cross-region traffic even though you might
| have stateful services that aren't replicated across
| regions. That might mean changes in config or business
| logic across all your services.
|
| It is absolutely not as simple as spinning up a cluster on
| AWS at Roblox's scale.
| Twirrim wrote:
| Roblox is not a startup, and has a significant sized
| footprint (18,000 servers isn't something that's just
| available, even within clouds. They're not magically
| scalable places, capacity tends to land just ahead of
| demand). It's not even remotely a simple case of just "run
| a script and wee we have redundancy" There are _lots_ of
| things to consider.
|
| 18k servers is also not cheap, at all. They suggest at
| least some of their clusters are running on 64 cores, some
| on 128. I'm guessing they probably have a fair spread of
| cores.
|
| Just to give a sense of cost, AWS's calculator estimates
| 18,0000 _32_ core instances would set you back $9m per
| month. That 's just the EC2 cost, and assuming a lower core
| count is used by other components in the platform. 64 core
| would bump that to $18m. Per month. Doing nothing but
| sitting waiting ready. That's not considering network
| bandwidth costs, load balancers etc. etc.
|
| When you're talking infrastructure on that scale, you have
| to contact cloud companies in advance, and work with them
| around capacity requirements, or you'll find you're barely
| started on provisioning and you won't find capacity
| available (you'll want to on that scale _anyway_ because
| you 'll get discounts but it's still going to be very
| expensive)
| bradly wrote:
| Are the same services available in all regions?
|
| Are the same instance sizes available in all regions?
|
| Are there enough instances of the sizes you need?
|
| Do you have reserved instances in the other region?
|
| Are your increased quotas applied to all regions?
|
| What region are your S3 assets in? Are you going to migrate
| those as well?
|
| Is it acceptable for all user sessions to be terminated?
|
| Have you load tested the other region?
|
| How often are you going to test the region fail over?
| Yearly? Quarterly? With every code change?
|
| What is the acceptable RTO and RPO with executives and
| board-members?
|
| And all of that is without thinking about cache warming,
| database migration/mirror/replication, solr indexing (are
| you going to migrate the index or rebuild? Do you know how
| long it takes to rebuild your solr index?).
|
| The startups you worked at probably had different needs the
| Roblox. I was the tech leach on a Rails app that was
| embedded in TurboTax and QuickBooks and was rendered on
| each TT screen transition and reading your comment in that
| context shows a lot of inexperience in large, production
| systems.
| [deleted]
| bradly wrote:
| Yes. If you are running in two zones in the hopes that you
| will be up if one goes down, you need to be handling less
| than 50% load in each zone. If you can scale up fast enough
| for your use case, great. But when a zone goes down and
| everyone is trying to launch in the zone still up, there may
| not be instances for you available at that time. Our site had
| a billion in revenue or something based on a single day, so
| for us it was worth the cost, but it not easy (or at least it
| wasn't at the time).
| outworlder wrote:
| > It's also a lot more expensive. Probably order of magnitude
| more expensive than the cost of a 1 day outage
|
| Not sure I agree. Yes, network costs are higher, but your
| overall costs may not be depending on how you architect.
| Independent services across AZs? Sure. You'll have multiples
| of your current costs. Deploying your clusters spanning AZs?
| Not that much - you'll pay for AZ traffic though.
| adrr wrote:
| It is when you run your own date centers and have to shell
| out a large capital outlays to spin up a new datacenter.
| Symbiote wrote:
| The usual way this works (and I assume this is the case
| for Roblox) is not by constructing buildings, but by
| renting space in someone else's datacentre.
|
| Pretty much every city worldwide has at least one place
| providing power, cooling, racks and (optionally) network.
| You rent space for one or more servers, or you rent
| racks, or parts of a floor, or whole floors. You buy your
| own servers, and either install them yourself, or pay the
| datacentre staff to install them.
| Hamuko wrote:
| How expensive? Remember that the Roblox Corporation does
| about a billion dollars in revenue per year and takes about
| 50% of all revenue developers generate on their platform.
| dev_by_day wrote:
| Right, outages get more expensive the larger you grow. What
| else needs to be thought of is not just the loss of revenue
| for the time your service is down but also it's affect on
| user trust and usability. Customers will gladly leave you
| for a more reliable competitor once they get fed up.
| johnmarcus wrote:
| Multi-AZ is free at Amazon. Having things split amongst 3
| AZ's cost no more than having in a single AZ.
|
| Multi-Region is a different story.
| otterley wrote:
| There are definitely cost and other considerations you have
| to think about when going multi-AZ.
|
| Cross-AZ network traffic has charges associated with it.
| Inter-AZ network latency is higher than intra-AZ latency.
| And there are other limitations as well, such as EBS
| volumes being attachable only to an instance in the same AZ
| as the volume.
|
| That said, AWS does recommend using multiple Availability
| Zones to improve overall availability and reduce Mean Time
| to Recovery (MTTR).
|
| (I work for AWS. Opinions are my own and not necessarily
| those of my employer.)
| znep wrote:
| This is very true, the costs and performance impacts can
| be significant if your architecture isn't designed to
| account for it. And sometimes even if it is.
|
| In addition, unless you can cleanly survive an AZ going
| down, which can take a bunch more work in some cases,
| then being multi-AZ can actually reduce your availability
| by giving more things to fail.
|
| AZs are a powerful tool but are not a no-brainer for
| applications at scale that are not designed for them, it
| is literally spreading your workload across multiple
| nearby data centers with a bit (or a lot) more tooling
| and services to help than if you were doing it in your
| own data centers.
| suifbwish wrote:
| Having bare metal may not be less stress but AWS is by no
| means cheap.
| johnmarcus wrote:
| AWS is not cheap, but splitting amongst AZ's is of no
| additional cost.
| orangepurple wrote:
| False
|
| Data Transfer within the same AWS Region Data transferred
| "in" to and "out" from Amazon EC2, Amazon RDS, Amazon
| Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon
| ElastiCache instances, Elastic Network Interfaces or VPC
| Peering connections across Availability Zones in the same
| AWS Region is charged at $0.01/GB in each direction.
|
| https://aws.amazon.com/ec2/pricing/on-
| demand/#Data_Transfer_...
| vorpalhex wrote:
| I'm more impressed that it hasn't been an issue until now.
| bob1029 wrote:
| > Having multiple fully independent zones seems more reliable
| and failsafe.
|
| This also introduces new modes of failure which did not exist
| before. There are no silver bullets for this problem.
| rhizome wrote:
| There are no silver bullets to _any_ problem, but there are
| other ways of implementing services and architecture that can
| sidestep these things.
| maxclark wrote:
| No surprised at all. Multi AZ is a PITA. You'd be surprised how
| many 7fig+/month infra is single region/az
| hedwall wrote:
| A guess would be that game servers are distributed across the
| globe but backend services l are in one place. A common pattern
| in game companies.
___________________________________________________________________
(page generated 2022-01-20 23:00 UTC)