[HN Gopher] Roblox has been down for days and it's not because o...
___________________________________________________________________
Roblox has been down for days and it's not because of Chipotle
Author : Terretta
Score : 180 points
Date : 2021-10-31 17:22 UTC (5 hours ago)
(HTM) web link (www.theverge.com)
(TXT) w3m dump (www.theverge.com)
| tetron wrote:
| Been checking the #roblox hashtag on Twitter and the two main
| themes are addicts going through withdrawal and devs saying how
| they wouldn't have their llama appreciation fan site be down this
| long let alone your core business.
| TekMol wrote:
| Has any billion dollar company ever been down for 2 days?
|
| Or is this the highscore?
| [deleted]
| wodenokoto wrote:
| Maersk had their entire computer infrastructure down for 9
| days.
|
| https://www.i-cio.com/management/insight/item/maersk-springi...
| cinntaile wrote:
| The supermarket chain Coop in Sweden was down longer than 2
| days. They closed their shops during the ransomware attack.
| https://www.bbc.com/news/technology-57707530
| [deleted]
| sithlord wrote:
| I dont remember how long Garmin was down, but I feel like it
| had to have been close!
| grayfaced wrote:
| Notpetya really hindered operations for some companies of that
| size. Depends on how you define an outage.
| jlokier wrote:
| Tesco online grocery shopping in the UK was down for about 2
| days recently, and Tesco as a whole has a market cap of $29.1B.
| TekMol wrote:
| Did they publish what happened?
| omnicognate wrote:
| They claimed it was an attack (an "attempt to interfere
| with our systems") but haven't given any details.
| mysterydip wrote:
| I assume you could still go in the store and get groceries
| though, right? Not familiar if they're online-only right now.
| arichard123 wrote:
| Yes, Tesco was basically open the whole time.
| romanhn wrote:
| The lack of communication for an outage this big is absolutely
| shameful. I put this on the leadership, not any of the engineers
| working round the clock. Having been in the middle of a critical
| service outage that lasted over 24 hours I totally get the
| craziness of the situation, but Roblox seriously needs to revisit
| their incident management and customer update process. Even
| though kids are the main consumers of the app, the near total
| silence speaks volumes about their business's lack of
| preparedness for disaster scenarios. If nothing else, I hope
| they'll see this as an opportunity to learn and do better next
| time.
| swiley wrote:
| Meh. This is part of playing a game where all the multiplayer
| goes through one service.
| golemiprague wrote:
| What exactly need to be communicated? it is just gossip. They
| will fix it when they fix it, they probably don't know when
| exactly and the kids who play it don't care about the details.
| CheezeIt wrote:
| Good grief. It's a game for kids. Nobody needs updates. They
| could even take the weekend off.
| kjaftaedi wrote:
| It might be a game for kids, but their company is publicly
| traded and valued higher than Electronic Arts. (More than
| ubisoft and take-two interactive combined)
| TeMPOraL wrote:
| You mean a weekend - in plenty of countries, an extended
| weekend due to holidays - in which many kids hoped they could
| play some Roblox?
| jerbearito wrote:
| I can't tell if this aggressively dismissive comment is a
| joke or not.
| charcircuit wrote:
| >It's a game for kids
|
| There are companies that rely on Roblox for hosting their
| games so they can make money. This is the equivalent of cloud
| hosting going down.
| Hamuko wrote:
| I thought they were just making money for Roblox and other
| big tech companies (who take like three fourths of the
| total pie).
| WhisperingShiba wrote:
| I think you have a Kernel of truth in your statement, but
| perhaps you are overlooking how silly it is for someone's
| well-being to rely on a couple weeks of video game micro
| transaction.
|
| My heart goes out to the Devs in this category, but
| hopefully this is further impetus not to stake too much on
| tech services.
| kjeetgill wrote:
| It is business relationship like any other. There's
| always some risk with investing and relying on a business
| partner, but it's also reasonable to have expectations of
| cooperation and good faith.
| WhisperingShiba wrote:
| 100%. Really, I'm more expressing my desire to live in a
| society which doesn't feel a constant (simulated) risk
| just to exist. This economy is just a game we play, and I
| think a lot of people don't know this, and also don't
| want to be part of it.
|
| I think to be working at the top of the Hierarchy of
| needs (in game development, for instance), people should
| demonstrably master the lower levels of the pyramid. We
| really can live in a way, where this is possible. We just
| have to collectively want it.
| gillytech wrote:
| As with all businesses Roblox has a paying customer base and
| the company has a responsibility to those customers.
| notjesse wrote:
| I would also argue a responsibility to their shareholders.
|
| How much revenue would they have missed out on during this
| time? And how much will it affect their longer term growth?
|
| If I were a Roblox shareholder, I would be pretty annoyed
| by the silence.
| newobj wrote:
| It's an actually a platform and ecosystem in which some
| people make their entire living.
| intunderflow wrote:
| Roblox made their developer tooling require you to be always
| online and signed in, even though it doesn't actually need this
| to function. This means that all development workflows have been
| bricked during this outage too.
| https://twitter.com/RBXStatus/status/1454815143607607300/pho...
| lotophage wrote:
| Development has been bloxxed
| Matheus28 wrote:
| Fun fact: I operate many web mmo games (or "io games" how people
| like to call them), and traffic is up around 20-100% since the
| Roblox outage started.
| truetraveller wrote:
| This guy's too humble. He's Matheus Valadares, and he's the
| creator of Agar.io [1]. This is the game that pretty much
| started the ".io game" genre [2].
|
| [1]: https://en.wikipedia.org/wiki/Agar.io
|
| [2]: "Around 2015 a multiplayer game, Agar.io, spawned many
| other games with a similar playstyle and .io domain". See
| https://en.wikipedia.org/wiki/.io
| schnebbau wrote:
| That's cool, I love io games. Which ones?
| winrid wrote:
| I see one of their DBs is Mongo. I wonder if they ran into some
| sharding related nightmare.
| amelius wrote:
| Is there a possibility this is caused by ransomware?
| PeterisP wrote:
| From TFA - "We believe we have identified an underlying
| internal cause of the outage with no evidence of an external
| intrusion," says a Roblox spokesperson.
| amelius wrote:
| Thanks, but can we trust any company in this situation to be
| transparent, given that such an attack would also mean that
| user data could have been leaked?
| [deleted]
| PeterisP wrote:
| Reports of IT incidents from public companies tend to be
| obscure, vague, missing or technically true but misleading
| but they almost universally stop short of outright lying
| (mainly because for public company officers lying to
| shareholders i.e. the general public has harsher personal
| risks&consequences than the company losing lots of money or
| even shutting down; being incompetent is completely legal
| but lying to your shareholders about material aspects of
| company finances is a crime and also makes them personally
| financially liable), so if they say that they have seen no
| evidence of an external intrusion, then I would presume
| that it's definitely not ransomware, where signs of
| external intrusion - namely, the ransom demand - would be
| obvious and undeniable.
|
| Perhaps (though IMHO not that likely) it may be some other
| kind of attack e.g. one intended to secretly steal customer
| data, which does not give signs of external intrusion if
| the company doesn't look for them much (and it might have
| motivation to not look very hard), but if it was
| ransomware, I'm quite sure they would not say what they
| said.
| Groxx wrote:
| It's possible, but also that's a theory _entirely_ founded
| in paranoia, as it both needs no evidence and accepts no
| evidence.
|
| _Everything_ could be an attack. So, without further
| evidence in favor of it, it 's kinda pointless to wonder
| about, since the answer is always "yes, possibly"
| regardless of what has happened or what anyone has said.
| whimsicalism wrote:
| A ransomware intrusion into a modern large tech company like
| Roblox would be unusual and impressive.
| AlexAndScripts wrote:
| At the same time, Twitch and Kaseya both got broken into
| recently.
| whimsicalism wrote:
| Twitch was not breached by ransomware, and Kaseya is not
| the caliber of company I am discussing.
| mattnewton wrote:
| Two day outage of a product the size of Roblox is unusual
| too. I hope they publish a postmortem - my guess is some kind
| of database issue to have stayed down this long.
| christkv wrote:
| I feel for the people in the trenches on this one. It's got to
| suck bad.
| tgsovlerkhgsel wrote:
| In the immediate short term, yes, obviously - stress, long
| hours, etc. In the immediate aftermath, probably too:
| Burdensome mandates from management that isn't always aware of
| reality to "make sure this never happens again".
|
| But if the organization is functional, in the medium term, this
| may also mean staffing understaffed teams, hiring SREs, etc. -
| which can mean less stress, no more 24/7 pager duty, better pay
| etc.
| i386 wrote:
| > Burdensome mandates from management that isn't always aware
| of reality to "make sure this never happens again".
|
| When something like this threatens to end the entire party
| (no one pays or get paid) you god damn want to figure out why
| and not make it happen again. That's not burdensome, that's
| business.
| christkv wrote:
| They are still a public company so I guess we will see the
| damage on Monday. Might be an opportunity to buy low.
| tschellenbach wrote:
| That's a very long outage, I wonder how this happened.
|
| - Perhaps internal systems they've developed, and the people who
| created them left. So it's not just fix the thing, but first
| understand what the thing is doing and then fix it. - Data
| recovery can take forever if you run into edge cases with your
| databases
|
| Anyone found any articles about their architecture?
| ekovarski wrote:
| Hashicorp has a case study with some details of what they use,
|
| https://www.hashicorp.com/case-studies/roblox
| ytjohn wrote:
| > We have people who are first-time system administrators
| deploying applications. building containers, maintaining
| Nomad. There is a guy on our team who worked in IT help desk
| for eight years - just today he upgraded an entire cluster
| himself."
| sidlls wrote:
| I know everyone has to start somewhere, but this just
| sounds like a sure way to have a platform blow up
| impressively.
| breadzeppelin__ wrote:
| oof (from the link). Debugging nightmare to have a fairly
| inexperienced team trying to diagnose stuff in these
| incredibly complex systems.
|
| > 4 SREs managing Nomad, Consul, and Vault for 11,000+ nodes
| across 22 clusters, serving 420+ internal developers
|
| > "We have people who are first-time system administrators
| deploying applications, building containers, maintaining
| Nomad. There is a guy on our team who worked in the IT help
| desk for eight years -- just today he upgraded an entire
| cluster himself."
| kawsper wrote:
| I bet it is some sort of Vault/Consul shenanigans that's going
| on.
| wernerb wrote:
| When I saw "secret store" I guessed it had to be Vault.
| Vault's is amazing but it lets you configure things that can
| blow up on you in X time. For example, issuing 50 secrets per
| second but have every secret expire after a week (or never).
| It would mean (multiple) goroutine per secret checking status
| on the lease. This kind of thing unfortunately, is easy to
| miss and occur in Vault.
| xakahnx wrote:
| Adding to the speculation here, I'm willing to bet some component
| of their issue is not entirely technical. Regardless of the
| underlying cause (PKI was mentioned), for downtime to last this
| long it almost definitely means some persistent data was lost or
| corrupted. Of course they can recover from a backup (I'm
| confident they have clean backups) but what does that mean for
| the business? "We irrecoverably lost 12 hours of data" could have
| severe implications, for example legal or compliance risks.
| pm90 wrote:
| Why are you confident they have clean backups? It's been my
| experience that backup infrastructure is usually not given much
| thought, engineers infrequently test that recovery from backups
| work as expected. Not saying that's what it is but not sure if
| it can be ruled out.
| aderdale wrote:
| Seems to work at least partly now. I just jumped into one of the
| more popular games and a bunch of people playing.
| jitl wrote:
| I wonder if they have a major Postgres database that hit
| transaction ID wraparound? Postgres uses int32 for transaction
| IDs, and IDs are only reclaimed by a vacuum maintenance process
| which can fall behind if the DB is under heavy write load. Other
| companies have been bitten by this before, eg Sentry in 2015
| (https://blog.sentry.io/2015/07/23/transaction-id-
| wraparound-...). Depending on the size of the database, you could
| be down several days waiting for Postgres to clean things up.
|
| Even though it's a well documented issue with Postgres and you
| have an experienced team keeping an eye on it, a new write
| pattern could accelerate things into the danger zone quite
| quickly. At Notion we had a scary close call with this about a
| year ago that lead to us splitting a production DB over the
| weekend to avoid hard downtime.
|
| Whatever the issue is, I'm wishing the engineers working on it
| all the best.
| AznHisoka wrote:
| That's not a bad theory. Even the homepage is down so that
| suggests their entire database was taken down.
| bink wrote:
| I doubt their homepage is being served from the same DB as
| their game platform.
|
| I think throwing together a static page better than "we're
| making the game more awesome" would be simple. It kinda makes
| me wonder if it's an internal auth/secret issue as has been
| speculated. That could theoretically make it harder to update
| the website, especially if it's deployed by CI/CD.
| mulmen wrote:
| Do you have any reason to believe this is the case?
| [deleted]
| vbg wrote:
| The latest version of Postgres addresses this issue in various
| ways, although it's not entirely solved, it should be
| significantly mitigated.
| jiripospisil wrote:
| Somebody mentioned a secret store issue that affected all of
| their services.
|
| https://news.ycombinator.com/item?id=29044500
| xyst wrote:
| I thought it was a certificate issue? I am looking at robox.com
| and the issuing CA is GoDaddy...
| isuckatcoding wrote:
| Yeah but wouldn't the impact be more widespread then?
| mateo411 wrote:
| Certificate issues don't take that long to resolve.
| dilyevsky wrote:
| They do if your entire PKI infra is down too
| duskwuff wrote:
| A company's internal PKI infrastructure wouldn't be
| responsible for issuing a public-facing certificate. They
| literally can't sign those -- a real CA has to do it.
| dilyevsky wrote:
| You are of course correct but usually public and private
| would reuse some core components of the infra (eg still
| need to store signed key pair somewhere safe). I'm
| speculating here but given how long it's been down some
| very core and very difficult to recover service must have
| failed. Security infra tends to have those properties
| charcircuit wrote:
| Downtime is expensive. You could just bypass your infra
| and manually get it working so that you can fix your
| infra while production is up instead of when it's down.
| crehn wrote:
| That's in fact how most high-impact events should be
| handled: mitigate the issue with a potentially short-term
| solution, once things are back up find the root cause,
| fix the root cause, and perform a thorough analysis of
| events to ensure it won't happen again.
| dilyevsky wrote:
| Depending on the level of automation that may not be
| possible. That's like saying if factory line robot fails
| "you just bypass the line and manually weld those car
| bodies"
| djbusby wrote:
| Wait. You can sign your own. They are just not trusted by
| the wider world. Your devices have an OS provided set of
| trusted root-CA.
| mig39 wrote:
| robox or roblox?
| jitl wrote:
| I'm just speculating. I didn't do any in depth research -
| none of the articles or tweets by Roblox I saw offered
| anything more than "an internal issue".
| hintymad wrote:
| I'd speculate that it's more likely a data corruption problem.
| A system was overwhelmed or misconfigured led to corruption of
| critical configuration data, which led to propagation of such
| corruption to a large number of dependencies. Roblox tried to
| restore its data from backup, a process that was not
| necessarily rehearsed regularly or rigorously therefore took
| longer than expected. All other services would have to restore
| their systems in a cascaded fashion while sorting out complex
| dependencies and constraints, which would take days.
| crehn wrote:
| If it's a known issue, is there no way to increase the
| transaction ID size?
|
| Quite surprising a seemingly battle-tested database can choke
| in such a manner.
| semi-extrinsic wrote:
| IDK about Postgres internals, but typically switching to
| int64 means recompiling all your binaries, plus your existing
| data format on disk needs to be converted.
| anyfoo wrote:
| That's assuming a lot, including that the binaries aren't
| 64 bit already (a bit unlikely nowadays), and the database
| wouldn't just use a 32 bit datatype for this specific
| purpose in this specific configuration. (If this issue has
| anything to do with transaction IDs at all, as covered
| elsewhere.)
| jatone wrote:
| they're not wrong. its not the software that's the issue,
| its the data stored on disk. the transaction ID is stored
| in the data for various reasons.
|
| in theory nowadays it wouldn't be too hard to change if
| you use logical replication to upgrade the database but
| it'd be a huge undertaking for a lot of companies.
| anyfoo wrote:
| Right, but the "you need to recompile binaries if you
| switch something to 64 bit" part was a bit too general.
| alophawen wrote:
| You seem to confuse 64 bit integers with 64 bit programs
| anyfoo wrote:
| I didn't.
| [deleted]
| mikeklaas wrote:
| I find it pretty odd to speculate that they are experiencing a
| very specific failure mode of a particular database. Do you
| even know whether they use Postgres?
| anyfoo wrote:
| Maybe their load balancer got hit by a 2009 Toyota Camry with
| a sticky accelerator pedal?
| [deleted]
| Threeve303 wrote:
| It is a very unlikely Y2K bug
| 1cvmask wrote:
| Do any other databases have similar issues to Postgres? Or this
| is specific to Postgres?
| ElbertF wrote:
| Years ago I ran into this issue with MySQL, storing four
| billion rows with a unique ID.
| hourislate wrote:
| A couple of possibilities...
|
| I know they said they weren't hacked but they were hacked.
|
| or
|
| They are completely inept and have no disaster recovery plan in
| place, etc.
| reilly3000 wrote:
| It feels like some kind of catastrophic data loss. I can't
| imagine that app servers or network infrastructure could have
| been the root cause, especially because they are running on AWS
| and there hasn't been any reports of outages or other customers
| impacted. Restoring an old backup and rebuilding data from logs
| seems like the only thing that could take so long. That or an
| entirely dysfunctional IT org that can't get out of its own way
| in a crisis.
|
| Best to them.
| bink wrote:
| This article from 2019 suggests they use a mix of cloud and a
| dedicated data center.
|
| https://portworx.com/blog/architects-corner-roblox-runs-
| plat...
| wly_cdgr wrote:
| "Roblox is very popular, especially with kids -- more than 50
| percent of Roblox players are under the age of 13. More than 40
| million people play it daily" ....from misleading logical non
| sequitur to parroting Roblox marketing numbers in under 30 words,
| nice. Verge is such trash
| breakingcups wrote:
| As longs as we're speculating.. One of the few things I can think
| of that can't reasonably be sped up is data integrity recovery.
| Say some data got in an inconsistent state and now they have to
| manually restore a whole bunch of financial transactions or
| something before opening the game up again, because otherwise
| customers would get very mad at missing stuff they've paid for,
| traded, etc.
|
| If they were to resume the game before restoring these issues,
| they would only exacerbate with state moving even further from
| where it was originally.
| g123g wrote:
| Wouldn't it be cheaper to directly compensate such customers
| rather than keeping the whole website down for 3 days?
| bryanrasmussen wrote:
| maybe, but wouldn't you have to recuperate the data to be
| able to figure out what customers you owed and how much?
| charcircuit wrote:
| Just ask people to send in support tickets
| TeMPOraL wrote:
| Right, but as soon as people get wind of what you're
| doing, you'll be drowning in fraudulent tickets.
| xyst wrote:
| I wonder how the market is going to react when it opens tomorrow.
| I am thinking a quick scalp with weekly $RBLX puts, then when it
| recovers double up on cheap long call options.
| ctvo wrote:
| Pretty remarkable to think the market prices things like this,
| or that public information everyone is privy to gives you an
| edge.
| pkulak wrote:
| I like slots myself.
| cinntaile wrote:
| It went up when it was known that the site was down, so it can
| be a bit hard to predict.
| downrightmike wrote:
| probably a bunch of bots that bought on any news is good
| news.
| jeffal wrote:
| Like when Camelcamelcamel was down for a week in 2019
|
| https://news.ycombinator.com/item?id=19038198
| EastOfTruth wrote:
| Apparently, according to roblox.com, some player are able to
| play: "We are incrementally opening to groups of players and will
| continue rolling out."
|
| https://i.imgur.com/KgDxNsg.png
| abricot wrote:
| They write that, but according to various game Discords it's
| absolutely not true. No one is allowed to log in.
| swatkat wrote:
| https://twitter.com/Bloxy_News/status/1454861081021587456
|
| _" STATUS UPDATE: Roblox is incrementally opening the website to
| groups of users and will continue to open up to more over the
| course of the day..."_
| abricot wrote:
| They write that, but according to various game Discords it's
| absolutely not true. No one is allowed to log in.
| huhtenberg wrote:
| My friend's kid got in half an hour ago.
| fragmede wrote:
| _absolutely?_ how many people are in these discords, and
| where do their players fall in the database sharding key?
| because thundering herd is definitely a problem when letting
| people back on to a system, and oh also we have no idea what
| the underlying issue _actually_ is because Roblox has been
| basically silent this entire time. (Unsubstantiated Internet
| rumor about it being the secrets store also doesn 't count.)
| yibg wrote:
| You have confirmation 0 Roblox users are able to log in?
| meheleventyone wrote:
| Confirmation on the official account:
| https://twitter.com/roblox/status/1454900890180063238
| chipotle_coyote wrote:
| Well, I'm glad I'm finally off the hook.
| roamerz wrote:
| Internal cause does not necessarily mean a technical mishap. Read
| rogue sysadmin or other employee initiated event.
___________________________________________________________________
(page generated 2021-10-31 23:00 UTC)