[HN Gopher] Roblox has been down for days and it's not because o...
       ___________________________________________________________________
        
       Roblox has been down for days and it's not because of Chipotle
        
       Author : Terretta
       Score  : 180 points
       Date   : 2021-10-31 17:22 UTC (5 hours ago)
        
 (HTM) web link (www.theverge.com)
 (TXT) w3m dump (www.theverge.com)
        
       | tetron wrote:
       | Been checking the #roblox hashtag on Twitter and the two main
       | themes are addicts going through withdrawal and devs saying how
       | they wouldn't have their llama appreciation fan site be down this
       | long let alone your core business.
        
       | TekMol wrote:
       | Has any billion dollar company ever been down for 2 days?
       | 
       | Or is this the highscore?
        
         | [deleted]
        
         | wodenokoto wrote:
         | Maersk had their entire computer infrastructure down for 9
         | days.
         | 
         | https://www.i-cio.com/management/insight/item/maersk-springi...
        
         | cinntaile wrote:
         | The supermarket chain Coop in Sweden was down longer than 2
         | days. They closed their shops during the ransomware attack.
         | https://www.bbc.com/news/technology-57707530
        
         | [deleted]
        
         | sithlord wrote:
         | I dont remember how long Garmin was down, but I feel like it
         | had to have been close!
        
         | grayfaced wrote:
         | Notpetya really hindered operations for some companies of that
         | size. Depends on how you define an outage.
        
         | jlokier wrote:
         | Tesco online grocery shopping in the UK was down for about 2
         | days recently, and Tesco as a whole has a market cap of $29.1B.
        
           | TekMol wrote:
           | Did they publish what happened?
        
             | omnicognate wrote:
             | They claimed it was an attack (an "attempt to interfere
             | with our systems") but haven't given any details.
        
           | mysterydip wrote:
           | I assume you could still go in the store and get groceries
           | though, right? Not familiar if they're online-only right now.
        
             | arichard123 wrote:
             | Yes, Tesco was basically open the whole time.
        
       | romanhn wrote:
       | The lack of communication for an outage this big is absolutely
       | shameful. I put this on the leadership, not any of the engineers
       | working round the clock. Having been in the middle of a critical
       | service outage that lasted over 24 hours I totally get the
       | craziness of the situation, but Roblox seriously needs to revisit
       | their incident management and customer update process. Even
       | though kids are the main consumers of the app, the near total
       | silence speaks volumes about their business's lack of
       | preparedness for disaster scenarios. If nothing else, I hope
       | they'll see this as an opportunity to learn and do better next
       | time.
        
         | swiley wrote:
         | Meh. This is part of playing a game where all the multiplayer
         | goes through one service.
        
         | golemiprague wrote:
         | What exactly need to be communicated? it is just gossip. They
         | will fix it when they fix it, they probably don't know when
         | exactly and the kids who play it don't care about the details.
        
         | CheezeIt wrote:
         | Good grief. It's a game for kids. Nobody needs updates. They
         | could even take the weekend off.
        
           | kjaftaedi wrote:
           | It might be a game for kids, but their company is publicly
           | traded and valued higher than Electronic Arts. (More than
           | ubisoft and take-two interactive combined)
        
           | TeMPOraL wrote:
           | You mean a weekend - in plenty of countries, an extended
           | weekend due to holidays - in which many kids hoped they could
           | play some Roblox?
        
           | jerbearito wrote:
           | I can't tell if this aggressively dismissive comment is a
           | joke or not.
        
           | charcircuit wrote:
           | >It's a game for kids
           | 
           | There are companies that rely on Roblox for hosting their
           | games so they can make money. This is the equivalent of cloud
           | hosting going down.
        
             | Hamuko wrote:
             | I thought they were just making money for Roblox and other
             | big tech companies (who take like three fourths of the
             | total pie).
        
             | WhisperingShiba wrote:
             | I think you have a Kernel of truth in your statement, but
             | perhaps you are overlooking how silly it is for someone's
             | well-being to rely on a couple weeks of video game micro
             | transaction.
             | 
             | My heart goes out to the Devs in this category, but
             | hopefully this is further impetus not to stake too much on
             | tech services.
        
               | kjeetgill wrote:
               | It is business relationship like any other. There's
               | always some risk with investing and relying on a business
               | partner, but it's also reasonable to have expectations of
               | cooperation and good faith.
        
               | WhisperingShiba wrote:
               | 100%. Really, I'm more expressing my desire to live in a
               | society which doesn't feel a constant (simulated) risk
               | just to exist. This economy is just a game we play, and I
               | think a lot of people don't know this, and also don't
               | want to be part of it.
               | 
               | I think to be working at the top of the Hierarchy of
               | needs (in game development, for instance), people should
               | demonstrably master the lower levels of the pyramid. We
               | really can live in a way, where this is possible. We just
               | have to collectively want it.
        
           | gillytech wrote:
           | As with all businesses Roblox has a paying customer base and
           | the company has a responsibility to those customers.
        
             | notjesse wrote:
             | I would also argue a responsibility to their shareholders.
             | 
             | How much revenue would they have missed out on during this
             | time? And how much will it affect their longer term growth?
             | 
             | If I were a Roblox shareholder, I would be pretty annoyed
             | by the silence.
        
           | newobj wrote:
           | It's an actually a platform and ecosystem in which some
           | people make their entire living.
        
       | intunderflow wrote:
       | Roblox made their developer tooling require you to be always
       | online and signed in, even though it doesn't actually need this
       | to function. This means that all development workflows have been
       | bricked during this outage too.
       | https://twitter.com/RBXStatus/status/1454815143607607300/pho...
        
         | lotophage wrote:
         | Development has been bloxxed
        
       | Matheus28 wrote:
       | Fun fact: I operate many web mmo games (or "io games" how people
       | like to call them), and traffic is up around 20-100% since the
       | Roblox outage started.
        
         | truetraveller wrote:
         | This guy's too humble. He's Matheus Valadares, and he's the
         | creator of Agar.io [1]. This is the game that pretty much
         | started the ".io game" genre [2].
         | 
         | [1]: https://en.wikipedia.org/wiki/Agar.io
         | 
         | [2]: "Around 2015 a multiplayer game, Agar.io, spawned many
         | other games with a similar playstyle and .io domain". See
         | https://en.wikipedia.org/wiki/.io
        
         | schnebbau wrote:
         | That's cool, I love io games. Which ones?
        
       | winrid wrote:
       | I see one of their DBs is Mongo. I wonder if they ran into some
       | sharding related nightmare.
        
       | amelius wrote:
       | Is there a possibility this is caused by ransomware?
        
         | PeterisP wrote:
         | From TFA - "We believe we have identified an underlying
         | internal cause of the outage with no evidence of an external
         | intrusion," says a Roblox spokesperson.
        
           | amelius wrote:
           | Thanks, but can we trust any company in this situation to be
           | transparent, given that such an attack would also mean that
           | user data could have been leaked?
        
             | [deleted]
        
             | PeterisP wrote:
             | Reports of IT incidents from public companies tend to be
             | obscure, vague, missing or technically true but misleading
             | but they almost universally stop short of outright lying
             | (mainly because for public company officers lying to
             | shareholders i.e. the general public has harsher personal
             | risks&consequences than the company losing lots of money or
             | even shutting down; being incompetent is completely legal
             | but lying to your shareholders about material aspects of
             | company finances is a crime and also makes them personally
             | financially liable), so if they say that they have seen no
             | evidence of an external intrusion, then I would presume
             | that it's definitely not ransomware, where signs of
             | external intrusion - namely, the ransom demand - would be
             | obvious and undeniable.
             | 
             | Perhaps (though IMHO not that likely) it may be some other
             | kind of attack e.g. one intended to secretly steal customer
             | data, which does not give signs of external intrusion if
             | the company doesn't look for them much (and it might have
             | motivation to not look very hard), but if it was
             | ransomware, I'm quite sure they would not say what they
             | said.
        
             | Groxx wrote:
             | It's possible, but also that's a theory _entirely_ founded
             | in paranoia, as it both needs no evidence and accepts no
             | evidence.
             | 
             |  _Everything_ could be an attack. So, without further
             | evidence in favor of it, it 's kinda pointless to wonder
             | about, since the answer is always "yes, possibly"
             | regardless of what has happened or what anyone has said.
        
         | whimsicalism wrote:
         | A ransomware intrusion into a modern large tech company like
         | Roblox would be unusual and impressive.
        
           | AlexAndScripts wrote:
           | At the same time, Twitch and Kaseya both got broken into
           | recently.
        
             | whimsicalism wrote:
             | Twitch was not breached by ransomware, and Kaseya is not
             | the caliber of company I am discussing.
        
           | mattnewton wrote:
           | Two day outage of a product the size of Roblox is unusual
           | too. I hope they publish a postmortem - my guess is some kind
           | of database issue to have stayed down this long.
        
       | christkv wrote:
       | I feel for the people in the trenches on this one. It's got to
       | suck bad.
        
         | tgsovlerkhgsel wrote:
         | In the immediate short term, yes, obviously - stress, long
         | hours, etc. In the immediate aftermath, probably too:
         | Burdensome mandates from management that isn't always aware of
         | reality to "make sure this never happens again".
         | 
         | But if the organization is functional, in the medium term, this
         | may also mean staffing understaffed teams, hiring SREs, etc. -
         | which can mean less stress, no more 24/7 pager duty, better pay
         | etc.
        
           | i386 wrote:
           | > Burdensome mandates from management that isn't always aware
           | of reality to "make sure this never happens again".
           | 
           | When something like this threatens to end the entire party
           | (no one pays or get paid) you god damn want to figure out why
           | and not make it happen again. That's not burdensome, that's
           | business.
        
           | christkv wrote:
           | They are still a public company so I guess we will see the
           | damage on Monday. Might be an opportunity to buy low.
        
       | tschellenbach wrote:
       | That's a very long outage, I wonder how this happened.
       | 
       | - Perhaps internal systems they've developed, and the people who
       | created them left. So it's not just fix the thing, but first
       | understand what the thing is doing and then fix it. - Data
       | recovery can take forever if you run into edge cases with your
       | databases
       | 
       | Anyone found any articles about their architecture?
        
         | ekovarski wrote:
         | Hashicorp has a case study with some details of what they use,
         | 
         | https://www.hashicorp.com/case-studies/roblox
        
           | ytjohn wrote:
           | > We have people who are first-time system administrators
           | deploying applications. building containers, maintaining
           | Nomad. There is a guy on our team who worked in IT help desk
           | for eight years - just today he upgraded an entire cluster
           | himself."
        
             | sidlls wrote:
             | I know everyone has to start somewhere, but this just
             | sounds like a sure way to have a platform blow up
             | impressively.
        
           | breadzeppelin__ wrote:
           | oof (from the link). Debugging nightmare to have a fairly
           | inexperienced team trying to diagnose stuff in these
           | incredibly complex systems.
           | 
           | > 4 SREs managing Nomad, Consul, and Vault for 11,000+ nodes
           | across 22 clusters, serving 420+ internal developers
           | 
           | > "We have people who are first-time system administrators
           | deploying applications, building containers, maintaining
           | Nomad. There is a guy on our team who worked in the IT help
           | desk for eight years -- just today he upgraded an entire
           | cluster himself."
        
         | kawsper wrote:
         | I bet it is some sort of Vault/Consul shenanigans that's going
         | on.
        
           | wernerb wrote:
           | When I saw "secret store" I guessed it had to be Vault.
           | Vault's is amazing but it lets you configure things that can
           | blow up on you in X time. For example, issuing 50 secrets per
           | second but have every secret expire after a week (or never).
           | It would mean (multiple) goroutine per secret checking status
           | on the lease. This kind of thing unfortunately, is easy to
           | miss and occur in Vault.
        
       | xakahnx wrote:
       | Adding to the speculation here, I'm willing to bet some component
       | of their issue is not entirely technical. Regardless of the
       | underlying cause (PKI was mentioned), for downtime to last this
       | long it almost definitely means some persistent data was lost or
       | corrupted. Of course they can recover from a backup (I'm
       | confident they have clean backups) but what does that mean for
       | the business? "We irrecoverably lost 12 hours of data" could have
       | severe implications, for example legal or compliance risks.
        
         | pm90 wrote:
         | Why are you confident they have clean backups? It's been my
         | experience that backup infrastructure is usually not given much
         | thought, engineers infrequently test that recovery from backups
         | work as expected. Not saying that's what it is but not sure if
         | it can be ruled out.
        
       | aderdale wrote:
       | Seems to work at least partly now. I just jumped into one of the
       | more popular games and a bunch of people playing.
        
       | jitl wrote:
       | I wonder if they have a major Postgres database that hit
       | transaction ID wraparound? Postgres uses int32 for transaction
       | IDs, and IDs are only reclaimed by a vacuum maintenance process
       | which can fall behind if the DB is under heavy write load. Other
       | companies have been bitten by this before, eg Sentry in 2015
       | (https://blog.sentry.io/2015/07/23/transaction-id-
       | wraparound-...). Depending on the size of the database, you could
       | be down several days waiting for Postgres to clean things up.
       | 
       | Even though it's a well documented issue with Postgres and you
       | have an experienced team keeping an eye on it, a new write
       | pattern could accelerate things into the danger zone quite
       | quickly. At Notion we had a scary close call with this about a
       | year ago that lead to us splitting a production DB over the
       | weekend to avoid hard downtime.
       | 
       | Whatever the issue is, I'm wishing the engineers working on it
       | all the best.
        
         | AznHisoka wrote:
         | That's not a bad theory. Even the homepage is down so that
         | suggests their entire database was taken down.
        
           | bink wrote:
           | I doubt their homepage is being served from the same DB as
           | their game platform.
           | 
           | I think throwing together a static page better than "we're
           | making the game more awesome" would be simple. It kinda makes
           | me wonder if it's an internal auth/secret issue as has been
           | speculated. That could theoretically make it harder to update
           | the website, especially if it's deployed by CI/CD.
        
         | mulmen wrote:
         | Do you have any reason to believe this is the case?
        
           | [deleted]
        
         | vbg wrote:
         | The latest version of Postgres addresses this issue in various
         | ways, although it's not entirely solved, it should be
         | significantly mitigated.
        
         | jiripospisil wrote:
         | Somebody mentioned a secret store issue that affected all of
         | their services.
         | 
         | https://news.ycombinator.com/item?id=29044500
        
         | xyst wrote:
         | I thought it was a certificate issue? I am looking at robox.com
         | and the issuing CA is GoDaddy...
        
           | isuckatcoding wrote:
           | Yeah but wouldn't the impact be more widespread then?
        
           | mateo411 wrote:
           | Certificate issues don't take that long to resolve.
        
             | dilyevsky wrote:
             | They do if your entire PKI infra is down too
        
               | duskwuff wrote:
               | A company's internal PKI infrastructure wouldn't be
               | responsible for issuing a public-facing certificate. They
               | literally can't sign those -- a real CA has to do it.
        
               | dilyevsky wrote:
               | You are of course correct but usually public and private
               | would reuse some core components of the infra (eg still
               | need to store signed key pair somewhere safe). I'm
               | speculating here but given how long it's been down some
               | very core and very difficult to recover service must have
               | failed. Security infra tends to have those properties
        
               | charcircuit wrote:
               | Downtime is expensive. You could just bypass your infra
               | and manually get it working so that you can fix your
               | infra while production is up instead of when it's down.
        
               | crehn wrote:
               | That's in fact how most high-impact events should be
               | handled: mitigate the issue with a potentially short-term
               | solution, once things are back up find the root cause,
               | fix the root cause, and perform a thorough analysis of
               | events to ensure it won't happen again.
        
               | dilyevsky wrote:
               | Depending on the level of automation that may not be
               | possible. That's like saying if factory line robot fails
               | "you just bypass the line and manually weld those car
               | bodies"
        
               | djbusby wrote:
               | Wait. You can sign your own. They are just not trusted by
               | the wider world. Your devices have an OS provided set of
               | trusted root-CA.
        
           | mig39 wrote:
           | robox or roblox?
        
           | jitl wrote:
           | I'm just speculating. I didn't do any in depth research -
           | none of the articles or tweets by Roblox I saw offered
           | anything more than "an internal issue".
        
         | hintymad wrote:
         | I'd speculate that it's more likely a data corruption problem.
         | A system was overwhelmed or misconfigured led to corruption of
         | critical configuration data, which led to propagation of such
         | corruption to a large number of dependencies. Roblox tried to
         | restore its data from backup, a process that was not
         | necessarily rehearsed regularly or rigorously therefore took
         | longer than expected. All other services would have to restore
         | their systems in a cascaded fashion while sorting out complex
         | dependencies and constraints, which would take days.
        
         | crehn wrote:
         | If it's a known issue, is there no way to increase the
         | transaction ID size?
         | 
         | Quite surprising a seemingly battle-tested database can choke
         | in such a manner.
        
           | semi-extrinsic wrote:
           | IDK about Postgres internals, but typically switching to
           | int64 means recompiling all your binaries, plus your existing
           | data format on disk needs to be converted.
        
             | anyfoo wrote:
             | That's assuming a lot, including that the binaries aren't
             | 64 bit already (a bit unlikely nowadays), and the database
             | wouldn't just use a 32 bit datatype for this specific
             | purpose in this specific configuration. (If this issue has
             | anything to do with transaction IDs at all, as covered
             | elsewhere.)
        
               | jatone wrote:
               | they're not wrong. its not the software that's the issue,
               | its the data stored on disk. the transaction ID is stored
               | in the data for various reasons.
               | 
               | in theory nowadays it wouldn't be too hard to change if
               | you use logical replication to upgrade the database but
               | it'd be a huge undertaking for a lot of companies.
        
               | anyfoo wrote:
               | Right, but the "you need to recompile binaries if you
               | switch something to 64 bit" part was a bit too general.
        
               | alophawen wrote:
               | You seem to confuse 64 bit integers with 64 bit programs
        
               | anyfoo wrote:
               | I didn't.
        
         | [deleted]
        
         | mikeklaas wrote:
         | I find it pretty odd to speculate that they are experiencing a
         | very specific failure mode of a particular database. Do you
         | even know whether they use Postgres?
        
           | anyfoo wrote:
           | Maybe their load balancer got hit by a 2009 Toyota Camry with
           | a sticky accelerator pedal?
        
             | [deleted]
        
             | Threeve303 wrote:
             | It is a very unlikely Y2K bug
        
         | 1cvmask wrote:
         | Do any other databases have similar issues to Postgres? Or this
         | is specific to Postgres?
        
           | ElbertF wrote:
           | Years ago I ran into this issue with MySQL, storing four
           | billion rows with a unique ID.
        
       | hourislate wrote:
       | A couple of possibilities...
       | 
       | I know they said they weren't hacked but they were hacked.
       | 
       | or
       | 
       | They are completely inept and have no disaster recovery plan in
       | place, etc.
        
         | reilly3000 wrote:
         | It feels like some kind of catastrophic data loss. I can't
         | imagine that app servers or network infrastructure could have
         | been the root cause, especially because they are running on AWS
         | and there hasn't been any reports of outages or other customers
         | impacted. Restoring an old backup and rebuilding data from logs
         | seems like the only thing that could take so long. That or an
         | entirely dysfunctional IT org that can't get out of its own way
         | in a crisis.
         | 
         | Best to them.
        
           | bink wrote:
           | This article from 2019 suggests they use a mix of cloud and a
           | dedicated data center.
           | 
           | https://portworx.com/blog/architects-corner-roblox-runs-
           | plat...
        
       | wly_cdgr wrote:
       | "Roblox is very popular, especially with kids -- more than 50
       | percent of Roblox players are under the age of 13. More than 40
       | million people play it daily" ....from misleading logical non
       | sequitur to parroting Roblox marketing numbers in under 30 words,
       | nice. Verge is such trash
        
       | breakingcups wrote:
       | As longs as we're speculating.. One of the few things I can think
       | of that can't reasonably be sped up is data integrity recovery.
       | Say some data got in an inconsistent state and now they have to
       | manually restore a whole bunch of financial transactions or
       | something before opening the game up again, because otherwise
       | customers would get very mad at missing stuff they've paid for,
       | traded, etc.
       | 
       | If they were to resume the game before restoring these issues,
       | they would only exacerbate with state moving even further from
       | where it was originally.
        
         | g123g wrote:
         | Wouldn't it be cheaper to directly compensate such customers
         | rather than keeping the whole website down for 3 days?
        
           | bryanrasmussen wrote:
           | maybe, but wouldn't you have to recuperate the data to be
           | able to figure out what customers you owed and how much?
        
             | charcircuit wrote:
             | Just ask people to send in support tickets
        
               | TeMPOraL wrote:
               | Right, but as soon as people get wind of what you're
               | doing, you'll be drowning in fraudulent tickets.
        
       | xyst wrote:
       | I wonder how the market is going to react when it opens tomorrow.
       | I am thinking a quick scalp with weekly $RBLX puts, then when it
       | recovers double up on cheap long call options.
        
         | ctvo wrote:
         | Pretty remarkable to think the market prices things like this,
         | or that public information everyone is privy to gives you an
         | edge.
        
         | pkulak wrote:
         | I like slots myself.
        
         | cinntaile wrote:
         | It went up when it was known that the site was down, so it can
         | be a bit hard to predict.
        
           | downrightmike wrote:
           | probably a bunch of bots that bought on any news is good
           | news.
        
       | jeffal wrote:
       | Like when Camelcamelcamel was down for a week in 2019
       | 
       | https://news.ycombinator.com/item?id=19038198
        
       | EastOfTruth wrote:
       | Apparently, according to roblox.com, some player are able to
       | play: "We are incrementally opening to groups of players and will
       | continue rolling out."
       | 
       | https://i.imgur.com/KgDxNsg.png
        
         | abricot wrote:
         | They write that, but according to various game Discords it's
         | absolutely not true. No one is allowed to log in.
        
       | swatkat wrote:
       | https://twitter.com/Bloxy_News/status/1454861081021587456
       | 
       |  _" STATUS UPDATE: Roblox is incrementally opening the website to
       | groups of users and will continue to open up to more over the
       | course of the day..."_
        
         | abricot wrote:
         | They write that, but according to various game Discords it's
         | absolutely not true. No one is allowed to log in.
        
           | huhtenberg wrote:
           | My friend's kid got in half an hour ago.
        
           | fragmede wrote:
           | _absolutely?_ how many people are in these discords, and
           | where do their players fall in the database sharding key?
           | because thundering herd is definitely a problem when letting
           | people back on to a system, and oh also we have no idea what
           | the underlying issue _actually_ is because Roblox has been
           | basically silent this entire time. (Unsubstantiated Internet
           | rumor about it being the secrets store also doesn 't count.)
        
           | yibg wrote:
           | You have confirmation 0 Roblox users are able to log in?
        
         | meheleventyone wrote:
         | Confirmation on the official account:
         | https://twitter.com/roblox/status/1454900890180063238
        
       | chipotle_coyote wrote:
       | Well, I'm glad I'm finally off the hook.
        
       | roamerz wrote:
       | Internal cause does not necessarily mean a technical mishap. Read
       | rogue sysadmin or other employee initiated event.
        
       ___________________________________________________________________
       (page generated 2021-10-31 23:00 UTC)