[HN Gopher] Feb 27 2023 GCP Outage Incident Report
___________________________________________________________________
Feb 27 2023 GCP Outage Incident Report
Author : fastest963
Score : 93 points
Date : 2023-03-07 16:31 UTC (6 hours ago)
(HTM) web link (status.cloud.google.com)
(TXT) w3m dump (status.cloud.google.com)
| numbsafari wrote:
| Maybe the folks as Fly shouldn't feel so alone.
| bakul wrote:
| To update a critical file _atomically_ we used to first create an
| updated copy on the same filesystem and then rename it. Is this
| not possible on GCP?
| we_never_see_it wrote:
| [flagged]
| pb7 wrote:
| An ad company with some of the most reliable infrastructure on
| the planet since its inception? Yeah, I think it can.
| nijave wrote:
| Can a book store?
| SteveNuts wrote:
| They already were an infrastructure company, they just realized
| they could sell it to outsiders (as Amazon had already very
| much proved).
| fulafel wrote:
| I wonder how much people in the know really believe in singling
| out a single root cause to these HA system failures.
| notimetorelax wrote:
| Look at the 'Remediation and Prevention' section for the fixes.
| Root cause of an incident is always one, but means to prevent
| it are multiple.
| fulafel wrote:
| Right, but is it the right mental model to demand a root
| cause to be delivered as the outcome to the investigation,
| when there are lots of things going wrong and process &
| architecture problems that make the involved failures fatal.
| notimetorelax wrote:
| This follows Google's published post mortem template:
| https://sre.google/sre-book/example-postmortem
|
| What you're saying can go into the lessons learned section.
| toast0 wrote:
| Well, I don't think there is any question about it. It can only
| be attributable to human error. This sort of thing has cropped
| up before, and it has always been due to human error.
|
| Done.
| FigmentEngine wrote:
| tldr; no system failure is human error. if a human can cause
| this, then your system lacks adequate controls and mechanism.
| the root cause is the lack of controls, not the human error
| toast0 wrote:
| Building a system without sufficient controls is a classic
| (human) error in system design.
| nailer wrote:
| It's really odd to see comments like this faded out from
| downvotes. Anyone from the devops or SRE or distributed systems
| would ask the same.
|
| For example why are there no processes to check for snapshot
| integrity or if they are, why were they not used?
| pgwhalen wrote:
| It's downvoted because it's a straw man, the linked article
| isn't suggesting there was a single root cause.
| verdverm wrote:
| Did you read it?
|
| They pointed out several issues that caused this and several
| mitigations to prevent it from happening again.
|
| I have yet to see anyone else with public RCAs as good as
| Google Cloud
| fulafel wrote:
| Yes - under the root cause heading isn't the following raised
| as "the" root cause?
|
| > During a routine update to the critical elements snapshot
| data, an incomplete snapshot was inadvertently shared which
| removed several sites from the topology map.
| verdverm wrote:
| Yes, that is one sentence in the full analysis which
| describes the core of what happened. There are other
| sentences which describe several contributing factors.
| sillybov3456 wrote:
| Right, the thing about root causes is that you can always
| keep digging. For instance why was an incomplete snapshot
| shared? And then the why for that why, and on and on until
| you reach the singularity at the beginning of the universe,
| which can logically be the only real root cause of
| anything. Root cause just means whatever is enough to make
| your boss satisfied.
| ccooffee wrote:
| > During a routine update to the critical elements snapshot data,
| an incomplete snapshot was inadvertently shared which removed
| several sites from the topology map.
|
| I wish this went into more detail about how an incomplete
| snapshot was created and how the incomplete snapshot was valid-
| enough to sort-of work.
|
| I'm supposing that whatever interchange format was in use does
| not have any "END" delimiters (e.g. closing quotes/braces), nor
| any checksumming to ensure the entire message was delivered. I'm
| mildly surprised that there wasn't a failsafe to prevent
| automatically replacing a currently-in-use snapshot with one that
| lacks many of the services. (Adding a "type 'yes, I mean this'"
| user interaction widget is my preferred approach to avoid this
| class of problem in admin interfaces.)
| spullara wrote:
| I ran into a problem like this with a service that used YAML
| for their config file. Basically when I edited it and saved the
| service would automatically pick up the change and load the
| config. However, the save hadn't completed so it only read a
| partial file which was still valid because, YAML.
| unxdfa wrote:
| I still like XML and XSD. People look at me these days like
| I'm insane. But in this case a partially loaded XML document
| would not parse let alone pass schema validation.
|
| Again vindicated. YAML needs to go away. It is misery. I'd
| rather have XSLT than Helm templates.
| piva00 wrote:
| Both you and I are very aware that the XSLT + XML config
| file would have a tool for translating/generating it from a
| YAML; and most users would use that tool instead of
| configuring in XML.
| unxdfa wrote:
| Not if I amble around the office with a chair leg making
| menacing grunts they won't.
| bragr wrote:
| >I'm supposing that whatever interchange format was in use does
| not have any "END" delimiters (e.g. closing quotes/braces), nor
| any checksumming to ensure the entire message was delivered.
|
| Those only ensure you get the whole message, not that the
| message makes sense.
| ccooffee wrote:
| Quite true. I was assuming that the incomplete snapshot was a
| transmission error or storage error. It's quite possible that
| the bug was an error of omission inside the data itself (e.g.
| someone accidentally removed an important key-value mapping
| from the data generator).
| jgrahamc wrote:
| One of the first things I did at Cloudflare was change the
| format of a file that contained vital information about the
| mapping between IPs and zones (no longer used these days) so
| that it had something like: START <count>
| <sha1> . . . END
|
| because it was just slurped up line by line assuming EOF was
| good.
| cheeselip420 wrote:
| one reason why JSON is superior to things like TOML or YAML
| for these use-cases...
| e12e wrote:
| Not to worry, JSONL[1] fixes that ;)
|
| In all seriousness - just dropping 1gb json file with N
| million records in one end probably isn't great either. I
| suppose one could somehow marry js and subresource
| integrity protection[2] to get a json serialized structure
| with hash integrity check. It would probably be a terrible
| idea.
|
| [1] https://jsonlines.org/
|
| [2] https://developer.mozilla.org/en-
| US/docs/Web/Security/Subres...
| ehPReth wrote:
| > Text editing programs call the first line of a text
| file "line 1". The first value in a JSON Lines file
| should also be called "value 1".
|
| I wonder why not zero?
| metadat wrote:
| In my experience, this has been called "ndjson", or
| newline-delimited JSON. It's a remarkably effective and
| useful pattern.
|
| Apparently jsonlines is only a website, whereas ndjson is
| a website+spec. Given the conceptual simplicity, this
| claim seems a bit dubious. Having two of these identical
| things, each one clawing for mindshare is dumb and
| counter-productive. Why can't we all be friendly and get
| along for the greater good? Oh well, it's par for the
| course.
|
| http://ndjson.org/
| sitkack wrote:
| Every mission critical file eventually gets version numbers
| and checksums.
| kridsdale1 wrote:
| And eventually a blockchain
| smt88 wrote:
| Files can be immutable, cryptographically-verifiable, and
| distributed across servers without a blockchain.
|
| Adding a blockchain to this would slow things down, add
| surface area for errors, and provide absolutely no value.
| This is true of every usage of blockchain other than
| creating tokens.
| dekhn wrote:
| No. That's not how it would be done at google. they invented a
| binary protocol to handle things like this reliably. And that
| protocol goes over an integrity-checking network transport.
| It's more likely an odd edge case occurred and somehow got past
| the normal QC checks- say, an RPC returned an error but the
| error handler didn't do a retry, it just fell through.
| omoikane wrote:
| End markers will help detect truncated files but not other
| kinds of brokenness, need checksums for those.
|
| Also, I think the config files for certain off the shelf
| network devices don't come with end markers or checksums, and
| might not be very good at handling corrupted configs. The usual
| practice is to push updates to a small fraction of those, then
| wait and see what happens before proceeding further.
| zamnos wrote:
| Unfortunately even checksums won't help you if you mistakenly
| dredge up old but valid config. That doesn't mean they're not
| worth doing, but we don't know, due to a lack of detail, if
| Google already has checksums or if they would even have
| helped in this situation.
| londons_explore wrote:
| Google internally avoids use of ascii files, so the type of
| error you are suggesting is unlikley.
|
| I suspect it was more a case of incorrect error handling in a
| loop... Eg. output = [] try:
| for site in sites: output += process(site)
| except: print("error!") write_to_file(output)
| jeffbee wrote:
| Or it could have just been a recordio file that was being
| written by something that crashed in the middle of doing it,
| and the committed offset was at a valid end of record.
|
| Really there's 1000 ways for this to happen all of which are
| going to sound obvious in retrospect but are easy to commit.
| q3k wrote:
| Even straight up Proto is vulnerable to this (either with
| an unlucky crash right between fields being actually
| written to a file, or when attempting to stream top-level
| fields).
| [deleted]
| amalcon wrote:
| It's more likely that the snapshot was incomplete in that it
| was based on incomplete precursors, rather than that the
| message itself was truncated. The details of something like
| that aren't always appropriate for this kind of report (i.e.
| they usually create more questions than they answer for a
| reader unfamiliar with the basics of the system).
| ehPReth wrote:
| I wish literally everywhere had (mandated?) detailed _public_
| RFOs like this. My residential ISP down for 20 minutes? Tell me
| more about the cable headend or your bad firmware push, please!
| LostLetterbox2 wrote:
| Anyone want to share what a programming cycle is? (The complete
| snapshot was restored with the next programming cycle)
| 7a1c9427 wrote:
| > Google's automation systems mitigated this failure by pushing a
| complete topology snapshot during the next programming cycle. The
| proper sites were restored and the network converged by 05:05
| US/Pacific.
|
| I think this is the most understated part of the whole report.
| The bad thing happened due to "automated clever thing" and then
| the system "automagically" mitigated it in ~7 minutes. Likely
| before a human had even figured out what had gone wrong.
| Jensson wrote:
| How would you otherwise do it? Anything that automatically
| pushes updates should monitor for rapid increase in errors
| afterwards and roll back if so. You should do at least that if
| you are working on a critical system.
| metadat wrote:
| Sure, in an ideal world this is how nearly everything would
| work.
|
| Getting a complex system to a level of maturity where this is
| feasible to do at scale in real life and actually work well
| is a respectable and non-trivial achievement.
|
| I don't know if Amazon or Azure are able to confidently and
| effectively put in such automatic remediation measures
| globally. My sense is there are humans involved to triage and
| fix unusual types of outages at every other cloud provider,
| including the other bigs.
|
| Leaving a comment on a message board saying how things ought
| to work is one thing (there's nothing wrong with your
| comment, I like it!); I only want to highlight, bold, and
| underscore how successfully achieving this level of automatic
| remediation atop a large and dynamic system is uncommon and
| noteworthy.
| fdgsdfogijq wrote:
| I always think about how impossible it will be for GCP to compete
| with AWS. The work culture at AWS has been brutal for a decade.
| High standards for work, and insane amounts of oncall/ops
| reduction. A burn and churn machine that has created AWS. Google
| is a laid back company with great technology, but not the culture
| to grind out every detail of getting cloud to really work.
| Microsoft is another story altogether, as they already have a ton
| of corporate relationships to bring clients.
| baggy_trough wrote:
| Well personally, I find GCP highly reliable and easier to
| understand than AWS.
| abofh wrote:
| Hopefully you can fund their operations for the coming
| quarter then, linked in suggests even GCP SRE's are on the
| chopping block.
| zamnos wrote:
| Yeah. No one ever went wrong with hosting in US-east-1. Oh
| wait.
| ec109685 wrote:
| AWS is architected in a way that makes global failures
| harder, e.g. VPC's only span a region.
| jvolkman wrote:
| So instead people just stuff all of their infrastructure
| into one giant region and hope for the best.
| nijave wrote:
| That, or it's the customers fault for not reading the
| docs and building their thing wrong
| nightpool wrote:
| The last time us-east-1 went down, if memory serves, it
| took down the entirety of ec2 provisioning with it.
| dboreham wrote:
| "Automated clever thing wasn't as clever as it needed to be"
| [deleted]
| 1equalsequals1 wrote:
| Oxford comma as well, how controversial can it get
| londons_explore wrote:
| At googles scale and reliability target, I would hope they have
| multiple independent worldwide networks.
|
| Each network would have its own config plane and data plane.
| Changes would only be made to one at once. Perhaps even different
| teams managing them so that one rogue or hacked employee can't
| take down the whole lot.
|
| Then, if someone screws up and pushes a totally nuts config, it
| will only impact one network. User traffic would flow just fine
| over the other networks.
|
| Obviously there would need to be thought on which network data
| will be routed over, failover logic between one and another, etc.
| And that failover logic would all be site specific, and rolled
| out site by site, so there is again so single point of global
| failure.
| toast0 wrote:
| I've seen a couple companies (credibly) claim multiple
| independent networks, but it seems to be pretty expensive to do
| in practice. There's a lot of cost to be saved by having a
| single network, and it's too tempting to share things making
| the independence illusory.
|
| Probably you get better independence with smaller providers
| where the provider isn't yet big enough to actually run a
| global network. Then, each PoP is likely to run or fail
| independently. Google and peers have enough scope that they can
| run a single global network where one can experience the Haiku:
| It's not BGP There's no way it's BGP It was BGP
|
| (DNS also fits in there; any sort of automation to push DNS or
| BGP changes will probably fit, if the name was carefully
| chosen)
| sillybov3456 wrote:
| That isn't really how production networks work in my uneducated
| opinion. If they are connected to the production network then
| they are the production network, and the level of isolation
| required to make that not the case would be so extreme as to
| make things potentially more unreliable.
|
| Others can correct me if I'm wrong about this. All I know is
| that the production network where I work is not air gapped in
| the way that would be required to truthfully consider testing
| networks a non production environment, so non prod changes
| typically wind up in front of the change review board anyway.
|
| Ask your own sites network engineers and see if they have
| similar constraints because I would be interested to hear more
| perspectives on that.
|
| One other thing I will say is that the abstractions of "config
| plane" and "data plane" and "control plane" don't really exist
| on real physical systems. That is mostly an abstraction created
| for applications people, those systems are not going to be
| totally blocked from interacting with eachother, they kind of
| have to. So if any of your "planes" are shared with production
| it is a production environment.
| rkeene2 wrote:
| That would mean that all networks which peer with the
| Internet would necessarily be considered Production. This
| isn't that reasonable outside certain niches (i.e., national
| government networks).
|
| Instead, what's commonly done is to provide a Controlled
| Interface (to borrow a term from those national government
| networks) that gates which things are at which level of
| trust. This is where security boundaries are enforced -- and
| if they are sound security boundaries things on either side
| can't reasonably damage the other side.
| sillybov3456 wrote:
| That's super interesting, and you're definitely right about
| the internet thing. I suppose our network guys must have
| some way to see if a change will propagate beyond a
| particular interface?
| sangnoir wrote:
| > One other thing I will say is that the abstractions of
| "config plane" and "data plane" and "control plane" don't
| really exist on real physical systems
|
| If you use any sort of virtualization: the control plane
| (infra) vs data plane (apps) will naturally evolve from the
| architecture. The config plane and control plane can get
| squashed into the same thing though, but it can also be
| disparate for at both infra- and application level.
| dekhn wrote:
| Data plane and control plane are definitely a thing in real
| physical systems- look at a classical router, where the
| packet processor works independently of, and is occasionally
| programmed by, or assisted by, a message passing from the
| data plane to the control plane. That control plane is
| typical elsewhere on the main board, talking to the data
| plane through a well-specific protocol.
|
| Google's network is complicated, making many assumptions
| about "what is prod" etc hard to reason about.
| jeffbee wrote:
| Google does have this, but one of the networks is way bigger
| than the other, so they can't exactly fail B4 over to B2.
___________________________________________________________________
(page generated 2023-03-07 23:00 UTC)