hngopher.com

       [HN Gopher] Feb 27 2023 GCP Outage Incident Report
       ___________________________________________________________________
        
       Feb 27 2023 GCP Outage Incident Report
        
       Author : fastest963
       Score  : 93 points
       Date   : 2023-03-07 16:31 UTC (6 hours ago)
        
 (HTM) web link (status.cloud.google.com)
 (TXT) w3m dump (status.cloud.google.com)
        
       | numbsafari wrote:
       | Maybe the folks as Fly shouldn't feel so alone.
        
       | bakul wrote:
       | To update a critical file _atomically_ we used to first create an
       | updated copy on the same filesystem and then rename it. Is this
       | not possible on GCP?
        
       | we_never_see_it wrote:
       | [flagged]
        
         | pb7 wrote:
         | An ad company with some of the most reliable infrastructure on
         | the planet since its inception? Yeah, I think it can.
        
         | nijave wrote:
         | Can a book store?
        
         | SteveNuts wrote:
         | They already were an infrastructure company, they just realized
         | they could sell it to outsiders (as Amazon had already very
         | much proved).
        
       | fulafel wrote:
       | I wonder how much people in the know really believe in singling
       | out a single root cause to these HA system failures.
        
         | notimetorelax wrote:
         | Look at the 'Remediation and Prevention' section for the fixes.
         | Root cause of an incident is always one, but means to prevent
         | it are multiple.
        
           | fulafel wrote:
           | Right, but is it the right mental model to demand a root
           | cause to be delivered as the outcome to the investigation,
           | when there are lots of things going wrong and process &
           | architecture problems that make the involved failures fatal.
        
             | notimetorelax wrote:
             | This follows Google's published post mortem template:
             | https://sre.google/sre-book/example-postmortem
             | 
             | What you're saying can go into the lessons learned section.
        
         | toast0 wrote:
         | Well, I don't think there is any question about it. It can only
         | be attributable to human error. This sort of thing has cropped
         | up before, and it has always been due to human error.
         | 
         | Done.
        
           | FigmentEngine wrote:
           | tldr; no system failure is human error. if a human can cause
           | this, then your system lacks adequate controls and mechanism.
           | the root cause is the lack of controls, not the human error
        
             | toast0 wrote:
             | Building a system without sufficient controls is a classic
             | (human) error in system design.
        
         | nailer wrote:
         | It's really odd to see comments like this faded out from
         | downvotes. Anyone from the devops or SRE or distributed systems
         | would ask the same.
         | 
         | For example why are there no processes to check for snapshot
         | integrity or if they are, why were they not used?
        
           | pgwhalen wrote:
           | It's downvoted because it's a straw man, the linked article
           | isn't suggesting there was a single root cause.
        
         | verdverm wrote:
         | Did you read it?
         | 
         | They pointed out several issues that caused this and several
         | mitigations to prevent it from happening again.
         | 
         | I have yet to see anyone else with public RCAs as good as
         | Google Cloud
        
           | fulafel wrote:
           | Yes - under the root cause heading isn't the following raised
           | as "the" root cause?
           | 
           | > During a routine update to the critical elements snapshot
           | data, an incomplete snapshot was inadvertently shared which
           | removed several sites from the topology map.
        
             | verdverm wrote:
             | Yes, that is one sentence in the full analysis which
             | describes the core of what happened. There are other
             | sentences which describe several contributing factors.
        
             | sillybov3456 wrote:
             | Right, the thing about root causes is that you can always
             | keep digging. For instance why was an incomplete snapshot
             | shared? And then the why for that why, and on and on until
             | you reach the singularity at the beginning of the universe,
             | which can logically be the only real root cause of
             | anything. Root cause just means whatever is enough to make
             | your boss satisfied.
        
       | ccooffee wrote:
       | > During a routine update to the critical elements snapshot data,
       | an incomplete snapshot was inadvertently shared which removed
       | several sites from the topology map.
       | 
       | I wish this went into more detail about how an incomplete
       | snapshot was created and how the incomplete snapshot was valid-
       | enough to sort-of work.
       | 
       | I'm supposing that whatever interchange format was in use does
       | not have any "END" delimiters (e.g. closing quotes/braces), nor
       | any checksumming to ensure the entire message was delivered. I'm
       | mildly surprised that there wasn't a failsafe to prevent
       | automatically replacing a currently-in-use snapshot with one that
       | lacks many of the services. (Adding a "type 'yes, I mean this'"
       | user interaction widget is my preferred approach to avoid this
       | class of problem in admin interfaces.)
        
         | spullara wrote:
         | I ran into a problem like this with a service that used YAML
         | for their config file. Basically when I edited it and saved the
         | service would automatically pick up the change and load the
         | config. However, the save hadn't completed so it only read a
         | partial file which was still valid because, YAML.
        
           | unxdfa wrote:
           | I still like XML and XSD. People look at me these days like
           | I'm insane. But in this case a partially loaded XML document
           | would not parse let alone pass schema validation.
           | 
           | Again vindicated. YAML needs to go away. It is misery. I'd
           | rather have XSLT than Helm templates.
        
             | piva00 wrote:
             | Both you and I are very aware that the XSLT + XML config
             | file would have a tool for translating/generating it from a
             | YAML; and most users would use that tool instead of
             | configuring in XML.
        
               | unxdfa wrote:
               | Not if I amble around the office with a chair leg making
               | menacing grunts they won't.
        
         | bragr wrote:
         | >I'm supposing that whatever interchange format was in use does
         | not have any "END" delimiters (e.g. closing quotes/braces), nor
         | any checksumming to ensure the entire message was delivered.
         | 
         | Those only ensure you get the whole message, not that the
         | message makes sense.
        
           | ccooffee wrote:
           | Quite true. I was assuming that the incomplete snapshot was a
           | transmission error or storage error. It's quite possible that
           | the bug was an error of omission inside the data itself (e.g.
           | someone accidentally removed an important key-value mapping
           | from the data generator).
        
         | jgrahamc wrote:
         | One of the first things I did at Cloudflare was change the
         | format of a file that contained vital information about the
         | mapping between IPs and zones (no longer used these days) so
         | that it had something like:                   START <count>
         | <sha1>         .         .         .          END
         | 
         | because it was just slurped up line by line assuming EOF was
         | good.
        
           | cheeselip420 wrote:
           | one reason why JSON is superior to things like TOML or YAML
           | for these use-cases...
        
             | e12e wrote:
             | Not to worry, JSONL[1] fixes that ;)
             | 
             | In all seriousness - just dropping 1gb json file with N
             | million records in one end probably isn't great either. I
             | suppose one could somehow marry js and subresource
             | integrity protection[2] to get a json serialized structure
             | with hash integrity check. It would probably be a terrible
             | idea.
             | 
             | [1] https://jsonlines.org/
             | 
             | [2] https://developer.mozilla.org/en-
             | US/docs/Web/Security/Subres...
        
               | ehPReth wrote:
               | > Text editing programs call the first line of a text
               | file "line 1". The first value in a JSON Lines file
               | should also be called "value 1".
               | 
               | I wonder why not zero?
        
               | metadat wrote:
               | In my experience, this has been called "ndjson", or
               | newline-delimited JSON. It's a remarkably effective and
               | useful pattern.
               | 
               | Apparently jsonlines is only a website, whereas ndjson is
               | a website+spec. Given the conceptual simplicity, this
               | claim seems a bit dubious. Having two of these identical
               | things, each one clawing for mindshare is dumb and
               | counter-productive. Why can't we all be friendly and get
               | along for the greater good? Oh well, it's par for the
               | course.
               | 
               | http://ndjson.org/
        
           | sitkack wrote:
           | Every mission critical file eventually gets version numbers
           | and checksums.
        
             | kridsdale1 wrote:
             | And eventually a blockchain
        
               | smt88 wrote:
               | Files can be immutable, cryptographically-verifiable, and
               | distributed across servers without a blockchain.
               | 
               | Adding a blockchain to this would slow things down, add
               | surface area for errors, and provide absolutely no value.
               | This is true of every usage of blockchain other than
               | creating tokens.
        
         | dekhn wrote:
         | No. That's not how it would be done at google. they invented a
         | binary protocol to handle things like this reliably. And that
         | protocol goes over an integrity-checking network transport.
         | It's more likely an odd edge case occurred and somehow got past
         | the normal QC checks- say, an RPC returned an error but the
         | error handler didn't do a retry, it just fell through.
        
         | omoikane wrote:
         | End markers will help detect truncated files but not other
         | kinds of brokenness, need checksums for those.
         | 
         | Also, I think the config files for certain off the shelf
         | network devices don't come with end markers or checksums, and
         | might not be very good at handling corrupted configs. The usual
         | practice is to push updates to a small fraction of those, then
         | wait and see what happens before proceeding further.
        
           | zamnos wrote:
           | Unfortunately even checksums won't help you if you mistakenly
           | dredge up old but valid config. That doesn't mean they're not
           | worth doing, but we don't know, due to a lack of detail, if
           | Google already has checksums or if they would even have
           | helped in this situation.
        
         | londons_explore wrote:
         | Google internally avoids use of ascii files, so the type of
         | error you are suggesting is unlikley.
         | 
         | I suspect it was more a case of incorrect error handling in a
         | loop... Eg.                   output = []         try:
         | for site in sites:             output += process(site)
         | except:           print("error!")         write_to_file(output)
        
           | jeffbee wrote:
           | Or it could have just been a recordio file that was being
           | written by something that crashed in the middle of doing it,
           | and the committed offset was at a valid end of record.
           | 
           | Really there's 1000 ways for this to happen all of which are
           | going to sound obvious in retrospect but are easy to commit.
        
             | q3k wrote:
             | Even straight up Proto is vulnerable to this (either with
             | an unlucky crash right between fields being actually
             | written to a file, or when attempting to stream top-level
             | fields).
        
           | [deleted]
        
         | amalcon wrote:
         | It's more likely that the snapshot was incomplete in that it
         | was based on incomplete precursors, rather than that the
         | message itself was truncated. The details of something like
         | that aren't always appropriate for this kind of report (i.e.
         | they usually create more questions than they answer for a
         | reader unfamiliar with the basics of the system).
        
       | ehPReth wrote:
       | I wish literally everywhere had (mandated?) detailed _public_
       | RFOs like this. My residential ISP down for 20 minutes? Tell me
       | more about the cable headend or your bad firmware push, please!
        
       | LostLetterbox2 wrote:
       | Anyone want to share what a programming cycle is? (The complete
       | snapshot was restored with the next programming cycle)
        
       | 7a1c9427 wrote:
       | > Google's automation systems mitigated this failure by pushing a
       | complete topology snapshot during the next programming cycle. The
       | proper sites were restored and the network converged by 05:05
       | US/Pacific.
       | 
       | I think this is the most understated part of the whole report.
       | The bad thing happened due to "automated clever thing" and then
       | the system "automagically" mitigated it in ~7 minutes. Likely
       | before a human had even figured out what had gone wrong.
        
         | Jensson wrote:
         | How would you otherwise do it? Anything that automatically
         | pushes updates should monitor for rapid increase in errors
         | afterwards and roll back if so. You should do at least that if
         | you are working on a critical system.
        
           | metadat wrote:
           | Sure, in an ideal world this is how nearly everything would
           | work.
           | 
           | Getting a complex system to a level of maturity where this is
           | feasible to do at scale in real life and actually work well
           | is a respectable and non-trivial achievement.
           | 
           | I don't know if Amazon or Azure are able to confidently and
           | effectively put in such automatic remediation measures
           | globally. My sense is there are humans involved to triage and
           | fix unusual types of outages at every other cloud provider,
           | including the other bigs.
           | 
           | Leaving a comment on a message board saying how things ought
           | to work is one thing (there's nothing wrong with your
           | comment, I like it!); I only want to highlight, bold, and
           | underscore how successfully achieving this level of automatic
           | remediation atop a large and dynamic system is uncommon and
           | noteworthy.
        
       | fdgsdfogijq wrote:
       | I always think about how impossible it will be for GCP to compete
       | with AWS. The work culture at AWS has been brutal for a decade.
       | High standards for work, and insane amounts of oncall/ops
       | reduction. A burn and churn machine that has created AWS. Google
       | is a laid back company with great technology, but not the culture
       | to grind out every detail of getting cloud to really work.
       | Microsoft is another story altogether, as they already have a ton
       | of corporate relationships to bring clients.
        
         | baggy_trough wrote:
         | Well personally, I find GCP highly reliable and easier to
         | understand than AWS.
        
           | abofh wrote:
           | Hopefully you can fund their operations for the coming
           | quarter then, linked in suggests even GCP SRE's are on the
           | chopping block.
        
         | zamnos wrote:
         | Yeah. No one ever went wrong with hosting in US-east-1. Oh
         | wait.
        
           | ec109685 wrote:
           | AWS is architected in a way that makes global failures
           | harder, e.g. VPC's only span a region.
        
             | jvolkman wrote:
             | So instead people just stuff all of their infrastructure
             | into one giant region and hope for the best.
        
               | nijave wrote:
               | That, or it's the customers fault for not reading the
               | docs and building their thing wrong
        
             | nightpool wrote:
             | The last time us-east-1 went down, if memory serves, it
             | took down the entirety of ec2 provisioning with it.
        
       | dboreham wrote:
       | "Automated clever thing wasn't as clever as it needed to be"
        
       | [deleted]
        
       | 1equalsequals1 wrote:
       | Oxford comma as well, how controversial can it get
        
       | londons_explore wrote:
       | At googles scale and reliability target, I would hope they have
       | multiple independent worldwide networks.
       | 
       | Each network would have its own config plane and data plane.
       | Changes would only be made to one at once. Perhaps even different
       | teams managing them so that one rogue or hacked employee can't
       | take down the whole lot.
       | 
       | Then, if someone screws up and pushes a totally nuts config, it
       | will only impact one network. User traffic would flow just fine
       | over the other networks.
       | 
       | Obviously there would need to be thought on which network data
       | will be routed over, failover logic between one and another, etc.
       | And that failover logic would all be site specific, and rolled
       | out site by site, so there is again so single point of global
       | failure.
        
         | toast0 wrote:
         | I've seen a couple companies (credibly) claim multiple
         | independent networks, but it seems to be pretty expensive to do
         | in practice. There's a lot of cost to be saved by having a
         | single network, and it's too tempting to share things making
         | the independence illusory.
         | 
         | Probably you get better independence with smaller providers
         | where the provider isn't yet big enough to actually run a
         | global network. Then, each PoP is likely to run or fail
         | independently. Google and peers have enough scope that they can
         | run a single global network where one can experience the Haiku:
         | It's not BGP        There's no way it's BGP        It was BGP
         | 
         | (DNS also fits in there; any sort of automation to push DNS or
         | BGP changes will probably fit, if the name was carefully
         | chosen)
        
         | sillybov3456 wrote:
         | That isn't really how production networks work in my uneducated
         | opinion. If they are connected to the production network then
         | they are the production network, and the level of isolation
         | required to make that not the case would be so extreme as to
         | make things potentially more unreliable.
         | 
         | Others can correct me if I'm wrong about this. All I know is
         | that the production network where I work is not air gapped in
         | the way that would be required to truthfully consider testing
         | networks a non production environment, so non prod changes
         | typically wind up in front of the change review board anyway.
         | 
         | Ask your own sites network engineers and see if they have
         | similar constraints because I would be interested to hear more
         | perspectives on that.
         | 
         | One other thing I will say is that the abstractions of "config
         | plane" and "data plane" and "control plane" don't really exist
         | on real physical systems. That is mostly an abstraction created
         | for applications people, those systems are not going to be
         | totally blocked from interacting with eachother, they kind of
         | have to. So if any of your "planes" are shared with production
         | it is a production environment.
        
           | rkeene2 wrote:
           | That would mean that all networks which peer with the
           | Internet would necessarily be considered Production. This
           | isn't that reasonable outside certain niches (i.e., national
           | government networks).
           | 
           | Instead, what's commonly done is to provide a Controlled
           | Interface (to borrow a term from those national government
           | networks) that gates which things are at which level of
           | trust. This is where security boundaries are enforced -- and
           | if they are sound security boundaries things on either side
           | can't reasonably damage the other side.
        
             | sillybov3456 wrote:
             | That's super interesting, and you're definitely right about
             | the internet thing. I suppose our network guys must have
             | some way to see if a change will propagate beyond a
             | particular interface?
        
           | sangnoir wrote:
           | > One other thing I will say is that the abstractions of
           | "config plane" and "data plane" and "control plane" don't
           | really exist on real physical systems
           | 
           | If you use any sort of virtualization: the control plane
           | (infra) vs data plane (apps) will naturally evolve from the
           | architecture. The config plane and control plane can get
           | squashed into the same thing though, but it can also be
           | disparate for at both infra- and application level.
        
           | dekhn wrote:
           | Data plane and control plane are definitely a thing in real
           | physical systems- look at a classical router, where the
           | packet processor works independently of, and is occasionally
           | programmed by, or assisted by, a message passing from the
           | data plane to the control plane. That control plane is
           | typical elsewhere on the main board, talking to the data
           | plane through a well-specific protocol.
           | 
           | Google's network is complicated, making many assumptions
           | about "what is prod" etc hard to reason about.
        
         | jeffbee wrote:
         | Google does have this, but one of the networks is way bigger
         | than the other, so they can't exactly fail B4 over to B2.
        
       ___________________________________________________________________
       (page generated 2023-03-07 23:00 UTC)