[HN Gopher] Sharing details on a recent incident impacting one o...
       ___________________________________________________________________
        
       Sharing details on a recent incident impacting one of our customers
        
       Author : nonfamous
       Score  : 86 points
       Date   : 2024-05-24 14:48 UTC (8 hours ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | foobazgt wrote:
       | Sounds like a pretty thorough review in that they didn't stop at
       | just an investigation of the specific tool / process, but also
       | examined the rest for any auto deletion problems and also
       | confirmed soft delete behavior.
       | 
       | They could have gone one step further by reviewing all cases of
       | default behavior for anything that might be surprising. That
       | said, it can be difficult to assess what is "surprising", as it's
       | often the people who know the least about a tool/API who also
       | utilize its defaults.
        
         | x0x0 wrote:
         | Sounds more like some pants browning because incidents like
         | this are a great reason to just use aws. Like come on:
         | 
         | > _After the end of the system-assigned 1 year period, the
         | customer's GCVE Private Cloud was deleted. No customer
         | notification was sent because the deletion was triggered as a
         | result of a parameter being left blank by Google operators
         | using the internal tool, and not due a customer deletion
         | request. Any customer-initiated deletion would have been
         | preceded by a notification to the customer._
         | 
         | ... Tada! We 're so incompetent we let giant deletes happen
         | with no human review. Thank god this customer didn't trust us
         | and kept off-gcp backups or they'd be completely screwed.
         | 
         | > _There has not been an incident of this nature within Google
         | Cloud prior to this instance. It is not a systemic issue._
         | 
         | Translated to English: oh god, every aws and Azure salesperson
         | has sent 3 emails to all their prospects citing our utter
         | fuckup.
        
           | markfive wrote:
           | > Thank god this customer didn't trust us and kept off-gcp
           | backups or they'd be completely screwed.
           | 
           | Except that, from the article, the customer's backups that
           | were used to recover were in GCP, and in the same region.
        
             | ceejayoz wrote:
             | I'm curious about that bit.
             | 
             | https://www.unisuper.com.au/contact-us/outage-update says
             | "UniSuper had backups in place with an additional service
             | provider. These backups have minimised data loss, and
             | significantly improved the ability of UniSuper and Google
             | Cloud to complete the restoration."
        
               | politelemon wrote:
               | That's the bit that's sticking out to be as
               | contradictory. I'm inclined to not believe what GCP have
               | said here as an account deletion is an account deletion,
               | why would some objects be left behind.
               | 
               | No doubt this little bit must be causing some annoyance
               | among UniSuper's tech teams.
        
               | graemep wrote:
               | The had another provider because the regulator requires
               | it. I suspect a lot of businesses in less regulated
               | industries do not.
        
       | gnabgib wrote:
       | Related stories _UniSuper members go a week with no account
       | access after Google Cloud misconfig_ [0](186 points, 16 days ago,
       | 42 comments) _Google Cloud accidentally deletes customer 's
       | account_ [1](128 points, 15 days ago, 32 comments)
       | 
       | [0]: https://news.ycombinator.com/item?id=40304666
       | 
       | [1]: https://news.ycombinator.com/item?id=40313171
        
       | tempnow987 wrote:
       | Wow - I was wrong. I thought this would have been something like
       | terraform with a default to immediate delete with no recovery
       | period or something. Still a default, but a third party thing and
       | maybe someone in unisuper testing something and mis-scoping the
       | delete.
       | 
       | Crazy that it really was google side. UniSuper must have been
       | like WHAT THE HELL?
        
       | mannyv wrote:
       | I guessed it was provisioning or keys. Looks like I was somewhat
       | correct!
        
       | janalsncm wrote:
       | I think it stretches credulity to say that the first time such an
       | event happened was with a multi billion dollar mutual fund. In
       | other words, I'm glad Unisuper's problem was resolved, but there
       | were probably many others which were small enough to ignore.
       | 
       | I can only hope this gives GCP the kick in the pants it needs.
        
         | resolutebat wrote:
         | GCVE (managed VMware) is a pretty obscure service, it's only
         | used by the kind of multi billion dollar companies that want to
         | lift and shift their legacy VMware fleets into the cloud as is.
        
       | sgt101 wrote:
       | Super motivating to have off cloud backup strategies...
        
         | tgv wrote:
         | Or cross-cloud. S3's ingress and storage costs are low, so
         | that's an option when you don't use AWS.
        
       | jawns wrote:
       | > The customer's CIO and technical teams deserve praise for the
       | speed and precision with which they executed the 24x7 recovery,
       | working closely with Google Cloud teams.
       | 
       | I wonder if they just get praise in a blog post, or if the
       | customer is now sitting on a king's ransom in Google Cloud
       | credit.
        
       | noncoml wrote:
       | What surprises me the most is that the customer managed to
       | actually speak to a person from Google support. Must have been a
       | pretty big private cloud deployment.
       | 
       | Edit: saw from the other replies that the customer was Unisuper.
       | No wonder they managed to speak to an actual person.
        
       | cebert wrote:
       | > "Google Cloud continues to have the most resilient and stable
       | cloud infrastructure in the world."
       | 
       | I don't think GPC has that reputation compared to AWS or Azure.
       | They aren't at the same level.
        
       | snewman wrote:
       | Given the level of impact that this incident caused, I am
       | surprised that the remediations did not go deeper. They ensured
       | that the same problem could not happen again in the same way, but
       | that's all. So some equivalent glitch somewhere down the road
       | could lead to a similar result (or worse; not all customers might
       | have the same "robust and resilient architectural approach to
       | managing risk of outage or failure").
       | 
       | Examples of things they could have done to systematically guard
       | against inappropriate service termination / deletion in the
       | future:
       | 
       | 1. When terminating a service, temporarily place it in a state
       | where the service is unavailable but all data is retained and can
       | be restored at the push of a button. Discard the data after a few
       | days. This provides an opportunity for the customer to report the
       | problem.
       | 
       | 2. Audit all deletion workflows for all services (they only
       | mention having reviewed GCVE). Ensure that customers are notified
       | in advance whenever any service is terminated, even if "the
       | deletion was triggered as a result of a parameter being left
       | blank by Google operators using the internal tool".
       | 
       | 3. Add manual review for _any_ termination of a service that is
       | in active use, above a certain size.
       | 
       | Absent these broader measures, I don't find this postmortem to be
       | in the slightest bit reassuring. Given the are-you-f*ing-kidding-
       | me nature of the incident, I would have expected any sensible
       | provider who takes the slightest pride in their service, or even
       | is merely interested in protecting their reputation, to visibly
       | go over the top in ensuring nothing like this could happen again.
       | Instead, they've done the bare minimum. That says something bad
       | about the culture at Google Cloud.
        
         | 2OEH8eoCRo0 wrote:
         | That sounds reasonable. Perhaps they felt that a larger change
         | to process would be riskier overall.
        
       | JCM9 wrote:
       | The quality and rigor of GCP's engineering is not even remotely
       | close to that of an AWS or Azure and this incident shows it.
        
         | justinclift wrote:
         | And Azure has a very poor reputation, so that bar is _not_ at
         | all high.
        
         | SoftTalker wrote:
         | Honestly I've never worked anywhere that didn't have some kind
         | of "war story" that was told about how some admin or programmer
         | mistake resulted in the deletion of some vast swathe of data,
         | and then the panic-driven heroics that were needed to recover.
         | 
         | It shouldn't happen, but it does, all the time, because humans
         | aren't perfect, and neither are the things we create.
        
         | kccqzy wrote:
         | Azure has had a continuous stream of security breaches. I don't
         | trust them either. It's AWS and AWS alone.
        
       | l00tr wrote:
       | if it would be small or medium buisness google wouldnt even care
        
       | lopkeny12ko wrote:
       | > Google Cloud services have strong safeguards in place with a
       | combination of soft delete, advance notification, and human-in-
       | the-loop, as appropriate.
       | 
       | I mean, clearly not? By Google's own admission, in this very
       | article, the resources were not soft deleted, no advance
       | notification was sent, and there was no human in the loop for
       | approving the automated deletion.
       | 
       | And Google's remediation items include adding even _more_
       | automation for this process. This sounds totally backward to me.
       | Am I missing something?
        
       | maximinus_thrax wrote:
       | > Google Cloud continues to have the most resilient and stable
       | cloud infrastructure in the world.
       | 
       | As a company, Google has a lot of work to do about its customer
       | care reputation regardless of what some metrics somewhere say
       | about who's cloud is more reliable or not. I would not trust my
       | business to Google Cloud, I would not trust anything with money
       | to anything with the Google logo. Anyone who's been reading
       | hacker news for a couple of years can remember how many times
       | folks were asking for insider contacts to recover their
       | accounts/data. Extrapolating this to a business would keep me up
       | at night.
        
       | dekhn wrote:
       | If you're a GCP customer with a TAM, here's how to make them
       | squirm. Ask them what protections GCP has in place, on your
       | account, that would prevent GCP from inadvertently deleting large
       | amounts of resources if GCP makes an administrative error.
       | 
       | They'll point to something that says this specific problem was
       | alleviated (by deprecating the tool that did it, and automating
       | more of the process), and then you can persist: we know you've
       | fixed this problem, then followup: will a human review this
       | large-scale deletion before the resources are deleted?
       | 
       | From what I can tell (I worked for GCP aeons ago, and am an
       | active user of AWS for even longer) GCP's human-based protection
       | measures are close to non-existent, and much less than AWS.
       | Either way, it's definitely worth asking your TAM about this very
       | real risk.
        
       ___________________________________________________________________
       (page generated 2024-05-24 23:00 UTC)