[HN Gopher] Sharing details on a recent incident impacting one o...
___________________________________________________________________
Sharing details on a recent incident impacting one of our customers
Author : nonfamous
Score : 86 points
Date : 2024-05-24 14:48 UTC (8 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| foobazgt wrote:
| Sounds like a pretty thorough review in that they didn't stop at
| just an investigation of the specific tool / process, but also
| examined the rest for any auto deletion problems and also
| confirmed soft delete behavior.
|
| They could have gone one step further by reviewing all cases of
| default behavior for anything that might be surprising. That
| said, it can be difficult to assess what is "surprising", as it's
| often the people who know the least about a tool/API who also
| utilize its defaults.
| x0x0 wrote:
| Sounds more like some pants browning because incidents like
| this are a great reason to just use aws. Like come on:
|
| > _After the end of the system-assigned 1 year period, the
| customer's GCVE Private Cloud was deleted. No customer
| notification was sent because the deletion was triggered as a
| result of a parameter being left blank by Google operators
| using the internal tool, and not due a customer deletion
| request. Any customer-initiated deletion would have been
| preceded by a notification to the customer._
|
| ... Tada! We 're so incompetent we let giant deletes happen
| with no human review. Thank god this customer didn't trust us
| and kept off-gcp backups or they'd be completely screwed.
|
| > _There has not been an incident of this nature within Google
| Cloud prior to this instance. It is not a systemic issue._
|
| Translated to English: oh god, every aws and Azure salesperson
| has sent 3 emails to all their prospects citing our utter
| fuckup.
| markfive wrote:
| > Thank god this customer didn't trust us and kept off-gcp
| backups or they'd be completely screwed.
|
| Except that, from the article, the customer's backups that
| were used to recover were in GCP, and in the same region.
| ceejayoz wrote:
| I'm curious about that bit.
|
| https://www.unisuper.com.au/contact-us/outage-update says
| "UniSuper had backups in place with an additional service
| provider. These backups have minimised data loss, and
| significantly improved the ability of UniSuper and Google
| Cloud to complete the restoration."
| politelemon wrote:
| That's the bit that's sticking out to be as
| contradictory. I'm inclined to not believe what GCP have
| said here as an account deletion is an account deletion,
| why would some objects be left behind.
|
| No doubt this little bit must be causing some annoyance
| among UniSuper's tech teams.
| graemep wrote:
| The had another provider because the regulator requires
| it. I suspect a lot of businesses in less regulated
| industries do not.
| gnabgib wrote:
| Related stories _UniSuper members go a week with no account
| access after Google Cloud misconfig_ [0](186 points, 16 days ago,
| 42 comments) _Google Cloud accidentally deletes customer 's
| account_ [1](128 points, 15 days ago, 32 comments)
|
| [0]: https://news.ycombinator.com/item?id=40304666
|
| [1]: https://news.ycombinator.com/item?id=40313171
| tempnow987 wrote:
| Wow - I was wrong. I thought this would have been something like
| terraform with a default to immediate delete with no recovery
| period or something. Still a default, but a third party thing and
| maybe someone in unisuper testing something and mis-scoping the
| delete.
|
| Crazy that it really was google side. UniSuper must have been
| like WHAT THE HELL?
| mannyv wrote:
| I guessed it was provisioning or keys. Looks like I was somewhat
| correct!
| janalsncm wrote:
| I think it stretches credulity to say that the first time such an
| event happened was with a multi billion dollar mutual fund. In
| other words, I'm glad Unisuper's problem was resolved, but there
| were probably many others which were small enough to ignore.
|
| I can only hope this gives GCP the kick in the pants it needs.
| resolutebat wrote:
| GCVE (managed VMware) is a pretty obscure service, it's only
| used by the kind of multi billion dollar companies that want to
| lift and shift their legacy VMware fleets into the cloud as is.
| sgt101 wrote:
| Super motivating to have off cloud backup strategies...
| tgv wrote:
| Or cross-cloud. S3's ingress and storage costs are low, so
| that's an option when you don't use AWS.
| jawns wrote:
| > The customer's CIO and technical teams deserve praise for the
| speed and precision with which they executed the 24x7 recovery,
| working closely with Google Cloud teams.
|
| I wonder if they just get praise in a blog post, or if the
| customer is now sitting on a king's ransom in Google Cloud
| credit.
| noncoml wrote:
| What surprises me the most is that the customer managed to
| actually speak to a person from Google support. Must have been a
| pretty big private cloud deployment.
|
| Edit: saw from the other replies that the customer was Unisuper.
| No wonder they managed to speak to an actual person.
| cebert wrote:
| > "Google Cloud continues to have the most resilient and stable
| cloud infrastructure in the world."
|
| I don't think GPC has that reputation compared to AWS or Azure.
| They aren't at the same level.
| snewman wrote:
| Given the level of impact that this incident caused, I am
| surprised that the remediations did not go deeper. They ensured
| that the same problem could not happen again in the same way, but
| that's all. So some equivalent glitch somewhere down the road
| could lead to a similar result (or worse; not all customers might
| have the same "robust and resilient architectural approach to
| managing risk of outage or failure").
|
| Examples of things they could have done to systematically guard
| against inappropriate service termination / deletion in the
| future:
|
| 1. When terminating a service, temporarily place it in a state
| where the service is unavailable but all data is retained and can
| be restored at the push of a button. Discard the data after a few
| days. This provides an opportunity for the customer to report the
| problem.
|
| 2. Audit all deletion workflows for all services (they only
| mention having reviewed GCVE). Ensure that customers are notified
| in advance whenever any service is terminated, even if "the
| deletion was triggered as a result of a parameter being left
| blank by Google operators using the internal tool".
|
| 3. Add manual review for _any_ termination of a service that is
| in active use, above a certain size.
|
| Absent these broader measures, I don't find this postmortem to be
| in the slightest bit reassuring. Given the are-you-f*ing-kidding-
| me nature of the incident, I would have expected any sensible
| provider who takes the slightest pride in their service, or even
| is merely interested in protecting their reputation, to visibly
| go over the top in ensuring nothing like this could happen again.
| Instead, they've done the bare minimum. That says something bad
| about the culture at Google Cloud.
| 2OEH8eoCRo0 wrote:
| That sounds reasonable. Perhaps they felt that a larger change
| to process would be riskier overall.
| JCM9 wrote:
| The quality and rigor of GCP's engineering is not even remotely
| close to that of an AWS or Azure and this incident shows it.
| justinclift wrote:
| And Azure has a very poor reputation, so that bar is _not_ at
| all high.
| SoftTalker wrote:
| Honestly I've never worked anywhere that didn't have some kind
| of "war story" that was told about how some admin or programmer
| mistake resulted in the deletion of some vast swathe of data,
| and then the panic-driven heroics that were needed to recover.
|
| It shouldn't happen, but it does, all the time, because humans
| aren't perfect, and neither are the things we create.
| kccqzy wrote:
| Azure has had a continuous stream of security breaches. I don't
| trust them either. It's AWS and AWS alone.
| l00tr wrote:
| if it would be small or medium buisness google wouldnt even care
| lopkeny12ko wrote:
| > Google Cloud services have strong safeguards in place with a
| combination of soft delete, advance notification, and human-in-
| the-loop, as appropriate.
|
| I mean, clearly not? By Google's own admission, in this very
| article, the resources were not soft deleted, no advance
| notification was sent, and there was no human in the loop for
| approving the automated deletion.
|
| And Google's remediation items include adding even _more_
| automation for this process. This sounds totally backward to me.
| Am I missing something?
| maximinus_thrax wrote:
| > Google Cloud continues to have the most resilient and stable
| cloud infrastructure in the world.
|
| As a company, Google has a lot of work to do about its customer
| care reputation regardless of what some metrics somewhere say
| about who's cloud is more reliable or not. I would not trust my
| business to Google Cloud, I would not trust anything with money
| to anything with the Google logo. Anyone who's been reading
| hacker news for a couple of years can remember how many times
| folks were asking for insider contacts to recover their
| accounts/data. Extrapolating this to a business would keep me up
| at night.
| dekhn wrote:
| If you're a GCP customer with a TAM, here's how to make them
| squirm. Ask them what protections GCP has in place, on your
| account, that would prevent GCP from inadvertently deleting large
| amounts of resources if GCP makes an administrative error.
|
| They'll point to something that says this specific problem was
| alleviated (by deprecating the tool that did it, and automating
| more of the process), and then you can persist: we know you've
| fixed this problem, then followup: will a human review this
| large-scale deletion before the resources are deleted?
|
| From what I can tell (I worked for GCP aeons ago, and am an
| active user of AWS for even longer) GCP's human-based protection
| measures are close to non-existent, and much less than AWS.
| Either way, it's definitely worth asking your TAM about this very
| real risk.
___________________________________________________________________
(page generated 2024-05-24 23:00 UTC)