hngopher.com

       [HN Gopher] Sharing details on a recent incident impacting one o...
       ___________________________________________________________________
        
       Sharing details on a recent incident impacting one of our customers
        
       Author : nonfamous
       Score  : 284 points
       Date   : 2024-05-24 14:48 UTC (1 days ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | foobazgt wrote:
       | Sounds like a pretty thorough review in that they didn't stop at
       | just an investigation of the specific tool / process, but also
       | examined the rest for any auto deletion problems and also
       | confirmed soft delete behavior.
       | 
       | They could have gone one step further by reviewing all cases of
       | default behavior for anything that might be surprising. That
       | said, it can be difficult to assess what is "surprising", as it's
       | often the people who know the least about a tool/API who also
       | utilize its defaults.
        
         | x0x0 wrote:
         | Sounds more like some pants browning because incidents like
         | this are a great reason to just use aws. Like come on:
         | 
         | > _After the end of the system-assigned 1 year period, the
         | customer's GCVE Private Cloud was deleted. No customer
         | notification was sent because the deletion was triggered as a
         | result of a parameter being left blank by Google operators
         | using the internal tool, and not due a customer deletion
         | request. Any customer-initiated deletion would have been
         | preceded by a notification to the customer._
         | 
         | ... Tada! We 're so incompetent we let giant deletes happen
         | with no human review. Thank god this customer didn't trust us
         | and kept off-gcp backups or they'd be completely screwed.
         | 
         | > _There has not been an incident of this nature within Google
         | Cloud prior to this instance. It is not a systemic issue._
         | 
         | Translated to English: oh god, every aws and Azure salesperson
         | has sent 3 emails to all their prospects citing our utter
         | fuckup.
        
           | markfive wrote:
           | > Thank god this customer didn't trust us and kept off-gcp
           | backups or they'd be completely screwed.
           | 
           | Except that, from the article, the customer's backups that
           | were used to recover were in GCP, and in the same region.
        
             | ceejayoz wrote:
             | I'm curious about that bit.
             | 
             | https://www.unisuper.com.au/contact-us/outage-update says
             | "UniSuper had backups in place with an additional service
             | provider. These backups have minimised data loss, and
             | significantly improved the ability of UniSuper and Google
             | Cloud to complete the restoration."
        
               | politelemon wrote:
               | That's the bit that's sticking out to be as
               | contradictory. I'm inclined to not believe what GCP have
               | said here as an account deletion is an account deletion,
               | why would some objects be left behind.
               | 
               | No doubt this little bit must be causing some annoyance
               | among UniSuper's tech teams.
        
               | flaminHotSpeedo wrote:
               | I'm inclined to not believe GCP because they edited their
               | status updates retroactively and lied in their postmortem
               | about the Clichy fire in Paris not affecting multiple
               | "zones"
        
               | graemep wrote:
               | The had another provider because the regulator requires
               | it. I suspect a lot of businesses in less regulated
               | industries do not.
        
             | skywhopper wrote:
             | I think you misread. Here's the relevant statement from the
             | article:
             | 
             | "Data backups that were stored in Google Cloud Storage in
             | the same region were not impacted by the deletion, and,
             | _along with third party backup software_ , were
             | instrumental in aiding the rapid restoration."
        
         | rezonant wrote:
         | > and also confirmed soft delete behavior.
         | 
         | Where exactly do they mention they have confirmed soft delete
         | behavior systemically? All they said was they have ensured that
         | this specific automatic deletion scenario can no longer happen,
         | and it seems the main reason is because "these deployments are
         | now automated". They were automated before, now they are even
         | more automated. That does zero to assure me that their deletion
         | mechanisms are consistently safe, only that there's no operator
         | at the wheel any more.
        
       | gnabgib wrote:
       | Related stories _UniSuper members go a week with no account
       | access after Google Cloud misconfig_ [0](186 points, 16 days ago,
       | 42 comments) _Google Cloud accidentally deletes customer 's
       | account_ [1](128 points, 15 days ago, 32 comments)
       | 
       | [0]: https://news.ycombinator.com/item?id=40304666
       | 
       | [1]: https://news.ycombinator.com/item?id=40313171
        
       | tempnow987 wrote:
       | Wow - I was wrong. I thought this would have been something like
       | terraform with a default to immediate delete with no recovery
       | period or something. Still a default, but a third party thing and
       | maybe someone in unisuper testing something and mis-scoping the
       | delete.
       | 
       | Crazy that it really was google side. UniSuper must have been
       | like WHAT THE HELL?
        
         | rezonant wrote:
         | One assumes they are getting a massive credit to their GCP
         | bill, if not an outright remediation payment from Google.
        
           | abraae wrote:
           | The effusive praise for the customer in Google's statement
           | makes me think they have free GCP for the next year, in
           | exchange for not going public with their frustrations.
        
         | markmark wrote:
         | The article describes what happened and it had nothing to do
         | with Unisuper. Google deployed the private cloud with an
         | internal Google tool. And that internal Google tool configured
         | things to auto-delete after a year.
        
       | mannyv wrote:
       | I guessed it was provisioning or keys. Looks like I was somewhat
       | correct!
        
       | janalsncm wrote:
       | I think it stretches credulity to say that the first time such an
       | event happened was with a multi billion dollar mutual fund. In
       | other words, I'm glad Unisuper's problem was resolved, but there
       | were probably many others which were small enough to ignore.
       | 
       | I can only hope this gives GCP the kick in the pants it needs.
        
         | resolutebat wrote:
         | GCVE (managed VMware) is a pretty obscure service, it's only
         | used by the kind of multi billion dollar companies that want to
         | lift and shift their legacy VMware fleets into the cloud as is.
        
         | crazygringo wrote:
         | I doubt it, because even a smaller customer would have taken
         | this to the press, which would have picked up on it.
         | 
         | "Google deleted our cloud service" is a major news story for a
         | business of any size.
        
         | joshuamorton wrote:
         | A critical piece of the incident here was that this involved
         | special customization that most customers didn't have or use,
         | and which bypassed some safety checks, as a result it couldn't
         | impact "normal" small customers.
        
       | sgt101 wrote:
       | Super motivating to have off cloud backup strategies...
        
         | tgv wrote:
         | Or cross-cloud. S3's ingress and storage costs are low, so
         | that's an option when you don't use AWS.
        
       | jawns wrote:
       | > The customer's CIO and technical teams deserve praise for the
       | speed and precision with which they executed the 24x7 recovery,
       | working closely with Google Cloud teams.
       | 
       | I wonder if they just get praise in a blog post, or if the
       | customer is now sitting on a king's ransom in Google Cloud
       | credit.
        
         | rezonant wrote:
         | There's no reality where a competent customer isn't going to
         | ensure Google pays for this. I'd be surprised if they have a
         | bill at all this year.
        
         | wolfi1 wrote:
         | there should have been some punitive damage
        
       | noncoml wrote:
       | What surprises me the most is that the customer managed to
       | actually speak to a person from Google support. Must have been a
       | pretty big private cloud deployment.
       | 
       | Edit: saw from the other replies that the customer was Unisuper.
       | No wonder they managed to speak to an actual person.
        
       | cebert wrote:
       | > "Google Cloud continues to have the most resilient and stable
       | cloud infrastructure in the world."
       | 
       | I don't think GPC has that reputation compared to AWS or Azure.
       | They aren't at the same level.
        
         | sa46 wrote:
         | Microsoft is prone to severe breaches a few times per year.
         | 
         | https://firewalltimes.com/microsoft-data-breach-timeline/
        
           | pquki4 wrote:
           | You can't just equate Microsoft and Azure like that.
        
             | skywhopper wrote:
             | Azure has had multiple embarrassingly bad tenant-boundary
             | leaks, including stuff like being able to access another
             | customer's metadata service, including credentials, just by
             | changing a port number. They clearly have some major issues
             | with lack of internal architecture review.
        
         | markmark wrote:
         | Does Azure? I think there's AWS then everyone else.
        
         | pm90 wrote:
         | I have used Azure, AWS and GCP. The only reason people use
         | Azure is because others force them to. It's an extremely shitty
         | cloud product. They pretend to compete with AWS but aren't even
         | as good as GCP.
        
       | snewman wrote:
       | Given the level of impact that this incident caused, I am
       | surprised that the remediations did not go deeper. They ensured
       | that the same problem could not happen again in the same way, but
       | that's all. So some equivalent glitch somewhere down the road
       | could lead to a similar result (or worse; not all customers might
       | have the same "robust and resilient architectural approach to
       | managing risk of outage or failure").
       | 
       | Examples of things they could have done to systematically guard
       | against inappropriate service termination / deletion in the
       | future:
       | 
       | 1. When terminating a service, temporarily place it in a state
       | where the service is unavailable but all data is retained and can
       | be restored at the push of a button. Discard the data after a few
       | days. This provides an opportunity for the customer to report the
       | problem.
       | 
       | 2. Audit all deletion workflows for all services (they only
       | mention having reviewed GCVE). Ensure that customers are notified
       | in advance whenever any service is terminated, even if "the
       | deletion was triggered as a result of a parameter being left
       | blank by Google operators using the internal tool".
       | 
       | 3. Add manual review for _any_ termination of a service that is
       | in active use, above a certain size.
       | 
       | Absent these broader measures, I don't find this postmortem to be
       | in the slightest bit reassuring. Given the are-you-f*ing-kidding-
       | me nature of the incident, I would have expected any sensible
       | provider who takes the slightest pride in their service, or even
       | is merely interested in protecting their reputation, to visibly
       | go over the top in ensuring nothing like this could happen again.
       | Instead, they've done the bare minimum. That says something bad
       | about the culture at Google Cloud.
        
         | 2OEH8eoCRo0 wrote:
         | That sounds reasonable. Perhaps they felt that a larger change
         | to process would be riskier overall.
        
           | TheCleric wrote:
           | No it would probably be even worse from Google's perspective:
           | more expensive.
        
         | rezonant wrote:
         | Hard agree. They clearly were more interested in making clear
         | that there's not a systemic problem in how GCP's operators
         | manage the platform, which read strongly and alarmingly that
         | there is a systemic problem in how GCP's operators manage the
         | platform. The lack of the common sense measures you outline in
         | their postmortem just tells me that they aren't doing anything
         | to fix it.
        
           | ok_dad wrote:
           | "There's no systemic problem."
           | 
           | Meanwhile, the operators were allowed to leave a parameter
           | blank and the default was to set a deletion time bomb.
           | 
           | Not systemic my butt! That's a process failure, and every
           | process failure like this is a systemic problem because the
           | system shouldn't allow a stupid error like this.
        
             | joshuamorton wrote:
             | If you're arguing that _that_ was the systemic problem,
             | then it 's been fully fixed, as the manual operation was
             | removed and so validation can no longer be bypassed.
        
         | phito wrote:
         | It's a joke that they're not doing these things. How can you be
         | an giant cloud provider and not think of putting safe guards
         | around data deletion. I guess that realistically they thought
         | of it many times but never implemented it because our costs
         | money.
        
           | pm90 wrote:
           | It's probably because implementing such safeguards wouldn't
           | help anyones promo packet.
           | 
           | I really dislike that most of our major cloud infrastructure
           | is provided by big tech rather than eg infrastructure
           | vendors. I trust equinix a lot more than Google because thats
           | all they do.
        
             | metadat wrote:
             | Understandable, however public clouds are a huge mix of
             | both hardware and software, and it takes deep proficiency
             | at both to pull it off. Equinix are definitely in the
             | hardware and routing business.. may be tough to work
             | upstream.
             | 
             | Hardware always get commoditized to the max (sad but true).
        
             | Thorrez wrote:
             | I work in GCP and have seen a lot of OKRs about improving
             | reliability. So implementing something like this would help
             | someone's promo packet.
        
               | cbarrick wrote:
               | This is exactly the kind of work that would get SREs
               | promoted.
        
               | passion__desire wrote:
               | It is funny Google has internal memegen but not ideagen.
               | Ideate away your problems, guys.
        
             | lima wrote:
             | As a customer of Equinix Cloud... No thank you.
             | Infrastructure vendors are terrible software engineers.
        
         | Ocha wrote:
         | I wouldn't be surprised if VMware support is getting deprecated
         | in GCP so they just don't care - waiting for all customers to
         | move off of it
        
           | snewman wrote:
           | My point is that if they had this problem in their VMware
           | support, they might have a similar problem in one of their
           | other services. But they didn't check (or at least they
           | didn't claim credit for having checked, which likely means
           | they didn't check).
        
         | sangnoir wrote:
         | > When terminating a service, temporarily place it in a state
         | where the service is unavailable but all data is retained and
         | can be restored at the push of a button. Discard the data after
         | a few days. This provides an opportunity for the customer to
         | report the problem
         | 
         | Replacing actual deletion with deletion flags may lead to lead
         | to _other_ fun bugs like  "Google Cloud fails to delete
         | customer data, running afoul of EU rules". I suspect Google
         | would err on the side of accidental deletions rather than
         | accidental non-deletions: at least in the EU.
        
           | pm90 wrote:
           | I highly doubt this was the reason. Google has similar
           | deletion protection for other resources eg GCP projects are
           | soft deleted for 30 days before being nuked.
        
           | boesboes wrote:
           | Not really how it works, GDPR protects individuals and allow
           | them to request deletion with the data owner. They need to
           | then, within 60(?) days, respond to any request. Google has
           | nothing to do with that beyond having to make sure their
           | infra is secure. There even are provisions for dealing with
           | personal data in backups.
           | 
           | EU law has nothing to do with this.
        
           | mcherm wrote:
           | > I suspect Google would err on the side of accidental
           | deletions rather than accidental non-deletions: at least in
           | the EU.
           | 
           | I certainly hope not, because that would be incredibly
           | stupid. Customers understand the significance of different
           | kinds of risk. This story got an incredible amount of
           | attention among the community of people who choose between
           | different cloud services. A story about how Google had failed
           | to delete data on time would not have gotten nearly as much
           | attention.
           | 
           | But let us suppose for a moment that Google has no concern
           | for their reputation, only for their legal liability. Under
           | EU privacy rules, there might be some liability for failing
           | to delete data on schedule -- although I strongly suspect
           | that the kind of "this was an unavoidable one-off mistake"
           | justifications that we see in this article would convince a
           | court to reduce that liability.
           | 
           | But what liability would they face for the deletion? This was
           | a hedge fund managing billions of dollars. Fortunately, they
           | had off-site backups to restore their data. If they hadn't,
           | and it had been impossible to restore the data, how much
           | liability could Google have faced?
           | 
           | Surely, even the lawyers in charge of minimizing liability
           | would agree: it is better to fail by keeping customers
           | accounts then to fail by deleting them.
        
           | rlpb wrote:
           | A deletion flag is acceptable under EU rules. For example,
           | they are acceptable as a means of dealing with deletion
           | requests for data that also exists in backups. Provided that
           | the restore process also honors such flags.
        
         | steveBK123 wrote:
         | >> 1. When terminating a service, temporarily place it in a
         | state where the service is unavailable but all data is retained
         | and can be restored at the push of a button. Discard the data
         | after a few days. This provides an opportunity for the customer
         | to report the problem.
         | 
         | This is so obviously "enterprise software 101" that it is
         | telling Google is operating in 2024 without it.
         | 
         | Since my new hire grad days, the idea of immediately deleting
         | data that is no longer needed was out of the question.
         | 
         | Soft deletes in databases with a column you mark delete.
         | Move/rename data on disk until super duper sure you need to
         | delete it (and maybe still let the backup remain). Etc..
        
           | nikanj wrote:
           | There are many voices in the industry arguing against soft
           | deletes. Mostly coming from a very Chesterton's Fence
           | perspective.
           | 
           | For some examples
           | https://www.metabase.com/learn/analytics/data-model-
           | mistakes...
           | 
           | https://www.cultured.systems/2024/04/24/Soft-delete/
           | 
           | https://brandur.org/soft-deletion
           | 
           | Many more can easily be found.
        
             | snewman wrote:
             | For the use case we're discussing here, of terminating an
             | entire service, the soft delete would typically be needed
             | only at some high level, such as on the access list for the
             | service. The impact on performance, etc. should be minimal.
        
               | steveBK123 wrote:
               | Precisely, before you delete a customer account, you
               | disable its access to the system. This is a scream test.
               | 
               | Once you've gone through some time and due diligence you
               | can contemplate actually deleting the customer data and
               | account.
        
             | danparsonson wrote:
             | OK, but those examples you gave all boil down to the
             | following:
             | 
             | 1. you might accidentally access soft-deleted data and/or
             | the data model is more complicated 2. data protection 3.
             | you'll never need it
             | 
             | to which I say
             | 
             | 1. you'll make all kinds of mistakes if you don't
             | understand the data model, and, it's really not that hard
             | to tuck those details away inside data access code/SPs/etc
             | that the rest of your app doesn't need to care about
             | 
             | 2. you can still delete the data later on, and indeed that
             | may be preferable as deleting under load can cause
             | performance (e.g. locking) issues
             | 
             | 3. at least one of those links says they never used it,
             | then gives an example of when soft-deleted data was used to
             | help recover an account (albeit by creating a new record as
             | a copy, but only because they'd never tried an undelete
             | before and where worried about breaking something; sensible
             | but not exactly making the point they wanted to make)
             | 
             | So I'm gonna say I don't get it; sure it's not a panacea,
             | yes there are alternatives, but in my opinion neither is it
             | an anti-pattern. It's just one of dozens of trade-offs made
             | when designing a system.
        
           | crazygringo wrote:
           | It sounds like the problem is that the deletion was
           | configured with an internal tool that bypassed all those
           | kinds of protections -- that went straight to the actual
           | delete. Including warnings to the customer, etc.
           | 
           | Which is bizarre. Even internal tools used by reps shouldn't
           | be performing hard deletes.
           | 
           | And then I'd also love to know how the heck a default value
           | to expire in a year ever made it past code review. I think
           | that's the biggest howler of all. How did one person ever
           | think there should be a default like that, and how did
           | someone else see it and say yeah that sounds good?
        
           | roughly wrote:
           | > This is so obviously "enterprise software 101" that it is
           | telling Google is operating in 2024 without it.
           | 
           | My impression of GCP generally is that they've got some very
           | smart people working on some very impressive advanced
           | features and all the standard boring stuff nobody wants to do
           | is done to the absolute bare minimum required to check the
           | spec sheet. For all its bizarre modern enterprise-ness, I
           | don't think Google ever really grew out of its early academic
           | lab habits.
        
             | steveBK123 wrote:
             | I know a bunch of way-too-smart PHD types that worked at
             | GOOG exclusively in R&D roles that they bragged to me
             | earnestly was not revenue generating.
        
         | ajross wrote:
         | FWIW, you're solving the bug by fiat, and that doesn't work.
         | Surely analogs to all those protections are already in place.
         | But a firm and obvious requirement of a software system that is
         | capable of deleting data _the ability to delete data_. And if
         | it can do it you can write a bug that short-circuits any
         | architectural protection you put in place. Which is the
         | definition of a bug.
         | 
         | Basically I don't see this as helpful. This is just a form of
         | the "I would never have written this bug" postmortem response.
         | And yeah, you would. We all would. And do.
        
         | mehulashah wrote:
         | I'm completely baffled by Google's "postmortem" myself. Not
         | only is it obviously insufficient to anyone that has operated
         | online services as you point out, but the conclusions are full
         | of hubris. I.e. this was a one time incident, it won't happen
         | again, we're very sorry, but we're awesome and continue to be
         | awesome. This doesn't seem to help Google Cloud's face-in-palm
         | moment.
        
           | playingalong wrote:
           | It looks like they could read the SRE book by Google. BTW
           | available for free at https://sre.google/sre-book/table-of-
           | contents/
           | 
           | A bit chaotic (a mix of short essays) and simplistic
           | (assuming one kind of approach or design), but definitely
           | still worth a read. No exaggeration to state it was category
           | defining.
        
         | markhahn wrote:
         | most of this complaint is explicitly answered in the article.
         | must have been TL...
        
         | PcChip wrote:
         | Could it have been a VMware expiration setting somewhere, and
         | thus VMware itself deleted the customer's tenant? If so then
         | Google wouldn't have a way to prove it won't happen again
         | except by always setting the expiration flag to "never" instead
         | of leaving it blank
        
         | yalok wrote:
         | I would add one more -
         | 
         | 4. Add an option to auto-backup all the data from the account
         | to the outside backup service of users choice.
         | 
         | This would help not just with these kind of accidents, but also
         | any kind of data corruption/availability issues.
         | 
         | I would pay for this even for my personal gmail account.
        
         | belter wrote:
         | Can you imagine if there was no backup? Google would be in for
         | to cover the +/- 200 billion in losses?
         | 
         | This is why the smart people at Berkshire Hathaway don't offer
         | Cyber Insurance: https://youtu.be/INztpkzUaDw?t=5418
        
       | JCM9 wrote:
       | The quality and rigor of GCP's engineering is not even remotely
       | close to that of an AWS or Azure and this incident shows it.
        
         | justinclift wrote:
         | And Azure has a very poor reputation, so that bar is _not_ at
         | all high.
        
         | SoftTalker wrote:
         | Honestly I've never worked anywhere that didn't have some kind
         | of "war story" that was told about how some admin or programmer
         | mistake resulted in the deletion of some vast swathe of data,
         | and then the panic-driven heroics that were needed to recover.
         | 
         | It shouldn't happen, but it does, all the time, because humans
         | aren't perfect, and neither are the things we create.
        
           | 20after4 wrote:
           | Sure, it's the tone and content of their response that is
           | worrying, more than the fact that an incident happened. An
           | honest and transparent root cause analysis with technically
           | sound and thorough mitigations, including changes in policy
           | with regard to defaults. Their response seems like only the
           | most superficial, bare-minimum approximation of an
           | appropriate response to deleting a large customer's entire
           | account. If I were on the incident response team I'd be
           | strongly advocating for at lease these additional changes:
           | 
           | Make deletes opt-in rather than opt out. Make all large-scale
           | deletions have some review process with automated tests and a
           | final human review. And not just some low-level technical
           | employee, the account managers should have seen this on their
           | dashboard somewhere long before it happened. Finally,
           | undertake a thorough and systematic review of other services
           | to look for similar failure modes, especially with regard
           | anything which is potentially destructive and can conceivably
           | be default-on in the absence of a supplied configuration
           | parameter.
        
         | kccqzy wrote:
         | Azure has had a continuous stream of security breaches. I don't
         | trust them either. It's AWS and AWS alone.
        
           | IcyWindows wrote:
           | Huh? I have seen ones for the rest of Microsoft, but not
           | Azure.
        
             | twisteriffic wrote:
             | One of many
             | https://msrc.microsoft.com/blog/2021/09/additional-
             | guidance-...
        
       | l00tr wrote:
       | if it would be small or medium buisness google wouldnt even care
        
       | lopkeny12ko wrote:
       | > Google Cloud services have strong safeguards in place with a
       | combination of soft delete, advance notification, and human-in-
       | the-loop, as appropriate.
       | 
       | I mean, clearly not? By Google's own admission, in this very
       | article, the resources were not soft deleted, no advance
       | notification was sent, and there was no human in the loop for
       | approving the automated deletion.
       | 
       | And Google's remediation items include adding even _more_
       | automation for this process. This sounds totally backward to me.
       | Am I missing something?
        
         | jerbear4328 wrote:
         | They automated away the part that had a human error (the
         | internal tool with a field left blank), so that human error
         | can't mess it up in the same way again. They should move that
         | human labor to checking before tons of stuff gets deleted.
        
           | 20after4 wrote:
           | It seems to me that the default-delete is the real WTF. Why
           | would a blank field result in a default auto-delete in any
           | sane world. The delete should be opt-in not opt-out.
        
             | macintux wrote:
             | It took me way too many years to figure out that any
             | management script I write for myself and my co-workers
             | should, by default, execute as a dry run operation.
             | 
             | I now put a -e/--execute flag on every destructive command;
             | without that, the script will conduct some basic sanity
             | checks and halt before making changes.
        
       | maximinus_thrax wrote:
       | > Google Cloud continues to have the most resilient and stable
       | cloud infrastructure in the world.
       | 
       | As a company, Google has a lot of work to do about its customer
       | care reputation regardless of what some metrics somewhere say
       | about who's cloud is more reliable or not. I would not trust my
       | business to Google Cloud, I would not trust anything with money
       | to anything with the Google logo. Anyone who's been reading
       | hacker news for a couple of years can remember how many times
       | folks were asking for insider contacts to recover their
       | accounts/data. Extrapolating this to a business would keep me up
       | at night.
        
       | dekhn wrote:
       | If you're a GCP customer with a TAM, here's how to make them
       | squirm. Ask them what protections GCP has in place, on your
       | account, that would prevent GCP from inadvertently deleting large
       | amounts of resources if GCP makes an administrative error.
       | 
       | They'll point to something that says this specific problem was
       | alleviated (by deprecating the tool that did it, and automating
       | more of the process), and then you can persist: we know you've
       | fixed this problem, then followup: will a human review this
       | large-scale deletion before the resources are deleted?
       | 
       | From what I can tell (I worked for GCP aeons ago, and am an
       | active user of AWS for even longer) GCP's human-based protection
       | measures are close to non-existent, and much less than AWS.
       | Either way, it's definitely worth asking your TAM about this very
       | real risk.
        
         | RajT88 wrote:
         | Give 'em hell.
         | 
         | This motivates the TAM's to learn to work the system better.
         | They will never be able to change things on their own, but
         | sometimes you get escalation path promises and gentlemen's
         | agreements.
         | 
         | Enough screaming TAM's may eventually motivate someone high up
         | to take action. Someday.
        
           | ethbr1 wrote:
           | Way in which TAMs usually actually fix things:
           | - Single customer complains loudly        - TAM searches for
           | other customers with similar concerns        - Once total ARR
           | is sufficient...        - It gets added to dev's roadmap
        
             | RajT88 wrote:
             | If you are lucky!
             | 
             | I work for a CSP (not GCP) so I may be a little cynical on
             | the topic.
        
               | ethbr1 wrote:
               | Steps 3 and 4 are usually the difficult ones.
        
             | nikanj wrote:
             | - It gets closed as wontfix. Google never hires a human to
             | do a job well, if AI/ML can do the same job badly
        
             | mulmen wrote:
             | Does the ARR calculation consider the lifetime of cloud
             | credits given to burned customers to prevent them from
             | moving to a competitor?
             | 
             | In other words can UniSuper be confident in getting support
             | from Google next time?
        
               | ethbr1 wrote:
               | My heart says absolutely not.
        
         | DominoTree wrote:
         | Pitch it as an opportunity for a human at Google to reach out
         | and attempt to retain a customer when someone has their assets
         | scheduled for deletion. Would probably get more traction
         | internally, and has a secondary effect of ensuring it's clear
         | to everyone that things are about to be nuked.
        
       | tkcranny wrote:
       | > 'Google teams worked 24x7 over several days'
       | 
       | I don't know if they get what the seven means there.
        
         | ReleaseCandidat wrote:
         | They worked so much, the days felt like weeks.
        
         | rezonant wrote:
         | I suppose if mitigation fell over the weekend it might still
         | make sense.
        
         | shermantanktop wrote:
         | 24 engineers, 7 hours a day. Plus massages and free cafeteria
         | food from a chef.
        
         | Etherlord87 wrote:
         | Perhaps the team members cycle so that the team was working on
         | the thing without any night or weekend break. Which should be a
         | standard thing at all times for a big project like this IMHO.
        
         | crazygringo wrote:
         | Ha, you're right it's a bit nonsensical if you take it
         | completely literally.
         | 
         | But of course x7 means working every day of the week. So you
         | can absolutely work 24x7 from Thursday afternoon through
         | Tuesday morning. It just means they didn't take the weekend
         | off.
        
         | iJohnDoe wrote:
         | Or they offloaded to India like they do for most of their
         | stuff.
        
         | shombaboor wrote:
         | this comment made my day
        
       | postatic wrote:
       | Uni super customer here in Aus. Didn't know what it was but kept
       | receiving emails every day when they were trying to resolve this.
       | Only found out from news on what's actually happened. Feels like
       | they downplayed the whole thing as "system downtime". Imagine
       | something actually happened to people's money and billions of
       | dollars that were saved as their superannuation fund.
        
       | lukeschlather wrote:
       | The initial statement on this incident was pretty misleading, it
       | sounded like Google just accidentally deleted an entire GCP
       | account. Reading this writeup I'm reassured, it sounds like they
       | only lost a region's worth of virtual machines, which is
       | absolutely something that happens (and that I think my systems
       | can handle without too much trouble.) The original writeup made
       | it sound like all of their GCS buckets, SQL databases, etc. in
       | all regions were just gone which is a different thing and
       | something I hope Google can be trusted not to do.
        
         | wmf wrote:
         | It was a red flag when UniSuper said their subscription was
         | deleted, not their account. Many people jumped to conclusions
         | about that.
        
       | hiddencost wrote:
       | > It is not a systemic issue.
       | 
       | I kinda think the opposite. The culture that kept these kinds of
       | problems at bay has largely left the company or stopped trying to
       | keep it alive, as they no longer really care about what they're
       | building.
       | 
       | Morale is real bad.
        
       | mercurialsolo wrote:
       | Only if internal tools went thru the same scrutiny as public
       | tools.
       | 
       | More often than not critical parameters or mis-configurations
       | happen because of internal tools which work on unpublished
       | params.
       | 
       | Internal tools should be treated as tech debt. You won't be able
       | to eliminate issues but vastly reduce the surface area of errors.
        
       | jwnin wrote:
       | End of day Friday disclosure before a long holiday weekend; well
       | timed.
        
       | nurettin wrote:
       | It sounds like a giant PR piece about how Google is ready to
       | respond to a single customer and is ready to work through their
       | problems instead of creating an auto-response account suspension
       | infinite loop nightmare.
        
       | kjellsbells wrote:
       | Interesting, but I draw different lessons from the post.
       | 
       | Use of internal tools. Sure, everyone has internal tools, but if
       | you are doing customer stuff, you really ought to be using the
       | same API surface as the public tooling, which at cloud scale is
       | guaranteed to have been exercised and tested much more than some
       | little dev group's scripts. Was that the case here?
       | 
       | Passive voice. This post should have a name attached to it. Like,
       | Thomas Kurian. Palming it off to the anonymous "customer support
       | team" still shows a lack of understanding of how trust is
       | maintained with customers.
       | 
       | The recovery seems to have been due to exceptional good fortune
       | or foresight on the part of the customer, not Google. It seems
       | that the customer had images or data stored outside of GCP. How
       | many of us cloud users could say that? How many of us cloud users
       | have encouraged customers to move further and deeper along the
       | IaaS > PaaS > SaaS curve, making them more vulnerable to total
       | account loss like this? There's an uncomfortable lesson here.
        
         | kleton wrote:
         | > name attached
         | 
         | Blameless (and nameless) postmortems are a cultural thing at
         | google
        
           | rohansingh wrote:
           | That's great internally, but serious external communication
           | with customers should have a name attached and responsibility
           | accepted (i.e., "the buck stops here").
        
             | hluska wrote:
             | Culture can't just change like that.
        
           | saurik wrote:
           | So, I read your comment and realized that I think it made me
           | misinterpret the comment you are replying to? I thereby wrote
           | a big paragraph explaining how even as someone who cares
           | about personal accountability within large companies, I
           | didn't think a name made sense to assign blame here for a
           | variety of reasons...
           | 
           | ...but, then I realized that that isn't what is being asked
           | for here: the comment isn't talking about the nameless
           | "Google operators" that aren't being blamed, it is talking
           | about the lack of anyone who wrote this post itself! There, I
           | think I do agree: someone should sign off on a post like
           | this, whether it is a project lead or the CEO of the entire
           | company... it shouldn't just be "Google Cloud Customer
           | Support".
           | 
           | Having articles that aren't really written by anyone frankly
           | makes it difficult for my monkey brain to feel there are
           | actual humans on the inside whom I can trust to care about
           | what is going on; and, FWIW, this hasn't always been a
           | general part of Google's culture: if this had been a screw up
           | in the search engine a decade ago, we would have gotten a
           | statement from Matt Cutts, and knowing that there was that
           | specific human who cared on the inside meant a lot to some of
           | us.
        
       | emmelaich wrote:
       | Using "TL;DR" in professional communication is a little
       | unprofessional.
       | 
       | Some non-nerd exec is going to wonder what the heck that means.
        
         | logrot wrote:
         | It used to be called executive summary. It's brilliant but the
         | kids found it a too formal phrase.
         | 
         | IMHO almost every article should start with one.
        
       | logrot wrote:
       | Executive summary?
        
         | petesergeant wrote:
         | there's literally a tl;dr in the linked article
        
           | none_to_remain wrote:
           | There should be an Executive Summary
        
         | taspeotis wrote:
         | Google employee scheduled the deletion of UniSuper's resources
         | and Google (ironically) did not cancel it.
        
       | xyst wrote:
       | Transparency for Google is releasing this incident report on the
       | Friday of a long weekend [in the US].
       | 
       | I wonder if UniSuper was compensated for G's fuckup.
       | 
       | "A single default parameter vs multibillion organization. The
       | winner may surprise you!1"
        
       | walrus01 wrote:
       | The idea that you could have an automated tool delete services at
       | the end of a term for a corporate/enterprise customer of this
       | size and scale is absolutely absurd and inexcusable. No matter
       | whether the parameter was set correctly or incorrectly in the
       | first place. It should go through several levels of account
       | manager/representative/management for _manual review by a human_
       | at the google side before removal.
        
       ___________________________________________________________________
       (page generated 2024-05-25 23:01 UTC)