[HN Gopher] Sharing details on a recent incident impacting one o...
___________________________________________________________________
Sharing details on a recent incident impacting one of our customers
Author : nonfamous
Score : 284 points
Date : 2024-05-24 14:48 UTC (1 days ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| foobazgt wrote:
| Sounds like a pretty thorough review in that they didn't stop at
| just an investigation of the specific tool / process, but also
| examined the rest for any auto deletion problems and also
| confirmed soft delete behavior.
|
| They could have gone one step further by reviewing all cases of
| default behavior for anything that might be surprising. That
| said, it can be difficult to assess what is "surprising", as it's
| often the people who know the least about a tool/API who also
| utilize its defaults.
| x0x0 wrote:
| Sounds more like some pants browning because incidents like
| this are a great reason to just use aws. Like come on:
|
| > _After the end of the system-assigned 1 year period, the
| customer's GCVE Private Cloud was deleted. No customer
| notification was sent because the deletion was triggered as a
| result of a parameter being left blank by Google operators
| using the internal tool, and not due a customer deletion
| request. Any customer-initiated deletion would have been
| preceded by a notification to the customer._
|
| ... Tada! We 're so incompetent we let giant deletes happen
| with no human review. Thank god this customer didn't trust us
| and kept off-gcp backups or they'd be completely screwed.
|
| > _There has not been an incident of this nature within Google
| Cloud prior to this instance. It is not a systemic issue._
|
| Translated to English: oh god, every aws and Azure salesperson
| has sent 3 emails to all their prospects citing our utter
| fuckup.
| markfive wrote:
| > Thank god this customer didn't trust us and kept off-gcp
| backups or they'd be completely screwed.
|
| Except that, from the article, the customer's backups that
| were used to recover were in GCP, and in the same region.
| ceejayoz wrote:
| I'm curious about that bit.
|
| https://www.unisuper.com.au/contact-us/outage-update says
| "UniSuper had backups in place with an additional service
| provider. These backups have minimised data loss, and
| significantly improved the ability of UniSuper and Google
| Cloud to complete the restoration."
| politelemon wrote:
| That's the bit that's sticking out to be as
| contradictory. I'm inclined to not believe what GCP have
| said here as an account deletion is an account deletion,
| why would some objects be left behind.
|
| No doubt this little bit must be causing some annoyance
| among UniSuper's tech teams.
| flaminHotSpeedo wrote:
| I'm inclined to not believe GCP because they edited their
| status updates retroactively and lied in their postmortem
| about the Clichy fire in Paris not affecting multiple
| "zones"
| graemep wrote:
| The had another provider because the regulator requires
| it. I suspect a lot of businesses in less regulated
| industries do not.
| skywhopper wrote:
| I think you misread. Here's the relevant statement from the
| article:
|
| "Data backups that were stored in Google Cloud Storage in
| the same region were not impacted by the deletion, and,
| _along with third party backup software_ , were
| instrumental in aiding the rapid restoration."
| rezonant wrote:
| > and also confirmed soft delete behavior.
|
| Where exactly do they mention they have confirmed soft delete
| behavior systemically? All they said was they have ensured that
| this specific automatic deletion scenario can no longer happen,
| and it seems the main reason is because "these deployments are
| now automated". They were automated before, now they are even
| more automated. That does zero to assure me that their deletion
| mechanisms are consistently safe, only that there's no operator
| at the wheel any more.
| gnabgib wrote:
| Related stories _UniSuper members go a week with no account
| access after Google Cloud misconfig_ [0](186 points, 16 days ago,
| 42 comments) _Google Cloud accidentally deletes customer 's
| account_ [1](128 points, 15 days ago, 32 comments)
|
| [0]: https://news.ycombinator.com/item?id=40304666
|
| [1]: https://news.ycombinator.com/item?id=40313171
| tempnow987 wrote:
| Wow - I was wrong. I thought this would have been something like
| terraform with a default to immediate delete with no recovery
| period or something. Still a default, but a third party thing and
| maybe someone in unisuper testing something and mis-scoping the
| delete.
|
| Crazy that it really was google side. UniSuper must have been
| like WHAT THE HELL?
| rezonant wrote:
| One assumes they are getting a massive credit to their GCP
| bill, if not an outright remediation payment from Google.
| abraae wrote:
| The effusive praise for the customer in Google's statement
| makes me think they have free GCP for the next year, in
| exchange for not going public with their frustrations.
| markmark wrote:
| The article describes what happened and it had nothing to do
| with Unisuper. Google deployed the private cloud with an
| internal Google tool. And that internal Google tool configured
| things to auto-delete after a year.
| mannyv wrote:
| I guessed it was provisioning or keys. Looks like I was somewhat
| correct!
| janalsncm wrote:
| I think it stretches credulity to say that the first time such an
| event happened was with a multi billion dollar mutual fund. In
| other words, I'm glad Unisuper's problem was resolved, but there
| were probably many others which were small enough to ignore.
|
| I can only hope this gives GCP the kick in the pants it needs.
| resolutebat wrote:
| GCVE (managed VMware) is a pretty obscure service, it's only
| used by the kind of multi billion dollar companies that want to
| lift and shift their legacy VMware fleets into the cloud as is.
| crazygringo wrote:
| I doubt it, because even a smaller customer would have taken
| this to the press, which would have picked up on it.
|
| "Google deleted our cloud service" is a major news story for a
| business of any size.
| joshuamorton wrote:
| A critical piece of the incident here was that this involved
| special customization that most customers didn't have or use,
| and which bypassed some safety checks, as a result it couldn't
| impact "normal" small customers.
| sgt101 wrote:
| Super motivating to have off cloud backup strategies...
| tgv wrote:
| Or cross-cloud. S3's ingress and storage costs are low, so
| that's an option when you don't use AWS.
| jawns wrote:
| > The customer's CIO and technical teams deserve praise for the
| speed and precision with which they executed the 24x7 recovery,
| working closely with Google Cloud teams.
|
| I wonder if they just get praise in a blog post, or if the
| customer is now sitting on a king's ransom in Google Cloud
| credit.
| rezonant wrote:
| There's no reality where a competent customer isn't going to
| ensure Google pays for this. I'd be surprised if they have a
| bill at all this year.
| wolfi1 wrote:
| there should have been some punitive damage
| noncoml wrote:
| What surprises me the most is that the customer managed to
| actually speak to a person from Google support. Must have been a
| pretty big private cloud deployment.
|
| Edit: saw from the other replies that the customer was Unisuper.
| No wonder they managed to speak to an actual person.
| cebert wrote:
| > "Google Cloud continues to have the most resilient and stable
| cloud infrastructure in the world."
|
| I don't think GPC has that reputation compared to AWS or Azure.
| They aren't at the same level.
| sa46 wrote:
| Microsoft is prone to severe breaches a few times per year.
|
| https://firewalltimes.com/microsoft-data-breach-timeline/
| pquki4 wrote:
| You can't just equate Microsoft and Azure like that.
| skywhopper wrote:
| Azure has had multiple embarrassingly bad tenant-boundary
| leaks, including stuff like being able to access another
| customer's metadata service, including credentials, just by
| changing a port number. They clearly have some major issues
| with lack of internal architecture review.
| markmark wrote:
| Does Azure? I think there's AWS then everyone else.
| pm90 wrote:
| I have used Azure, AWS and GCP. The only reason people use
| Azure is because others force them to. It's an extremely shitty
| cloud product. They pretend to compete with AWS but aren't even
| as good as GCP.
| snewman wrote:
| Given the level of impact that this incident caused, I am
| surprised that the remediations did not go deeper. They ensured
| that the same problem could not happen again in the same way, but
| that's all. So some equivalent glitch somewhere down the road
| could lead to a similar result (or worse; not all customers might
| have the same "robust and resilient architectural approach to
| managing risk of outage or failure").
|
| Examples of things they could have done to systematically guard
| against inappropriate service termination / deletion in the
| future:
|
| 1. When terminating a service, temporarily place it in a state
| where the service is unavailable but all data is retained and can
| be restored at the push of a button. Discard the data after a few
| days. This provides an opportunity for the customer to report the
| problem.
|
| 2. Audit all deletion workflows for all services (they only
| mention having reviewed GCVE). Ensure that customers are notified
| in advance whenever any service is terminated, even if "the
| deletion was triggered as a result of a parameter being left
| blank by Google operators using the internal tool".
|
| 3. Add manual review for _any_ termination of a service that is
| in active use, above a certain size.
|
| Absent these broader measures, I don't find this postmortem to be
| in the slightest bit reassuring. Given the are-you-f*ing-kidding-
| me nature of the incident, I would have expected any sensible
| provider who takes the slightest pride in their service, or even
| is merely interested in protecting their reputation, to visibly
| go over the top in ensuring nothing like this could happen again.
| Instead, they've done the bare minimum. That says something bad
| about the culture at Google Cloud.
| 2OEH8eoCRo0 wrote:
| That sounds reasonable. Perhaps they felt that a larger change
| to process would be riskier overall.
| TheCleric wrote:
| No it would probably be even worse from Google's perspective:
| more expensive.
| rezonant wrote:
| Hard agree. They clearly were more interested in making clear
| that there's not a systemic problem in how GCP's operators
| manage the platform, which read strongly and alarmingly that
| there is a systemic problem in how GCP's operators manage the
| platform. The lack of the common sense measures you outline in
| their postmortem just tells me that they aren't doing anything
| to fix it.
| ok_dad wrote:
| "There's no systemic problem."
|
| Meanwhile, the operators were allowed to leave a parameter
| blank and the default was to set a deletion time bomb.
|
| Not systemic my butt! That's a process failure, and every
| process failure like this is a systemic problem because the
| system shouldn't allow a stupid error like this.
| joshuamorton wrote:
| If you're arguing that _that_ was the systemic problem,
| then it 's been fully fixed, as the manual operation was
| removed and so validation can no longer be bypassed.
| phito wrote:
| It's a joke that they're not doing these things. How can you be
| an giant cloud provider and not think of putting safe guards
| around data deletion. I guess that realistically they thought
| of it many times but never implemented it because our costs
| money.
| pm90 wrote:
| It's probably because implementing such safeguards wouldn't
| help anyones promo packet.
|
| I really dislike that most of our major cloud infrastructure
| is provided by big tech rather than eg infrastructure
| vendors. I trust equinix a lot more than Google because thats
| all they do.
| metadat wrote:
| Understandable, however public clouds are a huge mix of
| both hardware and software, and it takes deep proficiency
| at both to pull it off. Equinix are definitely in the
| hardware and routing business.. may be tough to work
| upstream.
|
| Hardware always get commoditized to the max (sad but true).
| Thorrez wrote:
| I work in GCP and have seen a lot of OKRs about improving
| reliability. So implementing something like this would help
| someone's promo packet.
| cbarrick wrote:
| This is exactly the kind of work that would get SREs
| promoted.
| passion__desire wrote:
| It is funny Google has internal memegen but not ideagen.
| Ideate away your problems, guys.
| lima wrote:
| As a customer of Equinix Cloud... No thank you.
| Infrastructure vendors are terrible software engineers.
| Ocha wrote:
| I wouldn't be surprised if VMware support is getting deprecated
| in GCP so they just don't care - waiting for all customers to
| move off of it
| snewman wrote:
| My point is that if they had this problem in their VMware
| support, they might have a similar problem in one of their
| other services. But they didn't check (or at least they
| didn't claim credit for having checked, which likely means
| they didn't check).
| sangnoir wrote:
| > When terminating a service, temporarily place it in a state
| where the service is unavailable but all data is retained and
| can be restored at the push of a button. Discard the data after
| a few days. This provides an opportunity for the customer to
| report the problem
|
| Replacing actual deletion with deletion flags may lead to lead
| to _other_ fun bugs like "Google Cloud fails to delete
| customer data, running afoul of EU rules". I suspect Google
| would err on the side of accidental deletions rather than
| accidental non-deletions: at least in the EU.
| pm90 wrote:
| I highly doubt this was the reason. Google has similar
| deletion protection for other resources eg GCP projects are
| soft deleted for 30 days before being nuked.
| boesboes wrote:
| Not really how it works, GDPR protects individuals and allow
| them to request deletion with the data owner. They need to
| then, within 60(?) days, respond to any request. Google has
| nothing to do with that beyond having to make sure their
| infra is secure. There even are provisions for dealing with
| personal data in backups.
|
| EU law has nothing to do with this.
| mcherm wrote:
| > I suspect Google would err on the side of accidental
| deletions rather than accidental non-deletions: at least in
| the EU.
|
| I certainly hope not, because that would be incredibly
| stupid. Customers understand the significance of different
| kinds of risk. This story got an incredible amount of
| attention among the community of people who choose between
| different cloud services. A story about how Google had failed
| to delete data on time would not have gotten nearly as much
| attention.
|
| But let us suppose for a moment that Google has no concern
| for their reputation, only for their legal liability. Under
| EU privacy rules, there might be some liability for failing
| to delete data on schedule -- although I strongly suspect
| that the kind of "this was an unavoidable one-off mistake"
| justifications that we see in this article would convince a
| court to reduce that liability.
|
| But what liability would they face for the deletion? This was
| a hedge fund managing billions of dollars. Fortunately, they
| had off-site backups to restore their data. If they hadn't,
| and it had been impossible to restore the data, how much
| liability could Google have faced?
|
| Surely, even the lawyers in charge of minimizing liability
| would agree: it is better to fail by keeping customers
| accounts then to fail by deleting them.
| rlpb wrote:
| A deletion flag is acceptable under EU rules. For example,
| they are acceptable as a means of dealing with deletion
| requests for data that also exists in backups. Provided that
| the restore process also honors such flags.
| steveBK123 wrote:
| >> 1. When terminating a service, temporarily place it in a
| state where the service is unavailable but all data is retained
| and can be restored at the push of a button. Discard the data
| after a few days. This provides an opportunity for the customer
| to report the problem.
|
| This is so obviously "enterprise software 101" that it is
| telling Google is operating in 2024 without it.
|
| Since my new hire grad days, the idea of immediately deleting
| data that is no longer needed was out of the question.
|
| Soft deletes in databases with a column you mark delete.
| Move/rename data on disk until super duper sure you need to
| delete it (and maybe still let the backup remain). Etc..
| nikanj wrote:
| There are many voices in the industry arguing against soft
| deletes. Mostly coming from a very Chesterton's Fence
| perspective.
|
| For some examples
| https://www.metabase.com/learn/analytics/data-model-
| mistakes...
|
| https://www.cultured.systems/2024/04/24/Soft-delete/
|
| https://brandur.org/soft-deletion
|
| Many more can easily be found.
| snewman wrote:
| For the use case we're discussing here, of terminating an
| entire service, the soft delete would typically be needed
| only at some high level, such as on the access list for the
| service. The impact on performance, etc. should be minimal.
| steveBK123 wrote:
| Precisely, before you delete a customer account, you
| disable its access to the system. This is a scream test.
|
| Once you've gone through some time and due diligence you
| can contemplate actually deleting the customer data and
| account.
| danparsonson wrote:
| OK, but those examples you gave all boil down to the
| following:
|
| 1. you might accidentally access soft-deleted data and/or
| the data model is more complicated 2. data protection 3.
| you'll never need it
|
| to which I say
|
| 1. you'll make all kinds of mistakes if you don't
| understand the data model, and, it's really not that hard
| to tuck those details away inside data access code/SPs/etc
| that the rest of your app doesn't need to care about
|
| 2. you can still delete the data later on, and indeed that
| may be preferable as deleting under load can cause
| performance (e.g. locking) issues
|
| 3. at least one of those links says they never used it,
| then gives an example of when soft-deleted data was used to
| help recover an account (albeit by creating a new record as
| a copy, but only because they'd never tried an undelete
| before and where worried about breaking something; sensible
| but not exactly making the point they wanted to make)
|
| So I'm gonna say I don't get it; sure it's not a panacea,
| yes there are alternatives, but in my opinion neither is it
| an anti-pattern. It's just one of dozens of trade-offs made
| when designing a system.
| crazygringo wrote:
| It sounds like the problem is that the deletion was
| configured with an internal tool that bypassed all those
| kinds of protections -- that went straight to the actual
| delete. Including warnings to the customer, etc.
|
| Which is bizarre. Even internal tools used by reps shouldn't
| be performing hard deletes.
|
| And then I'd also love to know how the heck a default value
| to expire in a year ever made it past code review. I think
| that's the biggest howler of all. How did one person ever
| think there should be a default like that, and how did
| someone else see it and say yeah that sounds good?
| roughly wrote:
| > This is so obviously "enterprise software 101" that it is
| telling Google is operating in 2024 without it.
|
| My impression of GCP generally is that they've got some very
| smart people working on some very impressive advanced
| features and all the standard boring stuff nobody wants to do
| is done to the absolute bare minimum required to check the
| spec sheet. For all its bizarre modern enterprise-ness, I
| don't think Google ever really grew out of its early academic
| lab habits.
| steveBK123 wrote:
| I know a bunch of way-too-smart PHD types that worked at
| GOOG exclusively in R&D roles that they bragged to me
| earnestly was not revenue generating.
| ajross wrote:
| FWIW, you're solving the bug by fiat, and that doesn't work.
| Surely analogs to all those protections are already in place.
| But a firm and obvious requirement of a software system that is
| capable of deleting data _the ability to delete data_. And if
| it can do it you can write a bug that short-circuits any
| architectural protection you put in place. Which is the
| definition of a bug.
|
| Basically I don't see this as helpful. This is just a form of
| the "I would never have written this bug" postmortem response.
| And yeah, you would. We all would. And do.
| mehulashah wrote:
| I'm completely baffled by Google's "postmortem" myself. Not
| only is it obviously insufficient to anyone that has operated
| online services as you point out, but the conclusions are full
| of hubris. I.e. this was a one time incident, it won't happen
| again, we're very sorry, but we're awesome and continue to be
| awesome. This doesn't seem to help Google Cloud's face-in-palm
| moment.
| playingalong wrote:
| It looks like they could read the SRE book by Google. BTW
| available for free at https://sre.google/sre-book/table-of-
| contents/
|
| A bit chaotic (a mix of short essays) and simplistic
| (assuming one kind of approach or design), but definitely
| still worth a read. No exaggeration to state it was category
| defining.
| markhahn wrote:
| most of this complaint is explicitly answered in the article.
| must have been TL...
| PcChip wrote:
| Could it have been a VMware expiration setting somewhere, and
| thus VMware itself deleted the customer's tenant? If so then
| Google wouldn't have a way to prove it won't happen again
| except by always setting the expiration flag to "never" instead
| of leaving it blank
| yalok wrote:
| I would add one more -
|
| 4. Add an option to auto-backup all the data from the account
| to the outside backup service of users choice.
|
| This would help not just with these kind of accidents, but also
| any kind of data corruption/availability issues.
|
| I would pay for this even for my personal gmail account.
| belter wrote:
| Can you imagine if there was no backup? Google would be in for
| to cover the +/- 200 billion in losses?
|
| This is why the smart people at Berkshire Hathaway don't offer
| Cyber Insurance: https://youtu.be/INztpkzUaDw?t=5418
| JCM9 wrote:
| The quality and rigor of GCP's engineering is not even remotely
| close to that of an AWS or Azure and this incident shows it.
| justinclift wrote:
| And Azure has a very poor reputation, so that bar is _not_ at
| all high.
| SoftTalker wrote:
| Honestly I've never worked anywhere that didn't have some kind
| of "war story" that was told about how some admin or programmer
| mistake resulted in the deletion of some vast swathe of data,
| and then the panic-driven heroics that were needed to recover.
|
| It shouldn't happen, but it does, all the time, because humans
| aren't perfect, and neither are the things we create.
| 20after4 wrote:
| Sure, it's the tone and content of their response that is
| worrying, more than the fact that an incident happened. An
| honest and transparent root cause analysis with technically
| sound and thorough mitigations, including changes in policy
| with regard to defaults. Their response seems like only the
| most superficial, bare-minimum approximation of an
| appropriate response to deleting a large customer's entire
| account. If I were on the incident response team I'd be
| strongly advocating for at lease these additional changes:
|
| Make deletes opt-in rather than opt out. Make all large-scale
| deletions have some review process with automated tests and a
| final human review. And not just some low-level technical
| employee, the account managers should have seen this on their
| dashboard somewhere long before it happened. Finally,
| undertake a thorough and systematic review of other services
| to look for similar failure modes, especially with regard
| anything which is potentially destructive and can conceivably
| be default-on in the absence of a supplied configuration
| parameter.
| kccqzy wrote:
| Azure has had a continuous stream of security breaches. I don't
| trust them either. It's AWS and AWS alone.
| IcyWindows wrote:
| Huh? I have seen ones for the rest of Microsoft, but not
| Azure.
| twisteriffic wrote:
| One of many
| https://msrc.microsoft.com/blog/2021/09/additional-
| guidance-...
| l00tr wrote:
| if it would be small or medium buisness google wouldnt even care
| lopkeny12ko wrote:
| > Google Cloud services have strong safeguards in place with a
| combination of soft delete, advance notification, and human-in-
| the-loop, as appropriate.
|
| I mean, clearly not? By Google's own admission, in this very
| article, the resources were not soft deleted, no advance
| notification was sent, and there was no human in the loop for
| approving the automated deletion.
|
| And Google's remediation items include adding even _more_
| automation for this process. This sounds totally backward to me.
| Am I missing something?
| jerbear4328 wrote:
| They automated away the part that had a human error (the
| internal tool with a field left blank), so that human error
| can't mess it up in the same way again. They should move that
| human labor to checking before tons of stuff gets deleted.
| 20after4 wrote:
| It seems to me that the default-delete is the real WTF. Why
| would a blank field result in a default auto-delete in any
| sane world. The delete should be opt-in not opt-out.
| macintux wrote:
| It took me way too many years to figure out that any
| management script I write for myself and my co-workers
| should, by default, execute as a dry run operation.
|
| I now put a -e/--execute flag on every destructive command;
| without that, the script will conduct some basic sanity
| checks and halt before making changes.
| maximinus_thrax wrote:
| > Google Cloud continues to have the most resilient and stable
| cloud infrastructure in the world.
|
| As a company, Google has a lot of work to do about its customer
| care reputation regardless of what some metrics somewhere say
| about who's cloud is more reliable or not. I would not trust my
| business to Google Cloud, I would not trust anything with money
| to anything with the Google logo. Anyone who's been reading
| hacker news for a couple of years can remember how many times
| folks were asking for insider contacts to recover their
| accounts/data. Extrapolating this to a business would keep me up
| at night.
| dekhn wrote:
| If you're a GCP customer with a TAM, here's how to make them
| squirm. Ask them what protections GCP has in place, on your
| account, that would prevent GCP from inadvertently deleting large
| amounts of resources if GCP makes an administrative error.
|
| They'll point to something that says this specific problem was
| alleviated (by deprecating the tool that did it, and automating
| more of the process), and then you can persist: we know you've
| fixed this problem, then followup: will a human review this
| large-scale deletion before the resources are deleted?
|
| From what I can tell (I worked for GCP aeons ago, and am an
| active user of AWS for even longer) GCP's human-based protection
| measures are close to non-existent, and much less than AWS.
| Either way, it's definitely worth asking your TAM about this very
| real risk.
| RajT88 wrote:
| Give 'em hell.
|
| This motivates the TAM's to learn to work the system better.
| They will never be able to change things on their own, but
| sometimes you get escalation path promises and gentlemen's
| agreements.
|
| Enough screaming TAM's may eventually motivate someone high up
| to take action. Someday.
| ethbr1 wrote:
| Way in which TAMs usually actually fix things:
| - Single customer complains loudly - TAM searches for
| other customers with similar concerns - Once total ARR
| is sufficient... - It gets added to dev's roadmap
| RajT88 wrote:
| If you are lucky!
|
| I work for a CSP (not GCP) so I may be a little cynical on
| the topic.
| ethbr1 wrote:
| Steps 3 and 4 are usually the difficult ones.
| nikanj wrote:
| - It gets closed as wontfix. Google never hires a human to
| do a job well, if AI/ML can do the same job badly
| mulmen wrote:
| Does the ARR calculation consider the lifetime of cloud
| credits given to burned customers to prevent them from
| moving to a competitor?
|
| In other words can UniSuper be confident in getting support
| from Google next time?
| ethbr1 wrote:
| My heart says absolutely not.
| DominoTree wrote:
| Pitch it as an opportunity for a human at Google to reach out
| and attempt to retain a customer when someone has their assets
| scheduled for deletion. Would probably get more traction
| internally, and has a secondary effect of ensuring it's clear
| to everyone that things are about to be nuked.
| tkcranny wrote:
| > 'Google teams worked 24x7 over several days'
|
| I don't know if they get what the seven means there.
| ReleaseCandidat wrote:
| They worked so much, the days felt like weeks.
| rezonant wrote:
| I suppose if mitigation fell over the weekend it might still
| make sense.
| shermantanktop wrote:
| 24 engineers, 7 hours a day. Plus massages and free cafeteria
| food from a chef.
| Etherlord87 wrote:
| Perhaps the team members cycle so that the team was working on
| the thing without any night or weekend break. Which should be a
| standard thing at all times for a big project like this IMHO.
| crazygringo wrote:
| Ha, you're right it's a bit nonsensical if you take it
| completely literally.
|
| But of course x7 means working every day of the week. So you
| can absolutely work 24x7 from Thursday afternoon through
| Tuesday morning. It just means they didn't take the weekend
| off.
| iJohnDoe wrote:
| Or they offloaded to India like they do for most of their
| stuff.
| shombaboor wrote:
| this comment made my day
| postatic wrote:
| Uni super customer here in Aus. Didn't know what it was but kept
| receiving emails every day when they were trying to resolve this.
| Only found out from news on what's actually happened. Feels like
| they downplayed the whole thing as "system downtime". Imagine
| something actually happened to people's money and billions of
| dollars that were saved as their superannuation fund.
| lukeschlather wrote:
| The initial statement on this incident was pretty misleading, it
| sounded like Google just accidentally deleted an entire GCP
| account. Reading this writeup I'm reassured, it sounds like they
| only lost a region's worth of virtual machines, which is
| absolutely something that happens (and that I think my systems
| can handle without too much trouble.) The original writeup made
| it sound like all of their GCS buckets, SQL databases, etc. in
| all regions were just gone which is a different thing and
| something I hope Google can be trusted not to do.
| wmf wrote:
| It was a red flag when UniSuper said their subscription was
| deleted, not their account. Many people jumped to conclusions
| about that.
| hiddencost wrote:
| > It is not a systemic issue.
|
| I kinda think the opposite. The culture that kept these kinds of
| problems at bay has largely left the company or stopped trying to
| keep it alive, as they no longer really care about what they're
| building.
|
| Morale is real bad.
| mercurialsolo wrote:
| Only if internal tools went thru the same scrutiny as public
| tools.
|
| More often than not critical parameters or mis-configurations
| happen because of internal tools which work on unpublished
| params.
|
| Internal tools should be treated as tech debt. You won't be able
| to eliminate issues but vastly reduce the surface area of errors.
| jwnin wrote:
| End of day Friday disclosure before a long holiday weekend; well
| timed.
| nurettin wrote:
| It sounds like a giant PR piece about how Google is ready to
| respond to a single customer and is ready to work through their
| problems instead of creating an auto-response account suspension
| infinite loop nightmare.
| kjellsbells wrote:
| Interesting, but I draw different lessons from the post.
|
| Use of internal tools. Sure, everyone has internal tools, but if
| you are doing customer stuff, you really ought to be using the
| same API surface as the public tooling, which at cloud scale is
| guaranteed to have been exercised and tested much more than some
| little dev group's scripts. Was that the case here?
|
| Passive voice. This post should have a name attached to it. Like,
| Thomas Kurian. Palming it off to the anonymous "customer support
| team" still shows a lack of understanding of how trust is
| maintained with customers.
|
| The recovery seems to have been due to exceptional good fortune
| or foresight on the part of the customer, not Google. It seems
| that the customer had images or data stored outside of GCP. How
| many of us cloud users could say that? How many of us cloud users
| have encouraged customers to move further and deeper along the
| IaaS > PaaS > SaaS curve, making them more vulnerable to total
| account loss like this? There's an uncomfortable lesson here.
| kleton wrote:
| > name attached
|
| Blameless (and nameless) postmortems are a cultural thing at
| google
| rohansingh wrote:
| That's great internally, but serious external communication
| with customers should have a name attached and responsibility
| accepted (i.e., "the buck stops here").
| hluska wrote:
| Culture can't just change like that.
| saurik wrote:
| So, I read your comment and realized that I think it made me
| misinterpret the comment you are replying to? I thereby wrote
| a big paragraph explaining how even as someone who cares
| about personal accountability within large companies, I
| didn't think a name made sense to assign blame here for a
| variety of reasons...
|
| ...but, then I realized that that isn't what is being asked
| for here: the comment isn't talking about the nameless
| "Google operators" that aren't being blamed, it is talking
| about the lack of anyone who wrote this post itself! There, I
| think I do agree: someone should sign off on a post like
| this, whether it is a project lead or the CEO of the entire
| company... it shouldn't just be "Google Cloud Customer
| Support".
|
| Having articles that aren't really written by anyone frankly
| makes it difficult for my monkey brain to feel there are
| actual humans on the inside whom I can trust to care about
| what is going on; and, FWIW, this hasn't always been a
| general part of Google's culture: if this had been a screw up
| in the search engine a decade ago, we would have gotten a
| statement from Matt Cutts, and knowing that there was that
| specific human who cared on the inside meant a lot to some of
| us.
| emmelaich wrote:
| Using "TL;DR" in professional communication is a little
| unprofessional.
|
| Some non-nerd exec is going to wonder what the heck that means.
| logrot wrote:
| It used to be called executive summary. It's brilliant but the
| kids found it a too formal phrase.
|
| IMHO almost every article should start with one.
| logrot wrote:
| Executive summary?
| petesergeant wrote:
| there's literally a tl;dr in the linked article
| none_to_remain wrote:
| There should be an Executive Summary
| taspeotis wrote:
| Google employee scheduled the deletion of UniSuper's resources
| and Google (ironically) did not cancel it.
| xyst wrote:
| Transparency for Google is releasing this incident report on the
| Friday of a long weekend [in the US].
|
| I wonder if UniSuper was compensated for G's fuckup.
|
| "A single default parameter vs multibillion organization. The
| winner may surprise you!1"
| walrus01 wrote:
| The idea that you could have an automated tool delete services at
| the end of a term for a corporate/enterprise customer of this
| size and scale is absolutely absurd and inexcusable. No matter
| whether the parameter was set correctly or incorrectly in the
| first place. It should go through several levels of account
| manager/representative/management for _manual review by a human_
| at the google side before removal.
___________________________________________________________________
(page generated 2024-05-25 23:01 UTC)