[HN Gopher] Post-incident review on the Atlassian April 2022 outage
___________________________________________________________________
Post-incident review on the Atlassian April 2022 outage
Author : johnmoon
Score : 104 points
Date : 2022-04-29 20:53 UTC (1 days ago)
(HTM) web link (www.atlassian.com)
(TXT) w3m dump (www.atlassian.com)
| andrewstuart wrote:
| For anyone wanting to attack:
|
| * hindsight is 20/20
|
| * modern software systems are very complex
|
| * have you never made a mistake?
| BOOSTERHIDROGEN wrote:
| What apps to create that timeline ?
| Andugal wrote:
| If I understand correctly, since they deleted only a small subset
| of all their customers, they could not restore a clean backup of
| those customers without losing data from other customers not
| impacted by the outage.
|
| So how would one have a clean "partial backup" strategy if
| something similar would happen in his company?
| dopylitty wrote:
| I can only imagine the way the person who pulled the trigger on
| the deletion script felt the moment they realized what had
| happened.
|
| I've been there with much less significant incidents when a
| "routine" change turned into a potentially resume generating
| event. It's not fun.
|
| Ultimately the responsibility is with the organization that made
| it possible for an event of that scale to happen rather than the
| individual person who happened to trigger it but that doesn't
| make it feel any better.
| llamaLord wrote:
| As someone who observed this particular incident from the
| inside (holy shit-balls have the last 3 weeks not been fun),
| one of the few positive elements of it has been the universal
| and effectively instinctual agreement internally that it was a
| massive screw-up in the system that we all have to own, rather
| than one or several individual screw-ups that need to be put at
| the feet of individuals.
| trebligdivad wrote:
| Well, the one constant of IT is 'shit happens'; mark this one
| down as something interesting you've seen.
| SnowHill9902 wrote:
| That moment when you see DELETE 18388272773 0
| Kwpolska wrote:
| Thomas J. Watson famously said:
|
| _Recently, I was asked if I was going to fire an employee who
| made a mistake that cost the company $600,000. No, I replied, I
| just spent $600,000 training him. Why would I want somebody to
| hire his experience?_
| originalvichy wrote:
| I think in this instance it's the person who sent the IDs that
| feels worse. The deleters were provided IDs that had 30 correct
| app IDs and the rest were site IDs.
|
| Like they mentioned they had a delete script that worked for
| all types of unique IDs so that also can dilute the feeling of
| "it's all my fault" hopefully.
| baskethead wrote:
| Oooof. Passing in Application IDs will delete applications and
| passing in site IDs will delete sites. That's a really really bad
| design. I'm bookmarking this so that I can use it as a showcase
| going forward.
|
| Just this week, I changed a spec in one of our proposed endpoints
| that did exactly that. We passed in ids of various types of
| objects to perform actions, and I changed the api so that it
| would be forced to pass in a struct that contained an object type
| and object id. Explicitness is so much safer in the long run,
| especially in enterprise apps.
| ilayn wrote:
| I am very curious if they used a Jira board during this crisis
| for issue tracking. Because then they would have more than 4
| lessons learned.
| civilized wrote:
| Good thing they acquired Trello.
| codeflo wrote:
| What you're basically suggesting is that feature development at
| Atlassian moves at such a glacial speed because of course
| they're using Jira to manage it. This kind of blows my mind
| right now.
| duxup wrote:
| >The script that was executed followed our standard peer-review
| process that focused on which endpoint was being called and how.
| It did not cross-check the provided cloud site IDs to validate
| whether they referred to the Insight App or to the entire site,
| and the problem was that the script contained the ID for a
| customer's entire site.
|
| Yup that deletes something... anyway...
|
| > Establish universal "soft deletes" across all systems.
|
| It's just easier that way to observe what might happen.
| pokoleo wrote:
| Soft deletion always feels at odds with privacy-related "right to
| have data deleted" laws.
|
| Would be super interested in a technical writeup on how they do
| this.
| Jcowell wrote:
| It shouldn't be. These laws at least have the nuance to
| understand that data can't be immediately deleted from Backups
| and that in such instances where deletes are complicated the
| customer is notified.
| h1fra wrote:
| "Right to have data deleted" can be 'circumvented' if the data
| is critical part of the system or is needed for legal purpose
| (for example it can be mandatory to keep 1 year of IP logs and
| data associated with it)
|
| In previous companies I have worked for, we did instant soft-
| delete, then hard anonymisation after 15-30days and then hard
| delete after a year. That means the data was not recoverable
| for customer but could still be recovered for legal purpose.
| baskethead wrote:
| There's a time period before which you need to permanently
| delete the data. A soft delete will allow you to delete the
| data quickly and you can see what happens. If everything is
| okay you can then purge your database of all soft deleted data.
| jasonwatkinspdx wrote:
| IANAL but the laws have carve outs for backup retention, etc.
|
| A simple technical solution is to store all data with per user
| encryption keys, and then just delete the key. This obviously
| doesn't let you prove to anyone else that you've deleted all
| copies of the key, but you can use it as a way to have higher
| confidence you don't inadvertently leak it.
| notreallyserio wrote:
| Ideally they'd encrypt the customer content with a key provided
| by the customer and destroyed when the customer requests
| account deletion. The customer would still be able to use their
| key to decrypt backups that they get prior to the request. If
| the customer changes their mind, they just upload the key again
| (along with the backup, if necessary).
|
| Of course, this means trusting Atlassian to actually delete the
| key on request, but there's not much reason for them not to.
| rob_c wrote:
| Seriously, where were the -24hr backups that could be rolled back
| to once its clear the script is fubar, or using it on just 10% of
| the estate first? ...
| largbae wrote:
| It is never that simple. Say the backup existed, and was
| global. By the time you get everyone briefed on how fubar it is
| and get agreement to load the backups, there are hours of
| changes from the unaffected customers that will be wiped by the
| restore, or have to be reconciled by hand for months. Sure, you
| can concoct the perfect antidote with hindsight, but their
| retro and next steps are sound.
| rob_c wrote:
| Build it properly. By definiton of following best practice IT
| IS or always should be not far from being able to follow
| this. If its not someone is to blame
| kwertyoowiyop wrote:
| That lesson really stuck out for me also. My definition of
| "restore" has been too simplistic.
| baskethead wrote:
| What it means is that they never tested their disaster
| recovery system, because this would have been found right
| away. Or, someone would have reported it and an upper level
| exec would have signed off on it being okay to take 14 days
| to restore a small subset of users.
| largbae wrote:
| Again, not that simple. The customer restore procedure was
| almost certainly tested(and in active use as customers blow
| up their own data often enough). It was _not_ tested on 800
| customer stacks simultaneously, as that was considered a
| sitewide disaster by whoever dreamed up the failure modes
| to test for. Meanwhile the actual whole site disaster
| restore plan may or may not have been tested, but it was
| useless for this case since some customers were unaffected
| and would be damaged by the whole site plan.
| [deleted]
| joering2 wrote:
| I wish they could assure us the engineer who pressed the button
| wasn't fired.
| devjam wrote:
| > Atlassian is proud of our incident management process which
| emphasizes that a blameless culture and a focus on improving
| our technical systems ...
| seanwilson wrote:
| Do any databases have something like native support for soft
| deletes or ability to undo (other than SQL transactions rollbacks
| where you're having to specify the undo checkpoint)? Something
| like what Git does where it keeps a history of edits? If this
| isn't common, is this a neglected area that should be addressed
| or it's just too hard of a problem? It feels like with SQL,
| there's minimal guardrails and it's just your own fault if you're
| not extra careful, compared to say using Git with code or using
| "restore from trash" with filesystems.
| VTimofeenko wrote:
| Snowflake has time travel[1] keeping the original data for
| specified period of time.
|
| 1: https://docs.snowflake.com/en/user-guide/data-
| availability.h...
| jordanthoms wrote:
| Quite a few databases support time travel queries, in
| particular Oracle has for years and CockroachDB has them also.
| We can query the state of a table as it was at any point in the
| last 72hrs.
| tyre wrote:
| Datomic treats all data as immutable, so it can wind back to
| any version.
|
| When new data is written, the entire block is copied and
| rewritten rather than changing the data in-place.
| mdavidn wrote:
| This is similar to SQL checkpoints in that a rollback is all
| or nothing. It wouldn't be segmented by tenant unless the
| each tenant has its own transactors.
| candiddevmike wrote:
| WAL archiving / point in time recovery can help with this.
| wolf550e wrote:
| In MVCC systems like PostgreSQL, if you don't vacuum (garbage
| collect the old tuples), your database is append-only and you
| can query as if your transaction was started at some time in
| the past. I don't know how to set auto-vacuum to have a fixed
| delay, e.g. keeping 24h of changes, but I bet it can be added
| if it's not built-in.
| jasonwatkinspdx wrote:
| So, the fully fleshed out form of this in databases is usually
| called Bitemporality. The official SQL standard has included
| this for some years now, but it's not widely implemented by
| databases.
|
| An intuitive way to think of bitemporality is it's like MVCC,
| but with 4 timestamps per row version. One pair describes a
| range of time in "outside" or "valid" time, ie whatever
| semantic domain the database is modeling, the other pair
| describes a range of "system" time, which is when this record
| was present in the database. This lets you capture and reason
| about the distinction between when a fact the database models
| was true in the real world, and when the database was updated
| to reflect this fact (some people call this "as of" vs "as
| at"... the terms here aren't fully settled but the basic
| distinction is). So you can revise history, do complex time
| travel queries, all sorts of stuff. It's a very useful model
| that directly aligns with what sort of questions businesses
| need to answer in the context of a court case or revising their
| ground source of truth due to past bug or error.
|
| The downside is your database balloons with row versions, and
| many queries become far more complicated, perhaps needing
| addition joins, etc. Also from the perspective of database
| implementors there's a ton more complexity in the code. So
| that's why it's not widely supported despite the standard.
|
| There's also a niche of databases built around this model from
| the ground up, usually based on Datalog instead of SQL. There's
| also overlapping work with RDF and Semantic Web thinking (as
| awry as all that went).
|
| In practice how most organizations address this is
| operationally, by keeping generational and incremental backups
| that let them restore previous database states as needed.
| Though as the original post we're here for proves, that kind of
| operational solution can bite back hard when it goes wrong.
| justinludwig wrote:
| I can't say that I've ever been a fan of Atlassian or their
| products, but this blog post makes it sounds like they've at
| least learned the right lessons from this:
|
| 1. Establish universal "soft deletes" across all systems.
|
| 2. Better DR for multi-site, multi-product incidents.
|
| 3. Fix their incident-management process for large-scale
| incidents.
|
| 4. Fix their incident communications.
|
| Regarding #4 in particular:
|
| "Rather than wait until we had a full picture, we should have
| been transparent about what we did know and what we didn't know.
| Providing general restoration estimates (even if directional) and
| being clear about when we expected to have a more complete
| picture would have allowed our customers to better plan around
| the incident....
|
| [In the future], we will acknowledge incidents early, through
| multiple channels. We will release public communications on
| incidents within hours. To better reach impacted customers, we
| will improve the backup of key contacts and retrofit support
| tooling to enable customers... [to make emergency] contact with
| our technical support team."
| OrderlyTiamat wrote:
| I am actually pleasantly surprised at the openness of this
| response, and their taking responsibility of mistakes and
| detailing what will change in the future. It's not just
| corporate speak. I think that speaks well for the company and
| it improved my view of them.
| [deleted]
| perlgeek wrote:
| I tend to agree.
|
| However, there's one point that makes me skeptical: there are
| no organizational changes, or changes to leadership, or
| anything in that direction.
|
| This sounds like "the tech guys screwed up, culture and
| management is fine here". Which it might be, or it might not.
|
| I would have loved to see
|
| 5. We will stop pushing customers so hard towards using our
| cloud
|
| for one, but that wouldn't be convenient for Atlassian.
| eastbound wrote:
| Note that the ToS also forbid Cloud users from "disseminating
| information" about "the performance of the products".
|
| So you can't say it's slow or unperforming.
| geerlingguy wrote:
| I thought, in terms of Jira and Confluence at least, it was
| just accepted that being slow and underperforming was the
| status quo and if it was running at a speed you'd consider
| normal, that's an exception (and cause for alarm... like
| "did that actually save or is there a silent JS error not
| being displayed?").
___________________________________________________________________
(page generated 2022-04-30 23:00 UTC)