[HN Gopher] Fundamentals of Incident Management
___________________________________________________________________
Fundamentals of Incident Management
Author : bitfield
Score : 109 points
Date : 2021-08-09 16:38 UTC (6 hours ago)
(HTM) web link (bitfieldconsulting.com)
(TXT) w3m dump (bitfieldconsulting.com)
| blamestross wrote:
| I've done a LOT of incident management and I'm not happy about
| it. The biggest issue I have run into other than burnout is this:
|
| Thinking and reasoning under pressure are the enemy. Make as many
| decisions in advance as possible. Make flowcharts and decision
| trees with "decision criteria" already written down.
|
| If you have to figure something out or make a "decision" then
| things are really really bad. That happens sometimes, but when
| teams don't prep at all for incident management (pre-determined
| plans for common classes of problem) every incident is "really
| really bad"
|
| If have a low risk, low cost action with low confidence of high
| reward, I'm going to do it and just tell people it happened.
| Asking means I just lost a half-hour+ worth of money and if I
| just did it and I was wrong we would have lost 2 minutes of
| money. When management asks me why I did that, I point at the doc
| I wrote that my coworkers reviewed and mostly forgot about.
|
| A really common example is "it looks like most the errors are in
| datacenter X", you fail out of the datacenter. Maybe it was
| sampling bias or some other issue and it doesn't help, maybe the
| problem follows the traffic, maybe it just suddenly makes things
| better. No matter what we get signal. Establish well in advance
| of a situation what the common "solutions" to problems are and if
| you are oncall and responding, then just DO them and
| document+communicate as you do.
| spa3thyb wrote:
| There is a month and day, Feb 15, in the header, but no year. I
| can't figure out if that's ironic or apropos, since this story
| reads like a thriller from perhaps ten years ago, but the post
| date appears to have been 2020-02-15 - yikes.
| quartz wrote:
| Nice to see articles like this describing a company's incident
| response process and the positive approach to incident culture
| via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an
| incident management startup).
|
| Regarding gamedays specifically: I've found that many company
| leaders don't embrace them because culturally they're not really
| aligned to the idea that incidents and outages aren't 100%
| preventable.
|
| It's a mistake to think of the incident management muscle as one
| you'd like exercised as little as possible when in reality it's
| something that should be in top form because doing so comes with
| all kinds of downstream values for the company (a positive
| culture towards resiliency, openness, team building, honesty
| about technical risk, etc).
|
| Sadly this can be a difficult mindset to break out of especially
| if you come from a company mired in "don't tell the exec unless
| it's so bad they'll find out themselves anyway."
|
| Relatedly, the desire to drop the incident count to zero
| discourages recordkeeping of "near-miss" incidents, which
| generally deserve to have the same learning process (postmortem,
| followup action items, etc) associated with them as the outcomes
| of major incidents and game days.
|
| Hopefully this outdated attitude continues to die off.
|
| If you're just getting started with incident response or are
| interested in the space, I highly recommend:
|
| - For basic practices: Google's SRE chapters on incident
| management [2]
|
| - For the history of why we prepare for incidents and how we
| learn from them effectively: Sidney Dekker's Field Guide to
| Understanding Human Error [3]
|
| [1] https://kintaba.com
|
| [2] https://sre.google/sre-book/managing-incidents/
|
| [3] https://www.amazon.com/Field-Guide-Understanding-Human-
| Error...
| athenot wrote:
| > Relatedly, the desire to drop the incident count to zero
| discourages recordkeeping of "near-miss" incidents, which
| generally deserve to have the same learning process
| (postmortem, followup action items, etc) associated with them
| as the outcomes of major incidents and game days.
|
| Zero recorded incidents is a vanity metric in many orgs, and
| yes, this looses many fantastic learning opportunities. The end
| results is that these learning opportunities eventually do
| happen, but with significant impact associated with them.
| pm90 wrote:
| > Regarding gamedays specifically: I've found that many company
| leaders don't embrace them because culturally they're not
| really aligned to the idea that incidents and outages aren't
| 100% preventable.
|
| So. Much. This. Unless leaders were engineers in the past or
| have kept abreast of evolution in technology, the default
| mindset is still "incidents should never happen" rather than
| "incidents will happen how can we handle them better". This is
| especially pronounced in politics heavy environments since
| outages are seen as a professional failure, a way to score
| brownie points over the team that fails. As a result, you often
| have a culture that tried to avoid being responsible for
| outages at any cost, which (ironically) leads to worse overall
| quality of the system since the root cause is never dealt with.
| denton-scratch wrote:
| It doesn't match my experience, with a real incident.
|
| I was a dev in a small web company (10 staff), moonlighting as
| sysadmin. Our webserver had 40 sites on it. It was hit by a not-
| very-clever zero-day exploit, and most of the websites were now
| running the attacker's scripts.
|
| It fell to me to sort it out - the rest of the crew were to keep
| on coding websites. The ISP had cut off the server's outbound
| email, because it was spewing spam. So I spent about an hour
| trying to find the malicious scripts, before I realised that I
| could never be certain that I'd found them all.
|
| You get an impulse to panic when you realise that the company's
| future (and your job) depends on you not screwing up; and you're
| facing a problem you've never faced before.
|
| So I commissioned a new machine, and configured it. I started
| moving sites across from the old machine to the new one. After
| about three sites, I decided to script the moving work. Cool.
|
| But the sites weren't all the same - some were Drupal (different
| versions), some were Wordpress, some were custom PHP. It worked
| for about 30 of the sites, with a lot of per-site manual
| tinkering.
|
| Note that for the most part, the sites weren't under revision
| control - there were backups in zip files, from various dates,
| for some of the sites. And I'd never worked on most of those
| sites, each of which had its own quirks. So I spent the next week
| making every site deploy correctly from the RCS.
|
| I then spent about a week getting this automated, so that in a
| future incident we could get running again quickly. Happily we
| had a generously-configured Xen server, and I could test the
| process on VMs.
|
| My colleagues weren't allowed to help out, they were supposed to
| go on making websites. And I got resistance from my boss,
| demanding status updates ("are we there yet?")
|
| The happy outcome is that that work became the kernel of a proper
| CI pipeline, and provoked a fairly deep change in the way the
| company worked. And by the end, I knew all about every site the
| company hosted.
|
| We were just a web-shop; most web-shops are (or were) like this.
| If I was doing routine sysadmin, instead of coding websites, I
| was watched like a hawk to make sure I wasn't doing anything
| 'unnecessary'.
|
| This incident gave me the authority to do the sysadmin job
| properly; and in fact it saved me a lot of sysadmin time -
| because previously, if a dev wanted a new version of a site
| deployed, I had to interrupt whatever I was doing to deploy it.
| With the CI pipeline, provided the site had passed some testing
| and review stage, it could be deployed to production by the dev
| himself.
|
| It would have been cool to be able to do recovery drills,
| rotating roles and so on; but it was enough for my bosses that
| more than one person knew how to rebuild the server from scratch,
| and that it could be done in 30 minutes.
|
| Life in a small web-shop could get exciting, occasionally.
| pm90 wrote:
| It sounds like you're working in a different environment than
| the author. The environment they describe involves an ops
| _team_ rather than an ops _individual_ (what you've described).
| If you had to work with a team to resolve the incident, and had
| to do so on a fairly regular cadence, processes like this would
| likely be more useful.
| denton-scratch wrote:
| I have worked in a properly-organised devops environment
| (same number of staff, totally different culture).
|
| Anyway, I was just telling a story about a different kind of
| "incident response".
| rachelbythebay wrote:
| I may never understand why some places are all about assigning
| titles and roles in this kind of thing. You need one, maybe two,
| plus a whole whack of technical skills from everyone else.
|
| Also, conference calls are death.
| dilyevsky wrote:
| I find Comms Lead role to be super useful bc i dont want to be
| bogged down replying to customers in the middle of the incident
| + probably don't even have all the context/access. Everything
| else except ICM seems like a waste of time to me especially
| Recorder
| mimir wrote:
| It sort of baffles me how much engineer time is seemingly spent
| here designing and running these "gamedays" vs just improving and
| automating the underlying systems. Don't glorify getting paged,
| glorify systems that can automatically heal themselves.
|
| I spend a good amount of time doing incident management and
| reliability work.
|
| Red team/blue team gamedays seems like a waste of time. Either
| you are so early on your reliability journey that trivial things
| like "does my database failover" are interesting things to test
| (in which case just fix it). Or, you're a more experienced team
| and there's little low hanging reliability fruit left. In the
| later, gamedays seem unlikely to that closely mimic a real world
| incident. Since low hanging fruit is gone, all your serious
| incidents tend to be complex failure interactions between various
| system components. To resolve them quickly, you simply want all
| the people with deep context on those systems quickly coming up
| with and testing out competing hypotheses on what might be wrong.
| Incident management only really matters in the sense that you
| want to allow the people with the most system context to focus on
| fixing the actual system. Serious incident management really only
| comes into play when the issue is large enough to threaten the
| company + require coordinated work from many orgs/teams.
|
| My team and I spend most of time thinking about how we can
| automate any repetitive tasks or failover. In the case something
| can't be automated, we think about how we can increase the
| observability of the system, so that future issues can be
| resolved faster.
| krisoft wrote:
| > which captures all the key log files and status information
| from the ailing machine.
|
| Machine? As in singular machine goes down and you wake up 5
| people? That just sounds like bad planning.
|
| > Pearson is spinning up a new cloud server, and Rawlings checks
| the documentation and procedures for migrating websites, getting
| everything ready to run so that not even a second is wasted.
|
| Heroic. But in reality you have already wasted minutes. Why is
| this not all automated?
|
| I understand that this is a simulated scenairo. Maybe the
| situation was simplified for clarity, but really if a single
| machine going down leads to this amount of heroics then you
| should work on those fundamentals. In my opinion.
| RationPhantoms wrote:
| Not only that but they appear to be okay with the fact that a
| single ISP has knocked them offline. If I was a customer of
| theirs and found out, I would probably change providers.
| LambdaComplex wrote:
| Agreed.
|
| While reading this, I was thinking "This is so important that
| you'll wake all these people up in the middle of the night, but
| you only have a single ISP? No backup ISP with automated
| failover?"
| commiefornian wrote:
| They skipped over a few steps of ICS. ICS starts with a single
| person playing all roles.
|
| It prescribes a way to scale up and down the team in ways that
| streamlines the communication so everyone knows their role,
| nothing gets lost when people come in and out of the system and
| you don't have all hands conference calls and multiple people
| telling the customers multiple things or multiple people asking
| for status updates from each person.
| gengelbro wrote:
| It's not inspiring to me with the 'cool tactical' framing this
| article attempts to convey.
|
| I've worked as an oncall for a fundamental backbone service of
| the internet in the past and paged into middle of the night
| outages. It's harrowing and exhausting. Cool names like 'incident
| commander' do not change this.
|
| We also had a "see ya in the morning" culture. Instead I'd be
| much more impressed to have a "see ya in the afternoon, get some
| sleep" culture.
| choeger wrote:
| It seems to be a bit of cargo cult, to be honest. They seem to
| take inspiration from ER teams or the military.
|
| I think that this kind of drill helps a lot for cases where you
| can take a pre-planned route, like deploying that backup server
| or rerouting traffic. But the obvious question then is: Why not
| automate _that_ as well?
|
| When it comes to diagnosis or, worse, triage, in my experience
| you want independent free agents looking at the system all at
| once. You don't want a warroom-like atmosphere with a single
| screen but rather n+1 hackers focusing on what their first
| intuition tells them is the root cause. In a second step you
| want these hackers to convene and discuss their root cause
| hypotheses. If necessary, you want them to run experiemnts to
| confirm these hypotheses. And _then_ you decide the appropriate
| reaction.
| joshuamorton wrote:
| I agree. I think this particular framing gets things slightly
| wrong. You want parallelism, but you still need central
| organization (so that you can have clear delegation) and
| delegation of work to various researchers. For a complex
| incident, I've seen 5+ subteams researching various threads
| of the incident. But, importantly, before any of those
| subteams take any action, they report to the IC so that two
| groups don't accidentally take actions that might be good in
| isolation but are harmful when combined.
| 1123581321 wrote:
| My experience is there's little conflict between a central
| conference call or room, and multiple independent
| investigators, since those investigators need to present and
| compare their findings _somewhere_. It would indeed be a
| mistake to demand everyone look at one high-level view,
| though. Based on the organization depicted in the article,
| this would be the "researcher" role, split among multiple
| people.
| dvtrn wrote:
| _They seem to take inspiration from ER teams or the military_
|
| It's probably nothing but overestimation but I feel like I'm
| seeing more of this later in my career than I did early on,
| or maybe I'm paying more attention?
|
| Whatever it is: past experience (which includes coming from a
| military family in the states) has taught me to avoid
| companies that crib unnecessary amounts of jargon, lingo and
| colloquialisms from the military.
|
| Curious if others have noticed or even feel the same and what
| your experiences have been for feeling similarly?
| pinko wrote:
| Agree completely. It's a strong signal that someone has a
| military cosplay fetish (which very few people with
| experience in the actual military do), which in turn tends
| to come along with other dysfunctional traits. It's a
| warning for me that the person is not likely to be a good
| vendor, customer, or collaborator.
| pjc50 wrote:
| Yup. It's misplaced machismo, with all that implies.
| dvtrn wrote:
| My favorite one was when a superior was explaining a plan
| to right-size some new machines as we slowly migrated
| customers onto the appliance, and some particularly
| aggravating issues we were having with memory consumption
| that upon inspection and a lot of time spent-made no real
| sense to us why it was occurring the way it was.
|
| "dvtrn you are to take the flank and breach this issue
| with Paul"
|
| And this ran all the way up to the top of the org. Senior
| leaders were _constantly_ quoting that Jocko Wilink
| fella. It was...something.
|
| My old man (a former Dill Instructor, made for an
| interesting childhood) found it utterly hilarious when
| I'd call him up randomly with the latest phrase of the
| day, uttered by some director or another. To my
| knowledge, and I sure-damn asked, the only affinity
| anyone on the executive team had with the military was
| two of them having buddies who served.
| igetspam wrote:
| I don't know how old you are but my career now exceeds two
| decades. I definitely see this more now but that's because
| I institute it. Earlier in my career, we failed at incident
| management and at ownership. We now share the burden of on-
| call not just with the operators (sysadmins or old) but
| also with the people who wrote the code. We've spent a lot
| of time building better models based on proven methods,
| quite a few come from work done in high intensity roles
| paid by tax dollars: risk analysis, disaster recovery,
| firefighting, command and control, incident management, war
| games, red teams.
| dvtrn wrote:
| You've got a couple of years on me, I've been in the game
| a little over 13 years now.
|
| I support the notion there's a strong difference between
| lingo that's properly applied to the situation and lingo
| that is recklessly applied because it "sounds cool".
|
| The examples you gave seem to be fair game for the work
| being done-in the interest of brief, specific language;
| the examples I gave in another comment though
| ("flanking","breaching") however are just grating
| and...weird to use in a work environment.
|
| Your point is nevertheless well met.
| athenot wrote:
| Yes, it's a map-reduce algorithm. Muliple people check
| multiple areas of the system in parallel and then both
| evidence & rule-outs start to emerge.
| totetsu wrote:
| The worst part is hearing fron your manager the next day that
| the NOC operator complained about your rude tone of voice when
| waking up and answering the phone at 3am.
| dmuth wrote:
| I don't disagree with your post, but one thing I want to
| mention is the origin of the term "Incident Commander"--it
| doesn't exist to be cool, but rather derives from how FEMA
| handles disasters. I suspect its usage in IT became a thing
| because it was already used in real-life, and it made more
| sense than creating a new term.
|
| If you have two hours, you can take the training that describes
| the nomenclature behind the Incident Command System, and why it
| became a thing:
|
| https://training.fema.gov/is/courseoverview.aspx?code=is-100...
|
| This online training takes about 2 hours and is open to the
| general public. I took it on a Saturday afternoon some years
| ago and it gave me useful context to why certain things are
| standardized.
| [deleted]
| jvreagan wrote:
| > nstead I'd be much more impressed to have a "see ya in the
| afternoon, get some sleep" culture
|
| I've led teams for over a decade that have oncall duties. One
| principle we have lived by is that if you are paged outside of
| hours, you take off the time you need to not hold a grudge
| against the team/company. Some people don't need it, some
| people take a day off, some people sleep in, some people cash
| in on their next vacation. To each their own according to their
| needs. Seems to work well.
|
| We also swap out oncall in real time if, say, someone gets
| paged a couple nights in a row.
| athenot wrote:
| Yup, this is important or people will burn out real quick.
| And when there are major incidents, as IC it's especially
| important to dismiss people from bridges as early as possible
| when I know I'm going to need them sooner rather than later
| the following day. Or swap with a more junior person so that
| the senior one is nice and fresh for when the next wave is
| anticipated.
| [deleted]
| Sebguer wrote:
| Yeah, the tone of this article is really odd, and like, the
| bulk of the content is just a narrativization of the incident
| roles in the Google SRE book. The only 'trick' is running game
| days?
| lemax wrote:
| These incident role names are fairly common in product
| companies these days. I guess you are correct that they do
| suggest a certain culture around incidents, but in my
| experience is definitely a good thing. It's a "don't blame
| people, let's focus on the root cause, get things back up, and
| figure out how to prevent this next time" sort of thing. People
| try to meet SLAs and they treat each other like humans. We
| focus on improving process/frameworks over blaming individual
| people. And yup, think this comes along with, "incident
| yesterday was intense, I'm gonna catch up on sleep".
| pm90 wrote:
| I agree with your comment. These names are just ways for
| teams to delegate different responsibilities w.r.t incident
| management quickly and in a way that's understood by
| everyone. Having concrete names for such roles is both a good
| thing (everyone knows who can make the call for hard
| decisions) and helps you talk intelligently about the
| evolution of such roles. e.g. "our Incident Commanders used
| to spend 15% of their time in p0 incidents, but that has
| reduced to 10% due to improvements in rollout
| procedures/runbooks/etc."
| zippergz wrote:
| The concepts and terms of incident command are not from the
| military (or ER as another poster suggested). It's from the
| fire service and emergency management in general. I don't know
| if that changes peoples' perceptions and I agree that no amount
| of terminology changes how exhausting being on call is. But if
| people are reacting negatively to "military" connotations, I
| think that is unwarranted.
|
| https://en.wikipedia.org/wiki/Incident_Command_System
| nucleardog wrote:
| I think actually learning what the ICS is _for_ might help
| people understand a bit better why it's not necessarily just
| "unnecessary tacticool". It's not just a bunch of important-
| sounding names for things.
|
| ICS, at its core, is a system for helping people self-
| organize into an effective organization in the face of
| quickly changing circumstances and emergent problems.
|
| Some simple rules are things like:
|
| * The most senior/qualified person on-site is generally in
| charge. (How you determine that kinda varies depending on
| organization.)
|
| * Positions are only created when required. You don't assign
| people roles unless there's a need for that role.
|
| * Positions are split and responsibilities delegated as the
| span of control increases beyond a set point.
|
| * Control should stay as local to the problem as it
| realistically can while still solving the problem.
|
| From there, it goes on to standardized a template hierarchy
| and defines things like specific colours associated with
| specific roles so as roles change and chaos ensues, people
| can continue to operate effectively and in an organized
| manner. In-person, this means things like the
| commander/executive roles running around in red vests with
| their role on the back. If the role changes hands, so does
| the vest.
|
| Some of the roles in that template organization are things
| like:
|
| * The "Public Information Officer" who is responsible for
| preparing and communicating to the public. This makes a
| single person responsible to ensure conflicting or confusing
| messaging is not making its way out.
|
| * A "Liason Officer" who is responsible for coordinating with
| other organizations. This provides another central point of
| coordination for requests flowing outside of your response.
|
| I think we could all imagine how this starts to become
| valuable in, say, a building collapse scenario with police,
| fire, EMS, the gas company, search and rescue, emergency
| social services, etc all on scene.
|
| In an IT context, what this means it that, generally, the
| most senior person online is going to be in charge of
| receiving reports from people and directing them. If there
| aren't many people around, they'd generally be pitching in to
| help as well.
|
| As more people show up and the communication and coordination
| overhead increases, they step out of doing any specific
| technical work. If enough show up, they may then delegate
| people out as leaders of specific teams tasked with specific
| goals (they may also just tell them they're not needed and
| send them to wait on standby).
|
| All roles, including the "Public Information" and "Liason"
| roles fall to the Incident Commander unless delegated out. At
| some point, if the requests for reporting from management
| start interfering with their role as Incident Commander, they
| delegate that role out. If it turns out the incident is going
| to require heavy communication or coordination with a vendor,
| they may delegate out the Liason role to someone else.
|
| ICS is probably largely unnecessary if your response never
| spans larger than the number of people that can effectively
| communicate in a google meet call, but as you get more and
| more people involved it contains a lot of valuable lessons
| and things learned through real world experience in
| situations much more stressful and dangerous than we ever
| face that help you effectively manage and coordinate the
| human resources in response to an incident.
|
| (Disclaimer: That's all basically from memory. The city sent
| me on a ICS, ICS in an emergency operations centre context,
| and a few more courses a few years back as part of
| volunteering with an emergency communications group. It's
| probably 90% accurate.)
| raffraffraff wrote:
| EU working time act means "I'll see you in the afternoon",
| whether they like it or not.
| tetha wrote:
| > We also had a "see ya in the morning" culture. Instead I'd be
| much more impressed to have a "see ya in the afternoon, get
| some sleep" culture.
|
| German labor laws forbid employees from working 10-13 hours
| after a long on-call situation after a normal work day, just
| like that. Add in time compensation, and a bad on-call
| situation at night easily ends up as the next day off paid.
|
| I've found this to take a lot of edge of on-call. Sure, it
| /sucks/ to get called at 1am and do stuff until 3, but that's a
| day to sleep in and recover. Maybe hop on a call if the team
| needs input, but that's optional.
| ipaddr wrote:
| So they are testing against fully awake people at 2:30pm and
| expecting similiar results at 4:30am after heavy drinking.
| x3n0ph3n3 wrote:
| Do you really drink heavily when you are on-call?
| LambdaComplex wrote:
| Based on my understanding of the UK's drinking culture: it
| wouldn't be the most surprising thing ever
| shimylining wrote:
| Why not? I am not working just to work, I am working to enjoy
| my life with the people around me. I am also from Europe so
| it might be different views, but just because I am on call
| doesn't mean I'm going to stop living my life. Work doesn't
| define me as a person. :)
| pm90 wrote:
| If I'm on call (OC), I'm responsible for the uptime of the
| system even after hours. So If I'm planning on going
| hiking, I will inform the secondary OC, or delay plans to a
| weekend when I'm not OC. Generally I do tend to avoid
| getting heavily inebriated (although of course there are
| times when this is unavoidable).
|
| I'm not judging, but just pointing out that I've certainly
| experienced a different OC culture in the US.
| burnished wrote:
| You're mixed up, they're drilling. You drill so that when an
| emergency happens and it's 4:30am and you're bleary eyed your
| hands already know what to do (and are doing it) before your
| eyes even open all the way.
| candiddevmike wrote:
| This sounds like a great way to wipe a database accidentally
| (like GitLab). The worst thing you can do to help fix a
| problem is having people asleep at the wheel.
| pm90 wrote:
| Noted, but the point of the drill is precisely to uncover
| these failure modes and attempt to fix them. e.g. you might
| have automated runbooks to fix the problem rather than
| access the DB directly. You might have frequent backups and
| processes to easily restore from backups in case of
| database wipes.
___________________________________________________________________
(page generated 2021-08-09 23:00 UTC)