[HN Gopher] War Rooms vs. Deep Investigations
___________________________________________________________________
War Rooms vs. Deep Investigations
Author : ingve
Score : 145 points
Date : 2025-02-23 12:01 UTC (10 hours ago)
(HTM) web link (rachelbythebay.com)
(TXT) w3m dump (rachelbythebay.com)
| adolph wrote:
| It's interesting to think about how broad a net must be cast to
| understand the state of a system.
|
| _That was another rathole, and the answer was also a thing to
| behold: I couldn 't see it in the checked-in source code because
| it had been fixed. Some other engineer on a completely unrelated
| project had tripped over it, figured it out, and sent a fix to
| the team which owned that program. They had committed it, so the
| source code looked fine._
| esafak wrote:
| And the fact that everyone benefits when people aren't just
| doing their own, narrowly-defined jobs.
| tantalor wrote:
| This is an obvious, first thing to check when you are looking
| directly at source code.
|
| Oh, the code changed 1 week ago? Let's see the diff. Oooooooh!
| belval wrote:
| > Could I run my terminals in there? Yes. Did I? Yes, for a
| while. Was I effective? Not really. I missed my desk, my normal
| chair, my big Thunderbolt monitor, my full-size (and yet entirely
| boring) keyboard, and a relatively odor-free environment.
|
| Not Meta but at Amazon I always felt like war rooms are a place
| for some leader to scream at you and not much else. The reality
| is that debugging some retry storm, resource exhaustion or
| whatever won't happen in a room with 18 people talking over one
| another.
|
| Give me a meeting link, I'll join and provide info as I find it,
| but this type of sweaty hackathon-style all-hands-on-deck was
| never productive for me.
| nine_zeros wrote:
| >Not Meta but at Amazon I always felt like war rooms are a
| place for some leader to scream at you and not much else.
|
| It is for the some "leader". The vast large tech industry is
| filled with phony leaders who don't understand how the job is
| done and what makes the doers tick.
|
| But they occupy the place of "leadership". They must be seen as
| doing something. So they are doing the something that they can
| - scream at people in a locked room.
|
| If they could actually solve technical problems or talk to
| their bosses like a real engineering leader, they would. But
| they literally are incapable of doing so.
|
| So war rooms and BS performative art it is.
| bloomingkales wrote:
| The role of a leader is an age old role. When someone is
| thrown into leadership, I do believe a lot of adrenaline
| kicks in. You begin acting as if you are a leader similar to
| how a parent has parent senses and will run into a street to
| save any kid from a car (poor example, any human should, but
| hopefully you get my point). I think what you get in a war
| room is that primal "phenomena" of "oh shit I'm the leader
| now". You have to weather the primal emotions, and get a cool
| head back on to fulfill the leadership role.
|
| If it's your first time, then yeah, you will probably handle
| it like a dick (or twat). You gotta take on an ancient role
| with humility.
| teeray wrote:
| > It is for the some "leader".
|
| Exactly. It's so the leader can ask "do we have an update?"
| every 10 minutes when nothing has changed.
| Hasu wrote:
| > I always felt like war rooms are a place for some leader to
| scream at you and not much else. The reality is that debugging
| some retry storm, resource exhaustion or whatever won't happen
| in a room with 18 people talking over one another.
|
| I once walked out of a war room (at a much smaller company that
| I wouldn't be at for much longer) that had devolved into
| finger-pointing and blame games. Half an hour later, my boss
| came out to find out what I was doing and I pointed at my
| screen and said, "This. This is what's wrong. Ship my fix and
| we're done here." The entire war room came to my desk to see
| and discuss the fix, which we shipped, which solved the issue.
|
| At my next job, I had to hold back laughter when the VP of
| Engineering, who was pushing mob programming, said, "Think
| about it. When we have an incident, when something is really
| important, what do we do? We all get in a room together. No one
| leaves the war room to go solve the incident on their own."
| bloomingkales wrote:
| _" This. This is what's wrong. Ship my fix and we're done
| here."_
|
| Lol. This is not hyperbole. Just about everyone has several
| stories like this, and they are quite hilarious in their
| utter absurdity. It's like these people get possessed by the
| spirit of Gordon Gekko in that exact moment and must
| absolutely play out the role to the tee. Then they become
| unpossessed and go Skiing on weekends.
| aledalgrande wrote:
| In my experience it's only (and exactly) leaders without
| tech nor people skills that do this. Have experienced both
| good and bad. A world of difference.
| alabastervlog wrote:
| Leaders are really into playing pretend and it seems to just
| get worse the higher they are on the ladder.
|
| Like, literally, an effective way to sell to them is to make
| them feel like they're in a movie doing Super Important Things.
| LOL. Executive Disneyland.
| SpicyLemonZest wrote:
| Much of a leader's job is to visibly perform leadership. It
| seems silly until the first time you need something big from
| a team whose managers do it poorly, and you realize that
| they're incapable of making commitments or setting
| priorities.
|
| The expectation that leaders will play pretend about a "war"
| and call everyone into a "war room" is just a part of what it
| means for an organization to commit that consistent high
| uptime is a top priority.
| bossyTeacher wrote:
| > I can't imagine doing that kind of multi-window parallel
| investigation stuff on a teeny little laptop screen with people
| right next to me on either side
|
| This is it. Managers (I mean non technical folk) don't understand
| this. They don't understand that putting people physically
| together won't help you solve the issue faster. This is the same
| mentality that believes that typing code faster or generating
| more code is a good thing. The kind that believes that all
| employees need to always be physically together for "good stuff"
| to happen.
|
| Sadly, they will never learn. Those managers and c-suite people
| will never read Rachel's post or investigate if their rto
| policies are necessarily good for the business. These folks are
| just reading numbers on a spreadsheet without fully understanding
| what those numbers actually mean in their business.
|
| Sadly, I don't see that ever changing because that mentality
| provides a comforting worldview where office gives you sense of
| control and having all your cows in the farm under your watchful
| eye (or that of your trusty shepherds) feels so intuitive that
| any alternatives are simply too uncomfortable to even think
| about.
| bloomingkales wrote:
| There was a head of a department that once forced everyone to
| uninstall iTunes because he believed it was reducing
| productivity. Feels like a never-ending battle with these
| types.
| trollied wrote:
| I used to be a Rachel-a-like in a past life. Really tight SLAs
| (mobile network infra etc, so people have to be able to make
| emergency calls, for example).
|
| So many times I got bridged into a conference call whilst fixing
| things & doing RCAs against tight SLAs, as non-technical people
| didn't have any sort of idea that it was wasting my time. "I am
| fixing it, I will send updates as per contractual agreements"
| _puts phone down_.
|
| On several occasions I got 2 emails after the fact - one praising
| me for resolving quickly, another asking me to please be nicer to
| executives. The calls stopped after 5 or 6 times.
|
| Things have moved on these days, and it's much easier to
| coordinate such events on Slack etc. Thankfully!
| Nifty3929 wrote:
| Please be empathetic to people who do not understand what is
| going on, but who have tremendous responsibility to the
| business. The business problem is always a superset of the
| technology problem.
|
| Yes, of course you aren't able to fix it while you are on the
| phone with them. A conference call will not fix the code. They
| know that too. But they also need meaningful information and
| updates in order to do their jobs, which often requires them to
| provide updates to others like important customers,
| shareholders, the CEO, or even the government. They may also
| need this in order to plan out other activities.
|
| Providing useful information and frequent updates (not
| "contractual" updates) to them with this in mind would go a
| long way toward solving the whole business problem that is
| created by the technical problem. It might also get them off
| your back sooner, and with more respect for you.
|
| There are two critical pieces of information that would help an
| executive very much: Do we know what the problem is? and Do we
| know what the solution is? A simple yes/no on both of those
| would be a great start.
| aqueueaqueue wrote:
| Surely more than one person is working on the fix? If you
| have a pair one can pop off and give updates to a third
| technical person (maybe their manager or an inicident
| manager) who can liaise.
| ameliaquining wrote:
| Communicating that information to executives needs to be the
| responsibility of someone who isn't currently heads-down
| debugging. Google's SRE Book suggests creating a
| "communications lead" role.
| cratermoon wrote:
| At one employer our site outage recovery runbook
| specifically stated that one person was to be designated to
| communicate status outside the tiger team and be a buffer
| between panicky people across the company and the technical
| folks fixing the problem.
| CapricornNoble wrote:
| I'm not familiar with the "War Room" in the context of computer
| network operations specifically, but I have deep experience
| running military operations centers and I'm reading this through
| that lens.
|
| >People figured out that yes, they had run the machines out of
| memory, specifically with the push - the distribution of new
| bytecode to the web servers. Other people started taking steps to
| beat back some of the bloat that had been creeping in that
| summer, so the memory situation wouldn't be so bad. I suspect
| some others also dialed back the number of threads (simultaneous
| requests) on the smaller web servers to keep them from running
| quite as "hot".
|
| Cross-functional information exchange. Who is coordinating or
| directing all these disparate actions? Who is fusing the
| knowledge gained from these actions? Who is disseminating a
| clearer picture of "what really happened"? Who is using that
| updated picture to frame new taskings for all the people doing
| these independent investigations? The answer to all those
| questions should be "the staff in the War Room", and the
| leadership in the War Room in particular. My take-away is that
| the author is arguing that their ability to pursue single-
| function actions within their domain of expertise was optimized
| in their work environment, and was degraded in the War Room. They
| aren't wrong.
|
| >I guess a "war room" might work out if you have a bunch of stuff
| that has to happen to deal with a possible "crisis" and then it's
| just a matter of coordinating it. You don't have people doing
| "heads-down hack" stuff nearly as much in a case like that.
|
| Exactly. Coordinating a bunch of stuff for crisis management =
| put those people in the War Room. Focused heads-down tasks = put
| those people where they can ...focus. Now that said....one thing
| I've come to HATE about working in a military headquarters is
| open offices for everyone who isn't the G-shop lead and his/her
| deputy. Everyone else is shoved into a cubicle farm, probably
| with ESPN blaring in the background on top of a half-dozen
| conversations and people constantly dropping by your desk to BS
| about cover sheets for TPS Reports. So even if you're NOT in the
| War Room, you can't focus.
| icegreentea2 wrote:
| I feel like some things that consistently gets in the way of
| the clean separation between (crudely speaking) deciders and
| doers, and keeping the doers out of the war room (so they can
| work effectively) are:
|
| Poor, or fear of poor communication. The "do-ers" become
| compelled to be in the war room to try to mitigate
| communication failures.
|
| Unclear decision making processes and ownership. People with
| high technical expertise (who would be top tier do-ers who
| maybe should be kept out the war room) are kept around because
| their immediate feedback in the war room can significantly
| shift the decision making process and decisions made.
|
| I should be more specific - I believe there's often a desire
| (and makes instinctive sense) to fall back to decision by
| consensus. Once everyone understands that this is how these
| things work, then obviously you want to pack the smartest, most
| competent people in the room, either because you're playing
| political games, and you want more "votes", or because truly
| you believe that you need the best people in the room to guide
| consensus.
|
| These are structural and cultural (non-cynical) issues that
| drive both doers and decision makers to -want- to keep smart,
| competent doers in the room, even though separation -should-
| lead to better outcomes.
| coldcode wrote:
| I've watched many war room in various employers.
|
| At one (20 years ago), they met for six months to determine why
| our field offices' network connection to the home office was so
| pathetic and unusable. It was led by the head of networking.
| After all those meetings, it was decided that all 1000
| independent field offices should upgrade their internet to T1
| connections. It didn't help. Another six months goes by, and I
| hear from my connections in networking that the real problem was
| the head of networking had installed a half-duplex low-speed
| ethernet card: all 1000 office's data had been going through a
| pinhole. It was replaced, and suddenly everything was fine again,
| other than the hole in the office's pockets for an unnecessary
| upgrade.
|
| No one ever mentioned it publically.
| gherkinnn wrote:
| War room. In the trenches. War stories. Pasty programmers and
| plump PMs using such terminology is a bit silly.
|
| Fixing a printer that sometimes does something unexpected is not
| even a sailor's yarn, let alone a war story.
| alabastervlog wrote:
| "Telemetry" for "keylogging our fart app's users" because
| everyone wishes they were doing something cool and/or
| meaningful.
| yuliyp wrote:
| A war room / call is for coordination. If you need the person
| draining the bad region to know that "oh that other region is
| bad, so we can't drain too fast" or "the metrics look better
| now".
|
| For truly understanding an incident? Hell no. That requires heads
| down *focus*. I've made a habit of taking a few hours later in
| the day after incidents to calmly go through and try to write a
| correct timeline. It's amazing how broken peoples' perceptions of
| what was actually happening at the time are (including my own!).
| Being able to go through it calmly, alone provided tons of
| insights.
|
| It's this type of analysis that leads to proper understanding and
| the right follow-ups rather than knee-jerk reactions.
| DylanDmitri wrote:
| I've been in some good coordinating calls for widespread
| incidents. Many unique individuals (15+) talked in a ten minute
| period, sharing context on what their teams were seeing, what
| re-meditations had worked for them, etc..
| master_crab wrote:
| This. People keep commenting about it being performative.
| That's orthogonal to its purpose. Even the original blogpost
| points out the limitation of singular focused effort without
| acknowledging it. It took the author weeks to figure out the
| actual issue.
|
| If FB had been down that long, they'd be out of business.
| thwarted wrote:
| It took me seven weeks (not full time, but from the initial
| incident to the final publishing) to do the research and write
| up for a recent event. This included in-person interviews, data
| correlation, reading code, and revision control spelunking
| across multiple repositories to understand the series of events
| and decisions that led to the event, some of them months
| earlier. Some people were advocating "get it out because we
| have to move on", which I pushed back on. Once published, the
| feedback was positive and some folks acknowledged that knee-
| jerk follow-up reactions would have made things worse. But to
| get to the point where the post-incident review is valuable
| someone has to put in the actual work and time to make it so.
| It should be a learning experience, not a checking a box;
| otherwise, we're just spinning our wheels without making any
| progress.
| hackpelican wrote:
| In the places I've worked, a war room was always the place where
| we cut the bleeding and revert the system to a working state.
| Never was the RCA the intended outcome of a war room, though we'd
| often reach the RCA in the silence of the meeting bridge while
| something deployed/rolled back.
|
| Root cause analysis is definitely not a group activity, it's best
| done in a place where one can have complete focus.
|
| However, cutting the bleeding requires plenty of communication,
| weighing different options, having a higher-up sign off on a
| tradeoff, getting our ops team to coordinate towards some common
| goal, monitoring the recovery... etc.
| sunshowers wrote:
| So interestingly, I think root cause analysis can be a group
| effort, but I think it has to be done on a remote call where
| everyone is in front of a big monitor or two, and people can
| take breaks and such. I've been part of teams that have done
| root cause analysis over a call (sometimes many calls), and
| it's been quite effective.
| afro88 wrote:
| IIRC, Facebook don't (or didn't) do rollbacks. They always fix
| forward. I guess hours long incidents like this are the other
| edge of that double edged sword.
| claytonjy wrote:
| Language can be tricky here. If I revert to an older commit,
| literally rewriting history to remove newer, bad commits, I
| think we'd all consider that a rollback. But if I instead add
| a new commit which undoes the bad commits, is that a rollback
| or a roll forward?
| steveBK123 wrote:
| The purpose of the war room is not to solve the problem but to
| perform the act of problem solving visibly for certain audiences.
| dakiol wrote:
| As a software engineer I generally can help little when a non-
| trivial incident occurs whether it is via war rooms or deep
| investigations. I do have some kind of access to some logs,
| traces and metrics (datadog, for instance), but at the end only
| the SREs or platform engineers are the one who determine the root
| cause of any incident because they have 100% observability.
| willvarfar wrote:
| The "war room" or "tiger team" or whatever its called is often a
| way to parachute in a handful of engineers that top management
| trusts to sort out the mess made by the masses. Often crusty old-
| timer engineers are kept around just to be called on in these
| scenarios.
| Nifty3929 wrote:
| Yes, and this also gives the lie to ageism. I hear this from
| older-than-me people fairly often, that the reason that they
| can't get the job they want, or a promotion, or whatever, is
| 'ageism.'
|
| Meanwhile, I routinely see people older than me (I'm not young)
| being hired, promoted and generally shown great respect -
| because their years of experience has given them wisdom. They
| also remember how things developed over time and have more
| experience with details farther down the abstraction stack
| because those abstractions weren't around when they cut their
| teeth.
|
| I aspire to be one of those grey beards in the not so distant
| future. And I doubt my age will ever hold back my career, aside
| from change my personal choices (for retirement, fewer hours,
| etc).
| pbronez wrote:
| Yes, but that expertise may not be easily transferable. Two
| decades of experience with a firm is much more valuable to
| that specific firm that anywhere else. If you leave that
| place, you only have general lessons to apply elsewhere.
| hedayet wrote:
| Ex-Google SRE here with experience in multiple revenue-critical
| war rooms. At Google, war rooms were particularly useful because
| saying, "X is in a war room" (at least as late as 2017) gave X
| the credibility to say no to everything else. Having technically
| competent leaders made the experience enjoyable--because they
| weren't just there to demand updates but actively contributed by
| writing queries, and nudging the team in the right direction by
| asking series of right questions.
|
| My worst experience with crisis management was with one
| particular team at another big tech company, where the leaders
| were ignorant about the technology--completely clueless about the
| service and its architecture. In cases like this, the issue
| becomes a binary 0/1 problem: the service is either broken (0) or
| running smoothly (1). When a leader lacks the technical knowledge
| to grasp the intermediate steps, their only contribution is
| yelling for updates--and that's exactly what they did.
|
| Bottom line: War rooms can be a space for deep work with good
| leadership (a combination of technical soundness and co-
| ordination skills under pressure). But they can quickly turn into
| hell when leadership lacks one of these two essential qualities--
| and resorts to yelling to cover their asses.
| cratermoon wrote:
| "This fbagent process ran as root, ran a bunch of subprocesses,
| called fork(), didn't handle a -1 return code, and then later
| went to kill that "wayward child". Sending a signal (SIGKILL in
| this case) to "pid -1" on Linux sends it to everything but init
| and yourself. If you're root (yep) and not running in some kind
| of PID namespace (yep to that too), that's pretty much the whole
| world."
|
| Key phrase "didn't handle a -1 return code".
|
| Yuan, Ding, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu
| Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. "Simple
| Testing Can Prevent Most Critical Failures." Proceedings of the
| 11th Symposium on Operating Systems Design and Implementation
| (OSDI), 2014, 17.
| https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...
| edflsafoiewq wrote:
| > This fbagent process ran as root, ran a bunch of subprocesses,
| called fork(), didn't handle a -1 return code, and then later
| went to kill that "wayward child".
|
| In-band error codes strike again.
| pedrocr wrote:
| This is a case of both in-band error codes and overloaded
| meanings of inputs colliding. Modern languages make both things
| much better but even in C the kill(2) interface seems much too
| clever. It seems it could have easily been a couple of
| different functions.
| Simon_O_Rourke wrote:
| Why do all these posts descend into the "I'm so awesome"
| archetype, describe the damned problem and how it was resolved
| and for goodness sake stop trying to stroke that ego while you're
| doing it.
___________________________________________________________________
(page generated 2025-02-23 23:00 UTC)