[HN Gopher] War Rooms vs. Deep Investigations
       ___________________________________________________________________
        
       War Rooms vs. Deep Investigations
        
       Author : ingve
       Score  : 145 points
       Date   : 2025-02-23 12:01 UTC (10 hours ago)
        
 (HTM) web link (rachelbythebay.com)
 (TXT) w3m dump (rachelbythebay.com)
        
       | adolph wrote:
       | It's interesting to think about how broad a net must be cast to
       | understand the state of a system.
       | 
       |  _That was another rathole, and the answer was also a thing to
       | behold: I couldn 't see it in the checked-in source code because
       | it had been fixed. Some other engineer on a completely unrelated
       | project had tripped over it, figured it out, and sent a fix to
       | the team which owned that program. They had committed it, so the
       | source code looked fine._
        
         | esafak wrote:
         | And the fact that everyone benefits when people aren't just
         | doing their own, narrowly-defined jobs.
        
         | tantalor wrote:
         | This is an obvious, first thing to check when you are looking
         | directly at source code.
         | 
         | Oh, the code changed 1 week ago? Let's see the diff. Oooooooh!
        
       | belval wrote:
       | > Could I run my terminals in there? Yes. Did I? Yes, for a
       | while. Was I effective? Not really. I missed my desk, my normal
       | chair, my big Thunderbolt monitor, my full-size (and yet entirely
       | boring) keyboard, and a relatively odor-free environment.
       | 
       | Not Meta but at Amazon I always felt like war rooms are a place
       | for some leader to scream at you and not much else. The reality
       | is that debugging some retry storm, resource exhaustion or
       | whatever won't happen in a room with 18 people talking over one
       | another.
       | 
       | Give me a meeting link, I'll join and provide info as I find it,
       | but this type of sweaty hackathon-style all-hands-on-deck was
       | never productive for me.
        
         | nine_zeros wrote:
         | >Not Meta but at Amazon I always felt like war rooms are a
         | place for some leader to scream at you and not much else.
         | 
         | It is for the some "leader". The vast large tech industry is
         | filled with phony leaders who don't understand how the job is
         | done and what makes the doers tick.
         | 
         | But they occupy the place of "leadership". They must be seen as
         | doing something. So they are doing the something that they can
         | - scream at people in a locked room.
         | 
         | If they could actually solve technical problems or talk to
         | their bosses like a real engineering leader, they would. But
         | they literally are incapable of doing so.
         | 
         | So war rooms and BS performative art it is.
        
           | bloomingkales wrote:
           | The role of a leader is an age old role. When someone is
           | thrown into leadership, I do believe a lot of adrenaline
           | kicks in. You begin acting as if you are a leader similar to
           | how a parent has parent senses and will run into a street to
           | save any kid from a car (poor example, any human should, but
           | hopefully you get my point). I think what you get in a war
           | room is that primal "phenomena" of "oh shit I'm the leader
           | now". You have to weather the primal emotions, and get a cool
           | head back on to fulfill the leadership role.
           | 
           | If it's your first time, then yeah, you will probably handle
           | it like a dick (or twat). You gotta take on an ancient role
           | with humility.
        
           | teeray wrote:
           | > It is for the some "leader".
           | 
           | Exactly. It's so the leader can ask "do we have an update?"
           | every 10 minutes when nothing has changed.
        
         | Hasu wrote:
         | > I always felt like war rooms are a place for some leader to
         | scream at you and not much else. The reality is that debugging
         | some retry storm, resource exhaustion or whatever won't happen
         | in a room with 18 people talking over one another.
         | 
         | I once walked out of a war room (at a much smaller company that
         | I wouldn't be at for much longer) that had devolved into
         | finger-pointing and blame games. Half an hour later, my boss
         | came out to find out what I was doing and I pointed at my
         | screen and said, "This. This is what's wrong. Ship my fix and
         | we're done here." The entire war room came to my desk to see
         | and discuss the fix, which we shipped, which solved the issue.
         | 
         | At my next job, I had to hold back laughter when the VP of
         | Engineering, who was pushing mob programming, said, "Think
         | about it. When we have an incident, when something is really
         | important, what do we do? We all get in a room together. No one
         | leaves the war room to go solve the incident on their own."
        
           | bloomingkales wrote:
           | _" This. This is what's wrong. Ship my fix and we're done
           | here."_
           | 
           | Lol. This is not hyperbole. Just about everyone has several
           | stories like this, and they are quite hilarious in their
           | utter absurdity. It's like these people get possessed by the
           | spirit of Gordon Gekko in that exact moment and must
           | absolutely play out the role to the tee. Then they become
           | unpossessed and go Skiing on weekends.
        
             | aledalgrande wrote:
             | In my experience it's only (and exactly) leaders without
             | tech nor people skills that do this. Have experienced both
             | good and bad. A world of difference.
        
         | alabastervlog wrote:
         | Leaders are really into playing pretend and it seems to just
         | get worse the higher they are on the ladder.
         | 
         | Like, literally, an effective way to sell to them is to make
         | them feel like they're in a movie doing Super Important Things.
         | LOL. Executive Disneyland.
        
           | SpicyLemonZest wrote:
           | Much of a leader's job is to visibly perform leadership. It
           | seems silly until the first time you need something big from
           | a team whose managers do it poorly, and you realize that
           | they're incapable of making commitments or setting
           | priorities.
           | 
           | The expectation that leaders will play pretend about a "war"
           | and call everyone into a "war room" is just a part of what it
           | means for an organization to commit that consistent high
           | uptime is a top priority.
        
       | bossyTeacher wrote:
       | > I can't imagine doing that kind of multi-window parallel
       | investigation stuff on a teeny little laptop screen with people
       | right next to me on either side
       | 
       | This is it. Managers (I mean non technical folk) don't understand
       | this. They don't understand that putting people physically
       | together won't help you solve the issue faster. This is the same
       | mentality that believes that typing code faster or generating
       | more code is a good thing. The kind that believes that all
       | employees need to always be physically together for "good stuff"
       | to happen.
       | 
       | Sadly, they will never learn. Those managers and c-suite people
       | will never read Rachel's post or investigate if their rto
       | policies are necessarily good for the business. These folks are
       | just reading numbers on a spreadsheet without fully understanding
       | what those numbers actually mean in their business.
       | 
       | Sadly, I don't see that ever changing because that mentality
       | provides a comforting worldview where office gives you sense of
       | control and having all your cows in the farm under your watchful
       | eye (or that of your trusty shepherds) feels so intuitive that
       | any alternatives are simply too uncomfortable to even think
       | about.
        
         | bloomingkales wrote:
         | There was a head of a department that once forced everyone to
         | uninstall iTunes because he believed it was reducing
         | productivity. Feels like a never-ending battle with these
         | types.
        
       | trollied wrote:
       | I used to be a Rachel-a-like in a past life. Really tight SLAs
       | (mobile network infra etc, so people have to be able to make
       | emergency calls, for example).
       | 
       | So many times I got bridged into a conference call whilst fixing
       | things & doing RCAs against tight SLAs, as non-technical people
       | didn't have any sort of idea that it was wasting my time. "I am
       | fixing it, I will send updates as per contractual agreements"
       | _puts phone down_.
       | 
       | On several occasions I got 2 emails after the fact - one praising
       | me for resolving quickly, another asking me to please be nicer to
       | executives. The calls stopped after 5 or 6 times.
       | 
       | Things have moved on these days, and it's much easier to
       | coordinate such events on Slack etc. Thankfully!
        
         | Nifty3929 wrote:
         | Please be empathetic to people who do not understand what is
         | going on, but who have tremendous responsibility to the
         | business. The business problem is always a superset of the
         | technology problem.
         | 
         | Yes, of course you aren't able to fix it while you are on the
         | phone with them. A conference call will not fix the code. They
         | know that too. But they also need meaningful information and
         | updates in order to do their jobs, which often requires them to
         | provide updates to others like important customers,
         | shareholders, the CEO, or even the government. They may also
         | need this in order to plan out other activities.
         | 
         | Providing useful information and frequent updates (not
         | "contractual" updates) to them with this in mind would go a
         | long way toward solving the whole business problem that is
         | created by the technical problem. It might also get them off
         | your back sooner, and with more respect for you.
         | 
         | There are two critical pieces of information that would help an
         | executive very much: Do we know what the problem is? and Do we
         | know what the solution is? A simple yes/no on both of those
         | would be a great start.
        
           | aqueueaqueue wrote:
           | Surely more than one person is working on the fix? If you
           | have a pair one can pop off and give updates to a third
           | technical person (maybe their manager or an inicident
           | manager) who can liaise.
        
           | ameliaquining wrote:
           | Communicating that information to executives needs to be the
           | responsibility of someone who isn't currently heads-down
           | debugging. Google's SRE Book suggests creating a
           | "communications lead" role.
        
             | cratermoon wrote:
             | At one employer our site outage recovery runbook
             | specifically stated that one person was to be designated to
             | communicate status outside the tiger team and be a buffer
             | between panicky people across the company and the technical
             | folks fixing the problem.
        
       | CapricornNoble wrote:
       | I'm not familiar with the "War Room" in the context of computer
       | network operations specifically, but I have deep experience
       | running military operations centers and I'm reading this through
       | that lens.
       | 
       | >People figured out that yes, they had run the machines out of
       | memory, specifically with the push - the distribution of new
       | bytecode to the web servers. Other people started taking steps to
       | beat back some of the bloat that had been creeping in that
       | summer, so the memory situation wouldn't be so bad. I suspect
       | some others also dialed back the number of threads (simultaneous
       | requests) on the smaller web servers to keep them from running
       | quite as "hot".
       | 
       | Cross-functional information exchange. Who is coordinating or
       | directing all these disparate actions? Who is fusing the
       | knowledge gained from these actions? Who is disseminating a
       | clearer picture of "what really happened"? Who is using that
       | updated picture to frame new taskings for all the people doing
       | these independent investigations? The answer to all those
       | questions should be "the staff in the War Room", and the
       | leadership in the War Room in particular. My take-away is that
       | the author is arguing that their ability to pursue single-
       | function actions within their domain of expertise was optimized
       | in their work environment, and was degraded in the War Room. They
       | aren't wrong.
       | 
       | >I guess a "war room" might work out if you have a bunch of stuff
       | that has to happen to deal with a possible "crisis" and then it's
       | just a matter of coordinating it. You don't have people doing
       | "heads-down hack" stuff nearly as much in a case like that.
       | 
       | Exactly. Coordinating a bunch of stuff for crisis management =
       | put those people in the War Room. Focused heads-down tasks = put
       | those people where they can ...focus. Now that said....one thing
       | I've come to HATE about working in a military headquarters is
       | open offices for everyone who isn't the G-shop lead and his/her
       | deputy. Everyone else is shoved into a cubicle farm, probably
       | with ESPN blaring in the background on top of a half-dozen
       | conversations and people constantly dropping by your desk to BS
       | about cover sheets for TPS Reports. So even if you're NOT in the
       | War Room, you can't focus.
        
         | icegreentea2 wrote:
         | I feel like some things that consistently gets in the way of
         | the clean separation between (crudely speaking) deciders and
         | doers, and keeping the doers out of the war room (so they can
         | work effectively) are:
         | 
         | Poor, or fear of poor communication. The "do-ers" become
         | compelled to be in the war room to try to mitigate
         | communication failures.
         | 
         | Unclear decision making processes and ownership. People with
         | high technical expertise (who would be top tier do-ers who
         | maybe should be kept out the war room) are kept around because
         | their immediate feedback in the war room can significantly
         | shift the decision making process and decisions made.
         | 
         | I should be more specific - I believe there's often a desire
         | (and makes instinctive sense) to fall back to decision by
         | consensus. Once everyone understands that this is how these
         | things work, then obviously you want to pack the smartest, most
         | competent people in the room, either because you're playing
         | political games, and you want more "votes", or because truly
         | you believe that you need the best people in the room to guide
         | consensus.
         | 
         | These are structural and cultural (non-cynical) issues that
         | drive both doers and decision makers to -want- to keep smart,
         | competent doers in the room, even though separation -should-
         | lead to better outcomes.
        
       | coldcode wrote:
       | I've watched many war room in various employers.
       | 
       | At one (20 years ago), they met for six months to determine why
       | our field offices' network connection to the home office was so
       | pathetic and unusable. It was led by the head of networking.
       | After all those meetings, it was decided that all 1000
       | independent field offices should upgrade their internet to T1
       | connections. It didn't help. Another six months goes by, and I
       | hear from my connections in networking that the real problem was
       | the head of networking had installed a half-duplex low-speed
       | ethernet card: all 1000 office's data had been going through a
       | pinhole. It was replaced, and suddenly everything was fine again,
       | other than the hole in the office's pockets for an unnecessary
       | upgrade.
       | 
       | No one ever mentioned it publically.
        
       | gherkinnn wrote:
       | War room. In the trenches. War stories. Pasty programmers and
       | plump PMs using such terminology is a bit silly.
       | 
       | Fixing a printer that sometimes does something unexpected is not
       | even a sailor's yarn, let alone a war story.
        
         | alabastervlog wrote:
         | "Telemetry" for "keylogging our fart app's users" because
         | everyone wishes they were doing something cool and/or
         | meaningful.
        
       | yuliyp wrote:
       | A war room / call is for coordination. If you need the person
       | draining the bad region to know that "oh that other region is
       | bad, so we can't drain too fast" or "the metrics look better
       | now".
       | 
       | For truly understanding an incident? Hell no. That requires heads
       | down *focus*. I've made a habit of taking a few hours later in
       | the day after incidents to calmly go through and try to write a
       | correct timeline. It's amazing how broken peoples' perceptions of
       | what was actually happening at the time are (including my own!).
       | Being able to go through it calmly, alone provided tons of
       | insights.
       | 
       | It's this type of analysis that leads to proper understanding and
       | the right follow-ups rather than knee-jerk reactions.
        
         | DylanDmitri wrote:
         | I've been in some good coordinating calls for widespread
         | incidents. Many unique individuals (15+) talked in a ten minute
         | period, sharing context on what their teams were seeing, what
         | re-meditations had worked for them, etc..
        
         | master_crab wrote:
         | This. People keep commenting about it being performative.
         | That's orthogonal to its purpose. Even the original blogpost
         | points out the limitation of singular focused effort without
         | acknowledging it. It took the author weeks to figure out the
         | actual issue.
         | 
         | If FB had been down that long, they'd be out of business.
        
         | thwarted wrote:
         | It took me seven weeks (not full time, but from the initial
         | incident to the final publishing) to do the research and write
         | up for a recent event. This included in-person interviews, data
         | correlation, reading code, and revision control spelunking
         | across multiple repositories to understand the series of events
         | and decisions that led to the event, some of them months
         | earlier. Some people were advocating "get it out because we
         | have to move on", which I pushed back on. Once published, the
         | feedback was positive and some folks acknowledged that knee-
         | jerk follow-up reactions would have made things worse. But to
         | get to the point where the post-incident review is valuable
         | someone has to put in the actual work and time to make it so.
         | It should be a learning experience, not a checking a box;
         | otherwise, we're just spinning our wheels without making any
         | progress.
        
       | hackpelican wrote:
       | In the places I've worked, a war room was always the place where
       | we cut the bleeding and revert the system to a working state.
       | Never was the RCA the intended outcome of a war room, though we'd
       | often reach the RCA in the silence of the meeting bridge while
       | something deployed/rolled back.
       | 
       | Root cause analysis is definitely not a group activity, it's best
       | done in a place where one can have complete focus.
       | 
       | However, cutting the bleeding requires plenty of communication,
       | weighing different options, having a higher-up sign off on a
       | tradeoff, getting our ops team to coordinate towards some common
       | goal, monitoring the recovery... etc.
        
         | sunshowers wrote:
         | So interestingly, I think root cause analysis can be a group
         | effort, but I think it has to be done on a remote call where
         | everyone is in front of a big monitor or two, and people can
         | take breaks and such. I've been part of teams that have done
         | root cause analysis over a call (sometimes many calls), and
         | it's been quite effective.
        
         | afro88 wrote:
         | IIRC, Facebook don't (or didn't) do rollbacks. They always fix
         | forward. I guess hours long incidents like this are the other
         | edge of that double edged sword.
        
           | claytonjy wrote:
           | Language can be tricky here. If I revert to an older commit,
           | literally rewriting history to remove newer, bad commits, I
           | think we'd all consider that a rollback. But if I instead add
           | a new commit which undoes the bad commits, is that a rollback
           | or a roll forward?
        
       | steveBK123 wrote:
       | The purpose of the war room is not to solve the problem but to
       | perform the act of problem solving visibly for certain audiences.
        
       | dakiol wrote:
       | As a software engineer I generally can help little when a non-
       | trivial incident occurs whether it is via war rooms or deep
       | investigations. I do have some kind of access to some logs,
       | traces and metrics (datadog, for instance), but at the end only
       | the SREs or platform engineers are the one who determine the root
       | cause of any incident because they have 100% observability.
        
       | willvarfar wrote:
       | The "war room" or "tiger team" or whatever its called is often a
       | way to parachute in a handful of engineers that top management
       | trusts to sort out the mess made by the masses. Often crusty old-
       | timer engineers are kept around just to be called on in these
       | scenarios.
        
         | Nifty3929 wrote:
         | Yes, and this also gives the lie to ageism. I hear this from
         | older-than-me people fairly often, that the reason that they
         | can't get the job they want, or a promotion, or whatever, is
         | 'ageism.'
         | 
         | Meanwhile, I routinely see people older than me (I'm not young)
         | being hired, promoted and generally shown great respect -
         | because their years of experience has given them wisdom. They
         | also remember how things developed over time and have more
         | experience with details farther down the abstraction stack
         | because those abstractions weren't around when they cut their
         | teeth.
         | 
         | I aspire to be one of those grey beards in the not so distant
         | future. And I doubt my age will ever hold back my career, aside
         | from change my personal choices (for retirement, fewer hours,
         | etc).
        
           | pbronez wrote:
           | Yes, but that expertise may not be easily transferable. Two
           | decades of experience with a firm is much more valuable to
           | that specific firm that anywhere else. If you leave that
           | place, you only have general lessons to apply elsewhere.
        
       | hedayet wrote:
       | Ex-Google SRE here with experience in multiple revenue-critical
       | war rooms. At Google, war rooms were particularly useful because
       | saying, "X is in a war room" (at least as late as 2017) gave X
       | the credibility to say no to everything else. Having technically
       | competent leaders made the experience enjoyable--because they
       | weren't just there to demand updates but actively contributed by
       | writing queries, and nudging the team in the right direction by
       | asking series of right questions.
       | 
       | My worst experience with crisis management was with one
       | particular team at another big tech company, where the leaders
       | were ignorant about the technology--completely clueless about the
       | service and its architecture. In cases like this, the issue
       | becomes a binary 0/1 problem: the service is either broken (0) or
       | running smoothly (1). When a leader lacks the technical knowledge
       | to grasp the intermediate steps, their only contribution is
       | yelling for updates--and that's exactly what they did.
       | 
       | Bottom line: War rooms can be a space for deep work with good
       | leadership (a combination of technical soundness and co-
       | ordination skills under pressure). But they can quickly turn into
       | hell when leadership lacks one of these two essential qualities--
       | and resorts to yelling to cover their asses.
        
       | cratermoon wrote:
       | "This fbagent process ran as root, ran a bunch of subprocesses,
       | called fork(), didn't handle a -1 return code, and then later
       | went to kill that "wayward child". Sending a signal (SIGKILL in
       | this case) to "pid -1" on Linux sends it to everything but init
       | and yourself. If you're root (yep) and not running in some kind
       | of PID namespace (yep to that too), that's pretty much the whole
       | world."
       | 
       | Key phrase "didn't handle a -1 return code".
       | 
       | Yuan, Ding, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu
       | Zhao, Yongle Zhang, Pranay U Jain, and Michael Stumm. "Simple
       | Testing Can Prevent Most Critical Failures." Proceedings of the
       | 11th Symposium on Operating Systems Design and Implementation
       | (OSDI), 2014, 17.
       | https://www.eecg.utoronto.ca/~yuan/papers/failure_analysis_o...
        
       | edflsafoiewq wrote:
       | > This fbagent process ran as root, ran a bunch of subprocesses,
       | called fork(), didn't handle a -1 return code, and then later
       | went to kill that "wayward child".
       | 
       | In-band error codes strike again.
        
         | pedrocr wrote:
         | This is a case of both in-band error codes and overloaded
         | meanings of inputs colliding. Modern languages make both things
         | much better but even in C the kill(2) interface seems much too
         | clever. It seems it could have easily been a couple of
         | different functions.
        
       | Simon_O_Rourke wrote:
       | Why do all these posts descend into the "I'm so awesome"
       | archetype, describe the damned problem and how it was resolved
       | and for goodness sake stop trying to stroke that ego while you're
       | doing it.
        
       ___________________________________________________________________
       (page generated 2025-02-23 23:00 UTC)