[HN Gopher] Root cause is for plants, not software
       ___________________________________________________________________
        
       Root cause is for plants, not software
        
       Author : kiyanwang
       Score  : 29 points
       Date   : 2021-11-22 11:37 UTC (1 days ago)
        
 (HTM) web link (www.verica.io)
 (TXT) w3m dump (www.verica.io)
        
       | monocasa wrote:
       | RCA is a tool. It can be everything the article says it is. It
       | can also be incredibly useful by not being what the article says
       | it has to be.
       | 
       | For instance I've heard there's at the very least an FAA
       | institutional push against plane crash RCAs simply saying "pilot
       | error". Pilots are known to make errors (we all do), so the FAA
       | wants to know how the error wasn't accounted for in the greater
       | process and resulted in a plane falling out of the sky. If you
       | use RCA not for blame, or don't call it quits after finding a
       | "smoking gun", it's a fantastic tool for enumerating single
       | points of failure that you can prove actually failed. There's
       | normally multiple things you can do better if multiple people
       | care about what comes out of an RCA.
        
       | mjb wrote:
       | There's some good stuff here, but it also feels like it's a deep
       | analysis of a straw man rather than of real industry practice.
       | 
       | > At its core, RCA posits that an incident has a single, specific
       | cause or trigger, without which the incident wouldn't have
       | happened
       | 
       | Does it?
       | 
       | None of the teams I've worked with that practice root-cause
       | analysis, or use the term 'root cause' assume that incidents have
       | a single specific cause or trigger. In fact, most teams seem very
       | comfortable exploring complex trees of causality, no matter what
       | they call their post-mortem process.
       | 
       | > None of these alternatives are fixed as easily as reverting the
       | one line. None of them suggest that they stem from the same
       | 'root.' And yet, all of these alternatives are more fundamental
       | to resilience than the one line of a configuration file."
       | 
       | "In the beginning the Universe was created. This has made a lot
       | of people very angry and been widely regarded as a bad move."
       | (Douglas Adams)
       | 
       | There has to be a logical stopping point to investigations,
       | because at some point they just become metaphysics. Even getting
       | to the point of "we don't make enough revenue to fix this" isn't
       | really helpful, because everybody knows that already. Instead,
       | and somewhat crucially, choosing which actions to take after an
       | event needs to be done in context of the constraints on the
       | business. Sometimes those constraints are _a_ problem, but it may
       | be possible to improve the situation even within those
       | constraints. Throwing our hands up and saying that we can 't make
       | improvements because of the constraints isn't really that
       | helpful.
       | 
       | Dekker is spot-on that it's a construction, but that's the point,
       | not the problem.
       | 
       | > Our focus on language in reports is informed by research into
       | the relationship between language and how people perceive and/or
       | assign blame.
       | 
       | I just don't believe this. Bad organizations are going to assign
       | blame no matter what language they use, better organizations
       | won't. The problem is the culture, not the language, and it
       | really doesn't seem to me that changing the language will change
       | the culture. Instead, it'll just lead to double-speak.
       | 
       | They even say this:
       | 
       | > RCA is often a path that ends squarely at a person's feet.
       | "Human error" is a quick and easy scapegoat for all kinds of
       | incidents and accidents, and it's deceptive simplicity as a root
       | cause is an inherent part of its larger-scale harmful effects.
       | It's comforting to frame an incident as someone straying from
       | well-established rules, policies, or guidelines; simply provide
       | more training, and more guardrails and checklists in the future!
       | 
       | then this:
       | 
       | > we can say now that there's only a small amount of incidents
       | (less than a percent) that directly call out human or operator
       | error as a "root cause."
       | 
       | Which just don't seem consistent.
        
         | [deleted]
        
         | xg15 wrote:
         | I agree. This seems to conflate immediate direct causes and
         | lack of mitigations or robustness.
         | 
         | By that logic, I'm missing "questionable dietary habits of the
         | lead engineer" and "failure of the building architechts to
         | create a more calming office environment" as causes in the
         | article.
        
           | afarrell wrote:
           | "Every problem at Toyota can be blamed on Matthew Perry if
           | you keep asking 'Why?' enough times."
        
         | Spooky23 wrote:
         | The author strikes me as someone focused on something other
         | than incident and problem management. The answers or depend on
         | what you're looking at.
         | 
         | The cause of the _incident_ used in the example is the
         | configuration change. Resolution is rollback.
         | 
         | When you look at problem management, the root causes of the
         | _problem_ are likely process issues that need to be addressed.
         | (Or not)
         | 
         | Think of a non tech example to help understand. Look at a
         | picture of an early superhighway from the 1950s like the New
         | York State Thruway, New Jersey Turnpike, etc. Notice that there
         | are no guardrails or they are made of wood.
         | 
         | When a 1950s driver fell asleep and ran off the road, flipping
         | and getting ejected from the car, he died. The cause of the
         | incident was him falling asleep and losing control.
         | 
         | But for the highway engineer, many problems are introduced by
         | this incident. The shoulder could be graded to avoid rollover.
         | Appropriate barriers could prevent the rollover. Rumble strips
         | wake up the driver. For the mechanical engineer at a car
         | company, seatbelts prevent ejection, safety glass prevents
         | eviceration, crumple zones prevent trauma.
         | 
         | None of those things prevent the guy from falling asleep in
         | that incident. But they address the greater problem of highway
         | fatality.
        
         | wahern wrote:
         | > There has to be a logical stopping point to investigations,
         | because at some point they just become metaphysics.
         | 
         | The law understands this point very well, which is why it has
         | developed concepts like Proximate cause:
         | https://en.wikipedia.org/wiki/Proximate_cause
         | 
         | Rules of evidence also arose to prevent investigations and
         | analyses from getting lost in the weeds. See
         | https://en.wikipedia.org/wiki/Evidence_(law) Though they're
         | highly tuned to witness testimony and similar mushy issues--
         | intent, etc--so don't translate well. But whenever you hear
         | about a court tossing out evidence, or doing something
         | seemingly inane like (in a recent high-profile case) preventing
         | a juror from using a smartphone to "zoom in" to a piece of
         | evidence, it's precisely this role that is being (or attempted
         | to be) primarily served--preventing scope creep and the endless
         | bickering that will inevitably result. The rabbit hole is
         | bottomless, and whatever stopping points you choose are often
         | arbitrary--they especially seem so to outsiders.
        
       | scrubs wrote:
       | I do not agree with the thrust of this article. Some of the
       | issues raised only arise because of the the sloppy or incorrect
       | identification of root cause by software people, which then falls
       | into a definitional whack-a-mole-game. Not helpful at all.
       | 
       | Now, the author rightly argues that coming up with RCA isn't
       | easy. In particular it tends to follow a twin inverted V shape:
       | <--- starting wide various technical/operation issues considered
       | 
       | \\_/
       | 
       | + <--- narrowing to the root"ist" technical cause
       | 
       | /-\                  <--- widening back into organization issues
       | as to why an             agent of change could do or did do
       | something
       | 
       | In addition, unlike manufacturing, there is no sense of 6-sigma
       | riding on top of a domain of work governed by natural science so
       | that arriving at a quantitatively convincing argument of root
       | cause is generally not possible.
       | 
       | Even the book on TQM in 6-sigma work:
       | 
       | https://www.amazon.com/What-Total-Quality-Control-Japanese/d...
       | 
       | points out that technical issues often dissolve into
       | organizational issues. See page 57-58 then elsewhere.
       | 
       | So where does that leave one? Dealing with the many contributing
       | factors in an organization that impinge on decisions which may
       | lead to bugs/outages/defects:
       | 
       | - Eliminate opportunity defects. Simplify.
       | 
       | - Inputs to the next process should be controlled (in
       | manufacturing parlance they are X-sigma quality). It's harder for
       | me to screw something up if what I start with conforms to its
       | requirements aligned ultimately with customer satisfaction.
       | 
       | - quality is everybody's problem. See again pg57-58. Sectionalism
       | is a major impediment to enterprise wide improvement
       | 
       | - the ultimate aim is continuous improvement of which RCA (fish
       | bone diagrams and the rest) are but tools. The salient business
       | question is: ok, a client outage occurred. Not OK, but does it
       | repeat?
       | 
       | Root cause analysis? Plants are doing just fine. Human
       | organizations can be helped by root cause analysis. That's just
       | how it is.
        
       | lmilcin wrote:
       | The findings from Salesforce example:                 1. A lack
       | of automation with safeguards for DNS changes       2.
       | Insufficient guardrails to enforce the Change Management process
       | 3. Subversion of the Emergency Break Fix (EBF) process
       | 
       | It looks a lot like somebody tried to push their agenda (personal
       | or organizational) and write points that suit their preferred
       | solution rather than try to understand actual causes of the
       | outage.
       | 
       | These findings are essentially "Somebody made a mistake, let's
       | remove accesses and/or put more controls on the process so that
       | people are not even able to make a mistake".
       | 
       | This seems to be very low trust environment. My experience tells
       | me the most likely course of action is that even if the problem
       | is fixed, it is going to happen at the cost of more overhead in
       | the process and making it even more low-trust, causing more
       | damage in the long run.
       | 
       | You can apply this kind of flawed reasoning ("somebody made a
       | mistake -- nobody can be trusted with it again") to any problem
       | and soon nobody can do anything on their own. Productivity
       | plummets. People are becoming disinterested in their work (try to
       | be "engaged" when management takes all your tools away).
       | Management blames people for not being able to do even basic
       | things -- loosing even more trust and pushing even more solutions
       | like that. Vicious cycle.
       | 
       | Take for example point #3.
       | 
       | How about figuring out _WHY_ people need or feel they need to
       | subvert EBF process? Isn 't it rational to assume that if people
       | have been subverting the process they might have actual reason to
       | do so? Maybe their regular process is too onerous and they are
       | using emergency process to meet deadlines? Maybe the effort
       | should be directed at improving the regular process so that they
       | don't have reason to subvert it?
        
         | kevin_thibedeau wrote:
         | This isn't low trust but rather poka-yoke mistake proofing. If
         | you have a fallible process you either replace it with an
         | infallible one or put in controls to mitigate the risk. When
         | you effectively run a factory for configuring software you have
         | an interest in not letting a single production worker shut down
         | the whole operation.
        
           | lmilcin wrote:
           | That is exactly low trust environment.
           | 
           | Trust requires that you accept that you can be hurt by the
           | person you trust. If you work to remove that ability you are
           | saying you do not trust that other person.
           | 
           | Now, I am not saying that you should design processes so that
           | employees can fail them.
           | 
           | What I am saying is that when somebody is a developer and the
           | company says it is ok to "burn" their time and effort because
           | they do not trust the developer to make a right decision, it
           | sucks and it makes it very difficult for that person to care
           | to do a good job.
           | 
           | There are usually different ways to prevent things from
           | failing that do not require the organization to manifest that
           | it does not trust their employees. And there are ways for the
           | organization to show they trust their employees to make up
           | for the cases when these types of solutions can't be
           | implemented.
           | 
           | I worked in Samsung. In the office I worked, there is a ban
           | on having knives in the kitchen, even the cutlery that is
           | basically unusable to cut anything more substantial than
           | overcooked potatoes. I heard the same for all other offices.
           | Apparently, two guys half a world away had a knife fight in
           | the kitchen and so it was decided every single employee is
           | not to be trusted handling cutlery. The funny thing is,
           | people started bringing their own, really sharp knives. So...
           | knives are still there but employees have one more reminder
           | that they are not being treated as adults.
           | 
           | I worked in a large number of companies, from small to very,
           | very large ones. They all vary a lot in the level of
           | permission any single employee has to do things.
           | 
           | But one thing I noticed is that companies where employees
           | have very little permission to do anything do not have less
           | process failures.
           | 
           | There is many reasons for this. Now that you made your
           | employees impotent, you need a lot more of them to do the
           | same work. Or maybe you need to shift the responsibilities
           | from real thinking people to automation that can also make
           | mistakes but with a much larger impact. People that are
           | constantly reminded they are not trusted start behaving this
           | way. They tend to lower standards and stop caring.
           | 
           | Taking somebodys permission to do something can be done well
           | but it usually requires much more care and thought than what
           | typically happens at large corporations.
        
             | afarrell wrote:
             | When I do carpentry, I make and use wooden jigs when I
             | identify situations where I know my hands will not be
             | steady enough. This is good poka-yoke.
             | 
             | If someone forced me to use their jigs taylor-made to their
             | workflow (however scientifically managed), that would be a
             | low-trust environment.
        
               | lmilcin wrote:
               | I understand this. The difference is that when somebody
               | implements RCA findings this usually results in mandatory
               | overhead. Not a wooden jig that you can decide to skip if
               | it does not suit your next item you are working on.
        
         | blacksmith_tb wrote:
         | I see that a lot in RCAs - it's generally easy to rush out
         | _how_ something went wrong, but it can be much more subtle
         | _why_ it did; but since heads will roll if it isn 't sent in
         | 24hr, we get useless recaps.
        
           | [deleted]
        
       | a-dub wrote:
       | i always thought that finding "the root cause" was simply a
       | declaration by those investigating that they're either satisfied
       | with what they've learned or are bored of investigating.
       | 
       | at the end of the day, only a few things matter. which component
       | is actually broken? since it's not always obvious, so that the
       | immediate situation may be resolved expeditiously. what was the
       | event that triggered it such that process or tooling around it
       | might be improved? where might investment be made to prevent the
       | same or similar happening in future?
       | 
       | that's basically it.
        
       | annoyingnoob wrote:
       | In my opinion, its called root cause _analysis_ and not a root
       | cause _pointer_. No matter what you call it, its a good idea to
       | understand the events that lead to an incident. We cannot improve
       | if we do not know where /how we failed.
        
       | thrower123 wrote:
       | Root cause analysis is almost always just blame-shifting and
       | trying to find somebody to fall on the sword and accept
       | culpability, and often as not, pay for the downtime/missed
       | SLAs/etc...
       | 
       | Maybe there are some enlightened organizations out there that
       | treat it as a learning opportunity, but I've never seen it from
       | any of the Fortune 500s that demand a RCA from me when something
       | derps up on their end.
        
         | scrubs wrote:
         | I find this outlook deplorable and a significant contributing
         | factor to why companies can't do better.
         | 
         | Read: https://www.amazon.com/What-Total-Quality-Control-
         | Japanese/d...
         | 
         | and do better. You have choice. Use it.
        
       | bob1029 wrote:
       | RCA is a complex path that can be explored in a variety of ways.
       | 
       | The 5 whys analysis is one of the better ways to go about RCA in
       | my experience. Especially, if everyone can be mature adults and
       | amend it as appropriate to the circumstances at hand. For
       | example, you might only need to ask "why" 3 times before it
       | becomes overly-reductive. Sometimes 15. Also, you might find that
       | the answer to each "why" is a collection of things, each seeking
       | their own new tree of exploration.
       | 
       | I have witnessed some hilariously-deep RCA pools. If you ever
       | work in systems engineering for a semiconductor manufacturer, you
       | will see some of the most insane shit. Like tracing a series of
       | customer device failures back to the exact human contractor who
       | brought a naughty tool into an inappropriate area of the facility
       | for a brief duration, causing a chain of events 100+ deep,
       | ultimately resulting in elevated defects in all batches of wafers
       | that were ran through a specific port on a specific tool.
       | 
       | In this context, I think the C is plural. Very few complex issues
       | are attributable to exactly 1 logical thing.
        
       ___________________________________________________________________
       (page generated 2021-11-23 23:01 UTC)