[HN Gopher] Root cause is for plants, not software
___________________________________________________________________
Root cause is for plants, not software
Author : kiyanwang
Score : 29 points
Date : 2021-11-22 11:37 UTC (1 days ago)
(HTM) web link (www.verica.io)
(TXT) w3m dump (www.verica.io)
| monocasa wrote:
| RCA is a tool. It can be everything the article says it is. It
| can also be incredibly useful by not being what the article says
| it has to be.
|
| For instance I've heard there's at the very least an FAA
| institutional push against plane crash RCAs simply saying "pilot
| error". Pilots are known to make errors (we all do), so the FAA
| wants to know how the error wasn't accounted for in the greater
| process and resulted in a plane falling out of the sky. If you
| use RCA not for blame, or don't call it quits after finding a
| "smoking gun", it's a fantastic tool for enumerating single
| points of failure that you can prove actually failed. There's
| normally multiple things you can do better if multiple people
| care about what comes out of an RCA.
| mjb wrote:
| There's some good stuff here, but it also feels like it's a deep
| analysis of a straw man rather than of real industry practice.
|
| > At its core, RCA posits that an incident has a single, specific
| cause or trigger, without which the incident wouldn't have
| happened
|
| Does it?
|
| None of the teams I've worked with that practice root-cause
| analysis, or use the term 'root cause' assume that incidents have
| a single specific cause or trigger. In fact, most teams seem very
| comfortable exploring complex trees of causality, no matter what
| they call their post-mortem process.
|
| > None of these alternatives are fixed as easily as reverting the
| one line. None of them suggest that they stem from the same
| 'root.' And yet, all of these alternatives are more fundamental
| to resilience than the one line of a configuration file."
|
| "In the beginning the Universe was created. This has made a lot
| of people very angry and been widely regarded as a bad move."
| (Douglas Adams)
|
| There has to be a logical stopping point to investigations,
| because at some point they just become metaphysics. Even getting
| to the point of "we don't make enough revenue to fix this" isn't
| really helpful, because everybody knows that already. Instead,
| and somewhat crucially, choosing which actions to take after an
| event needs to be done in context of the constraints on the
| business. Sometimes those constraints are _a_ problem, but it may
| be possible to improve the situation even within those
| constraints. Throwing our hands up and saying that we can 't make
| improvements because of the constraints isn't really that
| helpful.
|
| Dekker is spot-on that it's a construction, but that's the point,
| not the problem.
|
| > Our focus on language in reports is informed by research into
| the relationship between language and how people perceive and/or
| assign blame.
|
| I just don't believe this. Bad organizations are going to assign
| blame no matter what language they use, better organizations
| won't. The problem is the culture, not the language, and it
| really doesn't seem to me that changing the language will change
| the culture. Instead, it'll just lead to double-speak.
|
| They even say this:
|
| > RCA is often a path that ends squarely at a person's feet.
| "Human error" is a quick and easy scapegoat for all kinds of
| incidents and accidents, and it's deceptive simplicity as a root
| cause is an inherent part of its larger-scale harmful effects.
| It's comforting to frame an incident as someone straying from
| well-established rules, policies, or guidelines; simply provide
| more training, and more guardrails and checklists in the future!
|
| then this:
|
| > we can say now that there's only a small amount of incidents
| (less than a percent) that directly call out human or operator
| error as a "root cause."
|
| Which just don't seem consistent.
| [deleted]
| xg15 wrote:
| I agree. This seems to conflate immediate direct causes and
| lack of mitigations or robustness.
|
| By that logic, I'm missing "questionable dietary habits of the
| lead engineer" and "failure of the building architechts to
| create a more calming office environment" as causes in the
| article.
| afarrell wrote:
| "Every problem at Toyota can be blamed on Matthew Perry if
| you keep asking 'Why?' enough times."
| Spooky23 wrote:
| The author strikes me as someone focused on something other
| than incident and problem management. The answers or depend on
| what you're looking at.
|
| The cause of the _incident_ used in the example is the
| configuration change. Resolution is rollback.
|
| When you look at problem management, the root causes of the
| _problem_ are likely process issues that need to be addressed.
| (Or not)
|
| Think of a non tech example to help understand. Look at a
| picture of an early superhighway from the 1950s like the New
| York State Thruway, New Jersey Turnpike, etc. Notice that there
| are no guardrails or they are made of wood.
|
| When a 1950s driver fell asleep and ran off the road, flipping
| and getting ejected from the car, he died. The cause of the
| incident was him falling asleep and losing control.
|
| But for the highway engineer, many problems are introduced by
| this incident. The shoulder could be graded to avoid rollover.
| Appropriate barriers could prevent the rollover. Rumble strips
| wake up the driver. For the mechanical engineer at a car
| company, seatbelts prevent ejection, safety glass prevents
| eviceration, crumple zones prevent trauma.
|
| None of those things prevent the guy from falling asleep in
| that incident. But they address the greater problem of highway
| fatality.
| wahern wrote:
| > There has to be a logical stopping point to investigations,
| because at some point they just become metaphysics.
|
| The law understands this point very well, which is why it has
| developed concepts like Proximate cause:
| https://en.wikipedia.org/wiki/Proximate_cause
|
| Rules of evidence also arose to prevent investigations and
| analyses from getting lost in the weeds. See
| https://en.wikipedia.org/wiki/Evidence_(law) Though they're
| highly tuned to witness testimony and similar mushy issues--
| intent, etc--so don't translate well. But whenever you hear
| about a court tossing out evidence, or doing something
| seemingly inane like (in a recent high-profile case) preventing
| a juror from using a smartphone to "zoom in" to a piece of
| evidence, it's precisely this role that is being (or attempted
| to be) primarily served--preventing scope creep and the endless
| bickering that will inevitably result. The rabbit hole is
| bottomless, and whatever stopping points you choose are often
| arbitrary--they especially seem so to outsiders.
| scrubs wrote:
| I do not agree with the thrust of this article. Some of the
| issues raised only arise because of the the sloppy or incorrect
| identification of root cause by software people, which then falls
| into a definitional whack-a-mole-game. Not helpful at all.
|
| Now, the author rightly argues that coming up with RCA isn't
| easy. In particular it tends to follow a twin inverted V shape:
| <--- starting wide various technical/operation issues considered
|
| \\_/
|
| + <--- narrowing to the root"ist" technical cause
|
| /-\ <--- widening back into organization issues
| as to why an agent of change could do or did do
| something
|
| In addition, unlike manufacturing, there is no sense of 6-sigma
| riding on top of a domain of work governed by natural science so
| that arriving at a quantitatively convincing argument of root
| cause is generally not possible.
|
| Even the book on TQM in 6-sigma work:
|
| https://www.amazon.com/What-Total-Quality-Control-Japanese/d...
|
| points out that technical issues often dissolve into
| organizational issues. See page 57-58 then elsewhere.
|
| So where does that leave one? Dealing with the many contributing
| factors in an organization that impinge on decisions which may
| lead to bugs/outages/defects:
|
| - Eliminate opportunity defects. Simplify.
|
| - Inputs to the next process should be controlled (in
| manufacturing parlance they are X-sigma quality). It's harder for
| me to screw something up if what I start with conforms to its
| requirements aligned ultimately with customer satisfaction.
|
| - quality is everybody's problem. See again pg57-58. Sectionalism
| is a major impediment to enterprise wide improvement
|
| - the ultimate aim is continuous improvement of which RCA (fish
| bone diagrams and the rest) are but tools. The salient business
| question is: ok, a client outage occurred. Not OK, but does it
| repeat?
|
| Root cause analysis? Plants are doing just fine. Human
| organizations can be helped by root cause analysis. That's just
| how it is.
| lmilcin wrote:
| The findings from Salesforce example: 1. A lack
| of automation with safeguards for DNS changes 2.
| Insufficient guardrails to enforce the Change Management process
| 3. Subversion of the Emergency Break Fix (EBF) process
|
| It looks a lot like somebody tried to push their agenda (personal
| or organizational) and write points that suit their preferred
| solution rather than try to understand actual causes of the
| outage.
|
| These findings are essentially "Somebody made a mistake, let's
| remove accesses and/or put more controls on the process so that
| people are not even able to make a mistake".
|
| This seems to be very low trust environment. My experience tells
| me the most likely course of action is that even if the problem
| is fixed, it is going to happen at the cost of more overhead in
| the process and making it even more low-trust, causing more
| damage in the long run.
|
| You can apply this kind of flawed reasoning ("somebody made a
| mistake -- nobody can be trusted with it again") to any problem
| and soon nobody can do anything on their own. Productivity
| plummets. People are becoming disinterested in their work (try to
| be "engaged" when management takes all your tools away).
| Management blames people for not being able to do even basic
| things -- loosing even more trust and pushing even more solutions
| like that. Vicious cycle.
|
| Take for example point #3.
|
| How about figuring out _WHY_ people need or feel they need to
| subvert EBF process? Isn 't it rational to assume that if people
| have been subverting the process they might have actual reason to
| do so? Maybe their regular process is too onerous and they are
| using emergency process to meet deadlines? Maybe the effort
| should be directed at improving the regular process so that they
| don't have reason to subvert it?
| kevin_thibedeau wrote:
| This isn't low trust but rather poka-yoke mistake proofing. If
| you have a fallible process you either replace it with an
| infallible one or put in controls to mitigate the risk. When
| you effectively run a factory for configuring software you have
| an interest in not letting a single production worker shut down
| the whole operation.
| lmilcin wrote:
| That is exactly low trust environment.
|
| Trust requires that you accept that you can be hurt by the
| person you trust. If you work to remove that ability you are
| saying you do not trust that other person.
|
| Now, I am not saying that you should design processes so that
| employees can fail them.
|
| What I am saying is that when somebody is a developer and the
| company says it is ok to "burn" their time and effort because
| they do not trust the developer to make a right decision, it
| sucks and it makes it very difficult for that person to care
| to do a good job.
|
| There are usually different ways to prevent things from
| failing that do not require the organization to manifest that
| it does not trust their employees. And there are ways for the
| organization to show they trust their employees to make up
| for the cases when these types of solutions can't be
| implemented.
|
| I worked in Samsung. In the office I worked, there is a ban
| on having knives in the kitchen, even the cutlery that is
| basically unusable to cut anything more substantial than
| overcooked potatoes. I heard the same for all other offices.
| Apparently, two guys half a world away had a knife fight in
| the kitchen and so it was decided every single employee is
| not to be trusted handling cutlery. The funny thing is,
| people started bringing their own, really sharp knives. So...
| knives are still there but employees have one more reminder
| that they are not being treated as adults.
|
| I worked in a large number of companies, from small to very,
| very large ones. They all vary a lot in the level of
| permission any single employee has to do things.
|
| But one thing I noticed is that companies where employees
| have very little permission to do anything do not have less
| process failures.
|
| There is many reasons for this. Now that you made your
| employees impotent, you need a lot more of them to do the
| same work. Or maybe you need to shift the responsibilities
| from real thinking people to automation that can also make
| mistakes but with a much larger impact. People that are
| constantly reminded they are not trusted start behaving this
| way. They tend to lower standards and stop caring.
|
| Taking somebodys permission to do something can be done well
| but it usually requires much more care and thought than what
| typically happens at large corporations.
| afarrell wrote:
| When I do carpentry, I make and use wooden jigs when I
| identify situations where I know my hands will not be
| steady enough. This is good poka-yoke.
|
| If someone forced me to use their jigs taylor-made to their
| workflow (however scientifically managed), that would be a
| low-trust environment.
| lmilcin wrote:
| I understand this. The difference is that when somebody
| implements RCA findings this usually results in mandatory
| overhead. Not a wooden jig that you can decide to skip if
| it does not suit your next item you are working on.
| blacksmith_tb wrote:
| I see that a lot in RCAs - it's generally easy to rush out
| _how_ something went wrong, but it can be much more subtle
| _why_ it did; but since heads will roll if it isn 't sent in
| 24hr, we get useless recaps.
| [deleted]
| a-dub wrote:
| i always thought that finding "the root cause" was simply a
| declaration by those investigating that they're either satisfied
| with what they've learned or are bored of investigating.
|
| at the end of the day, only a few things matter. which component
| is actually broken? since it's not always obvious, so that the
| immediate situation may be resolved expeditiously. what was the
| event that triggered it such that process or tooling around it
| might be improved? where might investment be made to prevent the
| same or similar happening in future?
|
| that's basically it.
| annoyingnoob wrote:
| In my opinion, its called root cause _analysis_ and not a root
| cause _pointer_. No matter what you call it, its a good idea to
| understand the events that lead to an incident. We cannot improve
| if we do not know where /how we failed.
| thrower123 wrote:
| Root cause analysis is almost always just blame-shifting and
| trying to find somebody to fall on the sword and accept
| culpability, and often as not, pay for the downtime/missed
| SLAs/etc...
|
| Maybe there are some enlightened organizations out there that
| treat it as a learning opportunity, but I've never seen it from
| any of the Fortune 500s that demand a RCA from me when something
| derps up on their end.
| scrubs wrote:
| I find this outlook deplorable and a significant contributing
| factor to why companies can't do better.
|
| Read: https://www.amazon.com/What-Total-Quality-Control-
| Japanese/d...
|
| and do better. You have choice. Use it.
| bob1029 wrote:
| RCA is a complex path that can be explored in a variety of ways.
|
| The 5 whys analysis is one of the better ways to go about RCA in
| my experience. Especially, if everyone can be mature adults and
| amend it as appropriate to the circumstances at hand. For
| example, you might only need to ask "why" 3 times before it
| becomes overly-reductive. Sometimes 15. Also, you might find that
| the answer to each "why" is a collection of things, each seeking
| their own new tree of exploration.
|
| I have witnessed some hilariously-deep RCA pools. If you ever
| work in systems engineering for a semiconductor manufacturer, you
| will see some of the most insane shit. Like tracing a series of
| customer device failures back to the exact human contractor who
| brought a naughty tool into an inappropriate area of the facility
| for a brief duration, causing a chain of events 100+ deep,
| ultimately resulting in elevated defects in all batches of wafers
| that were ran through a specific port on a specific tool.
|
| In this context, I think the C is plural. Very few complex issues
| are attributable to exactly 1 logical thing.
___________________________________________________________________
(page generated 2021-11-23 23:01 UTC)