[HN Gopher] Minesweeper automates root cause analysis as a first...
___________________________________________________________________
Minesweeper automates root cause analysis as a first-line defense
against bugs
Author : muglug
Score : 70 points
Date : 2021-02-09 19:38 UTC (3 hours ago)
(HTM) web link (engineering.fb.com)
(TXT) w3m dump (engineering.fb.com)
| nomy99 wrote:
| typo: Engineering not Enginering
| pronoiac wrote:
| I emailed the mods.
| stingraycharles wrote:
| While this is a great strategy for figuring out the cause of a
| bug, I'd argue that "root cause analysis" in engineering is
| typically a much more qualitative analysis, and more about high
| impact failures than mere bug reports.
|
| A more accurate title may be "automatic data collection and
| analysis for bug reports"; I'm also confident that Microsoft has
| been doing this exact same thing for at least a few decades.
| cat199 wrote:
| > and more about high impact failures than mere bug reports.
|
| It's often more about people problems than software problems..
| tehjoker wrote:
| I did like the idea that they recorded low memory conditions.
| That seems incredibly useful for debugging issues that occur in
| the wild. A natural next step would be checking GPU memory as
| well if they haven't already.
|
| Is it possible to measure overall system memory pressure in JS
| or is that sandboxed?
| puttycat wrote:
| I'm pretty confident that they use the same OS data for
| fingerprinting as well.
| schemescape wrote:
| This seems like a great system for isolating steps to reproduce a
| bug, but I'm not sure I would consider this "root cause
| analysis".
| Tarsul wrote:
| It appears to have nothing to do with the game.
| 1_2__4 wrote:
| It's been a disappointing trend that SRE execs have adopted
| magical ML thinking, believing they're just a few short years
| away from ditching all those annoying engineers and replacing
| them with a model. I am skeptical.
| qbasic_forever wrote:
| So any concrete stats on how this has helped shorten bug
| investigation time, improve quality of releases, etc? It looks
| like an interesting data-driven approach to bug finding but
| there's curiously no qualitative analysis of how it's actually
| working in practice. I'd be a little concerned that systems like
| this can fall into the background as a flurry of noise and
| process that doesn't actually improve the quality of the product.
| vladd wrote:
| Seems the article is confusing the "trigger" of an event with its
| "root cause".
|
| I like to give the example of the Concorde airplane crash [0] to
| exemplify the difference: the incident was _triggered_ by debris
| on the runway (which caused the tires to explode, igniting the
| fuel tank above). But _the root cause_ was the placement of fuel
| in proximity of inflatable tire materials.
|
| [0] https://en.wikipedia.org/wiki/Air_France_Flight_4590
| k1t wrote:
| Is there really a difference though?
|
| To me a "trigger" is the initial event that begins a sequence.
| Isn't that also a "root cause"?
|
| Since everything is connected to everything else, it seems like
| the point that you decide is the "root" is fairly arbitrary.
|
| It seems you could easily conclude that the root cause was that
| the runway wasn't cleaned/inspected often enough. Or that the
| departures were scheduled too close together, preventing such
| an inspection, etc..
|
| If anything I would say the _root cause_ was the piece of
| engine cowl falling from the preceding flight - since that
| seems to be the first thing that "went wrong" in the process.
| azinman2 wrote:
| If I said 'screw you', and that cause you to flip out and
| kill everyone around you, you couldn't say the root cause was
| me saying 'screw you'. The root cause might be childhood
| trauma, extreme emotional imbalance, irrational thoughts,
| etc. The statement 'screw you' was the trigger.
|
| Similarly here, a trigger might be uploading a photo to FB,
| but the root cause of an issue might be a bug in encoding
| JPGs.
| breischl wrote:
| One approach to this is the "Five Why's" approach, wherein
| you ask "why" five times. eg,
|
| Q1: Why did the plan explode?
|
| A1: The engine cowling fell into the engine
|
| Q2: Why'd that happen?
|
| A2: The tire exploded and damaged it.
|
| etc etc.
|
| Obviously the number 5 is arbitrary and not always
| applicable, it's just a heuristic to get to something "root-
| ish" without getting to ridiculous distant things like "the
| laws of physics prevent two objects from inhabiting the same
| position in space-time".
|
| More generally, defining something as root vs. not is
| somewhat of a judgement call. Usually you try to find
| something that will prevent future problems of this sort and
| call it the root cause. Ideally something that your
| organization can mitigate with a reasonable time/cost.
|
| Note that the actual mitigation is a separate question. If
| runway debris is the root cause, then one mitigation is
| reworking the fuel system. Another would be using tougher
| tires. Perhaps another would be adding a shield between the
| tires and the aircraft body. Another might be an automated
| runway monitoring system that detects debris. etc.
| dathinab wrote:
| Debris on the runway is something to be expected in rare
| cases, given how many flights there are a day it's just a
| matter of time until it happens.
|
| As such the problem is an air plan which is designed in a way
| too prone to cause (too) fatal accidents in certain "rare but
| guaranteed to happen at some point" situation.
|
| But in the end if you say both are trigger which together
| lead to the catastrophe or one is a trigger and another is
| the root cause is indeed irrelevant.
|
| The problem is if you do something I will call trigger
| analysis but refer to root cause analysis treating it as if
| it gives you _the_ root cause it can very easily to
| situations where you fix one of the problems but not all, and
| potentially not even the biggest problem.
|
| I.e. you make it slightly less likely that there is debris on
| the runway but you don't fix the problem of the airplane
| being too prone to certain kinds of catastrophic failure.
| kryogen1c wrote:
| > Is there really a difference though?
|
| yes.
|
| > To me a "trigger" is the initial event that begins a
| sequence. Isn't that also a "root cause"?
|
| no. root causes are irreducible, hence the word "root".
|
| if someone is endlessly trained for an event and then fails
| at game time, its probably a root cause. people make mistakes
| that cannot be avoided (this is why defense in depth is a
| thing)
|
| if the training program is a 5 second sentence before the
| event, the persons mistake is not the root cause, its the
| training program.
| spockz wrote:
| To me the placement of the tanks or lack of protection would
| be the root cause. Because this failure could have been
| triggered by any other debris as well. So the root cause is
| the thing that If you fix it, it fixes the problem in a
| fundamental level.
___________________________________________________________________
(page generated 2021-02-09 23:00 UTC)