[HN Gopher] New method identifies the root causes of statistical...
___________________________________________________________________
New method identifies the root causes of statistical outliers
Author : jonbaer
Score : 82 points
Date : 2022-07-19 16:39 UTC (6 hours ago)
(HTM) web link (www.amazon.science)
(TXT) w3m dump (www.amazon.science)
| [deleted]
| ubbr wrote:
| Why focus on outliers and not all quantitative causal
| relationships in the graph like "tell me what causes X"?
| kqr wrote:
| Because there's nothing in particular that causes what's
| happening in the central part of the distribution. The
| centrally located outcomes are the result of combining a myriad
| of small factors this way and that way. Outliers usually have
| an identifiable, single cause.
|
| This is why in statistical process control these types of
| outcome are known as "common-cause variation" and "assignable-
| cause variation".
| jazzyjackson wrote:
| I think it would be very useful to look at an outlier and know
| if it's a legitimate measurement that should be kept in the
| dataset or an aberration that can be safely removed from the
| dataset.
|
| I am not a statistician, so I don't know under what
| circumstances outliers are usually thrown out.
| kqr wrote:
| > I am not a statistician, so I don't know under what
| circumstances outliers are usually thrown out.
|
| As an industrial statistician, I can tell you: way too often.
|
| Outliers are the signal among the noise. They indicate
| _something_. It is nearly always worth finding out what,
| instead of removing them. If they indicate a flaw with
| measurement or the process, then fix that flaw and re-do the
| measurement or re-run the process. Outlier gone! But in a
| much more informative way.
| mcswell wrote:
| I have absolutely no knowledge of this stuff (in particular
| of your industry), but I would have thought that most
| outliers indicate either a measurement error (which you
| suggest) or a cause that hasn't been taken into account.
| For the latter, I'm thinking for example about countries
| whose life expectancy is completely out of line with their
| per capita income, or some such. Do unexpected causes
| appear to cause outliers in your field?
| [deleted]
| tpoacher wrote:
| Is it me or has this paper gone to great lengths to effectively
| rediscover the humble (but oft misunderstood) p-value?
| uoaei wrote:
| This "new" method seems to be an old method, indistinguishable
| from Pearl's do-calculus.
|
| What is the innovation here, aside from a new software library?
| The quantification of each candidate root cause's influence on
| the outcome? I am surprised the authors found nothing similar
| throughout the entire corpus of academic research.
|
| I must say, Amazon's "science" blog is the most unimpressive of
| the big tech companies. It churns out PR like the rest, but the
| others at least have some substance behind them.
| tpoacher wrote:
| It's ironic that they say they find no good definition for the
| term, but then go on to cite the same people that use those
| exact terms to describe the exact same thing:
| https://www.semanticscholar.org/paper/Causality%2C-Responsib...
| svnt wrote:
| It would be more ironic if they made a mistake like this in a
| way that wasn't quite so ego-reinforcing.
| clircle wrote:
| Very bold of them to assume that outliers even exist. Most
| statisticians would say that there's not really a difference
| between an outlier and a non-outlier, but actually the data just
| have fatter tails than you originally thought, and you need to
| use a robust modeling method to account for that.
| civilized wrote:
| > Most statisticians would say that there's not really a
| difference between an outlier and a non-outlier
|
| Definitely untrue. Suppose we throw 1,000 baseballs, shoot one
| rocket (45 degree angle), record the distance each one flew,
| and present the 1,001 data points to someone for analysis. Most
| people, statisticians or not, will recognize the rocket's
| distance as an outlier, even if they don't know the cause of
| it. And they'll be right to do so.
|
| It's true that there is no _universal_ definition of outlier.
| But outliers are very clearly defined in some contexts.
| timy2shoes wrote:
| > Suppose we throw 1,000 baseballs, shoot one rocket (45
| degree angle), record the distance each one flew, and present
| the 1,001 data points to someone for analysis.
|
| But I would argue that's not really a case of an outlier,
| that's a mixture distribution because the underlying
| identifying feature (baseball or rocket) is missing.
| civilized wrote:
| You've almost grasped the point. Mixture distributions with
| a missing underlying feature are _how_ outliers (in the
| meaningful, well-defined sense) are generated. And if you
| dig into this paper (I have not yet), I can almost
| guarantee you that this is foundational to the paper.
| WastingMyTime89 wrote:
| By definition outliers are just points which differ
| significantly from the others. Obviously if your distribution
| has heavy tails they are likely to be there. They could also be
| noise or simply errors. The only things you know before
| analysing is that they are different which is why they are
| outliers.
| whatshisface wrote:
| If you carefully read the article, what they're doing is
| classifying points by how unlikely they'd be according to
| their model of their system, and they are using that to find
| inaccuracies in their model.
|
| Without a model there is no difference between outliers and
| "regular" data points, because a data point will always match
| a distribution made from itself.
| mhh__ wrote:
| Sometimes there's a horse rubbing against your air pressure
| sensor though.
| aquafox wrote:
| That's not true. There's a variety of reasons why outliers can
| exist, for example data quality issues where an experiment
| failed or the method to measure the data did not work. The
| approach presented there in my opinion has one big caveat: It
| tries to learn the cause of outliers from the data itself
| instead of incorporating domain knowledge! Having and using
| domain knowledge is time consuming and difficult, but nothing
| can replace it. One cannot automate away to critically explore
| and understand the data, although machine learning folks think
| they do.
| magicalhippo wrote:
| Lets say you time some code, and get a nice Gaussian at around
| 1ms say, but with a smaller peak far to the right (ie +20ms or
| so). The second peak is because your thread got scheduled. You
| don't care about those samples, they don't matter.
|
| Can't you just say those are outliers and not consider them?
| Ideally you'd build some bimodal model, but given that they
| don't matter, what's wrong with throwing away those samples?
___________________________________________________________________
(page generated 2022-07-19 23:01 UTC)