[HN Gopher] New method identifies the root causes of statistical...
       ___________________________________________________________________
        
       New method identifies the root causes of statistical outliers
        
       Author : jonbaer
       Score  : 82 points
       Date   : 2022-07-19 16:39 UTC (6 hours ago)
        
 (HTM) web link (www.amazon.science)
 (TXT) w3m dump (www.amazon.science)
        
       | [deleted]
        
       | ubbr wrote:
       | Why focus on outliers and not all quantitative causal
       | relationships in the graph like "tell me what causes X"?
        
         | kqr wrote:
         | Because there's nothing in particular that causes what's
         | happening in the central part of the distribution. The
         | centrally located outcomes are the result of combining a myriad
         | of small factors this way and that way. Outliers usually have
         | an identifiable, single cause.
         | 
         | This is why in statistical process control these types of
         | outcome are known as "common-cause variation" and "assignable-
         | cause variation".
        
         | jazzyjackson wrote:
         | I think it would be very useful to look at an outlier and know
         | if it's a legitimate measurement that should be kept in the
         | dataset or an aberration that can be safely removed from the
         | dataset.
         | 
         | I am not a statistician, so I don't know under what
         | circumstances outliers are usually thrown out.
        
           | kqr wrote:
           | > I am not a statistician, so I don't know under what
           | circumstances outliers are usually thrown out.
           | 
           | As an industrial statistician, I can tell you: way too often.
           | 
           | Outliers are the signal among the noise. They indicate
           | _something_. It is nearly always worth finding out what,
           | instead of removing them. If they indicate a flaw with
           | measurement or the process, then fix that flaw and re-do the
           | measurement or re-run the process. Outlier gone! But in a
           | much more informative way.
        
             | mcswell wrote:
             | I have absolutely no knowledge of this stuff (in particular
             | of your industry), but I would have thought that most
             | outliers indicate either a measurement error (which you
             | suggest) or a cause that hasn't been taken into account.
             | For the latter, I'm thinking for example about countries
             | whose life expectancy is completely out of line with their
             | per capita income, or some such. Do unexpected causes
             | appear to cause outliers in your field?
        
       | [deleted]
        
       | tpoacher wrote:
       | Is it me or has this paper gone to great lengths to effectively
       | rediscover the humble (but oft misunderstood) p-value?
        
       | uoaei wrote:
       | This "new" method seems to be an old method, indistinguishable
       | from Pearl's do-calculus.
       | 
       | What is the innovation here, aside from a new software library?
       | The quantification of each candidate root cause's influence on
       | the outcome? I am surprised the authors found nothing similar
       | throughout the entire corpus of academic research.
       | 
       | I must say, Amazon's "science" blog is the most unimpressive of
       | the big tech companies. It churns out PR like the rest, but the
       | others at least have some substance behind them.
        
         | tpoacher wrote:
         | It's ironic that they say they find no good definition for the
         | term, but then go on to cite the same people that use those
         | exact terms to describe the exact same thing:
         | https://www.semanticscholar.org/paper/Causality%2C-Responsib...
        
           | svnt wrote:
           | It would be more ironic if they made a mistake like this in a
           | way that wasn't quite so ego-reinforcing.
        
       | clircle wrote:
       | Very bold of them to assume that outliers even exist. Most
       | statisticians would say that there's not really a difference
       | between an outlier and a non-outlier, but actually the data just
       | have fatter tails than you originally thought, and you need to
       | use a robust modeling method to account for that.
        
         | civilized wrote:
         | > Most statisticians would say that there's not really a
         | difference between an outlier and a non-outlier
         | 
         | Definitely untrue. Suppose we throw 1,000 baseballs, shoot one
         | rocket (45 degree angle), record the distance each one flew,
         | and present the 1,001 data points to someone for analysis. Most
         | people, statisticians or not, will recognize the rocket's
         | distance as an outlier, even if they don't know the cause of
         | it. And they'll be right to do so.
         | 
         | It's true that there is no _universal_ definition of outlier.
         | But outliers are very clearly defined in some contexts.
        
           | timy2shoes wrote:
           | > Suppose we throw 1,000 baseballs, shoot one rocket (45
           | degree angle), record the distance each one flew, and present
           | the 1,001 data points to someone for analysis.
           | 
           | But I would argue that's not really a case of an outlier,
           | that's a mixture distribution because the underlying
           | identifying feature (baseball or rocket) is missing.
        
             | civilized wrote:
             | You've almost grasped the point. Mixture distributions with
             | a missing underlying feature are _how_ outliers (in the
             | meaningful, well-defined sense) are generated. And if you
             | dig into this paper (I have not yet), I can almost
             | guarantee you that this is foundational to the paper.
        
         | WastingMyTime89 wrote:
         | By definition outliers are just points which differ
         | significantly from the others. Obviously if your distribution
         | has heavy tails they are likely to be there. They could also be
         | noise or simply errors. The only things you know before
         | analysing is that they are different which is why they are
         | outliers.
        
           | whatshisface wrote:
           | If you carefully read the article, what they're doing is
           | classifying points by how unlikely they'd be according to
           | their model of their system, and they are using that to find
           | inaccuracies in their model.
           | 
           | Without a model there is no difference between outliers and
           | "regular" data points, because a data point will always match
           | a distribution made from itself.
        
         | mhh__ wrote:
         | Sometimes there's a horse rubbing against your air pressure
         | sensor though.
        
         | aquafox wrote:
         | That's not true. There's a variety of reasons why outliers can
         | exist, for example data quality issues where an experiment
         | failed or the method to measure the data did not work. The
         | approach presented there in my opinion has one big caveat: It
         | tries to learn the cause of outliers from the data itself
         | instead of incorporating domain knowledge! Having and using
         | domain knowledge is time consuming and difficult, but nothing
         | can replace it. One cannot automate away to critically explore
         | and understand the data, although machine learning folks think
         | they do.
        
         | magicalhippo wrote:
         | Lets say you time some code, and get a nice Gaussian at around
         | 1ms say, but with a smaller peak far to the right (ie +20ms or
         | so). The second peak is because your thread got scheduled. You
         | don't care about those samples, they don't matter.
         | 
         | Can't you just say those are outliers and not consider them?
         | Ideally you'd build some bimodal model, but given that they
         | don't matter, what's wrong with throwing away those samples?
        
       ___________________________________________________________________
       (page generated 2022-07-19 23:01 UTC)