[HN Gopher] Statisticians use a technique that leverages randomn...
       ___________________________________________________________________
        
       Statisticians use a technique that leverages randomness to deal
       with the unknown
        
       Author : Duximo
       Score  : 134 points
       Date   : 2024-10-03 07:08 UTC (3 days ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | xiaodai wrote:
       | I don't know. I find quanta articles very high noise. It's always
       | hyping something
        
         | jll29 wrote:
         | I don't find the language of the article full of "hype"; they
         | describe the history of different forms of imputation from
         | single to multiple to ML-based.
         | 
         | The table is particularly useful as it describes what the
         | article is all about in a way that can stick to students'
         | minds. I'm very grateful for QuantaMagazine for its popular
         | science reporting.
        
           | billfruit wrote:
           | The Quanta articles usually have a gossipy style and are very
           | low information density.
        
             | SAI_Peregrinus wrote:
             | They're usually more science history than science. Who did
             | what, when, and a basic ovnrview of why it's important.
        
         | vouaobrasil wrote:
         | I agree with that. I skip the Quanta magazine articles, mainly
         | because the titles seem to be a little to hyped for my taste
         | and don't represent the content as well as they should.
        
           | amelius wrote:
           | Yes, typically a short conversation with an LLM gives me more
           | info and understanding of a topic than reading a Quanta
           | article.
        
         | MiddleMan5 wrote:
         | Curious, what sites would you recommend?
        
       | light_hue_1 wrote:
       | I wish they actually engaged with this issue instead of writing a
       | fluff piece. There are plenty of problems with multiple
       | imputation.
       | 
       | Not the least of which is that it's far too easy to do the
       | equivalent of p hacking and get your data to be significant by
       | playing games with how you do the imputation. Garbage in, garbage
       | out.
       | 
       | I think all of these methods should be abolished from the
       | curriculum entirely. When I review papers in the ML/AI I
       | automatically reject any paper or dataset that uses imputation.
       | 
       | This is all a consequence of the terrible statics used in most
       | fields. Bayesian methods don't need to do this.
        
         | jll29 wrote:
         | There are plenty of legit. articles that discuss/survey
         | imputation in ML/AI:
         | https://scholar.google.com/scholar?hl=de&as_sdt=0%2C5&q=%22m...
        
           | light_hue_1 wrote:
           | The prestigious journal "Artificial intelligence in
           | medicine"? No. Just because it's on Google scholar doesn't
           | mean it's worth anything. These are almost all trash. On the
           | first page there's one maybe legit paper in an ok venue as
           | far as ML is concerned (KDD; an adjacent field to ML) that's
           | 30 years old.
           | 
           | No. AI/ML folks don't do imputation on our datasets. I cannot
           | think of a single major dataset in vision, nlp, or robotics
           | that does so. Despite missing data being a huge issue in
           | those fields. It's an antiqued method for an antiqued idea of
           | how statistics should work that is doing far more damage than
           | good.
        
             | disgruntledphd2 wrote:
             | Ok that's interesting. I profoundly disagree with your
             | tone, but would really like to hear with you regard as good
             | approaches to the problem of missing data (particularly
             | where you have dropout from a study or experiment).
        
               | nyrikki wrote:
               | Perhaps looking into the issues with uncongeniality and
               | multiple imputation may help, although I haven't looked
               | at MI for a a long time so consider my reply as an
               | attempt to be helpful vs authoritive.
               | 
               | In another related intuition for a probable foot gun
               | relates to learning linearly inseparable functions like
               | XOR which requires MLPs.
               | 
               | A single missing value in an XOR situation is far more
               | challenging than participant dropouts causing missing
               | data.
               | 
               | Specifically the problem is counterintuitively non-
               | convex, with multiple possibilities for convergence
               | without information in the corpus to know which may be
               | true.
               | 
               | That is a useful lens in my mind, where I think of the
               | manifold being pushed down in opposite sectors as the
               | kernel trick.
               | 
               | Another potential lens to think about it is that in
               | medical studies the assumption is that there is a smooth
               | and continuous function, while in learning, we are trying
               | to find a smooth continuous function with minimal loss.
               | 
               | We can't assume that the function we need to learn is
               | smooth, but autograd specifically limits what is
               | learnable and simplicity bias, especially with feed
               | forward networks is an additional concern.
               | 
               | One thing that is common for people to conflate is the
               | fact that a differentiable function is probably smooth
               | and continuous.
               | 
               | But the set of continuous functions that is
               | differentiable _anywhere_ is a meger set.
               | 
               | Like anything in math and logic, the assumptions you can
               | make will influence what methods work.
               | 
               | As ML is existential quantification, and because it is
               | insanely good at finding efficient glitches in the
               | matrix, within the limits of my admittedly limited
               | knowledge, MI would need to be a very targeted solution
               | with a lot of care to avoid set shattering from causing
               | uncongeniality, especially in the unsupervised context.
               | 
               | Hopefully someone else can provide a better productive
               | insights.
        
         | DAGdug wrote:
         | Maybe in academia, where sketchy incentives rule. In industry,
         | p-hacking is great till you're eventually caught for doing
         | nonsense that isn't driving real impact (still, the lead time
         | is enough to mint money).
        
           | light_hue_1 wrote:
           | Very doubtful. There are plenty of drugs that get approved
           | and are of questionable value. Plenty of procedures that turn
           | out to be not useful. The incentives in industry are even
           | worse because everything depends on lying with data if you
           | can do it.
        
             | hggigg wrote:
             | Indeed. Even worse some entire academic fields are built on
             | pillars of lies. I was married to a researcher in one of
             | them. Anything that compromises the existence of the field
             | just gets written off. The end game is this fed into life
             | changing healthcare decisions so one should never assume
             | academia is harmless. This was utterly painful watching it
             | from the perspective of a mathematician.
        
             | nerdponx wrote:
             | I assume by "in industry" they meant in jobs where you are
             | doing data analysis to support decisions that your employer
             | is making. This would be any typical "data scientist" job
             | nowadays. There the consequences of BSing are felt by the
             | entity that pays you, and will eventually come back around
             | to you.
             | 
             | The incentives in medicine are more similar to those in
             | academia, where your job is to cook up data that convinces
             | _someone else_ of your results, with highly imbalanced
             | incentives that reward fraud.
        
               | DAGdug wrote:
               | Yes, precisely this! I've seen more than a few people
               | fired for generating BS analyses that didn't help their
               | employer, especially in tech where scrutiny is immense
               | when things start to fail.
        
         | fn-mote wrote:
         | Clearly you know your stuff. Are there any not-super-technical
         | references where an argument against using imputation is
         | clearly explained?
        
         | parpfish wrote:
         | I feel like multiple imputation is fine when you have data
         | missing at random.
         | 
         | The problem is that data is _never_ actually missing at random
         | and there's always some sort of interesting variable that
         | confounds which pieces are missing
        
           | underbiding wrote:
           | True true but how do you account for missing data based on
           | variables you care about and those you don't?
           | 
           | More specifically, how do you determine if the pattern you
           | seem to be identifying is actually related to the phenomenon
           | being measured and not an error in the measurement tools
           | themselves?
           | 
           | For example, a significant pattern of answers to "Yes / No:
           | have you ever been assaulted?" are blank. This could be (A),
           | respondents who were assaulted are more likely to leave it
           | blank out of shame or (B) someone handling the spreadsheet
           | accidentally dropped some rows in the data (because lets be
           | serious here, its all spreadsheets and emails...).
           | 
           | While you could say that (B) should be theoretically "more
           | truly random", we can't assume that there isn't a pattern to
           | the way those rows were dropped (i.e. a pattern imposed on
           | some algorithm that bugged out and dropped those rows).
        
             | Xcelerate wrote:
             | > how do you determine if the pattern you seem to be
             | identifying is actually related to the phenomenon being
             | measured and not an error in the measurement tools
             | themselves?
             | 
             | If the "which data is missing" information can be used be
             | to compress the data that isn't missing further than it can
             | be compressed be alone, then the missing data is missing at
             | least in part due to the phenomenon being measured.
             | Otherwise, it's not.
             | 
             | We're basically just asking if K(non-missing data | which
             | data is missing) < K(non-missing data). This is
             | uncomputable so it doesn't actually answer your question
             | regarding "how to determine", but it does provide a
             | necessary and sufficient theoretical criteria.
             | 
             | A decent practical approximation might be to see if you can
             | develop a model that predicts the non-missing data better
             | when augmented with the "which information is missing"
             | information than via self-prediction. That could be an
             | interesting research project...
        
               | parpfish wrote:
               | There's already a bunch of stats research on this
               | problem. Some useful terms to look up are MCAR (missing
               | completely at random) and MNAR (missing not at random)
        
         | aabaker99 wrote:
         | My intuition would be that there are certain conditions under
         | which Bayesian inference for the missing data and multiple
         | imputation lead to the same results.
         | 
         | What is the distinction?
         | 
         | The scenario described in the paper could be represented in a
         | Bayesian method or not. "For a given missing value in one copy,
         | randomly assign a guess from your distribution." Here "my
         | distribution" could be Bayesian or not but either way it's
         | still up to the statistician to make good choices about the
         | model. The Bayesian can p hack here all the same.
        
       | clircle wrote:
       | Does any living statistician come close to the level of Donald
       | Rubin in terms of research impact? Missing data analysis, causal
       | inference, EM algorithm, any probably more. He just walks around
       | creating new subfields.
        
         | selimthegrim wrote:
         | Efron?
        
           | richrichie wrote:
           | & Tibshirani
        
             | selimthegrim wrote:
             | Stein too
        
         | aquafox wrote:
         | Andrew Gelman?
        
           | nabla9 wrote:
           | Gelman has contributed to Bayesianism, hierarchial models and
           | Stan is great, but that's not even close to what Rubin has
           | done.
           | 
           | ps. Gelman was Rubin's doctoral student.
        
         | selectionbias wrote:
         | Also approximate Bayesian computation, principal
         | stratification, and the Bayesian Bootstrap.
        
         | j7ake wrote:
         | Mike Jordan, Tibshirani, Emmanuel Candes
        
       | paulpauper wrote:
       | why not use regression on the existing entries to infer what the
       | missing ones should be?
        
         | ivan_ah wrote:
         | That would push things towards the mean... not necessarily a
         | bad thing, but presumably later steps of the analysis will be
         | pooling/averaging data together so not that useful.
         | 
         | A more interesting approach, let's call it OPTION2, would be to
         | sample from the predictive distribution of a regression
         | (regression mean + noise), which would result in more
         | variability in the imputations, although random so might not
         | what you want.
         | 
         | The multiple imputation approach seems to be a resampling
         | methods of obtaining OPTION2, w/o need to assume linear
         | regression model.
        
           | stdbrouw wrote:
           | Multiple imputation simply means you impute multiple times
           | and run the analysis on each complete (imputed) dataset so
           | you can incorporate the uncertainty that comes from guessing
           | at missing values into your final confidence intervals and
           | such. How you actually do the imputation will depend on the
           | type of variable, the amount of missingness etc. A draw from
           | the predictive distribution of a linear model of other
           | variables without missing data is definitely a common method,
           | but in a state-of-the-art multiple imputation package like mi
           | in R you can choose from dozens.
        
       | Jun8 wrote:
       | Not one mention of the EM algorithm, which is, as far as I can
       | understand, is being described here (https://en.m.wikipedia.org/w
       | iki/Expectation%E2%80%93maximiza...). It has so many
       | applications, among which is estimating number of clusters for a
       | Gaussian mixture model.
       | 
       | An ELI5 intro: https://abidlabs.github.io/EM-Algorithm/
        
         | Sniffnoy wrote:
         | It does not appear to be what's being described here? Could you
         | perhaps expand on the equivalence between the two if it is?
        
         | miki123211 wrote:
         | > It has so many applications, among which is estimating number
         | of clusters for a Gaussian mixture model
         | 
         | Any sources for that? As far as I remember, EM is used to
         | calculate actual cluster parameters (means, covariances etc),
         | but I'm not aware of any usage to estimate what number of
         | clusters works best.
         | 
         | Source: I've implemented EM for GMMs for a college assignment
         | once, but I'm a bit hazy on the details.
        
           | fleischhauf wrote:
           | you are right you still need the number of clusters
        
             | BrokrnAlgorithm wrote:
             | I've been out of the loop for stats for a while, but is
             | there a viable approach for estimating ex ante the number
             | of clusters when creating a GMM? I can think if
             | constructing ex post metrics, i.e using a grid and goodness
             | of fit measurements, but these feel more like brute forcing
             | it
        
               | disgruntledphd2 wrote:
               | Unsupervised learning is hard, and the pick K problem is
               | probably the hardest part.
               | 
               | For PCA or factor analysis, there's lots of ways but
               | without some way of determining ground truth it's
               | difficult to know if you've done a good job.
        
               | lukego wrote:
               | Is the question fundamentally: what's the relative
               | likelihood of each number or clusters?
               | 
               | If so then estimating the marginal likelihood of each one
               | and comparing them seems pretty reasonable?
               | 
               | (I mean in the sense of Jaynes chapter 20.)
        
               | CrazyStat wrote:
               | There are Bayesian nonparametric methods that do this by
               | putting a dirichlet process prior on the parameters of
               | the mixture components. Both the prior specification and
               | the computation (MCMC) are tricky, though.
        
         | CrazyStat wrote:
         | EM can be used to impute data, but that would be single
         | imputation. Multiple imputation as described here would not use
         | EM since the goal is to get samples from a distribution of
         | possible values for the missing data.
        
           | wdkrnls wrote:
           | In other words, EM makes more sense. All this imputation
           | stuff seems to me more like an effort to keep using obsolete
           | modeling techniques.
        
       | TaurenHunter wrote:
       | Donald Rubin is kind of a modern day Leibniz...
       | 
       | Rubin Causal Model
       | 
       | Propensity Score Matching
       | 
       | Contributions to
       | 
       | Bayesian Inference
       | 
       | Missing data mechanisms
       | 
       | Survey sampling
       | 
       | Causal inference in observations
       | 
       | Multiple comparisons and hypothesis testing
        
       | a-dub wrote:
       | life is nothing but shaped noise
        
       | bgnn wrote:
       | Why not interpolate the missing data points with similar patients
       | data?
       | 
       | This must be about the confidence of the approach. Maybe
       | interpolation would be overconfident too.
        
       | userbinator wrote:
       | It reminds me somewhat of dithering in signal processing.
        
       | SillyUsername wrote:
       | Isn't this just Monte Carlo, or did I miss something?
        
         | hatmatrix wrote:
         | Monte Carlo is one way to implement multiple imputation.
        
       | karaterobot wrote:
       | Does anyone else find it maddeningly difficult to read Quanta
       | articles on desktop, because the nav bar keeps dancing around the
       | screen? One of my least favorite web design things is the "let's
       | move the bar up and down the screen depending on what direction
       | he's scrolling, that'll really mess with him." I promise I can
       | find the nav bar on my own when I need it.
        
       ___________________________________________________________________
       (page generated 2024-10-06 23:02 UTC)