[HN Gopher] Statisticians use a technique that leverages randomn...
___________________________________________________________________
Statisticians use a technique that leverages randomness to deal
with the unknown
Author : Duximo
Score : 134 points
Date : 2024-10-03 07:08 UTC (3 days ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| xiaodai wrote:
| I don't know. I find quanta articles very high noise. It's always
| hyping something
| jll29 wrote:
| I don't find the language of the article full of "hype"; they
| describe the history of different forms of imputation from
| single to multiple to ML-based.
|
| The table is particularly useful as it describes what the
| article is all about in a way that can stick to students'
| minds. I'm very grateful for QuantaMagazine for its popular
| science reporting.
| billfruit wrote:
| The Quanta articles usually have a gossipy style and are very
| low information density.
| SAI_Peregrinus wrote:
| They're usually more science history than science. Who did
| what, when, and a basic ovnrview of why it's important.
| vouaobrasil wrote:
| I agree with that. I skip the Quanta magazine articles, mainly
| because the titles seem to be a little to hyped for my taste
| and don't represent the content as well as they should.
| amelius wrote:
| Yes, typically a short conversation with an LLM gives me more
| info and understanding of a topic than reading a Quanta
| article.
| MiddleMan5 wrote:
| Curious, what sites would you recommend?
| light_hue_1 wrote:
| I wish they actually engaged with this issue instead of writing a
| fluff piece. There are plenty of problems with multiple
| imputation.
|
| Not the least of which is that it's far too easy to do the
| equivalent of p hacking and get your data to be significant by
| playing games with how you do the imputation. Garbage in, garbage
| out.
|
| I think all of these methods should be abolished from the
| curriculum entirely. When I review papers in the ML/AI I
| automatically reject any paper or dataset that uses imputation.
|
| This is all a consequence of the terrible statics used in most
| fields. Bayesian methods don't need to do this.
| jll29 wrote:
| There are plenty of legit. articles that discuss/survey
| imputation in ML/AI:
| https://scholar.google.com/scholar?hl=de&as_sdt=0%2C5&q=%22m...
| light_hue_1 wrote:
| The prestigious journal "Artificial intelligence in
| medicine"? No. Just because it's on Google scholar doesn't
| mean it's worth anything. These are almost all trash. On the
| first page there's one maybe legit paper in an ok venue as
| far as ML is concerned (KDD; an adjacent field to ML) that's
| 30 years old.
|
| No. AI/ML folks don't do imputation on our datasets. I cannot
| think of a single major dataset in vision, nlp, or robotics
| that does so. Despite missing data being a huge issue in
| those fields. It's an antiqued method for an antiqued idea of
| how statistics should work that is doing far more damage than
| good.
| disgruntledphd2 wrote:
| Ok that's interesting. I profoundly disagree with your
| tone, but would really like to hear with you regard as good
| approaches to the problem of missing data (particularly
| where you have dropout from a study or experiment).
| nyrikki wrote:
| Perhaps looking into the issues with uncongeniality and
| multiple imputation may help, although I haven't looked
| at MI for a a long time so consider my reply as an
| attempt to be helpful vs authoritive.
|
| In another related intuition for a probable foot gun
| relates to learning linearly inseparable functions like
| XOR which requires MLPs.
|
| A single missing value in an XOR situation is far more
| challenging than participant dropouts causing missing
| data.
|
| Specifically the problem is counterintuitively non-
| convex, with multiple possibilities for convergence
| without information in the corpus to know which may be
| true.
|
| That is a useful lens in my mind, where I think of the
| manifold being pushed down in opposite sectors as the
| kernel trick.
|
| Another potential lens to think about it is that in
| medical studies the assumption is that there is a smooth
| and continuous function, while in learning, we are trying
| to find a smooth continuous function with minimal loss.
|
| We can't assume that the function we need to learn is
| smooth, but autograd specifically limits what is
| learnable and simplicity bias, especially with feed
| forward networks is an additional concern.
|
| One thing that is common for people to conflate is the
| fact that a differentiable function is probably smooth
| and continuous.
|
| But the set of continuous functions that is
| differentiable _anywhere_ is a meger set.
|
| Like anything in math and logic, the assumptions you can
| make will influence what methods work.
|
| As ML is existential quantification, and because it is
| insanely good at finding efficient glitches in the
| matrix, within the limits of my admittedly limited
| knowledge, MI would need to be a very targeted solution
| with a lot of care to avoid set shattering from causing
| uncongeniality, especially in the unsupervised context.
|
| Hopefully someone else can provide a better productive
| insights.
| DAGdug wrote:
| Maybe in academia, where sketchy incentives rule. In industry,
| p-hacking is great till you're eventually caught for doing
| nonsense that isn't driving real impact (still, the lead time
| is enough to mint money).
| light_hue_1 wrote:
| Very doubtful. There are plenty of drugs that get approved
| and are of questionable value. Plenty of procedures that turn
| out to be not useful. The incentives in industry are even
| worse because everything depends on lying with data if you
| can do it.
| hggigg wrote:
| Indeed. Even worse some entire academic fields are built on
| pillars of lies. I was married to a researcher in one of
| them. Anything that compromises the existence of the field
| just gets written off. The end game is this fed into life
| changing healthcare decisions so one should never assume
| academia is harmless. This was utterly painful watching it
| from the perspective of a mathematician.
| nerdponx wrote:
| I assume by "in industry" they meant in jobs where you are
| doing data analysis to support decisions that your employer
| is making. This would be any typical "data scientist" job
| nowadays. There the consequences of BSing are felt by the
| entity that pays you, and will eventually come back around
| to you.
|
| The incentives in medicine are more similar to those in
| academia, where your job is to cook up data that convinces
| _someone else_ of your results, with highly imbalanced
| incentives that reward fraud.
| DAGdug wrote:
| Yes, precisely this! I've seen more than a few people
| fired for generating BS analyses that didn't help their
| employer, especially in tech where scrutiny is immense
| when things start to fail.
| fn-mote wrote:
| Clearly you know your stuff. Are there any not-super-technical
| references where an argument against using imputation is
| clearly explained?
| parpfish wrote:
| I feel like multiple imputation is fine when you have data
| missing at random.
|
| The problem is that data is _never_ actually missing at random
| and there's always some sort of interesting variable that
| confounds which pieces are missing
| underbiding wrote:
| True true but how do you account for missing data based on
| variables you care about and those you don't?
|
| More specifically, how do you determine if the pattern you
| seem to be identifying is actually related to the phenomenon
| being measured and not an error in the measurement tools
| themselves?
|
| For example, a significant pattern of answers to "Yes / No:
| have you ever been assaulted?" are blank. This could be (A),
| respondents who were assaulted are more likely to leave it
| blank out of shame or (B) someone handling the spreadsheet
| accidentally dropped some rows in the data (because lets be
| serious here, its all spreadsheets and emails...).
|
| While you could say that (B) should be theoretically "more
| truly random", we can't assume that there isn't a pattern to
| the way those rows were dropped (i.e. a pattern imposed on
| some algorithm that bugged out and dropped those rows).
| Xcelerate wrote:
| > how do you determine if the pattern you seem to be
| identifying is actually related to the phenomenon being
| measured and not an error in the measurement tools
| themselves?
|
| If the "which data is missing" information can be used be
| to compress the data that isn't missing further than it can
| be compressed be alone, then the missing data is missing at
| least in part due to the phenomenon being measured.
| Otherwise, it's not.
|
| We're basically just asking if K(non-missing data | which
| data is missing) < K(non-missing data). This is
| uncomputable so it doesn't actually answer your question
| regarding "how to determine", but it does provide a
| necessary and sufficient theoretical criteria.
|
| A decent practical approximation might be to see if you can
| develop a model that predicts the non-missing data better
| when augmented with the "which information is missing"
| information than via self-prediction. That could be an
| interesting research project...
| parpfish wrote:
| There's already a bunch of stats research on this
| problem. Some useful terms to look up are MCAR (missing
| completely at random) and MNAR (missing not at random)
| aabaker99 wrote:
| My intuition would be that there are certain conditions under
| which Bayesian inference for the missing data and multiple
| imputation lead to the same results.
|
| What is the distinction?
|
| The scenario described in the paper could be represented in a
| Bayesian method or not. "For a given missing value in one copy,
| randomly assign a guess from your distribution." Here "my
| distribution" could be Bayesian or not but either way it's
| still up to the statistician to make good choices about the
| model. The Bayesian can p hack here all the same.
| clircle wrote:
| Does any living statistician come close to the level of Donald
| Rubin in terms of research impact? Missing data analysis, causal
| inference, EM algorithm, any probably more. He just walks around
| creating new subfields.
| selimthegrim wrote:
| Efron?
| richrichie wrote:
| & Tibshirani
| selimthegrim wrote:
| Stein too
| aquafox wrote:
| Andrew Gelman?
| nabla9 wrote:
| Gelman has contributed to Bayesianism, hierarchial models and
| Stan is great, but that's not even close to what Rubin has
| done.
|
| ps. Gelman was Rubin's doctoral student.
| selectionbias wrote:
| Also approximate Bayesian computation, principal
| stratification, and the Bayesian Bootstrap.
| j7ake wrote:
| Mike Jordan, Tibshirani, Emmanuel Candes
| paulpauper wrote:
| why not use regression on the existing entries to infer what the
| missing ones should be?
| ivan_ah wrote:
| That would push things towards the mean... not necessarily a
| bad thing, but presumably later steps of the analysis will be
| pooling/averaging data together so not that useful.
|
| A more interesting approach, let's call it OPTION2, would be to
| sample from the predictive distribution of a regression
| (regression mean + noise), which would result in more
| variability in the imputations, although random so might not
| what you want.
|
| The multiple imputation approach seems to be a resampling
| methods of obtaining OPTION2, w/o need to assume linear
| regression model.
| stdbrouw wrote:
| Multiple imputation simply means you impute multiple times
| and run the analysis on each complete (imputed) dataset so
| you can incorporate the uncertainty that comes from guessing
| at missing values into your final confidence intervals and
| such. How you actually do the imputation will depend on the
| type of variable, the amount of missingness etc. A draw from
| the predictive distribution of a linear model of other
| variables without missing data is definitely a common method,
| but in a state-of-the-art multiple imputation package like mi
| in R you can choose from dozens.
| Jun8 wrote:
| Not one mention of the EM algorithm, which is, as far as I can
| understand, is being described here (https://en.m.wikipedia.org/w
| iki/Expectation%E2%80%93maximiza...). It has so many
| applications, among which is estimating number of clusters for a
| Gaussian mixture model.
|
| An ELI5 intro: https://abidlabs.github.io/EM-Algorithm/
| Sniffnoy wrote:
| It does not appear to be what's being described here? Could you
| perhaps expand on the equivalence between the two if it is?
| miki123211 wrote:
| > It has so many applications, among which is estimating number
| of clusters for a Gaussian mixture model
|
| Any sources for that? As far as I remember, EM is used to
| calculate actual cluster parameters (means, covariances etc),
| but I'm not aware of any usage to estimate what number of
| clusters works best.
|
| Source: I've implemented EM for GMMs for a college assignment
| once, but I'm a bit hazy on the details.
| fleischhauf wrote:
| you are right you still need the number of clusters
| BrokrnAlgorithm wrote:
| I've been out of the loop for stats for a while, but is
| there a viable approach for estimating ex ante the number
| of clusters when creating a GMM? I can think if
| constructing ex post metrics, i.e using a grid and goodness
| of fit measurements, but these feel more like brute forcing
| it
| disgruntledphd2 wrote:
| Unsupervised learning is hard, and the pick K problem is
| probably the hardest part.
|
| For PCA or factor analysis, there's lots of ways but
| without some way of determining ground truth it's
| difficult to know if you've done a good job.
| lukego wrote:
| Is the question fundamentally: what's the relative
| likelihood of each number or clusters?
|
| If so then estimating the marginal likelihood of each one
| and comparing them seems pretty reasonable?
|
| (I mean in the sense of Jaynes chapter 20.)
| CrazyStat wrote:
| There are Bayesian nonparametric methods that do this by
| putting a dirichlet process prior on the parameters of
| the mixture components. Both the prior specification and
| the computation (MCMC) are tricky, though.
| CrazyStat wrote:
| EM can be used to impute data, but that would be single
| imputation. Multiple imputation as described here would not use
| EM since the goal is to get samples from a distribution of
| possible values for the missing data.
| wdkrnls wrote:
| In other words, EM makes more sense. All this imputation
| stuff seems to me more like an effort to keep using obsolete
| modeling techniques.
| TaurenHunter wrote:
| Donald Rubin is kind of a modern day Leibniz...
|
| Rubin Causal Model
|
| Propensity Score Matching
|
| Contributions to
|
| Bayesian Inference
|
| Missing data mechanisms
|
| Survey sampling
|
| Causal inference in observations
|
| Multiple comparisons and hypothesis testing
| a-dub wrote:
| life is nothing but shaped noise
| bgnn wrote:
| Why not interpolate the missing data points with similar patients
| data?
|
| This must be about the confidence of the approach. Maybe
| interpolation would be overconfident too.
| userbinator wrote:
| It reminds me somewhat of dithering in signal processing.
| SillyUsername wrote:
| Isn't this just Monte Carlo, or did I miss something?
| hatmatrix wrote:
| Monte Carlo is one way to implement multiple imputation.
| karaterobot wrote:
| Does anyone else find it maddeningly difficult to read Quanta
| articles on desktop, because the nav bar keeps dancing around the
| screen? One of my least favorite web design things is the "let's
| move the bar up and down the screen depending on what direction
| he's scrolling, that'll really mess with him." I promise I can
| find the nav bar on my own when I need it.
___________________________________________________________________
(page generated 2024-10-06 23:02 UTC)