[HN Gopher] What are the most important statistical ideas of the...
       ___________________________________________________________________
        
       What are the most important statistical ideas of the past 50 years?
        
       Author : Anon84
       Score  : 215 points
       Date   : 2022-02-21 16:46 UTC (6 hours ago)
        
 (HTM) web link (www.tandfonline.com)
 (TXT) w3m dump (www.tandfonline.com)
        
       | ModernMech wrote:
       | Kalman published his filter in 1960... a little over 50 years but
       | I'd say it's worthy to mention given its huge impact. The idea
       | that we can use multiple noisy sensors to get a more accurate
       | reading than any one of them could provide enables all kinds of
       | autonomous systems. Basically everything that uses sensors (which
       | is essentially every device these days) is better due to this
       | fact.
        
         | savant_penguin wrote:
         | And it uses almost all of the tricks in the book in a single
         | analytic model
         | 
         | Statistics+optimization+dynamic systems+linear algebra
        
       | oxff wrote:
       | "Throw more compute at it"
        
         | Q6T46nT668w6i3m wrote:
         | The authors agree. It's mentioned a handful of times.
        
       | prout69098132 wrote:
        
         | prout69098132 wrote:
        
       | prout69098132 wrote:
        
       | westcort wrote:
       | Not in the past 50 years, but more like the past 80 years,
       | nonparametric statistics in general are pretty amazing, though
       | underused. Look at the Mann-Whitney U test and the Wilcoxon Rank
       | Sum. Those tests require very few assumptions.
        
         | dannykwells wrote:
         | These are the standard tests in most of biology at this time.
         | Not underused at all. Very powerful and lovely to not have to
         | assume normality.
        
       | prout69098132 wrote:
        
       | bell-cot wrote:
       | The most important statistical idea of the past 50 years is the
       | same as the most important statistical idea of the 50 years
       | before that:
       | 
       | "Due to reduced superstition, better education, and general
       | awareness & progress, humans who are neither meticulous
       | statistics experts, nor working in very constrained & repetitive
       | circumstances, will understand and apply statistics more
       | objectively and correctly than they generally have in the past."
       | 
       | Sadly, this idea is still wrong.
        
       | deepsquirrelnet wrote:
       | Good review article. I always enjoy browsing the references, and
       | found "Computer Age Statistical Inference" among them. Looks like
       | a good read, with a pdf available online.
        
         | mjb wrote:
         | It's a great book. Short and to-the-point. Highly recommended.
        
       | aabajian wrote:
       | Didn't read the document, but hopefully it mentions PageRank, the
       | prime example of using probabilistic graphical models to rank
       | nodes in a directed graph. More info:
       | https://www.amazon.com/Probabilistic-Graphical-Models-Princi...
       | 
       | I've heard that Google and Baidu essentially started at the same
       | time, with the same algorithm discovery (PageRank). Maybe someone
       | can comment on if there was idea sharing or if both teams derived
       | it independently.
        
         | nabla9 wrote:
         | Page Rank is just application of eigenvalues into ranking.
         | 
         | The idea came first up in the 70's.
         | https://www.sciencedirect.com/science/article/abs/pii/030645...
         | and several times afterward before PageRank was developed.
        
         | mianos wrote:
         | The sort of methods 'PageRank' uses already existed. It reminds
         | of Apple `inventing` (air quotes) the mp3 player. It didn't, it
         | applied existing technology, refined it and publicized it. They
         | did not invent it but maybe 'inventing' something is only a
         | very small part of making something useful for many people.
        
         | bjourne wrote:
         | PageRank actually had a predecessor called HITS (according to
         | some sources HITS were developed before PageRank, according to
         | others they were contemporaries), an algorithm developed by Jon
         | Kleinberg for ranking hypertext documents.
         | https://en.wikipedia.org/wiki/HITS_algorithm However, Kleinberg
         | stayed in academia and never attempted to commercialize his
         | research like Page and Brin did. HITS was more complex than
         | PageRank and context-sensitive so queries required much more
         | computing resources than PageRank. PageRank is _kind of_ what
         | you get if you take HITS and remove the slow parts.
         | 
         | What I find very interesting about PageRank is how you can
         | trade accuracy for performance. The traditional way of
         | calculating PageRank by means of squaring a matrix iteratively
         | until it reaches convergence gives you correct results but is
         | sloooooow. For a modestly sized graph it could take days. But
         | if accuracy isn't that important you can use Monte Carlo
         | simulation and get most of the PageRank correct in a fraction
         | of the time of the iterative method. It's also easy to
         | parallelize.
        
           | jll29 wrote:
           | Page's PageRank patent references HITS:
           | 
           | Jon M. Kleinberg, "Authoritative sources in a hyperlinked
           | environment," 1998, Proc. Of the 9th Annual ACM-SIAM
           | Symposium on Discrete Algorithms, pp. 668-677.
        
         | mach1ne wrote:
         | Didn't Larry Page and Sergey Brin openly publicize the PageRank
         | algorithm? It'd seem more likely that Baidu just copypasted the
         | idea.
        
         | oneoff786 wrote:
         | The basic concept behind page rank is pretty obvious. If you
         | stare at a graph for a while, it'll probably be your big idea
         | if you try to imagine centrality calculations.
         | 
         | Implementing it and catching edge cases isn't trivial
        
         | screye wrote:
         | Given that Pagerank was literally invented and named after
         | Larry Page, I would think that Google had a head start.
         | 
         | That being said, Page Rank is a more a stellar example of
         | adapting an academic idea into practice, than a statistical
         | idea in and of itself.
         | 
         | Afterall, it is 'merely' the stationary distribution for a
         | random walk over an undirected graph. I say 'merely' with a lot
         | of respect, because the best ideas often feel simple in
         | hindsight. But, it is that simplicity that makes them even more
         | impressive.
        
         | andi999 wrote:
         | I heard that the first approach of google was using/adapting an
         | published algorithm which was used to rank scientific
         | publications from the network of citations. Not sure if this is
         | the algorithm you mentioned though.
        
           | divbzero wrote:
           | The ranking of scientific publications based on citations
           | you're describing is impact factor [1]. I haven't heard that
           | as an inspiration for Larry Page's PageRank [2] but that is
           | plausible.
           | 
           | [1]: https://en.wikipedia.org/wiki/Impact_factor
           | 
           | [2]: https://en.wikipedia.org/wiki/PageRank
        
         | ppsreejith wrote:
         | From the wikipedia page of Robin Li, co-founder of Baidu:
         | https://en.wikipedia.org/wiki/Robin_Li#RankDex
         | 
         | > In 1996, while at IDD, Li created the Rankdex site-scoring
         | algorithm for search engine page ranking, which was awarded a
         | U.S. patent. It was the first search engine that used
         | hyperlinks to measure the quality of websites it was indexing,
         | predating the very similar algorithm patent filed by Google two
         | years later in 1998.
        
       | oraoraoraoraora wrote:
       | Statistical process control has been significant over the last
       | 100 years.
        
         | avs733 wrote:
         | I would agree with you but they are speaking to a different
         | audience. This is in a journal for statistics researchers and
         | theorists. These would all be things that would inform the
         | creation of pragmatic tools like SPC.
        
       | csee wrote:
       | Meta-analysis techniques like funnel plots.
        
         | pacbard wrote:
         | Meta-analysis is an application of idea #4 (Bayesian Multilevel
         | Models) in the article.
         | 
         | What makes meta-analysis special within a multilevel framework
         | is that you know the level 1 variance. This creates a special
         | case of a generalized multilevel model where you leverage your
         | knowledge of L1 mean and variance (from each individual study's
         | results) to estimate the possible mean and variance of the
         | population effect.
         | 
         | The population mean and variance is usually presented in funnel
         | plots where you can see the expected distribution of effect
         | sizes/point estimates given a sample size/standard error.
         | 
         | Researchers have also started to plot actual point estimates
         | from published papers in this plot, showing that most of the
         | published results are "inside the funnel", a result that is
         | usually cited as evidence of publication bias. In other words,
         | the missing studies end up in researchers' file drawers instead
         | of being published somewhere.
        
       | bobbyd2323 wrote:
       | Bootstrap
        
       | datastoat wrote:
       | Validation on holdout sets.
       | 
       | When I was a student in the 1990s, I was taught about hypothesis
       | testing (and all the hassle of p-fishing etc.), and about
       | Bayesian inference (which is lovely, until you have to invent
       | priors over the model space -- e.g. a prior over neural network
       | architectures). These are both systems that tie themselves in
       | epistemological knots when trying to answer the simple question
       | "What model shall I use?"
       | 
       | Holdout set validation is such a clean simple idea, and so easy
       | to use (as long as you have big data), and it does away with all
       | the frequentist and Bayesian tangle, which is why it's so
       | widespread in ML nowadays.
       | 
       | It also aligns statistical inference with Popper's idea of
       | scientific falsifiability -- scientists test their models against
       | a new experimental data, data scientists can test their model
       | against qualitatively different holdout sets. (Just make sure you
       | don't get your holdout set by shuffling, since that's not what
       | Popper would call a "genuine risky validation".)
       | 
       | The article mentions Breiman's "alternative view of the
       | foundations of statistics based on prediction rather than
       | modeling". That's not general enough, since it doesn't
       | accommodate generative modelling (e.g. GPT, GANs). I think it's
       | better to frame ML in terms of "evaluating model fit on a holdout
       | set", since that accommodates both predictive and generative
       | modelling.
        
         | anxrn wrote:
         | Very much agree with the simplicity and power of separation of
         | training, validation and test sets. Is this really a 'big data'
         | era notion though? This was fairly standard in 90s era language
         | and speech work.
        
       | pandoro wrote:
       | Solomonoff Induction. Although proven to be uncomputable (there
       | are people working on formalizing efficient approximations) it is
       | such a mind-blowing idea. It brings together Occam's razor,
       | Epicurus' Principle of multiple explanations, Bayes' theorem,
       | Algorithmic Information Theory and Universal Turing machines in a
       | theory of universal induction. The mathematical proof and details
       | are way above my head but I cannot help but feel like it is very
       | underrated.
        
         | spekcular wrote:
         | Statistics is an applied science, and Solomonoff induction has
         | had zero practical impact. So I feel it's not underrated at
         | all, and perhaps overrated among a certain crowd.
        
       | ThouYS wrote:
       | Bootstrap resampling is such a black magic thing
        
         | graycat wrote:
         | There is a nice treatment of resampling, i.e., _permutation_
         | tests, in (from my TeX format bibliography)
         | 
         | Sidney Siegel, {\it Nonparametric Statistics for the Behavioral
         | Sciences,\/} McGraw-Hill, New York, 1956.\ \
         | 
         | Right, there was already a good book on such tests over 50
         | years ago.
         | 
         | Can also justify it with an independence, identically
         | distributed assumption. But a weaker assumption of
         | _exchangeability_ can also work -- I published a paper with
         | that.
         | 
         | The broad idea of such a statistical hypothesis test is to
         | decide on the _null_ hypothesis, _null_ as in no effect (if
         | looking for an effect, then want to reject the null hypothesis
         | of no effect) and to make assumptions to permit calculating the
         | probability of what you observe. If that probability is way too
         | small then reject the null hypothesis and conclude that there
         | was an effect. Right, it 's fishy.
        
           | jll29 wrote:
           | The 2nd edition is never far from my desk:
           | 
           | Siegel, S., & Castellan, N. J. (1988). Nonparametric
           | statistics for the behavioral sciences (2nd ed.) New York:
           | McGraw-Hill.
        
         | CrazyStat wrote:
         | One way to approach the bootstrap is as sampling from the
         | posterior mean of a Dirichlet Process model with a
         | noninformative prior (alpha=0).
        
         | derbOac wrote:
         | It's just Monte Carlo simulation using the observed
         | distribution as the population distribution.
        
           | grayclhn wrote:
           | 1) It's not -- there are lots of procedures called "the
           | bootstrap" that act differently.
           | 
           | 2) The fact that "substitute the data for the population
           | distribution" both works and is sometimes provably better
           | than other more sensible approaches is a little mind blowing.
           | 
           | Most things called the bootstrap feel like cheating, ie "this
           | part seems hard, let's do the easiest thing possible instead
           | and hope it works."
        
           | civilized wrote:
           | It's not the mere description of the procedure that people
           | find mysterious.
        
           | btown wrote:
           | This has big "A monad is just a monoid in the category of
           | endofunctors" energy
        
       | vanattab wrote:
       | The most important to me is "There are three kinds of lies in
       | this world. Lies, damn lies, and statistics."
       | 
       | Not attacking the mathematically field of statistics just
       | pointing out that lots of people abuse statistics in an attempt
       | to get people to behave as they would prefer.
        
       | jll29 wrote:
       | Off-the-cuff, i.e. without digging deeply into a set of history
       | of statistics books:
       | 
       | Tied 1st place:
       | 
       | * Markov chain Monte Carlo (MCMC) and the Metropolis-Hastings
       | algorithm
       | 
       | * Hidden Markov Models and the Viterbi algorithm for most
       | probable sequence in linear time
       | 
       | * Vapnik-Chervonenkis theory of statistical learning (Vladimir
       | Naumovich Vapnik & Alexey Chervonenkis) and SVMs
       | 
       | 4th place:
       | 
       | * Edwin Jaynes: maximum entropy for constructing priors
       | (borderline: 1957)
       | 
       | Honorable mentions:
       | 
       | * Breiman et al.'s CART (Classification and Regression Trees)
       | algorithm (and Quinlan's C5.0 extension)
       | 
       | * Box-Jenkins method (autoregressive moving average (ARMA) /
       | autoregressive integrated moving average (ARIMA) to find the best
       | fit of a time-series model to past values of a time series)
       | 
       | (The beginning of the 20th century was much more fertile in
       | comparison - Kolmogorov, Fisher, Gosset, Aitken, Cox, de Finetti,
       | Kullback, the Pearsons, Spearman etc.)
        
         | dlg wrote:
         | I generally agree with you. However, as a pedantic note,
         | Metrolopis, Rosenbluth, Rosenbluth, Teller and Teller was in
         | 1953 and Hastings was 1970.
        
       | mlcrypto wrote:
        
       | uoaei wrote:
       | IMO, kernel-based computational methods are by far the most
       | important _overlooked_ advances in the statistical sciences.
       | 
       | Kernel methods are linear methods on data projected into very-
       | high-dimensional spaces, and you get basically all the benefits
       | of linear methods (convexity, access to analytical
       | techniques/manipulations, etc.) while being much more
       | computationally tractable and data-efficient than a naive
       | approach. Maximum mean discrepancy (MMD) is a particularly shiny
       | result from the last few years.
       | 
       | The tradeoff is that you must use an adequate kernel for whatever
       | procedure you intend, and these can sometimes have sneaky
       | pitfalls. A crass example would be the relative failure of tSNE
       | and similar kernel-based visualization tools: in the case of tSNE
       | the Cauchy kernel's tails are extremely fat which ends up
       | degrading the representation of intra- vs inter-cluster
       | distances.
        
         | teshier-A wrote:
         | My experience with MMD is that unless you've been using it and
         | are familiar with other kernel methods, you probably won't know
         | what to do with it : what kernel do I use ? how can I test for
         | significance (in any sense of the word) ? Add the (last I
         | checked) horrendous complexity, to me it looks like a less
         | usable mutual information (or KL div) and without all the nice
         | information theory around it.
        
       | enriquto wrote:
       | My favourite is Mandelbrot's heuristic converse of the central
       | limit theorem: the _only_ variables that are normal are those
       | that are sums of many variables of finite variance.
        
       | andi999 wrote:
       | Identifying p value hacking.
        
         | hackernewds wrote:
         | what does that mean?
        
           | pacbard wrote:
           | p-hacking is a research "dark pattern" where a researcher
           | fits several similar models and reports only the one that has
           | the significant p-value for the relationship of interest.
           | 
           | This strategy is possible because p-values are themselves
           | stochastic and a researcher will find one significant p-value
           | for every 20 models that they run (at least on average).
           | 
           | p-hacking could also refer to pushing a p-value close to the
           | significant cut-off (usually 0.05) by modifying the
           | statistical model slightly until the desired result is
           | achieved. This process usually involves the inclusion of
           | control variables that are not really related to the outcome
           | but that will change the standard errors/p-values.
           | 
           | Another way to p-hack is to drop specific observations until
           | the desired p-value is reached. This process usually involves
           | removing participants from a sample for a seemingly
           | legitimate reason until the desired p-value is achieved.
           | Usually identifying and eliminating a few high leverage
           | observations is enough to change the significance level of a
           | point estimate.
           | 
           | Multiple strategies to address p-hacking have been proposed
           | and discussed. One of the most popular ones is pre-
           | registration of research designs and models. The idea here is
           | that a researcher would publish their research design and
           | models before conducting the experiment and they will report
           | only the results from the pre-registered models. This process
           | eliminates the "fishing expedition" nature of p-hacking.
           | 
           | Other strategies involve better research designs that are not
           | sensitive to model respecification. These are usually
           | experimental and quasi-experimental methods that leverage an
           | external source of variation (external to both the researcher
           | and the studied system, like random assignment to conditions)
           | to isolate the relationship between two variables.
        
             | pthread_t wrote:
             | I saw this firsthand as an undergrad research assistant in
             | a neuroscience lab. How did it go when I brought it up?
             | Swept under the rug and published in a high-impact journal.
        
           | ldiracdelta wrote:
           | I believe it is referencing "The Replication Crisis"
           | https://en.wikipedia.org/wiki/Replication_crisis
        
         | [deleted]
        
       | oneoff786 wrote:
       | Shap values and other methods to parse out the inner workings for
       | "black box" machine learning models. They're good enough that
       | I've grown fond of just throwing a light gbm model at everything
       | and calling it a day for that sweet spot of predictive power and
       | ease of implementation.
        
         | hackernewds wrote:
         | Would be so kind to share an example or resources to learn
         | about this?
        
           | screye wrote:
           | [1] Scott is the leading authority on all things SHAP (he
           | wrote the seminal paper on it).
           | 
           | [2] Chris Molnar's interpretable learning book has chapters
           | on Shapeley values and SHAP. If you'd prefer text instead of
           | video.
           | 
           | [1] https://www.youtube.com/watch?v=B-c8tIgchu0
           | 
           | [2] https://christophm.github.io/interpretable-ml-
           | book/shapley.h...
        
           | magneticnorth wrote:
           | Seconding Chris Molnar's excellent writeup. I also find the
           | readme & example notebooks in Scott Lundberg's github repo to
           | be a great way to get started. There are also references
           | there for the original papers, which are surprisingly
           | readable, imo. https://github.com/slundberg/shap
        
         | teruakohatu wrote:
         | > I've grown fond of just throwing a light gbm model at
         | everything and calling it a day
         | 
         | It is not always a good idea to do that. Always try different
         | methods, there is no ultimate method. At the very least OLS
         | should be tried and some other fully explainable methods, even
         | a simple CART like method.
        
           | [deleted]
        
           | csee wrote:
           | OLS is my default go to. It outperforms a random forest so
           | often in small data real world applications over
           | nonstationary data, and model explainability is built into
           | it. If I'm working in a domain with stationary data then I'd
           | tilt more to the forest (due to not having to engineer
           | features, and the inbuilt ability to detect non-linear
           | relationships and interactions between features).
        
         | hervature wrote:
         | Strongly disagree. Shapley values and LIME give a very crude
         | and extremely limited understanding of the model. They
         | basically amount to a random selection of local slopes. For
         | instance, if I tell you the (randomly selected) slopes of a
         | function are [0.5, 0.2, 12.4, 1.1, 2.6] which the average is
         | 3.3, can you guess anything? You might notice it is monotonic
         | (maybe) but you certainly don't guess it is e^x.
        
           | LeanderK wrote:
           | I would say that our ML models are not predictable enough yet
           | at the local neighbourhood to really trust LIME. Adversarial
           | examples prove that you just can't select a small enough
           | range since you can always find those even for super tiny
           | distances.
        
         | sockpuppet69 wrote:
        
       ___________________________________________________________________
       (page generated 2022-02-21 23:00 UTC)