[HN Gopher] Ideas in statistics that have powered AI
       ___________________________________________________________________
        
       Ideas in statistics that have powered AI
        
       Author : MAXPOOL
       Score  : 156 points
       Date   : 2021-07-07 13:33 UTC (9 hours ago)
        
 (HTM) web link (news.columbia.edu)
 (TXT) w3m dump (news.columbia.edu)
        
       | vcdimension wrote:
       | I'm surprised they didn't mention support vector machines and the
       | kernel trick which was discovered by statisticians.
        
         | srean wrote:
         | Although Vapnik's treatise is called Statistical Learning
         | Theory, neither statisticians nor he himself identifies himself
         | as a statistician. In fact his proposals were quite radically
         | different from the established norm in contemporary statistics.
         | The same holds for Corrina Cortes.
         | 
         | Kernel 'trick', representer theorem etc are far older and have
         | their origins in functional analysis
        
           | JHonaker wrote:
           | I highly doubt that a person with a PhD in statistics doesn't
           | identify as a statistician.
        
             | mturmon wrote:
             | Look at Vapnik's publications and affiliations around this
             | time (1990 to 1997). He was working at a research group at
             | ATT Labs with people like Corinna Cortes, Leon Bottou, and
             | Yann LeCun. He had just come over from Russia, as did many
             | technically proficient Russian Jews in these years.
             | 
             | None of the people in this ATT Labs milieu were associated
             | with the Statistics community. They didn't go to Stats
             | conferences, and they didn't publish in Stats journals. I
             | don't believe that any of them were formally trained in
             | what you might call conventional statistics approaches of
             | the time, i.e., no Stanford PhD, no JASA or Ann. Stat.
             | publications.
             | 
             | They knew _about_ that stuff! But they were approaching
             | problems, like digit recognition (which are not amenable to
             | model based statistics) from a more applied math /physics
             | point of view. Vapnik's work that appeared in English at
             | this time introduces learning as a form of Tikhonov
             | regularization applied to a loss function. Not as a type of
             | maximum likelihood, not as a riff on Bayesian inference.
             | And his SVM work introduces the kernel trick as an
             | interpretation of Mercer's theorem -- a very applied math
             | motivation.
             | 
             | I knew Vladimir around this time, but I can't really say if
             | he would have described himself as a "statistician" -- that
             | would be hard for anyone to know. But I can say that he and
             | his closest colleagues were not part of the Stats community
             | of the time.
             | 
             | On the other side, very few Stats authorities were deeply
             | interested in this stuff. Leo Brieman, of course, Andrew
             | Barron, Art Owen, Trevor Hastie, Rob Tibshirani. It's the
             | Stats community's loss that so few recognized the value of
             | these approaches.
        
               | srean wrote:
               | Beautifully summarized !
        
             | sjg007 wrote:
             | Some may identify as mathematicians.
        
             | srean wrote:
             | The only way to dispute that would be appeal to authority
             | so I will avoid. Perhaps there was a time when he did
             | identify as a mathematical/theoretical statistician, but
             | his contributions were quite a dramatic break from what was
             | the norm in statistical practice at the time.
             | 
             | I would argue that his contributions were central to the
             | birth of rigorous machine learning (the non deep learning
             | kind) as a field of its own with a focus that's different
             | from that of statistics.
             | 
             | One quantitative test I can suggest -- one can count the
             | number of occurrences of the word 'statistics' in the
             | journals and conferences he has published and compare that
             | with the number of publications he has authored. My sense
             | is that it will be close to 0 and getting closer (if not
             | already there yet).
        
               | dkshdkjshdk wrote:
               | > One quantitative test I can suggest -- one can count
               | the number of occurrences of the word 'statistics' in the
               | journals and conferences he has published and compare
               | that with the number of publications he has authored. My
               | sense is that it will be close to 0 and getting closer
               | (if not already there yet).
               | 
               | I'm not sure this "quantitative test" is the best
               | approach... after all, I'm pretty sure "Biometrica" and
               | "Econometrica" don't have "statistics" in their names.
        
               | srean wrote:
               | Fair point.
        
         | eachro wrote:
         | Putting aside discovery, I've always considered SVMs to be in
         | the realm of optimization rather than ML or statistics (though
         | I suppose you could then also put modern deep learning under
         | optimization too).
        
           | dkshdkjshdk wrote:
           | Why? No one uses SVM as a solver/optimization method (though
           | you do need a solver/optimization method to train a SVM).
           | 
           | Same with "modern deep learning" (whatever that may be): just
           | because you need to optimize something doesn't make the field
           | "optimization". Just because I'm using stochastic gradient
           | descent (or some other optimization method) in the course of
           | my work, doesn't mean that I'm working in the field of
           | Optimization.
        
             | eachro wrote:
             | To really understand it (primal, dual formulations) you
             | need tools from convex optimization. So it doesn't really
             | feel appropriate to teach it in a standard machine learning
             | class (unless you just toss out the details). In
             | optimization classes, you go through tons of different
             | applications of the methods you learn about: SVM slots in
             | perfectly there. It hits duality, quadratic programming,
             | even gradient descent (Pegasos).
             | 
             | Re deep learning: right, I was bringing up deep learning as
             | a clear example of why you might not want to classify it
             | under optimization. No one considers applications of deep
             | learning to be optimization. However, work on the various
             | optimizers (Adam, adagrad, second order methods, etc) which
             | are all fundamental to doing any deep learning work would
             | be firmly in the field of optimization.
        
               | srean wrote:
               | > ... you need tools from convex optimization. So it
               | doesn't feel appropriate to teach it in a standard
               | machine learning class.
               | 
               | Those topics were perquisites to taking ML at the grad
               | level when I took them. You either had to have relevant
               | courses in your bag or convince the prof that you could
               | handle it.
        
               | eachro wrote:
               | Yea grad level absolutely (pretty much anything can fly
               | at the grad level). Undergrad? Maybe we should teach it
               | b/c of the historical importance it has to the field and
               | how the community developed but I really do think most ML
               | classes would be better off without it b/c of the extra
               | background you'd have to use precious time on. Kernel
               | PCA, kernel regression are better for demonstrating the
               | power of kernels.
               | 
               | I suppose the idea of a maximum separating hyperplane is
               | kind of unique to SVMs and if you just teach SVMs through
               | the primal and leave it at that, you don't need to spend
               | all that much time motivating the dual.
        
               | dkshdkjshdk wrote:
               | > To really understand it (primal, dual formulations) you
               | need tools from convex optimization. So it doesn't really
               | feel appropriate to teach it in a standard machine
               | learning class (unless you just toss out the details).
               | 
               | Sure. But to understand it, you probably also need to
               | know a bit about arithmetic, algebra, geometry, etc.
               | Still, you wouldn't say that SVM belong to these fields,
               | even though these fields are probably a requirement if
               | you want to understand SVMs.
               | 
               | > So it doesn't really feel appropriate to teach it in a
               | standard machine learning class (unless you just toss out
               | the details).
               | 
               | If the people you are talking about already had an
               | optimization class (including convex optimization), then
               | it should be appropriate to teach it using those
               | formalisms, no?
               | 
               | Another example: you're not going far in understanding
               | Schroedinger's equation if you don't have the necessary
               | linear algebra bases. Does that make Schroedinger's
               | equation part of linear algebra?
               | 
               | > It hits duality, quadratic programming, even gradient
               | descent (Pegasos).
               | 
               | Sure... then it's a subject of machine learning that is
               | good to refresh your knowledge of optimization and linear
               | algebra, sure. It still feels kinda weird if you're going
               | to introduce people to SVM in the context of an
               | Optimization class (other than possibly as an example of
               | a specific optimization problem, or as an application of
               | specific optimization methods).
               | 
               | > However, work on the various optimizers (Adam, adagrad,
               | second order methods, etc) which are all fundamental to
               | doing any deep learning work would be firmly in the field
               | of optimization.
               | 
               | Exactly. If you're doing _that_ , then you _are_ doing
               | research in Optimization, and not research in  "deep
               | learning", as far as I'm concerned. But, let's face it...
               | those types of papers are a minority in the field.
        
       | [deleted]
        
       | andyxor wrote:
       | ..aand on the same page there is a link promoting Critical Race
       | Theory.
       | 
       | It's kind of like coming to listen to math lecture and seeing
       | swastika signs on the walls.
        
       | bmc7505 wrote:
       | https://statmodeling.stat.columbia.edu/2020/12/09/what-are-t...
        
       | heinrichhartman wrote:
       | Out of the 10 papers I am able to download 3 of them freely.
       | 
       | - For the papers I am quoted 26EUR - 39EUR
       | 
       | - For the books I am quoted 129EUR - 133EUR
       | 
       | This is audacious. Some of these papers are form the 70ies. And I
       | highly doubt that the authors get any royalties from those sales.
        
         | ur-whale wrote:
         | sci-hub FTW
         | 
         | why would you want to feed the parasites?
        
         | the_svd_doctor wrote:
         | Authors never get _any_ royalties from paper sales as far as I
         | know :) (for books maybe).
        
         | nolroz wrote:
         | <donates to sci-hub>
        
       | sgt101 wrote:
       | How have they attributed GANs and Deep Learning to Statistics? I
       | thought Goodfellow was doing an AI PhD and that Hinton is a
       | biologically inspired / neuroscience fellow?
        
         | 6gvONxR4sf7o wrote:
         | The only easy real division between stats and ML is in
         | universities, where it's just a question of which department.
         | If it's CS, then it's ML or AI. If it's stats, it's stats.
         | 
         | If it's industry, it's whatever the marketing department
         | decides, inevitably AI :(
        
           | sjg007 wrote:
           | Or data science.
        
         | fighterpilot wrote:
         | See the table in this link for a humorous comparison by
         | Tibshirani:
         | 
         | https://brenocon.com/blog/2008/12/statistics-vs-machine-lear...
        
         | MAXPOOL wrote:
         | Deep learning machine learning models are statistical and
         | probabilistic models. You can categorize Deep Learning under
         | both computer science and statistics. For example stat.ML and
         | cs.LG in Arxiv.
         | 
         | Machine learning and statistics are closely related fields,
         | both historically and in current practice and methodology.
        
           | sgt101 wrote:
           | My working definition is that statisticians choose and
           | engineer models while machine learning searches a vast space
           | of models.
        
             | nightski wrote:
             | That doesn't seem right to me. In both cases you have a
             | model and are just searching for optimal parameters
             | considering the bias/variance tradeoff. There may be a few
             | instances of a neural network or other ML model being set
             | up to dynamically change it's architecture during training
             | but that seems to me like it would not work out well at
             | all.
             | 
             | If anything in specific cases the statistical model (if
             | Bayesian) is more comprehensive in that it doesn't try to
             | find a point estimate of the parameters but instead forms a
             | full distribution around the plausibility of the
             | parameters.
        
               | sgt101 wrote:
               | So I was thinking like this - if you have a data set with
               | 200000 variables and 70,000 examples how would you find a
               | model without using a search process? On the other hand
               | if you have 20 variables and _or_ if you understand which
               | of the 200000 variables are the ones to worry about then
               | you can build a model by hand. I guess that also
               | statisticians are working to summarise or create insight
               | about data while ML is working to create a prediction
               | (although statistical models can be used to do that too).
        
               | sjg007 wrote:
               | You can have Bayesian DNNs.
        
               | whimsicalism wrote:
               | Not easily. I don't think the literature is very
               | incredible on this - how do you define a prior over all
               | of the parameters of an NN?
        
               | sjg007 wrote:
               | There's a vast literature you can read.
        
               | whimsicalism wrote:
               | I have read some of the literature - ie. the bayes by
               | backprop method, etc.
               | 
               | It doesn't seem like getting a posterior over the
               | parameter space of a neural network is tractable as of
               | now.
        
             | sdenton4 wrote:
             | Another way I've heard it framed is that statistics cares
             | about the parameters, and ML cares about the answers.
        
               | fighterpilot wrote:
               | ML cares about predictive generalization, stats cares
               | about understanding and interpreting drivers
        
               | gmfawcett wrote:
               | That framing seems a wee bit dismissive of statistics,
               | esp. applied stats -- I'll hazard a guess that this came
               | from the ML camp. :)
        
               | srean wrote:
               | There is nothing dismissive about it, they are just
               | different things.
               | 
               | If we have strong theoretical understanding of the
               | physics/model of a problem, but are unsure about some
               | parameters, it makes sense to develop methods to find
               | those unknown parameters accurately. This ability is a
               | big deal in traditional statistics. People working in
               | this field try and prove results that their proposed
               | method can actually do this. If the model happens to be
               | largely correct and the method robust, this even allows
               | prediction.
               | 
               | Often, however, we do not know the model and the
               | 'parameter' is a piece of fiction anyway. If we are
               | interested in prediction alone, its fine to let go of an
               | ability to accurately estimate the parameters as long as
               | predictions are accurate. Think epicycle models of
               | planetary motion. ML folks try to prove that their
               | methods have good prediction properties and are happy to
               | sacrifice on parameter recovery.
               | 
               | Nonparametric statistics and prequential statistics are
               | somewhere in the middle.
               | 
               | Sometimes it does seem though that people are trying to
               | out do each other, coming up with methods that estimates
               | the spectrum of a unicorn's rainbow more accurately than
               | the best known result in research literature. This may
               | look odd because the unicorn and his rainbow are pieces
               | of fiction.
        
               | gmfawcett wrote:
               | Fair points made, and your unicorn analogy is wonderful.
               | :)
        
       | bjornsing wrote:
       | I'm sorely missing Maximum Likelihood Estimation (MLE). It's a
       | statistical technique that goes back to Gauss and Laplace but was
       | popularized by Fisher. In AI/ML it's often referred to as
       | "minimizing cross-entropy loss", but this is just a
       | misappropriation / reinvention of the wheel. The math is the same
       | and MLE is a much more sane theoretical framework.
        
         | whimsicalism wrote:
         | What are you missing about it? MLE is the bread and butter of
         | any deep learning architecture - it is how you train the
         | network!
         | 
         | e: Ah I'm a dunce - missing from the article.
        
         | MontyCarloHall wrote:
         | "Cross entropy" specifically refers to the log-likelihood
         | function of a binary random variable, and is only used as the
         | cost function for binary classifiers. It does not refer to
         | likelihood functions in general.
        
           | ansk wrote:
           | Do people not google terms before trying to speak
           | authoritatively on a topic they aren't familiar with? The
           | original commenter is correct, cross entropy is a generic
           | measure of two probability distributions - in the case of
           | maximum likelihood estimation, these are the data
           | distribution and the distribution of the learned model.
        
             | MontyCarloHall wrote:
             | You are incorrect.
             | 
             | For a given probability distribution parameterized by th
             | with probability mass/density p(x|th), the likelihood of th
             | given a set of data X = {x_1, ..., x_n} (assuming X is
             | independently/identically distributed) is simply the
             | product of independent probabilities,
             | 
             | L(th|X) = P_i=1^n p(x_i|th)
             | 
             | Maximizing this product with respect to th yields the
             | maximum likelihood estimate of th. Since sums are generally
             | easier to work with than products, and log is a monotonic
             | function, we generally work with the log-likelihood
             | function
             | 
             | log L(th|X) = S_i=1^n log p(x_i|th)
             | 
             | since the log-likelihood will achieve its maximum for the
             | same value of th as the likelihood.
             | 
             | The cross entropy of two discrete probability distributions
             | p and q is
             | 
             | S_i=1^n p_i log q_i
             | 
             | (For continuous distributions, replace the sum with an
             | integral.)
             | 
             | This is completely unrelated to the generic log-likelihood
             | function defined above. The two are only related if p
             | happens to be the probability distribution of a binary
             | random variable x = {0,1}, with probability p of equalling
             | 1:
             | 
             | p(x|p) = p^x(1-p)^(1-x)
             | 
             | Its log-likelihood is therefore
             | 
             | x log p + (1-x) log(1-p)
             | 
             | which for this particular case, happens to be a cross
             | entropy. Note that this is the log-likelihood of a single
             | observation in a single class; for multiple
             | observations/multiple classes, we sum across them, e.g.
             | 
             | S_i=1^n x_i log p_i + (1-x_i)log (1-p_i)
             | 
             | for a single observation across n total classes.
             | 
             | But again, the relationship to cross entropy only holds for
             | this particular choice of p. It is not generally the case
             | that the generic log-likelihood function,
             | 
             | log L(th|X) = S_i=1^n log p(x_i|th)
             | 
             | is a cross entropy!
        
               | contravariant wrote:
               | You can take the cross entropy between the the
               | probability distribution and the dirac-delta distribution
               | for the actual data. This will equal the log-likelihood.
               | 
               | Things get a little iffy with continuous probability
               | distributions, but that's just because both your cross-
               | entropy and your MLE estimate will depend on your choice
               | of variables if you don't pick a prior. Just as for MLE
               | you can blindly plug in the probability density and it'll
               | work just fine.
        
               | whimsicalism wrote:
               | > your MLE estimate will depend on your choice of
               | variables if you don't pick a prior.
               | 
               | If you're doing MLE, then you don't have a prior (or
               | rather, you have a uniform prior over the parameter(s) of
               | interest).
        
               | MontyCarloHall wrote:
               | True! Given a ~Dirac comb~ mixture of Dirac distributions
               | 
               | c(x) = 1/n S_i=1^n d(kh - x_i)
               | 
               | and some function f, you can express the sum of f over
               | x_i as
               | 
               | S_i=1^n f(x_i) = [?]_-[?]^[?] dx' f(x)c(x - x')
               | 
               | If f were a log probability, this would be indeed be a
               | (continuous) cross entropy:
               | 
               | S_i log p(kh_i|th) = [?]_-[?]^[?] dx' log p(kh_i|th) c(x
               | - x')
               | 
               | However, this isn't generally how we think about
               | likelihood functions, since there is nothing gained from
               | expressing a simple sum of log probability densities in
               | terms of a Dirac comb. Indeed, every ML text/paper I've
               | read only ever refers to "cross entropy" in the context
               | of the cost function for one-hot categorical random
               | variables, since the formula for cross entropy is
               | immediately present in the likelihood function. Cost
               | functions involving other random variables are simply
               | called "cost functions" or just "likelihoods" if the
               | author comes from a stats background.
        
               | dkshdkjshdk wrote:
               | > Given a Dirac comb
               | 
               | > c(x) = 1/n S_i=1^n d(kh - x_i)
               | 
               | Sorry for the pedantry, but a mixture of Dirac
               | distributions is almost always _not_ a Dirac comb. Notice
               | that a mixture of Dirac distributions is a Dirac comb
               | only if you have an _infinite_ number of equally-
               | separated samples (and empirical distributions tend to
               | have a finite number of samples).
        
               | ansk wrote:
               | You've correctly shown that maximizing the likelihood is
               | equivalent to minimizing cross entropy in the discrete
               | case, but frankly that is unrelated to your claim that
               | the equivalency doesn't hold in the general case. As
               | noted in the sibling comment, the generalization to the
               | continuous case is evident when viewing the empirical
               | data distribution as a mixture of dirac densities.
        
               | whimsicalism wrote:
               | Hm, I never really thought about it this way - but I
               | guess it does generalize to continuous space in a pretty
               | natural way.
        
       | ehw3 wrote:
       | > 2. John Tukey (1977). Exploratory Data Analysis.
       | 
       | > This book has been hugely influential and is a fun read that
       | can be digested in one sitting.
       | 
       | Wow. The PDF is over 700 pages. That seems fairly impressive for
       | single-sitting digestion.
        
       | hyttioaoa wrote:
       | "Generalized adversarial networks, or GANs, are a conceptual
       | advance that allow reinforcement learning problems to be solved
       | automatically." -
       | 
       | "Generalized" :D Also the description is nonsense. This has
       | nothing to do with reinforcement learning. Makes me wonder about
       | the rest.
        
         | [deleted]
        
         | totoglazer wrote:
         | The paper has it right, at least.
        
         | cscurmudgeon wrote:
         | Now, if a press release from a top univ is so wrong on
         | something that is easily checkable, how accurate are other
         | forms of news?
        
           | [deleted]
        
           | nerdponx wrote:
           | Think of "press release wrongness" with a probability
           | distribution. Some press releases are really good, some are
           | really bad. A sensible prior would be somewhere in the
           | middle. If you start to see a lot of bad press releases, then
           | you can update your posterior towards "I can't trust any of
           | these."
        
             | bee_rider wrote:
             | Is that a good prior? I expect due to Dunning-Kruger that
             | the willingness to produce an article on a topic would
             | follow a pretty intense bimodal distribution.
        
       | 317070 wrote:
       | > Generative adversarial networks, or GANs, are a conceptual
       | advance that allow reinforcement learning problems to be solved
       | automatically. They mark a step toward the longstanding goal of
       | artificial general intelligence while also harnessing the power
       | of parallel processing so that a program can train itself by
       | playing millions of games against itself. At a conceptual level,
       | GANs link prediction with generative models.
       | 
       | What? Every sentence here is so wrong I have a hard time seeing
       | what kind of misunderstanding would lead to this.
       | 
       | GAN's are a conceptual advance of generative models (i.e. models
       | that can generate more, similar data). Reinforcement learning is
       | a separate field. Parallel processing is ubiquitous, and has
       | nothing to do with GANs or reinforcement learning (they are both
       | usually pretty parallellized). Self-play sounds like they wanted
       | to talk about the alphago/alphazero papers? And GANs are
       | infamously not really predictive/discriminative. If anything,
       | they thoroughly disconnected predicition from generative models.
        
         | whimsicalism wrote:
         | > GAN's are a conceptual advance of generative models (i.e.
         | models that can generate more, similar data).
         | 
         | This is something I've long had confusion with, coming from a
         | probabilistic perspective.
         | 
         | How does a GAN model the joint probability of the data? My
         | understanding was that was what a generative model does. There
         | doesn't seem to be a clear probabilistic interpretation of a
         | GAN whatsoever.
        
           | gyom wrote:
           | Part of the cleverness of GANs was to have found a way to
           | train a neural network that generates data without explicitly
           | modeling the probability density.
           | 
           | In a stats textbook, when you know that your training data
           | comes from a normal distribution, you can maximize the MLE
           | wrt the parameters, and then use that for sampling. That's
           | basic theory.
           | 
           | In practice, it was very hard to learn a good pdf for
           | experimental data when you had a training set of images. GANs
           | provided a way to bypass this.
           | 
           | Of course, people could have said "hey let's generate samples
           | without maximizing a loglikelihood first", but they didn't
           | know how to do it properly, how to train the network in any
           | other way besides minimizing cross-entropy (which is
           | equivalent to maximizing loglikelihood).
           | 
           | Then GANs actually provided a new loss function that could be
           | trained. Total paradigm shift!
        
             | whimsicalism wrote:
             | I'm on board with all of this, I think even before GANs it
             | was becoming popular to optimize loss that wasn't
             | necessarily a log likelihood.
             | 
             | But I'm confused by the usage of the phrase generative
             | model, which I took to always mean a probabilistic model of
             | the joint that can be sampled over. I get that GANs
             | _generate_ data samples, but it seems different.
        
               | hervature wrote:
               | This is the problem when people use technical terms
               | loosely and interchangeably with their English
               | definitions. Generative model classifiers are precisely
               | as you describe. They model a joint distribution that one
               | can sample.
               | 
               | GANs cannot even fit this definition because it is not a
               | classifier. It is composed of a generator and a
               | discriminator. The discriminator is a discriminative
               | classifier. The generator is, well, a generator. It has
               | nothing to do with generative model classifiers. Then you
               | get some variation of neural network generator > model
               | that generates > generative model. This leads to
               | confusion.
        
         | gyom wrote:
         | You're right that this is spectacularly wrong.
         | 
         | I dare not even read the rest of the page just in case my brain
         | accidentally absorbs other bad information like that paragraph
         | about GANs.
        
         | [deleted]
        
       | master_yoda_1 wrote:
       | half of these are relevant to small data problem which is not
       | exactly we mean when we say AI.
        
         | dkshdkjshdk wrote:
         | What _do_ you mean when you say AI? I 'm curious.
         | 
         | As far as I can tell, most people (e.g., whoever wrote this
         | article) seem to use AI as a synonym for "machine learning",
         | basically.
        
       ___________________________________________________________________
       (page generated 2021-07-07 23:01 UTC)