[HN Gopher] Ideas in statistics that have powered AI
___________________________________________________________________
Ideas in statistics that have powered AI
Author : MAXPOOL
Score : 156 points
Date : 2021-07-07 13:33 UTC (9 hours ago)
(HTM) web link (news.columbia.edu)
(TXT) w3m dump (news.columbia.edu)
| vcdimension wrote:
| I'm surprised they didn't mention support vector machines and the
| kernel trick which was discovered by statisticians.
| srean wrote:
| Although Vapnik's treatise is called Statistical Learning
| Theory, neither statisticians nor he himself identifies himself
| as a statistician. In fact his proposals were quite radically
| different from the established norm in contemporary statistics.
| The same holds for Corrina Cortes.
|
| Kernel 'trick', representer theorem etc are far older and have
| their origins in functional analysis
| JHonaker wrote:
| I highly doubt that a person with a PhD in statistics doesn't
| identify as a statistician.
| mturmon wrote:
| Look at Vapnik's publications and affiliations around this
| time (1990 to 1997). He was working at a research group at
| ATT Labs with people like Corinna Cortes, Leon Bottou, and
| Yann LeCun. He had just come over from Russia, as did many
| technically proficient Russian Jews in these years.
|
| None of the people in this ATT Labs milieu were associated
| with the Statistics community. They didn't go to Stats
| conferences, and they didn't publish in Stats journals. I
| don't believe that any of them were formally trained in
| what you might call conventional statistics approaches of
| the time, i.e., no Stanford PhD, no JASA or Ann. Stat.
| publications.
|
| They knew _about_ that stuff! But they were approaching
| problems, like digit recognition (which are not amenable to
| model based statistics) from a more applied math /physics
| point of view. Vapnik's work that appeared in English at
| this time introduces learning as a form of Tikhonov
| regularization applied to a loss function. Not as a type of
| maximum likelihood, not as a riff on Bayesian inference.
| And his SVM work introduces the kernel trick as an
| interpretation of Mercer's theorem -- a very applied math
| motivation.
|
| I knew Vladimir around this time, but I can't really say if
| he would have described himself as a "statistician" -- that
| would be hard for anyone to know. But I can say that he and
| his closest colleagues were not part of the Stats community
| of the time.
|
| On the other side, very few Stats authorities were deeply
| interested in this stuff. Leo Brieman, of course, Andrew
| Barron, Art Owen, Trevor Hastie, Rob Tibshirani. It's the
| Stats community's loss that so few recognized the value of
| these approaches.
| srean wrote:
| Beautifully summarized !
| sjg007 wrote:
| Some may identify as mathematicians.
| srean wrote:
| The only way to dispute that would be appeal to authority
| so I will avoid. Perhaps there was a time when he did
| identify as a mathematical/theoretical statistician, but
| his contributions were quite a dramatic break from what was
| the norm in statistical practice at the time.
|
| I would argue that his contributions were central to the
| birth of rigorous machine learning (the non deep learning
| kind) as a field of its own with a focus that's different
| from that of statistics.
|
| One quantitative test I can suggest -- one can count the
| number of occurrences of the word 'statistics' in the
| journals and conferences he has published and compare that
| with the number of publications he has authored. My sense
| is that it will be close to 0 and getting closer (if not
| already there yet).
| dkshdkjshdk wrote:
| > One quantitative test I can suggest -- one can count
| the number of occurrences of the word 'statistics' in the
| journals and conferences he has published and compare
| that with the number of publications he has authored. My
| sense is that it will be close to 0 and getting closer
| (if not already there yet).
|
| I'm not sure this "quantitative test" is the best
| approach... after all, I'm pretty sure "Biometrica" and
| "Econometrica" don't have "statistics" in their names.
| srean wrote:
| Fair point.
| eachro wrote:
| Putting aside discovery, I've always considered SVMs to be in
| the realm of optimization rather than ML or statistics (though
| I suppose you could then also put modern deep learning under
| optimization too).
| dkshdkjshdk wrote:
| Why? No one uses SVM as a solver/optimization method (though
| you do need a solver/optimization method to train a SVM).
|
| Same with "modern deep learning" (whatever that may be): just
| because you need to optimize something doesn't make the field
| "optimization". Just because I'm using stochastic gradient
| descent (or some other optimization method) in the course of
| my work, doesn't mean that I'm working in the field of
| Optimization.
| eachro wrote:
| To really understand it (primal, dual formulations) you
| need tools from convex optimization. So it doesn't really
| feel appropriate to teach it in a standard machine learning
| class (unless you just toss out the details). In
| optimization classes, you go through tons of different
| applications of the methods you learn about: SVM slots in
| perfectly there. It hits duality, quadratic programming,
| even gradient descent (Pegasos).
|
| Re deep learning: right, I was bringing up deep learning as
| a clear example of why you might not want to classify it
| under optimization. No one considers applications of deep
| learning to be optimization. However, work on the various
| optimizers (Adam, adagrad, second order methods, etc) which
| are all fundamental to doing any deep learning work would
| be firmly in the field of optimization.
| srean wrote:
| > ... you need tools from convex optimization. So it
| doesn't feel appropriate to teach it in a standard
| machine learning class.
|
| Those topics were perquisites to taking ML at the grad
| level when I took them. You either had to have relevant
| courses in your bag or convince the prof that you could
| handle it.
| eachro wrote:
| Yea grad level absolutely (pretty much anything can fly
| at the grad level). Undergrad? Maybe we should teach it
| b/c of the historical importance it has to the field and
| how the community developed but I really do think most ML
| classes would be better off without it b/c of the extra
| background you'd have to use precious time on. Kernel
| PCA, kernel regression are better for demonstrating the
| power of kernels.
|
| I suppose the idea of a maximum separating hyperplane is
| kind of unique to SVMs and if you just teach SVMs through
| the primal and leave it at that, you don't need to spend
| all that much time motivating the dual.
| dkshdkjshdk wrote:
| > To really understand it (primal, dual formulations) you
| need tools from convex optimization. So it doesn't really
| feel appropriate to teach it in a standard machine
| learning class (unless you just toss out the details).
|
| Sure. But to understand it, you probably also need to
| know a bit about arithmetic, algebra, geometry, etc.
| Still, you wouldn't say that SVM belong to these fields,
| even though these fields are probably a requirement if
| you want to understand SVMs.
|
| > So it doesn't really feel appropriate to teach it in a
| standard machine learning class (unless you just toss out
| the details).
|
| If the people you are talking about already had an
| optimization class (including convex optimization), then
| it should be appropriate to teach it using those
| formalisms, no?
|
| Another example: you're not going far in understanding
| Schroedinger's equation if you don't have the necessary
| linear algebra bases. Does that make Schroedinger's
| equation part of linear algebra?
|
| > It hits duality, quadratic programming, even gradient
| descent (Pegasos).
|
| Sure... then it's a subject of machine learning that is
| good to refresh your knowledge of optimization and linear
| algebra, sure. It still feels kinda weird if you're going
| to introduce people to SVM in the context of an
| Optimization class (other than possibly as an example of
| a specific optimization problem, or as an application of
| specific optimization methods).
|
| > However, work on the various optimizers (Adam, adagrad,
| second order methods, etc) which are all fundamental to
| doing any deep learning work would be firmly in the field
| of optimization.
|
| Exactly. If you're doing _that_ , then you _are_ doing
| research in Optimization, and not research in "deep
| learning", as far as I'm concerned. But, let's face it...
| those types of papers are a minority in the field.
| [deleted]
| andyxor wrote:
| ..aand on the same page there is a link promoting Critical Race
| Theory.
|
| It's kind of like coming to listen to math lecture and seeing
| swastika signs on the walls.
| bmc7505 wrote:
| https://statmodeling.stat.columbia.edu/2020/12/09/what-are-t...
| heinrichhartman wrote:
| Out of the 10 papers I am able to download 3 of them freely.
|
| - For the papers I am quoted 26EUR - 39EUR
|
| - For the books I am quoted 129EUR - 133EUR
|
| This is audacious. Some of these papers are form the 70ies. And I
| highly doubt that the authors get any royalties from those sales.
| ur-whale wrote:
| sci-hub FTW
|
| why would you want to feed the parasites?
| the_svd_doctor wrote:
| Authors never get _any_ royalties from paper sales as far as I
| know :) (for books maybe).
| nolroz wrote:
| <donates to sci-hub>
| sgt101 wrote:
| How have they attributed GANs and Deep Learning to Statistics? I
| thought Goodfellow was doing an AI PhD and that Hinton is a
| biologically inspired / neuroscience fellow?
| 6gvONxR4sf7o wrote:
| The only easy real division between stats and ML is in
| universities, where it's just a question of which department.
| If it's CS, then it's ML or AI. If it's stats, it's stats.
|
| If it's industry, it's whatever the marketing department
| decides, inevitably AI :(
| sjg007 wrote:
| Or data science.
| fighterpilot wrote:
| See the table in this link for a humorous comparison by
| Tibshirani:
|
| https://brenocon.com/blog/2008/12/statistics-vs-machine-lear...
| MAXPOOL wrote:
| Deep learning machine learning models are statistical and
| probabilistic models. You can categorize Deep Learning under
| both computer science and statistics. For example stat.ML and
| cs.LG in Arxiv.
|
| Machine learning and statistics are closely related fields,
| both historically and in current practice and methodology.
| sgt101 wrote:
| My working definition is that statisticians choose and
| engineer models while machine learning searches a vast space
| of models.
| nightski wrote:
| That doesn't seem right to me. In both cases you have a
| model and are just searching for optimal parameters
| considering the bias/variance tradeoff. There may be a few
| instances of a neural network or other ML model being set
| up to dynamically change it's architecture during training
| but that seems to me like it would not work out well at
| all.
|
| If anything in specific cases the statistical model (if
| Bayesian) is more comprehensive in that it doesn't try to
| find a point estimate of the parameters but instead forms a
| full distribution around the plausibility of the
| parameters.
| sgt101 wrote:
| So I was thinking like this - if you have a data set with
| 200000 variables and 70,000 examples how would you find a
| model without using a search process? On the other hand
| if you have 20 variables and _or_ if you understand which
| of the 200000 variables are the ones to worry about then
| you can build a model by hand. I guess that also
| statisticians are working to summarise or create insight
| about data while ML is working to create a prediction
| (although statistical models can be used to do that too).
| sjg007 wrote:
| You can have Bayesian DNNs.
| whimsicalism wrote:
| Not easily. I don't think the literature is very
| incredible on this - how do you define a prior over all
| of the parameters of an NN?
| sjg007 wrote:
| There's a vast literature you can read.
| whimsicalism wrote:
| I have read some of the literature - ie. the bayes by
| backprop method, etc.
|
| It doesn't seem like getting a posterior over the
| parameter space of a neural network is tractable as of
| now.
| sdenton4 wrote:
| Another way I've heard it framed is that statistics cares
| about the parameters, and ML cares about the answers.
| fighterpilot wrote:
| ML cares about predictive generalization, stats cares
| about understanding and interpreting drivers
| gmfawcett wrote:
| That framing seems a wee bit dismissive of statistics,
| esp. applied stats -- I'll hazard a guess that this came
| from the ML camp. :)
| srean wrote:
| There is nothing dismissive about it, they are just
| different things.
|
| If we have strong theoretical understanding of the
| physics/model of a problem, but are unsure about some
| parameters, it makes sense to develop methods to find
| those unknown parameters accurately. This ability is a
| big deal in traditional statistics. People working in
| this field try and prove results that their proposed
| method can actually do this. If the model happens to be
| largely correct and the method robust, this even allows
| prediction.
|
| Often, however, we do not know the model and the
| 'parameter' is a piece of fiction anyway. If we are
| interested in prediction alone, its fine to let go of an
| ability to accurately estimate the parameters as long as
| predictions are accurate. Think epicycle models of
| planetary motion. ML folks try to prove that their
| methods have good prediction properties and are happy to
| sacrifice on parameter recovery.
|
| Nonparametric statistics and prequential statistics are
| somewhere in the middle.
|
| Sometimes it does seem though that people are trying to
| out do each other, coming up with methods that estimates
| the spectrum of a unicorn's rainbow more accurately than
| the best known result in research literature. This may
| look odd because the unicorn and his rainbow are pieces
| of fiction.
| gmfawcett wrote:
| Fair points made, and your unicorn analogy is wonderful.
| :)
| bjornsing wrote:
| I'm sorely missing Maximum Likelihood Estimation (MLE). It's a
| statistical technique that goes back to Gauss and Laplace but was
| popularized by Fisher. In AI/ML it's often referred to as
| "minimizing cross-entropy loss", but this is just a
| misappropriation / reinvention of the wheel. The math is the same
| and MLE is a much more sane theoretical framework.
| whimsicalism wrote:
| What are you missing about it? MLE is the bread and butter of
| any deep learning architecture - it is how you train the
| network!
|
| e: Ah I'm a dunce - missing from the article.
| MontyCarloHall wrote:
| "Cross entropy" specifically refers to the log-likelihood
| function of a binary random variable, and is only used as the
| cost function for binary classifiers. It does not refer to
| likelihood functions in general.
| ansk wrote:
| Do people not google terms before trying to speak
| authoritatively on a topic they aren't familiar with? The
| original commenter is correct, cross entropy is a generic
| measure of two probability distributions - in the case of
| maximum likelihood estimation, these are the data
| distribution and the distribution of the learned model.
| MontyCarloHall wrote:
| You are incorrect.
|
| For a given probability distribution parameterized by th
| with probability mass/density p(x|th), the likelihood of th
| given a set of data X = {x_1, ..., x_n} (assuming X is
| independently/identically distributed) is simply the
| product of independent probabilities,
|
| L(th|X) = P_i=1^n p(x_i|th)
|
| Maximizing this product with respect to th yields the
| maximum likelihood estimate of th. Since sums are generally
| easier to work with than products, and log is a monotonic
| function, we generally work with the log-likelihood
| function
|
| log L(th|X) = S_i=1^n log p(x_i|th)
|
| since the log-likelihood will achieve its maximum for the
| same value of th as the likelihood.
|
| The cross entropy of two discrete probability distributions
| p and q is
|
| S_i=1^n p_i log q_i
|
| (For continuous distributions, replace the sum with an
| integral.)
|
| This is completely unrelated to the generic log-likelihood
| function defined above. The two are only related if p
| happens to be the probability distribution of a binary
| random variable x = {0,1}, with probability p of equalling
| 1:
|
| p(x|p) = p^x(1-p)^(1-x)
|
| Its log-likelihood is therefore
|
| x log p + (1-x) log(1-p)
|
| which for this particular case, happens to be a cross
| entropy. Note that this is the log-likelihood of a single
| observation in a single class; for multiple
| observations/multiple classes, we sum across them, e.g.
|
| S_i=1^n x_i log p_i + (1-x_i)log (1-p_i)
|
| for a single observation across n total classes.
|
| But again, the relationship to cross entropy only holds for
| this particular choice of p. It is not generally the case
| that the generic log-likelihood function,
|
| log L(th|X) = S_i=1^n log p(x_i|th)
|
| is a cross entropy!
| contravariant wrote:
| You can take the cross entropy between the the
| probability distribution and the dirac-delta distribution
| for the actual data. This will equal the log-likelihood.
|
| Things get a little iffy with continuous probability
| distributions, but that's just because both your cross-
| entropy and your MLE estimate will depend on your choice
| of variables if you don't pick a prior. Just as for MLE
| you can blindly plug in the probability density and it'll
| work just fine.
| whimsicalism wrote:
| > your MLE estimate will depend on your choice of
| variables if you don't pick a prior.
|
| If you're doing MLE, then you don't have a prior (or
| rather, you have a uniform prior over the parameter(s) of
| interest).
| MontyCarloHall wrote:
| True! Given a ~Dirac comb~ mixture of Dirac distributions
|
| c(x) = 1/n S_i=1^n d(kh - x_i)
|
| and some function f, you can express the sum of f over
| x_i as
|
| S_i=1^n f(x_i) = [?]_-[?]^[?] dx' f(x)c(x - x')
|
| If f were a log probability, this would be indeed be a
| (continuous) cross entropy:
|
| S_i log p(kh_i|th) = [?]_-[?]^[?] dx' log p(kh_i|th) c(x
| - x')
|
| However, this isn't generally how we think about
| likelihood functions, since there is nothing gained from
| expressing a simple sum of log probability densities in
| terms of a Dirac comb. Indeed, every ML text/paper I've
| read only ever refers to "cross entropy" in the context
| of the cost function for one-hot categorical random
| variables, since the formula for cross entropy is
| immediately present in the likelihood function. Cost
| functions involving other random variables are simply
| called "cost functions" or just "likelihoods" if the
| author comes from a stats background.
| dkshdkjshdk wrote:
| > Given a Dirac comb
|
| > c(x) = 1/n S_i=1^n d(kh - x_i)
|
| Sorry for the pedantry, but a mixture of Dirac
| distributions is almost always _not_ a Dirac comb. Notice
| that a mixture of Dirac distributions is a Dirac comb
| only if you have an _infinite_ number of equally-
| separated samples (and empirical distributions tend to
| have a finite number of samples).
| ansk wrote:
| You've correctly shown that maximizing the likelihood is
| equivalent to minimizing cross entropy in the discrete
| case, but frankly that is unrelated to your claim that
| the equivalency doesn't hold in the general case. As
| noted in the sibling comment, the generalization to the
| continuous case is evident when viewing the empirical
| data distribution as a mixture of dirac densities.
| whimsicalism wrote:
| Hm, I never really thought about it this way - but I
| guess it does generalize to continuous space in a pretty
| natural way.
| ehw3 wrote:
| > 2. John Tukey (1977). Exploratory Data Analysis.
|
| > This book has been hugely influential and is a fun read that
| can be digested in one sitting.
|
| Wow. The PDF is over 700 pages. That seems fairly impressive for
| single-sitting digestion.
| hyttioaoa wrote:
| "Generalized adversarial networks, or GANs, are a conceptual
| advance that allow reinforcement learning problems to be solved
| automatically." -
|
| "Generalized" :D Also the description is nonsense. This has
| nothing to do with reinforcement learning. Makes me wonder about
| the rest.
| [deleted]
| totoglazer wrote:
| The paper has it right, at least.
| cscurmudgeon wrote:
| Now, if a press release from a top univ is so wrong on
| something that is easily checkable, how accurate are other
| forms of news?
| [deleted]
| nerdponx wrote:
| Think of "press release wrongness" with a probability
| distribution. Some press releases are really good, some are
| really bad. A sensible prior would be somewhere in the
| middle. If you start to see a lot of bad press releases, then
| you can update your posterior towards "I can't trust any of
| these."
| bee_rider wrote:
| Is that a good prior? I expect due to Dunning-Kruger that
| the willingness to produce an article on a topic would
| follow a pretty intense bimodal distribution.
| 317070 wrote:
| > Generative adversarial networks, or GANs, are a conceptual
| advance that allow reinforcement learning problems to be solved
| automatically. They mark a step toward the longstanding goal of
| artificial general intelligence while also harnessing the power
| of parallel processing so that a program can train itself by
| playing millions of games against itself. At a conceptual level,
| GANs link prediction with generative models.
|
| What? Every sentence here is so wrong I have a hard time seeing
| what kind of misunderstanding would lead to this.
|
| GAN's are a conceptual advance of generative models (i.e. models
| that can generate more, similar data). Reinforcement learning is
| a separate field. Parallel processing is ubiquitous, and has
| nothing to do with GANs or reinforcement learning (they are both
| usually pretty parallellized). Self-play sounds like they wanted
| to talk about the alphago/alphazero papers? And GANs are
| infamously not really predictive/discriminative. If anything,
| they thoroughly disconnected predicition from generative models.
| whimsicalism wrote:
| > GAN's are a conceptual advance of generative models (i.e.
| models that can generate more, similar data).
|
| This is something I've long had confusion with, coming from a
| probabilistic perspective.
|
| How does a GAN model the joint probability of the data? My
| understanding was that was what a generative model does. There
| doesn't seem to be a clear probabilistic interpretation of a
| GAN whatsoever.
| gyom wrote:
| Part of the cleverness of GANs was to have found a way to
| train a neural network that generates data without explicitly
| modeling the probability density.
|
| In a stats textbook, when you know that your training data
| comes from a normal distribution, you can maximize the MLE
| wrt the parameters, and then use that for sampling. That's
| basic theory.
|
| In practice, it was very hard to learn a good pdf for
| experimental data when you had a training set of images. GANs
| provided a way to bypass this.
|
| Of course, people could have said "hey let's generate samples
| without maximizing a loglikelihood first", but they didn't
| know how to do it properly, how to train the network in any
| other way besides minimizing cross-entropy (which is
| equivalent to maximizing loglikelihood).
|
| Then GANs actually provided a new loss function that could be
| trained. Total paradigm shift!
| whimsicalism wrote:
| I'm on board with all of this, I think even before GANs it
| was becoming popular to optimize loss that wasn't
| necessarily a log likelihood.
|
| But I'm confused by the usage of the phrase generative
| model, which I took to always mean a probabilistic model of
| the joint that can be sampled over. I get that GANs
| _generate_ data samples, but it seems different.
| hervature wrote:
| This is the problem when people use technical terms
| loosely and interchangeably with their English
| definitions. Generative model classifiers are precisely
| as you describe. They model a joint distribution that one
| can sample.
|
| GANs cannot even fit this definition because it is not a
| classifier. It is composed of a generator and a
| discriminator. The discriminator is a discriminative
| classifier. The generator is, well, a generator. It has
| nothing to do with generative model classifiers. Then you
| get some variation of neural network generator > model
| that generates > generative model. This leads to
| confusion.
| gyom wrote:
| You're right that this is spectacularly wrong.
|
| I dare not even read the rest of the page just in case my brain
| accidentally absorbs other bad information like that paragraph
| about GANs.
| [deleted]
| master_yoda_1 wrote:
| half of these are relevant to small data problem which is not
| exactly we mean when we say AI.
| dkshdkjshdk wrote:
| What _do_ you mean when you say AI? I 'm curious.
|
| As far as I can tell, most people (e.g., whoever wrote this
| article) seem to use AI as a synonym for "machine learning",
| basically.
___________________________________________________________________
(page generated 2021-07-07 23:01 UTC)