[HN Gopher] An overview of the theory of overparameterized machi...
___________________________________________________________________
An overview of the theory of overparameterized machine learning
Author : sebg
Score : 98 points
Date : 2021-09-18 23:32 UTC (2 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| rawoke083600 wrote:
| One can say, hardwork-beats-talent in some cases and in some
| cases talent-beats-hardwork :P
|
| Talent: a "smarter/clever" model Hardwork: more and more
| parameters
| derbOac wrote:
| I have to reread it but the model misspecification interpretation
| (that highly overparameterized models exhibit DD because they
| reduce misspecification error) is the only thing I've read in
| this area that makes some sense to me theoretically.
|
| I think DD is a huge issue for a number of fields and is really
| underappreciated a lot. Without meaning to sound disrespectful,
| much of this literature seems a little superficial or dismissive,
| not aware of the broad implications of the claims often being
| made.
|
| This is because of the ties between information-theory and
| statistics/modeling. In some sense, at least in the way I've
| thought about it, the DD seems to imply some kind of violations
| of fundamental information theory and comes across to me a bit as
| if someone in chemistry started claiming that some basic laws of
| thermodynamics in physics didn't apply anymore. Basically, the DD
| seems to imply that someone can extract more information from a
| string than the string contains. If you put it this way, it makes
| no sense, which is why I think this is such a hugely important
| issue.
|
| On the other hand, the empirical results are there, so figuring
| out what's going on is worthwhile and I have an open mind.
|
| This paper seems nice with the misspecification angle, because it
| is realistic and seems to open a path to some interpretations
| that might not violate some fundamental identities in IT.
| Misspecification (mismatched coding in IT) can lead to some weird
| phenomena that's not always intuitive.
|
| Another thing in the paper that's made clear is that DD might not
| always happen, and it seems informative to figure out when that's
| the case.
|
| In the background I have to say I'm still skeptical of the
| empirical breadth of DD. These weird cases of ML failures due to
| subtle challenge inputs (the example of errors in identifying
| Obama based on positioning and ties (?) is one example) to me
| seems like prime examples of overfitting. I still have a hunch
| that something about the training and test samples relative to
| the universe of actual intended samples is at play, or the whole
| phenomenon of DD is misleading because the overfitting problem is
| really in terms of model flexibility versus data complexity, and
| not necessarily in terms of number of parameters per se versus
| sample size.
| yldedly wrote:
| DD disappears when doing Bayesian model averaging:
| https://arxiv.org/pdf/2002.08791.pdf It seems DD is a
| phenomenon specific to point estimates.
|
| >overfitting problem is really in terms of model flexibility
| versus data complexity, and not necessarily in terms of number
| of parameters per se versus sample size
|
| Yep, well put.
| unemphysbro wrote:
| Briefly skimmed the paper and I'm not a ML or math dude but
| familiar with the double descent problem.
|
| I wonder if there's a good way to measure information content in
| models and how it scales with model parameters, if there are any
| invariants, scaling law's that arise etc. Reminds me of
| renormalization group methods which could be applied to a space
| of models, etc...
|
| Again, I'm no expert in any of this stuff, just arm-chairing
| away.
| marius_k wrote:
| Would it work to measure the size of compressed model?
| xyzzy21 wrote:
| I'm not surprised. In physics/engineering related model fitting,
| there is a double-edged risk of over-parametrization as well as
| under-parameterization. The former is almost always non-physical
| - it fits but the parameters are meaningless as predictor,
| conclusions or tying to ANYTHING physical as a design strategy.
| The latter fits only certain corners of the data space and if you
| don't know which ones, you can get yourself in to trouble that
| way as well.
|
| The former can be worse than the latter in terms of meaningful
| use for engineering.
| cs702 wrote:
| I'll come out and say what's on many practicioners' minds:
|
| It could very well be that generations of academics have been
| WRONG about the relationship between the number of parameters (or
| complexity) of a model and its ability to generalize to new,
| previously unseen data, _particularly when the data is drawn from
| many naturally occurring distributions_ -- as opposed to, say,
| distributions randomly chosen from the space of all possible
| distributions, as assumed by many theoretical frameworks (e.g.,
| No Free Lunch theorem). It could be that generations of students
| have been taught _The Wrong Thing_ (tm).
|
| In many cases we must _increase_ , not decrease, model complexity
| to improve generalization!
| [deleted]
| strange_things wrote:
| It's obvious no? For double descent: the network with a billion
| parameters is so large that it learns to simulate a neural
| network itself which first uses the test data set to train itself
| and then produces the final overfitted values??
| bsmith89 wrote:
| I'm not sure if it's related, but I've seen discussions of modern
| ML methods (in particular those trained using stochastic
| algorithms...maybe also models with low float precision...?)
| approximating Bayesian methods. The way I've imagined it is that
| the training path, by virtue of its stochasticity, resembles MCMC
| sampling and therefore tends to end up in regions of high
| posterior volume (the "typical set"), rather than high posterior
| density. I could see this resulting in a fit with parameters
| closer to their conditional expectations (in the Bayesian sense),
| which should be more generalizable to new data, hence fewer
| issues with overfitting.
|
| A consequence of this would be that if somehow a method were able
| to successfully find the _global_ loss-function minimum on the
| training data, it would perform worse on the the test set.
| Fortunately, our optimization methods _don't_ find the global
| minimum at all.
|
| Can anybody point me to literature on this idea? I don't know if
| my uninformed interpretation is actually close to what experts
| are thinking.
| bsmith89 wrote:
| Oh, and I enjoyed reading this primer on the Double Descent
| Phenomenon for anybody, like me, who hadn't heard of it before:
| https://openai.com/blog/deep-double-descent/
| gwern wrote:
| You're looking for flat minima / wide basins. (Amusingly, this
| one actually does go back to Schmidhuber etc.) Explains a lot
| of phenomenon like poorer generalization of second-order
| optimizers, SGD sometimes working surprisingly better,
| stochastic weight averaging / EMA, grokking, or patient
| teachers.
| talolard wrote:
| A quick summary/translation for those of us who don't speak ML.
|
| We keep hearing about these giant models like GPT3 with 1.5
| billion paramaters. Parameters are the things that change when we
| train a model, you can think about them as degrees of freedom. If
| you have a lot of parameters, theory made us believe that the
| model would just "overfit" the training data, e.g. memorize it.
| That's bad, because when new data comes in in production we'd
| expect the model to not be able to "generalize" to it, e.g. make
| accurate predictions on data it hasn't seen before, because it's
| just memorized training data instead of uncovering the "guiding
| principles" of the data so to speak.
|
| In practice, these huge models are, in laymans terms, fucking
| awesome and work really well e.g. they generalize and work in
| production. No one understands why.
|
| This paper is a survey or overview of what "too many paramaters"
| are, and all the research into why these models work even though
| they shouldn't.
| sdenton4 wrote:
| My big beef with a lot of the 'leading edge' ML research is
| that it tends to be waaaaay too focused on classification
| problems, and ImageNet in particular. And, last I checked, you
| /do/ still fight with overfitting in classification models, by
| cleverly choosing learning rate schedules and using early
| stopping schemes, 'double descent' be damned.
|
| You can solve classification with a hash function: Hash the
| image, and then just memorize which label goes with which hash.
| You can try to dodge this obviously dodgy solution by adding
| augmentation to the dataset. Then you instead learn to find a
| representation invariant under the set of augmentations, and
| learn the hash of that representation. It turns out these
| augmentation-invariant representations are actually pretty
| good, so we can solve the classification problem in what looks
| like a general way.
|
| However, there are many other classes of problems where the
| hash problem doesn't exist, because the information density of
| the outputs is too high to memorize in the same way.
| Specifically, generative models, and the sorts of
| predictive/infill problems used for self-supervision. In these
| spaces, the problems are more like: "Given this pile of
| augmented input, generate half a megabyte of coherent output."
| These kinds of problems simply don't overfit: Train a speech
| separation model on a big dataset, and the train+eval quality
| metrics will just asymptote their way up and to the right until
| you run out of training budget.
| quocanh wrote:
| When you say hash function, do you mean a cryptographic hash
| function? How on earth could the performance of that be
| anywhere near the simplest probabilistic algorithm on unseen
| examples?
| sdenton4 wrote:
| No, nothing cryptographic here. All I'm saying is that you
| can memorize the dataset by extracting a small fingerprint
| of each training example and associating it with an output
| label: ie, learn by lookup table. Then you don't need to
| memorize the whole training set, you just need to
| find/learn the fingerprinting function. With no
| augmentation, you might as well use MD5... With
| augmentation, you do need to do some actual work to learn
| to extract an augmentation invariant projection of the
| training examples, but the basic principle is the same.
| underanalyzer wrote:
| I have nothing to do with machine learning but it seems
| like the hashing approach would only work if you are
| "training" on the evaluation set instead of a separate
| training set. Afaik in image net like challenges the set
| of labeled training images does not contain any of the
| evaluation images so there wouldn't be any hashes
| matching any of the evaluation data.
| thisiszilff wrote:
| Yes, you're right. You should never see the
| test/evaluation dataset during training so it would be
| impossible to "memorize" the test cases. You would get
| good near perfect accuracy on the training data, but not
| the test set. I think the closest analogue would be
| models that produce conceptual embeddings somewhere in
| them -- those are kind of like hashes with the property
| that similar things have similar embeddings. Many
| classification neural networks kind of operate like that
| -- the initial layers produce a representation of the
| data and then the final layer actually performs the
| classification.
| r-zip wrote:
| Has this been implemented? What kinds of hashing
| functions are you talking about? How would you guarantee
| the same hash for all the augmentations?
|
| It seems like the approach you describe just moves the
| complexity of the task solved by neural networks into the
| hashing function.
| sdenton4 wrote:
| https://openreview.net/forum?id=Sy8gdB9xx
|
| "our experiments establish that state-of-the-art
| convolutional networks for image classification trained
| with stochastic gradient methods easily fit a random
| labeling of the training data. This phenomenon is
| qualitatively unaffected by explicit regularization, and
| occurs even if we replace the true images by completely
| unstructured random noise."
| [deleted]
| IX-103 wrote:
| Memorization is only an issue if you allow it to be. If
| design the model with a "narrow" enough inner stage then that
| limits the level of detail (in terms of distinct
| representable values) passed to subsequent stages. This
| should give you an ML algorithm that consists of a
| fingerprint (approximates your hashing) stage followed by a
| classifier that works based on the fingerprint input
| (approximates a table lookup). Such an algorithm should not
| have such a problem with over-fitting was you describe.
| [deleted]
| aomobile wrote:
| Thanks for the summary!
| Salgat wrote:
| To add to this, there's a misleading phenomenon that first
| occurs where the performance actually gets worse with too much
| data/parameters/epochs, but oddly improves again if you throw
| even more at the model.
| nexuist wrote:
| Is this the ML equivalent of Dunning-Kruger effect? A model
| with a bit of data is too afraid of being wrong to be
| overconfident. A model with a bit more data is overconfident
| in itself and gets things wrong. Finally, a model with tons
| and tons of data understands the complexity of the problem
| set and once again becomes too afraid of being wrong.
| visarga wrote:
| Model confidence as reported by softmax probability scores
| is notoriously noisy and miscalibrated. With larger models
| and more data the confidence estimation gets more nuanced.
| jointpdf wrote:
| For the interested, this phenomenon is known as (deep) double
| descent:
|
| https://openai.com/blog/deep-double-descent/
|
| https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understand.
| ..
|
| (Edit: Oh, the definition appears in the abstract of the
| linked paper.)
| diffCtx wrote:
| My take as a 90s math grad (out of touch with modern teaching):
| Theory is useful to show human society is stagnant.
|
| There's an infinite number of sentences but our ML models are
| having tons of "success" as society relies on finite set in
| daily life; those that instigate commerce.
|
| Like religion relied on an acceptable finite set of sentences,
| so too does our society. We're a bunch of weird little
| missionaries living in one geometric world, still believing in
| bigger purpose.
|
| ML isn't really outputting novelty, it's spewing our own
| inanity at us, and helping correct some bad math in engineering
| cases.
|
| We're easily mesmerized apes.
| andrewnc wrote:
| Super pedantic comment. GPT-3 has 175 Billion parameters. GPT-2
| was the 1.5 Billion model.
| yldedly wrote:
| >In practice, these huge models are, in laymans terms, fucking
| awesome and work really well e.g. they generalize and work in
| production. No one understands why.
|
| To add nuance to this, these models are awesome at
| interpolation, but not so much at extrapolation. Or in
| different terms, they generalize very well to an IID test set,
| but don't generalize under (even slight) distribution shift.
|
| The main reason for this is that these models tend to solve
| classification and regression problem quite differently from
| how humans do it. Broadly speaking, a large, flexible NN will
| find a "shortcut", i.e. a simple relation between some part of
| the input and the output, which may not be informative in the
| way we want; such as a watermark in the corner of an image, or
| statistical regularities in textures which disappear in
| slightly different lighting conditions. See e.g.
| https://thegradient.pub/shortcuts-neural-networks-love-to-ch...
|
| I think it's fair to say that these models are great when you
| have an enormous dataset that covers the entire domain, but
| sub-Google-scale problems are usually still solved by
| underparametrized models (even at Google).
| rich_sasha wrote:
| It depends. It really doesn't take that much data to train a
| pretty stunning (if simple) RNN character-level "language
| model" that beats any n-gram. Or on mnist. ANNs really are a
| useful tool for a vast class of problems, many of which can
| be solved with comparatively little data.
|
| Maybe your point stands, and it's just that some domains need
| less data, just saying.
| yldedly wrote:
| >ANNs really are a useful tool for a vast class of
| problems, many of which can be solved with comparatively
| little data.
|
| For sure, it all depends on how robust the model needs to
| be, how strongly it needs to generalize. If your dataset
| covers the entire domain, you don't need a robust model. If
| you need strong generalization, then you need to build in
| stronger priors.
|
| Take f(x) = x^2. If your model only needs to work in finite
| interval, you just need a decent sample that covers that
| interval. But if it needs to generalize outside that
| interval, no amount of parameters will give you good
| performance. Outside the boundaries of the interval, the NN
| will either be constant (with a sigmoid activation) or
| linear (with ReLU type activations).
| jcranberry wrote:
| My sister works in the NLP arm of ML and analogized it to the
| Clever Hans effect.
| 988747 wrote:
| > In practice, these huge models are, in laymans terms, fucking
| awesome and work really well e.g. they generalize and work in
| production. No one understands why.
|
| How about the resulting weights? If most of them are close to
| 0, then that would mean that a part of the training is for NN
| to learn which of 1.5B parameters are relevant, and which are
| not.
| rich_sasha wrote:
| There is something called the golden ticket theory (maybe
| mentioned in the paper, I'm on my phone), that says indeed
| that the large models are effectively ensembles of massive
| random models, and the top levels of the network pick the one
| or two that randomly happen to work.
|
| Maybe true but even then only part of the story, kernels in
| CNN genuinely seem to learn features like edges and textures.
| talolard wrote:
| There are two answers to this. First, empirically we see that
| the more parameters we add the better the model performs ==>
| Weights continue to contribute (and aren't dead) .
|
| Second, there is a very popular paper called "The lottery
| ticket hypothesis" [1] that in any network you can find
| subnetworks that work just as well. e.g. The parameters are
| redundant. This was written in 2018, which is a long time ago
| in big NN world, so I'm not sure how it holds up to current
| insanity sized models.
|
| [1]https://arxiv.org/abs/1803.03635
| baron_harkonnen wrote:
| > In practice, these huge models are, in laymans terms, fucking
| awesome and work really well
|
| A similarly surprising result from an adjacent community,
| Bayesian Statistics, is that in the case of hierarchical
| models, increasing your number of parameters can paradoxically
| _reduce_ overfitting.
|
| The scale of parameters in Bayesian model's is no where near
| that of these deep neural nets, but nonetheless this is a
| similarly shocking result since typically adding parameters is
| penalized when model building.
|
| It's a bit more explainable in Bayesian stats since what you're
| using some parameters for is limiting the impact of more
| granular parameters (i.e. you're learning a prior probability
| distribution for the other parameters, which prevents extreme
| overfitting in cases with less information).
|
| I wouldn't be too surprised if eventually we realized there was
| a similar cause preventing overfitting is overparameterized ml
| models.
| dxbydt wrote:
| > in the case of hierarchical models, increasing your number
| of parameters can paradoxically reduce overfitting.
|
| But in practice you don't keep increasing number of
| parameters...occam's razor still applies. Typical example -
| say your data has 100 samples. 50 comes from N(a,1) and 50
| from N(b,1). So you have a mixture distribution with 2
| unknown means a,b, and known variance 1.
|
| The frequentist would say a & b are constants. They are
| unknown, but still constants. To know them, just collect more
| & more samples. By the weak law, abar will converge to a &
| bbar to b.
|
| The Bayesian obviously differs & says a is an rv. So is b. Ok
| if a & b are rv where do they come from ? So the Bayesian
| punts & says a comes from N(c,d) and b comes from N(e,f). At
| this point your hierarchy is 1 level deep & you have added 4
| extra parameters c,d,e,f. So we then ask what is c,d,e,f ?
| You can punt again & say c comes from N(g,h) & so on....but
| ultimately you have to walk the talk & put some actual
| numbers out. You can't just increase the number of parameters
| like in the DL world.
|
| So usually, you stop your hierarchy at 1 or 2 levels. So
| instead of saying a comes from N(c,d), you would say a and b
| come from N(c,1). So what about c ? Ok it comes from N(0,1).
| Boom done. So you've added just 1 extra parameter c & you get
| a lot of benefit. Isn't at all like the DL world imho.
| oleg_myrk wrote:
| Any chance you could share a link to a relevant paper?
| sjg007 wrote:
| This is likely part of the reason. The only problem is said
| models require a lot of data but Humans can learn from a very
| small number of examples.
| sdenton4 wrote:
| Humans are continuously pretrained on a variety of tasks,
| though. Teaching a kid to say one word takes about a
| year...
| talolard wrote:
| I don't know if it's correct , but I often think of a
| classification model as learning the parameters of a dirchlet
| distribution with the final softmax layer being a sample from
| it
| naomisperfume wrote:
| Do you have any good references of this phenomenon in
| hierarchical models?
| bigfudge wrote:
| German and Hill has a brief intro and some references.
| [deleted]
___________________________________________________________________
(page generated 2021-09-21 23:01 UTC)