[HN Gopher] An overview of the theory of overparameterized machi...
       ___________________________________________________________________
        
       An overview of the theory of overparameterized machine learning
        
       Author : sebg
       Score  : 98 points
       Date   : 2021-09-18 23:32 UTC (2 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | rawoke083600 wrote:
       | One can say, hardwork-beats-talent in some cases and in some
       | cases talent-beats-hardwork :P
       | 
       | Talent: a "smarter/clever" model Hardwork: more and more
       | parameters
        
       | derbOac wrote:
       | I have to reread it but the model misspecification interpretation
       | (that highly overparameterized models exhibit DD because they
       | reduce misspecification error) is the only thing I've read in
       | this area that makes some sense to me theoretically.
       | 
       | I think DD is a huge issue for a number of fields and is really
       | underappreciated a lot. Without meaning to sound disrespectful,
       | much of this literature seems a little superficial or dismissive,
       | not aware of the broad implications of the claims often being
       | made.
       | 
       | This is because of the ties between information-theory and
       | statistics/modeling. In some sense, at least in the way I've
       | thought about it, the DD seems to imply some kind of violations
       | of fundamental information theory and comes across to me a bit as
       | if someone in chemistry started claiming that some basic laws of
       | thermodynamics in physics didn't apply anymore. Basically, the DD
       | seems to imply that someone can extract more information from a
       | string than the string contains. If you put it this way, it makes
       | no sense, which is why I think this is such a hugely important
       | issue.
       | 
       | On the other hand, the empirical results are there, so figuring
       | out what's going on is worthwhile and I have an open mind.
       | 
       | This paper seems nice with the misspecification angle, because it
       | is realistic and seems to open a path to some interpretations
       | that might not violate some fundamental identities in IT.
       | Misspecification (mismatched coding in IT) can lead to some weird
       | phenomena that's not always intuitive.
       | 
       | Another thing in the paper that's made clear is that DD might not
       | always happen, and it seems informative to figure out when that's
       | the case.
       | 
       | In the background I have to say I'm still skeptical of the
       | empirical breadth of DD. These weird cases of ML failures due to
       | subtle challenge inputs (the example of errors in identifying
       | Obama based on positioning and ties (?) is one example) to me
       | seems like prime examples of overfitting. I still have a hunch
       | that something about the training and test samples relative to
       | the universe of actual intended samples is at play, or the whole
       | phenomenon of DD is misleading because the overfitting problem is
       | really in terms of model flexibility versus data complexity, and
       | not necessarily in terms of number of parameters per se versus
       | sample size.
        
         | yldedly wrote:
         | DD disappears when doing Bayesian model averaging:
         | https://arxiv.org/pdf/2002.08791.pdf It seems DD is a
         | phenomenon specific to point estimates.
         | 
         | >overfitting problem is really in terms of model flexibility
         | versus data complexity, and not necessarily in terms of number
         | of parameters per se versus sample size
         | 
         | Yep, well put.
        
       | unemphysbro wrote:
       | Briefly skimmed the paper and I'm not a ML or math dude but
       | familiar with the double descent problem.
       | 
       | I wonder if there's a good way to measure information content in
       | models and how it scales with model parameters, if there are any
       | invariants, scaling law's that arise etc. Reminds me of
       | renormalization group methods which could be applied to a space
       | of models, etc...
       | 
       | Again, I'm no expert in any of this stuff, just arm-chairing
       | away.
        
         | marius_k wrote:
         | Would it work to measure the size of compressed model?
        
       | xyzzy21 wrote:
       | I'm not surprised. In physics/engineering related model fitting,
       | there is a double-edged risk of over-parametrization as well as
       | under-parameterization. The former is almost always non-physical
       | - it fits but the parameters are meaningless as predictor,
       | conclusions or tying to ANYTHING physical as a design strategy.
       | The latter fits only certain corners of the data space and if you
       | don't know which ones, you can get yourself in to trouble that
       | way as well.
       | 
       | The former can be worse than the latter in terms of meaningful
       | use for engineering.
        
       | cs702 wrote:
       | I'll come out and say what's on many practicioners' minds:
       | 
       | It could very well be that generations of academics have been
       | WRONG about the relationship between the number of parameters (or
       | complexity) of a model and its ability to generalize to new,
       | previously unseen data, _particularly when the data is drawn from
       | many naturally occurring distributions_ -- as opposed to, say,
       | distributions randomly chosen from the space of all possible
       | distributions, as assumed by many theoretical frameworks (e.g.,
       | No Free Lunch theorem). It could be that generations of students
       | have been taught _The Wrong Thing_ (tm).
       | 
       | In many cases we must _increase_ , not decrease, model complexity
       | to improve generalization!
        
         | [deleted]
        
       | strange_things wrote:
       | It's obvious no? For double descent: the network with a billion
       | parameters is so large that it learns to simulate a neural
       | network itself which first uses the test data set to train itself
       | and then produces the final overfitted values??
        
       | bsmith89 wrote:
       | I'm not sure if it's related, but I've seen discussions of modern
       | ML methods (in particular those trained using stochastic
       | algorithms...maybe also models with low float precision...?)
       | approximating Bayesian methods. The way I've imagined it is that
       | the training path, by virtue of its stochasticity, resembles MCMC
       | sampling and therefore tends to end up in regions of high
       | posterior volume (the "typical set"), rather than high posterior
       | density. I could see this resulting in a fit with parameters
       | closer to their conditional expectations (in the Bayesian sense),
       | which should be more generalizable to new data, hence fewer
       | issues with overfitting.
       | 
       | A consequence of this would be that if somehow a method were able
       | to successfully find the _global_ loss-function minimum on the
       | training data, it would perform worse on the the test set.
       | Fortunately, our optimization methods _don't_ find the global
       | minimum at all.
       | 
       | Can anybody point me to literature on this idea? I don't know if
       | my uninformed interpretation is actually close to what experts
       | are thinking.
        
         | bsmith89 wrote:
         | Oh, and I enjoyed reading this primer on the Double Descent
         | Phenomenon for anybody, like me, who hadn't heard of it before:
         | https://openai.com/blog/deep-double-descent/
        
         | gwern wrote:
         | You're looking for flat minima / wide basins. (Amusingly, this
         | one actually does go back to Schmidhuber etc.) Explains a lot
         | of phenomenon like poorer generalization of second-order
         | optimizers, SGD sometimes working surprisingly better,
         | stochastic weight averaging / EMA, grokking, or patient
         | teachers.
        
       | talolard wrote:
       | A quick summary/translation for those of us who don't speak ML.
       | 
       | We keep hearing about these giant models like GPT3 with 1.5
       | billion paramaters. Parameters are the things that change when we
       | train a model, you can think about them as degrees of freedom. If
       | you have a lot of parameters, theory made us believe that the
       | model would just "overfit" the training data, e.g. memorize it.
       | That's bad, because when new data comes in in production we'd
       | expect the model to not be able to "generalize" to it, e.g. make
       | accurate predictions on data it hasn't seen before, because it's
       | just memorized training data instead of uncovering the "guiding
       | principles" of the data so to speak.
       | 
       | In practice, these huge models are, in laymans terms, fucking
       | awesome and work really well e.g. they generalize and work in
       | production. No one understands why.
       | 
       | This paper is a survey or overview of what "too many paramaters"
       | are, and all the research into why these models work even though
       | they shouldn't.
        
         | sdenton4 wrote:
         | My big beef with a lot of the 'leading edge' ML research is
         | that it tends to be waaaaay too focused on classification
         | problems, and ImageNet in particular. And, last I checked, you
         | /do/ still fight with overfitting in classification models, by
         | cleverly choosing learning rate schedules and using early
         | stopping schemes, 'double descent' be damned.
         | 
         | You can solve classification with a hash function: Hash the
         | image, and then just memorize which label goes with which hash.
         | You can try to dodge this obviously dodgy solution by adding
         | augmentation to the dataset. Then you instead learn to find a
         | representation invariant under the set of augmentations, and
         | learn the hash of that representation. It turns out these
         | augmentation-invariant representations are actually pretty
         | good, so we can solve the classification problem in what looks
         | like a general way.
         | 
         | However, there are many other classes of problems where the
         | hash problem doesn't exist, because the information density of
         | the outputs is too high to memorize in the same way.
         | Specifically, generative models, and the sorts of
         | predictive/infill problems used for self-supervision. In these
         | spaces, the problems are more like: "Given this pile of
         | augmented input, generate half a megabyte of coherent output."
         | These kinds of problems simply don't overfit: Train a speech
         | separation model on a big dataset, and the train+eval quality
         | metrics will just asymptote their way up and to the right until
         | you run out of training budget.
        
           | quocanh wrote:
           | When you say hash function, do you mean a cryptographic hash
           | function? How on earth could the performance of that be
           | anywhere near the simplest probabilistic algorithm on unseen
           | examples?
        
             | sdenton4 wrote:
             | No, nothing cryptographic here. All I'm saying is that you
             | can memorize the dataset by extracting a small fingerprint
             | of each training example and associating it with an output
             | label: ie, learn by lookup table. Then you don't need to
             | memorize the whole training set, you just need to
             | find/learn the fingerprinting function. With no
             | augmentation, you might as well use MD5... With
             | augmentation, you do need to do some actual work to learn
             | to extract an augmentation invariant projection of the
             | training examples, but the basic principle is the same.
        
               | underanalyzer wrote:
               | I have nothing to do with machine learning but it seems
               | like the hashing approach would only work if you are
               | "training" on the evaluation set instead of a separate
               | training set. Afaik in image net like challenges the set
               | of labeled training images does not contain any of the
               | evaluation images so there wouldn't be any hashes
               | matching any of the evaluation data.
        
               | thisiszilff wrote:
               | Yes, you're right. You should never see the
               | test/evaluation dataset during training so it would be
               | impossible to "memorize" the test cases. You would get
               | good near perfect accuracy on the training data, but not
               | the test set. I think the closest analogue would be
               | models that produce conceptual embeddings somewhere in
               | them -- those are kind of like hashes with the property
               | that similar things have similar embeddings. Many
               | classification neural networks kind of operate like that
               | -- the initial layers produce a representation of the
               | data and then the final layer actually performs the
               | classification.
        
               | r-zip wrote:
               | Has this been implemented? What kinds of hashing
               | functions are you talking about? How would you guarantee
               | the same hash for all the augmentations?
               | 
               | It seems like the approach you describe just moves the
               | complexity of the task solved by neural networks into the
               | hashing function.
        
               | sdenton4 wrote:
               | https://openreview.net/forum?id=Sy8gdB9xx
               | 
               | "our experiments establish that state-of-the-art
               | convolutional networks for image classification trained
               | with stochastic gradient methods easily fit a random
               | labeling of the training data. This phenomenon is
               | qualitatively unaffected by explicit regularization, and
               | occurs even if we replace the true images by completely
               | unstructured random noise."
        
           | [deleted]
        
           | IX-103 wrote:
           | Memorization is only an issue if you allow it to be. If
           | design the model with a "narrow" enough inner stage then that
           | limits the level of detail (in terms of distinct
           | representable values) passed to subsequent stages. This
           | should give you an ML algorithm that consists of a
           | fingerprint (approximates your hashing) stage followed by a
           | classifier that works based on the fingerprint input
           | (approximates a table lookup). Such an algorithm should not
           | have such a problem with over-fitting was you describe.
        
         | [deleted]
        
         | aomobile wrote:
         | Thanks for the summary!
        
         | Salgat wrote:
         | To add to this, there's a misleading phenomenon that first
         | occurs where the performance actually gets worse with too much
         | data/parameters/epochs, but oddly improves again if you throw
         | even more at the model.
        
           | nexuist wrote:
           | Is this the ML equivalent of Dunning-Kruger effect? A model
           | with a bit of data is too afraid of being wrong to be
           | overconfident. A model with a bit more data is overconfident
           | in itself and gets things wrong. Finally, a model with tons
           | and tons of data understands the complexity of the problem
           | set and once again becomes too afraid of being wrong.
        
             | visarga wrote:
             | Model confidence as reported by softmax probability scores
             | is notoriously noisy and miscalibrated. With larger models
             | and more data the confidence estimation gets more nuanced.
        
           | jointpdf wrote:
           | For the interested, this phenomenon is known as (deep) double
           | descent:
           | 
           | https://openai.com/blog/deep-double-descent/
           | 
           | https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understand.
           | ..
           | 
           | (Edit: Oh, the definition appears in the abstract of the
           | linked paper.)
        
         | diffCtx wrote:
         | My take as a 90s math grad (out of touch with modern teaching):
         | Theory is useful to show human society is stagnant.
         | 
         | There's an infinite number of sentences but our ML models are
         | having tons of "success" as society relies on finite set in
         | daily life; those that instigate commerce.
         | 
         | Like religion relied on an acceptable finite set of sentences,
         | so too does our society. We're a bunch of weird little
         | missionaries living in one geometric world, still believing in
         | bigger purpose.
         | 
         | ML isn't really outputting novelty, it's spewing our own
         | inanity at us, and helping correct some bad math in engineering
         | cases.
         | 
         | We're easily mesmerized apes.
        
         | andrewnc wrote:
         | Super pedantic comment. GPT-3 has 175 Billion parameters. GPT-2
         | was the 1.5 Billion model.
        
         | yldedly wrote:
         | >In practice, these huge models are, in laymans terms, fucking
         | awesome and work really well e.g. they generalize and work in
         | production. No one understands why.
         | 
         | To add nuance to this, these models are awesome at
         | interpolation, but not so much at extrapolation. Or in
         | different terms, they generalize very well to an IID test set,
         | but don't generalize under (even slight) distribution shift.
         | 
         | The main reason for this is that these models tend to solve
         | classification and regression problem quite differently from
         | how humans do it. Broadly speaking, a large, flexible NN will
         | find a "shortcut", i.e. a simple relation between some part of
         | the input and the output, which may not be informative in the
         | way we want; such as a watermark in the corner of an image, or
         | statistical regularities in textures which disappear in
         | slightly different lighting conditions. See e.g.
         | https://thegradient.pub/shortcuts-neural-networks-love-to-ch...
         | 
         | I think it's fair to say that these models are great when you
         | have an enormous dataset that covers the entire domain, but
         | sub-Google-scale problems are usually still solved by
         | underparametrized models (even at Google).
        
           | rich_sasha wrote:
           | It depends. It really doesn't take that much data to train a
           | pretty stunning (if simple) RNN character-level "language
           | model" that beats any n-gram. Or on mnist. ANNs really are a
           | useful tool for a vast class of problems, many of which can
           | be solved with comparatively little data.
           | 
           | Maybe your point stands, and it's just that some domains need
           | less data, just saying.
        
             | yldedly wrote:
             | >ANNs really are a useful tool for a vast class of
             | problems, many of which can be solved with comparatively
             | little data.
             | 
             | For sure, it all depends on how robust the model needs to
             | be, how strongly it needs to generalize. If your dataset
             | covers the entire domain, you don't need a robust model. If
             | you need strong generalization, then you need to build in
             | stronger priors.
             | 
             | Take f(x) = x^2. If your model only needs to work in finite
             | interval, you just need a decent sample that covers that
             | interval. But if it needs to generalize outside that
             | interval, no amount of parameters will give you good
             | performance. Outside the boundaries of the interval, the NN
             | will either be constant (with a sigmoid activation) or
             | linear (with ReLU type activations).
        
           | jcranberry wrote:
           | My sister works in the NLP arm of ML and analogized it to the
           | Clever Hans effect.
        
         | 988747 wrote:
         | > In practice, these huge models are, in laymans terms, fucking
         | awesome and work really well e.g. they generalize and work in
         | production. No one understands why.
         | 
         | How about the resulting weights? If most of them are close to
         | 0, then that would mean that a part of the training is for NN
         | to learn which of 1.5B parameters are relevant, and which are
         | not.
        
           | rich_sasha wrote:
           | There is something called the golden ticket theory (maybe
           | mentioned in the paper, I'm on my phone), that says indeed
           | that the large models are effectively ensembles of massive
           | random models, and the top levels of the network pick the one
           | or two that randomly happen to work.
           | 
           | Maybe true but even then only part of the story, kernels in
           | CNN genuinely seem to learn features like edges and textures.
        
           | talolard wrote:
           | There are two answers to this. First, empirically we see that
           | the more parameters we add the better the model performs ==>
           | Weights continue to contribute (and aren't dead) .
           | 
           | Second, there is a very popular paper called "The lottery
           | ticket hypothesis" [1] that in any network you can find
           | subnetworks that work just as well. e.g. The parameters are
           | redundant. This was written in 2018, which is a long time ago
           | in big NN world, so I'm not sure how it holds up to current
           | insanity sized models.
           | 
           | [1]https://arxiv.org/abs/1803.03635
        
         | baron_harkonnen wrote:
         | > In practice, these huge models are, in laymans terms, fucking
         | awesome and work really well
         | 
         | A similarly surprising result from an adjacent community,
         | Bayesian Statistics, is that in the case of hierarchical
         | models, increasing your number of parameters can paradoxically
         | _reduce_ overfitting.
         | 
         | The scale of parameters in Bayesian model's is no where near
         | that of these deep neural nets, but nonetheless this is a
         | similarly shocking result since typically adding parameters is
         | penalized when model building.
         | 
         | It's a bit more explainable in Bayesian stats since what you're
         | using some parameters for is limiting the impact of more
         | granular parameters (i.e. you're learning a prior probability
         | distribution for the other parameters, which prevents extreme
         | overfitting in cases with less information).
         | 
         | I wouldn't be too surprised if eventually we realized there was
         | a similar cause preventing overfitting is overparameterized ml
         | models.
        
           | dxbydt wrote:
           | > in the case of hierarchical models, increasing your number
           | of parameters can paradoxically reduce overfitting.
           | 
           | But in practice you don't keep increasing number of
           | parameters...occam's razor still applies. Typical example -
           | say your data has 100 samples. 50 comes from N(a,1) and 50
           | from N(b,1). So you have a mixture distribution with 2
           | unknown means a,b, and known variance 1.
           | 
           | The frequentist would say a & b are constants. They are
           | unknown, but still constants. To know them, just collect more
           | & more samples. By the weak law, abar will converge to a &
           | bbar to b.
           | 
           | The Bayesian obviously differs & says a is an rv. So is b. Ok
           | if a & b are rv where do they come from ? So the Bayesian
           | punts & says a comes from N(c,d) and b comes from N(e,f). At
           | this point your hierarchy is 1 level deep & you have added 4
           | extra parameters c,d,e,f. So we then ask what is c,d,e,f ?
           | You can punt again & say c comes from N(g,h) & so on....but
           | ultimately you have to walk the talk & put some actual
           | numbers out. You can't just increase the number of parameters
           | like in the DL world.
           | 
           | So usually, you stop your hierarchy at 1 or 2 levels. So
           | instead of saying a comes from N(c,d), you would say a and b
           | come from N(c,1). So what about c ? Ok it comes from N(0,1).
           | Boom done. So you've added just 1 extra parameter c & you get
           | a lot of benefit. Isn't at all like the DL world imho.
        
           | oleg_myrk wrote:
           | Any chance you could share a link to a relevant paper?
        
           | sjg007 wrote:
           | This is likely part of the reason. The only problem is said
           | models require a lot of data but Humans can learn from a very
           | small number of examples.
        
             | sdenton4 wrote:
             | Humans are continuously pretrained on a variety of tasks,
             | though. Teaching a kid to say one word takes about a
             | year...
        
           | talolard wrote:
           | I don't know if it's correct , but I often think of a
           | classification model as learning the parameters of a dirchlet
           | distribution with the final softmax layer being a sample from
           | it
        
           | naomisperfume wrote:
           | Do you have any good references of this phenomenon in
           | hierarchical models?
        
             | bigfudge wrote:
             | German and Hill has a brief intro and some references.
        
             | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-09-21 23:01 UTC)