[HN Gopher] Understanding deep learning requires rethinking gene...
___________________________________________________________________
Understanding deep learning requires rethinking generalization
Author : tmfi
Score : 92 points
Date : 2021-03-04 18:32 UTC (4 hours ago)
(HTM) web link (cacm.acm.org)
(TXT) w3m dump (cacm.acm.org)
| benlivengood wrote:
| This is only tangentially my field, so pure speculation.
|
| I suppose it's possible that generalized minima are numerically
| more common than overfitted minima in an over-parameterized
| model, so probabilistically SGD will find a more general minima
| than not, regardless of regularization.
| hervature wrote:
| I think the general consensus (from my interactions) is that a
| local minima requires the gradient to vanish. When you have
| many dimensions, it's unlikely that they are all 0. Coupled
| with modern optimization methods (primarily momentum), this
| encourages the result to be in a shallow valley as opposed to a
| spiky minima. The leap of faith is equating shallow=general and
| spiky=overfitted.
| [deleted]
| vonsydov wrote:
| The whole point of neural networks was that you don't need to
| think hard about generalizations.
| [deleted]
| clircle wrote:
| > Conventional wisdom attributes small generalization error
| either to properties of the model family or to the regularization
| techniques used during training.
|
| I'd say it's more about the simplicity of the task and quality of
| the data.
| magicalhippo wrote:
| _The experiments we conducted emphasize that the effective
| capacity of several successful neural network architectures is
| large enough to shatter the training data. Consequently, these
| models are in principle rich enough to memorize the training
| data._
|
| So they're fitting elephants[1].
|
| I've been trying to use DeepSpeech[2] lately for a project, would
| be interesting to see the results for that.
|
| I guess it could also be a decent test for your model? Retrain it
| with random labels and if it succeeds the model is just
| memorizing, so either reduce model complexity or add more
| training data?
|
| [1]: https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-
| elep...
|
| [2]: https://github.com/mozilla/DeepSpeech
| mxwsn wrote:
| Large model capacity enough to perfectly memorize/interpolate
| data may not be a bad thing. A phenomenon known as "deep double
| descent" says that increasing modeling capacity relative to the
| dataset size can reduce generalization error, even after the
| model achieves perfect training performance (see work by
| Mikhail Belkin [0] and empirical demonstrations on large deep
| learning tasks by researchers from Harvard/OpenAI [1]). Other
| work argues that memorization is critical to good performance
| on real-world tasks where the data distribution is often long-
| tailed [2]: to perform well at tasks where the training data
| set only has 1 or 2 examples, it's best to memorize those
| labels, rather than extrapolate from other data (which a lower
| capacity model may prefer to do).
|
| [0]: https://arxiv.org/abs/1812.11118
|
| [1]: https://openai.com/blog/deep-double-descent/
|
| [2]: https://arxiv.org/abs/2008.03703
| derbOac wrote:
| I think there's something critical about the implicit data
| universe being considered in the test and training data, and
| in these randomized datasets. Memorizing elephants isn't
| necessarily a bad thing if you can be assured there are
| actually elephants in the data, or if your job is to
| reproduce some data that has some highly non-random, low
| entropy (in an abstract sense) features.
|
| I think where the phenomenon in this paper, and deep double
| descent starts to clash with intuition, is the more realistic
| case where the adversarial data universe is structured, not
| random, but not conforming to the observed training target
| label alphabet (to borrow a term loosely from the IT
| literature). That is, it's interesting to know that these
| models can perfectly reproduce random data, but generalizing
| from training to test data isn't interesting in a real-world
| sense if both are constrained by some implicit features of
| the data universe involved in the modeling process (e.g.,
| that the non-elephant data only differs randomly from the
| elephant data, or doesn't contain any nonelephants that
| aren't represented in the training data). So then you end up
| with this:
| https://www.theverge.com/2020/9/20/21447998/twitter-photo-
| pr...
|
| I guess it seems to me there's a lot of implicit assumptions
| about the data space and what's actually being inferred in a
| lot of these DL models. The insight about SGD is useful, but
| maybe only underscores certain things, and seems to get lost
| in some of the discussion about DDD. Even Rademacher
| complexity isn't taken with regard to the _entire_ dataspace,
| just over a uniformly random sampling of it -- so it will
| underrepresent certain corners of the data space that are
| highly nonuniform, low entropy, which is exactly where the
| trouble lies.
|
| There's lots of fascinating stuff in this area of research,
| lots to say, glad to see it here on HN again.
| caddemon wrote:
| Would that necessarily imply it is memorizing on the non-random
| labels though? I know analogies to human learning are overdone,
| but I definitely have seen humans fall back on memorization
| when they are struggling to "actually learn" some topic.
|
| So genuinely asking from a technical/ML perspective - is it
| possible a network could optimize training loss without
| memorization when possible, but as that fails end up just
| memorizing?
| hervature wrote:
| > is it possible a network could optimize training loss
| without memorization when possible
|
| What does that even mean? ML is simply a set of tools to
| minimize (f(x_i) - y_i)^2 (if you don't like squared loss,
| pick whatever you prefer). In particular, f() here is our
| neural network. There is no loss without previous data. The
| only thing the network is trying to do is "memorize" old
| data.
| caddemon wrote:
| It depends of course how you are defining memorization, but
| the network doesn't necessarily need to use the entirety of
| every input to do what you are describing. I would think
| what people mean when they say "it isn't learning, just
| memorizing" is that the vast majority of information about
| all previous inputs is being directly encoded in the
| network.
|
| The person I was responding to mentioned training on random
| labels, and if training still goes well the network must be
| a memorizer. But I don't see why it couldn't be the case
| that a network is able to act as a memorizer, but doesn't
| if there are certain patterns it can generalize on in the
| training data.
|
| Also, there is no human learning without previous data
| either, but I wouldn't characterize all of human learning
| as memorization.
| dumb1224 wrote:
| I don't understand the random label training part.
| Presumably you train on randomised labels which have no
| relationship with the input but surely it won't
| generalise well at all given the small probability of
| predicting the labels correctly by chance (The setup for
| a Permutation test am I wrong)?
| magicalhippo wrote:
| This was the thing I misread first time around.
|
| If you look at Table 1, you see that the models manage to
| train almost 100% correctly on the randomized labels, but
| crucially the control test score is down in the 10%
| region. This is in stark comparison to roughly 80-90%
| test score for the properly labeled data.
|
| So it seems to me that when faced with structured data
| they manage to generalize the structure somehow, while
| when faced with random training data they're powerful
| enough to simply memorize the training data.
|
| edit: so just point point out, obviously it's to be
| expected the test to be bad for random input, after all
| how can you properly test classification of random data?
|
| So the point, as I understand it, isn't that the
| randomized input leads to poor test results, but rather
| that the non-randomized ones manages to generalize
| despite it being capable of simply memorizing the input.
| caddemon wrote:
| AFAIK that's right, it would be very unlikely to
| generalize on random labels, which is why I read the
| comment as suggesting the network shouldn't have low
| training loss in that situation.
| tlb wrote:
| There's a spectrum of generalization-memorization. The
| extreme case of memorization is that it would only correctly
| classify images with precisely identical pixel values to the
| ones in the training set. When we say that people have
| "memorized" something, we are still far more general than
| that. We might key off a few words, say.
|
| The behavior of highly overparameterized networks is
| basically what you suggest: they will memorize if needed
| (which you can test by randomizing labels) but will usually
| generalize better when possible.
| bloaf wrote:
| That's not my reading. I think they are saying that models
| which *can* over-fit the data in both theory and practice,
| _appear not do so when there are in fact generalizations in the
| data._
| magicalhippo wrote:
| Ah hmm yes, good point. I forgot this piece from the article:
|
| _We observe a steady deterioration of the generalization
| error as we increase the noise level. This shows that neural
| networks are able to capture the remaining signal in the data
| while at the same time fit the noisy part using brute-force._
|
| Since they score well on the test data, they must have
| generalized to some degree. But since they're just as good at
| training on random input they also have the capacity to just
| memorize the training data.
| blt wrote:
| Please add the (still) to the HN post title. The original version
| of the paper without (still) in the title is several years old.
| davnn wrote:
| 2016 for v1 to be exact, link: https://arxiv.org/abs/1611.03530
| and discussion: https://news.ycombinator.com/item?id=13566917
___________________________________________________________________
(page generated 2021-03-04 23:00 UTC)