[HN Gopher] Grokfast: Accelerated Grokking by Amplifying Slow Gr...
___________________________________________________________________
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
Author : johnsutor
Score : 113 points
Date : 2024-06-03 20:27 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| svara wrote:
| Grokking is certainly an interesting phenomenon, but have
| practical applications of it been discovered yet?
|
| I remember seeing grokking demonstrated for MNIST (are there any
| other non synthetic datasets for which it has been shown?), but
| the authors of that paper had to make the training data smaller
| and got a test error far below state of the art.
|
| I'm very interested in this research, just curious about how
| practically relevant it is (yet).
| fwlr wrote:
| My gut instinct from reading about the phenomenon says that a
| "grokked" model of X parameters on Y tokens is not going to
| outperform an "ungrokked" model with 2X parameters on 2Y tokens
| - since "grokking" uses the same resources as parameter and
| token scaling, it's simply not a competitive scaling mechanism
| at the moment. It might make sense in some applications where
| some other hard limit (e.g. memory capacity at inference time)
| occurs before your resource limit AND you would still see good
| returns on improvements in quality, but I suspect those are
| still fairly narrow and/or rare applications.
| joelthelion wrote:
| Wouldn't it be super useful in cases where data is limited?
| d3m0t3p wrote:
| According to https://arxiv.org/abs/2405.15071 their grokked
| model outperformed GPT4 and Gemini1.5 on the reasoning task.
| We can then argue if the task makes sense and the conclusion
| stands for other use cases but i think grokking can be useful
| Legend2440 wrote:
| Nobody is really looking for practical applications for it, and
| you shouldn't necessarily expect them from this kind of
| academic research.
| svara wrote:
| That doesn't sound right at all.
|
| Improving generalization in deep learning is a big deal. The
| phenomenon is academically interesting either way, but e.g.
| making sota nets more training data economical seems like a
| practical result that might be entirely within reach.
| whimsicalism wrote:
| i think y'all are both right. grokking is a phenomenon that
| by definition applies to severely overfit neural networks,
| which is a very different regime than modern ML - but we
| might learn something from this that we can use to improve
| regularization
| barfbagginus wrote:
| Looks like grokking could give better reasoning and
| generalization to LLMs, but I'm not sure how practical it
| would be to overfit a larger LLM
|
| See: https://arxiv.org/abs/2405.15071
| whimsicalism wrote:
| grokking doesn't and will not have practical uses, imo - it is
| just an experiment that revealed cool things that we mostly
| already suspected about implicit regularization
|
| however, techniques we learn from grokking about implicit
| regularization might be helpful for the training regimes we
| actually use
| esafak wrote:
| I missed the beginning of the story. Why and when does grokking
| occur? It seems to be a case of reaching a new basin, casting
| doubt on the shallow basin hypothesis in over-parameterized
| neural networks? The last I checked all the extrema in such
| models were supposed to be good, and easy to reach?
| killerstorm wrote:
| IIRC it was observed in a training mode with weight decay.
| Perhaps a basin with proper generalization is more stable.
| whimsicalism wrote:
| i've worked in this field for 6 years and have never heard of
| the 'shallow basin hypothesis', care to explain more? is it
| just the idea that there are many good solutions that can be
| reached in very different parts of parameter space?
|
| all that grokking really means is that the 'correct',
| generalizable solution is often simpler than the overfit
| 'memorize all the datapoints' solution, so if you apply some
| sort of regularization to a model that you overfit, the
| regularization will make the memorized solution unstable and
| you will eventually tunnel over to the 'correct' solution
|
| actual DNNs nowadays are usually not obviously overfit because
| they are trained on only one epoch
| dontwearitout wrote:
| I haven't heard the term "shallow basin hypothesis" but I
| know what it refers to, these two papers spring to mind for
| me:
|
| 1) Loss Surfaces, Mode Connectivity, and Fast Ensembling of
| DNNs https://arxiv.org/abs/1802.10026
|
| 2) Visualizing the Loss Landscape of Neural Nets
| https://arxiv.org/abs/1712.09913
|
| There's also a very interesting body of work on merging
| trained models, such as by interpolating between points in
| weight space, which relates to the concept of "basins" of
| similar solutions. Skim the intro of this if you're
| interested in learning more: https://arxiv.org/abs/2211.08403
| whimsicalism wrote:
| cheers! i'm familiar with those first two papers, just not
| with the specific term. my intuition was more relatively
| deep points connected by tunnels than shallow basin - but
| it might just be the difficulty of describing high
| dimensional spaces
| esafak wrote:
| Yes, you both understood what I meant. I just coined the
| term, having in mind illustrations like Fig. 1 in _Low-Pass
| Filtering SGD for Recovering Flat Optima in the Deep
| Learning Optimization Landscape_
| (https://proceedings.mlr.press/v151/bisla22a.html)
|
| Reviewing the literature, I see the concept is more
| commonly referred to as "flat/wide minima"; e.g.,
| https://www.pnas.org/doi/10.1073/pnas.1908636117
| buildbot wrote:
| Why only MNIST and a Graph CNN? Those are small and somewhat odd
| choices. Scale these days should be at least 100 million param
| models and something like OpenWebText as a dataset in my opinion.
| Not sure what the SoTA is for visionm but same argument there.
| dzdt wrote:
| This paper is from a small group at an academic institution.
| They are trying to innovate in the idea space and are probably
| quite compute constrained. But for proving ideas smaller
| problems can make easier analysis even leaving aside compute
| resources. Not all research can jump straight to SOTA
| applications. It looks quite interesting, and I wouldn't be
| surprised to see it applied soon to larger problems.
| buildbot wrote:
| I've been in a small group at an academic institution. With
| our meager resources we trained larger models than this on
| many different vision problems. I personally train LLMs on
| OpenWebText than this using a few 4090s (not work related).
| Is that too much for a small group?
|
| MNIST is solvable using two pixels. It shouldn't be one of
| two benchmarks in a paper, again just in my opinion. It's
| useful for debugging only.
| all2 wrote:
| Again, a small academic institution may not have the
| experience or know-how to know these things.
| olnluis wrote:
| I thought so at first, but the repo's[0] owner and the
| first name listed in the article has Seoul National
| University on their Github profile. Far away from a small
| academic institution.
|
| [0]: https://github.com/ironjr/grokfast
| whimsicalism wrote:
| > MNIST is solvable using two pixels.
|
| really? do you have any details?
|
| agree it has no business being in a modern paper
| muskmusk wrote:
| It's a free world. Nothing stops you from applying their
| findings to bigger datasets. It would be a valuable
| contribution.
| gwern wrote:
| How can MNIST be solved using just two binary pixels when
| there's 10 classes, 0-9?
| whimsicalism wrote:
| i'm also curious but my understanding was MNIST pixels
| are not binary due to some postprocessing artifacts
| krasin wrote:
| > They are trying to innovate in the idea space and are
| probably quite compute constrained.
|
| Training a GPT-2 sized model costs ~$20 nowadays in respect
| to compute: https://github.com/karpathy/llm.c/discussions/481
| orlp wrote:
| $20 per attempt. A paper typically comes after trying
| hundreds of things. That said, the final version of your
| idea could certainly try it.
| olaulaja wrote:
| Baseline time to grok something looks to be around 1000x
| normal training time so make that $20k per attempt.
| Probably takes a while too. Their headline number (50x
| faster than baseline, $400) looks pretty doable if you can
| make grokking happen reliably at that speed.
| QuadmasterXLII wrote:
| It's because that's where the effect is showing upright now.
| This is the situation where the analogy to pre-paradigmatic
| optics is pretty strong. If you telescope to take pictures of
| Jupiter was having problems with rainbow fringes, so you
| designed the defraction grading to investigate the fringes
| whimsicalism wrote:
| i don't think grokking has been demonstrated in a large model
| yet
| HarHarVeryFunny wrote:
| There's a recent paper here where they've seen in a deeper
| model.
|
| https://arxiv.org/abs/2405.19454
|
| It's a bit surprising that apparently this hasn't been done
| before - to see how universal of a phenomenon this is.
| Imnimo wrote:
| Grokking may not even occur for datasets of that scale. Even
| the MNIST experiments require dropping the training data size
| from 50k examples to 1k. The reason for this is that the
| phenomenon seems to occur at a critical zone of having just
| barely enough training data to make generalization possible.
| See https://arxiv.org/abs/2205.10343 for details.
|
| Even figuring out how to induce grokking behavior on a 100M
| model or OpenWebText would be a big leap in the understanding
| of grokking. It's perfectly reasonable for a paper like this to
| show results on the standard tasks for which grokking has
| already been characterized.
| gessha wrote:
| Several reasons - computational effort, effort it takes to
| reproduce results on complex datasets, academic publishing
| model.
|
| Academic research is often times about making incremental steps
| and limiting uncertainty.
|
| Making something work for MNIST is already so much work,
| researchers don't have the time, money, and energy to run
| experiments for 10 datasets.
|
| Complex datasets are much harder to get a proper model trained
| on due to increased complexity - larger images, tasks, classes,
| etc.
|
| Also, as soon as you run your experiments on more datasets, you
| create an opportunity for reviewers to take you down - "why
| didn't you test it on this other dataset?"
| curious_cat_163 wrote:
| Cute! The signal processing folks have entered the room... :)
| utensil4778 wrote:
| I'm really annoyed that AI types are just stealing well
| established vocabulary from _everywhere_ and assigning new
| arbitrary definitions to them.
|
| You have countless LLMs, use one of them to generate new names
| that don't require ten billion new disambiguation pages in
| Wikipedia.
| eigenvalue wrote:
| I have a suspicion that this technique will prove most valuable
| for market oriented data sets (like price related time series),
| where there isn't necessarily that much massive data scale
| compared to text corpora, and where there are very tight limits
| on the amount of training data because you only want to include
| recent data to reduce the chances of market regime changes. This
| approach seems to shine when you don't quite have enough training
| data to completely map out the general case, but if you train for
| long enough naively, you can get lucky and fall into it.
___________________________________________________________________
(page generated 2024-06-04 23:02 UTC)