[HN Gopher] Grokfast: Accelerated Grokking by Amplifying Slow Gr...
___________________________________________________________________
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
Author : johnsutor
Score : 44 points
Date : 2024-06-03 20:27 UTC (2 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| svara wrote:
| Grokking is certainly an interesting phenomenon, but have
| practical applications of it been discovered yet?
|
| I remember seeing grokking demonstrated for MNIST (are there any
| other non synthetic datasets for which it has been shown?), but
| the authors of that paper had to make the training data smaller
| and got a test error far below state of the art.
|
| I'm very interested in this research, just curious about how
| practically relevant it is (yet).
| fwlr wrote:
| My gut instinct from reading about the phenomenon says that a
| "grokked" model of X parameters on Y tokens is not going to
| outperform an "ungrokked" model with 2X parameters on 2Y tokens
| - since "grokking" uses the same resources as parameter and
| token scaling, it's simply not a competitive scaling mechanism
| at the moment. It might make sense in some applications where
| some other hard limit (e.g. memory capacity at inference time)
| occurs before your resource limit AND you would still see good
| returns on improvements in quality, but I suspect those are
| still fairly narrow and/or rare applications.
| esafak wrote:
| I missed the beginning of the story. Why and when does grokking
| occur? It seems to be a case of reaching a new basin, casting
| doubt on the shallow basin hypothesis in over-parameterized
| neural networks? The last I checked all the extrema in such
| models were supposed to be good, and easy to reach?
| buildbot wrote:
| Why only MNIST and a Graph CNN? Those are small and somewhat odd
| choices. Scale these days should be at least 100 million param
| models and something like OpenWebText as a dataset in my opinion.
| Not sure what the SoTA is for visionm but same argument there.
| dzdt wrote:
| This paper is from a small group at an academic institution.
| They are trying to innovate in the idea space and are probably
| quite compute constrained. But for proving ideas smaller
| problems can make easier analysis even leaving aside compute
| resources. Not all research can jump straight to SOTA
| applications. It looks quite interesting, and I wouldn't be
| surprised to see it applied soon to larger problems.
| buildbot wrote:
| I've been in a small group at an academic institution. With
| our meager resources we trained larger models than this on
| many different vision problems. I personally train LLMs on
| OpenWebText than this using a few 4090s (not work related).
| Is that too much for a small group?
|
| MNIST is solvable using two pixels. It shouldn't be one of
| two benchmarks in a paper, again just in my opinion. It's
| useful for debugging only.
| all2 wrote:
| Again, a small academic institution may not have the
| experience or know-how to know these things.
| olnluis wrote:
| I thought so at first, but the repo's[0] owner and the
| first name listed in the article has Seoul National
| University on their Github profile. Far away from a small
| academic institution.
|
| [0]: https://github.com/ironjr/grokfast
| krasin wrote:
| > They are trying to innovate in the idea space and are
| probably quite compute constrained.
|
| Training a GPT-2 sized model costs ~$20 nowadays in respect
| to compute: https://github.com/karpathy/llm.c/discussions/481
___________________________________________________________________
(page generated 2024-06-03 23:00 UTC)