[HN Gopher] Grokfast: Accelerated Grokking by Amplifying Slow Gr...
       ___________________________________________________________________
        
       Grokfast: Accelerated Grokking by Amplifying Slow Gradients
        
       Author : johnsutor
       Score  : 113 points
       Date   : 2024-06-03 20:27 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | svara wrote:
       | Grokking is certainly an interesting phenomenon, but have
       | practical applications of it been discovered yet?
       | 
       | I remember seeing grokking demonstrated for MNIST (are there any
       | other non synthetic datasets for which it has been shown?), but
       | the authors of that paper had to make the training data smaller
       | and got a test error far below state of the art.
       | 
       | I'm very interested in this research, just curious about how
       | practically relevant it is (yet).
        
         | fwlr wrote:
         | My gut instinct from reading about the phenomenon says that a
         | "grokked" model of X parameters on Y tokens is not going to
         | outperform an "ungrokked" model with 2X parameters on 2Y tokens
         | - since "grokking" uses the same resources as parameter and
         | token scaling, it's simply not a competitive scaling mechanism
         | at the moment. It might make sense in some applications where
         | some other hard limit (e.g. memory capacity at inference time)
         | occurs before your resource limit AND you would still see good
         | returns on improvements in quality, but I suspect those are
         | still fairly narrow and/or rare applications.
        
           | joelthelion wrote:
           | Wouldn't it be super useful in cases where data is limited?
        
           | d3m0t3p wrote:
           | According to https://arxiv.org/abs/2405.15071 their grokked
           | model outperformed GPT4 and Gemini1.5 on the reasoning task.
           | We can then argue if the task makes sense and the conclusion
           | stands for other use cases but i think grokking can be useful
        
         | Legend2440 wrote:
         | Nobody is really looking for practical applications for it, and
         | you shouldn't necessarily expect them from this kind of
         | academic research.
        
           | svara wrote:
           | That doesn't sound right at all.
           | 
           | Improving generalization in deep learning is a big deal. The
           | phenomenon is academically interesting either way, but e.g.
           | making sota nets more training data economical seems like a
           | practical result that might be entirely within reach.
        
             | whimsicalism wrote:
             | i think y'all are both right. grokking is a phenomenon that
             | by definition applies to severely overfit neural networks,
             | which is a very different regime than modern ML - but we
             | might learn something from this that we can use to improve
             | regularization
        
               | barfbagginus wrote:
               | Looks like grokking could give better reasoning and
               | generalization to LLMs, but I'm not sure how practical it
               | would be to overfit a larger LLM
               | 
               | See: https://arxiv.org/abs/2405.15071
        
         | whimsicalism wrote:
         | grokking doesn't and will not have practical uses, imo - it is
         | just an experiment that revealed cool things that we mostly
         | already suspected about implicit regularization
         | 
         | however, techniques we learn from grokking about implicit
         | regularization might be helpful for the training regimes we
         | actually use
        
       | esafak wrote:
       | I missed the beginning of the story. Why and when does grokking
       | occur? It seems to be a case of reaching a new basin, casting
       | doubt on the shallow basin hypothesis in over-parameterized
       | neural networks? The last I checked all the extrema in such
       | models were supposed to be good, and easy to reach?
        
         | killerstorm wrote:
         | IIRC it was observed in a training mode with weight decay.
         | Perhaps a basin with proper generalization is more stable.
        
         | whimsicalism wrote:
         | i've worked in this field for 6 years and have never heard of
         | the 'shallow basin hypothesis', care to explain more? is it
         | just the idea that there are many good solutions that can be
         | reached in very different parts of parameter space?
         | 
         | all that grokking really means is that the 'correct',
         | generalizable solution is often simpler than the overfit
         | 'memorize all the datapoints' solution, so if you apply some
         | sort of regularization to a model that you overfit, the
         | regularization will make the memorized solution unstable and
         | you will eventually tunnel over to the 'correct' solution
         | 
         | actual DNNs nowadays are usually not obviously overfit because
         | they are trained on only one epoch
        
           | dontwearitout wrote:
           | I haven't heard the term "shallow basin hypothesis" but I
           | know what it refers to, these two papers spring to mind for
           | me:
           | 
           | 1) Loss Surfaces, Mode Connectivity, and Fast Ensembling of
           | DNNs https://arxiv.org/abs/1802.10026
           | 
           | 2) Visualizing the Loss Landscape of Neural Nets
           | https://arxiv.org/abs/1712.09913
           | 
           | There's also a very interesting body of work on merging
           | trained models, such as by interpolating between points in
           | weight space, which relates to the concept of "basins" of
           | similar solutions. Skim the intro of this if you're
           | interested in learning more: https://arxiv.org/abs/2211.08403
        
             | whimsicalism wrote:
             | cheers! i'm familiar with those first two papers, just not
             | with the specific term. my intuition was more relatively
             | deep points connected by tunnels than shallow basin - but
             | it might just be the difficulty of describing high
             | dimensional spaces
        
             | esafak wrote:
             | Yes, you both understood what I meant. I just coined the
             | term, having in mind illustrations like Fig. 1 in _Low-Pass
             | Filtering SGD for Recovering Flat Optima in the Deep
             | Learning Optimization Landscape_
             | (https://proceedings.mlr.press/v151/bisla22a.html)
             | 
             | Reviewing the literature, I see the concept is more
             | commonly referred to as "flat/wide minima"; e.g.,
             | https://www.pnas.org/doi/10.1073/pnas.1908636117
        
       | buildbot wrote:
       | Why only MNIST and a Graph CNN? Those are small and somewhat odd
       | choices. Scale these days should be at least 100 million param
       | models and something like OpenWebText as a dataset in my opinion.
       | Not sure what the SoTA is for visionm but same argument there.
        
         | dzdt wrote:
         | This paper is from a small group at an academic institution.
         | They are trying to innovate in the idea space and are probably
         | quite compute constrained. But for proving ideas smaller
         | problems can make easier analysis even leaving aside compute
         | resources. Not all research can jump straight to SOTA
         | applications. It looks quite interesting, and I wouldn't be
         | surprised to see it applied soon to larger problems.
        
           | buildbot wrote:
           | I've been in a small group at an academic institution. With
           | our meager resources we trained larger models than this on
           | many different vision problems. I personally train LLMs on
           | OpenWebText than this using a few 4090s (not work related).
           | Is that too much for a small group?
           | 
           | MNIST is solvable using two pixels. It shouldn't be one of
           | two benchmarks in a paper, again just in my opinion. It's
           | useful for debugging only.
        
             | all2 wrote:
             | Again, a small academic institution may not have the
             | experience or know-how to know these things.
        
               | olnluis wrote:
               | I thought so at first, but the repo's[0] owner and the
               | first name listed in the article has Seoul National
               | University on their Github profile. Far away from a small
               | academic institution.
               | 
               | [0]: https://github.com/ironjr/grokfast
        
             | whimsicalism wrote:
             | > MNIST is solvable using two pixels.
             | 
             | really? do you have any details?
             | 
             | agree it has no business being in a modern paper
        
             | muskmusk wrote:
             | It's a free world. Nothing stops you from applying their
             | findings to bigger datasets. It would be a valuable
             | contribution.
        
             | gwern wrote:
             | How can MNIST be solved using just two binary pixels when
             | there's 10 classes, 0-9?
        
               | whimsicalism wrote:
               | i'm also curious but my understanding was MNIST pixels
               | are not binary due to some postprocessing artifacts
        
           | krasin wrote:
           | > They are trying to innovate in the idea space and are
           | probably quite compute constrained.
           | 
           | Training a GPT-2 sized model costs ~$20 nowadays in respect
           | to compute: https://github.com/karpathy/llm.c/discussions/481
        
             | orlp wrote:
             | $20 per attempt. A paper typically comes after trying
             | hundreds of things. That said, the final version of your
             | idea could certainly try it.
        
             | olaulaja wrote:
             | Baseline time to grok something looks to be around 1000x
             | normal training time so make that $20k per attempt.
             | Probably takes a while too. Their headline number (50x
             | faster than baseline, $400) looks pretty doable if you can
             | make grokking happen reliably at that speed.
        
         | QuadmasterXLII wrote:
         | It's because that's where the effect is showing upright now.
         | This is the situation where the analogy to pre-paradigmatic
         | optics is pretty strong. If you telescope to take pictures of
         | Jupiter was having problems with rainbow fringes, so you
         | designed the defraction grading to investigate the fringes
        
         | whimsicalism wrote:
         | i don't think grokking has been demonstrated in a large model
         | yet
        
           | HarHarVeryFunny wrote:
           | There's a recent paper here where they've seen in a deeper
           | model.
           | 
           | https://arxiv.org/abs/2405.19454
           | 
           | It's a bit surprising that apparently this hasn't been done
           | before - to see how universal of a phenomenon this is.
        
         | Imnimo wrote:
         | Grokking may not even occur for datasets of that scale. Even
         | the MNIST experiments require dropping the training data size
         | from 50k examples to 1k. The reason for this is that the
         | phenomenon seems to occur at a critical zone of having just
         | barely enough training data to make generalization possible.
         | See https://arxiv.org/abs/2205.10343 for details.
         | 
         | Even figuring out how to induce grokking behavior on a 100M
         | model or OpenWebText would be a big leap in the understanding
         | of grokking. It's perfectly reasonable for a paper like this to
         | show results on the standard tasks for which grokking has
         | already been characterized.
        
         | gessha wrote:
         | Several reasons - computational effort, effort it takes to
         | reproduce results on complex datasets, academic publishing
         | model.
         | 
         | Academic research is often times about making incremental steps
         | and limiting uncertainty.
         | 
         | Making something work for MNIST is already so much work,
         | researchers don't have the time, money, and energy to run
         | experiments for 10 datasets.
         | 
         | Complex datasets are much harder to get a proper model trained
         | on due to increased complexity - larger images, tasks, classes,
         | etc.
         | 
         | Also, as soon as you run your experiments on more datasets, you
         | create an opportunity for reviewers to take you down - "why
         | didn't you test it on this other dataset?"
        
       | curious_cat_163 wrote:
       | Cute! The signal processing folks have entered the room... :)
        
       | utensil4778 wrote:
       | I'm really annoyed that AI types are just stealing well
       | established vocabulary from _everywhere_ and assigning new
       | arbitrary definitions to them.
       | 
       | You have countless LLMs, use one of them to generate new names
       | that don't require ten billion new disambiguation pages in
       | Wikipedia.
        
       | eigenvalue wrote:
       | I have a suspicion that this technique will prove most valuable
       | for market oriented data sets (like price related time series),
       | where there isn't necessarily that much massive data scale
       | compared to text corpora, and where there are very tight limits
       | on the amount of training data because you only want to include
       | recent data to reduce the chances of market regime changes. This
       | approach seems to shine when you don't quite have enough training
       | data to completely map out the general case, but if you train for
       | long enough naively, you can get lucky and fall into it.
        
       ___________________________________________________________________
       (page generated 2024-06-04 23:02 UTC)