[HN Gopher] Grokfast: Accelerated Grokking by Amplifying Slow Gr...
       ___________________________________________________________________
        
       Grokfast: Accelerated Grokking by Amplifying Slow Gradients
        
       Author : johnsutor
       Score  : 44 points
       Date   : 2024-06-03 20:27 UTC (2 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | svara wrote:
       | Grokking is certainly an interesting phenomenon, but have
       | practical applications of it been discovered yet?
       | 
       | I remember seeing grokking demonstrated for MNIST (are there any
       | other non synthetic datasets for which it has been shown?), but
       | the authors of that paper had to make the training data smaller
       | and got a test error far below state of the art.
       | 
       | I'm very interested in this research, just curious about how
       | practically relevant it is (yet).
        
         | fwlr wrote:
         | My gut instinct from reading about the phenomenon says that a
         | "grokked" model of X parameters on Y tokens is not going to
         | outperform an "ungrokked" model with 2X parameters on 2Y tokens
         | - since "grokking" uses the same resources as parameter and
         | token scaling, it's simply not a competitive scaling mechanism
         | at the moment. It might make sense in some applications where
         | some other hard limit (e.g. memory capacity at inference time)
         | occurs before your resource limit AND you would still see good
         | returns on improvements in quality, but I suspect those are
         | still fairly narrow and/or rare applications.
        
       | esafak wrote:
       | I missed the beginning of the story. Why and when does grokking
       | occur? It seems to be a case of reaching a new basin, casting
       | doubt on the shallow basin hypothesis in over-parameterized
       | neural networks? The last I checked all the extrema in such
       | models were supposed to be good, and easy to reach?
        
       | buildbot wrote:
       | Why only MNIST and a Graph CNN? Those are small and somewhat odd
       | choices. Scale these days should be at least 100 million param
       | models and something like OpenWebText as a dataset in my opinion.
       | Not sure what the SoTA is for visionm but same argument there.
        
         | dzdt wrote:
         | This paper is from a small group at an academic institution.
         | They are trying to innovate in the idea space and are probably
         | quite compute constrained. But for proving ideas smaller
         | problems can make easier analysis even leaving aside compute
         | resources. Not all research can jump straight to SOTA
         | applications. It looks quite interesting, and I wouldn't be
         | surprised to see it applied soon to larger problems.
        
           | buildbot wrote:
           | I've been in a small group at an academic institution. With
           | our meager resources we trained larger models than this on
           | many different vision problems. I personally train LLMs on
           | OpenWebText than this using a few 4090s (not work related).
           | Is that too much for a small group?
           | 
           | MNIST is solvable using two pixels. It shouldn't be one of
           | two benchmarks in a paper, again just in my opinion. It's
           | useful for debugging only.
        
             | all2 wrote:
             | Again, a small academic institution may not have the
             | experience or know-how to know these things.
        
               | olnluis wrote:
               | I thought so at first, but the repo's[0] owner and the
               | first name listed in the article has Seoul National
               | University on their Github profile. Far away from a small
               | academic institution.
               | 
               | [0]: https://github.com/ironjr/grokfast
        
           | krasin wrote:
           | > They are trying to innovate in the idea space and are
           | probably quite compute constrained.
           | 
           | Training a GPT-2 sized model costs ~$20 nowadays in respect
           | to compute: https://github.com/karpathy/llm.c/discussions/481
        
       ___________________________________________________________________
       (page generated 2024-06-03 23:00 UTC)