[HN Gopher] IsoFLOP curves of large language models are flat
       ___________________________________________________________________
        
       IsoFLOP curves of large language models are flat
        
       Author : alexmolas
       Score  : 25 points
       Date   : 2024-08-01 14:05 UTC (1 days ago)
        
 (HTM) web link (severelytheoretical.wordpress.com)
 (TXT) w3m dump (severelytheoretical.wordpress.com)
        
       | z4y5f3 wrote:
       | What they missed is that current scaling laws (OpenAI, Deepmind
       | Chinchilla) are based on the assumption that the model is trained
       | for one epoch. This essentially means that in order to scale
       | compute, you will have to scale the model size and/or the size of
       | the dataset. So Meta cannot simply spend 3.8e25 FLOPs on a 70B
       | model - to do this they must find 86T pretraining tokens which
       | they do not have.
       | 
       | Of course, ultimately we will figure out scaling laws for LLMs
       | trained on multiple epochs of data, but not today.
        
         | ActivatedAI wrote:
         | There is some good published research about doing multiple
         | passes over the training data, and how quickly learning
         | saturates. The TL:DR is that diminishing returns kicks in after
         | about 4 epochs.
         | 
         | https://arxiv.org/abs/2305.16264
        
           | z4y5f3 wrote:
           | Yep I have seen this paper before, and thank you for linking
           | it here for reference. My personal opinion is that compared
           | to single epoch scaling laws, we still need more evidence and
           | literature on effects of multiple epochs, but this paper is
           | one of the best results we have so far on using multiple
           | epochs.
        
       | radarsat1 wrote:
       | > So, these models are basically within 1% of each other in terms
       | of final pretraining loss.
       | 
       | How is this loss calculated though? Since it is called "loss" and
       | not "performance metric", I'm going to assume it is the teacher
       | forced cross entropy loss.
       | 
       | I'm not too familiar with LLM training but having been doing some
       | fair amount of seq2seq training lately in other domains I've
       | observed that the relationship between "loss" and autoregressive
       | inference performance gets very narrow towards the end of
       | training. What I mean is that smaller and smaller reductions in
       | loss lead to better improvements in the autoregressive output. So
       | at least I have some doubt that in practice that 1% loss
       | improvement is not actually incredibly significant with respect
       | to how well the model actually performs at inference time.
       | 
       | But I'm pretty interested in this topic and if people here have
       | observed this or the contrary I'd be curious to know.
        
         | ActivatedAI wrote:
         | Page 9 on the Llama tech report has an interesting graph that
         | predicts task level performance from the cross-entropy loss.
         | The sigmoidal model fits well, and at the steepest part of the
         | S, a .01 change in NLL is worth about 5% task level accuracy.
         | 
         | Here is a quick screenshot if you are lazy.
         | 
         | https://snipboard.io/C6mipQ.jpg
         | 
         | And here is the paper if you want to dig deep into the highest
         | quality published research on LLM frontier.
         | 
         | https://ai.meta.com/research/publications/the-llama-3-herd-o...
        
       ___________________________________________________________________
       (page generated 2024-08-02 23:01 UTC)