[HN Gopher] IsoFLOP curves of large language models are flat
___________________________________________________________________
IsoFLOP curves of large language models are flat
Author : alexmolas
Score : 25 points
Date : 2024-08-01 14:05 UTC (1 days ago)
(HTM) web link (severelytheoretical.wordpress.com)
(TXT) w3m dump (severelytheoretical.wordpress.com)
| z4y5f3 wrote:
| What they missed is that current scaling laws (OpenAI, Deepmind
| Chinchilla) are based on the assumption that the model is trained
| for one epoch. This essentially means that in order to scale
| compute, you will have to scale the model size and/or the size of
| the dataset. So Meta cannot simply spend 3.8e25 FLOPs on a 70B
| model - to do this they must find 86T pretraining tokens which
| they do not have.
|
| Of course, ultimately we will figure out scaling laws for LLMs
| trained on multiple epochs of data, but not today.
| ActivatedAI wrote:
| There is some good published research about doing multiple
| passes over the training data, and how quickly learning
| saturates. The TL:DR is that diminishing returns kicks in after
| about 4 epochs.
|
| https://arxiv.org/abs/2305.16264
| z4y5f3 wrote:
| Yep I have seen this paper before, and thank you for linking
| it here for reference. My personal opinion is that compared
| to single epoch scaling laws, we still need more evidence and
| literature on effects of multiple epochs, but this paper is
| one of the best results we have so far on using multiple
| epochs.
| radarsat1 wrote:
| > So, these models are basically within 1% of each other in terms
| of final pretraining loss.
|
| How is this loss calculated though? Since it is called "loss" and
| not "performance metric", I'm going to assume it is the teacher
| forced cross entropy loss.
|
| I'm not too familiar with LLM training but having been doing some
| fair amount of seq2seq training lately in other domains I've
| observed that the relationship between "loss" and autoregressive
| inference performance gets very narrow towards the end of
| training. What I mean is that smaller and smaller reductions in
| loss lead to better improvements in the autoregressive output. So
| at least I have some doubt that in practice that 1% loss
| improvement is not actually incredibly significant with respect
| to how well the model actually performs at inference time.
|
| But I'm pretty interested in this topic and if people here have
| observed this or the contrary I'd be curious to know.
| ActivatedAI wrote:
| Page 9 on the Llama tech report has an interesting graph that
| predicts task level performance from the cross-entropy loss.
| The sigmoidal model fits well, and at the steepest part of the
| S, a .01 change in NLL is worth about 5% task level accuracy.
|
| Here is a quick screenshot if you are lazy.
|
| https://snipboard.io/C6mipQ.jpg
|
| And here is the paper if you want to dig deep into the highest
| quality published research on LLM frontier.
|
| https://ai.meta.com/research/publications/the-llama-3-herd-o...
___________________________________________________________________
(page generated 2024-08-02 23:01 UTC)