[HN Gopher] Transcending Scaling Laws with 0.1% Extra Compute
___________________________________________________________________
Transcending Scaling Laws with 0.1% Extra Compute
Author : ashvardanian
Score : 62 points
Date : 2023-01-27 19:00 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jxf wrote:
| This sounds very interesting but I lack the technical depth in
| language models to understand it. In particular I can't parse the
| following excerpt:
|
| > The key idea is to continue training a state-of-the-art large
| language model (e.g., PaLM) on a few more steps with UL2's
| mixture-of-denoiser objective. We show that, with almost
| negligible extra computational costs and no new sources of data,
| we are able to substantially improve the scaling properties of
| large language models on downstream metrics. In this paper, we
| continue training PaLM with UL2R, introducing a new set of models
| at 8B, 62B, and 540B scale which we call U-PaLM.
|
| Things I don't understand:
|
| * PaLM (and its advantages/disadvantages relative to other LMs)
|
| * What a "mixture-of-denoiser objective" is
|
| * How "the scaling properties" are measured
|
| I'd be interested in a more accessible summary of how this works,
| if HN has any references.
| p1esk wrote:
| Did you try reading the paper (past the abstract)? It provides
| the reference to the original PaLM paper and answers the rest
| of your questions.
| mcint wrote:
| Linking Paper's with Code for its listing of relevant: tasks,
| datasets, and metrics with global ranking against other models on
| defined tasks.
|
| https://paperswithcode.com/paper/transcending-scaling-laws-w...
| 6gvONxR4sf7o wrote:
| > Impressively, at 540B scale, we show an approximately 2x
| computational savings rate where U-PaLM achieves the same
| performance as the final PaLM 540B model at around half its
| computational budget (i.e., saving ~4.4 million TPUv4 hours).
|
| Are you allowed to call your own work impressive in your
| abstract? Cool work, but that line is "transcendent."
|
| Anyways, aren't scaling laws more like O(whatever) asymptotics?
| Like if you reduce your sorting algo from 6.4 n^2 seconds to 3.2
| n^2, you don't say you "transcended the scaling laws," even
| though you sped it up a very significant amount. Am I
| misunderstanding?
| p1esk wrote:
| _aren 't scaling laws more like O(whatever) asymptotics?_
|
| Not if scaling is linear.
| [deleted]
| hobs wrote:
| Yes, that's day one stuff that the coefficient gets thrown
| away.
| dgreensp wrote:
| By "transcending scaling laws" and "improve the scaling
| properties," do they just mean higher-quality output compared to
| using the same (or smaller) model size with previous methods?
| whatshisface wrote:
| Here is a summary:
|
| - Training on a mixture of fill-in-the-gaps (a few missing words)
| and denoising (every word slightly corrupted) produces better
| LLMs than either one alone.
|
| - This advantage (that of using both metrics at once) can be
| gained with just a little extra training on a model previously
| trained with only one of them.
|
| This results in 2-4% improvements on most tasks, with a couple
| really big improvements (+20%) and one quite surprising (+60%) on
| a few BigBench tasks. The large percent improvements on the
| BigBench tasks seem to have more to do with the low initial
| performance than the new performance being outstanding; the 60%
| was from 7.6% right to 12.5% right.
| mcint wrote:
| Thank you for the summary!
|
| I quite like this idea, train on a mixture of simple critics,
| or really here, simple sources of noise.
| [deleted]
___________________________________________________________________
(page generated 2023-01-27 23:00 UTC)