[HN Gopher] Chinchilla Scaling: A replication attempt
___________________________________________________________________
Chinchilla Scaling: A replication attempt
Author : tosh
Score : 87 points
Date : 2024-04-18 15:05 UTC (7 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| cs702 wrote:
| Interesting! If the authors are right, it seems that the number
| of training tokens required per parameter (slowly) _declines_ as
| models become larger (Figure 5).
|
| That's good news. I think it deserves wider dissemination, so I'm
| upvoting your post.
|
| Thank you for sharing this on HN!
| dzdt wrote:
| Could be that the independence of training points available
| decline as the dataset size grows? At some point it becomes
| hard to add data that isn't essentially similar to something
| youve already added.
| cs702 wrote:
| Yes, could be. Not sure how or even if anyone could prove it,
| though.
| sebzim4500 wrote:
| I guess you could artifically limit the training data (e.g.
| by removing languages, categories) and see if the utility
| of extra tokens drops off as a result.
| godelski wrote:
| This should be fairly de facto true. Remember your dataset
| is some proxy for some real (but almost surely intractable)
| distribution.
|
| Now let's think about filling the space with p-balls that
| are bound by nearest points. So there should be no data
| point inside the ball. Then we've turned this problem into
| a sphere packing problem and we can talk about the size and
| volumes of those spheres.
|
| So if we uniformally fill our real distribution with data
| then the average volume of those spheres decrease. If we
| fill but not uniformly the average ball will decrease but
| the largest ball will shrink slower (this case being we
| aren't properly covering data in that region). In either
| case that more you add data, the more the balls shrink.
| Essentially meaning the difference between data decreases.
| The harder question is about those under represented
| regions. Finding them and determining how to properly
| sample.
|
| Another quick trick you can use to convince yourself if
| thinking about basis vectors (this won't be robust btw but
| a good starting point). In high dimensions the likelihood
| that two randomly sampled vectors are orthogonal is almost
| certainly true. So then we think of drawing basis vectors
| (independent vectors that span our space). So as we fill in
| data, we initially are very likely to have vectors (or
| data) that are independent in some way. But as we add more,
| the likelihood that they are orthogonal decreases. Of
| course your basis vectors don't need to be orthogonal but
| that's more semantics because we can always work in a space
| where that's true.
| Kronopath wrote:
| This is not good news, this means that we could end up with a
| dangerously superintelligent AI just by scaling up the number
| of parameters, without increasing the amount of training data.
| exe34 wrote:
| Like a corporation then. We should ban them until we can
| figure out how to align them!
| tehsauce wrote:
| ASI is nothing like a corporation
| wizzwizz4 wrote:
| No, they're not. Corporations have known, concrete
| impacts on the world, whereas the dangers of AI are, so
| far, corporations. ASIs are (as yet) fictional.
|
| Another difference: most corporations will avoid doing
| illegal stuff if the penalties are large enough: the
| corporation alignment problem is political. Pretty much
| no extant AI systems can be instructed in this way: we
| don't know how to align AIs even in theory.
| TeMPOraL wrote:
| Is very much like a corporation; a corp is effectively an
| AGI, just running very slowly - at the speed of
| bureaucracy.
| kelseyfrog wrote:
| No, but LLMs require orders of magnitude more language input
| than humans[1]. It's very reasonable to assume that
| architectural differences (size among them) is more likely a
| constraint for performance.
|
| 1. Specifically larger than the upper bound on _lifetime_
| language input for humans, even assuming 24 /7 at max reading
| speed.
| HeatrayEnjoyer wrote:
| Do they? What is the total size of all visual, audio,
| touch, locomotive, scent, and taste data collected between
| birth and when a human reaches IQ 100? There are multiple
| high-bandwidth feeds running into the brain 24/7.
| cubefox wrote:
| > language input
| p1esk wrote:
| How much language input does a human need to become
| intelligent if he doesn't receive any other input?
| TeMPOraL wrote:
| Yes, but LLMs come out of training as experts in
| approximately any single thing you can think of, and then
| some, and all that in dozen of languages. Humans don't
| achieve even a fraction of this kind of breadth.
| godelski wrote:
| This is not quite accurate, but complex because
| measurement is hard. The things they are being tested on
| are almost surely within the dataset. Let's take the bar
| exam for instance. Sure, we don't know what's in GPT
| data, but we know it has reddit, and we know reddit has
| many similar if not exact questions on it. We know that
| the first GPT4 did not have good semantic similarity
| matching because they just used a 3 substring matching on
| 50 chararcters (Appendix C) and they only consider the
| false positive nature. Then there's this line...
| The RLHF post-training dataset is vastly smaller than the
| pretraining set and unlikely to have any particular
| question contaminated. However we did not check
| explicitly.
|
| But my favorite is the HumanEval. I'll just remind
| everyone that this was written by 60 authors, mostly from
| OpenAI We evaluate functional correctness
| on a set of 164 handwritten programming problems, which
| we call the HumanEval dataset. ... __It is important for
| these tasks to be hand-written, since our models are
| trained on a large fraction of GitHub, which already
| contains solutions to problems from a variety of
| sources.__
|
| The problems? Well they're leetcode style... Can you tell
| me you can write leetcode style questions that
| Human Eval 2 Prompt: def
| truncate_number(number: float) -> float: """ Given a
| positive floating point number, it can be decomposed into
| and integer part (largest integer smaller than given
| number) and decimals (leftover part always smaller than
| 1). Return the decimal part of the number. >>>
| truncate_number(3.5) 0.5 """ Solution:
| return number % 1.0 Human Eval 4
| Prompt: from typing import List def
| mean_absolute_deviation(numbers: List[float]) -> float:
| """ For a given list of input numbers, calculate Mean
| Absolute Deviation around the mean of this dataset. Mean
| Absolute Deviation is the average absolute difference
| between each element and a centerpoint (mean in this
| case): MAD = average | x - x_mean | >>>
| mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """
| Solution mean = sum(numbers) / len(numbers)
| return sum(abs(x - mean) for x in numbers) / len(numbers)
|
| You really want to bet that that isn't on github? Because
| I'll bet you any dollar amount you want that there are
| solutions in near exact form that are on github prior to
| their cutoff date (Don't trust me, you can find them too.
| They're searchable even). Hell, I've poisoned the dataset
| here!
|
| LLMs are (lossy) compression systems. So they're great
| for information retrieval. And a lot of what we consider
| intelligence (and possibly even creativity) is based on
| information retrieval. Doesn't mean these things are any
| less impressive but just a note on how we should be
| interpreting results and understanding the limitations of
| our tools. Measuring intelligence is a really difficult
| thing and we need to be aware that the term isn't
| universally agreed upon and so people are often talking
| past one another and also some people are conflating the
| differences as if they are the same.
| newfocogi wrote:
| Key claims:
|
| "We have found three potential issues with Hoffmann et al.'s
| estimates of the Chinchilla scaling law that rely on Approach 3:
| 1. Their estimated model fits the reconstructed data very poorly.
| These conclusions hold even when accounting for potential noise
| in data reconstruction and excluding outlier models. 2. The
| confidence are implausibly tight given the number of data points.
| Obtaining confidence intervals that tight would require many
| hundreds of thousands of observations, while they likely had only
| ~400. 3. Their estimated model implies a scaling policy that is
| inconsistent with their other approach"
|
| Data point most people are probably looking for: "We find a range
| consistent with the 20 tokens per parameter rule of thumb.
| Indeed, our point estimates imply that 25.6 tokens per parameters
| is optimal."
| moffkalast wrote:
| Their rule of thumb would imply that a 70B model is saturated
| with 1.7T tokens, that's inconsistent with reality.
| og_kalu wrote:
| The Chinchilla laws were _compute optimal_ scaling laws. They
| 're not supposed to tell you what parameter-token combination
| will saturate a model.
| moffkalast wrote:
| Compute optimal for what, training? There's nothing optimal
| in blowing up model size beyond the absolute minimal needed
| or you'll spent the equivalent of a country in electricity
| trying to scale inference later.
| rfw300 wrote:
| Yes, compute-optimal for training only. The purpose of
| the paper wasn't to determine the most economically
| practical model one could build, the most "intelligent"
| model one could build given some amount of compute.
| ijk wrote:
| Quite. The big question at the time was "how much data do
| we need to train GPT-3 equivalent models". Open models
| had failed to live up to GPT performance, even ones with
| a massive number of parameters. So getting results that
| suggested a reason why other models were massively
| undertrained was important.
|
| Meanwhile, people noticed that for deployed models,
| inference cost often outweighs the initial training
| costs. It's sometimes better to train a smaller, faster
| model longer on more data, because it has lower overall
| cost (including environmental impact) if you're expecting
| to run the model a few million or billion times (e.g.,
| [1]). So training past the Chinchilla optimum point
| became a lot more common, particularly after Llama.
|
| [1] https://arxiv.org/abs/2401.00448
| FeepingCreature wrote:
| Blow up model size, get lots of space and parameters to
| do the double-descent grok thing in, then distill it way
| way down?
| og_kalu wrote:
| Training yes.
|
| Doubling your parameter count past that ratio will yield
| a better model than doubling your data and is much easier
| and cheaper to do.
| naasking wrote:
| That suggests that it's likely memorizing more special
| cases rather than distilling general principles. They
| generalize to some degree but clearly there's room for
| improvement.
| og_kalu wrote:
| It doesn't really suggest anything. Neither model will
| even close to saturation and all else equal, bigger
| models perform better in every way, including
| generalization.
| eldenring wrote:
| No their claim is that there are dimishing returns for a
| fixed compute budget (in training) to scaling up data past
| that threshold vs. scaling up params.
|
| This doesn't take inference into account either, obviously.
| magnio wrote:
| > To extract the data from the figure, we first downloaded the
| PDF from Hoffmann et al.'s arXiv submission and saved it in SVG
| format. We then parsed the SVG content to navigate and search the
| SVG structure. Within the SVG, we identified the group of points
| representing the scatter plot data and iterated over each point
| to extract its fill color and position (x and y coordinates)
| using the attributes of the corresponding SVG elements.
|
| > To map the SVG coordinates to the model size and training FLOP
| values, we used the location of the labels or ticks on the
| respective axes. This allowed us to establish a correspondence
| between the SVG coordinates and the actual data values
| represented in the plot.
|
| They ... reconstructed the data ... from a plot ... using ruler
| and eyes? Why not just emailed the original authors for the raw
| data? I can't help but feel like @yuvaltheterrible debunking
| papers.
| mxwsn wrote:
| Funnily enough, I've done this for a paper I wrote as well.
| Emailing authors is kind of a crapshoot. It's normal to get no
| response if it's been several years since the paper came out.
| In this case, a pdf plot is essentially lossless, and it's much
| faster than waiting for authors to maybe respond.
| V1ndaar wrote:
| And not only that, in many cases they will tell you (if they
| reply) "oh, we can't find the source of that plot anymore".
| Happened to me quite a few times (although in physics).
|
| I'm pretty sure I'm not the only one who's written themselves
| a mini tool to even extract data from a bitmap plot based on
| the axes. Involves some manual steps (cropping mainly), but
| is very convenient for the cases where people not even use
| vector graphics, but sometimes even just screenshots of
| plots... Do I like it? Hell no! It's why I've put quite some
| effort in doing it better for my PhD thesis.
| godelski wrote:
| Yeah it's very annoying especially these days when there's
| no real excuse to not have a copy. You can easily store all
| code and data for free and in an accessible manner. Even
| just GitHub for 90+% is good enough. Hugging face helps,
| and there's many other ways too.
|
| I remember my first year in grad school I was trying to
| replicate a work by a very prestigious university. It
| definitely wasn't reproducible from text but I did my best.
| Couldn't get close to their claims so I email the lead
| author (another grad student). No response. Luckily my
| advisor knew their advisor. Got a meeting and then I got
| sent code. It was nothing like what they claimed in the
| paper so I have no idea what they gave me. Anyways, my
| paper never got published because I couldn't beat them. It
| is what it is.
| Ajoo wrote:
| They claimed that they did ask several times in one of the
| replies.
| polygamous_bat wrote:
| > Why not just emailed the original authors for the raw data?
|
| Industry research labs, especially Google deepmind, are
| notoriously closed up about their "proprietary" data. I've hit
| this wall multiple times in my own work in AI.
| sp332 wrote:
| https://twitter.com/borgeaud_s/status/1780988694163321250
| says they're going to open the data from the paper. Not sure
| why they didn't do it before, but good news.
| acc_297 wrote:
| In fairness they did not use a ruler or eyes based on the
| excerpts you quote they extracted exact coordinates of data
| from an svg format which if the svg was created correctly
| should at least give a non-biased dataset maybe with less
| precision than the source
| levocardia wrote:
| I do that all the time using WebPlotDigitizer [1]. Works great.
|
| [1] https://apps.automeris.io/wpd/
| dynm wrote:
| Seconded. When I first saw this, I thought it looked
| unintuitive and difficult to use, but when I tried it, it was
| very easy and I had the extracted data in a few minutes.
| williamdclt wrote:
| I particularly like this second quote, I appreciate them taking
| the time to explain "what is a graph" in a scientific paper!
| ege_erdil wrote:
| we did and gave them a two week grace period to respond, but
| they only responded to us after we published on arxiv
|
| also, we didn't reconstruct the data using a ruler, you can
| automate that entire process so that it's much more reliable
| than that
| cgearhart wrote:
| TL;DR--couldn't exactly replicate their results, but broadly
| confirmed their findings. They agree that the optimal range is
| 5-40 tokens per parameter, and close to 20 for the "chinchilla"
| model from the original paper.
|
| Very unusual choice to reconstruct the dataset by eyeballing the
| graph in the source paper (why not just ask for it...?) and it's
| not really clear why the result is dressed up behind the
| salacious-seeming abstract.
| ege_erdil wrote:
| we didn't eyeball the graph, there are more accurate ways of
| extracting the data from a pdf file than that
|
| we did ask for the data but got no response until we published
| on arxiv
|
| what is supposed to be "salacious" about the abstract?
| warbaker wrote:
| Calling this a "replication attempt" implied to me that they
| tried to replicate the Chinchilla Scaling paper and found that it
| did not replicate, which would be a very big deal!
|
| Instead, they just redid the analysis based on a figure in the
| paper and found that the old model with slightly different
| parameters gave a better fit to the data. This is a valuable
| contribution, but a bit over-stated by the paper title, and the
| confrontational, "gotcha" tone of the paper is unwarranted.
|
| A better framing would have been something like "Chinchilla
| Scaling: Reanalyzed".
| ege_erdil wrote:
| one of their three approaches does not replicate and it's
| because of a software bug in the optimizer they used, i don't
| know what else we were supposed to say
| gwern wrote:
| The original Chinchilla authors have now identified the original
| bug, apparently:
| https://twitter.com/borgeaud_s/status/1780988694163321250
___________________________________________________________________
(page generated 2024-04-18 23:00 UTC)