[HN Gopher] Cerebras-GPT vs. LLaMA AI Model Performance Comparison
___________________________________________________________________
Cerebras-GPT vs. LLaMA AI Model Performance Comparison
Author : freeqaz
Score : 73 points
Date : 2023-03-29 19:26 UTC (3 hours ago)
(HTM) web link (www.lunasec.io)
(TXT) w3m dump (www.lunasec.io)
| ftxbro wrote:
| For context, the cerebras models were trained in only a couple
| weeks and the purpose of the training was to establish a scaling
| law for compute-optimal training, presumably for predicting what
| will happen with larger models where it's more important to train
| in a compute-optimal way. This is a different goal than that of
| other research projects that try to get the most power per VRAM
| on small models.
| pama wrote:
| The "law" was previously established empirically already and is
| only of relevance as a technical detail to a few specialists
| that may care. I think it was a strategic mistake to only
| release models that are weaker than what people can get their
| hands on. Is there a limit on that hardware scaling to larger
| models? As a hardware company that tries to stay in the game
| they should show some signs of dominance, not just Apache
| license.
| freeqaz wrote:
| That makes sense especially since they're not intending to
| deploy this model to production. For models like GPT-3/4 it
| makes sense why they would train them more because the costs of
| running the inference "in production" likely dominates the
| compute costs. (Just like how Youtube will spend 50x more
| compute to compress a video an extra 2% because bandwidth costs
| far outstrip the compression costs.)
|
| Do you know what percentage, roughly, this model has been
| trained relative to something like LLaMA? Are we talking 10%?
| 50%? 90%?
|
| It may be possible that it is still useful if it can be trained
| further by the community!
| gpm wrote:
| LLaMa 65B and 30B were trained on 1.4 trillion tokens. This
| model was trained on 260 billion tokens.
| freeqaz wrote:
| So ~18.6% trained relative to LLaMa. That's not _nothing_
| but it's also not great. Thanks for digging into this!
| breadchris wrote:
| Wow! I didn't even know that Cerebras was a thing and I have been
| trying to keep up to date with this stuff!
| ftxbro wrote:
| other hacker news cerebras discussion is here:
| https://news.ycombinator.com/item?id=35343763
| imaurer wrote:
| Submit new links here :)
|
| https://github.com/imaurer/awesome-decentralized-llm
| dumbaccount123 wrote:
| Oh my god enough, please let us just go back to living in caves.
| This is becoming unbearable at this point.
| knicholes wrote:
| Nobody is stopping you from living in a cave! ... at least I
| don't think so.
| uejfiweun wrote:
| I dunno man, you seen cave prices these days? The cave market
| is in a tough spot until the fed starts cutting...
| jeron wrote:
| "Powell no cut interest rates because economy strong" -
| chatGPT
| sbierwagen wrote:
| >>Is 10000 bigger than 10050?
|
| >>Yes, 10000 is bigger than 10050.
|
| >But even the mighty ChatGPT often can't do simple math
|
| GPT is bad at math because BPE input compression obfuscates
| individual digits. https://bbot.org/etc/gpt-math.png You'd be bad
| at math too if every number was scrambled.
|
| The graph is from page 22 of the GPT-3 paper from 2020.
| https://arxiv.org/abs/2005.14165 Even with 175 billion parameters
| it can't reliably do four digit addition.
|
| An example from 4 days ago of ChatGPT being as bad as you'd
| expect at string reversal:
| https://news.ycombinator.com/item?id=35297183
|
| (Although, I just tested ChatGPT Mar 14 Version against the above
| question after doing a bunch of math prompting and it got it
| right...)
| croddin wrote:
| I'm not sure if these models use the GPT tokenizer, but if you
| type a long string of numbers into
| https://platform.openai.com/tokenizer, you can see the tokens
| that the LLM would see. What the LLMs get as input for math is
| significantly worse then having to do mental math with roman
| numerals, tokenizing makes sense for words but for numbers it
| seems the like LLMs would have to learn a lot more steps. I
| wonder if limiting number tokens to 2 digits per token, instead
| of the 1-3 it currently is would improve models math.
| [deleted]
| sillysaurusx wrote:
| This is a common myth, which I've written about before.
| https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
|
| The closest anyone's come to proving that byte-level
| tokenization is better is the ByT5 paper
| https://arxiv.org/abs/2105.13626
|
| But they only showed evidence for improvement on specific
| tasks, not general performance, which is an important
| distinction. And their own benchmarks show that the
| improvements tend to be marginal:
| https://i.imgur.com/6Cw0APS.png
|
| One view of the situation is that byte-level access (or "digit-
| level" in this case) gives a way to accelerate training, and to
| achieve higher performance with fewer parameters. The model
| doesn't need to spend as much effort on learning the
| decompression algorithm (tokenization).
|
| But once learned, the tokenization doesn't seem to hinder a
| model from achieving higher performance, the same way that JPG
| compression doesn't hinder us from achieving an image that
| looks very good to humans. It's a bit like arguing an artist
| would be better if they only operated on raw bitmaps, or that
| our eyes would be better if our visual cortex didn't do any
| signal compression. Maybe, but the fact that our eyes do it is
| pretty strong evidence that compression isn't harmful.
| sbierwagen wrote:
| I'm not sure how this is germane?
|
| I'm talking _about_ specific tasks: saying if 10000 or 10050
| is larger. GPT is demonstrably bad at that. The ByT5 paper
| doesn 't mention arithmetic tasks or show benchmark results
| for the specific task I mention.
|
| Your linked comment says:
|
| >This is a common myth but in practice no one (as far as I
| know) has shown that byte level predictions result in
| superior overall performance.
|
| Stating if BPE or character tokenization is better for
| everything is a much broader claim, one I didn't make! One
| could easily imagine a toolformer that calls out to calc.exe
| for anything involving numbers which would get much better
| numeric performance while still using BPEs.
| sillysaurusx wrote:
| > GPT is bad at math because BPE input compression
| obfuscates individual digits. https://bbot.org/etc/gpt-
| math.png You'd be bad at math too if every number was
| scrambled.
|
| This is the myth I was referring to. BPE compression may
| slow down training, but it doesn't follow that slower
| training is the reason for being bad at math.
|
| If you trained GPT specifically on arithmetic tasks, you'd
| get superior performance to GPT-3, regardless of which
| tokenization scheme you'd use. But you'd destroy most of
| its knowledge about everything not-arithmetic.
| sbierwagen wrote:
| >BPE compression may slow down training, but it doesn't
| follow that slower training is the reason for being bad
| at math.
|
| It's not so much that it _slows down_ training, is that
| it completely destroys the relationship between digits
| and results. Every number is assigned a random token ID,
| so GPT-3 had to memorize every operation separately. It
| couldn 't generalize at all, which is why it got worse at
| larger numbers, which showed up less often in the
| training set-- no examples to remember.
|
| You can try the tokenizer online here:
| https://platform.openai.com/tokenizer
|
| It assigns the input text `10 11 12 13 14 15 16` token
| IDs `940, 1367, 1105, 1511, 1478, 1315, 1467`. How is it
| supposed to figure out incrementing numbers from that?
| Well, it can't, so it memorizes them. "Neural nets want
| to work"!
|
| I used the past tense above, because while writing this
| comment I asked ChatGPT Mar 14 Version a bunch of
| manydigit addition and substraction questions and it got
| them all right. Then I asked it if one of those large
| numbers contained an 8 and it... hallucinated a
| completely wrong answer, oops: https://bbot.org/etc/gpt-
| math2.png It's also still spotty at multiplication: "The
| product of 82368 and 33333 is 2745504384." Well, you got
| the first five digits right...
| f_devd wrote:
| > If you trained GPT specifically on arithmetic tasks
|
| Sure but you'd have a lot of overlapping tokens with BPE,
| which doesn't help with convergence. GP is claiming it's
| specifically worse at arithmetic because of BPE which is
| true.
| stormfather wrote:
| Are you referring to how BPE is permutation invariant?
| (ignoring positional encoding)
| spyder wrote:
| There was a paper where they have found that converting the
| numbers to scientific notation ( like 1.5e-7) has improved
| these transformer-based language models at math, if I remember
| correctly. (with a quick search I could not find the link to it
| now)
| cma wrote:
| Llama individually tokenized digits, how much did it fix the
| issue?
___________________________________________________________________
(page generated 2023-03-29 23:00 UTC)