[HN Gopher] DeepMind's New Language Model, Chinchilla
___________________________________________________________________
DeepMind's New Language Model, Chinchilla
Author : georgehill
Score : 195 points
Date : 2022-04-11 12:41 UTC (10 hours ago)
(HTM) web link (www.marktechpost.com)
(TXT) w3m dump (www.marktechpost.com)
| g051051 wrote:
| Is there a good reference as to what a "parameter" is in this
| context? I've looked a few times, but the explanations don't make
| any sense to me.
| guipsp wrote:
| You can think of a parameter as a number you can tweak while
| training. This network has 70B such numbers.
| sirk390 wrote:
| And if every parameter is one byte, the minimum, it will take
| at least 70gb to save or share this model. So it's still way
| to big to package directly in a app.
| cshimmin wrote:
| From the paper, they are using bfloat16, so I guess two
| bytes. But distributing and "packaging into an app" are not
| at all of practical interest for these kinds of models. You
| (a consumer) would interact via some API service, with the
| model running on a hardware-accelerated compute cloud.
|
| In any case, during training (where the model is run in
| possibly large batches), and even during inference, the
| size of the parameters is completely dwarfed by the
| intermediate tensor representations.
| brrrrrm wrote:
| > even during inference, the size of the parameters is
| completely dwarfed by the intermediate tensor
| representations
|
| What makes you say this?
| cshimmin wrote:
| It's especially true for models that do some kind of
| weight sharing, which is very common (CNNs, RNNs,
| transformers, etc). For a concrete example, consider a
| layer from an image convolutional network, which maps
| from a 3-dim colorspace to a 128-dim feature space.
| Assuming a 5x5 kernel that's about 10k parameters.
| However, after applying this layer, you go from having an
| (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W
| are the height and width of the image, and B is the
| number of images in the batch. If you're working with
| even moderately high resolution images, the memory
| required for these intermediate tensors at each layer is
| much larger than the parameters.
|
| Something similar applies for RNNs (same weights applied
| at each element of a sequence), GNNs and transformers
| (same weights applied at each _pair_ of data).
| lostmsu wrote:
| Have you seen modern games?
| sva_ wrote:
| I doubt they load that amount of data in memory
| replygirl wrote:
| I'm thinking about upgrading from 64gb to 128gb so i can
| use all my Cities: Skylines assets in the same map
| lostmsu wrote:
| Right, they usually stream assets as they are requested.
| Large models do the same.
| cshimmin wrote:
| It's a degree of freedom of the learnable model. For example in
| a "vanilla" neural network layer (MLP), which maps from M to N
| feature dimensions will contain an MxN matrix of learnable
| parameters that model the connections between the M inputs to
| the N outputs. Every time the model is updated during
| backpropagation, the loss gradient which has to be computed has
| the same dimensionality as the number of parameters. Also,
| generally more parameters means more operations in the forward
| pass. Therefore, a model with more parameters in general will
| require more FLOPs per iteration of training. The main point of
| this paper is that you can actually do better by training a
| smaller model for longer, rather than a bigger model for less
| time, assuming you have a fixed FLOP budget.
| zamalek wrote:
| The other thing with more parameters is that it gives the NN
| more ability to overfit. That means that instead of, say,
| learning what a dog is, it instead memorises all the
| sentences containing "dog" that it has ever seen.
| mangoham wrote:
| Cached version since the original is down (I'm assuming it's down
| due to load issues and not due to the author taking it down).
| https://webcache.googleusercontent.com/search?q=cache:PLSLy9...
| ritwikgupta wrote:
| Off-topic to Chinchilla, but relevant to the source site:
| MarkTechPost consistently borderline plagiarizes articles and
| shares them on their website as "paper summaries". They copy-
| paste from the source material and change some of the wording
| around as to appear original. My work, as well as other work from
| Berkeley AI Research, has been posted in this manner on their
| site.
|
| This seems highly unethical, and I'm surprised how they continue
| to operate.
| andreyk wrote:
| To add to this - they do this regularly, multiple times per
| week. While they do link to and acknowledge the source work,
| they do not make clear their writing is quoted or nearly
| quoted.
| brrrrrm wrote:
| Thanks for the heads up! In that case, I'd prefer not to share
| this link with peers. Do you have an alternative source with
| similar high-level content to share?
| lstamour wrote:
| Tough to say. Technically
| https://arxiv.org/pdf/2203.15556.pdf has the same content, it
| just isn't highlighted the same way.
| boplicity wrote:
| Fill out a DMCA notice:
|
| https://abuse.cloudflare.com/
|
| Cloudflare will forward it to their host, I believe, who will
| then ask that they remove the infringing material, or provide a
| counter claim.
| parhamn wrote:
| I don't know about this site, and I agree its unethical. But it
| does make me realize that I much prefer using language of the
| paper directly as opposed to having a non-expert poorly
| translate what your paper said. Especially given how papers put
| a lot of time in the accuracy and specificity of their language
| and word choices.
|
| Would it also annoy you if they screwed up the interpretation
| of what you wrote? Is the alternative less reach of your work?
| For hard core research the tradeoffs are tougher it seems. If
| it is just a matter of non-nevermind, thats strictly messed up.
| realYitzi wrote:
| We better get used to it. Because news companies will say an AI
| wrote it. No law allows suing an AI for plagiarism. Go prove
| something is not an AI.
| nudpiedo wrote:
| No one sues the car, the dog or the children, but the owner,
| responsible, parent, etc.
| georgehill wrote:
| OP here - Thanks for sharing. I wasn't aware of this but
| despite this behavior, they are getting 600k visits.
|
| https://www.similarweb.com/website/marktechpost.com/#overvie...
| isaacfrond wrote:
| They trained over 400 language models ranging from 70 million to
| over 16 billion parameters on 5 to 500 billion tokens while
| staying under a given compute budget. The results are modelled,
| and they pick the best one. Turns out the having a bit fewer
| tokens improves performance.
| gbasin wrote:
| Thank you :)
| sirk390 wrote:
| Is outperforming GPT-3 still a good reference? It seems there are
| many models outperforming GPT-3 in the superglue benchmark:
| https://super.gluebenchmark.com/leaderboard/ GPT-3 is in position
| #21, with 71.8% score. The best model is at 91.2%. Note the human
| baseline in #6 with 89.8%
| WithinReason wrote:
| > Is outperforming GPT-3 still a good reference?
|
| It is if you outperform it with much fewer parameters
| changoplatanero wrote:
| Aren't most of the models at the top not suitable for text
| generation? That's what makes gpt different from Bert
| colordrops wrote:
| What are the models at the top used for? Excuse my ignorance.
| priansh wrote:
| Mostly mask fill, but Transformers can be fine tuned to
| downstream tasks relatively easily (T5 was built for
| translation but is used for autocomplete in many cases)
| gfodor wrote:
| would you mind sharing some references (or even just
| googleable terms) for this process of fine tuning?
| redredrobot wrote:
| It's a good reference because people are familiar with GPT-3.
| The paper mostly compares Chinchilla to LaMDA, Jurassic,
| Gopher, MT-NLG, and GPT-3. In the broader tech industry and
| even to a certain extent within the AI field, GPT-3 is the only
| one that most people know by name.
| screye wrote:
| Note that this isn't an apples-to-apples comparison. The GPT-3
| position is for a few-shot use-case that has not been trained
| for this particular task. When fine-tuned, GPT-3 would be
| expected to perform a lot better. Lastly, GPT-3 is currently
| operating on the text-002 models, and the 3rd version of GPT-3
| is generally the one considered current. These benchmarks are
| for the original GPT3 model.
| wiz21c wrote:
| I understand I can query such a model, one query at a time. But
| are there way to query these models with several queries in a row
| such that the N+1-th query benefit from the knowledge that was
| used to answer the N first questions ? Basically, following a
| conversation. For example, youtube subtitles can badly translate
| some terms but if "it" had in mind the overall subject of the
| video, then it'd probably pick the correct word...
| rolisz wrote:
| Yes. That's how you use GPT3: for the 2nd token, you feed in
| your prompt and the first token it returned. Then you feed it
| your prompt and the first two output tokens, and so on.
| [deleted]
| hwers wrote:
| Can't wait for DeepMind to take a stab at outcompeting dall-e.
| mrfusion wrote:
| Does this imply we will run out of data to keep up with larger
| model sizes?
|
| Is there much more data out there than what they're already
| using?
| adamsmith143 wrote:
| Probably not an issue just yet, think of how much data is
| generated by Twitter on a daily basis for example.
| zarzavat wrote:
| If you want to teach your kid to learn English, and they came
| back to you and said _" Dad/mum, I finished reading the
| entire internet but I still don't understand English fully"_,
| would you say _" OK son, now go and stare at the Twitter
| firehouse until you grok perfect English"_ ?
|
| It's clear that these models have orders of magnitude too
| much data already.
|
| It somewhat reminds me of the proposals for larger and larger
| colliders in the hopes of seeing new physics that is always
| one collider in the future.
| lostmsu wrote:
| I disagree with this take because you grok English not only
| from the text you read, but also from the context of
| physical world around you. And that context is enormous:
| assuming 8000x8000x2 vision with 3 color 1 byte channels at
| 24fps without compression, you get 3e+17 bytes (300
| petabytes) of data along with your reading per year.
| ralfd wrote:
| Blind children can learn english fine though. And there
| are areas highly unmaterial (mathematics) which people
| still reason about.
| lostmsu wrote:
| You ignored the point. I only brought sight as an example
| (though, admittedly, it is the largest data inflow).
| mijoharas wrote:
| > It somewhat reminds me of the proposals for larger and
| larger colliders in the hopes of seeing new physics that is
| always one collider in the future.
|
| I agree with your main point, but think this analogy isn't
| an apt one. If you want to see what particles are created
| at higher energies you kinda need the bigger particle
| accelerators. (This isn't to say that we shouldn't be
| investigating lower energy collisions, but at a certain
| point you do need "bigger colliders" to see new things)
| nullc wrote:
| > It's clear that these models have orders of magnitude too
| much data already.
|
| I have a toy disproof for your claim that this is clear.
|
| Imagine that you are training a ML system using oracle
| access to Mum. The ML training system can request 10
| million representative samples of Mum output, and then we
| could judge if the ML system has adequately reproduced Mum.
|
| Now also imagine that Mum frequently tells people that Mum
| knows a 23 letter secret and while mum won't tell people
| what is outright, she'll answer queries like if a guess is
| lexographically higher or lower. We could even imagine that
| the ML has seen Mum's side of some interactions with her
| doing that.
|
| Would the ML know Mum's secret? No.
|
| Would a child that could interact with Mum? Yes-- after at
| most ceil(log_alphabet(23)) queries at most, if the child
| is efficient.
|
| Learning in an interactive context is not the same as
| learning from written material, so you can't be sure that
| the fact that children learn english from less text means
| that a non-interactive ML system could english from the
| same amount. Q.E.D.
|
| Now, if someone figures out how to efficiently train these
| natural language models with reinforcement learning...
| adamsmith143 wrote:
| The general point is that there is a huge volume of
| training data generated daily not that Twitter is a great
| source of it. Though I believe that GPT-3 for example was
| trained on the Common Crawl dataset which would contain
| both Twitter and Reddit.
|
| >It's clear that these models have orders of magnitude too
| much data already.
|
| Seems like a strange claim. The scaling laws are showing
| that you can still make gains with more data and more
| parameters.
|
| >It somewhat reminds me of the proposals for larger and
| larger colliders in the hopes of seeing new physics that is
| always one collider in the future.
|
| This is literally true though, couldn't find the Higgs
| without the LHC and most GUT candidates would only start
| being ruled out at high energy levels.
| gwern wrote:
| Common Crawl actually does not contain Twitter, you can
| go check the indexes https://github.com/ikreymer/cdx-
| index-client . Twitter is extremely aggressive about
| scraping/caching, and I guess that blocks CC. Models like
| GPT-3 still know a decent amount of Twitter material, and
| I figure that this is due to tweets being excerpts or
| mirrored manually in non-Twitter.com URLs (eg all the
| Twitter-mirroring bots on Reddit).
| zarzavat wrote:
| > Seems like a strange claim. The scaling laws are
| showing that you can still make gains with more data and
| more parameters.
|
| But then we've given up on matching human intelligence
| which is all about working efficiently with _small_
| training data, and certainly training a human does not
| need anywhere near as much data as GPT-3.
|
| GPT-3 was interesting as a proof-of-concept of what
| happens when you use a gigantic amount of training data.
| We don't need a bigger one until we can figure out how to
| make a smaller one that is just as effective.
|
| If scaling laws are telling us to keep putting even more
| training data into the thing, then the conclusion should
| be that the architecture is just not working out.
| adamsmith143 wrote:
| >But then we've given up on matching human intelligence
| which is all about working efficiently with small
| training data, and certainly training a human does not
| need anywhere near as much data as GPT-3.
|
| I don't think we should really take so much inspiration
| from the brain. We didn't make airplanes work by building
| bird machines so why should we do that here.
|
| >GPT-3 was interesting as a proof-of-concept of what
| happens when you use a gigantic amount of training data.
| We don't need a bigger one until we can figure out how to
| make a smaller one that is just as effective.
|
| This feels like a non sequitor. We can certainly keep
| making larger models and we will, because we can continue
| to make performance gains doing so.
|
| >If scaling laws are telling us to keep putting even more
| training data into the thing, then the conclusion should
| be that the architecture is just not working out.
|
| I don't think anyone in the field would agree to this
| point. Researchers see an easy avenue to gain better
| performance so they take it. Deepmind's model shows you
| can get similar results with more refined architecture,
| but this was released well after GPT-3. When teams
| significantly advance the state of the art with a much
| smaller model I think we should take notice but that
| hasn't happened yet.
| teraflop wrote:
| On the other hand, consider the difficulty of taking massive
| amounts of data from the modern web and filtering out the
| subset that was actually generated by humans, rather than
| previous generations of language models.
| adamsmith143 wrote:
| Definitely an interesting future problem. I'm sure OpenAI
| and others are thinking about it but I don't think these
| models are ubiquitous enough to have much impact just yet.
| axg11 wrote:
| Some estimates:
|
| - 500M tweets per day
|
| - 30 words/tokens per tweet
|
| - 40% of all tweets thrown away due to being
| duplicate/spam/bots
|
| = 9B tokens generated per day
| replygirl wrote:
| There's a ton of data that can be exponentially more useful,
| but we'll need networks that can (analogously) be late to work
| enough times to get fired, or experience heartbreak in
| succession while misunderstanding why prior heartbreak
| happened, or hallucinate stray cats when they're walking around
| the neighborhood at night
| kelseyfrog wrote:
| It implies our models are wrong.
|
| Consider that a human adolescence is ~9.46x10^6 minutes and a
| fast speaking rate is ~200words/minute. That sets an upper
| bound of 1.9 billion words heard during adolescence. ie: human
| adults are trained on a corpus of less than 1.9B words.
|
| To some extent, more data can offset worse models, but I don't
| think that's the regieme we're currently in. GPT-3 was trained
| (on among other languages) 181 billion English words - or about
| 100 times more words than a human will hear by the time they
| reach adulthood. How is the human brain able to achieve a
| higher level of success with 1% of the data?
|
| 1.
| https://github.com/openai/gpt-3/blob/master/dataset_statisti...
| Symmetry wrote:
| My understanding is that the binding constraint in training
| these models is the quantity of computation they consume.
| While a human makes do with drastically less input data, we
| also have drastically more computational resources in our
| heads to work on the problem than Google is using to train
| its models.
| gwern wrote:
| > How is the human brain able to achieve a higher level of
| success with 1% of the data?
|
| The most obvious answer is "the human brain uses a shit-ton
| more compute", for 18+ years as well.
|
| We spend data, which we have in abundance, to save on
| compute, which we do not. Even at the most generous low-end
| estimates of the human brain's computing power, we are only
| barely there; on the high-end estimates that people in love
| with the ineffable mysteries of the brain love to cite, we
| are multiple orders of magnitude away from even the biggest
| supercomputers matching the brain. So no matter which way you
| slice it, we are extremely compute-poor.
|
| Feeding a lot of data through an extremely lightweight
| optimizer like first-order SGDs is one way to cope with
| lacking compute:
| https://www.gwern.net/docs/ai/scaling/2013-bottou.pdf Bottou
| asks why (even in 2013!) is SGD so hard to dethrone when we
| can empirically see plenty of optimizers like second-order
| gradient descent algorithms which can beat SGD quite solidly?
| His observation is that while they are much better than SGD
| in terms of iterations or _n_, they lose in compute/wallclock
| because SGD can just go-brrrr through the data much faster
| than they can.
| nynx wrote:
| Yeah, there are ~100B neurons, ~1Q synapses, but how much
| compute is the brain actually using over time?
|
| Some quick googling gives this:
|
| - Generation of an action potential seems to use ~2.5x10^-7
| J [0]
|
| - The brain consumes around 20W during normal activity
|
| This seems to imply that there are around 8x10^7, call it
| 10^8, activations per second [1].
|
| Apparently, the average neuron has 1000 synapses. Let's say
| each synapse requires 10 mulacc operations per activation.
| Doing that math gives about 10^12 FLOPs/s [2].
|
| Integrate that over 18 years, and you get roughly 5.7x10^20
| FLOPs [3].
|
| PaLM required 2.56x10^24 FLOPs to train [4]. So, we have
| (way more than) enough compute, we're just not using it
| efficiently. We're wasting a lot of FLOPs on dense matrix
| multiplication.
|
| There's plenty of wiggle room in these calculations. I
| checked over the math, but I'd appreciate if someone would
| let me know if I've missed something.
| [0]:
| https://link.springer.com/article/10.1007/s11571-018-9503-3
| [1]: https://www.wolframalpha.com/input?i2d=true&i=Divide%5
| B20+W%2C2.5%E2%80%89%C3%97%E2%80%89Power%5B10%2C%E2%88%927%
| 5D+Joules%5D [2]: https://www.wolframalpha.com/inpu
| t?i2d=true&i=Power%5B10%2C8%5D+Hz+*+1000+*+10+flop
| [3]: https://www.wolframalpha.com/input?i2d=true&i=Power%5B
| 10%2C12%5D+Divide%5BFLOP%2Cs%5D+*+18+years [4]:
| https://blog.heim.xyz/palm-training-
| cost/#:~:text=PaLM%20(2022)-,2.5e24,-10x***
| nynx wrote:
| Yeah, this implies backpropagation is deeply suboptimal.
| kelseyfrog wrote:
| That is certainly a possibility. The other (non-mutually
| exclusive) implications may also be that human language
| acquisition benefits from being part of a multi-task model.
| Or that the problem has been overreduced ie: human language
| acquisition cannot simply be distilled into a words-
| in->words-out problem and that vision/hearing are actually
| integral parts of language acquisition that cannot be left
| out. Or that model arch still has major improvements to be
| made and attention is not all you need, for example.
| fpgaminer wrote:
| > and that vision/hearing are actually integral parts of
| language acquisition
|
| Deaf-blind authors would beg to differ.
|
| But yes, a human brain is exposed to lots of other
| sensory input, and we know from other research that
| multi-modal models can learn shared representations that
| benefit from the knowledge of each domain.
|
| In Transformer's favor, at least, they are far closer to
| tabula rasa than the human brain is and likely have to
| dedicate a lot of their training time to things that are
| otherwise "baked" into human brains. For example, humans
| come pre-packaged with V1 and V2 as part of their visual
| system, but CNNs and ViTs have to learn those filter
| packages from scratch.
|
| I agree with you though. Human brains are able to take
| single instances of experiences and build a wealth of
| understanding from them in ways that even modern
| Transformer architectures are not yet able.
| kristintynski wrote:
| It seems like internal language (thinking in language) is
| also a way our brains train themselves too? I've probably
| thought 100x more words than I've spoken.
| snovv_crash wrote:
| This would map to a sort of semi-supervised approach. For
| a lot of problems this has shown to drastically reduce
| the data requirements, but can bump up compute.
|
| All those conversations in the shower were actually
| regularizers!
| ianbutler wrote:
| This is exciting if only because as we discover more compute
| optimal models that out perform the behemoths that have been
| state of the art it opens up the ability for smaller independent
| groups to train and release their own versions, more fully
| democratizing AI. Looking forward to a group like Eluther or
| Hugging Face releasing a version of this.
| adamsmith143 wrote:
| >This is exciting if only because as we discover more compute
| optimal models that out perform the behemoths that have been
| state of the art it opens up the ability for smaller
| independent groups to train and release their own versions,
| more fully democratizing AI.
|
| I think I support this in principle but it seems like the
| scaling curves keep going so it's easier to just make larger
| models with more data.
|
| >Looking forward to a group like Eluther or Hugging Face
| releasing a version of this
|
| Both of those groups have access to dozens if not hundreds of
| Cloud GPUs, I'd hardly call them small.
|
| It would be impossible to replicate these models as say an
| independent researcher or even in an academic research group
| outside of maybe Stanford/Berkeley/MIT/etc. and I'd even doubt
| their ability to replicate models like this based purely on
| Cost alone.
| ianbutler wrote:
| Small is relative -- and to Google, Facebook and Microsoft
| they're positively tiny. Perfect is the enemy of good or some
| such and I think this is a move in the right direction even
| if I can't personally train this on my 3090.
| mark_l_watson wrote:
| The design of the original Transformer model in the Attention is
| all You Need paper was predicated on efficiency (all layers the
| same size, combining word/token embedding with position in the
| input stream harmonic embedding). It is good to see improvements!
| narrator wrote:
| I'd love to take a language model, load it up, and train it on
| everything I write in online learning mode. Does one need some
| massive hardware to do online learning with these models instead
| of just running the distilled final models?
| alpineidyll3 wrote:
| If these things get put on specialized hardware for inference
| with much lower energy costs, the world will never be the same.
| hwers wrote:
| Imagine any diffusion-style text-to-image model on specialized
| ASIC hardware.
| astrange wrote:
| That's what an ANE/TPU is.
|
| If you mean putting the model weights into gates directly,
| it'd be useless because users would get bored of the model as
| soon as they figured out what its style looked like. Also,
| large models can memorize their training data so eventually
| you'll get it to output something copyrighted.
| lobstey wrote:
| the biggest problem first of all might be the memory
| requirements given so many parameters. It couldn't be as cheap
| as a high end computer in the foreseeable future.
| f38zf5vdt wrote:
| There is probably a space-time trade off that needs to be
| explored in this space. It might be possible to preload the
| some of the most likely tokens to be selected next into the
| cache and/or RAM. These are glorified auto-complete
| algorithms that are poorly understood, as DeepMind's
| optimizations appear to show. For the English language, it is
| probable that there are only so many possible grammatically
| correct selections for the next token, for example.
| visarga wrote:
| Glorified autocomplete? Autocomplete can guess the next
| word .. sometimes, GPT-3 goes hundreds of words ahead. On
| generic topics it can be hard to distinguish from human
| text.
|
| And it can't cache tokens because all tokens are evaluated
| in the context of all the other tokens, so they don't have
| the same representations when they reoccur at different
| positions.
| f38zf5vdt wrote:
| They're evaluated in the context of the last 2^n many
| tokens, for many models it is 1024, 2048, or 4096 tokens
| as a scanning window. The tokens (words and sometimes
| punctuation) are represented by integer values, so the
| last 2^n many tokens would certainly qualify for storage
| in a cache. Then next token selection only has so many
| possible assignable selections in any given language
| model because of grammatical limitations. This is only
| one such optimization, there could also be optimizations
| around the likelihood of certain words to be used given
| the presence of certain previous tokens, and so on.
|
| But, yes, tokens are chosen one word as a time based on
| the previous content, similar to earlier auto-completion
| algorithms.
| priansh wrote:
| I've been saying this for years, language models are the ML
| equivalent of the billionaire space race, it's just a bunch
| of orgs with unlimited funding spending millions of dollars
| on compute to get more parameters than their rivals. It
| could be decades before we start to see them scale down or
| make meaningful optimizations. This paper is a good start
| but I'd be willing to bet everyone will ignore it and
| continue breaking the bank.
|
| Can you say that about any other task in ML? When
| Inceptionv3 came out I was able to run the model pretty
| comfortable on a 1060. Even pix2pix and most GANs fit
| comfortably in commercial compute, and the top of the line
| massive models can still run inference on a 3090. It's so
| unbelievably ironic that one of the major points
| Transformers aimed to solve when introduced was the compute
| inefficiency of recurrent networks, and it's devolved into
| "how many TPUs can daddy afford" instead.
| native_samples wrote:
| Is that fair? My Pixel phone seems to run nothing but ML
| models of various kinds and they run _locally_ which is
| madness, pure madness. It can recognize songs and my
| speech without talking to the cloud at all. That 's
| pretty much the definition of optimization!
| galcerte wrote:
| I have to ask, why call it that? I had a chuckle once I saw the
| name.
| redredrobot wrote:
| It outperforms the Gopher model
| cshimmin wrote:
| Yeah, similar "thematic" naming to MacOS versions. I don't
| know why the original one was called Gopher, though.
| goodside wrote:
| Because it retrieves facts from memory in a way that's
| analogized to a gopher retrieving objects.
| gwern wrote:
| There were a lot of complaints about earlier models being
| named, say, 'Meena'. (It's very sexist, you know, to name a
| chatbot a female name.) People won't complain about
| 'Chinchilla' because chinchillas are adorable. PaLMs aren't
| adorable, but at least it's neutral.
| MrBuddyCasino wrote:
| Its not so bad. If they were radio astronomers they'd call it
| Very Big Neuronal Language Model. IBM would call it Watson
| Advanced AI. If they were a gamer accessory company they'd call
| it DeepTek Ultra Pro VDH-Max AI A320M. Chinchilla is nice and
| fluffy.
| farmin wrote:
| It's the name of a town in QLD.
| binarymax wrote:
| Large language models have a (recent) history of silly names.
| BERT, BART, ELMO, RoBERTa, BIGBIRD, PaLM, Megatron etc. Might
| as well go full nonsense.
| DSingularity wrote:
| A touch of irony that cutting edge research on language can't
| produce better names.
| omarhaneef wrote:
| True. I will add that it is customary to justify it by
| demonstrating it is some sort of acronym or contraction.
| yeetsfromhellL2 wrote:
| It's a recursive, selective acronym
| C CH CHI
| CHIN CHINC CHINCH
| CHINCHI CHINCHIL CHINCHILL ==>
| CHINCHILLA HINCHILLA INCHILLA
| NCHILLA CHILLA HILLA ILLA
| LLA LA A
| omarhaneef wrote:
| I know what recursive means, I know what selective means,
| I know what an acronym is, and I think I see the pattern
| in that picture, but when I put it all together I am
| lost.
|
| Alternatively, is this a joke and the "recursive,
| selective acronym" can be used to justify any word?
| veonik wrote:
| A AR ARB
| ARBI ARBIT ARBITR
| ARBITRA ARBITRAR ==> ARBITRARY
| RBITRARY BITRARY ITRARY
| TRARY RARY ARY RY
| Y
|
| Yup, seems it works for any word.
| MisterTea wrote:
| My theory is since no one reads literature anymore, timeless,
| interesting and unique names from history and other cultures
| are lost to a deluge of soon to be forgotten gag, pop-culture
| and meme names. Perhaps this is why we have Chinchilla and
| not Oberon.
| jankeymeulen wrote:
| Like the Oberon OS and programming language?
| jstx1 wrote:
| Image models too - the Inception paper from 2014 directly
| refers to knowyourmeme.com and the "we need to go deeper"
| meme from the movie Inception -
| https://knowyourmeme.com/memes/we-need-to-go-deeper - it's
| the first reference in the paper [1] and it's also why the
| model is called that way.
|
| [1] https://arxiv.org/pdf/1409.4842.pdf
| ShamelessC wrote:
| Seems the link is down. Found a decent synopsis/discussion on
| lesswrong.
|
| https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-scalin...
|
| > On March 29th, DeepMind published a paper, "Training Compute-
| Optimal Large Language Models", that shows that essentially
| everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been
| training large language models with a deeply suboptimal use of
| compute.
|
| > Following the new scaling laws that they propose for the
| optimal use of compute, DeepMind trains a new, 70-billion
| parameter model that outperforms much larger language models,
| including the 175-billion parameter GPT-3 and DeepMind's own
| 270-billion parameter "Gopher".
| gyang wrote:
| I think there remains an immense amount of such suboptimality
| still hanging from the tree, so to speak.
|
| For example, our recent paper "Tensor Programs V: Tuning Large
| Neural Networks via Zero-Shot Hyperparameter Transfer"[1] shows
| that even learning rate and initialization used by existing
| models are _deeply wrong_. By just picking them correctly
| (which involves some really beautiful mathematics), we can
| effectively double the model size of the GPT-3 6.7B model (to
| be comparable in quality to the 13B model across the suite of
| benchmark tasks).
|
| Large neural networks behave in a way we are only beginning to
| understand well just because each empirical probe of any such
| model is so much more expensive and time consuming than typical
| models. But principled theory here can have a lot of leverage
| by pointing out the right direction to look, as it did in our
| work.
|
| [1] http://arxiv.org/abs/2203.03466
| p1esk wrote:
| What do you think about the concept of "critical batch size"?
| https://openai.com/blog/science-of-ai/
| gyang wrote:
| I think the concept makes sense. The basic insight, that
| the right batch size depends on the difficulty and
| noisiness of a task, is already used by teams. For example,
| the PaLM paper from last week increased its batch size
| throughout training.
|
| But as far as I know, the more precise predictions of
| optimal batch size aren't used much, probably because it's
| expensive to measure accurately, or because the predictive
| equation isn't accurate enough to begin with. I wonder if
| we can "transfer" the optimal batch size from a smaller
| setting (smaller model or data) to the full setting, like
| in our paper. This would make it much more practical.
| eigenvalue wrote:
| According to the LessWrong post, the smaller model trained on
| more data performs better on most of the tasks, but it's worse
| on "college level math" questions. I wonder why that is. Is it
| because the extra capacity of the larger model was used to
| basically memorize theorems? Or is it because the extra "brain
| power" let it model the math better? Oddly, one of the tasks
| that the smaller most outperformed the larger model on is "high
| school level math"! Very counterintuitive, and I am curious if
| there are any big takeaways lurking in that disparity.
| ShamelessC wrote:
| Gwern responded to a similar question in the comments
| section.
|
| (parent)
|
| > the fact that data and compute need to scale proportionally
| seems... like a big point in favor of NNs as
| memorizers/interpolators.
|
| (gwern)
|
| > Surely it's the opposite? The more bang you get out of each
| parameter, the less it looks like 'just' (whatever that
| means) memorization/interpolation. When you needed to
| increase parameters a lot, disproportionately, to cope with
| some more data, that does not speak well of abstractions or
| understanding. (If I can train a 1t model to get the same
| loss as what I thought was going to take a 100t model, why
| would I think that that 100t model must be
| memorizing/interpolating less?) Let's take your claim to its
| logical extreme: suppose we discovered tomorrow a scaling law
| that made parameters near-constant (log, let's say); would
| that not suggest that those parameters are super useful and
| it's doing an amazing job of learning the underlying
| algorithm and is not memorizing/interpolating?
| sillysaurusx wrote:
| This isn't addressing their question. And Gwern's goal here
| is to (incorrectly) try to get rid of the idea that models
| are just memorizing and interpolating, when in fact
| memorization and interpolation is what we all do, including
| models. He's just bothered by the idea that people think of
| models as less than magic.
|
| On the other hand, https://twitter.com/model_mechanic/statu
| s/151297688118364569... is admittedly pretty magical, even
| if the basis of that magic is memorization and
| interpolation.
| VirusNewbie wrote:
| Why do you say they just memorize and interpret? I can
| teach GPT-2 new things, including new objects and their
| physical properties and it does a good job with that.
| That also means it has definitely not just regurgitated a
| matching sentence back to me.
| replygirl wrote:
| when i see a new object for the first time, i MEMORIZE
| what i INTERPRET as its identifying traits, and ask
| someone who has already MEMORIZED what that object is to
| INTERPRET a concept with which i can associate those
| traits. the next time i encounter an object with those
| traits i can then recall the associations, then compose
| those trait-level interpretations into an interpretation
| of an object.
|
| at a fundamental level that's all this is, compositions
| of associated memorizations and interpretations, which
| map to compositions of sentence parts the machine can
| regurgitate back to you
| rictic wrote:
| To rebut someone's argument you must address the argument
| and not just talk about them and their motivations
|
| From your comment a reader will understand that you think
| they're just memorizing and interpolating and that you
| disagree with gwern on this point, but you've given your
| reader nothing that argues in favor of your position
|
| Why should someone believe that models are just
| memorizing and interpolating?
| yldedly wrote:
| It's impossible for a piecewise linear function to be
| anything other than linear outside the training sample.
| They are by their definition unable to do anything but
| interpolate.
| danuker wrote:
| It might just be by chance: the initial weights of one model
| could have been lucky in some areas, and unlucky in others.
| There's no way to tell other than training again, which is a
| costly proposition.
| eigenvalue wrote:
| That seems pretty unlikely to me actually. As the models
| and training data get much bigger, I think the initial
| weights become less important (at least assuming your
| random weights have certain desirable statistical
| properties, which they do by construction usually).
| [deleted]
| adamsmith143 wrote:
| Probably right. Most people dump on these language models for
| this reason but it would be absurd for a HS student to have
| to re-derive the quadratic equation every time they worked on
| an Algebra problem so naturally you memorize it. Why should
| it be any different for a language model?
| eutectic wrote:
| I never memorized the quadratic formula, and I did OK.
| whimsicalism wrote:
| Did you go to school in the US in the last 2-3 decades?
| replygirl wrote:
| Once you start calculus they let you use a real
| calculator
| whimsicalism wrote:
| That may be true, but in the US there are typically math
| courses before calculus.
| replygirl wrote:
| But then we get a calculator.
| whimsicalism wrote:
| Even then, it is typically not a symbolic calculator so
| if your answer is a closed form function of variables,
| you're SOL with a TI-84.
| adamsmith143 wrote:
| Maybe we went to radically different schools but I
| certainly had to calculate by hand using the quadratic
| formula countless times where calculators were not
| allowed to be used.
|
| Anyway it distracts from the point so it's not relevant.
| VikingCoder wrote:
| 70 billion parameters... Is each of those a 4-byte float?
|
| So, is that 280 billion bytes of just parameters?
| sudosysgen wrote:
| I'm fairly confident each of those is a 2-byte float, but yes
| that's over 100 GB of parameters.
| sillysaurusx wrote:
| Welcome to the party! I joined ML because I realized I
| could help. You can too. I bet you're both already thinking
| of clever ways to deal with massive models from an
| infrastructure standpoint. That's just one of hundreds of
| interesting problems.
| native_samples wrote:
| Is 100GB of parameters really that large? 128GB of RAM on
| a server class machine is not unusual. Seems such a model
| could fit entirely in RAM.
| andbberger wrote:
| GPU memory is generally much smaller and more expensive
| kristjansson wrote:
| To elaborate on the sibling comment: main memory is much
| bigger, but CPUs are much, much slower. It would be a
| challenge to merely run a model like this on CPU, and
| totally infeasible to train one. So the challenge is to
| fit into the memory of a single GPU you can afford,
| coordinate multiple GPUs, or efficiently page from main
| memory into GPU.
| Delitio wrote:
| Is there any source which explains what billion of
| parameters actually are?
|
| In my mind a parameter is: language, dialect, perhaps
| context parameters (food, dinner, lunch, travel) and if we
| than talk about language and audio perhaps sound waves,
| gender.
|
| Or are context parameters which gives you insight? Like a
| billion of parameters are literally something like
| travel=false, travel-europe=true people speaking=e, age,
| height,
| nl wrote:
| It's rare a single parameter maps to a human
| understandable concept. Occasionally someone finds one
| that does map fairly well, for example this case back in
| 2017: https://openai.com/blog/unsupervised-sentiment-
| neuron/#senti...
| jefft255 wrote:
| The parameters are the number of weights in a neural
| network, in this case.
| matt123456789 wrote:
| A parameter is a scalar value, most of which are in the
| attention matrices and feedforward matrices, you also
| hear these called "weights". Any intro to DL course will
| cover these in detail. I recommend started with Andrew
| Ng's Coursera class on Intro to Machine Learning,
| although there may be better ones out there now.
| Delitio wrote:
| Input parameter vs. weights then?
|
| I see tx
| lostmsu wrote:
| These networks (text models) usually have around a few
| thousand inputs.
| brrrrrm wrote:
| A good visual introduction to neural networks can be
| found here: https://playground.tensorflow.org
|
| A parameter is a "weight" in this case (the lines drawn
| from neuron to neuron). The neurons are effectively
| runtime values or "activations." Parameters (weights) are
| updated during training and then set as constant during
| "inference" (also called "prediction").
|
| There's unfortunately a ton of jargon and different
| groups use different words almost exclusively.
| dotnet00 wrote:
| Parameters are just floating point numbers, at most they
| can be seen as degrees of freedom or kind of like the
| order of a polynomial used in curve fitting.
|
| They're too abstract to assign much meaning to individual
| parameters, as our understanding of why their values are
| exactly the way they are is extremely limited.
___________________________________________________________________
(page generated 2022-04-11 23:00 UTC)