[HN Gopher] Cerebras-GPT: A Family of Open, Compute-Efficient, L...
___________________________________________________________________
Cerebras-GPT: A Family of Open, Compute-Efficient, Large Language
Models
Author : asb
Score : 341 points
Date : 2023-03-28 16:34 UTC (6 hours ago)
(HTM) web link (www.cerebras.net)
(TXT) w3m dump (www.cerebras.net)
| ivanvas wrote:
| Is it currently possible to find-tune any of the foundation
| modules available on a few Gb of unsupervised text?
| simonw wrote:
| Does the chinchilla recipe still hold today? I got the impression
| that the LLaMA paper proposed a different result where throwing
| far more tokens at the problem had a very meaningful impact, or
| did I misunderstand that?
| evanmays wrote:
| There's discussion elsewhere in this thread what chinchilla
| actually means. I'll only compare it to llama.
|
| Tldr; Chinchilla isn't wrong, it's just useful for a different
| goal than the llama paper.
|
| There's 3 hyper parameters to tweak here. Model size (parameter
| count), number of tokens pre trained on, and amount of compute
| available. End performance is in theory a function of these
| three hyperparameters.
|
| You can think of this as an optimization function.
|
| Chinchilla says, if you have a fixed amount of compute, here's
| what size and number of tokens to train for maximum
| performance.
|
| A lot of times, we have a fixed model size though though,
| because size impact inference costs and latency. Llama operates
| in this territory. They choose to fix the model size instead of
| the amount of compute.
|
| This could explain gaps in performance between Cerebras models
| of size X and llama models of size X. Llama models of size X
| have way more compute behind them
| espadrine wrote:
| I don't think it holds for two reasons.
|
| First, it only holds for a given architecture and
| implementation. Obviously, a different architecture will have a
| different training slope. This is clear when comparing LSTM
| with Transformers, but is also true between transformers that
| use prenorm/SwiGLU/rotary-positional, and those that follow
| Vaswani 2017.
|
| In terms of implementation, some algorithms yield the same
| result with fewer operations (IO, like FlashAttention and other
| custom CUDA kernels, and parallelism, like PaLM, which both
| came after Chinchilla), which unambiguously affect the Tflops
| side of the Chinchilla equation. Also, faster algorithms and
| better parallelization will yield a given loss sooner, while
| less power-hunger setups will do that cheaper.
|
| Second, even in the original Chinchilla paper in figure 2, some
| lines are stopped early before reaching Pareto (likely because
| it ran out of tokens, but LLaMA makes it seem that >1 epoch
| training is fine).
| whalesalad wrote:
| This "AI spring" is really snowballing with the crazy nouns and
| terminology. Alpaca, llama and now chinchilla??
| murkt wrote:
| Chinchilla actually came before alpaca and llama.
|
| Every new variation of model gets some new name, just like
| every library gets a new name. There were all kinds of BERTs
| before - DistilBert, Roberta, SciBERT, Schmobert, Schmuber,
| etc. Many hundreds of them, I think.
| whoisnnamdi wrote:
| Chinchilla actually came first!
| jhbadger wrote:
| As mentioned, chinchilla is not part of this trend, and
| chinchillas are rodents. Alpacas and llamas are South American
| camelids (animals related to camels). So if additional names
| are needed, I would expect them to be vicuna and guanaco, as
| they are also in the group.
| mometsi wrote:
| I think the relevant category is "Adorable Fuzzy Critters of
| the Andes". See also
| https://en.wikipedia.org/wiki/Spectacled_bear
| ramesh1994 wrote:
| The term "chinchilla" predates llama/alpaca. It doesn't
| directly map to a specific model, rather a family of compute-
| optimal models.
| mometsi wrote:
| Summary: This is a company that makes AI accelerator ICs. They
| reimplemented Chinchilla and released the model weights under a
| permissive license.
| bogwog wrote:
| In other words, they're actually incentivized to help make LLMs
| as accessible as possible, rather than try to keep them locked
| up to hide them from competitors.
|
| Which makes me wonder if Nvidia is doing anything with LLMs
| too?
| vintermann wrote:
| NVidia has certainly pushing the envelope on image
| generation. StyleGAN3 was really cool when it came.
|
| But it is an issue that their chips are hardly optimized for
| LLMs.
| meghan_rain wrote:
| How can a GPU be optimized for StyleGAN but not LLMs?
| Serious question.
| MacsHeadroom wrote:
| RAM. GPT-3 is over 600GB, ie just the max RAM of 8xA100s,
| because that's all the hardware can fit.
|
| StableDiffusion plus a whole chain of imagenets can make
| any visual imagery imaginable in 2GB of RAM. Meanwhile
| 2GB of RAM barely runs a basic tiny text completion NN
| that can't do anything intelligent. Text requires a lot
| more parameters (and more memory/RAM) than images.
| brucethemoose2 wrote:
| The Cerebras node's actual "RAM" (the 40GB of SRAM) is
| pretty modest too, but being an _enormous_ chip with the
| networked storage pools is certainly a better situation
| than a bunch of A100s reaching out to every other A100.
|
| Honestly, all the AI ASIC makers drastically
| underestimated the RAM requirements of future models.
| Graphcore's 4GB and Tenstorrent's 8GB per IC is kinda
| laughable, and it takes them longer to adjust than
| Nvidia. And Cerebras' original pitch was "fit the entire
| model into SRAM!"
| option wrote:
| obviously yes, there are larger (20B) models trained on more
| tokens that are released by NVIDIA under commercial-friendly
| license on HF
|
| https://huggingface.co/nvidia/nemo-megatron-gpt-20B
| antimatter15 wrote:
| Looking at their charts it seems like their 6.7B model is
| considerably worse than GPT-J which is an existing open 6B model
| from several years ago.
|
| I wish rather than stopping training early they would have run
| more data through a small model so we could have something more
| competitive with LLaMA 7B.
| cs-fan-101 wrote:
| Someone posted this repost from the Cerebras Discord earlier,
| but sharing for visibility -
|
| "We chose to train these models to 20 tokens per param to fit a
| scaling law to the Pile data set. These models are optimal for
| a fixed compute budget, not necessarily "best for use". If you
| had a fixed parameter budget (e.g., because you wanted to fit
| models on certain hardware) you would train on more tokens. We
| do that for our customers that seek that performance and want
| to get LLaMA-like quality with a commercial license"
| HanClinto wrote:
| Sounds like we should crowd-fund the cost to train and open
| source one of these models with LLaMa-like quality.
|
| I'd chip in!
| brucethemoose2 wrote:
| TBH that seems like a good job for Cerebras.
|
| There are plenty of such efforts, but the organizer needs
| some kind of significance to attract a critical mass, and a
| AI ASIC chip designer seems like a good candidate.
|
| Then again, maybe they prefer a bunch of privately trained
| models over an open one since that sells more ASIC time?
| brucethemoose2 wrote:
| > Cerebras Discord
|
| This is really weird to hear out loud.
|
| I still think of Discord as a niche gaming chatroom, even
| though I know that (for instance) a _wafer scale IC design
| company_ is hosting a Discord now.
| simonw wrote:
| You can try out some of these models on Hugging face here:
| https://huggingface.co/cerebras/Cerebras-GPT-1.3B
|
| That was the largest that had inference enabled - I'd really like
| to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B
| amilios wrote:
| Comparing the 13B model here
| https://huggingface.co/cerebras/Cerebras-GPT-13B to LLaMA-13B
| https://github.com/facebookresearch/llama/blob/main/MODEL_CA...
| you can see that in all of the reasoning tasks Cerebras-GPT lags
| behind. Any reason to use Cerebras instead of LLaMA? Doesn't seem
| like it.
| option wrote:
| it lags behind because according to their blogpost it was
| trained on <300B tokens. LLaMAs as far as I know were trained
| on more than trillion
| gpm wrote:
| The LLaMa paper says 1 trillion for the smaller models (7B,
| 13B) and 1.4 trillion for the larger models (30B, 65B)
| [deleted]
| mdagostino wrote:
| LLaMA is non-commercial
| potatoman22 wrote:
| Can the LLaMA weights be used for commercial products?
| espadrine wrote:
| There are two aspects to it.
|
| The first one is whether they would actually sue. The optics
| would be terrible. A similar situation occurred in the 90s
| when the RC4 cipher's code was leaked. Everyone used the
| leaked code pretending that it was a new cipher called
| arc4random, even though they had confirmation from people
| that licensed the cipher that its output was identical.
| Nobody was sued, and the RSA company never acknowledged it.
|
| The second one is related to the terms. The LLaMA weights
| themselves are licensed under terms that exclude commercial
| use:[0]
|
| > _You will not [...] use [...] the Software Products (or any
| derivative works thereof, works incorporating the Software
| Products, or any data produced by the Software), [...] for
| [...] any commercial or production purposes._
|
| But the definition of derivative works is gray. AFAIK, if
| LLaMA is distilled, there is an unsettled argument to be had
| that the end result is not a LLaMA derivative, and cannot be
| considered copyright or license infringement, similar to how
| models trained on blog articles and tweets are not infringing
| on those authors' copyright or licensing. The people that
| make the new model may be in breach of the license if they
| agreed to it, but maybe not the people that use that new
| model. Otherwise, ad absurdum, a model trained on the
| Internet will have content that was generated by LLaMA in its
| training set, so all models trained on the Internet after Feb
| 2023 will break the license.
|
| IANAL, but ultimately, Meta wins more by benefiting from what
| the community contributes on top of their work (similar to
| what happened with React), than by suing developers that use
| derivatives of their open models.
|
| [0]: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2
| jP4Z...
| gpm wrote:
| Unclear, likely jurisdiction dependent, almost certainly not
| if you need to operate world wide.
| binarymax wrote:
| Here are the zero-shot accuracy numbers posted in the Huggingface
| evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in
| their paper: Model BoolQ PIQA SIQA
| HellaSwag WinoGrande ARC-e ARC-c OBQA LLaMa 13B
| 78.1 80.1 50.4 79.2 73 74.8 52.7 56.4
| Cerebras-GPT 13B - 76.6 - 51.3 64.6 71.4
| 36.7 28.6
| wsgeorge wrote:
| I guess it's something. It still goes to show how far open
| models are behind the proprietary SOTA.
| topynate wrote:
| From their discord:
|
| > It would be interesting to know why you chose those FLOPS
| targets, unfortunately it looks like the models are quite
| under pre-trained (260B tokens for 13B model)
|
| > We chose to train these models to 20 tokens per param to
| fit a scaling law to the Pile data set. These models are
| optimal for a fixed compute budget, not necessarily "best for
| use". If you had a fixed parameter budget (e.g., because you
| wanted to fit models on certain hardware) you would train on
| more tokens. We do that for our customers that seek that
| performance and want to get LLaMA-like quality with a
| commercial license
|
| Which is the point made elsewhere in these comments, e.g.
| https://news.ycombinator.com/item?id=35344192, and also
| usefully shows how open Cerebras are. They're pretty open,
| but not as much as they would be if they were optimising for
| filling in other companies' moats.
| binarymax wrote:
| Indeed but this is zero-shot performance. Fine-tuning for a
| task should get you pretty good results. I'm interested in
| seeing the results of an Alpaca method against this Cerebras
| 13B model.
| MacsHeadroom wrote:
| >I'm interested in seeing the results of an Alpaca method
|
| You're talking apples to oranges. The "Alpaca method" is a
| dataset generation method. Nothing about Alpaca's training
| method is novel, interesting, or efficient. Alpaca used the
| same standard training method everyone else uses, A100
| clusters.
|
| If you mean LoRA/PEFT training which people used to
| replicate Alpaca then that is also apples to oranges
| because LoRA/PEFT is a finetuning method not a pre-training
| method.
| [deleted]
| Vetch wrote:
| Base model performance is what's most important and also
| impacts fine-tuning quality. Practically, a model that's
| good out of the box with minimal fine-tuning is also useful
| to more people. Since they focused on being training
| compute optimal for some budget, expect their models to lag
| behind Llama overall. Their 6.7B version should lag behind
| GPT-J, assuming 20 tokens per parameter.
|
| The Pythia models are also worth checking out, they might
| be better than or matched to CerebrasGPTs at each size
| (although they warn it is not intended for deployment).
|
| Conclusion: the landscape of top open models remains
| unchanged.
| rnosov wrote:
| I might be missing something but it looks to me that actually
| running this "open" model requires special hardware only
| accessible with a cloud subscription with 60 000 USD / week
| minimum spend[1]. Can anyone confirm if you can run it on your
| own hardware? If software is open but hardware is locked I don't
| see the point.
|
| [1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-
| engi....
|
| EDIT: Ok, looks like I've missed the hugging face repo. The
| language they use is a bit confusing.
| bubblethink wrote:
| You can run inference on GPUs. These are just models and
| weights.
| simonw wrote:
| The PyTorch model files are already available to download from
| Hugging Face - the largest one looks to be 52GB. They should
| run on any hardware that can run regular PyTorch models.
| JamesCoyne wrote:
| Slightly off-topic:
|
| I remember seeing news about the enormous chip Cerebras was/is
| selling (pdf
| https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).
|
| Has there been any indication that the LLMs released in the last
| few months use exotic hardware like this, or is it all "standard"
| hardware?
| wmf wrote:
| OpenAI uses Nvidia GPUs and Google uses their TPUs.
| ipsum2 wrote:
| Everyone except Google uses Nvidia for training. Cerebras,
| Gaudi, and other custom AI accelerators have unable to surpass
| Nvidia in performance/$ and performance/watt yet.
| brucethemoose2 wrote:
| You might see more training on Intel XPUs when they come out,
| since they have such enormous RAM pools. Maybe AMD MI300s and
| Intel Ponte Vecchio (both 128GB) in the shorter term, though I
| think they will mostly be in HPC supercomputers instead of
| cloud instances.
| eternalban wrote:
| > It takes substantial technical expertise to train very large
| models on GPUs. In the recently released GPT-4 Technical Report,
| OpenAI credits over thirty contributors just for compute
| infrastructure and scaling.
|
| This is called a silver lining for some (in case you were worried
| about gpt taking your job). Privacy requirements alone will in
| the near term force major companies to run their own inference
| (if not training). The expertise required are nearly identical to
| that of running large scale distributed computational graphs.
|
| This is an interesting diveragence from what happened with web.
| The backends started out simple before map-reduce and before
| deconstructing databases and processing distributed logs. With
| ML, we'll jump right into the complex backends in tandem with
| easy-picking early stage edge applications (which we see daily on
| HN).
| 2bitencryption wrote:
| This type of article (or press release, or whatever you want to
| call it) is exactly what makes the future so interesting.
|
| The cat is out of the bag, the genie is out of the bottle, the
| confetti has left the cannon[0].
|
| It's tempting to see a world dominated by Google Bard, ChatGPT,
| Bing Search, etc. And no doubt, they will be huge players, with
| services that are far more powerful than anything that can be run
| on the edge.
|
| But. BUT. The things that we can do on the edge are _incredible_
| now. Just imagine a year from now, or two. These earth-shattering
| models, which seem to be upending a whole industry, will soon
| have equivalents that run on the edge. Without services spying on
| your data. Without censorship on what the model can /cannot say.
| Because it's all local.
|
| When was the last time this happened? There _will_ be players who
| publish weights for models that are free to use. The moment that
| torrent magnet link is published, it 's out in the wild. And
| smart people _will_ package them as "one click installers" for
| people who aren't tech-savvy. This is already happening.
|
| So every time you're amazed by something chat-gpt4 says, remember
| that soon this will be in your pocket.
|
| [0] the "confetti" idiom brought to you by chat-gpt4.
| jazzkingrt wrote:
| Serious question: is it typical to describe client-side
| computing as "on the edge"?
|
| I thought running something on the edge referred to running it
| in close network proximity to the user, rather than users
| having control and running things themselves.
| dragonwriter wrote:
| > Serious question: is it typical to describe client-side
| computing as "on the edge"?
|
| Somewhat; its consistent with, e.g., Google's "Edge TPU"
| designation for its client-side neural processors.
|
| > I thought running something on the edge referred to running
| it in close network proximity to the user
|
| Typically, but on the client device _is_ the limit-case of
| "close network proximity to the user", so the use is
| consistent.
| capableweb wrote:
| Yes, "edge computing" can refer to both computing done as
| close to the user as possible geographically, or even on the
| device itself. If someone says "I wanna do edge computing"
| it's not clear enough to know if they just want to have
| servers they control as close to the user as possible, or do
| the computing on the device itself. I think Apple would say
| "edge computing" is on the actual device while CloudFlare
| would say "edge computing" is on their infrastructure, but
| distributed to be physically closer to the end user.
| iamerroragent wrote:
| I guess I've been out of the loop for a bit and didn't
| realize that "edge computing" became a term since cloud
| computing took off.
|
| It is kind of cyclical then is not?
|
| By that I mean computers used to be shared and to log into
| it through a terminal.
|
| Then the PC came around.
|
| Then about 15 years ago Cloud computing became the rage
| (really an extension or more sophisticated system than the
| first time shared computers)
|
| Now we're back to local computing. I even see more self
| hosting and moving away from cloud due to costs.
|
| All that rant is to say is it's interesting.
|
| Side note, getting this AI to be localized as much as
| possible I imagine will be really useful in the medical
| industry because it helps alleviate HIPAA requirements.
| nordsieck wrote:
| > It is kind of cyclical then is not?
|
| > By that I mean computers used to be shared and to log
| into it through a terminal.
|
| > Then the PC came around.
|
| > Then about 15 years ago Cloud computing became the rage
| (really an extension or more sophisticated system than
| the first time shared computers)
|
| There's a really neat article called "The Eternal
| Mainframe"[1] that you might be interested. It explores
| this idea in greater depth.
|
| ---
|
| 1. http://www.winestockwebdesign.com/Essays/Eternal_Mainf
| rame.h...
| iamerroragent wrote:
| Thanks, that was an interesting read!
|
| I wonder if the author's perspective has changed with
| regards to freedom to compute.
|
| Social Media is often used as an example of privacy
| invasion though I've failed to see why concerns over
| Facebook handling your private data is worrying when they
| don't have a product you need to have.
|
| Email on the other hand, is pretty much a necessity today
| so privacy concerns are vital there imo. Of course you
| can host your own server whereas you can't host your own
| Facebook.
| wsgeorge wrote:
| I believe this has been extended to mean "on device", which
| is interesting. See Gerganov's article on Github [0]. I wrote
| about this here [1] where I made a contrast between the core
| and the edge. I think the term maps well to this meaning.
|
| What I find more interesting is that in the classic "close
| network proximity", some parts of the world may not have
| benefited as much from that trend since the closest nodes of
| a global delivery network could be several countries away.
|
| [0] https://github.com/ggerganov/llama.cpp/discussions/205
|
| [1] https://medium.com/sort-of-like-a-tech-diary/consumer-ai-
| is-...
| TeMPOraL wrote:
| > _I believe this has been extended to mean "on device",
| which is interesting._
|
| I don't like the connotations this carries. This is almost
| openly talking about reaching all the way into peoples'
| hardware to run your software, for your benefit, on them,
| without their knowledge, consent or control...
| wsgeorge wrote:
| I see. Hadn't considered this. Yes, I see how that might
| be a concern.
|
| What I think is important in this AI Spring is that we
| make it possible for people to run their own models on
| their own hardware too, without having to submit anything
| to a large, centralised model for inference.
| matchagaucho wrote:
| An LLM running locally providing type-ahead completions seems
| inevitable.
| lioeters wrote:
| Yes, yes, and yes. I'm waiting for an actually open AI that can
| run on the edge, purely on commodity hardware like our laptops
| and phones - it's inevitable.
|
| I imagine this "cat out of the bag" situation, the
| democratization and commodification of powerful technology
| accessible and affordable to the public, is similar to what's
| happening with single-board computers and microcontrollers like
| Raspberry Pi, Arduino, ESP32.
|
| It might be similar to what happened with mobile phones, but
| there the power was quite restricted. The (mostly) duopoly of
| iOS and Android, with devices and apps locked down in various
| ways. Sure we can "jail break" and "root" our phone, but that's
| not for the general public.
|
| Maybe solar energy production is going through a similar
| process, with panels and batteries becoming more efficient and
| affordable every year.
|
| Certainly, it reminds one of the history of personal computers,
| the way such a powerful general-purpose tool became ubiquitous
| and local.
| akiselev wrote:
| After using ChatGPT 4 extensively for a few days, I think
| we're probably only a few years away from the first
| generation of truly conversational assistants ala Jarvis in
| Iron Man. Between LangChain and existing voice recognition
| software, we've already 95% of the way there, it just needs
| to be packaged up into a UI/UX that makes sense.
|
| These local models are absolutely critical for that to happen
| though. I'm hitting daily situations where I have to
| reconsider my use of ChatGPT because what I'm asking would
| leak very private personal information or somehow trip its
| morality filter.
|
| Just swapped in a 2TB nvme for a separate /home mount and
| reinstalled Arch just so I could have enough disk space to
| test a dozen models locally. I'm so ready!
| barking_biscuit wrote:
| Only a few years? Nobody can predict accurately in years
| anymore. Feels more like "only a few months" away.
| meghan_rain wrote:
| I pray to the AI gods that OpenAI will fail at calibrating
| the censorship layer and will continue to overcensor, which
| in turn will hopefully lead to many usecases requiring
| local LLMs, which in turn would increase the incentive to
| build them.
| visarga wrote:
| > the democratization and commodification of powerful
| technology accessible and affordable to the public
|
| Empowering everyone equally, very low barrier of entry. The
| spread of AI will make some skills become less important and
| other skills more important. There will be painful social
| changes. Memorising stuff, being exact, being careful,
| diligent - all these things can be done by LLM. But also
| creative recombination of ideas. For now we still hold
| supremacy in long form content, complex software projects and
| detecting inconsistencies, but they could be emergent
| qualities in GPT-5.
| sbierwagen wrote:
| If you thought crypto miners drove up the prices of GPUs,
| just wait until a gaming card can host a person who can get a
| job.
|
| Integer-quantized LLaMA can fit on a 4090, and isn't quite
| human-level. If a fully aligned, reliably non-hallucinary LLM
| can fit on the next generation 5 series GPUs then expect
| their market price to quadruple.
| hiAndrewQuinn wrote:
| I for one dream of a future without maps. I want to walk
| through a distant forest to find an ancient, unconnected ESP-32
| in the bark of a tree containing a tiny specialized AI that can
| only tell me about things relevant to the area, how far to walk
| upstream to the nearest town. And only if I can find it and
| scan an RFID tag to wake it up.
| vinc wrote:
| A beautiful dream!
|
| > I like to think (right now please!) of a cybernetic forest
| filled with pines and electronics where deer stroll
| peacefully past computers as if they were flowers with
| spinning blossoms.
| t_minus_2 wrote:
| The cat is out of the bag,The genie is out of the bottle,The
| confetti has left the cannon,The ship has sailed,The horse has
| bolted,The toothpaste is out of the tube,The beans have been
| spilled,The train has left the station,The die is cast,The bell
| has been run.
| cjf101 wrote:
| Yes, this is true. But, I worry about how long it will take for
| the utility of "GPT-4" on my phone to be close enough to
| whatever is only possible through models running on large cloud
| platforms to make that choice relatively drawback free.
|
| Is the curve of what this class of algorithms can provide
| sigmoid? If so, then yeah, eventually researchers should be
| able to democratize it sufficiently that the choice to use
| versions that can run on private hardware rational. But if the
| utility increases linearly or better over time/scale, the
| future will belong to whoever owns the biggest datacenters.
| hintymad wrote:
| I'd go one step further if it is not happening yet: smaller
| companies should really pool their resources to train open
| LLMs. Say, form a consortium and work with the open source
| community to build ChatGPT-equivalent. Companies will be crazy
| to assume that they can hand their future to the APIs offered
| by a handful of companies during this monumental technological
| paradigm shift in history.
|
| That is, a real OpenAI with a open government body.
| yieldcrv wrote:
| > And smart people will package them as "one click installers"
| for people who aren't tech-savvy. This is already happening.
|
| Any projects I can follow? Because I haven't seen any one click
| installers yet that didn't begin with "first install a package
| manager on the command line"
| slickdork wrote:
| Not an llm but this 1 click installer for stable diffusion is
| literally a 1 click installer. Impressively works.
|
| https://github.com/cmdr2/stable-diffusion-ui
| simon83 wrote:
| Google: "confetti has left the cannon"
|
| > No results found for "confetti has left the cannon".
|
| I'm amazed that a "stochastic parrot" can come up with such a
| beautiful idiom.
| visarga wrote:
| That name aged like milk. First of all, what you said. And
| second - a parrot can make more parrots without human help,
| language models can't make GPU chips. Insulting for both LLMs
| and parrots.
| educaysean wrote:
| Language models exist not in the physical world but
| digital. A more apt comparison would be if LLMs can compile
| new (or slightly modified) LLMs, and I believe the answer
| will soon be (if not already), a resounding yes.
| visarga wrote:
| I agree, LLMs can generate the source code for their
| network, can generate the training data, supervise a
| model being trained and explain step by step what they
| do. That's why I specifically said they can't make GPUs,
| if they could, they would be self replicators. We'd have
| to consider them alive.
| groestl wrote:
| > language models can't make GPU chips
|
| Not yet. I'm sure somebody is taking a hard look already.
| flir wrote:
| Could be lifted from another language. Is nice though.
| barking_biscuit wrote:
| Out of distribution generations are a thing.
| simon83 wrote:
| I understand that all of this is based on some fundamental
| mathematics, a couple of algorithms here, and some
| statistical analysis there. And I find it absolutely
| amazing that we can use all of that knowledge and encode it
| into something that resembles intelligence. This makes me
| think if our brains and the concept of intelligence are
| really as special and mysterious as we assume.
| athom wrote:
| Try looking up "pinkie pie party cannon"
| tjr wrote:
| Not exactly the same, but from:
|
| https://www.instagram.com/p/CQdBiVyh5C2/?hl=en
|
| _Now that the cat is out of the bag, or, should I say the
| confetti is out of the... can?_
| 2bitencryption wrote:
| For the curious, here was the conversation I had:
|
| ME: What's another saying similar to: "The cat is out of the
| bag" or "The genie is out of the bottle"?
|
| chatgpt: Another similar saying is "the toothpaste is out of
| the tube." Like the other two expressions, this phrase
| conveys the idea that something has been revealed or
| unleashed, and it cannot be undone or put back into its
| original state.
|
| ME: Can you invent a new phrase that has the same meaning,
| which is completely new?
|
| chatgpt: Certainly! Here's a new phrase with a similar
| meaning: "The confetti has left the cannon." This phrase
| suggests that once something is out in the open or has
| happened, it's impossible to reverse or contain it, much like
| confetti that has been shot out of a cannon during a
| celebration.
|
| ME: What's the name for this kind of expression? Aphorism?
| Fable?
|
| chatgpt: This kind of expression is known as an idiom. Idioms
| are phrases or expressions with a figurative meaning that is
| different from their literal meaning. They are commonly used
| in everyday language to convey a specific message or
| sentiment in a more colorful or memorable way.
| Tommstein wrote:
| That is impressive, especially given that neither Google
| nor DuckDuckGo find any previous trace of that idiom.
| johtso wrote:
| Not exactly sure why it would be surprising that it can
| come up with a convincing idiom when it can produce
| remarkably good _poetry_
| bilsbie wrote:
| I don't understand why people aren't more impressed with it
| clearly understanding and then even inventing idioms. That
| shows some real intelligence.
| fnordpiglet wrote:
| It's because they're confused in thinking human
| intelligence isn't learned stochastic expectation.
| gary_0 wrote:
| That's how the brain evolved, after all: take incoming
| input, process according to previous patterns, output a
| reaction to peripheral nervous system that will enhance
| survival. The evolutionary purpose of a cortex is to
| store a statistical model of an animal's environment
| based on past stimulus. Human intelligence is just a
| complicated permutation of these initial systemic
| parameters.
| [deleted]
| johnchristopher wrote:
| OT: I don't know about their scaling strategy for LLM but their
| scaling strategy for displaying pictures is disappointing.
|
| (it's all blurry)
| lanshanbob wrote:
| I think it's fixed now
| ricopags wrote:
| Came here to point this out, though not as pithily :D
|
| Really, really bad mark on whoever is in charge of their web
| marketing. Images should never look that bad, not even in
| support, but definitely not in marketing.
|
| edit: so this post is more useful, 4k res using Edge browser
| Kelamir wrote:
| Last time I viewed it, I believe it wasn't blurry. Perhaps to
| scale the traffic the images are now displayed in lower
| quality?
|
| But I'm not sure anymore that it wasn't initially blurry...
| Perhaps I'm hallucinating, like large language models.
|
| Current image displayed is https://www.cerebras.net/wp-
| content/uploads/2023/03/Scaling-... , will see if it changes.
| Kelamir wrote:
| I can confirm, it does change. As of now, it displays one of
| higher quality: https://www.cerebras.net/wp-
| content/uploads/2023/03/Scaling-...
| thewataccount wrote:
| They're dynamically scaled and something must be broken. If you
| inspect source you can find the raw images, here's a few:
|
| https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...
|
| https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
|
| https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
|
| EDIT: Looks like it scores better with less training - up until
| it matches GPT-J/Pythia/OPT and doesn't appear to have much
| benefit. It maybe scores slightly better then GPT-J which is
| pretty "eh", I'm not sure if GPT-J level performance is really
| useful for anything? NeoX 20B outperforms it in everything if
| you don't care about the amount of training needed.
|
| Does the better performance for less training matter if that
| benefit only applies when it's only performing a lot worse then
| GPT-J? It appears to lose it's scaling benefits before the
| performance is interesting enough to matter?
| [deleted]
| johnchristopher wrote:
| Disabling ublock on Firefox fixes the issue for me.
|
| edit: scratch that, it seems the AJAX endpoint returns 504
| more often that not.
| thewataccount wrote:
| Yeah it seems hit/miss depending on the window size and
| stuff. My guess is whatever is supposed to decide what size
| to request is messed up.
| ftxbro wrote:
| > Our paper, which will be available soon, will detail our
| training methods and performance results.
|
| Yay there will be a paper let's gooooooo!
| chessgecko wrote:
| I wonder what led to such a gap between llama 7b and Cerebras
| 13b. I hope they discuss it in the paper.
| gpm wrote:
| Is there a benchmark comparing the two that I missed?
|
| Edit: The huggingface page has 0-shot benchmarks which you can
| compare against the llama paper
|
| https://huggingface.co/cerebras/Cerebras-GPT-13B
|
| https://arxiv.org/pdf/2302.13971.pdf
| freeqaz wrote:
| I'm on mobile and struggled to compare these two tables
| properly. Would you mind posting a summary of your findings?
|
| Here are some values but I don't know what they mean. LLama
| 60B on the left, Cerebras 13B on the right.
|
| PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4
| gpm wrote:
| Really short summary: LLaMa is better, even smaller LLaMa
| models.
|
| Table format: Benchmark, Cerebras 13B, LLama 7B, LLama 13B,
| LLama 60B
|
| HellaSwag, 51.3, 76.1, 79.2, 84.2
|
| Piqa, 76.6, 79.8, 80.1, 82.8
|
| Wino-Grande, 64.6, 70.1, 73.0, 77.0
|
| Arc-e, 71.4, 72.8, 74.8, 78.9
|
| Arc-c, 36.7, 47.6, 52.7, 56.0
|
| OpenBookQA, 28.6, 57.2, 56.4, 60.2
| ftxbro wrote:
| This gap makes sense to me. The academic point of the Cerebras
| paper is to show their nice empirical scaling law for compute-
| optimal training, whereas the academic point of the LLaMA paper
| was to show that you can make small models punch above their
| weight by training them in a way that is deliberately not
| compute-optimal. Of course both of those publications had other
| academic and marketing purposes.
|
| From the Cerebras blog post: "Trained using the Chinchilla
| formula, these models provide the highest accuracy for a given
| compute budget."
|
| From the LLaMA paper: "The focus of this work is to train a
| series of language models that achieve the best possible
| performance at various inference budgets, by training on more
| tokens than what is typically used."
| visarga wrote:
| Of course this is great news, I hope these models can be fine-
| tuned to be like lighter versions of chatGPT. But I remember
| reading in the LLaMA paper that a small model can still improve
| when trained more than the Chinchilla budget.
|
| > For instance, although Hoffmann et al. (2022) recommends
| training a 10B model on 200B tokens, we find that the performance
| of a 7B model continues to improve even after 1T tokens.
|
| Cerebras says:
|
| > For instance, training a small model with too much data results
| in diminishing returns and less accuracy gains per FLOP
|
| But this is only of concern when you care about the training
| cost, such as when you are budget limited researcher or a company
| who doesn't deploy models at scale. But when you care about the
| total cost of deployment, then making a small model even better
| with lots of data is a smart move. In the end it matters more to
| have the most efficient model in prediction, not the most
| efficient model in training.
| skybrian wrote:
| What's in the Pile training data they used? How much source code
| does it include?
| Garcia98 wrote:
| I've been following open source LLMs for a while and at first
| glance this doesn't seem too powerful compared to other open
| models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems
| to perform much better. Although I'm not sure about the
| legalities about that licensing, since it's basically Flan-T5
| fine-tuned using the Alpaca dataset (which is under a Non-
| Commercial license).
|
| Nonetheless, it's exciting to see all these open models popping
| up, and I hope that a LLM equivalent to Stable Diffusion comes
| sooner than later.
|
| [0]: https://github.com/declare-lab/flan-alpaca
| ftxbro wrote:
| Their goal isn't to make a powerful model. It's to show how
| well compute-optimal models do on test-loss as a function of
| increasing model size. This function can be used with some
| caveats to forecast the test-loss of larger models for which
| compute-optimality becomes more important.
| alchemist1e9 wrote:
| Sounds like you might be the right person to ask the "big"
| question.
|
| For a small organization or individual who is technically
| competent and wants to try and do self-hosted inference.
|
| What open model is showing the most promise and how does it's
| results compare to the various openAI GPTs?
|
| A simple example problem would be asking for a summary of code.
| I've found openAI's GPT 3.5 and 4 to give pretty impressive
| english descriptions of code. Running that locally in batch
| would retain privacy and even if slow could just be kept
| running.
| Garcia98 wrote:
| Google's Flan-T5, Flan-UL2 and derivatives, are so far the
| most promising open (including commercial use) models that I
| have tried, however they are very "general purpose" and don't
| perform well in specific tasks like code understanding or
| generation. You could fine-tune Flan-T5 with a dataset that
| suits your specific task and get much better results, as
| shown by Flan-Alpaca.
|
| Sadly, there's no open model yet that acts like a Swiss knife
| and gets good-enough results for multiple use cases.
| brucethemoose2 wrote:
| FYI: Cerebras's nodes are _very_ different than your typical
| Nvidia training nodes:
|
| https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
|
| Each individual "chip" has _40GB of SRAM_ vs ~76MB for the Nvidia
| H100, and networked pools of external RAM, SSDs and such. Thats
| why the training architecture is so different.
| arbuge wrote:
| https://www.cerebras.net/product-chip/
|
| There's a comparison picture there of one of their chips
| alongside a regular GPU chip. Effectively they use up the
| entire wafer.
| brucethemoose2 wrote:
| Yeah, and that doesn't even do the nutty IO on these things
| justice.
|
| A 16x CS2 cluster like they describe is like a huge Nvidia
| cluster in terms of throughput, but more like a single Nvidia
| node structurally.
| ipsum2 wrote:
| Cerebras makes impressive hardware, but Nvidia still
| performs better in every regard. The most telling factor is
| that Cerebras claims they're too busy to run common
| benchmarking (e.g. MLPerf) to compare against Nvidia.
| alchemist1e9 wrote:
| It's unbelievable stuff. Does anyone know how much a single
| box costs? They are selling them it looks like.
| freeqaz wrote:
| If you have to ask, you can't afford it!
|
| Mostly teasing but my guess would be $500k+ since they'll
| likely price it so that it is the same $ as the
| equivalent NVIDIA cluster (or very close to it).
| alchemist1e9 wrote:
| Actually if they are around $2M looks like my company can
| afford one. Given this is just getting started it looks
| promising as I'm sure future generations will be more
| affordable.
| ericd wrote:
| On the order of several million USD for the second gen
| system. Last I heard, they're still at lowish volumes,
| selling some to national labs and the like.
| sbierwagen wrote:
| CS-1 costs "$2-3 million", CS-2 costs "several" million.
|
| A single Nvidia H100 costs somewhere around $30,000 each,
| so a GPU server with every slot populated costs about
| $300,000.
| brucethemoose2 wrote:
| ServeTheHome claims "HGX A100 platforms, when they are
| sold as single servers are generally in the $130K-$180K
| even leaving a very healthy margin for OEMs/ resellers"
|
| https://www.servethehome.com/graphcore-celebrates-a-
| stunning...
|
| Not sure about the H100, but it seems to be more supply
| constrained (hence pricier) atm.
|
| Now, the real question is how many HGX nodes "equals" a
| single CS2 node. The math here is extremely fuzzy, as the
| benefit to such extreme node consolidation depends on the
| workload, and the CS-2 takes up less space, but the HGX
| cluster will have more directly accessible RAM and better
| turnkey support for stuff since its Nvidia.
| bubblethink wrote:
| There is cloud pricing on the website.
| https://www.cerebras.net/product-cloud/
| alchemist1e9 wrote:
| This is actually really important from my perspective. It
| looks like an end user can work backwards from available
| inference hardware, or interference budget, required
| speed, then figure out a viable model size. Bring their
| own data and then fine tune or train from scratch.
|
| This is getting so real so fast.
| tombert wrote:
| Has anyone tried this? I have 96GB of GPU memory; will that be
| enough to run the biggest model?
| cuuupid wrote:
| 13B fits nicely even in a 3090 (24gb vram)!
| Filligree wrote:
| Only if you can use int8, which may or may not be convenient.
| lxe wrote:
| with load_in_8bits ?
| spi wrote:
| I have not tried, but 96GB of GPU memory is plenty, for
| inference there should certainly be no issue. Their biggest
| model has 13B parameters, you should be able to run inference
| (float16) already with 32GB of memory.
|
| With 96GB of memory you should also be able to fine-tune it
| (possibly some tricks like gradient accumulation and/or
| checkpointing might be needed), but you have to be ready for
| many days of computation...
| alchemist1e9 wrote:
| > but you have to be ready for many days of computation...
|
| I was thinking since we have API prices in tokens and now it
| looks like self hosted inference on high end GPUs for similar
| models. Then based on electricity prices there will be a
| self-hosted prices in tokens. Then how close are these
| already? What is the markup today from roughly the raw
| electricity cost that OpenAI has.
| wg0 wrote:
| Noob to ML in practice. These models containing weights, all of
| them, do they have a standard file/binary format?
| examplary_cable wrote:
| [I'm not an expert] but I believe .ckpt and .safetensors.
|
| The problem with .ckpt is that it executes arbitrary code in
| your machine(very unsafe). While .safetensors was made by
| huggingface in order to have a safe format to store the
| weights. I've also seen people load up the llama 7B via a .bin
| file.
| eldenring wrote:
| > Trained using the Chinchilla formula, these models provide the
| highest accuracy for a given compute budget.
|
| I'm confused as to why 111 million parameter models are trained
| with the Chinchilla formula. Why not scale up the training data?
| If you're training smaller models, surely optimizing performance
| is better than optimizing total compute.
|
| Seems like a silly misunderstanding of the Chinchilla paper, but
| I'm sure I'm missing something
| gamegoblin wrote:
| True. There was a good blog post published about this a few
| weeks ago: https://finbarr.ca/llms-not-trained-enough/
|
| Money quote for those who don't want to read the whole thing:
|
| '''
|
| When people talk about training a Chinchilla-optimal model,
| this is what they mean: training a model that matches their
| estimates for optimality. They estimated the optimal model size
| for a given compute budget, and the optimal number of training
| tokens for a given compute budget.
|
| However, when we talk about "optimal" here, what is meant is
| "what is the cheapest way to obtain a given loss level, in
| FLOPS." In practice though, we don't care about the answer!
| This is exactly the answer you care about if you're a
| researcher at DeepMind/FAIR/AWS who is training a model with
| the goal of reaching the new SOTA so you can publish a paper
| and get promoted. If you're training a model with the goal of
| actually deploying it, the training cost is going to be
| dominated by the inference cost. This has two implications:
|
| 1) there is a strong incentive to train smaller models which
| fit on single GPUs
|
| 2) we're fine trading off training time efficiency for
| inference time efficiency (probably to a ridiculous extent).
|
| Chinchilla implicitly assumes that the majority of the total
| cost of ownership (TCO) for a LLM is the training cost. In
| practice, this is only the case if you're a researcher at a
| research lab who doesn't support products (e.g. FAIR/Google
| Brain/DeepMind/MSR). For almost everyone else, the amount of
| resources spent on inference will dwarf the amount of resources
| spent during training.
|
| '''
| haldujai wrote:
| While true I think this also misses that "for almost everyone
| else" you're probably not (or at least should not) be trying
| to optimize zero-shot performance if you have an intended
| high inference use case so I don't think Chinchilla would be
| all that relevant.
| vintermann wrote:
| I have a suspicion that good zero-shot performance is a
| good starting point for fine-tuning. If you have more than
| one intended high inference use case, or can imagine a
| couple of new ones on the horizon, it might still be best
| to not target the first use case directly.
| haldujai wrote:
| You're not wrong, the Chinchilla rationale is that it may be
| more compute efficient to obtain a given loss using larger
| model sizes if the budget allows. As another commenter states
| this ignore the inference part of the equation.
|
| As an example the BERT/RoBERTa family were trained for much
| longer than Chinchilla, you do get diminishing returns though.
|
| There is a point of overtraining where downstream performance
| is impacted but that's pretty high.
|
| I think part of the answer to this is also that xxx million
| parameter decoder-only models don't seem to be that useful so
| it may not be worthwhile to optimize them for performance?
| ftxbro wrote:
| The point of those smaller models is for the "Cerebras Scaling
| Law for Compute-Optimal Training" which is the straight line
| plot in the image at the top of their webpage when you click
| the link.
|
| They want you to think it's reasonable that because the line is
| so straight (on a flops log scale) for so long, it could be
| tempting to extrapolate the pile-loss consequences of
| continuing compute-optimal training for larger models beyond
| their largest 13B one, with the obvious caveat that the
| extrapolation can't continue linearly much further if for no
| other reason than the test loss isn't going to go below zero
| (it will flatten out sooner than that).
|
| If you trained beyond compute-optimality on smaller models, it
| would mess up their straight line and make it look like we are
| sooner hitting diminishing returns on test loss.
| simonw wrote:
| "Cerebras open sources seven GPT-3 models from 111 million to 13
| billion parameters."
|
| I don't understand why they describe them as GPT-3 models here as
| opposed to calling them GPT models. Or even LLMs - but I guess
| that acronym isn't as widely recognized.
| wsgeorge wrote:
| I think GPT-3 is used as a benchmark for performance, so saying
| a model is on par with GPT-3 should give you an idea of what
| you can get out of it.
|
| IIRC most open source models to date - including the semi-open
| LLaMAs - have GPT-3-like performance. Nothing gets close to
| GPT-3.5 and beyond.
| patientplatypus wrote:
| [dead]
___________________________________________________________________
(page generated 2023-03-28 23:00 UTC)