[HN Gopher] Numbers every LLM developer should know
___________________________________________________________________
Numbers every LLM developer should know
Author : richardliaw
Score : 227 points
Date : 2023-05-17 17:50 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| YetAnotherNick wrote:
| > ~$1 million: Cost to train a 13 billion parameter model on 1.4
| trillion tokens
|
| Llama paper mentioned 135,168 A100 hours for training 13 billion
| model on 1 trillion tokens, which means ~$150k for lambdalabs on
| demand instance.
| waleedk wrote:
| [Author] Good luck trying to use clusters of Lambda machines.
| Lambda labs are cheap for a reason: their API is not very
| featureful (we looked at them and we saw they didn't even
| support machine tagging). If you're looking for a box or two,
| lambda labs is fine. If you're looking for 1,000, not so much.
|
| Plus they don't actually have any actually A100s available at
| the moment (2022-05-17).
|
| CoreWeave is a nice middle ground. You can at least get the
| A100 machines into a k8s cluster.
| curiousgal wrote:
| > _LLM Developer_
|
| This is the fastest I've rolled my eyes in a long time!
| ryanklee wrote:
| The amount of get-off-my-lawn grognardness that LLM activity
| inspires is really ridiculous.
|
| I really would ask you to take a second look at the spirit of
| your comment and think carefully about how much you really
| understand about the work being done on top of LLMs and if it
| justifies this kind of response.
| astrea wrote:
| I had the same reaction as the OP. I'm not a data scientist
| by trade or title, but I would personally be a little
| offended. If you designed the Porsche 911, would you not be
| offended by the shade tree mechanic who simply knows how to
| change the oil calling himself a Porsche designer/engineer?
| RyanCavanaugh wrote:
| Context matters. Is a "web developer" someone who makes web
| pages, or works on a browser rendering engine?
| ryanklee wrote:
| There are people making applications based on LLMs. You may
| quibble with the term LLM Developer, but to sneer or roll
| your eyes at it as if it were prima facie inaccurate or
| laughable is unjustified.
| cwkoss wrote:
| Are there any open source host-your-own LLMs that have licensing
| that allows for commercial use?
| Der_Einzige wrote:
| Dolly from Databricks is one at least
| waleedk wrote:
| [Author] TL;DR OS LLM models are coming.
|
| Dolly's not that great -- I've hit lots of issues using it to
| be honest .
|
| MosaicML has a nice commercially usable model here:
| https://www.mosaicml.com/blog/mpt-7b
|
| I think they're one of the leading ones (bias: they're kinda
| competitors to my employer Anyscale, but you gotta say
| something's good when it is).
|
| Red Pajama are leading an effort to build a fully open source
| model similar to LLaMa.
| https://www.together.xyz/blog/redpajama
| int_19h wrote:
| https://github.com/BlinkDL/RWKV-LM
| elorant wrote:
| Vicuna-13b is on Apache License 2.0.
| twbarr wrote:
| Vicuna is a delta model that you have to apply on top of
| LLaMA.
| throwaway888abc wrote:
| Excellent! Thank you so much for making/posting this
| waleedk wrote:
| [Author] You're welcome -- glad it was useful!
| MacsHeadroom wrote:
| > Of course there are efforts to reduce this, notably llama.cpp
| which runs a 13 billion parameter model on a 6GB GPU by
| quantizing aggressively down to 4 bits (and 8 bits without too
| much impact), but that's atypical.
|
| No, 4bit quantization is the typical case.
|
| At 4bit you can fit twice the parameters of 8bit in the same
| space for far better performance/perplexity/quality.
|
| Running LLMs higher than 4bit is atypical and almost always sub-
| optimal (compared to running a model half the size in 8bit).
|
| Even pretraining and finetuning in 4bit is likely to become the
| norm soon as fp4 becomes more well understood.
| moffkalast wrote:
| > llama.cpp which runs a 13 billion parameter model on a 6GB
| GPU
|
| I think that's a typo there too, the 13B model needs like 10G
| of memory for 4 bits, it's the 7B one that fits into 6G. Well
| unless you do the split thing with some layers on the CPU I
| guess.
| Der_Einzige wrote:
| No it isn't, quantization is not free. You lose a significant
| amount of performance that you are not measuring properly in
| automated benchmarks when you quantize to that level.
|
| You can see it in real time when you take most LLMs and compare
| them at different quantization levels. I can see the
| degradation even in the largest llama quite badly even at 8
| bits.
| astrange wrote:
| If you take a model and quantize it it's obviously going to
| get worse, but what if you train it again after that?
| MacsHeadroom wrote:
| Quantization is not free, but VRAM is even less free.
|
| If you have X amount of VRAM and can fit a 16bit model of
| size 2X in 8bit or a model of size 4X in 4bit then the 4X
| model in 4bit is ALWAYS superior with lower perplexity and
| better performance.
|
| You LOSE performance by using a smaller model in 8bit vs a
| larger model in 4bit.
| waleedk wrote:
| [Author] Completely disagree. Any analysis shows that you see
| perplexity reduction at 4 bits. Have a look at llama.cpp's
| results here:
|
| https://github.com/ggerganov/llama.cpp#quantization
|
| 4 bit has a perplexity score 0.13 or so higher.
| MacsHeadroom wrote:
| You're just wrong. You're looking at the wrong numbers. The
| perplexity score of a model with twice the parameters in half
| the bits (4bit) is FAR LOWER (ie better).
|
| If you are limited to X RAM and have two 16bit models of size
| 4X and 2X then the 4X model in 4bit will always be far
| superior to the 2X model in 8bit, with far lower perplexity.
|
| Compare 13B's 4bit perplexity of 5.3607 to 7B's 8bit
| perplexity of 5.9069. That is over 0.54 lower perplexity for
| the same RAM amount by using 4bit! That is MASSIVE!
| Taek wrote:
| There's also research showing that the perplexity reduction
| is less at higher parameter counts. E.g. a 65b parameter
| model barely has any impact at all when reducing from 16bit
| to 4bit
| mmoskal wrote:
| Well, if you have a fixed RAM size, you're better off with
| the largest model you can fit at 4 bits (13B 4b is way better
| than 7B 16b despite being twice smaller).
| kherud wrote:
| Can somebody please explain how quantization below 8 bit works?
| Since a byte is the smallest addressable unit I think, is the
| dimensionality of the weights somehow reduced?
| f_devd wrote:
| I believe it's locally (inner-loop or simd op) up-cast to
| float8/float16/int8, but I haven't looked at the internals of
| llama.cpp myself
| waleedk wrote:
| [Author] You approximate the weights using fewer bits. You
| also switch to ints instead of floats and then do some fancy
| stuff when multiplying to make it all work together.
|
| More detail than you probably wanted:
| https://huggingface.co/blog/hf-bitsandbytes-integration
| MacsHeadroom wrote:
| The latest release of bitsandbytes uses a new fp4 format.
| 4bit floating point scailing results in much lower
| perplexity than int4.
|
| Also note that for a fixed memory (RAM) size, 4bit (even
| int4) is always superior, resulting in lower perplexity
| than 8bit.
|
| E.g. LLaMA-13B int4 is far better/lower perplexity than
| LLaMA-7B fp8 while using the same amount of RAM.
| contravariant wrote:
| How come the token to word ratio is smaller than 1 if tokens are
| either words or part of words? Shouldn't you expect _more_ tokens
| than words?
| waleedk wrote:
| [Author] Fair point -- I clarified the language and gave a
| concrete example. Hope that helps!
| yonixw wrote:
| That is how I understood it, a token is on average a 3/4 of a
| word. "Token to word". So if you want to buy 1000 tokens you
| would get effectively 750 words.
| [deleted]
| renewiltord wrote:
| It's the token to word multiplier, yeah. i.e. x tokens = 0.75x
| words.
| furyofantares wrote:
| I think all the ratios given are x:1 and they tell you x.
| qeternity wrote:
| It's the other way around.
|
| 1 GPT4 token is equivalent to 50 GPT3.5 tokens.
|
| 1 token is equivalent to 0.75 words.
| contravariant wrote:
| That would make it 0.75 tokens to 1 word right?
| ramesh1994 wrote:
| I think parts of the write-up are great.
|
| There are some unique assumptions being made in parts of the gist
|
| > 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding
|
| > 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries
|
| I don't know how useful these numbers are if you take away the
| assumptions that self-hosted will work as well as API.
|
| > 10x: Throughput improvement from batching LLM requests
|
| I see that the write up mentions memory being a caveat to this,
| but it also depends on the card specs as well. Memory Bandwidth /
| TFLOPs offered by say 4090 is superior while having the same
| amount of VRAM as 3090. The caveat mentioned with token length in
| the gist itself makes the 10x claim not a useful rule of thumb.
| ramesh1994 wrote:
| > This means it is way cheaper to look something up in a vector
| store than to ask an LLM to generate it. E.g. "What is the
| capital of Delaware?" when looked up in an neural information
| retrieval system costs about 5x4 less than if you asked
| GPT-3.5-Turbo. The cost difference compared to GPT-4 is a
| whopping 250x!
|
| In a narrow use-case of a strict look-up. This seems to
| exaggerate the cost difference while having completely
| different trade-offs.
| abetlen wrote:
| I would add the following two numbers if you're generating
| realtime text or speech for human consumption:
|
| - Human Reading Speed (English): ~250 words per minute
|
| - Human Speaking Speed (English): ~150 words per minute
|
| Should be treated like the Doherty Threshold [1] for generative
| content.
|
| [1] https://lawsofux.com/doherty-threshold/
| armchairhacker wrote:
| But I'd say LLMs produce content faster than I can read or
| write it, because they can produce content which is really
| dense.
|
| Ask GPT-4 a question and then answer it yourself. Maybe your
| answer will be as good or better than GPT-4's but GPT-4 writes
| its answer a lot faster.
| Flux159 wrote:
| I think that it would be helpful to add a fine-tuning costs for
| an open source model (think LLaMA to Alpaca).
|
| From the phrasing around fine tuning right now it seems like it's
| using openai's fine tuning api to determine that cost, but it's
| not very clear.
|
| Also this would be helpful for other foundation models if that
| doesn't already exist - how much VRAM to run Stable Diffusion
| v2.1 at different resolutions, running Whisper or Bark for audio,
| etc.
| sebzim4500 wrote:
| They mention that they could finetune a 6B model for $7.
| Obviously the number depends on the amount of data and the
| model size but it's probably not going to be a significant
| expense in practice.
| PoignardAzur wrote:
| > _~$1 million: Cost to train a 13 billion parameter model on 1.4
| trillion tokens_
|
| MosaicML claims they trained a 7 billion parameter on 1 trillion
| tokens with a budget of $200k.
|
| https://www.mosaicml.com/blog/mpt-7b
|
| Does training cost scale linearly with model size and token
| count? If so, that suggests a lower bound of $600k to train the
| 13 billion params model. (Still roughly the same magnitude)
| waleedk wrote:
| [Author] Mosaic must be getting some kind of sweetheart deals
| on A100 80GB and A100 40GB. The prices they are quoting are not
| what say the AWS on-demand prices are. They quote $2 per GPU
| for A100 40GB and $2.50 for A100 80GB. That's literally half
| the AWS on-demand rate for A100s here:
| https://aws.amazon.com/ec2/instance-types/p4/
|
| And these are impossible to get. We tried to get some for
| Anyscale, and we were told there were no on-demand available
| and lead time for reserved (ouchie on the price! You're talking
| a quarter of a million dollars a year for one machine at list)
| was in weeks.
|
| Once you take the model size and hefty sweetheart deals into
| account, you're within 10%. Mosaic does have some nice whitebox
| optimizations, but nothing that radically changes the equation.
| fpgaminer wrote:
| A100-40GB is like $1.10 on LambdaLabs, on demand. Their
| availability is horrific on singles, but I've seen 8x
| instances pop up more often than not. And you can rent A100s
| for a buck a pop interruptible from other clouds, plenty of
| availability. $2 doesn't seem like much of a sweetheart deal.
| born-jre wrote:
| RANDOM THOUGHT:
|
| i wonder when we are getting docker for llm ... a Modelfile ?
|
| FROM "PAAMA/16b"
|
| APPLY "MNO/DATASET"
|
| each layer could be lora adapter like thing maybe.
|
| maybe when AI chips are finally here.
| jjtheblunt wrote:
| PyTorch tutorial looks similar (lower on the page)
|
| https://pytorch.org/tutorials/beginner/pytorch_with_examples...
| kristjansson wrote:
| SQLFlow[0] looks sort of like that: SELECT *
| FROM iris.train TO TRAIN DNNClassifier WITH
| model.hidden_units = [10, 10], model.n_classes = 3,
| train.epoch= 10 COLUMN sepal_length, sepal_width,
| petal_length, petal_width LABEL class INTO
| sqlflow_models.my_dnn_model;
|
| No idea how well it works.
|
| [0]: https://sql-machine-learning.github.io/
| jncraton wrote:
| > There's usually no need to go beyond 16-bit accuracy, and most
| of the time when you go to 8-bit accuracy there is too much loss
| of resolution.
|
| I'm not sure this is accurate. From what I have seen, 8-bit
| quantization is usually fine, and even 4-bit is a viable
| tradeoff. Here are some benchmarks from TextSynth showing no
| significant degradation between 16 and 8 bit:
|
| https://textsynth.com/technology.html
|
| 8-bit uses half as much memory and doubles the throughput for
| limited quality loss.
| superkuh wrote:
| It's true if you're doing training. But for inference severe
| quantization is mostly okay. And there are some internal parts
| of a transformer running inference with a quantized model where
| you might want the x-bit inputs to do calculations with 16 bits
| like the dot product similarity between vectors.
| Jackson__ wrote:
| Even that is being tackled by newer GPU architectures. For
| example, novelai is currently training an LLM in fp8
| precision, using H100 GPUs.[1]
|
| [1] https://blog.novelai.net/anlatan-acquires-
| hgx-h100-cluster-4...
|
| https://blog.novelai.net/text-model-progress-is-going-
| good-8...
| superkuh wrote:
| Cool stuff. I looked at https://en.wikipedia.org/wiki/Hoppe
| r_%28microarchitecture%29 and I noticed that that the fp8
| support is only for the tensor cores and not the CUDA side.
| Does that mean training with H100 GPU in fp8 mode would use
| some software ecosystem that's not the existing vast
| existing CUDA one? Or am I just misunderstanding CUDA cores
| vs tensor cores?
|
| PS, as a joke, they should implement GPU fluint8 and get
| baked in non-linearity for the activation function without
| even using a non-linear function,
| https://www.youtube.com/watch?v=Ae9EKCyI1xU ("GradIEEEnt
| half decent: The hidden power of imprecise lines" by
| suckerpinch)
| qeternity wrote:
| The problem with 8bit at the moment is massive performance
| degradation with bitsandbytes. Recent improvements in 4bit
| inference mean that 8bit is now a massive laggard (although
| there's no reason not to expect this to resolve).
| f_devd wrote:
| The article is right, 8-bit (and especially 4-bit) is atypical
| for deep learning models and highly depends on the amount of
| parameters (larger model can handle more quantization) and can
| even depend on specific training hyperparameters (mainly
| dropout & weight decay which can induce sparsity)
| int_19h wrote:
| Thing is, even when the impact from 4-bit is substantial, the
| larger parameter count it allows on the same hardware more
| than makes up for it. E.g. llama-30b is better at 4-bit than
| _any_ derivative of llama-13b, no matter how fine-tuned or
| quantized.
| waleedk wrote:
| [Author] Fair point. Adjusted the language.
|
| Nonetheless people do tend to use 16 bit huggingface models,
| and if you do go to 8 bits and it's wrong, you're never quite
| sure if it's the quant or the model.
| fzliu wrote:
| AFAIK for over-parameterized models, performing quantization or
| any other form of compression won't reduce accuracy by much
| (don't quote me on this though).
___________________________________________________________________
(page generated 2023-05-17 23:00 UTC)