[HN Gopher] Fast Llama 2 on CPUs with Sparse Fine-Tuning and Dee...
___________________________________________________________________
Fast Llama 2 on CPUs with Sparse Fine-Tuning and DeepSparse
Author : mwitiderrick
Score : 209 points
Date : 2023-11-23 04:44 UTC (18 hours ago)
(HTM) web link (neuralmagic.com)
(TXT) w3m dump (neuralmagic.com)
| tarruda wrote:
| Seems promising. Do they say anywhere how many tokens / second is
| achieved on the CPU?
| littlestymaar wrote:
| They do. It's on the third graph: https://neuralmagic.com/wp-
| content/uploads/2023/11/CHART-Lla...
| RossBencina wrote:
| Interesting company. Yannic Kilcher interviewed Nir Shavit last
| year and they went into some depth:
| https://www.youtube.com/watch?v=0PAiQ1jTN5k
|
| DeepSparse is on GitHub:
| https://github.com/neuralmagic/deepsparse
| anonymousDan wrote:
| Nir Shavit is co-author of IMO the best book on concurrent
| programming: https://dl.acm.org/doi/book/10.5555/2385452
| mark_l_watson wrote:
| Thanks for the GitHub link, I want to try this in a M2 with 32G
| memory. It will "waste" the GPU and neural cores, but might
| still run fast.
|
| Off topic: I so much appreciate everyone's work Getting LLMs
| running on inexpensive hardware! I have been having crazy
| amounts of fun with Ollama and a wide variety of tuned models.
| And, so fast!
| arkmm wrote:
| I might be missing something but it seems like most of the speed
| up is from quantization which is commonly used already, and the
| CPU instance used here isn't that much cheaper (~10-15%?) than a
| GPU instance that could run the model. For high utilization
| workloads the extra throughput might be useful though.
| yunohn wrote:
| I think you are missing something, or I am.
|
| If we observe the performance comparison graph (1), 2.8->9
| tok/s is achieved via quantization, but the remaining jumps
| from 9->16.6->24.6 tok/s are achieved from the sparse fine-
| tuning.
|
| (1) https://neuralmagic.com/wp-content/uploads/2023/11/CHART-
| Lla...
| jonatron wrote:
| Hetzner do cheap servers, but don't have many GPUs. If you're
| looking at saving money by running on CPU, you shouldn't be
| looking at one of the most expensive server providers.
| thelastparadise wrote:
| How is the $ per inference, say 4k tokens, on a Hetzner box
| vs an A100?
|
| Not all tasks require low latency.
| jonatron wrote:
| I'd like to know the answer to that too. I shouldn't have
| implied that CPU inference on Hetzner is cheaper than GPU
| on AWS, when I don't have any idea on the cost of either.
| mikeravkine wrote:
| Hetzner offers incredibly cheap ARM machines in the
| Falkenstein DC, for 25Eur a month you can snag the top of
| the line with 16 vCPU and 32GB RAM.
|
| If your usecase fits inside that 32GB (no 70B models,
| sadly) the price to performance of a GGUF Q4KM is really
| attractive on this setup.
| londons_explore wrote:
| With two/three instances, you can probably fit a 70B
| model into RAM, and you don't need super low latency
| between models to be able to do inference split layerwise
| between machines.
| leobg wrote:
| Any way to run a fine tuned Mistral model?
| rkwasny wrote:
| Very interesting, as I understand it was fine tuned on GSM8k, so
| we know we can remove 60% of the model and still answer GSM8k
| questions
|
| Bigger question is, does the sparse model maintain any general
| knowledge?
| andy99 wrote:
| Is it compared anywhere with a smaller model fine-tuned on the
| same thing? Edit: I didn't see any comparisons skimming the
| paper. More specifically fine tuned smaller models can often be
| pretty good. I'd want to see how 3B and 1B etc llama models
| fine-tuned on the same dataset perform. Is sparsity the key
| here or is it just fewer parameters?
|
| The quantization part was interesting - they also quantized
| activations and have some adaptive quantization that
| accommodates outliers.
| jakey_bakey wrote:
| Does anyone have a good resource on fine-tuning using open-source
| LLMs?
| shultays wrote:
| What "no drop in accuracy" means? Is it like they do some kind of
| lossless compression and it is guaranteed to behave same? Or are
| they claiming that as a result of a (subjective?) test that
| measures accuracy? If so how does such test works?
| halflings wrote:
| The page fully explains what they mean by this, showing results
| on benchmarks etc.
| avipars wrote:
| Is it RAM-usage heavy?
|
| I tried other models and they crashed because it stored too much
| data in RAM
| wills_forward wrote:
| This seems like a big step forward in terms of running specific
| use-case trained inference locally, right? At least given current
| hardware generally deployed in business.
| lawlessone wrote:
| Is this the same as the Fast feed forwards post from yesterday?
| If not, could both be applied?
| visarga wrote:
| They are alternative approaches.
| buildbot wrote:
| I did a but of digging but was unable to find what kind if
| sparsity this applies - unstructured? Semi-structured? Block
| sparsity?
___________________________________________________________________
(page generated 2023-11-23 23:02 UTC)