[HN Gopher] Fast Llama 2 on CPUs with Sparse Fine-Tuning and Dee...
       ___________________________________________________________________
        
       Fast Llama 2 on CPUs with Sparse Fine-Tuning and DeepSparse
        
       Author : mwitiderrick
       Score  : 209 points
       Date   : 2023-11-23 04:44 UTC (18 hours ago)
        
 (HTM) web link (neuralmagic.com)
 (TXT) w3m dump (neuralmagic.com)
        
       | tarruda wrote:
       | Seems promising. Do they say anywhere how many tokens / second is
       | achieved on the CPU?
        
         | littlestymaar wrote:
         | They do. It's on the third graph: https://neuralmagic.com/wp-
         | content/uploads/2023/11/CHART-Lla...
        
       | RossBencina wrote:
       | Interesting company. Yannic Kilcher interviewed Nir Shavit last
       | year and they went into some depth:
       | https://www.youtube.com/watch?v=0PAiQ1jTN5k
       | 
       | DeepSparse is on GitHub:
       | https://github.com/neuralmagic/deepsparse
        
         | anonymousDan wrote:
         | Nir Shavit is co-author of IMO the best book on concurrent
         | programming: https://dl.acm.org/doi/book/10.5555/2385452
        
         | mark_l_watson wrote:
         | Thanks for the GitHub link, I want to try this in a M2 with 32G
         | memory. It will "waste" the GPU and neural cores, but might
         | still run fast.
         | 
         | Off topic: I so much appreciate everyone's work Getting LLMs
         | running on inexpensive hardware! I have been having crazy
         | amounts of fun with Ollama and a wide variety of tuned models.
         | And, so fast!
        
       | arkmm wrote:
       | I might be missing something but it seems like most of the speed
       | up is from quantization which is commonly used already, and the
       | CPU instance used here isn't that much cheaper (~10-15%?) than a
       | GPU instance that could run the model. For high utilization
       | workloads the extra throughput might be useful though.
        
         | yunohn wrote:
         | I think you are missing something, or I am.
         | 
         | If we observe the performance comparison graph (1), 2.8->9
         | tok/s is achieved via quantization, but the remaining jumps
         | from 9->16.6->24.6 tok/s are achieved from the sparse fine-
         | tuning.
         | 
         | (1) https://neuralmagic.com/wp-content/uploads/2023/11/CHART-
         | Lla...
        
         | jonatron wrote:
         | Hetzner do cheap servers, but don't have many GPUs. If you're
         | looking at saving money by running on CPU, you shouldn't be
         | looking at one of the most expensive server providers.
        
           | thelastparadise wrote:
           | How is the $ per inference, say 4k tokens, on a Hetzner box
           | vs an A100?
           | 
           | Not all tasks require low latency.
        
             | jonatron wrote:
             | I'd like to know the answer to that too. I shouldn't have
             | implied that CPU inference on Hetzner is cheaper than GPU
             | on AWS, when I don't have any idea on the cost of either.
        
             | mikeravkine wrote:
             | Hetzner offers incredibly cheap ARM machines in the
             | Falkenstein DC, for 25Eur a month you can snag the top of
             | the line with 16 vCPU and 32GB RAM.
             | 
             | If your usecase fits inside that 32GB (no 70B models,
             | sadly) the price to performance of a GGUF Q4KM is really
             | attractive on this setup.
        
               | londons_explore wrote:
               | With two/three instances, you can probably fit a 70B
               | model into RAM, and you don't need super low latency
               | between models to be able to do inference split layerwise
               | between machines.
        
       | leobg wrote:
       | Any way to run a fine tuned Mistral model?
        
       | rkwasny wrote:
       | Very interesting, as I understand it was fine tuned on GSM8k, so
       | we know we can remove 60% of the model and still answer GSM8k
       | questions
       | 
       | Bigger question is, does the sparse model maintain any general
       | knowledge?
        
         | andy99 wrote:
         | Is it compared anywhere with a smaller model fine-tuned on the
         | same thing? Edit: I didn't see any comparisons skimming the
         | paper. More specifically fine tuned smaller models can often be
         | pretty good. I'd want to see how 3B and 1B etc llama models
         | fine-tuned on the same dataset perform. Is sparsity the key
         | here or is it just fewer parameters?
         | 
         | The quantization part was interesting - they also quantized
         | activations and have some adaptive quantization that
         | accommodates outliers.
        
       | jakey_bakey wrote:
       | Does anyone have a good resource on fine-tuning using open-source
       | LLMs?
        
       | shultays wrote:
       | What "no drop in accuracy" means? Is it like they do some kind of
       | lossless compression and it is guaranteed to behave same? Or are
       | they claiming that as a result of a (subjective?) test that
       | measures accuracy? If so how does such test works?
        
         | halflings wrote:
         | The page fully explains what they mean by this, showing results
         | on benchmarks etc.
        
       | avipars wrote:
       | Is it RAM-usage heavy?
       | 
       | I tried other models and they crashed because it stored too much
       | data in RAM
        
       | wills_forward wrote:
       | This seems like a big step forward in terms of running specific
       | use-case trained inference locally, right? At least given current
       | hardware generally deployed in business.
        
       | lawlessone wrote:
       | Is this the same as the Fast feed forwards post from yesterday?
       | If not, could both be applied?
        
         | visarga wrote:
         | They are alternative approaches.
        
       | buildbot wrote:
       | I did a but of digging but was unable to find what kind if
       | sparsity this applies - unstructured? Semi-structured? Block
       | sparsity?
        
       ___________________________________________________________________
       (page generated 2023-11-23 23:02 UTC)