hngopher.com

       [HN Gopher] Efficient LLM Inference (2023)
       ___________________________________________________________________
        
       Efficient LLM Inference (2023)
        
       Author : inaciom
       Score  : 67 points
       Date   : 2024-01-04 13:02 UTC (9 hours ago)
        
 (HTM) web link (www.artfintel.com)
 (TXT) w3m dump (www.artfintel.com)
        
       | ilaksh wrote:
       | I was experimenting with getting a few models to output Rhai
       | scripting and found that the non-quantized models or 6 bit were
       | able to do it as I requested with a few hints, but the 4 or 5 bit
       | ones got confused.
       | 
       | Whereas the 4 or 5 bit could handle equivalent requests with
       | Python.
       | 
       | My conclusion was that I should find tune a 4 or 5 bit on Rhai
       | scripting question output pairs and it I made enough good ones,
       | the performance on my task would improve.
       | 
       | Maybe if I just switch to Exllama2 or something then the 6 bit
       | will run fast enough.
        
       | fnbr wrote:
       | Ha! I got a ton of new subscribers this morning and was wondering
       | why. Let me know if I can answer any questions (I am the author).
        
         | ramesh1994 wrote:
         | I think distillation in the original sense isn't being done
         | anymore but finetuning on outputs from larger models like GPT-4
         | is a form of distillation (top-1 logit vs all logits and a
         | curated synthetic data instead of the original dataset)
         | 
         | On quantization though its still weird how just the weights are
         | quantized in methods like gptq / int8 while there are other
         | methods which quantize the activations as well. There's also
         | the matter of KV cache still being in original 16bit precision
         | regardless which is also unsolved here. Do you have any
         | thoughts or insights into this?
        
           | fnbr wrote:
           | It's not clear to me what's happening on the distillation
           | front. I agree no one is doing it externally, but I suspect
           | that the foundation model companies are doing it internally,
           | performance is just too good.
           | 
           | There's a bunch of recent work that quantizes the activations
           | as well, like fp8-LM. I think that this will come.
           | Quantization support in PyTorch is pretty experimental right
           | now, so I think we'll see a lot of improvements as it gets
           | better support.
           | 
           | The KV cache piece is tied to the activations imo- once those
           | start getting quantized effectively, the KV cache will
           | follow.
        
       | liuliu wrote:
       | One of these days I will find time to write more about model
       | inference optimizations without going through distillation /
       | quantization. Case in point: switching llama.cpp from custom
       | kernel to cublas's GEMM implementation reduces speed from 70tok/s
       | to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16).
        
         | golly_ned wrote:
         | I'd be interested in this, so you've got at least one reader.
        
       ___________________________________________________________________
       (page generated 2024-01-04 23:00 UTC)