[HN Gopher] Efficient LLM Inference (2023)
___________________________________________________________________
Efficient LLM Inference (2023)
Author : inaciom
Score : 67 points
Date : 2024-01-04 13:02 UTC (9 hours ago)
(HTM) web link (www.artfintel.com)
(TXT) w3m dump (www.artfintel.com)
| ilaksh wrote:
| I was experimenting with getting a few models to output Rhai
| scripting and found that the non-quantized models or 6 bit were
| able to do it as I requested with a few hints, but the 4 or 5 bit
| ones got confused.
|
| Whereas the 4 or 5 bit could handle equivalent requests with
| Python.
|
| My conclusion was that I should find tune a 4 or 5 bit on Rhai
| scripting question output pairs and it I made enough good ones,
| the performance on my task would improve.
|
| Maybe if I just switch to Exllama2 or something then the 6 bit
| will run fast enough.
| fnbr wrote:
| Ha! I got a ton of new subscribers this morning and was wondering
| why. Let me know if I can answer any questions (I am the author).
| ramesh1994 wrote:
| I think distillation in the original sense isn't being done
| anymore but finetuning on outputs from larger models like GPT-4
| is a form of distillation (top-1 logit vs all logits and a
| curated synthetic data instead of the original dataset)
|
| On quantization though its still weird how just the weights are
| quantized in methods like gptq / int8 while there are other
| methods which quantize the activations as well. There's also
| the matter of KV cache still being in original 16bit precision
| regardless which is also unsolved here. Do you have any
| thoughts or insights into this?
| fnbr wrote:
| It's not clear to me what's happening on the distillation
| front. I agree no one is doing it externally, but I suspect
| that the foundation model companies are doing it internally,
| performance is just too good.
|
| There's a bunch of recent work that quantizes the activations
| as well, like fp8-LM. I think that this will come.
| Quantization support in PyTorch is pretty experimental right
| now, so I think we'll see a lot of improvements as it gets
| better support.
|
| The KV cache piece is tied to the activations imo- once those
| start getting quantized effectively, the KV cache will
| follow.
| liuliu wrote:
| One of these days I will find time to write more about model
| inference optimizations without going through distillation /
| quantization. Case in point: switching llama.cpp from custom
| kernel to cublas's GEMM implementation reduces speed from 70tok/s
| to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16).
| golly_ned wrote:
| I'd be interested in this, so you've got at least one reader.
___________________________________________________________________
(page generated 2024-01-04 23:00 UTC)