[HN Gopher] 26x Faster Inference with Layer-Condensed KV Cache f...
___________________________________________________________________
26x Faster Inference with Layer-Condensed KV Cache for Large
Language Models
Author : georgehill
Score : 94 points
Date : 2024-05-20 15:33 UTC (7 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| vessenes wrote:
| Upshot of the paper -- right now KV caches are implemented for
| multiple deep layers in LLMs. Why not just the top layer? It
| would save memory.
|
| Initial result -- those KV caches in lower layers matter, and
| output suffered.
|
| Updated plan -- cull half the KV layers! This works 'nearly' as
| well as keeping all of them, with memory and compute savings.
|
| Downside - triple the training, worse out of band / long context
| performance.
|
| This feels to me like a technique you'd use on a particular
| architecture deployed at the edge where compute matters and you
| have a little extra room on performance. Phi-3 on raspberry pi,
| basically.
|
| Interesting! As always, I wish models showed prompt output in
| their papers, not just perplexity numbers. But, here we are.
| adtac wrote:
| > triple the training
|
| From what I understand, not quite. It looks like the cost of
| training might be similar, but less parallelisable within a
| specific token sequence. This is because they have to compute
| the KV of token T before they can use it in T+1 whereas in a
| regular training process you can compute the KV at each layer
| for every subsequence. You're right that it took 2.7x longer to
| train the smallest model but I wouldn't be surprised if the GPU
| utilisation was proportionally lower too.
| algo_trader wrote:
| In general, are the per-layer KV caches all independent of each
| other?
|
| If we split a model layer-wise for inference across X GPUs,
| does the cache split as well?
| YetAnotherNick wrote:
| I wish papers could be accepted with negative results. There's
| a lot of value in not repeating the same mistakes specially in
| a field like deep learning, which is not an exact science and
| mostly just driven by intuition.
| tripplyons wrote:
| This can be combined with Grouped Query Attention or Multi-Query
| Attention for an even further reduction in the size of the KV
| Cache!
| WhitneyLand wrote:
| But it's not free, it cuts quality significantly.
|
| It's not hard to find ways to speed up transformers if you're
| willing to give up quality.
|
| You could argue some tradeoffs are worth it and that's true
| sometimes but I don't see that they've made the case for it
| here.
| WhitneyLand wrote:
| Not sure if @dang is the right way to say the title is incorrect
| here, but shouldn't it match the paper?
|
| 1. The correct title is Layer-Condensed KV Cache for Efficient
| Inference of Large Language Models.
|
| 2. The paper does make a 26x claim later in the introduction, but
| it's an outlier.
|
| 26x is for only one benchmark and that benchmark is CPU based,
| not GPU based like 99% of transformer loads actually run on.
|
| If you look at GPU only workloads, the improvements range from
| 1.4x to 4.7x.
| kiraaa wrote:
| on gpu that is still huge.
| 0cf8612b2e1e wrote:
| Even if it is "only" the 40% lower end, that is a gargantuan
| savings. So many groups are compute constrained, every bit
| helps.
| josephg wrote:
| Sure; but 40% improvement is much less than a 26x
| improvement. If 40% is the realistic figure, cite that.
| Changing the title to include an outlier of 26x is click
| baity.
| VeejayRampay wrote:
| it sure is huge, but it's still far from 26x
| vlovich123 wrote:
| Is the KV cache something that runs on the GPU or on the CPU? Or
| traditionally on the CPU & this enables it to run on the GPU?
| adtac wrote:
| The KV cache is just another tensor to be used with matmuls.
| Unlike the model weights which are fixed, the KV cache is
| uniquely constructed for every input. Think of it as the model
| growing new weights to represent the new knowledge it learns
| about the user's input at inference time because not everything
| can be baked into the pretrained model.
|
| You want to store your KV cache in the same processor that does
| the rest of your matmuls.
| jasonjmcghee wrote:
| > please use the original title, unless it is misleading or
| linkbait; don't editorialize.
|
| "Layer-Condensed KV Cache for Efficient Inference of Large
| Language Models"
| joaquincabezas wrote:
| LLM inference optimization has been key for the OpenAI GPT-4o
| presentation (2x faster, 50% cheaper) and its driving lots of
| industry research because it's direct cost savings, but it's
| refreshing to see so many techniques published as papers (i.e
| from Stanford, Berkeley...)
| jsemrau wrote:
| "Our implementation is based on HuggingFace transformers where we
| register a new model opt-llama that supports the Layer-Condensed
| KV Cache."
|
| Not sure what this means? Would this work for a Mistral model as
| well?
___________________________________________________________________
(page generated 2024-05-20 23:00 UTC)