hngopher.com

       [HN Gopher] 26x Faster Inference with Layer-Condensed KV Cache f...
       ___________________________________________________________________
        
       26x Faster Inference with Layer-Condensed KV Cache for Large
       Language Models
        
       Author : georgehill
       Score  : 94 points
       Date   : 2024-05-20 15:33 UTC (7 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | vessenes wrote:
       | Upshot of the paper -- right now KV caches are implemented for
       | multiple deep layers in LLMs. Why not just the top layer? It
       | would save memory.
       | 
       | Initial result -- those KV caches in lower layers matter, and
       | output suffered.
       | 
       | Updated plan -- cull half the KV layers! This works 'nearly' as
       | well as keeping all of them, with memory and compute savings.
       | 
       | Downside - triple the training, worse out of band / long context
       | performance.
       | 
       | This feels to me like a technique you'd use on a particular
       | architecture deployed at the edge where compute matters and you
       | have a little extra room on performance. Phi-3 on raspberry pi,
       | basically.
       | 
       | Interesting! As always, I wish models showed prompt output in
       | their papers, not just perplexity numbers. But, here we are.
        
         | adtac wrote:
         | > triple the training
         | 
         | From what I understand, not quite. It looks like the cost of
         | training might be similar, but less parallelisable within a
         | specific token sequence. This is because they have to compute
         | the KV of token T before they can use it in T+1 whereas in a
         | regular training process you can compute the KV at each layer
         | for every subsequence. You're right that it took 2.7x longer to
         | train the smallest model but I wouldn't be surprised if the GPU
         | utilisation was proportionally lower too.
        
         | algo_trader wrote:
         | In general, are the per-layer KV caches all independent of each
         | other?
         | 
         | If we split a model layer-wise for inference across X GPUs,
         | does the cache split as well?
        
         | YetAnotherNick wrote:
         | I wish papers could be accepted with negative results. There's
         | a lot of value in not repeating the same mistakes specially in
         | a field like deep learning, which is not an exact science and
         | mostly just driven by intuition.
        
       | tripplyons wrote:
       | This can be combined with Grouped Query Attention or Multi-Query
       | Attention for an even further reduction in the size of the KV
       | Cache!
        
         | WhitneyLand wrote:
         | But it's not free, it cuts quality significantly.
         | 
         | It's not hard to find ways to speed up transformers if you're
         | willing to give up quality.
         | 
         | You could argue some tradeoffs are worth it and that's true
         | sometimes but I don't see that they've made the case for it
         | here.
        
       | WhitneyLand wrote:
       | Not sure if @dang is the right way to say the title is incorrect
       | here, but shouldn't it match the paper?
       | 
       | 1. The correct title is Layer-Condensed KV Cache for Efficient
       | Inference of Large Language Models.
       | 
       | 2. The paper does make a 26x claim later in the introduction, but
       | it's an outlier.
       | 
       | 26x is for only one benchmark and that benchmark is CPU based,
       | not GPU based like 99% of transformer loads actually run on.
       | 
       | If you look at GPU only workloads, the improvements range from
       | 1.4x to 4.7x.
        
         | kiraaa wrote:
         | on gpu that is still huge.
        
           | 0cf8612b2e1e wrote:
           | Even if it is "only" the 40% lower end, that is a gargantuan
           | savings. So many groups are compute constrained, every bit
           | helps.
        
             | josephg wrote:
             | Sure; but 40% improvement is much less than a 26x
             | improvement. If 40% is the realistic figure, cite that.
             | Changing the title to include an outlier of 26x is click
             | baity.
        
           | VeejayRampay wrote:
           | it sure is huge, but it's still far from 26x
        
       | vlovich123 wrote:
       | Is the KV cache something that runs on the GPU or on the CPU? Or
       | traditionally on the CPU & this enables it to run on the GPU?
        
         | adtac wrote:
         | The KV cache is just another tensor to be used with matmuls.
         | Unlike the model weights which are fixed, the KV cache is
         | uniquely constructed for every input. Think of it as the model
         | growing new weights to represent the new knowledge it learns
         | about the user's input at inference time because not everything
         | can be baked into the pretrained model.
         | 
         | You want to store your KV cache in the same processor that does
         | the rest of your matmuls.
        
       | jasonjmcghee wrote:
       | > please use the original title, unless it is misleading or
       | linkbait; don't editorialize.
       | 
       | "Layer-Condensed KV Cache for Efficient Inference of Large
       | Language Models"
        
       | joaquincabezas wrote:
       | LLM inference optimization has been key for the OpenAI GPT-4o
       | presentation (2x faster, 50% cheaper) and its driving lots of
       | industry research because it's direct cost savings, but it's
       | refreshing to see so many techniques published as papers (i.e
       | from Stanford, Berkeley...)
        
       | jsemrau wrote:
       | "Our implementation is based on HuggingFace transformers where we
       | register a new model opt-llama that supports the Layer-Condensed
       | KV Cache."
       | 
       | Not sure what this means? Would this work for a Mistral model as
       | well?
        
       ___________________________________________________________________
       (page generated 2024-05-20 23:00 UTC)