[HN Gopher] Parameter-free KV cache compression for memory-effic...
       ___________________________________________________________________
        
       Parameter-free KV cache compression for memory-efficient long-
       context LLMs
        
       Author : PaulHoule
       Score  : 58 points
       Date   : 2025-03-27 18:07 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | az226 wrote:
       | Is this some joke? They use Llama 2 7B? What year is it?
        
         | x1000 wrote:
         | If they had experimented using a newer model (gemma 3,
         | deepseek-1 7b, etc.) and reported better results, would that be
         | because their newer baseline model was better than the llama 2
         | model used in the previous methods' experiments? A more
         | comprehensive study would include results for as many baseline
         | models as possible. But there are likely other researchers in
         | the lab all waiting to use those expensive GPUs for their
         | experiments as well.
        
           | josephg wrote:
           | Sure. But papers take a really long time to write and go
           | through peer review. I think my paper on collaborative
           | editing took about 4 months from the point where we were done
           | writing to the point at which it appeared on arxiv.
           | 
           | This research was almost certainly done well before Gemma 3
           | and Deepseek were released.
        
         | PaulHoule wrote:
         | The best model is the one you can fit in memory.
         | 
         | About as soon as GPT-4 came out I said that OpenAI was doomed
         | on the trajectory it was on because they could not afford to
         | develop a GPT-5, GPT-6, etc.
         | 
         | Real innovation comes out of doing a _lot_ of experiments and
         | that means doing experiments quickly with the resources you
         | have. So you do most of your experiments with non-frontier
         | models, enough to make a good prediction of what would happen
         | if you maxxed out your model size, then you go big. That 's how
         | you make everyone else have a "DeepSeek moment".
         | 
         | A company like Apple wants to pick something on the frontier
         | and keep advancing on a straight line. Works great if you want
         | to make an M1, M2, M3, ... ARM chip but that's not how progress
         | works in AI today.
        
           | monocasa wrote:
           | I mean, there's other, better 7B models than Lllama 2 at this
           | point.
        
           | hinkley wrote:
           | Will we see models built on b-trees to deal with memory
           | requirements? Have we already?
        
             | sujayakar wrote:
             | Deepseek is already using SSDs for their KV cache:
             | https://github.com/deepseek-ai/3FS
        
               | vlovich123 wrote:
               | You are deeply misunderstanding what the KV cache
               | referred to here is. It's not for storing data. This is
               | the KV cache that's part of the model to reduce quadratic
               | compute complexity into linear for self attention. This
               | is not stored on SSD - it's in VRAM (or CPU if you're not
               | using a GPU)
        
               | boroboro4 wrote:
               | They, in fact, mention inference kv cache as use case in
               | readme. The most advanced kv caching uses hierarchy of
               | gpu ram/regular ram/ssd. Seems like they were able to use
               | their storage abstraction for last tier.
        
         | krasin wrote:
         | > Is this some joke? They use Llama 2 7B? What year is it?
         | 
         | They use llama2 to demonstrate that their compression method
         | works. There are potential cases:
         | 
         | 1. The method works on all / most LLMs. In this case, it does
         | not matter on which model they demonstrated the effect.
         | 
         | 2. The method only works on llama2, but not on other models.
         | Given that they published the code, I expect that people will
         | quickly test the method on many other models, so we will know
         | that soon. And yet - there would be a scientific significance
         | even if it works only on llama2, as it would mean that there's
         | some special and good in that architecture.
         | 
         | But I would bet it's #1 - the method works on most of the
         | models and they just picked whatever they had already had code
         | bindings to, to save the effort.
        
       | kristianp wrote:
       | Code at https://github.com/SusCom-Lab/ZeroMerge
        
       | hinkley wrote:
       | This feels like something that could be done in part by hand. We
       | store documents in KV that are often built deterministically by
       | merging two or three pieces of data, one of which is a form of
       | string interpolation (eg, templates).
       | 
       | Effectively if you had a microservice that did extremely light
       | data processing, and you moved the KV store behind it instead of
       | in front of it, you'd achieve a similar aim. A small cache in
       | front of it or even at the upstream would reduce the calculations
       | in the face of thundering herds.
        
         | vlovich123 wrote:
         | this sounds like you're thinking KV as in key-value like redis
         | or s3. This paper is about KV cache as in LLM which is for
         | reducing the computational complexity for self-attention. Has
         | nothing to do with what you wrote unless I misunderstood what
         | your wrote (I'm confused what upstream would mean here - the
         | contents of the KV cache are specific to the context provided
         | to the LLM / what the LLM is generating in response).
        
       ___________________________________________________________________
       (page generated 2025-03-27 23:01 UTC)