[HN Gopher] Parameter-free KV cache compression for memory-effic...
___________________________________________________________________
Parameter-free KV cache compression for memory-efficient long-
context LLMs
Author : PaulHoule
Score : 58 points
Date : 2025-03-27 18:07 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| az226 wrote:
| Is this some joke? They use Llama 2 7B? What year is it?
| x1000 wrote:
| If they had experimented using a newer model (gemma 3,
| deepseek-1 7b, etc.) and reported better results, would that be
| because their newer baseline model was better than the llama 2
| model used in the previous methods' experiments? A more
| comprehensive study would include results for as many baseline
| models as possible. But there are likely other researchers in
| the lab all waiting to use those expensive GPUs for their
| experiments as well.
| josephg wrote:
| Sure. But papers take a really long time to write and go
| through peer review. I think my paper on collaborative
| editing took about 4 months from the point where we were done
| writing to the point at which it appeared on arxiv.
|
| This research was almost certainly done well before Gemma 3
| and Deepseek were released.
| PaulHoule wrote:
| The best model is the one you can fit in memory.
|
| About as soon as GPT-4 came out I said that OpenAI was doomed
| on the trajectory it was on because they could not afford to
| develop a GPT-5, GPT-6, etc.
|
| Real innovation comes out of doing a _lot_ of experiments and
| that means doing experiments quickly with the resources you
| have. So you do most of your experiments with non-frontier
| models, enough to make a good prediction of what would happen
| if you maxxed out your model size, then you go big. That 's how
| you make everyone else have a "DeepSeek moment".
|
| A company like Apple wants to pick something on the frontier
| and keep advancing on a straight line. Works great if you want
| to make an M1, M2, M3, ... ARM chip but that's not how progress
| works in AI today.
| monocasa wrote:
| I mean, there's other, better 7B models than Lllama 2 at this
| point.
| hinkley wrote:
| Will we see models built on b-trees to deal with memory
| requirements? Have we already?
| sujayakar wrote:
| Deepseek is already using SSDs for their KV cache:
| https://github.com/deepseek-ai/3FS
| vlovich123 wrote:
| You are deeply misunderstanding what the KV cache
| referred to here is. It's not for storing data. This is
| the KV cache that's part of the model to reduce quadratic
| compute complexity into linear for self attention. This
| is not stored on SSD - it's in VRAM (or CPU if you're not
| using a GPU)
| boroboro4 wrote:
| They, in fact, mention inference kv cache as use case in
| readme. The most advanced kv caching uses hierarchy of
| gpu ram/regular ram/ssd. Seems like they were able to use
| their storage abstraction for last tier.
| krasin wrote:
| > Is this some joke? They use Llama 2 7B? What year is it?
|
| They use llama2 to demonstrate that their compression method
| works. There are potential cases:
|
| 1. The method works on all / most LLMs. In this case, it does
| not matter on which model they demonstrated the effect.
|
| 2. The method only works on llama2, but not on other models.
| Given that they published the code, I expect that people will
| quickly test the method on many other models, so we will know
| that soon. And yet - there would be a scientific significance
| even if it works only on llama2, as it would mean that there's
| some special and good in that architecture.
|
| But I would bet it's #1 - the method works on most of the
| models and they just picked whatever they had already had code
| bindings to, to save the effort.
| kristianp wrote:
| Code at https://github.com/SusCom-Lab/ZeroMerge
| hinkley wrote:
| This feels like something that could be done in part by hand. We
| store documents in KV that are often built deterministically by
| merging two or three pieces of data, one of which is a form of
| string interpolation (eg, templates).
|
| Effectively if you had a microservice that did extremely light
| data processing, and you moved the KV store behind it instead of
| in front of it, you'd achieve a similar aim. A small cache in
| front of it or even at the upstream would reduce the calculations
| in the face of thundering herds.
| vlovich123 wrote:
| this sounds like you're thinking KV as in key-value like redis
| or s3. This paper is about KV cache as in LLM which is for
| reducing the computational complexity for self-attention. Has
| nothing to do with what you wrote unless I misunderstood what
| your wrote (I'm confused what upstream would mean here - the
| contents of the KV cache are specific to the context provided
| to the LLM / what the LLM is generating in response).
___________________________________________________________________
(page generated 2025-03-27 23:01 UTC)