[HN Gopher] Show HN: KVSplit - Run 2-3x longer contexts on Apple...
___________________________________________________________________
Show HN: KVSplit - Run 2-3x longer contexts on Apple Silicon
I discovered that in LLM inference, keys and values in the KV cache
have very different quantization sensitivities. Keys need higher
precision than values to maintain quality. I patched llama.cpp to
enable different bit-widths for keys vs. values on Apple Silicon.
The results are surprising: - K8V4 (8-bit keys, 4-bit values): 59%
memory reduction with only 0.86% perplexity loss - K4V8 (4-bit
keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss
- The configurations use the same number of bits, but K8V4 is 7x
better for quality This means you can run LLMs with 2-3x longer
context on the same Mac. Memory usage scales with sequence length,
so savings compound as context grows. Implementation was
straightforward: 1. Added --kvq-key and --kvq-val flags to
llama.cpp 2. Applied existing quantization logic separately to K
and V tensors 3. Validated with perplexity metrics across context
lengths 4. Used Metal for acceleration (with -mlong-calls flag to
avoid vectorization issues) Benchmarked on an M4 MacBook Pro
running TinyLlama with 8K context windows. Compatible with
Metal/MPS and optimized for Apple Silicon. GitHub:
https://github.com/dipampaul17/KVSplit
Author : dipampaul17
Score : 157 points
Date : 2025-05-16 20:04 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| nico wrote:
| Great work. This seems very interesting, but I need something
| slightly more high level to relate to it
|
| Will it just allow me to run let's say a model with a 2048 token
| context window with a 4-6k context window? Or a 128k model (like
| gemma3) with a 256k+ context window?
|
| What's the ideal use case for local models?
|
| Thank you
| dipampaul17 wrote:
| With the K8V4 configuration providing 59% memory savings, you
| can effectively run contexts 2.4x longer on the same hardware.
| A model with a 2048 token context can now handle about 5000
| tokens, while an 8K context model can reach approximately 19.5K
| tokens.
|
| In practical terms, this means processing entire books at once
| on a MacBook, analyzing large codebases without splitting
| files, or maintaining comprehensive conversation history in
| chat applications.
|
| The memory savings scale linearly with context length - the
| longer your context window, the more absolute memory you save.
| On my M4 MacBook with 8K context, I reduced KV cache from 176MB
| to 72MB. At 128K context, that same percentage saving would
| free up gigabytes.
|
| This optimization is most valuable when you're context-window
| limited rather than model-parameter limited. If you're hitting
| OOM errors due to long inputs rather than large model weights,
| KVSplit directly addresses your bottleneck.
| kmacdough wrote:
| > Will it just allow me to run let's say a model with a 2048
| token context window with a 4-6k context window
|
| It reduces the memory footprint of a particular model. You can
| do what you like with that. Extending the context window post-
| training isn't trivial, so unless you know what you're doing,
| you'd be better off finding a model trained on a larger context
| window.
|
| Many uses for local models like working offline or
| privacy/security. Most folks, though, are using it to
| experiment with tweaking models.
| nico wrote:
| Will that make the model run/feel faster?
|
| I can run models with 30-40b parameters on my computer, but
| they feel a lot slower than the 1-7b ones
|
| So would this make the 30-40b parameter modes run faster? Or
| at least "feel" faster?
| badmonster wrote:
| I'm curious: is it possible to apply differentiated KV
| quantization (like K8V4) to models after they're already
| converted to .gguf format, or does this require rebuilding the
| model with special support? If it's compatible with any .gguf
| file, are there any limitations on model types (e.g. Mistral,
| Phi-3, etc.) or tokenizer configs?
| dipampaul17 wrote:
| Yes, that's one of the key benefits - KVSplit works with any
| existing .gguf model without requiring reconstruction or
| special conversion. The quantization happens at runtime on the
| KV cache, not during model loading or conversion.
|
| This works because the KV cache is created during inference as
| tokens are processed, completely separate from the model
| weights themselves. The --kvq-key and --kvq-val flags simply
| tell llama.cpp how to store these intermediate tensors in
| memory.
|
| I've tested it successfully with:
|
| - Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama -
| Qwen variants
|
| The only limitation is that it requires llama.cpp's Metal
| backend, and you need to disable Flash Attention with -fa 0
| since the current FA implementation in llama.cpp bypasses the
| custom KV cache format. The technique itself should work with
| any transformer architecture that uses a standard attention
| mechanism.
| entrepy123 wrote:
| Are these significantly faster/better on 64GB or 128GB Apple
| silicon (over 36GB or 48GB)?
|
| I've been reading that large contexts and large models are just
| painfully slow, even on the fastest and largest Apple silicon
| that money can buy.
|
| So I wonder if this helps make more use of greater memory, or if
| really smallish models are still where it's at for Apple silicon,
| practically speaking.
| behnamoh wrote:
| Is this patch possible to do on MLX? I'm getting better speeds on
| MLX. That, combined with your approach, would finally let Mac
| users have long conversations at usable speeds.
| matheist wrote:
| Looks interesting! Is there any intuition for why this should be
| the case? Did you discover it via that intuition, or just random
| experimentation?
|
| A note, your install script appears to still have a placeholder
| at the "apply patch" step. A suggestion, might be more user-
| friendly to fork llama.cpp and then include that as a git
| submodule rather than make it a "git clone and apply patch" step.
|
| A further note, everyone and their dog has a different local
| python set-up, might be nice to let people separate the llama.cpp
| stuff from the python stuff rather than bake in a dependence on
| homebrew python.
| ondra wrote:
| Is this any different from using --cache-type-k and --cache-
| type-v?
| azinman2 wrote:
| That's what I want to know!
| smcleod wrote:
| +0.86% perplexity it's quite a bit at such a small context size
| though isn't it? How is it at more reasonable context sizes like
| 64-128k?
| nomel wrote:
| > This means you can run LLMs with 2-3x longer context on the
| same Mac. Memory usage scales with sequence length, so savings
| compound as context grows.
|
| The point seems to be that this reduces memory footprint. This
| makes it _possible_ to run longer context, for the same limited
| memory, if you couldn 't before. Or, you can use that free
| memory to do something else, like an IDE.
| 3abiton wrote:
| This is a brilliant idea, and initiative. Does this also apply to
| GPUs? And I assume should be compatible with other quantization
| techniques, albeit they probably require their own patches?
___________________________________________________________________
(page generated 2025-05-16 23:00 UTC)