hngopher.com

       [HN Gopher] Show HN: KVSplit - Run 2-3x longer contexts on Apple...
       ___________________________________________________________________
        
       Show HN: KVSplit - Run 2-3x longer contexts on Apple Silicon
        
       I discovered that in LLM inference, keys and values in the KV cache
       have very different quantization sensitivities. Keys need higher
       precision than values to maintain quality.  I patched llama.cpp to
       enable different bit-widths for keys vs. values on Apple Silicon.
       The results are surprising:  - K8V4 (8-bit keys, 4-bit values): 59%
       memory reduction with only 0.86% perplexity loss - K4V8 (4-bit
       keys, 8-bit values): 59% memory reduction but 6.06% perplexity loss
       - The configurations use the same number of bits, but K8V4 is 7x
       better for quality  This means you can run LLMs with 2-3x longer
       context on the same Mac. Memory usage scales with sequence length,
       so savings compound as context grows.  Implementation was
       straightforward: 1. Added --kvq-key and --kvq-val flags to
       llama.cpp 2. Applied existing quantization logic separately to K
       and V tensors 3. Validated with perplexity metrics across context
       lengths 4. Used Metal for acceleration (with -mlong-calls flag to
       avoid vectorization issues)  Benchmarked on an M4 MacBook Pro
       running TinyLlama with 8K context windows. Compatible with
       Metal/MPS and optimized for Apple Silicon.  GitHub:
       https://github.com/dipampaul17/KVSplit
        
       Author : dipampaul17
       Score  : 157 points
       Date   : 2025-05-16 20:04 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | nico wrote:
       | Great work. This seems very interesting, but I need something
       | slightly more high level to relate to it
       | 
       | Will it just allow me to run let's say a model with a 2048 token
       | context window with a 4-6k context window? Or a 128k model (like
       | gemma3) with a 256k+ context window?
       | 
       | What's the ideal use case for local models?
       | 
       | Thank you
        
         | dipampaul17 wrote:
         | With the K8V4 configuration providing 59% memory savings, you
         | can effectively run contexts 2.4x longer on the same hardware.
         | A model with a 2048 token context can now handle about 5000
         | tokens, while an 8K context model can reach approximately 19.5K
         | tokens.
         | 
         | In practical terms, this means processing entire books at once
         | on a MacBook, analyzing large codebases without splitting
         | files, or maintaining comprehensive conversation history in
         | chat applications.
         | 
         | The memory savings scale linearly with context length - the
         | longer your context window, the more absolute memory you save.
         | On my M4 MacBook with 8K context, I reduced KV cache from 176MB
         | to 72MB. At 128K context, that same percentage saving would
         | free up gigabytes.
         | 
         | This optimization is most valuable when you're context-window
         | limited rather than model-parameter limited. If you're hitting
         | OOM errors due to long inputs rather than large model weights,
         | KVSplit directly addresses your bottleneck.
        
         | kmacdough wrote:
         | > Will it just allow me to run let's say a model with a 2048
         | token context window with a 4-6k context window
         | 
         | It reduces the memory footprint of a particular model. You can
         | do what you like with that. Extending the context window post-
         | training isn't trivial, so unless you know what you're doing,
         | you'd be better off finding a model trained on a larger context
         | window.
         | 
         | Many uses for local models like working offline or
         | privacy/security. Most folks, though, are using it to
         | experiment with tweaking models.
        
           | nico wrote:
           | Will that make the model run/feel faster?
           | 
           | I can run models with 30-40b parameters on my computer, but
           | they feel a lot slower than the 1-7b ones
           | 
           | So would this make the 30-40b parameter modes run faster? Or
           | at least "feel" faster?
        
       | badmonster wrote:
       | I'm curious: is it possible to apply differentiated KV
       | quantization (like K8V4) to models after they're already
       | converted to .gguf format, or does this require rebuilding the
       | model with special support? If it's compatible with any .gguf
       | file, are there any limitations on model types (e.g. Mistral,
       | Phi-3, etc.) or tokenizer configs?
        
         | dipampaul17 wrote:
         | Yes, that's one of the key benefits - KVSplit works with any
         | existing .gguf model without requiring reconstruction or
         | special conversion. The quantization happens at runtime on the
         | KV cache, not during model loading or conversion.
         | 
         | This works because the KV cache is created during inference as
         | tokens are processed, completely separate from the model
         | weights themselves. The --kvq-key and --kvq-val flags simply
         | tell llama.cpp how to store these intermediate tensors in
         | memory.
         | 
         | I've tested it successfully with:
         | 
         | - Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama -
         | Qwen variants
         | 
         | The only limitation is that it requires llama.cpp's Metal
         | backend, and you need to disable Flash Attention with -fa 0
         | since the current FA implementation in llama.cpp bypasses the
         | custom KV cache format. The technique itself should work with
         | any transformer architecture that uses a standard attention
         | mechanism.
        
       | entrepy123 wrote:
       | Are these significantly faster/better on 64GB or 128GB Apple
       | silicon (over 36GB or 48GB)?
       | 
       | I've been reading that large contexts and large models are just
       | painfully slow, even on the fastest and largest Apple silicon
       | that money can buy.
       | 
       | So I wonder if this helps make more use of greater memory, or if
       | really smallish models are still where it's at for Apple silicon,
       | practically speaking.
        
       | behnamoh wrote:
       | Is this patch possible to do on MLX? I'm getting better speeds on
       | MLX. That, combined with your approach, would finally let Mac
       | users have long conversations at usable speeds.
        
       | matheist wrote:
       | Looks interesting! Is there any intuition for why this should be
       | the case? Did you discover it via that intuition, or just random
       | experimentation?
       | 
       | A note, your install script appears to still have a placeholder
       | at the "apply patch" step. A suggestion, might be more user-
       | friendly to fork llama.cpp and then include that as a git
       | submodule rather than make it a "git clone and apply patch" step.
       | 
       | A further note, everyone and their dog has a different local
       | python set-up, might be nice to let people separate the llama.cpp
       | stuff from the python stuff rather than bake in a dependence on
       | homebrew python.
        
       | ondra wrote:
       | Is this any different from using --cache-type-k and --cache-
       | type-v?
        
         | azinman2 wrote:
         | That's what I want to know!
        
       | smcleod wrote:
       | +0.86% perplexity it's quite a bit at such a small context size
       | though isn't it? How is it at more reasonable context sizes like
       | 64-128k?
        
         | nomel wrote:
         | > This means you can run LLMs with 2-3x longer context on the
         | same Mac. Memory usage scales with sequence length, so savings
         | compound as context grows.
         | 
         | The point seems to be that this reduces memory footprint. This
         | makes it _possible_ to run longer context, for the same limited
         | memory, if you couldn 't before. Or, you can use that free
         | memory to do something else, like an IDE.
        
       | 3abiton wrote:
       | This is a brilliant idea, and initiative. Does this also apply to
       | GPUs? And I assume should be compatible with other quantization
       | techniques, albeit they probably require their own patches?
        
       ___________________________________________________________________
       (page generated 2025-05-16 23:00 UTC)