[HN Gopher] Bringing K/V context quantisation to Ollama
       ___________________________________________________________________
        
       Bringing K/V context quantisation to Ollama
        
       Author : mchiang
       Score  : 213 points
       Date   : 2024-12-05 01:40 UTC (21 hours ago)
        
 (HTM) web link (smcleod.net)
 (TXT) w3m dump (smcleod.net)
        
       | smcleod wrote:
       | Shout out to everyone from Ollama and the wider community that
       | helped with the reviews, feedback and assistance along the way.
       | It's great to contribute to such a fantastic project.
        
         | octocop wrote:
         | shout out to llama.cpp
        
       | satvikpendem wrote:
       | What's the best way to use Ollama with a GUI, just OpenWebUI? Any
       | options as well for mobile platforms like Android (or, I don't
       | even know if we can run LLMs on the phone in the first place).
        
         | sadeshmukh wrote:
         | A lot of the UIs, including OpenWebUI have the feature to
         | expose over LAN with users - that's what I did to use my GPU
         | while still being on my phone. Not entirely sure about native
         | UIs though.
         | 
         | Also, I normally use Groq's (with a q) API since it's really
         | cheap with no upfront billing info required - it's a whole
         | order of magnitude cheaper iirc than OpenAI/Claude. They
         | literally have a /openai endpoint if you need compatibility.
         | 
         | You can look in the direction of Google's Gemma if you need a
         | lightweight open weights LLM - there was something there that I
         | forgot.
        
         | smcleod wrote:
         | I personally use a mix of Open WebUI, Big AGI, BoltAI,
         | AnythingLLM on the desktop. The mobile space is where things
         | are really lacking at the moment, really I just end up browsing
         | to Open WebUI but that's not ideal. I'd love a iOS native
         | client that's well integrated into Siri, Shortcuts, Sharing
         | etc...
        
         | qudat wrote:
         | For hosting a web gui for ollama I use https://tuns.sh
         | 
         | It really convenient because it's just an SSH tunnel and then
         | you get automatic TLS and it protects your home IP.
         | 
         | With that you can access it from your mobile phone, just gotta
         | require a password to access it.
        
           | satvikpendem wrote:
           | I'm running OpenWebUI via Docker via OrbStack, it also
           | automatically provides TLS and works pretty well.
        
           | minwidth0px wrote:
           | I wrote my own UI[0] that connects over WebRTC using
           | Smoke.[1]
           | 
           | [0] https://github.com/minwidth0px/gpt-playground and
           | https://github.com/minwidth0px/Webrtc-NAT-Traversal-Proxy-
           | Se...
           | 
           | [1] https://github.com/sinclairzx81/smoke
        
         | huijzer wrote:
         | I have Open WebUI on a Hetzner instance connected to Deep
         | Infra. Works on mobile by turning the web page into an app. I
         | find the web framework that WebUI uses quite bloated/slow, but
         | apart from that it does work reliably. Price at Deep Infra is
         | typically about $0.04 per month even when actively asking lots
         | of questions during programming.
        
         | paradite wrote:
         | I built a custom GUI for coding tasks specifically, with built-
         | in code context management and workspaces:
         | 
         | https://prompt.16x.engineer/
         | 
         | Should work well if you have 64G vRAM to run SOTA models
         | locally.
        
           | throwaway314155 wrote:
           | > Should work well if you have 64G vRAM to run SOTA models
           | locally.
           | 
           | Does anyone have this?
           | 
           | edit: Ah, it's a Mac app.
        
             | paradite wrote:
             | Yeah Mac eats Windows on running LLMs.
             | 
             | My app does support Windows though, you can connect to
             | OpenAI, Claude, OpenRouter, Azure and other 3rd party
             | providers. Just running SOTA LLMs locally can be
             | challenging.
        
             | gzer0 wrote:
             | M4 Max with 128 GB RAM here. ;) Love it. A very expensive
             | early Christmas present.
        
           | accrual wrote:
           | Great looking GUI, I find simple black/white/boxy/monospace
           | UIs very effective.
        
         | rkwz wrote:
         | If you're using a Mac, I've built a lightweight native app -
         | https://github.com/sheshbabu/Chital
        
           | antirez wrote:
           | That's _very_ cool, finally a native app that runs fast.
           | Thanks.
        
             | rkwz wrote:
             | Thanks for the kind words :)
        
         | magicalhippo wrote:
         | Aa a Windows user, who just wanted something bare bones for
         | playing, I found this[1] small project useful. It does support
         | multi-modal models which is nice.
         | 
         | [1]: https://github.com/jakobhoeg/nextjs-ollama-llm-ui
        
         | seb314 wrote:
         | For running llms _locally_ on Android, there's "pocketpal"
         | (~7tok/s on a pixel 7 pro for some quant of llama 3.2 3B).
         | 
         | (Not sure if it uses ollama though)
        
         | vunderba wrote:
         | As far as open source goes, I'd probably recommend LibreChat.
         | It has connections for ollama, openai, anthropic, etc. It let's
         | you setup auth so you can theoretically use it from anywhere
         | (phone, etc.).
         | 
         | Fair warning, it's relatively heavyweight in so far as it has
         | to spin up a number of docker instances but works very well.
         | 
         | https://github.com/danny-avila/LibreChat
        
         | zerop wrote:
         | Many are there, apart from what others mentioned I am exploring
         | Anything LLM - https://anythingllm.com/. Liked the workspace
         | concept in it. We can club documents in workspaces and RAG
         | scope is managed.
        
       | wokwokwok wrote:
       | Nice.
       | 
       | That said... I mean...
       | 
       | > The journey to integrate K/V context cache quantisation into
       | Ollama took around 5 months.
       | 
       | ??
       | 
       | They incorrectly tagged #7926 which is a 2 line change, instead
       | of #6279 where it was implemented, which made me dig a bit deeper
       | and reading the actual change it seems:
       | 
       | The commit (1) is:                   > params :=
       | C.llama_context_default_params()         > ...         >
       | params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType))
       | <--- adds this         > params.type_v =
       | kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
       | 
       | Which has been part of llama.cpp since Dev 7, 2023 (2).
       | 
       | So... mmmm... while this is great, somehow I'm left feeling kind
       | of vaguely put-off by the comms around what is really 'we finally
       | support some config flag from llama.cpp that's been there for
       | really _quite a long time_ '.
       | 
       | > It took 5 months, but we got there in the end.
       | 
       | ... I guess... yay? The challenges don't seem like they were
       | technical, but I guess, good job getting it across the line in
       | the end?
       | 
       | [1] -
       | https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...
       | 
       | [2] - since
       | https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...
        
         | meesles wrote:
         | Author describes why it took as long as it did in the post, so
         | I don't think they're trying to be disingenous. Getting minor
         | changes merged upstream in large projects is difficult for
         | newer concepts since you need adoption and support.
         | 
         | Full release seems to contain more code[1], and author
         | references the llama.cpp pre-work and that author as well
         | 
         | This person is also not a core contributor, so this reads as a
         | hobbyist and fan of AI dev that is writing about their work.
         | Nothing to be ashamed of IMO.
         | 
         | [1] -
         | https://github.com/ollama/ollama/compare/v0.4.7...v0.4.8-rc0
        
           | smcleod wrote:
           | > this reads as a hobbyist and fan of AI dev that is writing
           | about their work
           | 
           | Bingo, that's me!
           | 
           | I suspect the OP didn't actually read the post.
           | 
           | 1. As you pointed out, it's about getting the feature
           | working, enabled and contributed into Ollama, not in
           | llama.cpp
           | 
           | 2. Digging through git commits isn't useful when you work
           | hard to squash commits before merging a PR, there were a
           | _lot_ over the last 5 months.
           | 
           | 3. While I'm not a go dev (and the introduction of cgo part
           | way through that threw me a bit) there certainly were
           | technicalities along the way, I suspect they not only didn't
           | both to read the post, they also didn't bother to read the
           | PR.
           | 
           | Also, just to clarify - I didn't even share this here, it's
           | just my personal blog of things I try to remember I did when
           | I look back at them years later.
        
         | smcleod wrote:
         | I'm going to be generous here and assume you didn't bother to
         | actually read the post (or even the PR) before writing a snaky,
         | non-constructive comment, but skimming through your HN comment
         | history this appears to be on-brand.
        
           | wokwokwok wrote:
           | I'll be generous and just say, maybe people should just use
           | llama.cpp and not ollama if they care about having nice
           | things, if merging support for existing features is that
           | difficult.
           | 
           | It seems like it's probably a better choice overall.
           | 
           | That said, I'm sure people worked very hard on this, and it's
           | nice to see it as a part of ollama for the people that use
           | it.
           | 
           | Also:
           | 
           | > Please don't comment on whether someone read an article.
           | "Did you even read the article? It mentions that" can be
           | shortened to "The article mentions that".
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
             | smcleod wrote:
             | Im not sure what kind of vendetta you have against Ollama
             | but I'll paste you here what I've written before when I've
             | heard claims similar to Ollama is just a wrapper for
             | llama.cpp:
             | 
             | With llama.cpp running on a machine, how do you connect
             | your LLM clients to it and request a model gets loaded with
             | a given set of parameters and templates?
             | 
             | ... you can't, because llama.cpp is the inference engine -
             | and it's bundled llama-cpp-server binary only provides
             | relatively basic server functionality - it's really more of
             | demo/example or MVP.
             | 
             | Llama.cpp is all configured at the time you run the binary
             | and manually provide it command line args for the one
             | specific model and configuration you start it with.
             | 
             | Ollama provides a server and client for interfacing and
             | packaging models, such as:                 Hot loading
             | models (e.g. when you request a model from your client
             | Ollama will load it on demand).       Automatic model
             | parallelisation.       Automatic model concurrency.
             | Automatic memory calculations for layer and GPU/CPU
             | placement.       Layered model configuration (basically
             | docker images for models).       Templating and
             | distribution of model parameters, templates in a container
             | image.       Near feature complete OpenAI compatible API as
             | well as it's native native API that supports more advanced
             | features such as model hot loading, context management,
             | etc...       Native libraries for common languages.
             | Official container images for hosting.       Provides a
             | client/server model for running remote or local inference
             | servers with either Ollama or openai compatible clients.
             | Support for both an official and self hosted model and
             | template repositories.       Support for multi-modal /
             | Vision LLMs - something that llama.cpp is not focusing on
             | providing currently.       Support for serving safetensors
             | models, as well as running and creating models directly
             | from their Huggingface model ID.       In addition to the
             | llama.cpp engine, Ollama are working on adding additional
             | model backends.
             | 
             | Ollama is not "better" or "worse" than llama.cpp because
             | it's an entirely different tool.
        
         | guywhocodes wrote:
         | This is par for Ollama, look at the log_probs issues/prs and
         | you get an idea of how well Ollama is run.
         | 
         | Ollama is IMO a model downloader for llama.cpp so you can do
         | roleplay with ease.
        
         | yard2010 wrote:
         | "Judge not, that ye be not judged"
        
       | lastdong wrote:
       | Great project! Do you think there might be some advantages to
       | bringing this over to LLaMA-BitNet?
        
       ___________________________________________________________________
       (page generated 2024-12-05 23:01 UTC)