[HN Gopher] Bringing K/V context quantisation to Ollama
___________________________________________________________________
Bringing K/V context quantisation to Ollama
Author : mchiang
Score : 213 points
Date : 2024-12-05 01:40 UTC (21 hours ago)
(HTM) web link (smcleod.net)
(TXT) w3m dump (smcleod.net)
| smcleod wrote:
| Shout out to everyone from Ollama and the wider community that
| helped with the reviews, feedback and assistance along the way.
| It's great to contribute to such a fantastic project.
| octocop wrote:
| shout out to llama.cpp
| satvikpendem wrote:
| What's the best way to use Ollama with a GUI, just OpenWebUI? Any
| options as well for mobile platforms like Android (or, I don't
| even know if we can run LLMs on the phone in the first place).
| sadeshmukh wrote:
| A lot of the UIs, including OpenWebUI have the feature to
| expose over LAN with users - that's what I did to use my GPU
| while still being on my phone. Not entirely sure about native
| UIs though.
|
| Also, I normally use Groq's (with a q) API since it's really
| cheap with no upfront billing info required - it's a whole
| order of magnitude cheaper iirc than OpenAI/Claude. They
| literally have a /openai endpoint if you need compatibility.
|
| You can look in the direction of Google's Gemma if you need a
| lightweight open weights LLM - there was something there that I
| forgot.
| smcleod wrote:
| I personally use a mix of Open WebUI, Big AGI, BoltAI,
| AnythingLLM on the desktop. The mobile space is where things
| are really lacking at the moment, really I just end up browsing
| to Open WebUI but that's not ideal. I'd love a iOS native
| client that's well integrated into Siri, Shortcuts, Sharing
| etc...
| qudat wrote:
| For hosting a web gui for ollama I use https://tuns.sh
|
| It really convenient because it's just an SSH tunnel and then
| you get automatic TLS and it protects your home IP.
|
| With that you can access it from your mobile phone, just gotta
| require a password to access it.
| satvikpendem wrote:
| I'm running OpenWebUI via Docker via OrbStack, it also
| automatically provides TLS and works pretty well.
| minwidth0px wrote:
| I wrote my own UI[0] that connects over WebRTC using
| Smoke.[1]
|
| [0] https://github.com/minwidth0px/gpt-playground and
| https://github.com/minwidth0px/Webrtc-NAT-Traversal-Proxy-
| Se...
|
| [1] https://github.com/sinclairzx81/smoke
| huijzer wrote:
| I have Open WebUI on a Hetzner instance connected to Deep
| Infra. Works on mobile by turning the web page into an app. I
| find the web framework that WebUI uses quite bloated/slow, but
| apart from that it does work reliably. Price at Deep Infra is
| typically about $0.04 per month even when actively asking lots
| of questions during programming.
| paradite wrote:
| I built a custom GUI for coding tasks specifically, with built-
| in code context management and workspaces:
|
| https://prompt.16x.engineer/
|
| Should work well if you have 64G vRAM to run SOTA models
| locally.
| throwaway314155 wrote:
| > Should work well if you have 64G vRAM to run SOTA models
| locally.
|
| Does anyone have this?
|
| edit: Ah, it's a Mac app.
| paradite wrote:
| Yeah Mac eats Windows on running LLMs.
|
| My app does support Windows though, you can connect to
| OpenAI, Claude, OpenRouter, Azure and other 3rd party
| providers. Just running SOTA LLMs locally can be
| challenging.
| gzer0 wrote:
| M4 Max with 128 GB RAM here. ;) Love it. A very expensive
| early Christmas present.
| accrual wrote:
| Great looking GUI, I find simple black/white/boxy/monospace
| UIs very effective.
| rkwz wrote:
| If you're using a Mac, I've built a lightweight native app -
| https://github.com/sheshbabu/Chital
| antirez wrote:
| That's _very_ cool, finally a native app that runs fast.
| Thanks.
| rkwz wrote:
| Thanks for the kind words :)
| magicalhippo wrote:
| Aa a Windows user, who just wanted something bare bones for
| playing, I found this[1] small project useful. It does support
| multi-modal models which is nice.
|
| [1]: https://github.com/jakobhoeg/nextjs-ollama-llm-ui
| seb314 wrote:
| For running llms _locally_ on Android, there's "pocketpal"
| (~7tok/s on a pixel 7 pro for some quant of llama 3.2 3B).
|
| (Not sure if it uses ollama though)
| vunderba wrote:
| As far as open source goes, I'd probably recommend LibreChat.
| It has connections for ollama, openai, anthropic, etc. It let's
| you setup auth so you can theoretically use it from anywhere
| (phone, etc.).
|
| Fair warning, it's relatively heavyweight in so far as it has
| to spin up a number of docker instances but works very well.
|
| https://github.com/danny-avila/LibreChat
| zerop wrote:
| Many are there, apart from what others mentioned I am exploring
| Anything LLM - https://anythingllm.com/. Liked the workspace
| concept in it. We can club documents in workspaces and RAG
| scope is managed.
| wokwokwok wrote:
| Nice.
|
| That said... I mean...
|
| > The journey to integrate K/V context cache quantisation into
| Ollama took around 5 months.
|
| ??
|
| They incorrectly tagged #7926 which is a 2 line change, instead
| of #6279 where it was implemented, which made me dig a bit deeper
| and reading the actual change it seems:
|
| The commit (1) is: > params :=
| C.llama_context_default_params() > ... >
| params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType))
| <--- adds this > params.type_v =
| kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
|
| Which has been part of llama.cpp since Dev 7, 2023 (2).
|
| So... mmmm... while this is great, somehow I'm left feeling kind
| of vaguely put-off by the comms around what is really 'we finally
| support some config flag from llama.cpp that's been there for
| really _quite a long time_ '.
|
| > It took 5 months, but we got there in the end.
|
| ... I guess... yay? The challenges don't seem like they were
| technical, but I guess, good job getting it across the line in
| the end?
|
| [1] -
| https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...
|
| [2] - since
| https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...
| meesles wrote:
| Author describes why it took as long as it did in the post, so
| I don't think they're trying to be disingenous. Getting minor
| changes merged upstream in large projects is difficult for
| newer concepts since you need adoption and support.
|
| Full release seems to contain more code[1], and author
| references the llama.cpp pre-work and that author as well
|
| This person is also not a core contributor, so this reads as a
| hobbyist and fan of AI dev that is writing about their work.
| Nothing to be ashamed of IMO.
|
| [1] -
| https://github.com/ollama/ollama/compare/v0.4.7...v0.4.8-rc0
| smcleod wrote:
| > this reads as a hobbyist and fan of AI dev that is writing
| about their work
|
| Bingo, that's me!
|
| I suspect the OP didn't actually read the post.
|
| 1. As you pointed out, it's about getting the feature
| working, enabled and contributed into Ollama, not in
| llama.cpp
|
| 2. Digging through git commits isn't useful when you work
| hard to squash commits before merging a PR, there were a
| _lot_ over the last 5 months.
|
| 3. While I'm not a go dev (and the introduction of cgo part
| way through that threw me a bit) there certainly were
| technicalities along the way, I suspect they not only didn't
| both to read the post, they also didn't bother to read the
| PR.
|
| Also, just to clarify - I didn't even share this here, it's
| just my personal blog of things I try to remember I did when
| I look back at them years later.
| smcleod wrote:
| I'm going to be generous here and assume you didn't bother to
| actually read the post (or even the PR) before writing a snaky,
| non-constructive comment, but skimming through your HN comment
| history this appears to be on-brand.
| wokwokwok wrote:
| I'll be generous and just say, maybe people should just use
| llama.cpp and not ollama if they care about having nice
| things, if merging support for existing features is that
| difficult.
|
| It seems like it's probably a better choice overall.
|
| That said, I'm sure people worked very hard on this, and it's
| nice to see it as a part of ollama for the people that use
| it.
|
| Also:
|
| > Please don't comment on whether someone read an article.
| "Did you even read the article? It mentions that" can be
| shortened to "The article mentions that".
|
| https://news.ycombinator.com/newsguidelines.html
| smcleod wrote:
| Im not sure what kind of vendetta you have against Ollama
| but I'll paste you here what I've written before when I've
| heard claims similar to Ollama is just a wrapper for
| llama.cpp:
|
| With llama.cpp running on a machine, how do you connect
| your LLM clients to it and request a model gets loaded with
| a given set of parameters and templates?
|
| ... you can't, because llama.cpp is the inference engine -
| and it's bundled llama-cpp-server binary only provides
| relatively basic server functionality - it's really more of
| demo/example or MVP.
|
| Llama.cpp is all configured at the time you run the binary
| and manually provide it command line args for the one
| specific model and configuration you start it with.
|
| Ollama provides a server and client for interfacing and
| packaging models, such as: Hot loading
| models (e.g. when you request a model from your client
| Ollama will load it on demand). Automatic model
| parallelisation. Automatic model concurrency.
| Automatic memory calculations for layer and GPU/CPU
| placement. Layered model configuration (basically
| docker images for models). Templating and
| distribution of model parameters, templates in a container
| image. Near feature complete OpenAI compatible API as
| well as it's native native API that supports more advanced
| features such as model hot loading, context management,
| etc... Native libraries for common languages.
| Official container images for hosting. Provides a
| client/server model for running remote or local inference
| servers with either Ollama or openai compatible clients.
| Support for both an official and self hosted model and
| template repositories. Support for multi-modal /
| Vision LLMs - something that llama.cpp is not focusing on
| providing currently. Support for serving safetensors
| models, as well as running and creating models directly
| from their Huggingface model ID. In addition to the
| llama.cpp engine, Ollama are working on adding additional
| model backends.
|
| Ollama is not "better" or "worse" than llama.cpp because
| it's an entirely different tool.
| guywhocodes wrote:
| This is par for Ollama, look at the log_probs issues/prs and
| you get an idea of how well Ollama is run.
|
| Ollama is IMO a model downloader for llama.cpp so you can do
| roleplay with ease.
| yard2010 wrote:
| "Judge not, that ye be not judged"
| lastdong wrote:
| Great project! Do you think there might be some advantages to
| bringing this over to LLaMA-BitNet?
___________________________________________________________________
(page generated 2024-12-05 23:01 UTC)