[HN Gopher] Ollama and gguf
___________________________________________________________________
Ollama and gguf
Author : indigodaddy
Score : 59 points
Date : 2025-08-11 17:54 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| indigodaddy wrote:
| ggerganov explains the issue:
| https://github.com/ollama/ollama/issues/11714#issuecomment-3...
| magicalhippo wrote:
| I noticed it the other way, llama.cpp failed to download the
| Ollama-downloaded gpt-oss 20b model. Thought it was odd given
| all the others I tried worked fine.
|
| Figured it had to be Ollama doing Ollama things, seems that was
| indeed the case.
| polotics wrote:
| ggerganov is my hero, and... it's a good thing this got posted
| so I saw in the comments that --flash-attn --cache-reuse 256
| could help with my setup (M3 36GB + RPC to M1 16GB) figuring
| out what params to set and at what value is a lot of trial and
| error, Gemini does help a bit clarify what params like top-k
| are going to do in practice. Still the whole load-balancing
| with RPC is something I think I'm going to have to read the
| source of llama.cpp to really understand (oops I almost wrote
| grok, damn you Elon) Anyways ollama is still not doing
| distributed load, and yeah I guess using it is a stepping
| stone...
| LeoPanthera wrote:
| The named anchor in this URL doesn't work in Safari. Safari
| correctly scrolls down to the comment in question, but then
| some Javascript on the page throws you back up to the top
| again.
| dcreater wrote:
| I think the title buries the lede? Its specific to GPT-OSS and
| exposes the shady stuff Ollama is doing to acquiesce/curry
| favor/partner with/get paid by corporate interests
| freedomben wrote:
| I think "shady" is a little too harsh - sounds like they forked
| an important upstream project, made incompatible changes that
| they didn't push upstream or even communicate with upstream
| about, and now have to deal with the consequences of that. If
| that's "shady" (despite being all out in the open) then nearly
| every company I've worked for has been "shady."
| wsgeorge wrote:
| There's a reddit thread from a few months ago that sort of
| explains what people don't like about ollama, that
| "shadiness" parent references:
|
| https://www.reddit.com/r/LocalLLaMA/comments/1jzocoo/finally.
| ..
| llmthrowaway wrote:
| Confusing title - thought this was about Ollama finally
| supporting sharded GGUF (ie. the Huggingface default for large
| gguf over 48gb).
|
| https://github.com/ollama/ollama/issues/5245
|
| Sadly it is not and the issue still remains open after over a
| year meaning ollama cannot run the latest SOTA open source models
| unless they covert them to their proprietary format which they do
| not consistently do.
|
| No surprise I guess given they've taken VC money, refuse to
| properly attribute the use things like llama.cpp and ggml, have
| their own model format for.. reasons? and have over 1800 open
| issues...
|
| Llama-server, ramallama or whatever model switcher ggerganov is
| working on (he showed previews recently) feel like the way
| forward.
| tarruda wrote:
| I recently discovered that ollama no longer uses llama.cpp as a
| library, and instead they link to the low level library (ggml)
| which requires them to reinvent a lot of wheel for absolutely no
| benefit (if there's some benefit I'm missing, please let me
| know).
|
| Even using llama.cpp as a library seems like an overkill for most
| use cases. Ollama could make its life much easier by spawning
| llama-server as a subprocess listening on a unix socket, and
| forward requests to it.
|
| One thing I'm curious about: Does ollama support strict
| structured output or strict tool calls adhering to a json schema?
| Because it would be insane to rely on a server for agentic use
| unless your server can guarantee the model will only produce
| valid json. AFAIK this feature is implemented by llama.cpp, which
| they no longer use.
| arcanemachiner wrote:
| > I recently discovered that ollama no longer uses llama.cpp as
| a library, and instead they link to the low level library
| (ggml) which requires them to reinvent a lot of wheel for
| absolutely no benefit (if there's some benefit I'm missing,
| please let me know).
|
| Here is some relevant drama on the subject:
|
| https://github.com/ollama/ollama/issues/11714#issuecomment-3...
| hodgehog11 wrote:
| I got to speak with some of the leads at Ollama and asked more
| or less this same question. The reason they abandoned llama.cpp
| is because it does not align with their goals.
|
| llama.cpp is designed to rapidly adopt research-level
| optimisations and features, but the downside is that reported
| speeds change all the time (sometimes faster, sometimes slower)
| and things break really often. You can't hope to establish
| contracts with simultaneous releases if there is no guarantee
| the model will even function.
|
| By reimplementing this layer, Ollama gets to enjoy a kind of
| LTS status that their partners rely on. It won't be as feature-
| rich, and definitely won't be as fast, but that's not their
| goal.
| A4ET8a8uTh0_v2 wrote:
| Thank you. This is genuinely a valid reason even from a
| simple consistency perspective.
|
| (edit: I think -- after I read some of the links -- I
| understand why Ollama comes across as less of a hero. Still,
| I am giving them some benefit of the doubt since they made
| local models very accessible to plebs like me; and maybe I
| can graduate to no ollama )
| hodgehog11 wrote:
| I think this is the thing: if you can use llama.cpp, you
| probably shouldn't use Ollama. It's designed for the
| beginner.
| jychang wrote:
| That's a dumb answer from them.
|
| What's wrong with using an older well-tested build of
| llama.cpp, instead of reinventing the wheel? Like every linux
| distro ever who's ever ran into this issue?
|
| Red Hat doesn't ship the latest build of the linux kernel to
| production. And Red Hat didn't reinvent the linux kernel for
| shits and giggles.
| hodgehog11 wrote:
| The Linux kernel does not break userspace.
|
| > What's wrong with using an older well-tested build of
| llama.cpp, instead of reinventing the wheel?
|
| Yeah, they tried this, this was the old setup as I
| understand it. But every time they needed support for a new
| model and had to update llama.cpp, an old model would break
| and one of their partners would go ape on them. They said
| it happened more than once, but one particular case (wish I
| could remember what it was) was so bad they felt they had
| no choice but to reimplement. It's the lowest risk
| strategy.
| halyconWays wrote:
| >(if there's some benefit I'm missing, please let me know).
|
| Makes their VCs think they're doing more, and have more
| ownership, rather than being a do-nothing wrapper with some
| analytics and S3 buckets that rehost models from HF.
| wubrr wrote:
| > Does ollama support strict structured output or strict tool
| calls adhering to a json schema?
|
| As far as I understand this is generally not possible at the
| model level. Best you can do is wrap the call in a (non-llm)
| json schema validator, and emit an error json in case the llm
| output does not match the schema, which is what some APIs do
| for you, but not very complicated to do yourself.
|
| Someone correct me if I'm wrong
| mangoman wrote:
| no that's incorrect - llama.cpp has support for providing a
| context free grammar while sampling and only samples tokens
| that would conform to the grammar, rather than sampling
| tokens that would violate the grammar
| tarruda wrote:
| The inference engine (llama.CPP) has full control over the
| possible tokens during inference. It can "force" the llm to
| output only valid tokens so that it produces valid json
| kristjansson wrote:
| and in fact leverages that control to constrain outputs to
| those matching user-specified BNFs
|
| https://github.com/ggml-org/llama.cpp/tree/master/grammars
| cdoern wrote:
| > Ollama could make its life much easier by spawning llama-
| server as a subprocess listening on a unix socket, and forward
| requests to it
|
| I'd recommend taking a look at
| https://github.com/containers/ramalama its more similar to what
| you're describing in the way it uses llama-server, also it is
| container native by default which is nice for portability.
| 12345hn6789 wrote:
| Just days ago ollama devs claimed[0] that ollama no longer relies
| on ggml / llama.cpp. here is their pull request(+165,966 -47,980)
| to reimplement (copy) llama.cpp code in their repository.
|
| https://github.com/ollama/ollama/pull/11823
|
| [0] https://news.ycombinator.com/item?id=44802414#44805396
| flakiness wrote:
| not against overall sentiment here, but quote the counterpoint
| from the linked HN comment to be fair:
|
| > Ollama does not use llama.cpp anymore; we do still keep it
| and occasionally update it to remain compatible for older
| models for when we used it.
|
| The linked PR is doing "occasionally update it" I guess? Note
| that "vendored" in the PR title often means to take a snapshot
| to pin a specific version.
___________________________________________________________________
(page generated 2025-08-11 23:00 UTC)