[HN Gopher] Ollama and gguf
       ___________________________________________________________________
        
       Ollama and gguf
        
       Author : indigodaddy
       Score  : 59 points
       Date   : 2025-08-11 17:54 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | indigodaddy wrote:
       | ggerganov explains the issue:
       | https://github.com/ollama/ollama/issues/11714#issuecomment-3...
        
         | magicalhippo wrote:
         | I noticed it the other way, llama.cpp failed to download the
         | Ollama-downloaded gpt-oss 20b model. Thought it was odd given
         | all the others I tried worked fine.
         | 
         | Figured it had to be Ollama doing Ollama things, seems that was
         | indeed the case.
        
         | polotics wrote:
         | ggerganov is my hero, and... it's a good thing this got posted
         | so I saw in the comments that --flash-attn --cache-reuse 256
         | could help with my setup (M3 36GB + RPC to M1 16GB) figuring
         | out what params to set and at what value is a lot of trial and
         | error, Gemini does help a bit clarify what params like top-k
         | are going to do in practice. Still the whole load-balancing
         | with RPC is something I think I'm going to have to read the
         | source of llama.cpp to really understand (oops I almost wrote
         | grok, damn you Elon) Anyways ollama is still not doing
         | distributed load, and yeah I guess using it is a stepping
         | stone...
        
         | LeoPanthera wrote:
         | The named anchor in this URL doesn't work in Safari. Safari
         | correctly scrolls down to the comment in question, but then
         | some Javascript on the page throws you back up to the top
         | again.
        
       | dcreater wrote:
       | I think the title buries the lede? Its specific to GPT-OSS and
       | exposes the shady stuff Ollama is doing to acquiesce/curry
       | favor/partner with/get paid by corporate interests
        
         | freedomben wrote:
         | I think "shady" is a little too harsh - sounds like they forked
         | an important upstream project, made incompatible changes that
         | they didn't push upstream or even communicate with upstream
         | about, and now have to deal with the consequences of that. If
         | that's "shady" (despite being all out in the open) then nearly
         | every company I've worked for has been "shady."
        
           | wsgeorge wrote:
           | There's a reddit thread from a few months ago that sort of
           | explains what people don't like about ollama, that
           | "shadiness" parent references:
           | 
           | https://www.reddit.com/r/LocalLLaMA/comments/1jzocoo/finally.
           | ..
        
       | llmthrowaway wrote:
       | Confusing title - thought this was about Ollama finally
       | supporting sharded GGUF (ie. the Huggingface default for large
       | gguf over 48gb).
       | 
       | https://github.com/ollama/ollama/issues/5245
       | 
       | Sadly it is not and the issue still remains open after over a
       | year meaning ollama cannot run the latest SOTA open source models
       | unless they covert them to their proprietary format which they do
       | not consistently do.
       | 
       | No surprise I guess given they've taken VC money, refuse to
       | properly attribute the use things like llama.cpp and ggml, have
       | their own model format for.. reasons? and have over 1800 open
       | issues...
       | 
       | Llama-server, ramallama or whatever model switcher ggerganov is
       | working on (he showed previews recently) feel like the way
       | forward.
        
       | tarruda wrote:
       | I recently discovered that ollama no longer uses llama.cpp as a
       | library, and instead they link to the low level library (ggml)
       | which requires them to reinvent a lot of wheel for absolutely no
       | benefit (if there's some benefit I'm missing, please let me
       | know).
       | 
       | Even using llama.cpp as a library seems like an overkill for most
       | use cases. Ollama could make its life much easier by spawning
       | llama-server as a subprocess listening on a unix socket, and
       | forward requests to it.
       | 
       | One thing I'm curious about: Does ollama support strict
       | structured output or strict tool calls adhering to a json schema?
       | Because it would be insane to rely on a server for agentic use
       | unless your server can guarantee the model will only produce
       | valid json. AFAIK this feature is implemented by llama.cpp, which
       | they no longer use.
        
         | arcanemachiner wrote:
         | > I recently discovered that ollama no longer uses llama.cpp as
         | a library, and instead they link to the low level library
         | (ggml) which requires them to reinvent a lot of wheel for
         | absolutely no benefit (if there's some benefit I'm missing,
         | please let me know).
         | 
         | Here is some relevant drama on the subject:
         | 
         | https://github.com/ollama/ollama/issues/11714#issuecomment-3...
        
         | hodgehog11 wrote:
         | I got to speak with some of the leads at Ollama and asked more
         | or less this same question. The reason they abandoned llama.cpp
         | is because it does not align with their goals.
         | 
         | llama.cpp is designed to rapidly adopt research-level
         | optimisations and features, but the downside is that reported
         | speeds change all the time (sometimes faster, sometimes slower)
         | and things break really often. You can't hope to establish
         | contracts with simultaneous releases if there is no guarantee
         | the model will even function.
         | 
         | By reimplementing this layer, Ollama gets to enjoy a kind of
         | LTS status that their partners rely on. It won't be as feature-
         | rich, and definitely won't be as fast, but that's not their
         | goal.
        
           | A4ET8a8uTh0_v2 wrote:
           | Thank you. This is genuinely a valid reason even from a
           | simple consistency perspective.
           | 
           | (edit: I think -- after I read some of the links -- I
           | understand why Ollama comes across as less of a hero. Still,
           | I am giving them some benefit of the doubt since they made
           | local models very accessible to plebs like me; and maybe I
           | can graduate to no ollama )
        
             | hodgehog11 wrote:
             | I think this is the thing: if you can use llama.cpp, you
             | probably shouldn't use Ollama. It's designed for the
             | beginner.
        
           | jychang wrote:
           | That's a dumb answer from them.
           | 
           | What's wrong with using an older well-tested build of
           | llama.cpp, instead of reinventing the wheel? Like every linux
           | distro ever who's ever ran into this issue?
           | 
           | Red Hat doesn't ship the latest build of the linux kernel to
           | production. And Red Hat didn't reinvent the linux kernel for
           | shits and giggles.
        
             | hodgehog11 wrote:
             | The Linux kernel does not break userspace.
             | 
             | > What's wrong with using an older well-tested build of
             | llama.cpp, instead of reinventing the wheel?
             | 
             | Yeah, they tried this, this was the old setup as I
             | understand it. But every time they needed support for a new
             | model and had to update llama.cpp, an old model would break
             | and one of their partners would go ape on them. They said
             | it happened more than once, but one particular case (wish I
             | could remember what it was) was so bad they felt they had
             | no choice but to reimplement. It's the lowest risk
             | strategy.
        
         | halyconWays wrote:
         | >(if there's some benefit I'm missing, please let me know).
         | 
         | Makes their VCs think they're doing more, and have more
         | ownership, rather than being a do-nothing wrapper with some
         | analytics and S3 buckets that rehost models from HF.
        
         | wubrr wrote:
         | > Does ollama support strict structured output or strict tool
         | calls adhering to a json schema?
         | 
         | As far as I understand this is generally not possible at the
         | model level. Best you can do is wrap the call in a (non-llm)
         | json schema validator, and emit an error json in case the llm
         | output does not match the schema, which is what some APIs do
         | for you, but not very complicated to do yourself.
         | 
         | Someone correct me if I'm wrong
        
           | mangoman wrote:
           | no that's incorrect - llama.cpp has support for providing a
           | context free grammar while sampling and only samples tokens
           | that would conform to the grammar, rather than sampling
           | tokens that would violate the grammar
        
           | tarruda wrote:
           | The inference engine (llama.CPP) has full control over the
           | possible tokens during inference. It can "force" the llm to
           | output only valid tokens so that it produces valid json
        
             | kristjansson wrote:
             | and in fact leverages that control to constrain outputs to
             | those matching user-specified BNFs
             | 
             | https://github.com/ggml-org/llama.cpp/tree/master/grammars
        
         | cdoern wrote:
         | > Ollama could make its life much easier by spawning llama-
         | server as a subprocess listening on a unix socket, and forward
         | requests to it
         | 
         | I'd recommend taking a look at
         | https://github.com/containers/ramalama its more similar to what
         | you're describing in the way it uses llama-server, also it is
         | container native by default which is nice for portability.
        
       | 12345hn6789 wrote:
       | Just days ago ollama devs claimed[0] that ollama no longer relies
       | on ggml / llama.cpp. here is their pull request(+165,966 -47,980)
       | to reimplement (copy) llama.cpp code in their repository.
       | 
       | https://github.com/ollama/ollama/pull/11823
       | 
       | [0] https://news.ycombinator.com/item?id=44802414#44805396
        
         | flakiness wrote:
         | not against overall sentiment here, but quote the counterpoint
         | from the linked HN comment to be fair:
         | 
         | > Ollama does not use llama.cpp anymore; we do still keep it
         | and occasionally update it to remain compatible for older
         | models for when we used it.
         | 
         | The linked PR is doing "occasionally update it" I guess? Note
         | that "vendored" in the PR title often means to take a snapshot
         | to pin a specific version.
        
       ___________________________________________________________________
       (page generated 2025-08-11 23:00 UTC)