[HN Gopher] Structured Outputs with Ollama
       ___________________________________________________________________
        
       Structured Outputs with Ollama
        
       Author : Patrick_Devine
       Score  : 231 points
       Date   : 2024-12-07 01:12 UTC (21 hours ago)
        
 (HTM) web link (ollama.com)
 (TXT) w3m dump (ollama.com)
        
       | bluechair wrote:
       | Has anyone seen how these constraints affect the quality of the
       | output out of the LLM?
       | 
       | In some instances, I'd rather parse Markdown or plain text if it
       | means the quality of the output is higher.
        
         | parthsareen wrote:
         | We've been keeping a close eye on this as well as research is
         | coming out. We're looking into improving sampling as a whole on
         | both speed and accuracy.
         | 
         | Hopefully with those changes we might also enable general
         | structure generation not only limited to JSON.
        
           | hackernewds wrote:
           | Who is "we"?
        
             | parthsareen wrote:
             | I authored the blog with some other contributors and worked
             | on the feature (PR:
             | https://github.com/ollama/ollama/pull/7900).
             | 
             | The current implementation uses llama.cpp GBNF grammars.
             | The more recent research (Outlines, XGrammar) points to
             | potentially speeding up the sampling process through FSTs
             | and GPU parallelism.
        
               | netghost wrote:
               | Thank you for the details!
        
               | mmoskal wrote:
               | If you want avoid startup cost, llguidance [0] has no
               | compilation phase and by far the fullest JSON support [1]
               | of any library. I did a PoC llama.cpp integration [2]
               | though our focus is mostly server-side [3].
               | 
               | [0] https://github.com/guidance-ai/llguidance [1]
               | https://github.com/guidance-
               | ai/llguidance/blob/main/parser/s... [2]
               | https://github.com/ggerganov/llama.cpp/pull/10224 [3]
               | https://github.com/guidance-ai/llgtrt
        
               | parthsareen wrote:
               | This looks really useful. Thank you!
        
               | HanClinto wrote:
               | I have been thinking about your PR regularly, and
               | pondering about how we should go about getting this
               | merged in.
               | 
               | I really want to see support for additional grammar
               | engines merged into llama.cpp, and I'm a big fan of the
               | work you did on this.
        
         | nikolayasdf123 wrote:
         | same here. I noticed that when you ask model to generate
         | elaborate responses in natural text, and then come up with an
         | answer, quality is orders of magnitude better, and something in
         | line you would expect human-like reasoning.
         | 
         | asking LLM to directly generate JSON gives much worser results,
         | similar to either random guess or intuition.
        
         | benreesman wrote:
         | I can say that I was categorically wrong about the utility of
         | things like instructor.
         | 
         | It's easy to burn a lot of tokens but if the thing you're doing
         | merits the cost? You can be a bully with it and while its never
         | the best, 95% as good for zero effort is a tool in one's kit.
        
         | thot_experiment wrote:
         | YMMV, it's a negative effect in terms of "reasoning" but the
         | delta isn't super significant in most cases. It really depends
         | on the LLM and whether your prompt is likely to generate a JSON
         | response to begin with, the more you have to coerce the LLM the
         | less likely it is to generate sane input. With smaller models
         | you more quickly end up at the edge of space where the LLM has
         | meaningful predictive power and so the outputs start getting
         | closer to random noise.
         | 
         | FWIW measured by me using a vibes based method, nothing
         | rigorous just a lot of hours spent on various LLM projects. I
         | have not used these particular tools yet but ollama was
         | previously able to guarantee json output through what I assume
         | is similar techniques and my partner and I worked previously on
         | a jsonformer-like thing for oobabooga, another LLM runtime
         | tool.
        
         | lolinder wrote:
         | Working with OpenAI's models I've found a very good strategy is
         | to have two passes if you can afford the extra tokens: one pass
         | uses a heavy model and natural language with markdown sections
         | discussing the reasoning and providing a final natural language
         | answer (ideally labeled clearly with a markdown header). The
         | second pass can use a cheaper and faster model to put the
         | answer into a structured output format for consumption by the
         | non-LLM parts of the pipeline.
         | 
         | You basically use JSON schema mode to draw a clean boundary
         | around the wishy-washy language bits, using the LLM as a
         | preprocessor to capture its own output in a useful format.
        
         | mmoskal wrote:
         | It depends how fine-tuned the model is to JSON output.
         | 
         | Also, you need to tell the model the schema. If you don't you
         | will get more weird tokenization issues.
         | 
         | For example, if the schema expects a JSON key "foobarbaz" and
         | the canonical BPE tokenization is ["foobar", "baz"], the token
         | mask generated by all current constrained output libraries will
         | let the model choose from "f", "foo", "foobar" (assuming these
         | are all valid tokens). The model might then choose "foo", and
         | then the constraint will force eg. "bar" and "baz" as next
         | tokens. Now the model will see ["foo", "bar", "baz"] instead of
         | ["foobar", "baz"] and will get confused [0]
         | 
         | If the model knows from the prompt "foobarbaz" is one of the
         | schema keys, it will generally prefer "foobar" over "foo".
         | 
         | [0] In modern models these tokens are related, because of
         | regularization but they are not the same.
        
         | crystal_revenge wrote:
         | There was a paper going around claiming that structured outputs
         | _did_ hurt the quality of the output, but it turns out their
         | experiment setup was laughably bad [0].
         | 
         | It looks like, so long as you're reasonable with the prompting,
         | you tend to get _better_ outputs when using structure.
         | 
         | 0. https://blog.dottxt.co/say-what-you-mean.html
        
           | coredog64 wrote:
           | I've seen one case where structured output was terrible: OCR
           | transcription of handwritten text in a form with blanks. You
           | want a very low temperature for transcription, but as soon as
           | the model starts to see multiple blank sequences, it starts
           | to hallucinate that "" is the most likely next token.
        
       | quaintdev wrote:
       | So I can use this with any supported models? The reason I'm
       | asking is because I can only run 1b-3b models reliably on my
       | hardware.
        
         | parthsareen wrote:
         | Hey! Author of the blog post here. Yes you should be able to
         | use any model. Your mileage may vary with the smaller models
         | but asking them to "return x in json" tends to help with
         | accuracy (anecdotally).
        
           | pamelafox wrote:
           | Do you happen to know if got-4o would be negatively affected
           | by the addition of "return x in json"? I'm debating whether I
           | could use the same prompt across all models, hosted and
           | ollama.
        
           | dcreater wrote:
           | Why do smaller models fail to follow? Isn't the objective of
           | constraints that it always provides the right output type?
        
             | parthsareen wrote:
             | The constraints will always be met. It's the data inside
             | that might be inaccurate. YMMV with smaller models in that
             | sense.
        
       | vincentpants wrote:
       | Wow neat! The first step to format ambivalence! Curious to see
       | how well does this perform on the edge, our overhead is always so
       | scarce!
       | 
       | Amazing work as always, looking forward to taking this for a
       | spin!
        
       | lxe wrote:
       | I'm still running oobabooga because of its exlv2 support which
       | does much more efficient inference on dual 3090s
        
         | thot_experiment wrote:
         | I haven't touched ooba in a while, what's the situation like
         | with exl2 vs the non-homogeneous quantization methods people
         | are using like q3k_s or whatever. IIRC while exl2 is faster the
         | gptq quants were outperforming it in terms of accuracy esp at
         | lower bit depths.
        
       | chirau wrote:
       | This is wonderful news.
       | 
       | I was actually scratching my head on how to structure a regular
       | prompt to produce csv data without extra nonsense like "Here is
       | your data" and "Please note blah blah" at the beginning and end,
       | so this is much welcome as I can define exactly what I want
       | returned then just push structured output to csv.
        
         | firejake308 wrote:
         | Remember that you still need to include an instruction to
         | produce a CSV to get the prompt into the right context to
         | generate a CSV that makes sense. Otherwise, you may get output
         | that is technically in the CSV format but doesn't make any
         | sense because the model was actually trying to write a
         | paragraph response and the token sampler just selected really
         | low-probability tokens that the model didn't really want to
         | say.
        
           | mmoskal wrote:
           | It seems ollama only supports JSON Schema.
           | 
           | Interestingly, JSON Schema has much less of this problem than
           | say CSV - when the model is forced to produce `{"first_key":`
           | it will generally understand it's supposed to continue in
           | JSON. It still helps to tell it the schema though, especially
           | due to weird tokenization issues you can get otherwise.
        
             | diggan wrote:
             | > It seems ollama only supports JSON Schema.
             | 
             | "Encoding" CSV as JSON is trivial though, so make it output
             | JSON then parse the array-of-arrays into CSV :)
        
         | KTibow wrote:
         | A lot of the time you can prevent this by prefilling the output
         | with ```\n and stopping at ```.
        
           | chirau wrote:
           | care to explain further? I am not sure I understand you fully
        
             | bonzini wrote:
             | ``` is markdown for pre formatted text. It puts the LLM in
             | the mood of generating machine readable data instead of
             | prose.
        
             | jstanley wrote:
             | A lot of the time you can prevent this by prefilling the
             | output with "```\n", and stopping at "```".
        
             | miki123211 wrote:
             | LLMs are essentially text completion machines, they get
             | some text in as a list of tokens, and their job is to
             | predict what token comes next.
             | 
             | Typically, you're interested in getting more than just one
             | token out of them, so you run them in a loop. You start
             | with the token list containing just the user's message, run
             | the LLM with that list, append the newly obtained token at
             | the end, run the LLM again to see what comes after that new
             | token and so on, until a special end-of-sentence token is
             | generated and the loop terminates.
             | 
             | There's no reason why you have to start with nothing but
             | the users' message, though. You can let the user specify
             | the beginning of the LLM's supposed completion, and ask the
             | LLM to start from that point, instead of generating its
             | completion from scratch. This essentially ensures that what
             | the LLM says begins with a specific string.
             | 
             | Not alll APIs expose this feature, there are good safety
             | reasons not to, but all LLMs are capable of doing it in
             | principle, and doing it with the open ones is trivial.
             | 
             | LLMs are typically trained to output markdown, which uses
             | ```language_name to denote code blocks in language_name, so
             | that user interfaces like Chat GPT's web UI can do proper
             | syntax highlighting.
             | 
             | Therefore, if you make your LLM think that it already
             | started a completion, and that completion began with
             | ```json, it will predict what's most likely to come after
             | that delimiter, and that would be a JSON block.
        
             | magicalhippo wrote:
             | The way the LLMs work is you feed them a vector, array of
             | numbers, that represents a sequence of tokens.
             | 
             | You turn the crank and you get a probability distribution
             | for the next token in the sequence. You then sample the
             | distribution to get the next token, append it to the
             | vector, and do it again and again.
             | 
             | Thus the typical LLM have no memory as such, it inferes
             | what it was thinking by looking at what it has already said
             | and uses that to figure out what to say next, so to speak.
             | 
             | The characters in the input prompt are converted to these
             | tokens, but there are also special tokens such as start of
             | input, end of input, start of output and end of output. The
             | end of output token is how the LLM "tells you" it's done
             | talking.
             | 
             | Normally in a chat scenario these special tokens are
             | inserted by the LLM front-end, say Ollama/llama.cpp in this
             | case.
             | 
             | However if you interface more directly you need to add
             | these yourself, and hence can prefill out the output before
             | feeding the vector to the LLM for the first time, and thus
             | the LLM will "think" it already started writing code say,
             | and thus it is likely to continue doing so.
        
               | polotics wrote:
               | you have described an RNN I think, don't attention heads
               | add something that you could compare to rough &ready
               | understanding?
        
               | magicalhippo wrote:
               | Auto-regressive LLMs do this as I understand it, though
               | it can vary if they feed the combined input and output[1]
               | through the whole net like GPT-2 and friends, or just the
               | decoder[2]. I described the former, and I should have
               | clarified that.
               | 
               | In either case you can "prime it" like it was suggested.
               | 
               | A regular RNN has more feedback[3], like each layer
               | feeding back to itself, as I understand it.
               | 
               | Happy to be corrected though.
               | 
               | [1]: https://jalammar.github.io/illustrated-gpt2/#one-
               | difference-...
               | 
               | [2]: https://medium.com/@ikim1994914/understanding-the-
               | modern-llm...
               | 
               | [3]: https://karpathy.github.io/2015/05/21/rnn-
               | effectiveness/
        
       | xnx wrote:
       | Is there a best approach for providing structured input to LLMs?
       | Example: feed in 100 sentences and get each one classified in
       | different ways. It's easy to get structured data out, but my
       | approach of prefixing line numbers seems clumsy.
        
         | mmoskal wrote:
         | Models are trained on Markdown, JSON and various programming
         | languages, so either one of these should work.
         | 
         | However, in this case, you're best of giving the model
         | sentences one by one to avoid it being confused. If you
         | structure the prompt like "Classify the following sentence,
         | here are the rules ...." + sentence, then you should be hitting
         | prefix cache and get even better performance than when doing a
         | single query. Of course, this only works if you have the prefix
         | cache and are not paying per input token (though most providers
         | now let you indicate you want to use prefix cache and pay
         | less).
        
           | xnx wrote:
           | Good idea. I might try that. I think classification quality
           | improves when it has following sentences. I'll have to see if
           | feeding them sequentially makes it worse.
        
       | quaintdev wrote:
       | Yay! It works. I used gemma2:2b and gave it below text
       | You have spent 190 at Fresh Mart. Current balance: 5098
       | 
       | and it gave below output                  {\n\"amount\":
       | 190,\n\"balance\": 5098 ,\"category\":
       | \"Shopping\",\n\"place\":\"Fresh Mart\"\n}
        
         | diggan wrote:
         | That's some pretty inconsistent JSON, but I guess that makes
         | sense when using a really small model and gemma on top of that.
        
       | rdescartes wrote:
       | If anyone needs a more powerful constrain outputs, llama.cpp
       | support gbnf:
       | 
       | https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
        
         | jimmySixDOF wrote:
         | Thats is exactly what they are using
        
         | sa-code wrote:
         | This is amazing, thank you for the link
        
         | lolinder wrote:
         | Have you found the output for arbitrary grammars to be
         | satisfactory? My naive assumption has been that these models
         | will produce better JSON than other formats simply by virtue of
         | having seen so much of it.
        
           | rdescartes wrote:
           | If you want to get a good result, the grammar should be
           | following the expect output from the prompt, especially if
           | you use a small model. Normally I would manually fine-tune
           | the prompt to output the grammar format first, and then apply
           | the grammar in production.
        
           | throwaway314155 wrote:
           | Who would downvote this perfectly reasonable question?
           | 
           | edit: Nm
        
         | dcreater wrote:
         | How is it more powerful?
        
           | evilduck wrote:
           | Grammars don't have to just be JSON, which means you could
           | have it format responses as anything with a formal grammar.
           | XML, HTTP responses, SQL, algebraic notation of math, etc.
        
       | JackYoustra wrote:
       | PRs on this have been open for something like a year! I'm a bit
       | sad about how quiet the maintainers have been on this.
        
         | dcreater wrote:
         | Reading tea leaves, they seem to be headed down the corporate
         | path so view everything through that lens and how to maximize
         | profit
        
         | parthsareen wrote:
         | Hey! Author of the post and one of the maintainers here. I
         | agree - we (maintainers) got to this late and in general want
         | to encourage more contributions.
         | 
         | Hoping to be more on top of community PRs and get them merged
         | in the coming year.
        
       | ein0p wrote:
       | That's very useful. To see why, try to get an LLM _reliably_
       | generate JSON output without this. Sometimes it will, but
       | sometimes it'll just YOLO and produce something you didn't ask
       | for, that can't be parsed.
        
       | guerrilla wrote:
       | No way. This is amazing and one of the things I actually wanted.
       | I love ollama be because it makes using an LLM feel like using
       | any other UNIX program. It makes LLMs feel like they belong on
       | UNIX.
       | 
       | Question though. Has anyone had luck running it on AMD GPUs? I've
       | heard it's harder but I really want to support the competition
       | when I get cards next year.
        
         | rcarmo wrote:
         | Yes, even on iGPUs. I was running it fairly well on a mini-PC
         | with a 780M and the BIOS a set to allocate 16GB of shared
         | memory to it.
        
       | lormayna wrote:
       | This is a fantastic news! I spent hours on fine tuning my prompt
       | to summarise text and output in JSON and still have some issues
       | sometimes. Is this feature available also with Go?
        
         | lioeters wrote:
         | It looks like the structured output feature is available in Go,
         | with the `format` field.                 type GenerateRequest
         | struct {         ...         // Format specifies the format to
         | return a response in.         Format json.RawMessage
         | `json:"format,omitempty"`
         | 
         | https://github.com/ollama/ollama/blob/de52b6c2f90ff220ed9469...
        
       | rcarmo wrote:
       | I must say it is nice to see the curl example first. As much as I
       | like Pydantic, I still prefer to hand-code the schemas, since it
       | makes it easier to move my prototypes to Go (or something else).
        
       | highlanderNJ wrote:
       | What's the value-add compared to `outlines`?
       | 
       | https://www.souzatharsis.com/tamingLLMs/notebooks/structured...
        
         | parthsareen wrote:
         | Hey! Author of the blog here. The current implementation uses
         | llama.cpp GBNF which has allowed for a quick implementation.
         | The biggest value-add at this time was getting the feature out.
         | 
         | With the newer research - outlines/xgrammar coming out, I hope
         | to be able to update the sampling to support more formats,
         | increase accuracy, and improve performance.
        
       | diimdeep wrote:
       | Very annoying marketing and pretending to be anything other than
       | just wrapper around llama.cpp.
        
         | evilduck wrote:
         | Can you ollama haters stop with this bullshit?
         | 
         | Does llama.cpp do dynamic model loading and unloading? Will it
         | fetch a model you request but isn't downloaded? Does it provide
         | SDKs? Does it have startup services it provides? There's space
         | for things that wrap llama.cpp and solve many of its pain
         | points. You can find piles of reports of people struggling to
         | build and compile llama.cpp for some reason or another who then
         | clicked an Ollama installer and it worked right away.
         | 
         | It's also a free OSS project giving all this away, why are you
         | being an ass and discouraging them?
        
           | dcreater wrote:
           | They're going to go corporate
        
           | diimdeep wrote:
           | Sure llama.cpp does not do all of that, except that it lets
           | you curl model from public and free to use endpoints, it does
           | that. But SDK? - fuck that. Load, unload and startup services
           | - who is even need that ? All this value is so minuscule
           | compared to core functionality provided by ggml/llamacpp.
           | 
           | But this submitted link is not even about all of that, it is
           | about what really llama.cpp does not do - it does not write
           | more lines of marketing material than lines of code, which is
           | that marketing material is about, lines of code that really
           | just wrap 10x more lines of code down the line, and all of
           | that by not making it clear as day.
        
             | evilduck wrote:
             | It's not even worth countering all these lies you're
             | telling. Enjoy your self inflicted irrational misery.
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:01 UTC)