[HN Gopher] Structured Outputs with Ollama
___________________________________________________________________
Structured Outputs with Ollama
Author : Patrick_Devine
Score : 231 points
Date : 2024-12-07 01:12 UTC (21 hours ago)
(HTM) web link (ollama.com)
(TXT) w3m dump (ollama.com)
| bluechair wrote:
| Has anyone seen how these constraints affect the quality of the
| output out of the LLM?
|
| In some instances, I'd rather parse Markdown or plain text if it
| means the quality of the output is higher.
| parthsareen wrote:
| We've been keeping a close eye on this as well as research is
| coming out. We're looking into improving sampling as a whole on
| both speed and accuracy.
|
| Hopefully with those changes we might also enable general
| structure generation not only limited to JSON.
| hackernewds wrote:
| Who is "we"?
| parthsareen wrote:
| I authored the blog with some other contributors and worked
| on the feature (PR:
| https://github.com/ollama/ollama/pull/7900).
|
| The current implementation uses llama.cpp GBNF grammars.
| The more recent research (Outlines, XGrammar) points to
| potentially speeding up the sampling process through FSTs
| and GPU parallelism.
| netghost wrote:
| Thank you for the details!
| mmoskal wrote:
| If you want avoid startup cost, llguidance [0] has no
| compilation phase and by far the fullest JSON support [1]
| of any library. I did a PoC llama.cpp integration [2]
| though our focus is mostly server-side [3].
|
| [0] https://github.com/guidance-ai/llguidance [1]
| https://github.com/guidance-
| ai/llguidance/blob/main/parser/s... [2]
| https://github.com/ggerganov/llama.cpp/pull/10224 [3]
| https://github.com/guidance-ai/llgtrt
| parthsareen wrote:
| This looks really useful. Thank you!
| HanClinto wrote:
| I have been thinking about your PR regularly, and
| pondering about how we should go about getting this
| merged in.
|
| I really want to see support for additional grammar
| engines merged into llama.cpp, and I'm a big fan of the
| work you did on this.
| nikolayasdf123 wrote:
| same here. I noticed that when you ask model to generate
| elaborate responses in natural text, and then come up with an
| answer, quality is orders of magnitude better, and something in
| line you would expect human-like reasoning.
|
| asking LLM to directly generate JSON gives much worser results,
| similar to either random guess or intuition.
| benreesman wrote:
| I can say that I was categorically wrong about the utility of
| things like instructor.
|
| It's easy to burn a lot of tokens but if the thing you're doing
| merits the cost? You can be a bully with it and while its never
| the best, 95% as good for zero effort is a tool in one's kit.
| thot_experiment wrote:
| YMMV, it's a negative effect in terms of "reasoning" but the
| delta isn't super significant in most cases. It really depends
| on the LLM and whether your prompt is likely to generate a JSON
| response to begin with, the more you have to coerce the LLM the
| less likely it is to generate sane input. With smaller models
| you more quickly end up at the edge of space where the LLM has
| meaningful predictive power and so the outputs start getting
| closer to random noise.
|
| FWIW measured by me using a vibes based method, nothing
| rigorous just a lot of hours spent on various LLM projects. I
| have not used these particular tools yet but ollama was
| previously able to guarantee json output through what I assume
| is similar techniques and my partner and I worked previously on
| a jsonformer-like thing for oobabooga, another LLM runtime
| tool.
| lolinder wrote:
| Working with OpenAI's models I've found a very good strategy is
| to have two passes if you can afford the extra tokens: one pass
| uses a heavy model and natural language with markdown sections
| discussing the reasoning and providing a final natural language
| answer (ideally labeled clearly with a markdown header). The
| second pass can use a cheaper and faster model to put the
| answer into a structured output format for consumption by the
| non-LLM parts of the pipeline.
|
| You basically use JSON schema mode to draw a clean boundary
| around the wishy-washy language bits, using the LLM as a
| preprocessor to capture its own output in a useful format.
| mmoskal wrote:
| It depends how fine-tuned the model is to JSON output.
|
| Also, you need to tell the model the schema. If you don't you
| will get more weird tokenization issues.
|
| For example, if the schema expects a JSON key "foobarbaz" and
| the canonical BPE tokenization is ["foobar", "baz"], the token
| mask generated by all current constrained output libraries will
| let the model choose from "f", "foo", "foobar" (assuming these
| are all valid tokens). The model might then choose "foo", and
| then the constraint will force eg. "bar" and "baz" as next
| tokens. Now the model will see ["foo", "bar", "baz"] instead of
| ["foobar", "baz"] and will get confused [0]
|
| If the model knows from the prompt "foobarbaz" is one of the
| schema keys, it will generally prefer "foobar" over "foo".
|
| [0] In modern models these tokens are related, because of
| regularization but they are not the same.
| crystal_revenge wrote:
| There was a paper going around claiming that structured outputs
| _did_ hurt the quality of the output, but it turns out their
| experiment setup was laughably bad [0].
|
| It looks like, so long as you're reasonable with the prompting,
| you tend to get _better_ outputs when using structure.
|
| 0. https://blog.dottxt.co/say-what-you-mean.html
| coredog64 wrote:
| I've seen one case where structured output was terrible: OCR
| transcription of handwritten text in a form with blanks. You
| want a very low temperature for transcription, but as soon as
| the model starts to see multiple blank sequences, it starts
| to hallucinate that "" is the most likely next token.
| quaintdev wrote:
| So I can use this with any supported models? The reason I'm
| asking is because I can only run 1b-3b models reliably on my
| hardware.
| parthsareen wrote:
| Hey! Author of the blog post here. Yes you should be able to
| use any model. Your mileage may vary with the smaller models
| but asking them to "return x in json" tends to help with
| accuracy (anecdotally).
| pamelafox wrote:
| Do you happen to know if got-4o would be negatively affected
| by the addition of "return x in json"? I'm debating whether I
| could use the same prompt across all models, hosted and
| ollama.
| dcreater wrote:
| Why do smaller models fail to follow? Isn't the objective of
| constraints that it always provides the right output type?
| parthsareen wrote:
| The constraints will always be met. It's the data inside
| that might be inaccurate. YMMV with smaller models in that
| sense.
| vincentpants wrote:
| Wow neat! The first step to format ambivalence! Curious to see
| how well does this perform on the edge, our overhead is always so
| scarce!
|
| Amazing work as always, looking forward to taking this for a
| spin!
| lxe wrote:
| I'm still running oobabooga because of its exlv2 support which
| does much more efficient inference on dual 3090s
| thot_experiment wrote:
| I haven't touched ooba in a while, what's the situation like
| with exl2 vs the non-homogeneous quantization methods people
| are using like q3k_s or whatever. IIRC while exl2 is faster the
| gptq quants were outperforming it in terms of accuracy esp at
| lower bit depths.
| chirau wrote:
| This is wonderful news.
|
| I was actually scratching my head on how to structure a regular
| prompt to produce csv data without extra nonsense like "Here is
| your data" and "Please note blah blah" at the beginning and end,
| so this is much welcome as I can define exactly what I want
| returned then just push structured output to csv.
| firejake308 wrote:
| Remember that you still need to include an instruction to
| produce a CSV to get the prompt into the right context to
| generate a CSV that makes sense. Otherwise, you may get output
| that is technically in the CSV format but doesn't make any
| sense because the model was actually trying to write a
| paragraph response and the token sampler just selected really
| low-probability tokens that the model didn't really want to
| say.
| mmoskal wrote:
| It seems ollama only supports JSON Schema.
|
| Interestingly, JSON Schema has much less of this problem than
| say CSV - when the model is forced to produce `{"first_key":`
| it will generally understand it's supposed to continue in
| JSON. It still helps to tell it the schema though, especially
| due to weird tokenization issues you can get otherwise.
| diggan wrote:
| > It seems ollama only supports JSON Schema.
|
| "Encoding" CSV as JSON is trivial though, so make it output
| JSON then parse the array-of-arrays into CSV :)
| KTibow wrote:
| A lot of the time you can prevent this by prefilling the output
| with ```\n and stopping at ```.
| chirau wrote:
| care to explain further? I am not sure I understand you fully
| bonzini wrote:
| ``` is markdown for pre formatted text. It puts the LLM in
| the mood of generating machine readable data instead of
| prose.
| jstanley wrote:
| A lot of the time you can prevent this by prefilling the
| output with "```\n", and stopping at "```".
| miki123211 wrote:
| LLMs are essentially text completion machines, they get
| some text in as a list of tokens, and their job is to
| predict what token comes next.
|
| Typically, you're interested in getting more than just one
| token out of them, so you run them in a loop. You start
| with the token list containing just the user's message, run
| the LLM with that list, append the newly obtained token at
| the end, run the LLM again to see what comes after that new
| token and so on, until a special end-of-sentence token is
| generated and the loop terminates.
|
| There's no reason why you have to start with nothing but
| the users' message, though. You can let the user specify
| the beginning of the LLM's supposed completion, and ask the
| LLM to start from that point, instead of generating its
| completion from scratch. This essentially ensures that what
| the LLM says begins with a specific string.
|
| Not alll APIs expose this feature, there are good safety
| reasons not to, but all LLMs are capable of doing it in
| principle, and doing it with the open ones is trivial.
|
| LLMs are typically trained to output markdown, which uses
| ```language_name to denote code blocks in language_name, so
| that user interfaces like Chat GPT's web UI can do proper
| syntax highlighting.
|
| Therefore, if you make your LLM think that it already
| started a completion, and that completion began with
| ```json, it will predict what's most likely to come after
| that delimiter, and that would be a JSON block.
| magicalhippo wrote:
| The way the LLMs work is you feed them a vector, array of
| numbers, that represents a sequence of tokens.
|
| You turn the crank and you get a probability distribution
| for the next token in the sequence. You then sample the
| distribution to get the next token, append it to the
| vector, and do it again and again.
|
| Thus the typical LLM have no memory as such, it inferes
| what it was thinking by looking at what it has already said
| and uses that to figure out what to say next, so to speak.
|
| The characters in the input prompt are converted to these
| tokens, but there are also special tokens such as start of
| input, end of input, start of output and end of output. The
| end of output token is how the LLM "tells you" it's done
| talking.
|
| Normally in a chat scenario these special tokens are
| inserted by the LLM front-end, say Ollama/llama.cpp in this
| case.
|
| However if you interface more directly you need to add
| these yourself, and hence can prefill out the output before
| feeding the vector to the LLM for the first time, and thus
| the LLM will "think" it already started writing code say,
| and thus it is likely to continue doing so.
| polotics wrote:
| you have described an RNN I think, don't attention heads
| add something that you could compare to rough &ready
| understanding?
| magicalhippo wrote:
| Auto-regressive LLMs do this as I understand it, though
| it can vary if they feed the combined input and output[1]
| through the whole net like GPT-2 and friends, or just the
| decoder[2]. I described the former, and I should have
| clarified that.
|
| In either case you can "prime it" like it was suggested.
|
| A regular RNN has more feedback[3], like each layer
| feeding back to itself, as I understand it.
|
| Happy to be corrected though.
|
| [1]: https://jalammar.github.io/illustrated-gpt2/#one-
| difference-...
|
| [2]: https://medium.com/@ikim1994914/understanding-the-
| modern-llm...
|
| [3]: https://karpathy.github.io/2015/05/21/rnn-
| effectiveness/
| xnx wrote:
| Is there a best approach for providing structured input to LLMs?
| Example: feed in 100 sentences and get each one classified in
| different ways. It's easy to get structured data out, but my
| approach of prefixing line numbers seems clumsy.
| mmoskal wrote:
| Models are trained on Markdown, JSON and various programming
| languages, so either one of these should work.
|
| However, in this case, you're best of giving the model
| sentences one by one to avoid it being confused. If you
| structure the prompt like "Classify the following sentence,
| here are the rules ...." + sentence, then you should be hitting
| prefix cache and get even better performance than when doing a
| single query. Of course, this only works if you have the prefix
| cache and are not paying per input token (though most providers
| now let you indicate you want to use prefix cache and pay
| less).
| xnx wrote:
| Good idea. I might try that. I think classification quality
| improves when it has following sentences. I'll have to see if
| feeding them sequentially makes it worse.
| quaintdev wrote:
| Yay! It works. I used gemma2:2b and gave it below text
| You have spent 190 at Fresh Mart. Current balance: 5098
|
| and it gave below output {\n\"amount\":
| 190,\n\"balance\": 5098 ,\"category\":
| \"Shopping\",\n\"place\":\"Fresh Mart\"\n}
| diggan wrote:
| That's some pretty inconsistent JSON, but I guess that makes
| sense when using a really small model and gemma on top of that.
| rdescartes wrote:
| If anyone needs a more powerful constrain outputs, llama.cpp
| support gbnf:
|
| https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
| jimmySixDOF wrote:
| Thats is exactly what they are using
| sa-code wrote:
| This is amazing, thank you for the link
| lolinder wrote:
| Have you found the output for arbitrary grammars to be
| satisfactory? My naive assumption has been that these models
| will produce better JSON than other formats simply by virtue of
| having seen so much of it.
| rdescartes wrote:
| If you want to get a good result, the grammar should be
| following the expect output from the prompt, especially if
| you use a small model. Normally I would manually fine-tune
| the prompt to output the grammar format first, and then apply
| the grammar in production.
| throwaway314155 wrote:
| Who would downvote this perfectly reasonable question?
|
| edit: Nm
| dcreater wrote:
| How is it more powerful?
| evilduck wrote:
| Grammars don't have to just be JSON, which means you could
| have it format responses as anything with a formal grammar.
| XML, HTTP responses, SQL, algebraic notation of math, etc.
| JackYoustra wrote:
| PRs on this have been open for something like a year! I'm a bit
| sad about how quiet the maintainers have been on this.
| dcreater wrote:
| Reading tea leaves, they seem to be headed down the corporate
| path so view everything through that lens and how to maximize
| profit
| parthsareen wrote:
| Hey! Author of the post and one of the maintainers here. I
| agree - we (maintainers) got to this late and in general want
| to encourage more contributions.
|
| Hoping to be more on top of community PRs and get them merged
| in the coming year.
| ein0p wrote:
| That's very useful. To see why, try to get an LLM _reliably_
| generate JSON output without this. Sometimes it will, but
| sometimes it'll just YOLO and produce something you didn't ask
| for, that can't be parsed.
| guerrilla wrote:
| No way. This is amazing and one of the things I actually wanted.
| I love ollama be because it makes using an LLM feel like using
| any other UNIX program. It makes LLMs feel like they belong on
| UNIX.
|
| Question though. Has anyone had luck running it on AMD GPUs? I've
| heard it's harder but I really want to support the competition
| when I get cards next year.
| rcarmo wrote:
| Yes, even on iGPUs. I was running it fairly well on a mini-PC
| with a 780M and the BIOS a set to allocate 16GB of shared
| memory to it.
| lormayna wrote:
| This is a fantastic news! I spent hours on fine tuning my prompt
| to summarise text and output in JSON and still have some issues
| sometimes. Is this feature available also with Go?
| lioeters wrote:
| It looks like the structured output feature is available in Go,
| with the `format` field. type GenerateRequest
| struct { ... // Format specifies the format to
| return a response in. Format json.RawMessage
| `json:"format,omitempty"`
|
| https://github.com/ollama/ollama/blob/de52b6c2f90ff220ed9469...
| rcarmo wrote:
| I must say it is nice to see the curl example first. As much as I
| like Pydantic, I still prefer to hand-code the schemas, since it
| makes it easier to move my prototypes to Go (or something else).
| highlanderNJ wrote:
| What's the value-add compared to `outlines`?
|
| https://www.souzatharsis.com/tamingLLMs/notebooks/structured...
| parthsareen wrote:
| Hey! Author of the blog here. The current implementation uses
| llama.cpp GBNF which has allowed for a quick implementation.
| The biggest value-add at this time was getting the feature out.
|
| With the newer research - outlines/xgrammar coming out, I hope
| to be able to update the sampling to support more formats,
| increase accuracy, and improve performance.
| diimdeep wrote:
| Very annoying marketing and pretending to be anything other than
| just wrapper around llama.cpp.
| evilduck wrote:
| Can you ollama haters stop with this bullshit?
|
| Does llama.cpp do dynamic model loading and unloading? Will it
| fetch a model you request but isn't downloaded? Does it provide
| SDKs? Does it have startup services it provides? There's space
| for things that wrap llama.cpp and solve many of its pain
| points. You can find piles of reports of people struggling to
| build and compile llama.cpp for some reason or another who then
| clicked an Ollama installer and it worked right away.
|
| It's also a free OSS project giving all this away, why are you
| being an ass and discouraging them?
| dcreater wrote:
| They're going to go corporate
| diimdeep wrote:
| Sure llama.cpp does not do all of that, except that it lets
| you curl model from public and free to use endpoints, it does
| that. But SDK? - fuck that. Load, unload and startup services
| - who is even need that ? All this value is so minuscule
| compared to core functionality provided by ggml/llamacpp.
|
| But this submitted link is not even about all of that, it is
| about what really llama.cpp does not do - it does not write
| more lines of marketing material than lines of code, which is
| that marketing material is about, lines of code that really
| just wrap 10x more lines of code down the line, and all of
| that by not making it clear as day.
| evilduck wrote:
| It's not even worth countering all these lies you're
| telling. Enjoy your self inflicted irrational misery.
___________________________________________________________________
(page generated 2024-12-07 23:01 UTC)