[HN Gopher] Show HN: LLMs can generate valid JSON 100% of the time
___________________________________________________________________
Show HN: LLMs can generate valid JSON 100% of the time
Outlines is a Python library that focuses on text generation with
large language models. Brandon and I are not LLM experts and
started the project a few months ago because we wanted to
understand better how the generation process works. Our original
background is probabilistic, relational and symbolic programming.
Recently we came up with a fast way to generate text that matches a
regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-
guide...). The basic idea is simple: regular expressions have an
equivalent Deterministic-Finite Automaton (DFA) representation. We
can transform this DFA into a generative model: in each state we
get a list of symbols which correspond to completions that
partially match the regular expression. We mask the other symbols
in the logits returned by a large language model, sample a new
symbol and move to the next state. The subtelty is that language
models work with tokens, not symbols, so we derive a new FSM whose
alphabet is the model's vocabulary. We can do this in only one pass
over the vocabulary. Generating the token masks thus only requires
a dictionary lookup at each state. Our method blows other libraries
like Microsoft's guidance out of the water. From there it was only
a small leap to be able to generate text that follows a JSON schema
(https://json-schema.org/), or is parseable into a Pydantic model
(https://docs.pydantic.dev/latest/usage/models/). The method works
with union types, optional types, nested schemas, arrays,
everything. It is guaranteed that the output is parseable. I think
it's cool, and I've spent a lot of time watching even tiny models
output valid JSON over the weekend. Hope you will too. I look
forward to feedback, bug reports, feature requests and discussions!
Edit: Link to our pre-print explaining the method and how this can
be extended to generate text that follows a Context-Free Grammar
https://arxiv.org/abs/2307.09702
Author : remilouf
Score : 384 points
Date : 2023-08-14 18:52 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Ilasky wrote:
| OpenAI has this capability built in with functions[0], I believe!
| Building my own project[1] I have implemented functions in
| combination with guidance[2] and haven't had a hiccup yet! I have
| a JSON parser function there, just in case, but it seems to be
| working reliably.
|
| Here's a bit more of a description of using the functions API for
| JSON returns: https://yonom.substack.com/p/native-json-output-
| from-gpt-4
|
| [0] https://openai.com/blog/function-calling-and-other-api-
| updat...
|
| [1] https://resgen.app
|
| [2] https://github.com/guidance-ai/guidance
| londons_explore wrote:
| >OpenAI has this capability built in with functions
|
| From OpenAI's docs:
|
| > note: the model may generate invalid JSON
|
| I would guess they _don 't_ use your method - and perhaps they
| should!
| Ilasky wrote:
| Good catch! It really is a combination of guidance
| guaranteeing JSON output and OpenAI getting it right a good
| majority of the time[0]. But yeah, I can see how it can be
| frustrating that the JSON output is not guaranteed by the
| docs.
|
| [0] >>99% in my experience
| Ilasky wrote:
| That said, I am definitely going to look into this library
| and compare its results to guidance, since they claim it
| blows it out of the water (which is very enticing!)
| remilouf wrote:
| Figure 2 in our paper (https://arxiv.org/abs/2307.09702)
| shows the difference for a single regex.
| thomasfromcdnjs wrote:
| I do the same, just tell Openai to call a parser at the end and
| wahal.
| Scene_Cast2 wrote:
| One potential drawback I can see is if the viable tokens are far
| down the list of predictions. In that case, filtering down to
| just those tokens is a distribution shift with resulting output
| being less stable / less sensible.
| Scarblac wrote:
| It can't be less sensible JSON than syntactically invalid JSON.
| All the tokens higher on the list are syntax errors.
| haswell wrote:
| That depends highly on the values contained within the JSON.
| Syntactically correct is only useful if the rest of the
| content is useful.
| skybrian wrote:
| It seems unlikely for JSON, but this might indicate that the
| model has somehow painted itself into a corner and the best
| thing to do is backtrack?
|
| Regenerating the entire response could be seen as an extreme
| form of backtracking.
| remilouf wrote:
| Indeed, this remains an empirical question.
| coding123 wrote:
| Can someone re-explain all of this. If I got to GPT3.5 and ask it
| to give me some information in json, vs whatever this library is
| doing?
| odyssey7 wrote:
| Each time you run an LLM on a sequence of tokens, it generates
| a probability distribution giving each token's likelihood of
| occurring next in the sequence. To actually determine the next
| token in the sequence, any of various strategies can be used to
| select from that probability distribution.
|
| The challenge in guided generation is conforming the output
| sequence with a formal language such as a JSON schema or even a
| rigorously grammatical version of English; typically in a
| formal language, most tokens in the vocabulary will be
| _impossible_ as next token candidates rather than merely
| unlikely. The authors explain that most guided generation
| systems are checking each token in the vocabulary to see if it
| would be a valid continuation of the sequence, filtering the
| probability distribution according to formal constraints before
| making the next token selection. The authors improve upon this
| process by indexing valid next tokens according to a formal
| language recognizer's possible states, so that the list of
| valid next tokens can be looked up in constant time rather than
| testing every token in the vocabulary.
|
| With the valid next token options in hand, the probability
| distribution for next tokens is filtered and then a selection
| is made.
| popinman322 wrote:
| Does this work in tandem with beam search or does it do greedy
| sampling?
| btwillard wrote:
| The underlying approach can improve the performance of anything
| that requires the set of non-zero probability tokens at each
| step, and anything that needs to continue matching/parsing from
| a previous state.
| Kiro wrote:
| Does this mean that I need to call the LLM API once for each
| token?
| baobabKoodaa wrote:
| No. You need to hook into the LLM at a lower level. One API
| call typically triggers a generation of a sequence of tokens
| and this library has to poke into things between each generated
| token.
| Deukhoofd wrote:
| Looks interesting! How would you say it compares to Microsoft's
| TypeChat (beyond the obvious Python/TypeScript difference)?
|
| https://microsoft.github.io/TypeChat/blog/introducing-typech...
| remilouf wrote:
| Thanks for bringing this library to my attention! From my
| understanding, TypeChat proceeds by (1) generating (2)
| attempting validation (3) if it fails, call the LLM again to
| fix the output (4) etc.
|
| Our method on the other _guarantees_ that the output will
| follow the specs of the JSON schema. No need to call the LLM
| several times.
| 1wheel wrote:
| There's also https://lmql.ai/
| remilouf wrote:
| LQML (and guidance https://github.com/guidance-ai/guidance)
| are much more inefficient. They loop over the entire
| vocabulary at each step, we only do it once at
| initialization.
| 2bitencryption wrote:
| TypeChat: let's try really hard to try to convince the model to
| make the highest-scoring tokens follow the grammar we want.
|
| Guidance (and this project?): Let's not even bother with trying
| to convince the model; instead, we'll only sample from the set
| of tokens that are guaranteed to be correct for the grammar we
| want to emit.
| btwillard wrote:
| Yeah, and our addition to all that is to almost completely
| remove the cost of determining the next valid tokens on each
| step.
| xigency wrote:
| Thanks for building this. The mechanics are such an obvious idea
| that it's astounding that the first-party platforms haven't done
| this yet. I would be interested to see how this could be used for
| other tasks outside of JSON that require structured input.
| umvi wrote:
| > it's astounding that the first-party platforms haven't done
| this yet
|
| I was under the impression LLM tech is currently in a breakneck
| arms race and that things are dramatically changing every few
| months. It could simply just be a consequence of limited
| developer resources. It would be "astounding" if decade-old
| tech were missing such a fundamental feature, but for AI tech
| in arms-race mode it seems reasonable that they are still
| missing QoL features.
| winwang wrote:
| I think they meant that you'd expect simpler/more obvious
| ideas to be implemented first.
| remilouf wrote:
| Thanks! We have extended the approach to grammar-based
| sampling. We describe the approach in the paper linked above.
| The following PR is relevant: https://github.com/normal-
| computing/outlines/pull/178
| Lerc wrote:
| Could this same approach be applied at training? If the
| guidance does a lot of the syntactical heavy lifting, would
| that create the opportunity for a model to use the weights
| for something else. Essentially not bothering to reduce the
| error of things that the guidance will stomp on anyway.
| LakshyAAAgrawal wrote:
| Hi, the paper at https://arxiv.org/abs/2306.10763 titled
| "Guiding Language Models of Code with Global Context using
| Monitors" shows how to have the language models generate code
| without hallucinated dereferences.
| Animats wrote:
| OK, you get syntactically valid JSON, but does it contain the
| correct info? This is effectively a polisher, like spell check,
| which gives the output superficially correct form but doesn't
| understand the content. Right?
| burke wrote:
| You can go pretty deep once you get context free grammars. For
| example, I'm using torch-grammar (but outlines should be able
| to do the same thing once CFG support is merged) to not just
| restrict the format of a generation to a DSL's syntax, but to
| restrict the keys it updates to valid keys in a known set.
|
| e.g.: int_key ::= DQUO ("f" ("e" ("atured-"
| ("b" ("log." ("p" ("ost_limit" | "a" ...
|
| Obviously, yeah, it doesn't "understand" the content, but
| that's what the LLM is for. It's remarkable how plausible the
| generations you can get out of random noise are with a
| sufficiently-restrictive grammar. Bolting that onto a well-
| trained LLM is pretty powerful.
| btwillard wrote:
| FYI: We've had grammar constraints available in Outlines for
| a while, but not using the FSM and indexing approach that
| makes the regex case so fast. My open PR only adds that.
| empath-nirvana wrote:
| This isn't really an interesting question is it? Everyone knows
| that chatgpt is not an oracle. It doesn't need to output the
| correct information 100% of the time.
| coder543 wrote:
| This analogy falls apart because the spellchecker is separate
| from the author, and doesn't know what the author intended.
|
| Here, the LLM is still dictating the token probabilities, so
| the content will be as correct as the LLM can make it, given
| the constraints. AIUI, the sampler is just choosing tokens on a
| combination of probability and syntactic correctness, instead
| of strictly on probability.
|
| If the LLM is forced to provide a numeric temperature for
| Seattle, and the input doesn't contain that data, then
| obviously the LLM will be forced by the sampler to provide a
| random answer if the sampler will accept nothing else, much
| like a human who is forced to mark "true"/"false" on an online
| form, with no option to reject the question and explain that
| the question isn't even a true/false question.
|
| I don't know about this specific implementation, but it seems
| important to design systems like this to always "accept"
| (sample for) an error response from the LLM so that it can
| hopefully reject invalid requests.
|
| But, yes, all the usual caveats about LLMs apply. It can't
| provide correct answers to things it doesn't know. Forcing it
| to respond with the answer to the life, the universe, and
| everything is not going to provide a meaningful response. Even
| things it "knows", it can still get wrong sometimes.
| anticrymactic wrote:
| I'm stupid with LLMs, but would it be possible to have this
| output with gpt4's intelligence, or would it have to be
| specifically trained?
| coder543 wrote:
| It's something OpenAI should really implement themselves.
| Implementing it from the client side will mean sending the
| same request over and over until you get a syntactically
| correct answer, which is going to be much slower and likely
| to cost a lot. The server can guide the generation, but the
| client can (currently) only hint at what it wants. ChatGPT4
| is fairly good at following schemas, and that's what OpenAI
| currently relies on, but they make no guarantees.
|
| It likely wouldn't require additional training. It's a
| change to the way the server uses the model, not a change
| to the model itself... but we don't know ChatGPT4's true
| architecture because OpenAI won't publish anything about
| it, so it's hard to say for sure.
| chipsrafferty wrote:
| Why isn't it possible to design LLMs that say "I don't know"?
| Lerc wrote:
| They can say I don't know when they contain the fact that
| they don't know something. For instance saying "I don't
| know" could be a response to"What is the meaning of life"
|
| On the other hand if you ask a LLM how to do something
| about fish maintenance that it does not know how to do, it
| might produce an answer like "Sure, first take your fish
| and " at which point all of the options for the next word
| are all over the place because there isn't the information
| available to guide the choice. The sentence started as if
| it knew the answer because there was no information to say
| that it didn't. By the time the absence of information has
| an impact, the LLM is already committed to the sentence
| where it is confidently giving you an answer.
| [deleted]
| coder543 wrote:
| It _is_ possible... ChatGPT4 says that all the time. It's
| just not _guaranteed_ that an LLM will recognize that it
| doesn't know a particular answer every time. I had even
| already mentioned in the comment you're replying to that
| you should leave room in the sampler to _allow_ the LLM to
| provide error responses. I never said it wasn't possible.
|
| Not to anthropomorphize LLMs too much, but humans will also
| sometimes respond confidently with a wrong answer too. Both
| LLMs and humans will sometimes say the wrong thing when
| they don't actually know an answer, but sometimes
| (hopefully most of the time) they will instead say that
| they don't know the answer.
|
| Contrary to another response here, I do not believe that
| LLMs only respond "I don't know" only when they have
| specifically memorized that they don't know a fact. I don't
| believe that's a good or useful mental model for how this
| stuff works. When you're dealing with tens or hundreds of
| billions of parameters, the "why" is often elusive and
| complicated. It's also probabilistic; it may respond that
| it doesn't know one time, but the next time, it may
| unfortunately claim to know an answer it doesn't know --
| which is a form of hallucination. Reducing hallucinations
| is one of the major goals of LLM research today, and
| ChatGPT4 performs much better in this area than ChatGPT3.5
| did.
|
| Here is a quick example of ChatGPT4 saying it doesn't know:
| https://chat.openai.com/share/7b72b109-fb84-4988-891b-f2eec
| c...
|
| I'm sure no one at OpenAI specifically trained ChatGPT4 to
| recognize a question about the Stanley Cup and respond that
| it doesn't know the answer, but it still said that it
| didn't know. It _absolutely did not_ start a sentence with
| "the winner of the 2023 Stanley Cup was..." and then wander
| its way into a bad answer. That's not a good representation
| of how this stuff works, even though it does sample one
| token at a time.
| skybrian wrote:
| They do, but it's a form of imitation, not actually knowing
| what they don't know.
|
| Ask an LLM to imitate a confident physicist and it will
| try, regardless of how much physics it knows.
|
| Or if you tell ChatGPT that it's wrong multiple times, it
| may learn the pattern and assume it's always wrong,
| resulting in a downward spiral. (This can happen when using
| Code Interpreter and it makes several failed attempts to
| correct a mistake.)
|
| The difficult research problem is training it to have an
| accurate model of what it knows.
| ianbutler wrote:
| https://github.com/newhouseb/clownfish
|
| Which I've been using for a while now, also restricts the
| sampling space to force correct generation, but does so as the
| result of a different process than yours.
| 2bitencryption wrote:
| it still blows my mind that OpenAI exposes an API with Functions
| calling, and yet _does not guarantee the model will call your
| function correctly_ , in fact, it does not even guarantee the
| output will be valid JSON.
|
| When this is, really, a solved problem. I've been using
| github.com/microsoft/guidance for weeks, and it genuinely, truly
| guarantees correct output, because _it simply does not sample
| from tokens that would be invalid._
|
| It just seems so obvious, I still have no clue why OpenAI does
| not do this. Like, why fuss around with validating JSON after the
| fact, when you can simply guarantee it is correct in the first
| place, by only sampling tokens _if they conform to the grammar
| you are trying to emit?_
| newhouseb wrote:
| I think this is likely a consequence of a couple of factors:
|
| 1. Fancy token selection w/in batches (read: beam search) is
| probably fairly hard to implement at scale without a
| significant loss in GPU utilization. Normally you can batch up
| a bunch of parallel generations and just push them all through
| the LLM at once because every generated token (of similar
| prompt size + some padding perhaps) takes a predictable time.
| If you stick a parser in between every token that can take
| variable time then your batch is slowed by the most complex
| grammar of the bunch.
|
| 2. OpenAI appears to work under the thesis articulated in the
| Bitter Lesson [i] that more compute (either via fine-tuning or
| bigger models) is the least foolish way to achieve improved
| capabilities hence their approach of function-calling just
| being... a fine tuned model.
|
| [i] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| BoorishBears wrote:
| I just left a comment along these lines, but realistically it's
| probably cheaper to just re-emit than to add the machinery that
| enables this to their existing architecture.
|
| At most I could have seen them maybe running a schema validator
| against the output and re-requesting on your behalf, but even
| that's probably cheaper for them to do client side (I will say,
| I'm surprised their API wrapper hasn't been updated to do this
| yet)
| 2bitencryption wrote:
| > maybe running a schema validator against the output and re-
| requesting on your behalf
|
| this is the part that blows my mind. You don't have to do
| this! You don't have to sample the entire output, and then
| validate after the fact.
|
| You're not _required_ to greedily pick the token with the
| highest score. You get the scores of all tokens, on every
| forward pass! So why even waste time picking invalid tokens
| if you 're just going to validate and retry later on??
|
| (note: when I say "you" here, I mean whoever is hosting the
| model. It _is_ true that OpenAI does not expose all token
| scores, it only gives you back the highest-scoring one. So a
| client-side library is not able to perform this grammar-based
| sampling.
|
| BUT, OpenAI themselves host host the model, and they see all
| token outputs, with all scores. And in the same API request,
| they allow you to pass the "function definition" as a JSON
| schema. So why not simply apply that function definition as a
| mask on the token outputs? They could do this without
| exposing all token scores to you, which they seem very
| opposed to for some reason.)
| BoorishBears wrote:
| Maybe re-read what I said?
|
| > realistically it's probably cheaper to just re-emit than
| _to add the machinery that enables this to their existing
| architecture_
|
| There are literally dozens of random projects that have
| implemented logit based masking, it's a trivial thing to
| implement.
|
| What's probably not as trivial is deploying it at scale
| with whatever architecture OpenAI already has in place.
| Especially if they're using the router-based MoE
| architecture most people are assuming they use.
|
| OpenAI doesn't expose token probabilities for their RLHF
| models, yet they did for GPT-3. Originally that lead to
| speculation that was to make building competitors harder,
| but they've now said they're actually still working on
| it... which leans even further into the idea they may have
| an architecture that makes the kind of sampling these
| projects rely on more difficult to implement than normal.
|
| Given how fast and cheap they've made access to these
| models, their current approach is a practical workaround if
| that's the case.
| behnamoh wrote:
| when GPT-4 first became available, I had a feeling that
| something about it felt "hacky". Compared to GPT-3 which
| was more streamlined, mature, and well thought out, GPT-4
| was like a system put together to outperform the previous
| one at all costs. I wouldn't be surprised if that led to
| design decisions that made their model hard to improve.
| Maybe GPT-5 will not be around any time soon.
| padolsey wrote:
| IANA{LLM}, but if you're only sampling from a "correct"
| grammar, you are potentially (very potentially) forgoing what
| might otherwise have been a more desirable and more
| semantically useful token. Most of the models have been trained
| on myriads of human language, not structured data necessarily,
| and so I'd rather elect for a more semantically enriched format
| (e.g. XML or YAML) because those are designed to be ~more human
| readable. Or perhaps more preferably: have the boss LLM pump
| out what it excels at (strings of prose most of the time) and
| have a secondary model with a stricter grammar convert that to
| JSON.
| [deleted]
| Q6T46nT668w6i3m wrote:
| Is this Brandon Willard the breakdancer from Detroit Brandon
| Willard?
|
| Edit: It is! https://brandonwillard.github.io/
| btwillard wrote:
| Ha, yeah, in a distant, but really fun, past!
| J_Shelby_J wrote:
| So to explain this another way:
|
| After each token generated by the LLM you update the logit bias
| "mask" to only allow the next token to be a valid json token?
|
| Very slick!
| remilouf wrote:
| Indeed. And we're able to update the mask with a dictionary
| lookup instead of looping over the entire vocabulary (slow!).
| [deleted]
| behnamoh wrote:
| It's actually a very old trick. Lots of libraries do this. idk
| what's the big deal about this one.
| remilouf wrote:
| Perhaps I didn't explain clearly enough in the original post?
| dontreact wrote:
| You would also need to keep generating until the whole string
| is valid. And what if it gets caught in a loop?
|
| Not sure how this can really guarantee 100%
| orlp wrote:
| > And what if it gets caught in a loop? Not sure how this can
| really guarantee 100%
|
| It's not great but after some timeout you can just set the
| mask to only include closing brackets.
| aassddffasdf wrote:
| You would still have to ensure balancing somehow. Both "]"
| and "}" are valid "closing brackets" and the correct one to
| choose is context-dependent.
| visarga wrote:
| Enforcing JSON schema, regex and grammars is very useful. But how
| can we enforce decoding spans from a document? decoded text
| should be copied from a list of spans in the input document. It
| would be useful for extractive tasks.
| Havoc wrote:
| That looks intriguing. Managing that interface has proven
| challenging - especially on data cleaning tasks where the model
| ends up talking rather than doing. Bit more guiderails would be
| helpful on that
| remilouf wrote:
| That's what we noticed as well, and we were not satisfied with
| the `guardrails` approach of just rejecting invalid outputs.
| The method makes the interface robust.
| lefttoreader wrote:
| The "trick" seems to blatantly rip off FlashText without citing
| it?
|
| https://arxiv.org/pdf/1711.00046.pdf
|
| I'm a fan of the approach. I normally wouldn't care if this was
| just another LLM library taking inspiration, but if you're going
| to go out of your way to put a paper on the ArXiv, feels like
| doing a literature review is a good step?
| [deleted]
| huevosabio wrote:
| Very cool! How much latency does it add?
| btwillard wrote:
| With our indexing approach, it only costs a dictionary lookup
| to get the next valid tokens during each sampling step, so very
| little latency.
| rckrd wrote:
| I also released a hosted version of my open-source libraries
| ReLLM and ParserLLM that already supports APIs for
|
| * Regex completion for LLMs
|
| * Context-free Grammar completion for LLMs
|
| https://thiggle.com/
|
| [0] https://github.com/r2d4/rellm
|
| [1] https://github.com/r2d4/parserllm
|
| [2] https://github.com/thiggle/api
|
| There's also another API on Thiggle that I've build that supports
| classification via a similar logit-based strategy.
| [deleted]
| [deleted]
| tantalor wrote:
| "Generating valid JSON" is not impressive. Here's some valid
| JSON: []
|
| The tricky part is generating _useful_ JSON.
| ape4 wrote:
| Or JSON that correctly answers what the prompt is asking.
| AtNightWeCode wrote:
| "" valid!
| notpushkin wrote:
| Generating valid JSON that _conforms to a given schema_ is
| pretty useful, although not impressive by itself. If the model
| can deduce field values from schema alone though, I think it 's
| pretty neat.
| quickthrower2 wrote:
| > LLMs can generate valid JSON 100% of the time
|
| If that seems surprising, it is worth doing a course like
| Karpathy's zero to hero NN, and have all the magic peeled away a
| layer at a time.
|
| The reason you can do this is because LLMs don't just generate
| the next word or token, it produces a probability distribution
| over all tokens. A JSON parser can give you a list of next valid
| tokens. The tokens in each case might be from a different set,
| e.g LLM thinks of " The" whereas the JSON parser might think of
| "{", so you need some conversion there. But if you sample
| randomly from only the valid tokens, the output must be valid
| JSON.
|
| What you can't build a parser for though is ... the truth! You
| may still be told lies or made up stuff.
| coder543 wrote:
| As a more general comment, the repo README provides examples that
| all use gpt2. It would be nice to see at least one example that
| invokes llama2, since I feel like that would make sure the reader
| knows that this library can use models that are more modern and
| interesting.
| Havoc wrote:
| Inclined to disagree - gpt2 is far more likely to produce
| gibberish. So if you can force specific outputs on that then it
| is a good demo that higher quality models will be even better
| coder543 wrote:
| Maybe... but then if I want to use something better, I have
| to figure out how by myself. I said "at least one example",
| not "please change _all_ the examples to llama2. " I agree
| with your general point. It would be nice if there were an
| example of how to use a better model.
|
| Models often have different shapes and requirements, so is it
| really as simple as changing the string "gpt2" to
| "llama2-13B-Chat" and it will magically work? If so, that's
| great, and I wish that was made clear. Unfortunately, that
| hasn't always been my experience with other libraries.
| remilouf wrote:
| Agree, working on a Colab with a "better" model as we
| speak.
| btbuildem wrote:
| I feel like I'm missing something very basic here, but is this
| library intended to be used with an existing model? If so, could
| you point to an example?
| remilouf wrote:
| It can be used with any open source model (if you can get the
| logits), and to some extent with OpenAI's API. Here is an
| example with `transformers`: https://github.com/normal-
| computing/outlines#efficient-json-...
|
| We plan on adding more model integrations, but it is completely
| decoupled from the method implementation.
| nikcheerla wrote:
| Does this work with GPT-4?
| thatcherthorn wrote:
| This is awesome. I have a vision to build self-managed software.
| This will be a great tool.
| remilouf wrote:
| Thank you! Hope this helps and opens many applications :)
| malux85 wrote:
| This is really great too, I am building self-generating
| experiments and molecular simulations with
| https://atomictessellator.com and I am going to try out this
| framework after work
| spott wrote:
| How does this relate to ggmls bnf sampling?
| remilouf wrote:
| Two differences:
|
| (1) This feature only requires regex-guided generation. We have
| a PR for BNF sampling that is about to be merged. (2) ggml
| loops over the entire vocabulary (~50k tokens) at each step,
| which introduces a noticeable overhead, and makes it unusable
| for complex grammars. Our method works by building an index at
| initialization, and build the masks at each step with a
| dictionary lookup. Once the index is built, generation is just
| as fast as standard generation. Doesn't depend on the
| complexity of the grammar, the size of the LLM or its
| vocabulary size.
| spott wrote:
| Regex-guided gen is slick... is it arbitrary? Or are you
| custom building it for json?
|
| If arbitrary, how are you pre-defining a set of masks? I
| would expect that splitting an arbitrary regex into a bunch
| of contexts for a masking dictionary to be non-trivial.
| anotherpaulg wrote:
| For complex tasks like coding, my experience is that asking for a
| complex output format hurts performance on the underlying task.
|
| https://aider.chat/docs/benchmarks.html
|
| I am curious if you have measured whether this sort of
| "constrained generation" suffers from similar downsides?
| leetharris wrote:
| How does this compare in terms of latency, cost, and
| effectiveness to jsonformer? https://github.com/1rgs/jsonformer
| bhickey wrote:
| jsonformer uses a template rather than a DFA. The logit masking
| seems to be identical, though.
| remilouf wrote:
| Figure 2 in our paper (https://arxiv.org/abs/2307.09702) shows
| the difference between guidance and outlines to generate a
| sequence that is valid to a regex. Jsonformer uses the same
| technique as guidance. Extrapolate this to several fields.
|
| Note that we still need to manage the KV cache in outlines.
| It's a small interface change that will be made this week
| hopefully, but we've been focusing on constrained generation so
| far.
| Der_Einzige wrote:
| Sad to see that my related work on token-level constrained
| text generation is not cited in the paper:
| https://github.com/Hellisotherpeople/Constrained-Text-
| Genera...
|
| https://aclanthology.org/2022.cai-1.2/
| remilouf wrote:
| We're unfortunately only human and didn't catch every
| single paper on the topic while writing the draft. Thanks
| for bringing it to our attention.
| panarky wrote:
| I can make GPT4 return valid JSON simply by providing examples in
| the system message. This works nine times out of ten.
|
| But it's still probabilistic, and nine times out of ten isn't
| good enough.
|
| Occasionally it will hallucinate responses like this:
|
| {"key1": "value1", "key2": "value2" for i in range(n)}
|
| Re-prompting with the parsing error message is usually enough to
| get it on the second try.
|
| But escaping double-quotes and newline characters is less
| reliable. Even after giving it multiple examples, it correctly
| escapes only about half the time.
|
| Re-prompting for escaping errors still yields a ~50% success
| rate.
| phillipcarter wrote:
| This is what we do, but for GPT-3.5. And it doesn't need to be
| system messages either. We even have it emitting _only_ JSON in
| a specific structure (except for when it fails to produce an
| output altogether). This is without the function calling model.
| caesil wrote:
| With ChatGPT function calling I get valid JSON 100% of the time
| from GPT-4 unless I have made some error in prompting.
|
| The chief error is not providing escape hatches. LLMs look for
| a right answer. If you are feeding it some texts and asking it
| to return structured data about the texts, but then one of the
| texts is blank, it will be difficult to determine a right
| answer, so you get hallucinations. The solution is an escape
| hatch where one of the arguments is a `textIsMissing` boolean
| or something.
|
| As long as you've accounted for these failure modes, it works
| flawlessly.
| MuffinFlavored wrote:
| I wonder if the next iteration of OpenAI features is something
| like:
|
| right now you can inject prompts that the LLM takes into
| consideration before the output
|
| I wonder if you can make it have a "post" generation function
| that says like "keep re-trying in a loop (aka hallucinating
| with randomness) until the output message passes XYZ
| format/checks/scoring"
| padjo wrote:
| It's starting to feel like LLMs are to "classical" software
| engineering what quantum physics was to classical physics
| keiferwiseman wrote:
| It took some iterations but I've managed to get the OpenAI API
| to give me valid JSON 100% of the time now(based on my
| testing). I think I put in the prompt to never use newlines
| because it was causing issues lol.
| thumbsup-_- wrote:
| Yeah same thing. I have done the same with GPT-3.5. Simply ask
| it to output using provided schema only and give a few
| examples. Always outputs in provided json format
| nextaccountic wrote:
| What about reprompting with a different temperature value?
|
| If this works, how to select the optimal value? Maybe you can
| train a model that can excel at the task of querying gpt4 for
| valid jsons
| simonw wrote:
| That re-prompting on error trick is what this new Microsoft
| library does, too: https://github.com/microsoft/TypeChat
|
| Here's their prompt for that:
| https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...
|
| I think the approach using grammars (seen here, but also in
| things like https://github.com/ggerganov/llama.cpp/pull/1773 )
| is a much more elegant solution.
| padolsey wrote:
| I've had more luck with getting it to output XML as (1) You can
| imbue XML with actual language/meaning (which LLMs adore) and
| (2) parsers can be made to be more forgiving. I get why people
| want to make JSON, but to me it's a bit like trying to get a
| cat to swim - you might eventually succeed, but it's not their
| natural inclination.
| gowld wrote:
| How do you imbue XML with meaning?
| padolsey wrote:
| XML Elements themselves: their naming, their attributes,
| comments, indentation. There's more opportunity at every
| level of the hierarchy to demarkate and establish meaning.
| Having closing-tags as well, I've found, is a massive boon;
| LLMs can better understand what "finishing" looks like if
| its delimited in a semantic way - with a name.
| orasis wrote:
| What about using ChatGPT's new function calling mechanism?
| superasn wrote:
| That returns broken JSON a lot of the times too
| activatedgeek wrote:
| Mechanistically, I think this library takes the simple idea of
| masking part of the vocabulary space and steps in time
| efficiently. Great!
|
| I am curious, however, for the ones who have played around with
| such libraries wrapping base LLMs with output structure: do base
| models like Llama2 work very well? My experience says "hell no!"
| and you do need a fair bit of instruction-tuning for specific use
| cases to actually get things to work.
|
| And even then, it seems very counter-intuitive to me that given
| an instruction-tuned model, post-hoc masking of the state-space
| during generation then amounts to just changing the generation
| distribution, and potentially detrimental to instruction-tuning?
| Havoc wrote:
| >you do need a fair bit of instruction-tuning for specific use
| cases to actually get things to work.
|
| The instruction tuning part is "trivial"...it's the dealing
| with edge cases part that gets me.
|
| With classic code edge cases are well insignificant edge cases.
| With LLM you never know what will make it go off on a tangent &
| the parsing code needs to deal with that chaos.
|
| Or put differently the % of cases that are edge cases seems to
| have gone up dramatically
| ethbr1 wrote:
| > _...given an instruction-tuned model, post-hoc masking of the
| state-space during generation then amounts to just changing the
| generation distribution..._
|
| Isn't that what we did with test driven development?
|
| The primary difference was our generator functions were human
| instead of LLM. Why not cut out the middle-human?
| make3 wrote:
| I'm not sure of why you would want to use raw llama-2 though
| when there is a million super strong instruction fine-tuned
| versions of llama-2 on HF hub that would do the job a million
| times better? Like Stability-AI's Beluga-2. See
| https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
|
| About your second point, the goal is that the model can only
| generate JSON (for example), which can 100% be done by
| constraining which output token can and cannot be used.
| nabakin wrote:
| Don't rely too much on automated benchmarks for LLMs. They
| are often gamed, made to overfit, and result in worse
| performance in the general case.
|
| Human evaluation is the gold standard and the Llama 2 paper
| gave significant evidence that Llama 2 70b chat is on-par, if
| not, better than ChatGPT for that metric so I tend to stick
| to it unless there is good reason not to.
| huevosabio wrote:
| The problem with Llama 2 chat versions is that they have
| been RLHF-ed to death. You can't ask questions without
| getting a sermon of how your question may be inappropriate
| for this or that reason.
|
| I think it's worse on the smaller models, but still present
| in the 70B one.
| lettergram wrote:
| Few thoughts, you're effectively creating representations that
| can convert to JSON (kudos!)
|
| Can't mention how we did it (there are a lot of public patents,
| if interested), but back in 2018 we had a way to generate
| synthetic data (statistically, structurally similar) off any
| dataset - https://medium.com/capital-one-tech/why-you-dont-
| necessarily... You could also design datasets if you wanted.
|
| It'd keep similar relations and worked pretty darn well. Not the
| exact same, but always produced valid JSON.
| simonw wrote:
| I really hope OpenAI add something like this to their endpoints
| soon.
|
| Being able to pass up some kind of grammar (a regular expression,
| or a JSON schema, or some other format) and have this trick run
| during their token sampling process to ensure the output was
| compliant would be incredibly useful.
| joshuanapoli wrote:
| Isn't the Function Calling feature meant for this purpose? It
| guides the LLM to output according to the given schema. The
| name of the feature is a little misleading.
|
| https://platform.openai.com/docs/guides/gpt/function-calling
| tornato7 wrote:
| Function Calling is fine-tuned to a certain output format,
| but it very often strays from that format. My function-
| calling-handling code has a mess of edge case handlers that
| catch when GPT-4 is calling functions incorrectly.
| simonw wrote:
| Surprisingly the function calling mechanism doesn't appear to
| use this trick - apparently it's still possible to get the
| wrong JSON structure back from it occasionally.
| M4v3R wrote:
| It's not though, they even say it in their docs that sending
| a schema does not guarantee that the model will actually
| adhere to the scheme or even produce valid JSON
| potatoman22 wrote:
| They recently added logit biases, so that's a start.
| dvt wrote:
| [flagged]
| aduffy wrote:
| This is exciting, we built a similar tool[1] recently
| specifically targeted at constraining llama output to match a
| TypeScript interface.
|
| I firmly believe that output format guarantees are going to be
| important for real (non-toy) decades for LLMs
|
| [1] https://github.com/ggerganov/llama.cpp/discussions/2494
| BoorishBears wrote:
| I'm not sure how this is different than:
|
| https://github.com/1rgs/jsonformer
|
| or
|
| https://github.com/newhouseb/clownfish
|
| or
|
| https://github.com/mkuchnik/relm
|
| or
|
| https://github.com/ggerganov/llama.cpp/pull/1773
|
| or
|
| https://github.com/Shopify/torch-grammar
|
| Overall there are a _ton_ of these logit based guidance systems,
| the reason they don 't get tons of traction is the SOTA models
| are behind REST APIs that don't enable this fine-grained
| approach.
|
| Those models perform so much better that people generally settle
| for just re-requesting until they get the correct format (and
| with GPT-4 that ends up being a fairly rare occurrence in my
| experience)
| remilouf wrote:
| Thanks for bringing clownfish and relm to my attention! afaik
| other libraries loop over the entire vocabulary at every step
| of the generation. We on the other hand build an index at
| initialization by looping once over the vocabulary. Then
| generation is just as fast as standard generation.
| burke wrote:
| torch-grammar generates a mask per PDA stack... we don't try
| to compute all the possible stacks. I'm sure there's
| something smarter that could be done here and you've probably
| figured it out (though IIRC regular languages don't have the
| arbitrarily recursive stack problem that you get when you get
| to context-free languages?) anyway, in practice we spend a
| few milliseconds on the first few requests building caches
| and then just apply masks from caches after that.
| sneedchucker wrote:
| Relevant; LLama.cpp implemented grammar-based sampling last
| month.
|
| https://news.ycombinator.com/item?id=36819906
| https://github.com/ggerganov/llama.cpp/pull/1773
| remilouf wrote:
| We can extend our approach to grammar-based sampling, as
| explained in the paper linked above. Relevant PR:
| https://github.com/normal-computing/outlines/pull/178
|
| Our method is much more efficient. llama.cpp loops over the
| entire vocabulary (~50k tokens) _at each step_ to generate the
| mask. We generate an index at initialization, and building the
| masks at each step only requires a dictionary lookup (trade
| speed for memory). Sampling is just as fast as standard
| sampling.
| popinman322 wrote:
| It should hopefully be a quick change to llama.cpp to add a
| mask per grammar state to bring it in line with your
| generation method; I don't think the two are incompatible,
| thankfully.
|
| I do wonder how much you win here by masking the tokens? You
| still need to iterate along the output vector to apply the
| mask. Masking on the accelerator still requires filtering on
| the CPU side? Compared to running the language model, the
| cost of iterating over the edges in the grammar seems small.
| burke wrote:
| Yes! This is closer to the approach I took in my port of
| llama.cpp's grammar support to PyTorch:
| https://github.com/Shopify/torch-
| grammar/blob/main/torch_gra... ... it generates a tensor
| mapping each PDA stack to a map of which tokens are
| acceptable from that state. It seems like a much better way
| to do it than looping over the sampled tokens on each turn.
___________________________________________________________________
(page generated 2023-08-14 23:00 UTC)