[HN Gopher] Show HN: LLMs can generate valid JSON 100% of the time
       ___________________________________________________________________
        
       Show HN: LLMs can generate valid JSON 100% of the time
        
       Outlines is a Python library that focuses on text generation with
       large language models. Brandon and I are not LLM experts and
       started the project a few months ago because we wanted to
       understand better how the generation process works. Our original
       background is probabilistic, relational and symbolic programming.
       Recently we came up with a fast way to generate text that matches a
       regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-
       guide...). The basic idea is simple: regular expressions have an
       equivalent Deterministic-Finite Automaton (DFA) representation. We
       can transform this DFA into a generative model: in each state we
       get a list of symbols which correspond to completions that
       partially match the regular expression. We mask the other symbols
       in the logits returned by a large language model, sample a new
       symbol and move to the next state. The subtelty is that language
       models work with tokens, not symbols, so we derive a new FSM whose
       alphabet is the model's vocabulary. We can do this in only one pass
       over the vocabulary.  Generating the token masks thus only requires
       a dictionary lookup at each state. Our method blows other libraries
       like Microsoft's guidance out of the water.  From there it was only
       a small leap to be able to generate text that follows a JSON schema
       (https://json-schema.org/), or is parseable into a Pydantic model
       (https://docs.pydantic.dev/latest/usage/models/). The method works
       with union types, optional types, nested schemas, arrays,
       everything. It is guaranteed that the output is parseable.  I think
       it's cool, and I've spent a lot of time watching even tiny models
       output valid JSON over the weekend. Hope you will too.  I look
       forward to feedback, bug reports, feature requests and discussions!
       Edit: Link to our pre-print explaining the method and how this can
       be extended to generate text that follows a Context-Free Grammar
       https://arxiv.org/abs/2307.09702
        
       Author : remilouf
       Score  : 384 points
       Date   : 2023-08-14 18:52 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Ilasky wrote:
       | OpenAI has this capability built in with functions[0], I believe!
       | Building my own project[1] I have implemented functions in
       | combination with guidance[2] and haven't had a hiccup yet! I have
       | a JSON parser function there, just in case, but it seems to be
       | working reliably.
       | 
       | Here's a bit more of a description of using the functions API for
       | JSON returns: https://yonom.substack.com/p/native-json-output-
       | from-gpt-4
       | 
       | [0] https://openai.com/blog/function-calling-and-other-api-
       | updat...
       | 
       | [1] https://resgen.app
       | 
       | [2] https://github.com/guidance-ai/guidance
        
         | londons_explore wrote:
         | >OpenAI has this capability built in with functions
         | 
         | From OpenAI's docs:
         | 
         | > note: the model may generate invalid JSON
         | 
         | I would guess they _don 't_ use your method - and perhaps they
         | should!
        
           | Ilasky wrote:
           | Good catch! It really is a combination of guidance
           | guaranteeing JSON output and OpenAI getting it right a good
           | majority of the time[0]. But yeah, I can see how it can be
           | frustrating that the JSON output is not guaranteed by the
           | docs.
           | 
           | [0] >>99% in my experience
        
             | Ilasky wrote:
             | That said, I am definitely going to look into this library
             | and compare its results to guidance, since they claim it
             | blows it out of the water (which is very enticing!)
        
               | remilouf wrote:
               | Figure 2 in our paper (https://arxiv.org/abs/2307.09702)
               | shows the difference for a single regex.
        
         | thomasfromcdnjs wrote:
         | I do the same, just tell Openai to call a parser at the end and
         | wahal.
        
       | Scene_Cast2 wrote:
       | One potential drawback I can see is if the viable tokens are far
       | down the list of predictions. In that case, filtering down to
       | just those tokens is a distribution shift with resulting output
       | being less stable / less sensible.
        
         | Scarblac wrote:
         | It can't be less sensible JSON than syntactically invalid JSON.
         | All the tokens higher on the list are syntax errors.
        
           | haswell wrote:
           | That depends highly on the values contained within the JSON.
           | Syntactically correct is only useful if the rest of the
           | content is useful.
        
           | skybrian wrote:
           | It seems unlikely for JSON, but this might indicate that the
           | model has somehow painted itself into a corner and the best
           | thing to do is backtrack?
           | 
           | Regenerating the entire response could be seen as an extreme
           | form of backtracking.
        
         | remilouf wrote:
         | Indeed, this remains an empirical question.
        
       | coding123 wrote:
       | Can someone re-explain all of this. If I got to GPT3.5 and ask it
       | to give me some information in json, vs whatever this library is
       | doing?
        
         | odyssey7 wrote:
         | Each time you run an LLM on a sequence of tokens, it generates
         | a probability distribution giving each token's likelihood of
         | occurring next in the sequence. To actually determine the next
         | token in the sequence, any of various strategies can be used to
         | select from that probability distribution.
         | 
         | The challenge in guided generation is conforming the output
         | sequence with a formal language such as a JSON schema or even a
         | rigorously grammatical version of English; typically in a
         | formal language, most tokens in the vocabulary will be
         | _impossible_ as next token candidates rather than merely
         | unlikely. The authors explain that most guided generation
         | systems are checking each token in the vocabulary to see if it
         | would be a valid continuation of the sequence, filtering the
         | probability distribution according to formal constraints before
         | making the next token selection. The authors improve upon this
         | process by indexing valid next tokens according to a formal
         | language recognizer's possible states, so that the list of
         | valid next tokens can be looked up in constant time rather than
         | testing every token in the vocabulary.
         | 
         | With the valid next token options in hand, the probability
         | distribution for next tokens is filtered and then a selection
         | is made.
        
       | popinman322 wrote:
       | Does this work in tandem with beam search or does it do greedy
       | sampling?
        
         | btwillard wrote:
         | The underlying approach can improve the performance of anything
         | that requires the set of non-zero probability tokens at each
         | step, and anything that needs to continue matching/parsing from
         | a previous state.
        
       | Kiro wrote:
       | Does this mean that I need to call the LLM API once for each
       | token?
        
         | baobabKoodaa wrote:
         | No. You need to hook into the LLM at a lower level. One API
         | call typically triggers a generation of a sequence of tokens
         | and this library has to poke into things between each generated
         | token.
        
       | Deukhoofd wrote:
       | Looks interesting! How would you say it compares to Microsoft's
       | TypeChat (beyond the obvious Python/TypeScript difference)?
       | 
       | https://microsoft.github.io/TypeChat/blog/introducing-typech...
        
         | remilouf wrote:
         | Thanks for bringing this library to my attention! From my
         | understanding, TypeChat proceeds by (1) generating (2)
         | attempting validation (3) if it fails, call the LLM again to
         | fix the output (4) etc.
         | 
         | Our method on the other _guarantees_ that the output will
         | follow the specs of the JSON schema. No need to call the LLM
         | several times.
        
           | 1wheel wrote:
           | There's also https://lmql.ai/
        
             | remilouf wrote:
             | LQML (and guidance https://github.com/guidance-ai/guidance)
             | are much more inefficient. They loop over the entire
             | vocabulary at each step, we only do it once at
             | initialization.
        
         | 2bitencryption wrote:
         | TypeChat: let's try really hard to try to convince the model to
         | make the highest-scoring tokens follow the grammar we want.
         | 
         | Guidance (and this project?): Let's not even bother with trying
         | to convince the model; instead, we'll only sample from the set
         | of tokens that are guaranteed to be correct for the grammar we
         | want to emit.
        
           | btwillard wrote:
           | Yeah, and our addition to all that is to almost completely
           | remove the cost of determining the next valid tokens on each
           | step.
        
       | xigency wrote:
       | Thanks for building this. The mechanics are such an obvious idea
       | that it's astounding that the first-party platforms haven't done
       | this yet. I would be interested to see how this could be used for
       | other tasks outside of JSON that require structured input.
        
         | umvi wrote:
         | > it's astounding that the first-party platforms haven't done
         | this yet
         | 
         | I was under the impression LLM tech is currently in a breakneck
         | arms race and that things are dramatically changing every few
         | months. It could simply just be a consequence of limited
         | developer resources. It would be "astounding" if decade-old
         | tech were missing such a fundamental feature, but for AI tech
         | in arms-race mode it seems reasonable that they are still
         | missing QoL features.
        
           | winwang wrote:
           | I think they meant that you'd expect simpler/more obvious
           | ideas to be implemented first.
        
         | remilouf wrote:
         | Thanks! We have extended the approach to grammar-based
         | sampling. We describe the approach in the paper linked above.
         | The following PR is relevant: https://github.com/normal-
         | computing/outlines/pull/178
        
           | Lerc wrote:
           | Could this same approach be applied at training? If the
           | guidance does a lot of the syntactical heavy lifting, would
           | that create the opportunity for a model to use the weights
           | for something else. Essentially not bothering to reduce the
           | error of things that the guidance will stomp on anyway.
        
         | LakshyAAAgrawal wrote:
         | Hi, the paper at https://arxiv.org/abs/2306.10763 titled
         | "Guiding Language Models of Code with Global Context using
         | Monitors" shows how to have the language models generate code
         | without hallucinated dereferences.
        
       | Animats wrote:
       | OK, you get syntactically valid JSON, but does it contain the
       | correct info? This is effectively a polisher, like spell check,
       | which gives the output superficially correct form but doesn't
       | understand the content. Right?
        
         | burke wrote:
         | You can go pretty deep once you get context free grammars. For
         | example, I'm using torch-grammar (but outlines should be able
         | to do the same thing once CFG support is merged) to not just
         | restrict the format of a generation to a DSL's syntax, but to
         | restrict the keys it updates to valid keys in a known set.
         | 
         | e.g.:                   int_key ::= DQUO ("f" ("e" ("atured-"
         | ("b" ("log." ("p" ("ost_limit" | "a" ...
         | 
         | Obviously, yeah, it doesn't "understand" the content, but
         | that's what the LLM is for. It's remarkable how plausible the
         | generations you can get out of random noise are with a
         | sufficiently-restrictive grammar. Bolting that onto a well-
         | trained LLM is pretty powerful.
        
           | btwillard wrote:
           | FYI: We've had grammar constraints available in Outlines for
           | a while, but not using the FSM and indexing approach that
           | makes the regex case so fast. My open PR only adds that.
        
         | empath-nirvana wrote:
         | This isn't really an interesting question is it? Everyone knows
         | that chatgpt is not an oracle. It doesn't need to output the
         | correct information 100% of the time.
        
         | coder543 wrote:
         | This analogy falls apart because the spellchecker is separate
         | from the author, and doesn't know what the author intended.
         | 
         | Here, the LLM is still dictating the token probabilities, so
         | the content will be as correct as the LLM can make it, given
         | the constraints. AIUI, the sampler is just choosing tokens on a
         | combination of probability and syntactic correctness, instead
         | of strictly on probability.
         | 
         | If the LLM is forced to provide a numeric temperature for
         | Seattle, and the input doesn't contain that data, then
         | obviously the LLM will be forced by the sampler to provide a
         | random answer if the sampler will accept nothing else, much
         | like a human who is forced to mark "true"/"false" on an online
         | form, with no option to reject the question and explain that
         | the question isn't even a true/false question.
         | 
         | I don't know about this specific implementation, but it seems
         | important to design systems like this to always "accept"
         | (sample for) an error response from the LLM so that it can
         | hopefully reject invalid requests.
         | 
         | But, yes, all the usual caveats about LLMs apply. It can't
         | provide correct answers to things it doesn't know. Forcing it
         | to respond with the answer to the life, the universe, and
         | everything is not going to provide a meaningful response. Even
         | things it "knows", it can still get wrong sometimes.
        
           | anticrymactic wrote:
           | I'm stupid with LLMs, but would it be possible to have this
           | output with gpt4's intelligence, or would it have to be
           | specifically trained?
        
             | coder543 wrote:
             | It's something OpenAI should really implement themselves.
             | Implementing it from the client side will mean sending the
             | same request over and over until you get a syntactically
             | correct answer, which is going to be much slower and likely
             | to cost a lot. The server can guide the generation, but the
             | client can (currently) only hint at what it wants. ChatGPT4
             | is fairly good at following schemas, and that's what OpenAI
             | currently relies on, but they make no guarantees.
             | 
             | It likely wouldn't require additional training. It's a
             | change to the way the server uses the model, not a change
             | to the model itself... but we don't know ChatGPT4's true
             | architecture because OpenAI won't publish anything about
             | it, so it's hard to say for sure.
        
           | chipsrafferty wrote:
           | Why isn't it possible to design LLMs that say "I don't know"?
        
             | Lerc wrote:
             | They can say I don't know when they contain the fact that
             | they don't know something. For instance saying "I don't
             | know" could be a response to"What is the meaning of life"
             | 
             | On the other hand if you ask a LLM how to do something
             | about fish maintenance that it does not know how to do, it
             | might produce an answer like "Sure, first take your fish
             | and " at which point all of the options for the next word
             | are all over the place because there isn't the information
             | available to guide the choice. The sentence started as if
             | it knew the answer because there was no information to say
             | that it didn't. By the time the absence of information has
             | an impact, the LLM is already committed to the sentence
             | where it is confidently giving you an answer.
        
               | [deleted]
        
             | coder543 wrote:
             | It _is_ possible... ChatGPT4 says that all the time. It's
             | just not _guaranteed_ that an LLM will recognize that it
             | doesn't know a particular answer every time. I had even
             | already mentioned in the comment you're replying to that
             | you should leave room in the sampler to _allow_ the LLM to
             | provide error responses. I never said it wasn't possible.
             | 
             | Not to anthropomorphize LLMs too much, but humans will also
             | sometimes respond confidently with a wrong answer too. Both
             | LLMs and humans will sometimes say the wrong thing when
             | they don't actually know an answer, but sometimes
             | (hopefully most of the time) they will instead say that
             | they don't know the answer.
             | 
             | Contrary to another response here, I do not believe that
             | LLMs only respond "I don't know" only when they have
             | specifically memorized that they don't know a fact. I don't
             | believe that's a good or useful mental model for how this
             | stuff works. When you're dealing with tens or hundreds of
             | billions of parameters, the "why" is often elusive and
             | complicated. It's also probabilistic; it may respond that
             | it doesn't know one time, but the next time, it may
             | unfortunately claim to know an answer it doesn't know --
             | which is a form of hallucination. Reducing hallucinations
             | is one of the major goals of LLM research today, and
             | ChatGPT4 performs much better in this area than ChatGPT3.5
             | did.
             | 
             | Here is a quick example of ChatGPT4 saying it doesn't know:
             | https://chat.openai.com/share/7b72b109-fb84-4988-891b-f2eec
             | c...
             | 
             | I'm sure no one at OpenAI specifically trained ChatGPT4 to
             | recognize a question about the Stanley Cup and respond that
             | it doesn't know the answer, but it still said that it
             | didn't know. It _absolutely did not_ start a sentence with
             | "the winner of the 2023 Stanley Cup was..." and then wander
             | its way into a bad answer. That's not a good representation
             | of how this stuff works, even though it does sample one
             | token at a time.
        
             | skybrian wrote:
             | They do, but it's a form of imitation, not actually knowing
             | what they don't know.
             | 
             | Ask an LLM to imitate a confident physicist and it will
             | try, regardless of how much physics it knows.
             | 
             | Or if you tell ChatGPT that it's wrong multiple times, it
             | may learn the pattern and assume it's always wrong,
             | resulting in a downward spiral. (This can happen when using
             | Code Interpreter and it makes several failed attempts to
             | correct a mistake.)
             | 
             | The difficult research problem is training it to have an
             | accurate model of what it knows.
        
       | ianbutler wrote:
       | https://github.com/newhouseb/clownfish
       | 
       | Which I've been using for a while now, also restricts the
       | sampling space to force correct generation, but does so as the
       | result of a different process than yours.
        
       | 2bitencryption wrote:
       | it still blows my mind that OpenAI exposes an API with Functions
       | calling, and yet _does not guarantee the model will call your
       | function correctly_ , in fact, it does not even guarantee the
       | output will be valid JSON.
       | 
       | When this is, really, a solved problem. I've been using
       | github.com/microsoft/guidance for weeks, and it genuinely, truly
       | guarantees correct output, because _it simply does not sample
       | from tokens that would be invalid._
       | 
       | It just seems so obvious, I still have no clue why OpenAI does
       | not do this. Like, why fuss around with validating JSON after the
       | fact, when you can simply guarantee it is correct in the first
       | place, by only sampling tokens _if they conform to the grammar
       | you are trying to emit?_
        
         | newhouseb wrote:
         | I think this is likely a consequence of a couple of factors:
         | 
         | 1. Fancy token selection w/in batches (read: beam search) is
         | probably fairly hard to implement at scale without a
         | significant loss in GPU utilization. Normally you can batch up
         | a bunch of parallel generations and just push them all through
         | the LLM at once because every generated token (of similar
         | prompt size + some padding perhaps) takes a predictable time.
         | If you stick a parser in between every token that can take
         | variable time then your batch is slowed by the most complex
         | grammar of the bunch.
         | 
         | 2. OpenAI appears to work under the thesis articulated in the
         | Bitter Lesson [i] that more compute (either via fine-tuning or
         | bigger models) is the least foolish way to achieve improved
         | capabilities hence their approach of function-calling just
         | being... a fine tuned model.
         | 
         | [i] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
         | BoorishBears wrote:
         | I just left a comment along these lines, but realistically it's
         | probably cheaper to just re-emit than to add the machinery that
         | enables this to their existing architecture.
         | 
         | At most I could have seen them maybe running a schema validator
         | against the output and re-requesting on your behalf, but even
         | that's probably cheaper for them to do client side (I will say,
         | I'm surprised their API wrapper hasn't been updated to do this
         | yet)
        
           | 2bitencryption wrote:
           | > maybe running a schema validator against the output and re-
           | requesting on your behalf
           | 
           | this is the part that blows my mind. You don't have to do
           | this! You don't have to sample the entire output, and then
           | validate after the fact.
           | 
           | You're not _required_ to greedily pick the token with the
           | highest score. You get the scores of all tokens, on every
           | forward pass! So why even waste time picking invalid tokens
           | if you 're just going to validate and retry later on??
           | 
           | (note: when I say "you" here, I mean whoever is hosting the
           | model. It _is_ true that OpenAI does not expose all token
           | scores, it only gives you back the highest-scoring one. So a
           | client-side library is not able to perform this grammar-based
           | sampling.
           | 
           | BUT, OpenAI themselves host host the model, and they see all
           | token outputs, with all scores. And in the same API request,
           | they allow you to pass the "function definition" as a JSON
           | schema. So why not simply apply that function definition as a
           | mask on the token outputs? They could do this without
           | exposing all token scores to you, which they seem very
           | opposed to for some reason.)
        
             | BoorishBears wrote:
             | Maybe re-read what I said?
             | 
             | > realistically it's probably cheaper to just re-emit than
             | _to add the machinery that enables this to their existing
             | architecture_
             | 
             | There are literally dozens of random projects that have
             | implemented logit based masking, it's a trivial thing to
             | implement.
             | 
             | What's probably not as trivial is deploying it at scale
             | with whatever architecture OpenAI already has in place.
             | Especially if they're using the router-based MoE
             | architecture most people are assuming they use.
             | 
             | OpenAI doesn't expose token probabilities for their RLHF
             | models, yet they did for GPT-3. Originally that lead to
             | speculation that was to make building competitors harder,
             | but they've now said they're actually still working on
             | it... which leans even further into the idea they may have
             | an architecture that makes the kind of sampling these
             | projects rely on more difficult to implement than normal.
             | 
             | Given how fast and cheap they've made access to these
             | models, their current approach is a practical workaround if
             | that's the case.
        
               | behnamoh wrote:
               | when GPT-4 first became available, I had a feeling that
               | something about it felt "hacky". Compared to GPT-3 which
               | was more streamlined, mature, and well thought out, GPT-4
               | was like a system put together to outperform the previous
               | one at all costs. I wouldn't be surprised if that led to
               | design decisions that made their model hard to improve.
               | Maybe GPT-5 will not be around any time soon.
        
         | padolsey wrote:
         | IANA{LLM}, but if you're only sampling from a "correct"
         | grammar, you are potentially (very potentially) forgoing what
         | might otherwise have been a more desirable and more
         | semantically useful token. Most of the models have been trained
         | on myriads of human language, not structured data necessarily,
         | and so I'd rather elect for a more semantically enriched format
         | (e.g. XML or YAML) because those are designed to be ~more human
         | readable. Or perhaps more preferably: have the boss LLM pump
         | out what it excels at (strings of prose most of the time) and
         | have a secondary model with a stricter grammar convert that to
         | JSON.
        
       | [deleted]
        
       | Q6T46nT668w6i3m wrote:
       | Is this Brandon Willard the breakdancer from Detroit Brandon
       | Willard?
       | 
       | Edit: It is! https://brandonwillard.github.io/
        
         | btwillard wrote:
         | Ha, yeah, in a distant, but really fun, past!
        
       | J_Shelby_J wrote:
       | So to explain this another way:
       | 
       | After each token generated by the LLM you update the logit bias
       | "mask" to only allow the next token to be a valid json token?
       | 
       | Very slick!
        
         | remilouf wrote:
         | Indeed. And we're able to update the mask with a dictionary
         | lookup instead of looping over the entire vocabulary (slow!).
        
           | [deleted]
        
         | behnamoh wrote:
         | It's actually a very old trick. Lots of libraries do this. idk
         | what's the big deal about this one.
        
           | remilouf wrote:
           | Perhaps I didn't explain clearly enough in the original post?
        
         | dontreact wrote:
         | You would also need to keep generating until the whole string
         | is valid. And what if it gets caught in a loop?
         | 
         | Not sure how this can really guarantee 100%
        
           | orlp wrote:
           | > And what if it gets caught in a loop? Not sure how this can
           | really guarantee 100%
           | 
           | It's not great but after some timeout you can just set the
           | mask to only include closing brackets.
        
             | aassddffasdf wrote:
             | You would still have to ensure balancing somehow. Both "]"
             | and "}" are valid "closing brackets" and the correct one to
             | choose is context-dependent.
        
       | visarga wrote:
       | Enforcing JSON schema, regex and grammars is very useful. But how
       | can we enforce decoding spans from a document? decoded text
       | should be copied from a list of spans in the input document. It
       | would be useful for extractive tasks.
        
       | Havoc wrote:
       | That looks intriguing. Managing that interface has proven
       | challenging - especially on data cleaning tasks where the model
       | ends up talking rather than doing. Bit more guiderails would be
       | helpful on that
        
         | remilouf wrote:
         | That's what we noticed as well, and we were not satisfied with
         | the `guardrails` approach of just rejecting invalid outputs.
         | The method makes the interface robust.
        
       | lefttoreader wrote:
       | The "trick" seems to blatantly rip off FlashText without citing
       | it?
       | 
       | https://arxiv.org/pdf/1711.00046.pdf
       | 
       | I'm a fan of the approach. I normally wouldn't care if this was
       | just another LLM library taking inspiration, but if you're going
       | to go out of your way to put a paper on the ArXiv, feels like
       | doing a literature review is a good step?
        
         | [deleted]
        
       | huevosabio wrote:
       | Very cool! How much latency does it add?
        
         | btwillard wrote:
         | With our indexing approach, it only costs a dictionary lookup
         | to get the next valid tokens during each sampling step, so very
         | little latency.
        
       | rckrd wrote:
       | I also released a hosted version of my open-source libraries
       | ReLLM and ParserLLM that already supports APIs for
       | 
       | * Regex completion for LLMs
       | 
       | * Context-free Grammar completion for LLMs
       | 
       | https://thiggle.com/
       | 
       | [0] https://github.com/r2d4/rellm
       | 
       | [1] https://github.com/r2d4/parserllm
       | 
       | [2] https://github.com/thiggle/api
       | 
       | There's also another API on Thiggle that I've build that supports
       | classification via a similar logit-based strategy.
        
         | [deleted]
        
       | [deleted]
        
       | tantalor wrote:
       | "Generating valid JSON" is not impressive. Here's some valid
       | JSON: []
       | 
       | The tricky part is generating _useful_ JSON.
        
         | ape4 wrote:
         | Or JSON that correctly answers what the prompt is asking.
        
         | AtNightWeCode wrote:
         | "" valid!
        
         | notpushkin wrote:
         | Generating valid JSON that _conforms to a given schema_ is
         | pretty useful, although not impressive by itself. If the model
         | can deduce field values from schema alone though, I think it 's
         | pretty neat.
        
       | quickthrower2 wrote:
       | > LLMs can generate valid JSON 100% of the time
       | 
       | If that seems surprising, it is worth doing a course like
       | Karpathy's zero to hero NN, and have all the magic peeled away a
       | layer at a time.
       | 
       | The reason you can do this is because LLMs don't just generate
       | the next word or token, it produces a probability distribution
       | over all tokens. A JSON parser can give you a list of next valid
       | tokens. The tokens in each case might be from a different set,
       | e.g LLM thinks of " The" whereas the JSON parser might think of
       | "{", so you need some conversion there. But if you sample
       | randomly from only the valid tokens, the output must be valid
       | JSON.
       | 
       | What you can't build a parser for though is ... the truth! You
       | may still be told lies or made up stuff.
        
       | coder543 wrote:
       | As a more general comment, the repo README provides examples that
       | all use gpt2. It would be nice to see at least one example that
       | invokes llama2, since I feel like that would make sure the reader
       | knows that this library can use models that are more modern and
       | interesting.
        
         | Havoc wrote:
         | Inclined to disagree - gpt2 is far more likely to produce
         | gibberish. So if you can force specific outputs on that then it
         | is a good demo that higher quality models will be even better
        
           | coder543 wrote:
           | Maybe... but then if I want to use something better, I have
           | to figure out how by myself. I said "at least one example",
           | not "please change _all_ the examples to llama2. " I agree
           | with your general point. It would be nice if there were an
           | example of how to use a better model.
           | 
           | Models often have different shapes and requirements, so is it
           | really as simple as changing the string "gpt2" to
           | "llama2-13B-Chat" and it will magically work? If so, that's
           | great, and I wish that was made clear. Unfortunately, that
           | hasn't always been my experience with other libraries.
        
             | remilouf wrote:
             | Agree, working on a Colab with a "better" model as we
             | speak.
        
       | btbuildem wrote:
       | I feel like I'm missing something very basic here, but is this
       | library intended to be used with an existing model? If so, could
       | you point to an example?
        
         | remilouf wrote:
         | It can be used with any open source model (if you can get the
         | logits), and to some extent with OpenAI's API. Here is an
         | example with `transformers`: https://github.com/normal-
         | computing/outlines#efficient-json-...
         | 
         | We plan on adding more model integrations, but it is completely
         | decoupled from the method implementation.
        
       | nikcheerla wrote:
       | Does this work with GPT-4?
        
       | thatcherthorn wrote:
       | This is awesome. I have a vision to build self-managed software.
       | This will be a great tool.
        
         | remilouf wrote:
         | Thank you! Hope this helps and opens many applications :)
        
         | malux85 wrote:
         | This is really great too, I am building self-generating
         | experiments and molecular simulations with
         | https://atomictessellator.com and I am going to try out this
         | framework after work
        
       | spott wrote:
       | How does this relate to ggmls bnf sampling?
        
         | remilouf wrote:
         | Two differences:
         | 
         | (1) This feature only requires regex-guided generation. We have
         | a PR for BNF sampling that is about to be merged. (2) ggml
         | loops over the entire vocabulary (~50k tokens) at each step,
         | which introduces a noticeable overhead, and makes it unusable
         | for complex grammars. Our method works by building an index at
         | initialization, and build the masks at each step with a
         | dictionary lookup. Once the index is built, generation is just
         | as fast as standard generation. Doesn't depend on the
         | complexity of the grammar, the size of the LLM or its
         | vocabulary size.
        
           | spott wrote:
           | Regex-guided gen is slick... is it arbitrary? Or are you
           | custom building it for json?
           | 
           | If arbitrary, how are you pre-defining a set of masks? I
           | would expect that splitting an arbitrary regex into a bunch
           | of contexts for a masking dictionary to be non-trivial.
        
       | anotherpaulg wrote:
       | For complex tasks like coding, my experience is that asking for a
       | complex output format hurts performance on the underlying task.
       | 
       | https://aider.chat/docs/benchmarks.html
       | 
       | I am curious if you have measured whether this sort of
       | "constrained generation" suffers from similar downsides?
        
       | leetharris wrote:
       | How does this compare in terms of latency, cost, and
       | effectiveness to jsonformer? https://github.com/1rgs/jsonformer
        
         | bhickey wrote:
         | jsonformer uses a template rather than a DFA. The logit masking
         | seems to be identical, though.
        
         | remilouf wrote:
         | Figure 2 in our paper (https://arxiv.org/abs/2307.09702) shows
         | the difference between guidance and outlines to generate a
         | sequence that is valid to a regex. Jsonformer uses the same
         | technique as guidance. Extrapolate this to several fields.
         | 
         | Note that we still need to manage the KV cache in outlines.
         | It's a small interface change that will be made this week
         | hopefully, but we've been focusing on constrained generation so
         | far.
        
           | Der_Einzige wrote:
           | Sad to see that my related work on token-level constrained
           | text generation is not cited in the paper:
           | https://github.com/Hellisotherpeople/Constrained-Text-
           | Genera...
           | 
           | https://aclanthology.org/2022.cai-1.2/
        
             | remilouf wrote:
             | We're unfortunately only human and didn't catch every
             | single paper on the topic while writing the draft. Thanks
             | for bringing it to our attention.
        
       | panarky wrote:
       | I can make GPT4 return valid JSON simply by providing examples in
       | the system message. This works nine times out of ten.
       | 
       | But it's still probabilistic, and nine times out of ten isn't
       | good enough.
       | 
       | Occasionally it will hallucinate responses like this:
       | 
       | {"key1": "value1", "key2": "value2" for i in range(n)}
       | 
       | Re-prompting with the parsing error message is usually enough to
       | get it on the second try.
       | 
       | But escaping double-quotes and newline characters is less
       | reliable. Even after giving it multiple examples, it correctly
       | escapes only about half the time.
       | 
       | Re-prompting for escaping errors still yields a ~50% success
       | rate.
        
         | phillipcarter wrote:
         | This is what we do, but for GPT-3.5. And it doesn't need to be
         | system messages either. We even have it emitting _only_ JSON in
         | a specific structure (except for when it fails to produce an
         | output altogether). This is without the function calling model.
        
         | caesil wrote:
         | With ChatGPT function calling I get valid JSON 100% of the time
         | from GPT-4 unless I have made some error in prompting.
         | 
         | The chief error is not providing escape hatches. LLMs look for
         | a right answer. If you are feeding it some texts and asking it
         | to return structured data about the texts, but then one of the
         | texts is blank, it will be difficult to determine a right
         | answer, so you get hallucinations. The solution is an escape
         | hatch where one of the arguments is a `textIsMissing` boolean
         | or something.
         | 
         | As long as you've accounted for these failure modes, it works
         | flawlessly.
        
         | MuffinFlavored wrote:
         | I wonder if the next iteration of OpenAI features is something
         | like:
         | 
         | right now you can inject prompts that the LLM takes into
         | consideration before the output
         | 
         | I wonder if you can make it have a "post" generation function
         | that says like "keep re-trying in a loop (aka hallucinating
         | with randomness) until the output message passes XYZ
         | format/checks/scoring"
        
           | padjo wrote:
           | It's starting to feel like LLMs are to "classical" software
           | engineering what quantum physics was to classical physics
        
         | keiferwiseman wrote:
         | It took some iterations but I've managed to get the OpenAI API
         | to give me valid JSON 100% of the time now(based on my
         | testing). I think I put in the prompt to never use newlines
         | because it was causing issues lol.
        
         | thumbsup-_- wrote:
         | Yeah same thing. I have done the same with GPT-3.5. Simply ask
         | it to output using provided schema only and give a few
         | examples. Always outputs in provided json format
        
         | nextaccountic wrote:
         | What about reprompting with a different temperature value?
         | 
         | If this works, how to select the optimal value? Maybe you can
         | train a model that can excel at the task of querying gpt4 for
         | valid jsons
        
         | simonw wrote:
         | That re-prompting on error trick is what this new Microsoft
         | library does, too: https://github.com/microsoft/TypeChat
         | 
         | Here's their prompt for that:
         | https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...
         | 
         | I think the approach using grammars (seen here, but also in
         | things like https://github.com/ggerganov/llama.cpp/pull/1773 )
         | is a much more elegant solution.
        
         | padolsey wrote:
         | I've had more luck with getting it to output XML as (1) You can
         | imbue XML with actual language/meaning (which LLMs adore) and
         | (2) parsers can be made to be more forgiving. I get why people
         | want to make JSON, but to me it's a bit like trying to get a
         | cat to swim - you might eventually succeed, but it's not their
         | natural inclination.
        
           | gowld wrote:
           | How do you imbue XML with meaning?
        
             | padolsey wrote:
             | XML Elements themselves: their naming, their attributes,
             | comments, indentation. There's more opportunity at every
             | level of the hierarchy to demarkate and establish meaning.
             | Having closing-tags as well, I've found, is a massive boon;
             | LLMs can better understand what "finishing" looks like if
             | its delimited in a semantic way - with a name.
        
         | orasis wrote:
         | What about using ChatGPT's new function calling mechanism?
        
           | superasn wrote:
           | That returns broken JSON a lot of the times too
        
       | activatedgeek wrote:
       | Mechanistically, I think this library takes the simple idea of
       | masking part of the vocabulary space and steps in time
       | efficiently. Great!
       | 
       | I am curious, however, for the ones who have played around with
       | such libraries wrapping base LLMs with output structure: do base
       | models like Llama2 work very well? My experience says "hell no!"
       | and you do need a fair bit of instruction-tuning for specific use
       | cases to actually get things to work.
       | 
       | And even then, it seems very counter-intuitive to me that given
       | an instruction-tuned model, post-hoc masking of the state-space
       | during generation then amounts to just changing the generation
       | distribution, and potentially detrimental to instruction-tuning?
        
         | Havoc wrote:
         | >you do need a fair bit of instruction-tuning for specific use
         | cases to actually get things to work.
         | 
         | The instruction tuning part is "trivial"...it's the dealing
         | with edge cases part that gets me.
         | 
         | With classic code edge cases are well insignificant edge cases.
         | With LLM you never know what will make it go off on a tangent &
         | the parsing code needs to deal with that chaos.
         | 
         | Or put differently the % of cases that are edge cases seems to
         | have gone up dramatically
        
         | ethbr1 wrote:
         | > _...given an instruction-tuned model, post-hoc masking of the
         | state-space during generation then amounts to just changing the
         | generation distribution..._
         | 
         | Isn't that what we did with test driven development?
         | 
         | The primary difference was our generator functions were human
         | instead of LLM. Why not cut out the middle-human?
        
         | make3 wrote:
         | I'm not sure of why you would want to use raw llama-2 though
         | when there is a million super strong instruction fine-tuned
         | versions of llama-2 on HF hub that would do the job a million
         | times better? Like Stability-AI's Beluga-2. See
         | https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...
         | 
         | About your second point, the goal is that the model can only
         | generate JSON (for example), which can 100% be done by
         | constraining which output token can and cannot be used.
        
           | nabakin wrote:
           | Don't rely too much on automated benchmarks for LLMs. They
           | are often gamed, made to overfit, and result in worse
           | performance in the general case.
           | 
           | Human evaluation is the gold standard and the Llama 2 paper
           | gave significant evidence that Llama 2 70b chat is on-par, if
           | not, better than ChatGPT for that metric so I tend to stick
           | to it unless there is good reason not to.
        
             | huevosabio wrote:
             | The problem with Llama 2 chat versions is that they have
             | been RLHF-ed to death. You can't ask questions without
             | getting a sermon of how your question may be inappropriate
             | for this or that reason.
             | 
             | I think it's worse on the smaller models, but still present
             | in the 70B one.
        
       | lettergram wrote:
       | Few thoughts, you're effectively creating representations that
       | can convert to JSON (kudos!)
       | 
       | Can't mention how we did it (there are a lot of public patents,
       | if interested), but back in 2018 we had a way to generate
       | synthetic data (statistically, structurally similar) off any
       | dataset - https://medium.com/capital-one-tech/why-you-dont-
       | necessarily... You could also design datasets if you wanted.
       | 
       | It'd keep similar relations and worked pretty darn well. Not the
       | exact same, but always produced valid JSON.
        
       | simonw wrote:
       | I really hope OpenAI add something like this to their endpoints
       | soon.
       | 
       | Being able to pass up some kind of grammar (a regular expression,
       | or a JSON schema, or some other format) and have this trick run
       | during their token sampling process to ensure the output was
       | compliant would be incredibly useful.
        
         | joshuanapoli wrote:
         | Isn't the Function Calling feature meant for this purpose? It
         | guides the LLM to output according to the given schema. The
         | name of the feature is a little misleading.
         | 
         | https://platform.openai.com/docs/guides/gpt/function-calling
        
           | tornato7 wrote:
           | Function Calling is fine-tuned to a certain output format,
           | but it very often strays from that format. My function-
           | calling-handling code has a mess of edge case handlers that
           | catch when GPT-4 is calling functions incorrectly.
        
           | simonw wrote:
           | Surprisingly the function calling mechanism doesn't appear to
           | use this trick - apparently it's still possible to get the
           | wrong JSON structure back from it occasionally.
        
           | M4v3R wrote:
           | It's not though, they even say it in their docs that sending
           | a schema does not guarantee that the model will actually
           | adhere to the scheme or even produce valid JSON
        
         | potatoman22 wrote:
         | They recently added logit biases, so that's a start.
        
       | dvt wrote:
       | [flagged]
        
       | aduffy wrote:
       | This is exciting, we built a similar tool[1] recently
       | specifically targeted at constraining llama output to match a
       | TypeScript interface.
       | 
       | I firmly believe that output format guarantees are going to be
       | important for real (non-toy) decades for LLMs
       | 
       | [1] https://github.com/ggerganov/llama.cpp/discussions/2494
        
       | BoorishBears wrote:
       | I'm not sure how this is different than:
       | 
       | https://github.com/1rgs/jsonformer
       | 
       | or
       | 
       | https://github.com/newhouseb/clownfish
       | 
       | or
       | 
       | https://github.com/mkuchnik/relm
       | 
       | or
       | 
       | https://github.com/ggerganov/llama.cpp/pull/1773
       | 
       | or
       | 
       | https://github.com/Shopify/torch-grammar
       | 
       | Overall there are a _ton_ of these logit based guidance systems,
       | the reason they don 't get tons of traction is the SOTA models
       | are behind REST APIs that don't enable this fine-grained
       | approach.
       | 
       | Those models perform so much better that people generally settle
       | for just re-requesting until they get the correct format (and
       | with GPT-4 that ends up being a fairly rare occurrence in my
       | experience)
        
         | remilouf wrote:
         | Thanks for bringing clownfish and relm to my attention! afaik
         | other libraries loop over the entire vocabulary at every step
         | of the generation. We on the other hand build an index at
         | initialization by looping once over the vocabulary. Then
         | generation is just as fast as standard generation.
        
           | burke wrote:
           | torch-grammar generates a mask per PDA stack... we don't try
           | to compute all the possible stacks. I'm sure there's
           | something smarter that could be done here and you've probably
           | figured it out (though IIRC regular languages don't have the
           | arbitrarily recursive stack problem that you get when you get
           | to context-free languages?) anyway, in practice we spend a
           | few milliseconds on the first few requests building caches
           | and then just apply masks from caches after that.
        
       | sneedchucker wrote:
       | Relevant; LLama.cpp implemented grammar-based sampling last
       | month.
       | 
       | https://news.ycombinator.com/item?id=36819906
       | https://github.com/ggerganov/llama.cpp/pull/1773
        
         | remilouf wrote:
         | We can extend our approach to grammar-based sampling, as
         | explained in the paper linked above. Relevant PR:
         | https://github.com/normal-computing/outlines/pull/178
         | 
         | Our method is much more efficient. llama.cpp loops over the
         | entire vocabulary (~50k tokens) _at each step_ to generate the
         | mask. We generate an index at initialization, and building the
         | masks at each step only requires a dictionary lookup (trade
         | speed for memory). Sampling is just as fast as standard
         | sampling.
        
           | popinman322 wrote:
           | It should hopefully be a quick change to llama.cpp to add a
           | mask per grammar state to bring it in line with your
           | generation method; I don't think the two are incompatible,
           | thankfully.
           | 
           | I do wonder how much you win here by masking the tokens? You
           | still need to iterate along the output vector to apply the
           | mask. Masking on the accelerator still requires filtering on
           | the CPU side? Compared to running the language model, the
           | cost of iterating over the edges in the grammar seems small.
        
           | burke wrote:
           | Yes! This is closer to the approach I took in my port of
           | llama.cpp's grammar support to PyTorch:
           | https://github.com/Shopify/torch-
           | grammar/blob/main/torch_gra... ... it generates a tensor
           | mapping each PDA stack to a map of which tokens are
           | acceptable from that state. It seems like a much better way
           | to do it than looping over the sampled tokens on each turn.
        
       ___________________________________________________________________
       (page generated 2023-08-14 23:00 UTC)