[HN Gopher] Jsonformer: Generate structured output from LLMs
       ___________________________________________________________________
        
       Jsonformer: Generate structured output from LLMs
        
       Author : yunyu
       Score  : 160 points
       Date   : 2023-05-02 16:29 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | rickcarlino wrote:
       | Has anyone seen a tool like this that uses Node rather than
       | Python? I have this exact problem in a GPT-based web application
       | I am building and have had to resort to some "creative"
       | solutions. At the very least I am glad to see people are tackling
       | this problem.
        
         | msikora wrote:
         | Same here. Considering switching my project (or at least part
         | of it) to Python. For anything to do with LLMs or ML in general
         | Python has by far the best libraries. JS is probably second, at
         | least for LLM stuff, but it is a distant second place...
        
         | SparkyMcUnicorn wrote:
         | I've had good luck with Langchain's output parsers[0], but in
         | addition to the format instructions I also append something
         | like "Do not provide any explanations, just the JSON output.",
         | which helps eliminate content being generated outside of the
         | JSON block.
         | 
         | [0]
         | https://js.langchain.com/docs/modules/prompts/output_parsers...
        
       | benob wrote:
       | How about going one step further and constrain transformer output
       | with a context-free grammar? That way you can generate more
       | conformant code such as Python or C.
        
         | Der_Einzige wrote:
         | This may be possible as constraints using constrained beam
         | search, which huggingface has quietly supported for a long
         | time.
        
           | gamegoblin wrote:
           | Wouldn't even need to beam search if you restrict it to
           | deterministic context free grammars, which would satisfy >
           | 95% of these "generate some JSON schema" use-cases. For DCFGs
           | you can just zero-out the probability for any token that is
           | invalid in the context, no lookahead or search needed.
           | Wouldn't work for truly context free things like most
           | programming languages, though.
        
       | sundarurfriend wrote:
       | > Bulletproof JSON generation: Jsonformer ensures that the
       | generated JSON is always syntactically correct and conforms to
       | the specified schema.
       | 
       | This is an important definition to take note of: "bulletproof"
       | doesn't mean that you'll get good or correct data. It only means
       | that it'll be valid JSON and in a particular schema that you
       | specify (because the LLM isn't building the JSON in the first
       | place, the library is).
       | 
       | It's an interesting idea. But it's not clear if they've validated
       | the heuristics they use, to see how well it performs in terms of
       | accuracy against, say, some kind of BeautifulSoup-like attempt to
       | make sense of the JSON-ish that the LLM produces and correct that
       | to be valid JSON, or any other approach to the problem.
        
         | dragonwriter wrote:
         | I wonder if LLMs are at the point where reprompting the LLM
         | with a very similar error message to what you would output to a
         | human from a user-friendly JSON processing tool for the error
         | would usually be a good way to fix errors.
        
           | execveat wrote:
           | Yeah, but that could require multiple queries, which isn't
           | very efficient. Training model just to fix JSON would be
           | better.
        
           | newhouseb wrote:
           | Sometimes, but it very much depends on the context (no pun
           | intended). If it's a pure syntax issue, OpenAI models will
           | almost certainly make the right correction. If it's more
           | abstract, like the LLM has hallucinated a property that is
           | invalid as part of some larger schema you can quickly descend
           | into the LLM gaslighting you into saying that it has fixed
           | things when it hasn't.
        
       | apalmer wrote:
       | Trying to understand why this is necessary? LLMs cannot reliably
       | generate valid Jason?
        
         | dwallin wrote:
         | Two ways in which it is useful over existing techniques:
         | 
         | - It is guaranteed to match your schema
         | 
         | - It is much lighter weight
        
           | Der_Einzige wrote:
           | Also, it costs tokens to ask a model to do something, and it
           | may choose not to do it re: constraints
           | 
           | You can force it by banning the vocabulary which violates a
           | constraint for free.
        
         | tysam_and wrote:
         | Yes, but Mike on the other hand....
        
       | kcorbitt wrote:
       | I've thought about building this for a while, glad it's out
       | there!
       | 
       | Not only does this guarantee your output is JSON, it lowers your
       | generation cost and latency by filling in many of the repetitive
       | schema tokens without passing them through the LLM.
       | 
       | For the very common case of "extracting multiple structured
       | fields from a piece of unstructured text," I believe there's an
       | even stronger optimization possible that would further decrease
       | costs, latency and potentially even improve accuracy.
       | 
       | Assuming the fields you want to extract are independent (and they
       | often are), you don't _need_ to generate them all in one go
       | autoregressively. Eg. instead of running the following pseudo-
       | prompt:                   "Input: 'It's sunny and cold today'
       | Output schema: {"sunny": boolean, "temperature": string}"
       | 
       | You could instead run the following two:
       | "Input: 'It's sunny and cold today'          Output schema:
       | {"sunny": boolean}"              "Input: 'It's sunny and cold
       | today'          Output schema: {"temperature": string}"
       | 
       | We don't do that today because when done naively it's very
       | inefficient -- you'd be tokenizing, passing to the GPU, and
       | computing the KV cache of the shared part of the prompt twice.
       | But a library with the right abstraction could run the second two
       | queries in a batch in parallel and reuse the same tokenization
       | and KV cache for both of them. It would actually be _more_
       | efficient than generating both fields in one go, since when you
       | factor out the shared prefixes both the generated text and its
       | context are shorter!
       | 
       | I mentioned above that this could also improve accuracy. Of
       | course it doesn't do that by default (except that by excluding
       | all the irrelevant fields it makes self-attention's job easier).
       | But what it _does_ do is give you an independent prompt for each
       | field you 're interested in. And so for particularly tricky
       | fields you're trying to extract, you have the flexibility to eg.
       | add several examples to make the generation N-shot.
        
         | travisjungroth wrote:
         | Maybe this will make CUE popular. It's similar to JSON, but the
         | idea of schema and values are put together through unification,
         | or you could say narrowing constraints. CUE would handle taking
         | all of those values individually, then combining them into
         | something concrete, incomplete, or erroring out.
        
         | execveat wrote:
         | You'd need to put the input first for this approach to work,
         | but in my testing models work better if you lead with a
         | question.
        
           | kcorbitt wrote:
           | Hmm. I admit that I haven't thought about this deeply, but
           | I'm not sure that's true? It seems to me that you could
           | extend the KV cache either backwards or forwards equally
           | easily.
        
             | Siira wrote:
             | You can't. The later values depend on the earlier ones, so
             | changing the early tokens invalidates your whole cache.
             | 
             | This is also probably why leading with a question works
             | better in the first place. All later processing conditions
             | on the question in this way.
             | 
             | BTW, in my very limited testing, GPT4 doesn't care about
             | the order.
        
           | tysam_and wrote:
           | I could be reading this wrong, but my assumption is/has been
           | that the prompt goes up to the end of the JSON field name,
           | and the LLM is only filling in the actual value, not the key.
           | I could be wrong on this one, however.
        
         | bckr wrote:
         | Can you briefly describe how you got to the point of having
         | this kind of intuition about language models?
        
           | kcorbitt wrote:
           | Kind of a meta-answer, but my personal learning style is
           | "think of something cool to build, then figure out what I
           | need to know to build it." It just so happens that a lot of
           | the interesting/cool stuff going on right now builds on top
           | of LLMs, so my projects have naturally gravitated that way.
           | But I've never sat down and to take an AI course or read the
           | "Attention Is All You Need" paper or anything. I absorb much
           | more when I learn something because I need it for something
           | I'm working on.
        
             | catchnear4321 wrote:
             | it is a correct answer.
        
           | tysam_and wrote:
           | I can't speak for OP, but something that I think helps is if
           | you think about the generation process as a jumping off point
           | that one can control the placement of, but not really much
           | that is generated afterwards.
           | 
           | Adding a scheme like this reduces the area of potential off-
           | roading that the LLM can do to a much smaller zone.
           | Additionally, it breaks up the chain of dependencies between
           | the two example outputs, because now we do not need to depend
           | upon past inputs to correctly output this scheme.
           | 
           | Since the information for JSON semantic structure is no
           | longer required to be driven by the LLM (it still has to
           | understand it to still be able to generate things with a
           | modicum of sense, IIRC), we can look at our dependency graph
           | for outputs. _This changes because now the fields really and
           | truly are independent, (if they are truly informationally
           | independent) _.
           | 
           | So now some kind of conjoined information requirement of (
           | autoregressive output ) <- (( field A ) <- ( field B ))
           | becomes ( autoregressive output ) <- (( field A ) && ( field
           | B )) which then can be factored out into separate calls
           | instead of sequentially, which yields us a batched call of ((
           | autoregressive output A ) <- ( field A ) && ( autoregressive
           | output B ) <- ( field B )).
           | 
           | From there it is just implementation. I likely would not have
           | thought about the OP's way of handling things for a good
           | while, though maybe I would have stumbled into it had I
           | enough reason to think about structured/templated kinds of
           | generation, which I do believe that I do now! <3 :) It really
           | breaks a lot of assumptions that are easy to quietly make and
           | I had not thought appropriately about the consequences of
           | reframing things in this way, to be honest.
           | 
           | As for "how" to think about this, if I were to give my take,
           | it would be always just turning whatever problem in front of
           | you is into a puzzle where you simplify it further each time.
           | Optimizing for less computation, time, code, or even just
           | what all of those are a kind of proxy for: less information
           | to sufficiently solve a problem. We can see that this problem
           | is reduced in complexity appropriately because we remove a
           | redundancy that does not need to be there at all.
           | 
           | One way to look at this is in the relationships between parts
           | of an idea. If you're able to understand, even vaguely, the
           | concepts behind some other concept and how they interact, and
           | maybe even have a 'standard toolkit' of relating to them, you
           | can start finding/transferring/applying other skills to these
           | parts of a concept. I don't think there's a guaranteed-
           | efficient way to maybe reduce a concept down to its parts, or
           | down to a more efficient representation without already,
           | well, knowing that representation. It's an NP-hard problem to
           | me personally, and is the reason why research and other
           | academic pursuits can take a while. It is a good skill to
           | learn I suppose and I certainly enjoy trying to use it,
           | personally.
           | 
           | To tie this back to your question about language models --
           | yes, some things have to do with the language model, but
           | oftentimes it's actually just the raw mathematical components
           | underlying a model. If you look for that, and (please please
           | please please please!!!!) then you don't necessarily _have_
           | to concern yourself with the implementation details (beyond
           | runtime limits, etc), as long as the math still applies you
           | should be able to reason really quite well about what else is
           | happening/could happen with a model type like these are.
           | 
           | In particular, LLMs being an autoregressive model where each
           | output depends upon its inputs lets us set up a dependency
           | graph. Then based upon some prior assumptions, we can maybe
           | make some substitutions/changes that allow us to fragment the
           | dependency graph and move it around as we wish. This is not
           | just applicable to LLMs, however, dependency graphs are
           | useful in a wide number of areas.
           | 
           | So one other thing that we're not talking about here is that
           | we're optimizing for an objective we want (clean JSON) by
           | explicitly...well, injecting that objective instead of living
           | on just hopes and dreams, y'aknow. This is a pretty
           | straightforward way of solving the problem by putting the
           | answer in the question, though poor input content still can
           | be a problem.
           | 
           | Stated a different way, we're collapsing the entropy of what
           | the network can introduce (which should be JSON, but remember
           | [!!!!!!!], neural networks are noisy estimators, and JSON
           | errors are mathematically guaranteed (even if rare), which
           | means any pipeline depending upon output like code can and
           | will fail, and is brittle to all sorts of other kinds of
           | complicated parsing errors. This is because to
           | catch/detect/enumerate/correct these errors, we need to have
           | all of the information needed to implement a JSON structure
           | itself. So basically we'd be using the same exact
           | information, just enforcing it in a horrendously inefficient
           | manner, which is how people have been doing it until the
           | present, which is okay as we humans are certainly not NP-
           | optimal machines IMO. In any case, we're still in the
           | parentheses, and the point was that any kind of variance can
           | be a problem here beyond some extremely tiny limit, and
           | that's not what LLMs are made to do. So at some point it's
           | guaranteed to break, and high volumes -- it's basically
           | guaranteed to break in a way that's either unusable or
           | requires so much effort to fix that you might as well have
           | embedded a JSON prior into your network generation process
           | because it would have required the same amount of information
           | as external validation would, albeit with less effort
           | (!!!!)), which is perfectly fine in our case if we're
           | exclusively generating JSON as it gives us what we want. Most
           | methods like this thankfully should have a low level of
           | invasiveness to the model as well, freeing us up to use
           | either the same or a similar model for multiple tasks.
           | 
           | This can create a bit of an ideological illusion as we
           | technically are destroying information by collapsing the
           | distributions of sentences/strings of tokens/etc that we are
           | generating, and maybe can lend to a "oh, we can add whatever
           | functionality we want!" kind of belief about this kind of
           | modeling. It's important what we're adding and taking away.
           | Also important is part of how/why/what is so powerful about
           | training these models on next token prediction on large text
           | corpora. We can trim them down to some smaller subproblem
           | much much more easily than we can expand them to cover a
           | larger subset. Which is pretty darn cool!
           | 
           | I know this sorta flew around a lot of places and touched on
           | a lot of things, probably not as cogently as I'd want to if I
           | had more time to review and revise it. Hope it was/is helpful
           | for you and feel free to let me know if you have any
           | questions. It's a very cool topic on the whole to me, tbh,
           | and there's a number of interesting conversations that can
           | branch off from this one. Honestly this whole general area is
           | where I see the real value in LLM development in research.
           | It's practical and it's helpful! :D :) <3 :)
           | 
           | Source for experience is a number of years of experience
           | across a wide variety of ML models, though I'm sure I made an
           | embarassing blunder or two in this post. ;P
        
           | kolinko wrote:
           | Not op, but I can share my approach - I went line by line by
           | Recmo's Cria: https://github.com/recmo/cria - which is an
           | implementation of Llama in Numpy - so very low level. Took me
           | I think 3-4 days x 10 hours + 1-2 days of reading about
           | Transformers to understand what's going on - but from that
           | you can see how models generate text and have a deep
           | understanding of what's going on.
        
       | visarga wrote:
       | I wanted to see the opposite - parsing JSON and YAML generated
       | from LLMs. It doesn't happen much with GPT-4 but lesser models
       | might mess up the format and then you can't simply parse it.
        
         | ImaCake wrote:
         | It sorta feels like LLMs or some kind of NN should be useful
         | (with training) for parsing malformed jsons. I suspect its a
         | hard problem but honestly it would be such a massive help for
         | those of us dealing with data at work!
        
       | andrewcamel wrote:
       | Seen a lot of things trying to do this by pressure testing the
       | outputs, but all feel like anti-patterns. This is the first that
       | seems like the "right" way to do it. Better to manage how the
       | model is generating vs creating one more potentially faulty
       | "glue" layer.
        
         | lt wrote:
         | Can you elaborate about what you mean by pressure testing?
         | Haven't heard this term yet.
        
           | andrewcamel wrote:
           | Maybe not the right term... Just that a lot of other libs act
           | like guardrails, i.e. let the model generate what it does (in
           | full form text / GPT output), and try to then parse out what
           | you want, error if output doesn't conform to standard format.
           | As opposed to basically only allowing the model to generate
           | into the already-built JSON form fields. Understandable why
           | this guardrails/parsing approach is so popular though...
           | can't do what this library is doing with OpenAI API. Need to
           | be able to manipulate the token generation; otherwise you're
           | forced to take full text output and try to parse it.
        
         | tysam_and wrote:
         | Mathematically it requires less information to impose a certain
         | prior on data in the process of generation than it does to read
         | the data, do error detection and correction according to a
         | prior, and then return it, if I understand correctly.
         | 
         | Something always felt incredibly icky to me about any kind of
         | ad-hoc 'fixer' scripts that were part of a pipeline that was
         | fully controlled by a user.
        
       | phh wrote:
       | I hope that this is new to no-one generating JSON using LLM,
       | because it felt like the first thing you'd do when I implemented
       | that kind of stuff. That being said, it's nice to have that as a
       | library ready-to-go.
        
       | ianbutler wrote:
       | Nice this codifies something similar I've been doing in my
       | prompts! Will be using this instead.
       | 
       | What I currently have been doing:
       | 
       | The JSON template for your response is provided below. The parts
       | to fill out are capitalized. Please do not modify the template.
       | Please fill in the template with one of the above options for
       | your response. <result> { "rating": "N. RATING", "reason":
       | "REASON" } </result>
        
       | layoric wrote:
       | I might be reading the code wrong but it looks like it crawls the
       | schema making a generation per primitive type. While that's a
       | clever way to ensure valid JSON, I don't know if I'd go as far as
       | to describe it as efficient.
       | 
       | Saying that if the model is unable to generate JSON due to its
       | training/fine tuning, this is indeed a clever solution!
        
       | pklee wrote:
       | This is pretty cool. I tried with dolly and then I tried with
       | T5-base, both of it did not give me result. It broke for me. Has
       | anyone tried it ?
        
       | Jayakumark wrote:
       | This is great that it does not use OpenAI and runs locally
        
       | tough wrote:
       | I knew a similar one called GPTyped, just posted it on HN
       | https://news.ycombinator.com/item?id=35793056#35793057
        
       | wy35 wrote:
       | Very interesting. I've only been using OpenAI APIs so this logit
       | stuff is new to me.
        
         | Der_Einzige wrote:
         | I've complained bitterly and openly about how annoying it is
         | that OpenAI locks down access to the full probability
         | distribution. Glad to see that others are running into this
         | stupid limitation and are doing work related to it.
        
         | esafak wrote:
         | It's a testament to the democratization of ML that
         | practitioners today can get by without knowing what a logit is.
        
           | tysam_and wrote:
           | And I am personally glad for that, for one! This means it's
           | accessible to more people without requiring specialized
           | knowledge, and while, yes, I think that always triggers an
           | internal reaction from most of us when it comes to thinking
           | about field dilution, it's almost a necessary tradeoff (like,
           | say, the uncertainty principle) when expanding the field out
           | to more people.
           | 
           | So, hurray! We've made it more accessible. And hopefully in
           | years to come, even very much more so! <3 :)
        
         | koboll wrote:
         | I'm flabbergasted that OpenAI does not yet offer an API that
         | reliably returns JSON based on some schema you feed it. It's
         | sort of possible to force it to do this but not really to a
         | production-ready degree.
        
       | newhouseb wrote:
       | Oh nice! I built a similar system a few weeks ago:
       | https://github.com/newhouseb/clownfish
       | 
       | I think the main differentiating factor here is that this is
       | better if you have a simpler JSON schema without enums or oneOf
       | constraints. If you do have these constraints, i.e. let's say you
       | wanted an array of different types that represented a items on a
       | menu { kind: pizza, toppings: [pepperoni] } or { kind: ice_cream,
       | flavor: vanilla | strawberry } then you would need something more
       | sophisticated like clownfish that can ask the LLM to pick
       | specific properties (and an ability to do some backtracking so
       | you can do proper beam search).
       | 
       | For completeness, another common approach can be found here:
       | https://github.com/ShreyaR/guardrails which essentially boils
       | down to "provide the schema in the prompt and ask the LLM to
       | correct things if it fails to get the schema right the first
       | time."
        
         | gamegoblin wrote:
         | I hate that gpt-3.5-turbo is so cheap that using systems like
         | guardrails is a sane thing to do. I can almost always prompt
         | davinci-003 without guardrails in a way to get my exact schema
         | 1-shot, whereas guardrails + 3.5-turbo will often consume 2-4x
         | more tokens, but that still makes it significantly cheaper.
        
           | brigadier132 wrote:
           | The problem people are having is hitting the rate limits for
           | chat gpt.
        
         | joshuanapoli wrote:
         | Thank you for the really clear and complete description of
         | "ControLogit"s and your approach in clownfish!
        
         | killthebuddha wrote:
         | One thing that I really like about the approach you took with
         | clownfish is that it doesn't constrain or modify the structure
         | of the prompt.
         | 
         | One of the primary difficulties with writing LLM applications
         | is that prompts are basically not composable, and any LLM
         | library that modifies your prompt is going to be a nightmare to
         | work with.
        
           | killthebuddha wrote:
           | Follow-up thought I just had: It seems that prompt structure
           | standards are going to have to emerge if any of these tools
           | have a shot at interoperability. I don't have hard data, but
           | IME if a prompt is structured
           | 
           | MEMORY EXAMPLE INSTRUCTION [COMPLETION]
           | 
           | it will basically not work to wrap it in a prompt that's
           | structured
           | 
           | INSTRUCTION MEMORY EXAMPLE [COMPLETION]
        
             | ianbutler wrote:
             | Interoperability can also be achieved with small adapters
             | written for the prompting style of the particular model
             | being interfaced with, I'd be surprised if like LangChain
             | or AutoGPT don't already do something like this in their
             | systems.
             | 
             | I'm currently building something that leverages an ensemble
             | of different LLMs depending on the difficulty of a task and
             | ran into this issue.
             | 
             | Dolly V2 takes "###Instruction: <your stuff> ###Response"
             | as the structure fed to the model where as GPT3.5 Turbo
             | wasn't trained to treat that particular structure as
             | important.
             | 
             | The nice thing is that GPT3.5 Turbo will just roll with the
             | prompt structure Dolly uses but that only works in very
             | large LLMs, I'd imagine I wouldn't get away with it in
             | other 12BN parameter models.
             | 
             | But realistically this could look like taking the
             | "INSTRUCTION MEMORY EXAMPLE [COMPLETION]" schema
             | represented in a library and each adapter would transform
             | it into
             | 
             | "MEMORY EXAMPLE INSTRUCTION [COMPLETION]" schema or
             | whatever is needed by the different model.
        
         | sudb wrote:
         | Another very similar approach to guardrails which manages to
         | avoid XML that I've been using with some success is langchain's
         | OutputFixingParser:
         | https://python.langchain.com/en/latest/modules/prompts/outpu...
        
       | Der_Einzige wrote:
       | Love to see further work on constrained decoding like this and
       | other systems introduced in the comments!
       | 
       | See my work and the paper about it. I've got a lot of y'all beat
       | on this (constrained decoding, not the templating and
       | structuring) by about a year:
       | 
       | https://github.com/hellisotherpeople/constrained-text-genera...
        
       ___________________________________________________________________
       (page generated 2023-05-02 23:00 UTC)