[HN Gopher] Jsonformer: Generate structured output from LLMs
___________________________________________________________________
Jsonformer: Generate structured output from LLMs
Author : yunyu
Score : 160 points
Date : 2023-05-02 16:29 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| rickcarlino wrote:
| Has anyone seen a tool like this that uses Node rather than
| Python? I have this exact problem in a GPT-based web application
| I am building and have had to resort to some "creative"
| solutions. At the very least I am glad to see people are tackling
| this problem.
| msikora wrote:
| Same here. Considering switching my project (or at least part
| of it) to Python. For anything to do with LLMs or ML in general
| Python has by far the best libraries. JS is probably second, at
| least for LLM stuff, but it is a distant second place...
| SparkyMcUnicorn wrote:
| I've had good luck with Langchain's output parsers[0], but in
| addition to the format instructions I also append something
| like "Do not provide any explanations, just the JSON output.",
| which helps eliminate content being generated outside of the
| JSON block.
|
| [0]
| https://js.langchain.com/docs/modules/prompts/output_parsers...
| benob wrote:
| How about going one step further and constrain transformer output
| with a context-free grammar? That way you can generate more
| conformant code such as Python or C.
| Der_Einzige wrote:
| This may be possible as constraints using constrained beam
| search, which huggingface has quietly supported for a long
| time.
| gamegoblin wrote:
| Wouldn't even need to beam search if you restrict it to
| deterministic context free grammars, which would satisfy >
| 95% of these "generate some JSON schema" use-cases. For DCFGs
| you can just zero-out the probability for any token that is
| invalid in the context, no lookahead or search needed.
| Wouldn't work for truly context free things like most
| programming languages, though.
| sundarurfriend wrote:
| > Bulletproof JSON generation: Jsonformer ensures that the
| generated JSON is always syntactically correct and conforms to
| the specified schema.
|
| This is an important definition to take note of: "bulletproof"
| doesn't mean that you'll get good or correct data. It only means
| that it'll be valid JSON and in a particular schema that you
| specify (because the LLM isn't building the JSON in the first
| place, the library is).
|
| It's an interesting idea. But it's not clear if they've validated
| the heuristics they use, to see how well it performs in terms of
| accuracy against, say, some kind of BeautifulSoup-like attempt to
| make sense of the JSON-ish that the LLM produces and correct that
| to be valid JSON, or any other approach to the problem.
| dragonwriter wrote:
| I wonder if LLMs are at the point where reprompting the LLM
| with a very similar error message to what you would output to a
| human from a user-friendly JSON processing tool for the error
| would usually be a good way to fix errors.
| execveat wrote:
| Yeah, but that could require multiple queries, which isn't
| very efficient. Training model just to fix JSON would be
| better.
| newhouseb wrote:
| Sometimes, but it very much depends on the context (no pun
| intended). If it's a pure syntax issue, OpenAI models will
| almost certainly make the right correction. If it's more
| abstract, like the LLM has hallucinated a property that is
| invalid as part of some larger schema you can quickly descend
| into the LLM gaslighting you into saying that it has fixed
| things when it hasn't.
| apalmer wrote:
| Trying to understand why this is necessary? LLMs cannot reliably
| generate valid Jason?
| dwallin wrote:
| Two ways in which it is useful over existing techniques:
|
| - It is guaranteed to match your schema
|
| - It is much lighter weight
| Der_Einzige wrote:
| Also, it costs tokens to ask a model to do something, and it
| may choose not to do it re: constraints
|
| You can force it by banning the vocabulary which violates a
| constraint for free.
| tysam_and wrote:
| Yes, but Mike on the other hand....
| kcorbitt wrote:
| I've thought about building this for a while, glad it's out
| there!
|
| Not only does this guarantee your output is JSON, it lowers your
| generation cost and latency by filling in many of the repetitive
| schema tokens without passing them through the LLM.
|
| For the very common case of "extracting multiple structured
| fields from a piece of unstructured text," I believe there's an
| even stronger optimization possible that would further decrease
| costs, latency and potentially even improve accuracy.
|
| Assuming the fields you want to extract are independent (and they
| often are), you don't _need_ to generate them all in one go
| autoregressively. Eg. instead of running the following pseudo-
| prompt: "Input: 'It's sunny and cold today'
| Output schema: {"sunny": boolean, "temperature": string}"
|
| You could instead run the following two:
| "Input: 'It's sunny and cold today' Output schema:
| {"sunny": boolean}" "Input: 'It's sunny and cold
| today' Output schema: {"temperature": string}"
|
| We don't do that today because when done naively it's very
| inefficient -- you'd be tokenizing, passing to the GPU, and
| computing the KV cache of the shared part of the prompt twice.
| But a library with the right abstraction could run the second two
| queries in a batch in parallel and reuse the same tokenization
| and KV cache for both of them. It would actually be _more_
| efficient than generating both fields in one go, since when you
| factor out the shared prefixes both the generated text and its
| context are shorter!
|
| I mentioned above that this could also improve accuracy. Of
| course it doesn't do that by default (except that by excluding
| all the irrelevant fields it makes self-attention's job easier).
| But what it _does_ do is give you an independent prompt for each
| field you 're interested in. And so for particularly tricky
| fields you're trying to extract, you have the flexibility to eg.
| add several examples to make the generation N-shot.
| travisjungroth wrote:
| Maybe this will make CUE popular. It's similar to JSON, but the
| idea of schema and values are put together through unification,
| or you could say narrowing constraints. CUE would handle taking
| all of those values individually, then combining them into
| something concrete, incomplete, or erroring out.
| execveat wrote:
| You'd need to put the input first for this approach to work,
| but in my testing models work better if you lead with a
| question.
| kcorbitt wrote:
| Hmm. I admit that I haven't thought about this deeply, but
| I'm not sure that's true? It seems to me that you could
| extend the KV cache either backwards or forwards equally
| easily.
| Siira wrote:
| You can't. The later values depend on the earlier ones, so
| changing the early tokens invalidates your whole cache.
|
| This is also probably why leading with a question works
| better in the first place. All later processing conditions
| on the question in this way.
|
| BTW, in my very limited testing, GPT4 doesn't care about
| the order.
| tysam_and wrote:
| I could be reading this wrong, but my assumption is/has been
| that the prompt goes up to the end of the JSON field name,
| and the LLM is only filling in the actual value, not the key.
| I could be wrong on this one, however.
| bckr wrote:
| Can you briefly describe how you got to the point of having
| this kind of intuition about language models?
| kcorbitt wrote:
| Kind of a meta-answer, but my personal learning style is
| "think of something cool to build, then figure out what I
| need to know to build it." It just so happens that a lot of
| the interesting/cool stuff going on right now builds on top
| of LLMs, so my projects have naturally gravitated that way.
| But I've never sat down and to take an AI course or read the
| "Attention Is All You Need" paper or anything. I absorb much
| more when I learn something because I need it for something
| I'm working on.
| catchnear4321 wrote:
| it is a correct answer.
| tysam_and wrote:
| I can't speak for OP, but something that I think helps is if
| you think about the generation process as a jumping off point
| that one can control the placement of, but not really much
| that is generated afterwards.
|
| Adding a scheme like this reduces the area of potential off-
| roading that the LLM can do to a much smaller zone.
| Additionally, it breaks up the chain of dependencies between
| the two example outputs, because now we do not need to depend
| upon past inputs to correctly output this scheme.
|
| Since the information for JSON semantic structure is no
| longer required to be driven by the LLM (it still has to
| understand it to still be able to generate things with a
| modicum of sense, IIRC), we can look at our dependency graph
| for outputs. _This changes because now the fields really and
| truly are independent, (if they are truly informationally
| independent) _.
|
| So now some kind of conjoined information requirement of (
| autoregressive output ) <- (( field A ) <- ( field B ))
| becomes ( autoregressive output ) <- (( field A ) && ( field
| B )) which then can be factored out into separate calls
| instead of sequentially, which yields us a batched call of ((
| autoregressive output A ) <- ( field A ) && ( autoregressive
| output B ) <- ( field B )).
|
| From there it is just implementation. I likely would not have
| thought about the OP's way of handling things for a good
| while, though maybe I would have stumbled into it had I
| enough reason to think about structured/templated kinds of
| generation, which I do believe that I do now! <3 :) It really
| breaks a lot of assumptions that are easy to quietly make and
| I had not thought appropriately about the consequences of
| reframing things in this way, to be honest.
|
| As for "how" to think about this, if I were to give my take,
| it would be always just turning whatever problem in front of
| you is into a puzzle where you simplify it further each time.
| Optimizing for less computation, time, code, or even just
| what all of those are a kind of proxy for: less information
| to sufficiently solve a problem. We can see that this problem
| is reduced in complexity appropriately because we remove a
| redundancy that does not need to be there at all.
|
| One way to look at this is in the relationships between parts
| of an idea. If you're able to understand, even vaguely, the
| concepts behind some other concept and how they interact, and
| maybe even have a 'standard toolkit' of relating to them, you
| can start finding/transferring/applying other skills to these
| parts of a concept. I don't think there's a guaranteed-
| efficient way to maybe reduce a concept down to its parts, or
| down to a more efficient representation without already,
| well, knowing that representation. It's an NP-hard problem to
| me personally, and is the reason why research and other
| academic pursuits can take a while. It is a good skill to
| learn I suppose and I certainly enjoy trying to use it,
| personally.
|
| To tie this back to your question about language models --
| yes, some things have to do with the language model, but
| oftentimes it's actually just the raw mathematical components
| underlying a model. If you look for that, and (please please
| please please please!!!!) then you don't necessarily _have_
| to concern yourself with the implementation details (beyond
| runtime limits, etc), as long as the math still applies you
| should be able to reason really quite well about what else is
| happening/could happen with a model type like these are.
|
| In particular, LLMs being an autoregressive model where each
| output depends upon its inputs lets us set up a dependency
| graph. Then based upon some prior assumptions, we can maybe
| make some substitutions/changes that allow us to fragment the
| dependency graph and move it around as we wish. This is not
| just applicable to LLMs, however, dependency graphs are
| useful in a wide number of areas.
|
| So one other thing that we're not talking about here is that
| we're optimizing for an objective we want (clean JSON) by
| explicitly...well, injecting that objective instead of living
| on just hopes and dreams, y'aknow. This is a pretty
| straightforward way of solving the problem by putting the
| answer in the question, though poor input content still can
| be a problem.
|
| Stated a different way, we're collapsing the entropy of what
| the network can introduce (which should be JSON, but remember
| [!!!!!!!], neural networks are noisy estimators, and JSON
| errors are mathematically guaranteed (even if rare), which
| means any pipeline depending upon output like code can and
| will fail, and is brittle to all sorts of other kinds of
| complicated parsing errors. This is because to
| catch/detect/enumerate/correct these errors, we need to have
| all of the information needed to implement a JSON structure
| itself. So basically we'd be using the same exact
| information, just enforcing it in a horrendously inefficient
| manner, which is how people have been doing it until the
| present, which is okay as we humans are certainly not NP-
| optimal machines IMO. In any case, we're still in the
| parentheses, and the point was that any kind of variance can
| be a problem here beyond some extremely tiny limit, and
| that's not what LLMs are made to do. So at some point it's
| guaranteed to break, and high volumes -- it's basically
| guaranteed to break in a way that's either unusable or
| requires so much effort to fix that you might as well have
| embedded a JSON prior into your network generation process
| because it would have required the same amount of information
| as external validation would, albeit with less effort
| (!!!!)), which is perfectly fine in our case if we're
| exclusively generating JSON as it gives us what we want. Most
| methods like this thankfully should have a low level of
| invasiveness to the model as well, freeing us up to use
| either the same or a similar model for multiple tasks.
|
| This can create a bit of an ideological illusion as we
| technically are destroying information by collapsing the
| distributions of sentences/strings of tokens/etc that we are
| generating, and maybe can lend to a "oh, we can add whatever
| functionality we want!" kind of belief about this kind of
| modeling. It's important what we're adding and taking away.
| Also important is part of how/why/what is so powerful about
| training these models on next token prediction on large text
| corpora. We can trim them down to some smaller subproblem
| much much more easily than we can expand them to cover a
| larger subset. Which is pretty darn cool!
|
| I know this sorta flew around a lot of places and touched on
| a lot of things, probably not as cogently as I'd want to if I
| had more time to review and revise it. Hope it was/is helpful
| for you and feel free to let me know if you have any
| questions. It's a very cool topic on the whole to me, tbh,
| and there's a number of interesting conversations that can
| branch off from this one. Honestly this whole general area is
| where I see the real value in LLM development in research.
| It's practical and it's helpful! :D :) <3 :)
|
| Source for experience is a number of years of experience
| across a wide variety of ML models, though I'm sure I made an
| embarassing blunder or two in this post. ;P
| kolinko wrote:
| Not op, but I can share my approach - I went line by line by
| Recmo's Cria: https://github.com/recmo/cria - which is an
| implementation of Llama in Numpy - so very low level. Took me
| I think 3-4 days x 10 hours + 1-2 days of reading about
| Transformers to understand what's going on - but from that
| you can see how models generate text and have a deep
| understanding of what's going on.
| visarga wrote:
| I wanted to see the opposite - parsing JSON and YAML generated
| from LLMs. It doesn't happen much with GPT-4 but lesser models
| might mess up the format and then you can't simply parse it.
| ImaCake wrote:
| It sorta feels like LLMs or some kind of NN should be useful
| (with training) for parsing malformed jsons. I suspect its a
| hard problem but honestly it would be such a massive help for
| those of us dealing with data at work!
| andrewcamel wrote:
| Seen a lot of things trying to do this by pressure testing the
| outputs, but all feel like anti-patterns. This is the first that
| seems like the "right" way to do it. Better to manage how the
| model is generating vs creating one more potentially faulty
| "glue" layer.
| lt wrote:
| Can you elaborate about what you mean by pressure testing?
| Haven't heard this term yet.
| andrewcamel wrote:
| Maybe not the right term... Just that a lot of other libs act
| like guardrails, i.e. let the model generate what it does (in
| full form text / GPT output), and try to then parse out what
| you want, error if output doesn't conform to standard format.
| As opposed to basically only allowing the model to generate
| into the already-built JSON form fields. Understandable why
| this guardrails/parsing approach is so popular though...
| can't do what this library is doing with OpenAI API. Need to
| be able to manipulate the token generation; otherwise you're
| forced to take full text output and try to parse it.
| tysam_and wrote:
| Mathematically it requires less information to impose a certain
| prior on data in the process of generation than it does to read
| the data, do error detection and correction according to a
| prior, and then return it, if I understand correctly.
|
| Something always felt incredibly icky to me about any kind of
| ad-hoc 'fixer' scripts that were part of a pipeline that was
| fully controlled by a user.
| phh wrote:
| I hope that this is new to no-one generating JSON using LLM,
| because it felt like the first thing you'd do when I implemented
| that kind of stuff. That being said, it's nice to have that as a
| library ready-to-go.
| ianbutler wrote:
| Nice this codifies something similar I've been doing in my
| prompts! Will be using this instead.
|
| What I currently have been doing:
|
| The JSON template for your response is provided below. The parts
| to fill out are capitalized. Please do not modify the template.
| Please fill in the template with one of the above options for
| your response. <result> { "rating": "N. RATING", "reason":
| "REASON" } </result>
| layoric wrote:
| I might be reading the code wrong but it looks like it crawls the
| schema making a generation per primitive type. While that's a
| clever way to ensure valid JSON, I don't know if I'd go as far as
| to describe it as efficient.
|
| Saying that if the model is unable to generate JSON due to its
| training/fine tuning, this is indeed a clever solution!
| pklee wrote:
| This is pretty cool. I tried with dolly and then I tried with
| T5-base, both of it did not give me result. It broke for me. Has
| anyone tried it ?
| Jayakumark wrote:
| This is great that it does not use OpenAI and runs locally
| tough wrote:
| I knew a similar one called GPTyped, just posted it on HN
| https://news.ycombinator.com/item?id=35793056#35793057
| wy35 wrote:
| Very interesting. I've only been using OpenAI APIs so this logit
| stuff is new to me.
| Der_Einzige wrote:
| I've complained bitterly and openly about how annoying it is
| that OpenAI locks down access to the full probability
| distribution. Glad to see that others are running into this
| stupid limitation and are doing work related to it.
| esafak wrote:
| It's a testament to the democratization of ML that
| practitioners today can get by without knowing what a logit is.
| tysam_and wrote:
| And I am personally glad for that, for one! This means it's
| accessible to more people without requiring specialized
| knowledge, and while, yes, I think that always triggers an
| internal reaction from most of us when it comes to thinking
| about field dilution, it's almost a necessary tradeoff (like,
| say, the uncertainty principle) when expanding the field out
| to more people.
|
| So, hurray! We've made it more accessible. And hopefully in
| years to come, even very much more so! <3 :)
| koboll wrote:
| I'm flabbergasted that OpenAI does not yet offer an API that
| reliably returns JSON based on some schema you feed it. It's
| sort of possible to force it to do this but not really to a
| production-ready degree.
| newhouseb wrote:
| Oh nice! I built a similar system a few weeks ago:
| https://github.com/newhouseb/clownfish
|
| I think the main differentiating factor here is that this is
| better if you have a simpler JSON schema without enums or oneOf
| constraints. If you do have these constraints, i.e. let's say you
| wanted an array of different types that represented a items on a
| menu { kind: pizza, toppings: [pepperoni] } or { kind: ice_cream,
| flavor: vanilla | strawberry } then you would need something more
| sophisticated like clownfish that can ask the LLM to pick
| specific properties (and an ability to do some backtracking so
| you can do proper beam search).
|
| For completeness, another common approach can be found here:
| https://github.com/ShreyaR/guardrails which essentially boils
| down to "provide the schema in the prompt and ask the LLM to
| correct things if it fails to get the schema right the first
| time."
| gamegoblin wrote:
| I hate that gpt-3.5-turbo is so cheap that using systems like
| guardrails is a sane thing to do. I can almost always prompt
| davinci-003 without guardrails in a way to get my exact schema
| 1-shot, whereas guardrails + 3.5-turbo will often consume 2-4x
| more tokens, but that still makes it significantly cheaper.
| brigadier132 wrote:
| The problem people are having is hitting the rate limits for
| chat gpt.
| joshuanapoli wrote:
| Thank you for the really clear and complete description of
| "ControLogit"s and your approach in clownfish!
| killthebuddha wrote:
| One thing that I really like about the approach you took with
| clownfish is that it doesn't constrain or modify the structure
| of the prompt.
|
| One of the primary difficulties with writing LLM applications
| is that prompts are basically not composable, and any LLM
| library that modifies your prompt is going to be a nightmare to
| work with.
| killthebuddha wrote:
| Follow-up thought I just had: It seems that prompt structure
| standards are going to have to emerge if any of these tools
| have a shot at interoperability. I don't have hard data, but
| IME if a prompt is structured
|
| MEMORY EXAMPLE INSTRUCTION [COMPLETION]
|
| it will basically not work to wrap it in a prompt that's
| structured
|
| INSTRUCTION MEMORY EXAMPLE [COMPLETION]
| ianbutler wrote:
| Interoperability can also be achieved with small adapters
| written for the prompting style of the particular model
| being interfaced with, I'd be surprised if like LangChain
| or AutoGPT don't already do something like this in their
| systems.
|
| I'm currently building something that leverages an ensemble
| of different LLMs depending on the difficulty of a task and
| ran into this issue.
|
| Dolly V2 takes "###Instruction: <your stuff> ###Response"
| as the structure fed to the model where as GPT3.5 Turbo
| wasn't trained to treat that particular structure as
| important.
|
| The nice thing is that GPT3.5 Turbo will just roll with the
| prompt structure Dolly uses but that only works in very
| large LLMs, I'd imagine I wouldn't get away with it in
| other 12BN parameter models.
|
| But realistically this could look like taking the
| "INSTRUCTION MEMORY EXAMPLE [COMPLETION]" schema
| represented in a library and each adapter would transform
| it into
|
| "MEMORY EXAMPLE INSTRUCTION [COMPLETION]" schema or
| whatever is needed by the different model.
| sudb wrote:
| Another very similar approach to guardrails which manages to
| avoid XML that I've been using with some success is langchain's
| OutputFixingParser:
| https://python.langchain.com/en/latest/modules/prompts/outpu...
| Der_Einzige wrote:
| Love to see further work on constrained decoding like this and
| other systems introduced in the comments!
|
| See my work and the paper about it. I've got a lot of y'all beat
| on this (constrained decoding, not the templating and
| structuring) by about a year:
|
| https://github.com/hellisotherpeople/constrained-text-genera...
___________________________________________________________________
(page generated 2023-05-02 23:00 UTC)