[HN Gopher] Llama: Add grammar-based sampling
___________________________________________________________________
Llama: Add grammar-based sampling
Author : davepeck
Score : 81 points
Date : 2023-07-21 21:17 UTC (1 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| spion wrote:
| Specifically for multi-choice string enums (essentially
| dropdowns), I wonder if this would work better if the full
| (joint/product) probability given the logits is considered when
| picking the final choice, rather than using a greedy algorithm.
| This will favor the right choice, as opposed to e.g. one of the
| choices that contain the most common start token - if a start
| token are shared among many items in the list.
|
| Of course the probability needs to be adjusted once a subset of
| the logits goes to zero so it actually makes sense...
| meepmorp wrote:
| Does anyone know Japanese well enough to comment on the output
| from the Japanese example?
| vore wrote:
| It is vaguely Japanese, I guess, but pretty incoherent:
| 1. What is the purpose? 2. Remember the customer 3.
| About the customer [incomplete sentence?]
| svc0 wrote:
| I think it should be noted that this enforces grammatical
| constraints on the model's generated text, but it doesn't do
| anything to properly align the content. This would be useful if
| you needed to ensure a server delivered well-formatted JSON, but
| it I suspect it wont solve a lot of alignment issues with current
| language generation. For example current iterations of Llama and
| GPT often do not label markdown code-blocks correctly. Using
| grammar-based sampling, you could enforce that it labels code
| blocks but you couldn't enforce correct labeling since this is
| context-dependent. You also couldn't invent a novel domain-
| specific language without aligning against that language and
| expect good output.
| version_five wrote:
| I'm interested in this and I'm going to try incorporating it into
| something I'm doing. That said, I feel like this could be one of
| those Bitter Lesson situations where it's not the most effective
| approach in anything but the very short term:
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| woah wrote:
| Not an expert at all, but I believe that OpenAI uses this in
| some of their GPT apis which are meant for programmatic use.
| I've seen it theorized that offloading the rote grammar stuff
| to a simple process that is meant for it lets the LLM use it's
| "brainpower" on the complicated stuff more effectively. No idea
| if this is true.
| TechBro8615 wrote:
| It makes sense to my uninformed intuition, which is that a
| strict grammar reduces the search space for the token
| generation and so the AI can eliminate possibilities that
| would otherwise be ambiguous.
| Der_Einzige wrote:
| It may be a stop-gap, but its an important one as it is not
| obvious that LLMs in the next few years will "organically"
| solve their issues with generating text with constraints.
| ilaksh wrote:
| Has anyone tested FreeWilly2 (the new Llama2 fine-tune released
| today by Stable Foundation) on code generation?
| 1024core wrote:
| Can someone ELI5 what's going on here? I'm reasonably familiar
| with LLMs, but I can't quite grok what Georgi is doing here and
| why it's so exciting for some.
| simonw wrote:
| See my comment here
| https://news.ycombinator.com/item?id=36820884
| modeless wrote:
| If you ask an LLM to generate JSON or another language that has
| a grammar, it will sometimes produce invalid syntax. This
| constrains the LLM so that it can only output valid syntax
| according to whatever grammar you supply.
|
| The way an LLM generates text is one token (short sequence of
| characters) at a time. First the giant neural net assigns a
| probability to every possible token (this is the hard part).
| Then a sampling procedure uses the probabilities to pick one of
| the tokens, and the process repeats.
|
| The sampling procedure is not a neural net and can be modified
| in many different ways. You might think that the sampling
| procedure should always simply pick the token with the highest
| probability (greedy sampling). You can do that, but it's
| usually better to pick at random weighted by the probabilities.
| This gives more diversity and is less likely to get stuck in
| loops. But this means that literally any token with nonzero
| probability might get picked, so you can see how this might
| lead to invalid JSON being generated. This change simply zeros
| out the probabilities of all the tokens that wouldn't be valid
| according to your grammar.
| astrange wrote:
| This is a sort of modern version of
| https://wiki.c2.com/?AlternateHardAndSoftLayers, one of the
| most useful software patterns.
| simonw wrote:
| Here's my understanding of how this works (please someone correct
| me if I'm getting this wrong).
|
| Language models emit tokens one at a time, starting with the
| prompt that you give them.
|
| If you have a conversation with an LLM, effectively you can think
| of that as you giving it a sequence of tokens, then it generates
| some, then you generate more and so-on.
|
| This grammar trick effectively takes advantage of this by giving
| you much more finely grained control over the tokens. So you can
| do things like this: Give me the address of the
| White House as JSON: {"street": "
|
| Then the LLM can return: 1600 Pennsylvania Ave
| NW"
|
| The moment you see that closing double quote, you take over again
| and inject: ", "City": "
|
| It fills in: Washington, DC"
|
| And so on.
|
| But because this is all based on a grammar, you can do way more
| with it than just JSON.
|
| I saw a brilliant suggestion relating to this on Twitter a while
| ago:
|
| > @OpenAI should add an API argument allowing passing up a
| deterministic context free grammar.
|
| > [...]
|
| > While I think DCFL is what you want here in the short term, the
| really best thing is passing up a small WASM binary that simply
| _is_ the sampler.
|
| > Allow a user to pass up a few KB of WASM binary and give it a
| few megabytes of RAM to run. Would enable next level LLM
| superpowers.
|
| https://twitter.com/grantslatton/status/1637692033115762688
| pshc wrote:
| I don't think this is correct; previously you could already
| control output by reading tokens one at a time from the LLM
| until you hit a stop character.
|
| My take from the grammar-based sampling PR is that you ask
| llama.cpp to constrain the next output token, to a restricted
| set of possible tokens, using the grammar.
| simonw wrote:
| Right, which is the same idea - it's just that the code in
| llama.cpp is running your grammar as part of its token
| generation decisions as opposed to pausing and waiting for
| your other code to pick the next token.
|
| (I'm trying for a very high level explanation here.)
| jiggawatts wrote:
| Not just that: the LLM outputs not individual tokens, but a
| weighted recommendation. The most probable ("best") token has
| the highest weight, but there may be many alternatives
| including JSON symbols like quote characters.
|
| The "temperature" setting adjusts how likely it is that an
| output token is chosen that is _not_ the top-rated option. That
| prevents repetitive output.
|
| Forcing an LLM to obey a grammar is mostly about filtering the
| list before the token choice is made. There may still be a
| random element controlled by the temperature!
|
| A more advanced feature not commonly used is to also enable
| back-tracking if the AI gets stuck and can't produce a valid
| output.
| bavarianbob wrote:
| Could someone help me with context? I'm OOTL and don't understand
| what is going on here.
| painted-now wrote:
| Can anyone recommend some paper or overview on how "sampling" /
| "decoding" is done in the e2e neural network age? I know how
| decoding was done for machine translation and speech recognition
| back in the HMM times (i.e.
| https://en.wikipedia.org/wiki/Viterbi_algorithm and
| https://en.wikipedia.org/wiki/Beam_search). These days I get the
| impression people just do "greedy" - but I don't really know. Any
| recommendations for info on that topic?
|
| Edit: Forgot Viterbi
| janalsncm wrote:
| Just reading through the GPT4 documentation it doesn't seem
| like there's a ton of difference with what you've mentioned.
|
| https://platform.openai.com/docs/api-reference/completions/c...
|
| Of course we now know that GPT4 is a Mixture of Experts, so
| under the hood they're parallelizing computation. They also
| include a way to modify the logits with presence/frequency
| penalty terms.
| spion wrote:
| Its greedy and random :) Instead of a paper, I would recommend
| the algorithms of most LMM implementations (rwkv.cpp has a
| relatively clean implementation in python https://github.com/sa
| harNooby/rwkv.cpp/blob/master/rwkv/samp...)
| painted-now wrote:
| I guess I need to sit down and study this stuff in more
| detail, but do I understand correctly that the code you
| shared makes the decisions for each position independently? I
| am just astonished that this produces any coherent output.
| Also it is not clear to me how the length of the output
| sequence is determined.
| pizza wrote:
| Once the stop token is likeliest
| moffkalast wrote:
| Ah finally, this was discussed a lot and is well overdue. Remains
| to be seen how well the models will adapt to this new constraint,
| though the demo seems promising.
| ec109685 wrote:
| Isn't this approach forcing the LLM to adapt? E.g. it is
| throwing tokens away that don't match the grammar.
| moffkalast wrote:
| Well the grammar will be correct as enforced by the sampler,
| but the content it's filled with could be anything at all.
| Sort of how when you change the prompt template the output
| can be garbage for some models. I haven't tried it out yet
| myself, but apparently even OpenAI's implementation of this
| exact principle on their API still has function hallucination
| issues even with GPT 4.
| Der_Einzige wrote:
| I am in love with this, I tried my hand at building a Constrained
| Text Generation Studio
| (https://github.com/Hellisotherpeople/Constrained-Text-
| Genera...), and got published at COLING 2022 for my paper on it
| (https://paperswithcode.com/paper/most-language-models-can-
| be...), but I always knew that something like this or the related
| idea enumerated in this paper: https://arxiv.org/abs/2306.03081
| was the way to go.
|
| I will have to think about how I can build grammars that force
| things like syllable counts or syntactic rules. Current LLMs do
| very poorly on those kinds of tasks due to the tokenization
| schemes...
___________________________________________________________________
(page generated 2023-07-21 23:00 UTC)