[HN Gopher] Llama: Add grammar-based sampling
       ___________________________________________________________________
        
       Llama: Add grammar-based sampling
        
       Author : davepeck
       Score  : 81 points
       Date   : 2023-07-21 21:17 UTC (1 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | spion wrote:
       | Specifically for multi-choice string enums (essentially
       | dropdowns), I wonder if this would work better if the full
       | (joint/product) probability given the logits is considered when
       | picking the final choice, rather than using a greedy algorithm.
       | This will favor the right choice, as opposed to e.g. one of the
       | choices that contain the most common start token - if a start
       | token are shared among many items in the list.
       | 
       | Of course the probability needs to be adjusted once a subset of
       | the logits goes to zero so it actually makes sense...
        
       | meepmorp wrote:
       | Does anyone know Japanese well enough to comment on the output
       | from the Japanese example?
        
         | vore wrote:
         | It is vaguely Japanese, I guess, but pretty incoherent:
         | 1. What is the purpose?       2. Remember the customer       3.
         | About the customer [incomplete sentence?]
        
       | svc0 wrote:
       | I think it should be noted that this enforces grammatical
       | constraints on the model's generated text, but it doesn't do
       | anything to properly align the content. This would be useful if
       | you needed to ensure a server delivered well-formatted JSON, but
       | it I suspect it wont solve a lot of alignment issues with current
       | language generation. For example current iterations of Llama and
       | GPT often do not label markdown code-blocks correctly. Using
       | grammar-based sampling, you could enforce that it labels code
       | blocks but you couldn't enforce correct labeling since this is
       | context-dependent. You also couldn't invent a novel domain-
       | specific language without aligning against that language and
       | expect good output.
        
       | version_five wrote:
       | I'm interested in this and I'm going to try incorporating it into
       | something I'm doing. That said, I feel like this could be one of
       | those Bitter Lesson situations where it's not the most effective
       | approach in anything but the very short term:
       | http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
         | woah wrote:
         | Not an expert at all, but I believe that OpenAI uses this in
         | some of their GPT apis which are meant for programmatic use.
         | I've seen it theorized that offloading the rote grammar stuff
         | to a simple process that is meant for it lets the LLM use it's
         | "brainpower" on the complicated stuff more effectively. No idea
         | if this is true.
        
           | TechBro8615 wrote:
           | It makes sense to my uninformed intuition, which is that a
           | strict grammar reduces the search space for the token
           | generation and so the AI can eliminate possibilities that
           | would otherwise be ambiguous.
        
         | Der_Einzige wrote:
         | It may be a stop-gap, but its an important one as it is not
         | obvious that LLMs in the next few years will "organically"
         | solve their issues with generating text with constraints.
        
       | ilaksh wrote:
       | Has anyone tested FreeWilly2 (the new Llama2 fine-tune released
       | today by Stable Foundation) on code generation?
        
       | 1024core wrote:
       | Can someone ELI5 what's going on here? I'm reasonably familiar
       | with LLMs, but I can't quite grok what Georgi is doing here and
       | why it's so exciting for some.
        
         | simonw wrote:
         | See my comment here
         | https://news.ycombinator.com/item?id=36820884
        
         | modeless wrote:
         | If you ask an LLM to generate JSON or another language that has
         | a grammar, it will sometimes produce invalid syntax. This
         | constrains the LLM so that it can only output valid syntax
         | according to whatever grammar you supply.
         | 
         | The way an LLM generates text is one token (short sequence of
         | characters) at a time. First the giant neural net assigns a
         | probability to every possible token (this is the hard part).
         | Then a sampling procedure uses the probabilities to pick one of
         | the tokens, and the process repeats.
         | 
         | The sampling procedure is not a neural net and can be modified
         | in many different ways. You might think that the sampling
         | procedure should always simply pick the token with the highest
         | probability (greedy sampling). You can do that, but it's
         | usually better to pick at random weighted by the probabilities.
         | This gives more diversity and is less likely to get stuck in
         | loops. But this means that literally any token with nonzero
         | probability might get picked, so you can see how this might
         | lead to invalid JSON being generated. This change simply zeros
         | out the probabilities of all the tokens that wouldn't be valid
         | according to your grammar.
        
           | astrange wrote:
           | This is a sort of modern version of
           | https://wiki.c2.com/?AlternateHardAndSoftLayers, one of the
           | most useful software patterns.
        
       | simonw wrote:
       | Here's my understanding of how this works (please someone correct
       | me if I'm getting this wrong).
       | 
       | Language models emit tokens one at a time, starting with the
       | prompt that you give them.
       | 
       | If you have a conversation with an LLM, effectively you can think
       | of that as you giving it a sequence of tokens, then it generates
       | some, then you generate more and so-on.
       | 
       | This grammar trick effectively takes advantage of this by giving
       | you much more finely grained control over the tokens. So you can
       | do things like this:                   Give me the address of the
       | White House as JSON:                  {"street": "
       | 
       | Then the LLM can return:                   1600 Pennsylvania Ave
       | NW"
       | 
       | The moment you see that closing double quote, you take over again
       | and inject:                   ",         "City": "
       | 
       | It fills in:                   Washington, DC"
       | 
       | And so on.
       | 
       | But because this is all based on a grammar, you can do way more
       | with it than just JSON.
       | 
       | I saw a brilliant suggestion relating to this on Twitter a while
       | ago:
       | 
       | > @OpenAI should add an API argument allowing passing up a
       | deterministic context free grammar.
       | 
       | > [...]
       | 
       | > While I think DCFL is what you want here in the short term, the
       | really best thing is passing up a small WASM binary that simply
       | _is_ the sampler.
       | 
       | > Allow a user to pass up a few KB of WASM binary and give it a
       | few megabytes of RAM to run. Would enable next level LLM
       | superpowers.
       | 
       | https://twitter.com/grantslatton/status/1637692033115762688
        
         | pshc wrote:
         | I don't think this is correct; previously you could already
         | control output by reading tokens one at a time from the LLM
         | until you hit a stop character.
         | 
         | My take from the grammar-based sampling PR is that you ask
         | llama.cpp to constrain the next output token, to a restricted
         | set of possible tokens, using the grammar.
        
           | simonw wrote:
           | Right, which is the same idea - it's just that the code in
           | llama.cpp is running your grammar as part of its token
           | generation decisions as opposed to pausing and waiting for
           | your other code to pick the next token.
           | 
           | (I'm trying for a very high level explanation here.)
        
         | jiggawatts wrote:
         | Not just that: the LLM outputs not individual tokens, but a
         | weighted recommendation. The most probable ("best") token has
         | the highest weight, but there may be many alternatives
         | including JSON symbols like quote characters.
         | 
         | The "temperature" setting adjusts how likely it is that an
         | output token is chosen that is _not_ the top-rated option. That
         | prevents repetitive output.
         | 
         | Forcing an LLM to obey a grammar is mostly about filtering the
         | list before the token choice is made. There may still be a
         | random element controlled by the temperature!
         | 
         | A more advanced feature not commonly used is to also enable
         | back-tracking if the AI gets stuck and can't produce a valid
         | output.
        
       | bavarianbob wrote:
       | Could someone help me with context? I'm OOTL and don't understand
       | what is going on here.
        
       | painted-now wrote:
       | Can anyone recommend some paper or overview on how "sampling" /
       | "decoding" is done in the e2e neural network age? I know how
       | decoding was done for machine translation and speech recognition
       | back in the HMM times (i.e.
       | https://en.wikipedia.org/wiki/Viterbi_algorithm and
       | https://en.wikipedia.org/wiki/Beam_search). These days I get the
       | impression people just do "greedy" - but I don't really know. Any
       | recommendations for info on that topic?
       | 
       | Edit: Forgot Viterbi
        
         | janalsncm wrote:
         | Just reading through the GPT4 documentation it doesn't seem
         | like there's a ton of difference with what you've mentioned.
         | 
         | https://platform.openai.com/docs/api-reference/completions/c...
         | 
         | Of course we now know that GPT4 is a Mixture of Experts, so
         | under the hood they're parallelizing computation. They also
         | include a way to modify the logits with presence/frequency
         | penalty terms.
        
         | spion wrote:
         | Its greedy and random :) Instead of a paper, I would recommend
         | the algorithms of most LMM implementations (rwkv.cpp has a
         | relatively clean implementation in python https://github.com/sa
         | harNooby/rwkv.cpp/blob/master/rwkv/samp...)
        
           | painted-now wrote:
           | I guess I need to sit down and study this stuff in more
           | detail, but do I understand correctly that the code you
           | shared makes the decisions for each position independently? I
           | am just astonished that this produces any coherent output.
           | Also it is not clear to me how the length of the output
           | sequence is determined.
        
             | pizza wrote:
             | Once the stop token is likeliest
        
       | moffkalast wrote:
       | Ah finally, this was discussed a lot and is well overdue. Remains
       | to be seen how well the models will adapt to this new constraint,
       | though the demo seems promising.
        
         | ec109685 wrote:
         | Isn't this approach forcing the LLM to adapt? E.g. it is
         | throwing tokens away that don't match the grammar.
        
           | moffkalast wrote:
           | Well the grammar will be correct as enforced by the sampler,
           | but the content it's filled with could be anything at all.
           | Sort of how when you change the prompt template the output
           | can be garbage for some models. I haven't tried it out yet
           | myself, but apparently even OpenAI's implementation of this
           | exact principle on their API still has function hallucination
           | issues even with GPT 4.
        
       | Der_Einzige wrote:
       | I am in love with this, I tried my hand at building a Constrained
       | Text Generation Studio
       | (https://github.com/Hellisotherpeople/Constrained-Text-
       | Genera...), and got published at COLING 2022 for my paper on it
       | (https://paperswithcode.com/paper/most-language-models-can-
       | be...), but I always knew that something like this or the related
       | idea enumerated in this paper: https://arxiv.org/abs/2306.03081
       | was the way to go.
       | 
       | I will have to think about how I can build grammars that force
       | things like syllable counts or syntactic rules. Current LLMs do
       | very poorly on those kinds of tasks due to the tokenization
       | schemes...
        
       ___________________________________________________________________
       (page generated 2023-07-21 23:00 UTC)