[HN Gopher] Character Prefix Conditioning
       ___________________________________________________________________
        
       Character Prefix Conditioning
        
       Author : mntruell
       Score  : 22 points
       Date   : 2025-01-08 09:27 UTC (3 days ago)
        
 (HTM) web link (www.cursor.com)
 (TXT) w3m dump (www.cursor.com)
        
       | kcarnold wrote:
       | This was the subject of https://arxiv.org/abs/2412.03719. (I
       | suspect you can do simpler than the paper's solution if you're
       | only interested in the top-k.)
       | 
       | A related topic is "token healing", although some implementations
       | (unfortunately including the one in HuggingFace Transformers)
       | make some big assumptions that aren't always true (like treating
       | spaces as special).
        
       | yorwba wrote:
       | Ideally you'd have a language model that can predict a good
       | continuation after any byte. If an existing model can't do that
       | because it's too reliant on a specific tokenization, you might
       | nonetheless be able to fine-tune it until it can gracefully
       | handle the unexpected tokenizations that result from splitting at
       | a random byte.
        
         | kevmo314 wrote:
         | Such a model will always be less performant than one on tokens,
         | as you're effectively switching to one byte per token. Solving
         | this problem in code is much cheaper.
        
           | yorwba wrote:
           | I don't mean switching to one byte per token, but switching
           | to training on the token distribution that results from
           | cutting off the input at arbitrary bytes. The bytes per token
           | should be basically unchanged, as only the end gets a bit
           | shorter.
        
             | kevmo314 wrote:
             | Yeah I've tried that approach. The model ends up needing to
             | learn every combination of tokens. For example, the word
             | "apple" now has six bytes positions it can be split on and
             | the model suddenly needs to learn that all six will yield
             | the same output attention state.
             | 
             | It ends up being O(max token length) more complex and so
             | you end up needing a proportionally larger model to
             | accommodate it.
        
       | teaearlgraycold wrote:
       | Not sure if this is free labor or a means to source candidates.
        
       | do_not_redeem wrote:
       | So here is ChatGPT's token list:
       | https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...
       | 
       | Is there some reason it isn't alphabetical? (More specifically,
       | lexically sorted by codepoint) If you had a model with sorted
       | tokens, you'd be able to solve this by constraining output to
       | tokens with the desired prefix, probably with some mechanism
       | similar to how this works:
       | https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
        
         | pizza wrote:
         | They _are_ sorted, but in a way such that, roughly speaking,
         | rank is proportional to inverse frequency (like Zipf's law, but
         | also permitting merging of subwords to be ranked). This is
         | actually extremely important because it makes the otherwise
         | very-high-cardinality categorical feature of target predicted
         | argmax vocab dictionary key index slightly smoother and
         | slightly more predictable for the model
        
         | versteegen wrote:
         | 516 instances of "\r\n"!
        
       | viraptor wrote:
       | > Can you construct an efficient algorithm for sampling from
       | q(tk|t1,...,tk-1), that minimizes calls to the original language
       | model?
       | 
       | I feel like I'm missing some issue here... Can't you query
       | stopping at the last full token boundary, then reject any results
       | which don't match the character prefix and continue from there
       | with the completion? Kind of like when you mask the invalid
       | actions when reinforcement training on games? Or is that losing
       | too much info?
        
         | sshh12 wrote:
         | I asked o1 to figure this out and this is essentially what it
         | came up with as well.
         | 
         | https://chat.sshh.io/share/HIzUotMYVxFhRde94ZYJJ
        
       ___________________________________________________________________
       (page generated 2025-01-11 23:01 UTC)