[HN Gopher] Character Prefix Conditioning
___________________________________________________________________
Character Prefix Conditioning
Author : mntruell
Score : 22 points
Date : 2025-01-08 09:27 UTC (3 days ago)
(HTM) web link (www.cursor.com)
(TXT) w3m dump (www.cursor.com)
| kcarnold wrote:
| This was the subject of https://arxiv.org/abs/2412.03719. (I
| suspect you can do simpler than the paper's solution if you're
| only interested in the top-k.)
|
| A related topic is "token healing", although some implementations
| (unfortunately including the one in HuggingFace Transformers)
| make some big assumptions that aren't always true (like treating
| spaces as special).
| yorwba wrote:
| Ideally you'd have a language model that can predict a good
| continuation after any byte. If an existing model can't do that
| because it's too reliant on a specific tokenization, you might
| nonetheless be able to fine-tune it until it can gracefully
| handle the unexpected tokenizations that result from splitting at
| a random byte.
| kevmo314 wrote:
| Such a model will always be less performant than one on tokens,
| as you're effectively switching to one byte per token. Solving
| this problem in code is much cheaper.
| yorwba wrote:
| I don't mean switching to one byte per token, but switching
| to training on the token distribution that results from
| cutting off the input at arbitrary bytes. The bytes per token
| should be basically unchanged, as only the end gets a bit
| shorter.
| kevmo314 wrote:
| Yeah I've tried that approach. The model ends up needing to
| learn every combination of tokens. For example, the word
| "apple" now has six bytes positions it can be split on and
| the model suddenly needs to learn that all six will yield
| the same output attention state.
|
| It ends up being O(max token length) more complex and so
| you end up needing a proportionally larger model to
| accommodate it.
| teaearlgraycold wrote:
| Not sure if this is free labor or a means to source candidates.
| do_not_redeem wrote:
| So here is ChatGPT's token list:
| https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...
|
| Is there some reason it isn't alphabetical? (More specifically,
| lexically sorted by codepoint) If you had a model with sorted
| tokens, you'd be able to solve this by constraining output to
| tokens with the desired prefix, probably with some mechanism
| similar to how this works:
| https://github.com/ggerganov/llama.cpp/blob/master/grammars/...
| pizza wrote:
| They _are_ sorted, but in a way such that, roughly speaking,
| rank is proportional to inverse frequency (like Zipf's law, but
| also permitting merging of subwords to be ranked). This is
| actually extremely important because it makes the otherwise
| very-high-cardinality categorical feature of target predicted
| argmax vocab dictionary key index slightly smoother and
| slightly more predictable for the model
| versteegen wrote:
| 516 instances of "\r\n"!
| viraptor wrote:
| > Can you construct an efficient algorithm for sampling from
| q(tk|t1,...,tk-1), that minimizes calls to the original language
| model?
|
| I feel like I'm missing some issue here... Can't you query
| stopping at the last full token boundary, then reject any results
| which don't match the character prefix and continue from there
| with the completion? Kind of like when you mask the invalid
| actions when reinforcement training on games? Or is that losing
| too much info?
| sshh12 wrote:
| I asked o1 to figure this out and this is essentially what it
| came up with as well.
|
| https://chat.sshh.io/share/HIzUotMYVxFhRde94ZYJJ
___________________________________________________________________
(page generated 2025-01-11 23:01 UTC)