[HN Gopher] Tokenization for language modeling: BPE vs. Unigram ...
___________________________________________________________________
Tokenization for language modeling: BPE vs. Unigram Language
Modeling (2020)
Author : phewlink
Score : 67 points
Date : 2025-05-30 08:59 UTC (14 hours ago)
(HTM) web link (ndingwall.github.io)
(TXT) w3m dump (ndingwall.github.io)
| anonymoushn wrote:
| How does SentencePiece choose the initial vocabulary, which is
| trimmed down to determine the final vocabulary which has these
| desirable properties?
| mcyc wrote:
| Just a minor nit: SentencePiece is a library, not a
| tokenization algorithm. It implements two tokenization
| algorithms, Unigram and BPE.
|
| BPE builds vocabularies from the base up so I assume you are
| talking about Unigram which starts with a big vocabulary and
| trims it.
|
| The details of UnigramLM are here
| https://arxiv.org/pdf/1804.10959, and the part about vocabulary
| seeding is Section 3.2.
|
| Basically, it just selects all substrings that appear in the
| corpus up to a certain length (and then maybe trims it a little
| by discarding rare substrings or something to reduce the
| initial size a bit and make things faster).
| anonymoushn wrote:
| If the library has two vocabulary learners, only one of which
| does the described thing, then isn't it unambiguous which
| implementation within the library the question refers to? And
| wouldn't it be ambiguous to instead say "how does Unigram do
| it" without referring to any particular implementation?
|
| Anyway, the paper says "Frequent substrings can be enumerated
| in O(T) time and O(20T) space with the Enhanced Suffix Array
| algorithm (Nong et al., 2009)", which is hilariously
| underspecified, at least in part because a suffix array
| algorithm isn't a top-k algorithm.
| wejick wrote:
| The vibe of AI discussions was so much different with today's (or
| at least for the past 2.5 years at least). It's quite surreal how
| things moved so fast
| macawfish wrote:
| It's always seemed like such low hanging fruit, so much semantic
| information just squandered in the thirst for larger models.
| kevmo314 wrote:
| Unfortunately the choice of tokenizer is baked into the model.
| If you want to innovate on the tokenizer, you have to train a
| whole base model yourself to prove it's better which makes the
| barrier for innovation pretty high.
| mdaniel wrote:
| Apologies if this is a dumb question, but is there no "hello
| world"-ish sandbox for testing this theory? I can very easily
| imagine that trying to go head-to-head with R1 or such is
| going to be a boatload of GPU, but for just testing tokenizer
| head-to-head isn't there a smaller sized one that can be used
| in a bake off?
| anonymoushn wrote:
| You can use modded-nanogpt for testing this sort of change.
| If you don't want to do anything really weird, you can just
| train whatever tokenizer you like on the training set, then
| retokenize the input, then run the existing training code.
| One person did this earlier this year using a Tokenmonster
| vocab and got better downstream performance in less train
| time.
| pama wrote:
| What exact attempt with which Tokenmonster vocab do you
| refer to? Sometimes it is hard to conclude much from
| these efforts. For example, having a smaller vocabulary
| is typically only useful for small models where the
| compute cost of the softmax layer at the end of the
| decoder may still factor into the equation for the
| performance. Fixing the size of the vocabulary while
| increasing the rest of the model makes this inefficiency
| disappear.
| anonymoushn wrote:
| https://x.com/alexjc/status/1881410039639863622
|
| I agree that it would be more useful to compare vocabs of
| identical size.
| pama wrote:
| Thanks! It seems to me that the performance gain here was
| due to a smaller vocab size. This type of change is
| almost guaranteed to backfire for larger models and
| larger datasets / lower loss requirements, so it probably
| is not very useful. Generally the historical trend has
| been to use larger vocabularies as the models got larger
| themselves.
| anonymoushn wrote:
| Well, he says he achieved some downstream win on the same
| size, but it didn't translate into a win in perplexity,
| so he tried to do something else. Like I said, it's
| unfortunate.
| z3c0 wrote:
| To expand on the other comment, if you look under the data
| folder in nanoGPT, you can see examples of how to train the
| model using various data sources and encoders.
| "shakespeare_char" is probably the most rudimentary, only
| converting the characters of the input into integers.
|
| e.g. https://github.com/karpathy/nanoGPT/blob/master/data/s
| hakesp...
| pbronez wrote:
| Is there a more current post with similar information? I'd love
| to see how contemporary tokenizers have improved on these
| algorithms.
| anonymoushn wrote:
| They have not.
| woodson wrote:
| There has been some innovation, e.g. SuperBPE
| (https://arxiv.org/abs/2503.13423) claims substantial gains
| on 8B-sized models.
| janalsncm wrote:
| Interesting recent research from Meta going in what I would call
| the opposite direction from an interpretability standpoint:
| https://arxiv.org/abs/2412.09871
|
| Rather than splitting based on a predefined list you can check
| with a text editor, BLTs use a neural net to tokenize. So it will
| be pretty hard to debug.
| YossarianFrPrez wrote:
| This is a cool paper!
|
| Basically, prior to feeding text to the tokenizer people have
| split the text on whitespaces. But whitespaces aren't exactly
| meaningful separators. By getting rid of this restriction as
| the tokenizer is 'learning', some of the tokens end up being
| 'by the way' or 'in the the long run.' The researchers find
| that this makes the model much more efficient.
| mrkramer wrote:
| So LLMs hallucinate by design which is of course bad for any
| serious work but then again I was thinking in which situations
| would hallucinations be actually useful....let's say poetry
| writing, lyrics writing etc., the crazier the better or for
| generating business ideas, "crazy ideas succeed" type logic.
| Lerc wrote:
| While training something on TinyStories one of the early stage
| outputs was
|
| _" Once upon a time, in a small town, there was a huge town"_
|
| Which I thought was the brilliant start for an interesting
| story. Just that slight deviation from expectation provides a
| excellent scaffold for your imagination to take hold.
|
| Of course the rest of the generation was utter rubbish, but if
| you could on-demand deviate from standard at key points while
| keeping the majority internally consistent then it would be
| great.
| mrkramer wrote:
| Poetry is full of figures of speech which are words and
| phrases that intentionally differ from common logic and
| common knowledge so LLMs can actually be probabilistic and
| accidental generators of figures of speech. Every nonsense
| that LLMs generate can be interpreted as a protentional
| metaphor or an oxymoron. The sentence that you mentioned
| sounds like a badass metaphor to me "Once upon a time, in a
| small town, there was a huge town" or in another words: town
| might be small in size but it has soul.
| Lerc wrote:
| I have been doing experiments along these lines. Initially I have
| avoided the issue with the semantics of words altogether because
| there is plenty of work to be done before you even get to the
| words themselves.
|
| With BPE you get different encodings based upon punctuation and
| whitespace whenever it decides to pair the characters before and
| after a word into the word itself.
|
| The tokenizer I have been using breaks words into components of:
| prefix character prefix character repeat count
| index of lowercase(word) index of capialization mode (0 =
| all lower, 1 = initial cap, 2= all caps) suffix character
|
| The encoder encodes the character before and after the word
| double encoding the information into the words previous and next,
| the decoder decreases the prefix count by one it it matches the
| suffix of the previous word. Anything that doesn't match a known
| word or is weirdly capitalized gets character encoded as a series
| of prefix-suffix characters with no word body. (a separate
| channel for chars would probably be better)
|
| Each channel gets their own embedding table and the embeddings
| from all the channels are summed before being passed into the
| transformer.
|
| Decode is just a separate last stage layer translating the final
| embedding into channels. In hindsight it should have been a
| little more than that for the decode because if the options for
| the next word were split between " YELLED!" and " whispered." my
| current system could theoretically produce " WHISPERED.". In
| practice it doesn't seem to do that much, but that means it's had
| to learn something to deal with that (I suspect by limiting
| variation) adding a little smarts to the end tokenization would
| help, perhaps choose the word index first then use the embedding
| for the match to filter the predicted embedding before
| calculating the other channels.
|
| I have not yet done anything on breaking words themselves up. I
| have been using tinystories for training so there hasn't been
| need for it with so few unique words. I have given it it a lot of
| thought though and I think I would contest the 'Gold standard'
| encodings. I think a word like 'nodular' should be encoded as
| something like 'nodule' '<of or relating to modifier> <minus e>
| ar'
|
| It's a little hard to comprehend what this might be for other
| languages, but I think there is probably insights to be had if
| you tried to make something that encoded English, Latin and Kanji
| equally well.
|
| I'd be curious to know what the total number of homonyms there
| are across languages, too. Just a single signal to say 'this one
| is a known homonym' would probably be beneficial. If the total
| number is low enough, having their own token range might work
| too.
___________________________________________________________________
(page generated 2025-05-30 23:01 UTC)