[HN Gopher] Tokenization for language modeling: BPE vs. Unigram ...
       ___________________________________________________________________
        
       Tokenization for language modeling: BPE vs. Unigram Language
       Modeling (2020)
        
       Author : phewlink
       Score  : 67 points
       Date   : 2025-05-30 08:59 UTC (14 hours ago)
        
 (HTM) web link (ndingwall.github.io)
 (TXT) w3m dump (ndingwall.github.io)
        
       | anonymoushn wrote:
       | How does SentencePiece choose the initial vocabulary, which is
       | trimmed down to determine the final vocabulary which has these
       | desirable properties?
        
         | mcyc wrote:
         | Just a minor nit: SentencePiece is a library, not a
         | tokenization algorithm. It implements two tokenization
         | algorithms, Unigram and BPE.
         | 
         | BPE builds vocabularies from the base up so I assume you are
         | talking about Unigram which starts with a big vocabulary and
         | trims it.
         | 
         | The details of UnigramLM are here
         | https://arxiv.org/pdf/1804.10959, and the part about vocabulary
         | seeding is Section 3.2.
         | 
         | Basically, it just selects all substrings that appear in the
         | corpus up to a certain length (and then maybe trims it a little
         | by discarding rare substrings or something to reduce the
         | initial size a bit and make things faster).
        
           | anonymoushn wrote:
           | If the library has two vocabulary learners, only one of which
           | does the described thing, then isn't it unambiguous which
           | implementation within the library the question refers to? And
           | wouldn't it be ambiguous to instead say "how does Unigram do
           | it" without referring to any particular implementation?
           | 
           | Anyway, the paper says "Frequent substrings can be enumerated
           | in O(T) time and O(20T) space with the Enhanced Suffix Array
           | algorithm (Nong et al., 2009)", which is hilariously
           | underspecified, at least in part because a suffix array
           | algorithm isn't a top-k algorithm.
        
       | wejick wrote:
       | The vibe of AI discussions was so much different with today's (or
       | at least for the past 2.5 years at least). It's quite surreal how
       | things moved so fast
        
       | macawfish wrote:
       | It's always seemed like such low hanging fruit, so much semantic
       | information just squandered in the thirst for larger models.
        
         | kevmo314 wrote:
         | Unfortunately the choice of tokenizer is baked into the model.
         | If you want to innovate on the tokenizer, you have to train a
         | whole base model yourself to prove it's better which makes the
         | barrier for innovation pretty high.
        
           | mdaniel wrote:
           | Apologies if this is a dumb question, but is there no "hello
           | world"-ish sandbox for testing this theory? I can very easily
           | imagine that trying to go head-to-head with R1 or such is
           | going to be a boatload of GPU, but for just testing tokenizer
           | head-to-head isn't there a smaller sized one that can be used
           | in a bake off?
        
             | anonymoushn wrote:
             | You can use modded-nanogpt for testing this sort of change.
             | If you don't want to do anything really weird, you can just
             | train whatever tokenizer you like on the training set, then
             | retokenize the input, then run the existing training code.
             | One person did this earlier this year using a Tokenmonster
             | vocab and got better downstream performance in less train
             | time.
        
               | pama wrote:
               | What exact attempt with which Tokenmonster vocab do you
               | refer to? Sometimes it is hard to conclude much from
               | these efforts. For example, having a smaller vocabulary
               | is typically only useful for small models where the
               | compute cost of the softmax layer at the end of the
               | decoder may still factor into the equation for the
               | performance. Fixing the size of the vocabulary while
               | increasing the rest of the model makes this inefficiency
               | disappear.
        
               | anonymoushn wrote:
               | https://x.com/alexjc/status/1881410039639863622
               | 
               | I agree that it would be more useful to compare vocabs of
               | identical size.
        
               | pama wrote:
               | Thanks! It seems to me that the performance gain here was
               | due to a smaller vocab size. This type of change is
               | almost guaranteed to backfire for larger models and
               | larger datasets / lower loss requirements, so it probably
               | is not very useful. Generally the historical trend has
               | been to use larger vocabularies as the models got larger
               | themselves.
        
               | anonymoushn wrote:
               | Well, he says he achieved some downstream win on the same
               | size, but it didn't translate into a win in perplexity,
               | so he tried to do something else. Like I said, it's
               | unfortunate.
        
             | z3c0 wrote:
             | To expand on the other comment, if you look under the data
             | folder in nanoGPT, you can see examples of how to train the
             | model using various data sources and encoders.
             | "shakespeare_char" is probably the most rudimentary, only
             | converting the characters of the input into integers.
             | 
             | e.g. https://github.com/karpathy/nanoGPT/blob/master/data/s
             | hakesp...
        
       | pbronez wrote:
       | Is there a more current post with similar information? I'd love
       | to see how contemporary tokenizers have improved on these
       | algorithms.
        
         | anonymoushn wrote:
         | They have not.
        
           | woodson wrote:
           | There has been some innovation, e.g. SuperBPE
           | (https://arxiv.org/abs/2503.13423) claims substantial gains
           | on 8B-sized models.
        
       | janalsncm wrote:
       | Interesting recent research from Meta going in what I would call
       | the opposite direction from an interpretability standpoint:
       | https://arxiv.org/abs/2412.09871
       | 
       | Rather than splitting based on a predefined list you can check
       | with a text editor, BLTs use a neural net to tokenize. So it will
       | be pretty hard to debug.
        
         | YossarianFrPrez wrote:
         | This is a cool paper!
         | 
         | Basically, prior to feeding text to the tokenizer people have
         | split the text on whitespaces. But whitespaces aren't exactly
         | meaningful separators. By getting rid of this restriction as
         | the tokenizer is 'learning', some of the tokens end up being
         | 'by the way' or 'in the the long run.' The researchers find
         | that this makes the model much more efficient.
        
       | mrkramer wrote:
       | So LLMs hallucinate by design which is of course bad for any
       | serious work but then again I was thinking in which situations
       | would hallucinations be actually useful....let's say poetry
       | writing, lyrics writing etc., the crazier the better or for
       | generating business ideas, "crazy ideas succeed" type logic.
        
         | Lerc wrote:
         | While training something on TinyStories one of the early stage
         | outputs was
         | 
         |  _" Once upon a time, in a small town, there was a huge town"_
         | 
         | Which I thought was the brilliant start for an interesting
         | story. Just that slight deviation from expectation provides a
         | excellent scaffold for your imagination to take hold.
         | 
         | Of course the rest of the generation was utter rubbish, but if
         | you could on-demand deviate from standard at key points while
         | keeping the majority internally consistent then it would be
         | great.
        
           | mrkramer wrote:
           | Poetry is full of figures of speech which are words and
           | phrases that intentionally differ from common logic and
           | common knowledge so LLMs can actually be probabilistic and
           | accidental generators of figures of speech. Every nonsense
           | that LLMs generate can be interpreted as a protentional
           | metaphor or an oxymoron. The sentence that you mentioned
           | sounds like a badass metaphor to me "Once upon a time, in a
           | small town, there was a huge town" or in another words: town
           | might be small in size but it has soul.
        
       | Lerc wrote:
       | I have been doing experiments along these lines. Initially I have
       | avoided the issue with the semantics of words altogether because
       | there is plenty of work to be done before you even get to the
       | words themselves.
       | 
       | With BPE you get different encodings based upon punctuation and
       | whitespace whenever it decides to pair the characters before and
       | after a word into the word itself.
       | 
       | The tokenizer I have been using breaks words into components of:
       | prefix character        prefix character repeat count
       | index of lowercase(word)        index of capialization mode  (0 =
       | all lower, 1 = initial cap, 2= all caps)        suffix character
       | 
       | The encoder encodes the character before and after the word
       | double encoding the information into the words previous and next,
       | the decoder decreases the prefix count by one it it matches the
       | suffix of the previous word. Anything that doesn't match a known
       | word or is weirdly capitalized gets character encoded as a series
       | of prefix-suffix characters with no word body. (a separate
       | channel for chars would probably be better)
       | 
       | Each channel gets their own embedding table and the embeddings
       | from all the channels are summed before being passed into the
       | transformer.
       | 
       | Decode is just a separate last stage layer translating the final
       | embedding into channels. In hindsight it should have been a
       | little more than that for the decode because if the options for
       | the next word were split between " YELLED!" and " whispered." my
       | current system could theoretically produce " WHISPERED.". In
       | practice it doesn't seem to do that much, but that means it's had
       | to learn something to deal with that (I suspect by limiting
       | variation) adding a little smarts to the end tokenization would
       | help, perhaps choose the word index first then use the embedding
       | for the match to filter the predicted embedding before
       | calculating the other channels.
       | 
       | I have not yet done anything on breaking words themselves up. I
       | have been using tinystories for training so there hasn't been
       | need for it with so few unique words. I have given it it a lot of
       | thought though and I think I would contest the 'Gold standard'
       | encodings. I think a word like 'nodular' should be encoded as
       | something like 'nodule' '<of or relating to modifier> <minus e>
       | ar'
       | 
       | It's a little hard to comprehend what this might be for other
       | languages, but I think there is probably insights to be had if
       | you tried to make something that encoded English, Latin and Kanji
       | equally well.
       | 
       | I'd be curious to know what the total number of homonyms there
       | are across languages, too. Just a single signal to say 'this one
       | is a known homonym' would probably be beneficial. If the total
       | number is low enough, having their own token range might work
       | too.
        
       ___________________________________________________________________
       (page generated 2025-05-30 23:01 UTC)