[HN Gopher] Automatically Detecting Under-Trained Tokens in Larg...
       ___________________________________________________________________
        
       Automatically Detecting Under-Trained Tokens in Large Language
       Models
        
       Author : veryluckyxyz
       Score  : 160 points
       Date   : 2024-05-12 06:46 UTC (16 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | helsinkiandrew wrote:
       | Good Computerphile video on glitch tokens a year ago:
       | 
       | https://www.youtube.com/watch?v=WO2X3oZEJOA
        
         | varelaz wrote:
         | This video somehow looks even more interesting than the pre-
         | print of the article
        
           | 3abiton wrote:
           | It describes the problem in a nicer way tbh. I haven't yet
           | read the preprint but the video is neat.
        
       | djamconway wrote:
       | Amazing name for the paper
        
         | ukuina wrote:
         | Full title is: "Fishing for Magikarp: Automatically Detecting
         | Under-trained Tokens in Large Language Models"
        
       | anewhnaccount3 wrote:
       | Isn't the solution to just train the tokeniser on the same corpus
       | as the LLM? I'm not sure why reusing tokenisers is so common.
       | Anybody know?
        
         | bjornsing wrote:
         | From the abstract I get the feeling these techniques are useful
         | when you don't have access to the corpus, as e.g. in the case
         | where you download some open source weights but the corpus is
         | secret. Otherwise I don't understand why you wouldn't just
         | compute a histogram over the tokens in (a statistical sample
         | of) the corpus.
        
           | dTal wrote:
           | > open source weights but the corpus is secret
           | 
           | This is oxymoronic; the corpus _is_ the  "source". Yet this
           | usage of "open source" is widespread. Maybe we should start
           | calling such models by their rightful name, "freeware".
        
             | inbetween wrote:
             | Freeware versus open source is a good point. But freeware
             | typically can't be modified by the recipient, whereas
             | downloadable models and open source code can. So I think
             | there's still a need for a different term, neither open
             | source nor freeware...
        
             | sdenton4 wrote:
             | No, the corpus is not the source. It's data. So we can have
             | concepts of open models, open source, and open data. Any
             | combination of these can be chosen independently.
             | 
             | (Open data and open model but not open source is a bit
             | weird, but not unthinkable: there may be unreleased
             | training tricks or specialized infrastructure such that the
             | source code release is hard or undesirable.)
        
           | karpathy wrote:
           | The paper mentions some reasons why these quick fix ideas are
           | not as simple as it sounds. For example many rare tokens are
           | "intermediate" merges inside the BPE algorithm, shorter
           | prefixes of longer words. The long word is common, but its
           | earlier, intermediate merge is not, by itself.
        
             | kacperlukawski wrote:
             | Are there any specific reasons for using BPE, not Unigram,
             | in LLMs? I've been trying to understand the impact of the
             | tokenization algorithm, and Unigram was reported to be a
             | better alternative (e.g., Byte Pair Encoding is Suboptimal
             | for Language Model Pretraining:
             | https://arxiv.org/abs/2004.03720). I understand that the
             | unigram training process should eliminate under-trained
             | tokens if trained on the same data as the LLM itself.
        
         | sp332 wrote:
         | Sure, but if your corpus is very large, that's not feasible.
        
           | swhan wrote:
           | Tokenizer training doesn't scale as well as model training,
           | so general practice is to train on a subset of the full
           | corpus.
        
             | ubutler wrote:
             | I've trained tokenizers on medium-sized datasets (+5GB of
             | text, although that could be considered small or large
             | depending on who you ask) and have always found training
             | quite fast. As in, it takes a couple minutes.
             | 
             | Maybe if we're talking terabytes it might not scale as well
             | but so far in my experience training tokenizers has never
             | been an issue. It's training models that takes ages.
        
             | kacperlukawski wrote:
             | Why is that an issue? Training the tokenizer seems much
             | more straightforward than training the model as it is based
             | on the statistics of the input data. I guess it may take a
             | while for massive datasets, but is calculating the
             | frequencies impossible to be done on a bigger scale?
        
         | yorwba wrote:
         | I think people usually start out wanting to use the same corpus
         | for their tokenizer and for the LLM, but after training the
         | tokenizer and while testing the LLM they discover that parts of
         | the corpus are useless garbage (no offense to
         | SolidGoldMagikarp's efforts on the counting subreddit) so those
         | get excluded from further training, but at this point the
         | tokenizer has become part of the API and replacing it with a
         | new version would break other things, so the superfluous tokens
         | stay in the tokenizer vocabulary.
        
         | ubutler wrote:
         | There are two reasons I can think of why someone might reuse a
         | tokeniser:
         | 
         | 1. They want to continue pretraining a model instead of
         | starting from scratch. But actually people might not know that
         | you can pretty easily reuse model weights even when training
         | with a new tokeniser (I've got a blog post on how to do that:
         | https://umarbutler.com/how-to-reuse-model-weights-when-train...
         | ).
         | 
         | 2. Because it's convenient for end users. Tokenising and
         | chunking really large corpora can take a long time and it's
         | nice that I can use the GPT2 tokeniser and then train a bunch
         | of different models on that data without having to retokenise
         | everything.
        
         | sebzim4500 wrote:
         | On top of what everyone else has said, even if you are able to
         | train your tokenizer on exactly your training dataset it
         | wouldn't remove all these issues.
         | 
         | The way BPE works you can end up with very rare tokens if they
         | get merged with another token. Imagine you have tokens X and Y,
         | and it happens that almost every X is followed by Y. Then the
         | BPE process would make a new token XY but wouldn't remove the
         | old token which would now be undertrained.
         | 
         | I guess to solve this we'd need to use a more sophisticated
         | merging algorithm than the greedy one.
        
       | londons_explore wrote:
       | We shouldn't just be looking for under trained tokens. Tokens are
       | effectively the first layer of the network, but we should also be
       | looking for training data imbalances at every weight at every
       | other layer of the network.
       | 
       | When we find them, it might be best to delete weights with hardly
       | any data flowing through them (which might make the model smaller
       | or help generalisation).
        
         | dissuade wrote:
         | We can already compress and/or merge holomorphic models.
        
       | 65a wrote:
       | I find it hard to believe that a Canadian company's model
       | contained an undertrained token related to hockey (albeit in
       | German). In all seriousness, this is pretty cool and am excited
       | to see understanding of tokenization impacts on models improve.
       | One notable finding is that a lot of the earlier open source
       | models have issues with carriage returns, which are not that
       | uncommonly introduced depending on where the data is coming from
       | etc.
        
       | esafak wrote:
       | There is a random matrix theory derived diagnostic of training
       | that relies on the spectral density of the correlation matrix of
       | the weights. Each layer's spectral density is fit to a truncated
       | power law, and deemed properly trained if the power law exponent
       | alpha is just above two.
       | 
       | https://jmlr.org/beta/papers/v22/20-410.html
        
       ___________________________________________________________________
       (page generated 2024-05-12 23:00 UTC)