[HN Gopher] Automatically Detecting Under-Trained Tokens in Larg...
___________________________________________________________________
Automatically Detecting Under-Trained Tokens in Large Language
Models
Author : veryluckyxyz
Score : 160 points
Date : 2024-05-12 06:46 UTC (16 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| helsinkiandrew wrote:
| Good Computerphile video on glitch tokens a year ago:
|
| https://www.youtube.com/watch?v=WO2X3oZEJOA
| varelaz wrote:
| This video somehow looks even more interesting than the pre-
| print of the article
| 3abiton wrote:
| It describes the problem in a nicer way tbh. I haven't yet
| read the preprint but the video is neat.
| djamconway wrote:
| Amazing name for the paper
| ukuina wrote:
| Full title is: "Fishing for Magikarp: Automatically Detecting
| Under-trained Tokens in Large Language Models"
| anewhnaccount3 wrote:
| Isn't the solution to just train the tokeniser on the same corpus
| as the LLM? I'm not sure why reusing tokenisers is so common.
| Anybody know?
| bjornsing wrote:
| From the abstract I get the feeling these techniques are useful
| when you don't have access to the corpus, as e.g. in the case
| where you download some open source weights but the corpus is
| secret. Otherwise I don't understand why you wouldn't just
| compute a histogram over the tokens in (a statistical sample
| of) the corpus.
| dTal wrote:
| > open source weights but the corpus is secret
|
| This is oxymoronic; the corpus _is_ the "source". Yet this
| usage of "open source" is widespread. Maybe we should start
| calling such models by their rightful name, "freeware".
| inbetween wrote:
| Freeware versus open source is a good point. But freeware
| typically can't be modified by the recipient, whereas
| downloadable models and open source code can. So I think
| there's still a need for a different term, neither open
| source nor freeware...
| sdenton4 wrote:
| No, the corpus is not the source. It's data. So we can have
| concepts of open models, open source, and open data. Any
| combination of these can be chosen independently.
|
| (Open data and open model but not open source is a bit
| weird, but not unthinkable: there may be unreleased
| training tricks or specialized infrastructure such that the
| source code release is hard or undesirable.)
| karpathy wrote:
| The paper mentions some reasons why these quick fix ideas are
| not as simple as it sounds. For example many rare tokens are
| "intermediate" merges inside the BPE algorithm, shorter
| prefixes of longer words. The long word is common, but its
| earlier, intermediate merge is not, by itself.
| kacperlukawski wrote:
| Are there any specific reasons for using BPE, not Unigram,
| in LLMs? I've been trying to understand the impact of the
| tokenization algorithm, and Unigram was reported to be a
| better alternative (e.g., Byte Pair Encoding is Suboptimal
| for Language Model Pretraining:
| https://arxiv.org/abs/2004.03720). I understand that the
| unigram training process should eliminate under-trained
| tokens if trained on the same data as the LLM itself.
| sp332 wrote:
| Sure, but if your corpus is very large, that's not feasible.
| swhan wrote:
| Tokenizer training doesn't scale as well as model training,
| so general practice is to train on a subset of the full
| corpus.
| ubutler wrote:
| I've trained tokenizers on medium-sized datasets (+5GB of
| text, although that could be considered small or large
| depending on who you ask) and have always found training
| quite fast. As in, it takes a couple minutes.
|
| Maybe if we're talking terabytes it might not scale as well
| but so far in my experience training tokenizers has never
| been an issue. It's training models that takes ages.
| kacperlukawski wrote:
| Why is that an issue? Training the tokenizer seems much
| more straightforward than training the model as it is based
| on the statistics of the input data. I guess it may take a
| while for massive datasets, but is calculating the
| frequencies impossible to be done on a bigger scale?
| yorwba wrote:
| I think people usually start out wanting to use the same corpus
| for their tokenizer and for the LLM, but after training the
| tokenizer and while testing the LLM they discover that parts of
| the corpus are useless garbage (no offense to
| SolidGoldMagikarp's efforts on the counting subreddit) so those
| get excluded from further training, but at this point the
| tokenizer has become part of the API and replacing it with a
| new version would break other things, so the superfluous tokens
| stay in the tokenizer vocabulary.
| ubutler wrote:
| There are two reasons I can think of why someone might reuse a
| tokeniser:
|
| 1. They want to continue pretraining a model instead of
| starting from scratch. But actually people might not know that
| you can pretty easily reuse model weights even when training
| with a new tokeniser (I've got a blog post on how to do that:
| https://umarbutler.com/how-to-reuse-model-weights-when-train...
| ).
|
| 2. Because it's convenient for end users. Tokenising and
| chunking really large corpora can take a long time and it's
| nice that I can use the GPT2 tokeniser and then train a bunch
| of different models on that data without having to retokenise
| everything.
| sebzim4500 wrote:
| On top of what everyone else has said, even if you are able to
| train your tokenizer on exactly your training dataset it
| wouldn't remove all these issues.
|
| The way BPE works you can end up with very rare tokens if they
| get merged with another token. Imagine you have tokens X and Y,
| and it happens that almost every X is followed by Y. Then the
| BPE process would make a new token XY but wouldn't remove the
| old token which would now be undertrained.
|
| I guess to solve this we'd need to use a more sophisticated
| merging algorithm than the greedy one.
| londons_explore wrote:
| We shouldn't just be looking for under trained tokens. Tokens are
| effectively the first layer of the network, but we should also be
| looking for training data imbalances at every weight at every
| other layer of the network.
|
| When we find them, it might be best to delete weights with hardly
| any data flowing through them (which might make the model smaller
| or help generalisation).
| dissuade wrote:
| We can already compress and/or merge holomorphic models.
| 65a wrote:
| I find it hard to believe that a Canadian company's model
| contained an undertrained token related to hockey (albeit in
| German). In all seriousness, this is pretty cool and am excited
| to see understanding of tokenization impacts on models improve.
| One notable finding is that a lot of the earlier open source
| models have issues with carriage returns, which are not that
| uncommonly introduced depending on where the data is coming from
| etc.
| esafak wrote:
| There is a random matrix theory derived diagnostic of training
| that relies on the spectral density of the correlation matrix of
| the weights. Each layer's spectral density is fit to a truncated
| power law, and deemed properly trained if the power law exponent
| alpha is just above two.
|
| https://jmlr.org/beta/papers/v22/20-410.html
___________________________________________________________________
(page generated 2024-05-12 23:00 UTC)