[HN Gopher] Understanding GPT Tokenizers
___________________________________________________________________
Understanding GPT Tokenizers
Author : simonw
Score : 97 points
Date : 2023-06-08 20:40 UTC (2 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| mike_hearn wrote:
| A few extra notes on tokens.
|
| You don't have to use tiktoken if you aren't actually tokenizing
| things. The token lists are just text files that consist of the
| characters base64 encoded followed by the numeric ID. If you want
| to explore the list you can just download them and decode them
| yourself.
|
| I find that sorting tokens by length makes it a bit easier to get
| a feel for what's in there.
|
| GPT-4 has a token vocabulary about twice the size of GPT-3.5.
|
| The most interesting thing to me about the GPT-4 token list is
| how dominated it is by non-natural languages. It's not as simple
| as English tokenizing more efficiently than Spanish because of
| frequency. The most common language after English is code. A huge
| number of tokens are allocated to even not very common things
| found in code, like "ValidateAntiForgeryToken" or
| "_InternalArray". From eyeballing the list I'd guess about half
| the tokens seem to be from source code.
|
| My guess is that it's not a coincidence that GPT-4 both trained
| on a lot of code and is also the leading model. I suspect we're
| going to discover at some point, or maybe OpenAI already did,
| that training on code isn't just a neat trick to get an LLM that
| can knock out scripts. Maybe it's fundamentally useful to train
| the model to reason logically and think clearly. The highly
| structured and unambiguous yet also complex thought that code
| represents is probably a great way for the model to really level
| up its thought processes. Ilya Sutskever mentioned in an
| interview that one of the bottlenecks they face on training
| something smarter than GPT-4 is getting access to "more complex
| thought". If this is true then it's possible the Microsoft
| collaboration will prove an enduring competitive advantage for
| OpenAI, as it gives them access to the bulk GitHub corpus which
| is probably quite hard to scrape otherwise.
| gwern wrote:
| Worth mentioning the many other consequences of BPE tokenization:
| gwern.net/gpt-3#bpes
| https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-...
| pmoriarty wrote:
| In the article on your blog, you wrote:
|
| _" GPT-3 rhymes reasonably well and often when appropriate,
| but the improvement is much smaller on rhyming than it is on
| pretty much everything else. Apparently it is easier for GPT-3
| to learn things like arithmetic and spreadsheets than it is to
| learn how to rhyme."_
|
| I've experimented extensively with Claude, and a bit with
| Claude+, ChatGPT (GPT 3.5) and GPT4 on poe.com, and I've had
| not the slightest problem in getting them to rhyme. However,
| once they've started writing rhyming poetry it's hard to get
| them to stop rhyming. They seem to have formed a strong
| association between rhyming and poetry. I've also been unable
| to get them to obey a specific rhyming scheme like ABBAB.
| burnished wrote:
| That seems incredibly challenging, I'd expect some
| fundamental difficulty due to rhyming being determined by how
| a word sounds and not what it means.
| bluepoint wrote:
| So the space character is part of the token?
| minimaxir wrote:
| This can vary by BPE tokenizer. The original GPT-2/GPT-3 was
| weirder about it.
| simonw wrote:
| Yup. Most common words have several tokens - the word, the word
| with a capital letter, the word with a leading space and
| sometimes the word all in caps too.
|
| Try searching for different words using the search box here:
| https://observablehq.com/@simonw/gpt-tokenizer#cell-135
| jiggawatts wrote:
| I wonder if the embeddings could be explicitly configured to
| account for these "symmetries". E.g.: instead of storing
| seperate full copies of the "variants", maybe keep a reduced
| representation with a common prefix and only a small subset
| of the embedding vector that is allowed to be learned?
|
| This could force the model to correctly learn how to
| capitalise, make all-caps, etc...
| ywmario wrote:
| I have been under the impression that the embedded vector is the
| one actually matters. Token is just another format.
| [deleted]
| hsjqllzlfkf wrote:
| Could anyone who's an expert comment why there seems to be such a
| focus on discussing tokenizers? It seems every other day there's
| a new article or implementation of a tokenizer on HN. But
| downstream from that, rarely anything. As a non-expert I would
| have thought to tokenizing is just one step.
| SkyPuncher wrote:
| Tokens are the primitives that most LLMs (and broadly a lot of
| NLP) works with. While, you and I would expect whole-words to
| be tokens, many tokens are shorter - 3 to 4 characters - and
| don't always match the sentence structure you and I expect.
|
| This can create some interesting challenges and unexpected
| behavior. It also makes certain things, like vectorization, a
| challenge since tokens may not map 1:1 with the words you
| intend to weight them against.
| hsjqllzlfkf wrote:
| Your answer explains what tokenizers are, which isn't what I
| asked. You also told me something interesting about
| tokenizers, which is also not what I asked. Can you tell me
| anything NOT about tokenized? This is my point.
| mike_hearn wrote:
| The reason it's not discussed much is that what goes on
| downstream of tokenization is extremely opaque. It's lots
| of layers of the transformer network so the overall
| structure is documented but what exactly those numbers mean
| is hard to figure out.
|
| There's an article here where the structure of an image
| generation network is explored a bit:
|
| https://openai.com/research/sparse-transformer
|
| They have a visualization of what the different layers are
| paying attention to.
|
| There are also some good explanations of transformers
| elsewhere online. This one is old but I found it helpful:
|
| http://jalammar.github.io/illustrated-transformer/
| hsjqllzlfkf wrote:
| This was my suspicion, thank you.
| thaumasiotes wrote:
| > While, you and I would expect whole-words to be tokens,
| many tokens are shorter - 3 to 4 characters - and don't
| always match the sentence structure you and I expect.
|
| There is a phenomenon called Broca's Aphasia which is,
| essentially, the inability to connect words into sentences.
| This mostly prevents the patient from communicating via
| language. But patients with this condition can reveal quite a
| bit about the structure of the language they can no longer
| speak.
|
| One example discussed in _The Language Instinct_ is someone
| who works at (and was injured at) a mill. He is unable to
| produce utterances that are more than one word long, though
| he seems to do well at understanding what people say to him.
| One of his single-word utterances, describing the mill where
| he works, is "Four hundred tons a day!".
|
| This is the opposite of what you describe, a single token
| that is longer than one word in the base language instead of
| being shorter. But it appears to be the same kind of thing.
|
| By the way, if you study a highly inflectional language such
| as Latin or Russian, you will lose the assumption that
| interpretive tokens should be whole words. You'd still expect
| them to align closely with sentence structure, though.
| whimsicalism wrote:
| You are using the word vectorization in an idiosyncratic way,
| you are referring to the process of embedding words?
| simonw wrote:
| I just think they're interesting.
|
| From a practical point of view they only really matter in that
| we have to think carefully about how to use our token budget.
| ftxbro wrote:
| The reason it's trending today is because of the phenomenon of
| Glitch Tokens. They thought all Glitch Tokens had been removed
| by GPT-4 but apparently one is still left. If you go down the
| rabbit hole on Glitch Tokens it gets ... really really weird.
| [deleted]
| ftxbro wrote:
| I just want to say i love your pet pelican names Pelly, Beaky,
| SkyDancer, Scoop, and Captain Gulliver.
| simonw wrote:
| Captain Gulliver is genuinely an excellent name for a pelican!
| api wrote:
| Has anyone ever tried a GPT trained on, say, 256 tokens
| representing bytes in a byte stream or even more simply binary
| digits?
|
| I imagine there are efficiency trade-offs but I just wonder if it
| works at all.
| sandinmyjoints wrote:
| Not a GPT, but I think Megabyte does that.
| simonw wrote:
| Here's the Observable notebook I built to explore how the
| tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer
| jcims wrote:
| I really really wish someone would try tokenizing off of a
| phonetic representation rather than textual one. I think it would
| be interesting to compare the output
| bpiche wrote:
| spacy's sense2vec gets pretty close to that
|
| https://spacy.io/universe/project/sense2vec/
|
| granted, it is 8 years old, but it's still interesting
| ftxbro wrote:
| it doesn't matter, the 'bitter lesson' as coined by Rich Sutton
| is that stacking more layers with more parameters and compute
| and dataset size is going to swamp any kind of clever 'feature
| engineering' like trying to be clever about phonetic tokens.
| Karpathy for example just wants to go back to byte tokens.
| nsinreal wrote:
| Yes, but how much extra layers and computing power do you
| need? Of course, phonetic tokens are awkward idea, but there
| is a reason why word "human" is encoded as only one token.
| spywaregorilla wrote:
| I don't think that is intuitive at all. "Clever feature
| engineering" like trying to create columns from calculations
| of tabular data, sure. You're not going to move the needle.
| But the basic representation of unstructured data like text
| could very believably alter the need for parameters, layers,
| and calculation speed by orders of magnitude.
| whimsicalism wrote:
| You would be wrong at the scales we are talking about.
|
| The whole point is that it is unintuitive.
| ftxbro wrote:
| > "I don't think that is intuitive at all."
|
| That's exactly the point. Every intuition is always on the
| side of feature engineering.
| sp332 wrote:
| Most current implementations can't count syllables at all, so
| it would get you at least that far.
| TechBro8615 wrote:
| Kudos to simonw for all the LLM content you've been publishing. I
| like reading your perspective and notes on your own learning
| experiences.
| throwaway2016a wrote:
| Pardon the n00b question, but...
|
| How does this relate to vectors? It was my understanding that the
| tokens were vectors and this seems to show them as an integer.
|
| It's probably a really obvious question to anyone who knows AI
| but I figured if I have it someone else does too.
| binarymax wrote:
| Very basic overview: A token is assigned a number, that number
| gets passed into the encoder model with other token numbers,
| and the encoder model transforms those number sequences into
| embeddings (vectors)
| nighthawk454 wrote:
| The tokens are an integer. The first layer of the model is an
| 'embedding', which is essentially a giant lookup table. So if a
| string gets tokenized to Token #3, that means get the vector in
| row 3 of the embedding table. (Those vectors are learned during
| model training.)
|
| More completely, you can think of the integers as being
| implicitly a one-hot vector encoding. So say you have a vocab
| size of 20,000 and you want Token #3. The one-hot vector would
| be a 20,000 length vector of zeros with a one in position 3.
| This vector is then multiplied against the embedding
| table/matrix. Although in practice this is equivalent to just
| selecting one row directly, so it's implemented as such and
| there's no reason to explicitly make the large one-hot vectors.
| z3c0 wrote:
| The integers represent a position within a vector of "all known
| tokens". Typically, following a simple bag-of-words approach,
| each position in the vector would be toggled to 1 or 0 based on
| the presence of a token in a given document. Since most vectors
| would be almost completely zeroed, the simpler way to represent
| these vectors is through a list of positions in the now
| abstracted vector, aka a sparse vector, ie a list of integers.
|
| In the case of more advanced language models like LLMs, a given
| token can be paired with many other features of the token (such
| as dependencies or parts-of-speech) to make an integer
| represent one of many permutations on the same word based on
| its usage.
___________________________________________________________________
(page generated 2023-06-08 23:00 UTC)