[HN Gopher] OpenAI Tokenizer
___________________________________________________________________
OpenAI Tokenizer
Author : tosh
Score : 226 points
Date : 2023-04-05 13:00 UTC (10 hours ago)
(HTM) web link (platform.openai.com)
(TXT) w3m dump (platform.openai.com)
| gnrlst wrote:
| Completely useless, but I was curious about the token indexes. I
| tried to look for Token #0. After a couple minutes of trial and
| error, it turns out it's the exclamation mark.
| swyx wrote:
| how does one "look" for a token? there isn't a lookup table
| somewhere?
| v4dok wrote:
| It completely butchers Greek. No wonder it charges some much for
| so little output. Every Greek character is a token.
|
| I wonder if there is space for innovation there. I would imagine
| that it similarly difficult for other non-English languages as
| well-known. I fear for the effect this will have on them.
| goldfeld wrote:
| There is a market opportunity here for a GPTesque thinking
| machine who actually masters and knows their greek ancients
| well. I knew it it was a lack of refined Platonic understanding
| when ChatGPT said it could not comment further on the Russian
| war.
| 0xDEF wrote:
| It's crazy that OpenAI hasn't fixed their tokenizer yet. They
| are leaving the door wide open for some Chinese big tech
| company to capture the non-Latin script parts of the world.
|
| i18n (and accessibility) was something American tech companies
| were serious about in the 90s and early 2000s. That is how they
| captured most of the global market. US tech dropping the ball
| on this leaves the door wide open for Chinese competitors.
| hombre_fatal wrote:
| Does OpenAI's tokenizer issues cash out into having worse
| results for Greek rather than just being more expensive for
| gpt-4? (gpt-3.5-turbo already costs peanuts)
|
| If not, then this response seems overblown. The competitive
| advantage in LLM at this point probably is not tokenizer
| optimizations and more about having results worth a damn.
| v4dok wrote:
| The usability is worse. Token limits are so much easier to
| reach. It's like using a model with dementia.
| Fauntleroy wrote:
| It could be that they are actively working on this problem,
| but the product has not yet been released.
| swyx wrote:
| " SolidGoldMagikarp"
|
| Characters: 18
|
| Tokens: 1
|
| heh. all i know is this is a fun magic token but 1) i dont really
| know how they found this and 2) i dont know what its implications
| are. i heard that you can use it to detect if you are talking to
| an AI.
| RugnirViking wrote:
| it comes from this blog post:
| https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
| kevingadd wrote:
| Some of the magic tokens are related to Twitch Plays Pokemon.
| https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
| swyx wrote:
| hmm. so this is evidence that openai scraped twitch chat of
| all places? (notoriously ephemeral)
|
| also opens a question as to how tokenizers are trained.
| should you discard or break up super niche words like this?
| squeaky-clean wrote:
| It doesn't necessarily mean it scraped twitch chat. That is
| the name of a moderator. They also moderate the subreddit
| and probably some other places. And being a moderator for
| such a popular event they probably had their name mentioned
| in other places as well. Every time they comment on Reddit
| their username would also appear.
|
| https://www.reddit.com/r/twitchplayspokemon/comments/2cxkpp
| /...
| Ari_Rahikkala wrote:
| "They" as in OpenAI, when they trained the tokenizer, just
| dumped a big set of text data into a BPE (byte pair encoding)
| tokenizer training script, and it saw that string in the data
| so many times that it ended up making a token for it.
|
| "They" as in the rest of us afterward... probably just looked
| at the token list. It's a little over fifty thousand items,
| mostly short words and fragments of words, and can be fun to
| explore.
|
| The GPT-2 and GPT-3 models proper were trained on different
| data than the tokenizer they use, one of the major differences
| being that some strings (like " SolidGoldMagikarp") showed up
| very rarely in the data that the model saw. As a result, the
| models can respond to the tokens for those strings a bit
| strangely, which is why they're called "glitch tokens". From
| what I've seen, the base models tend to just act as if the
| glitch token wasn't there, but instruction-tuned models can act
| in weirdly deranged ways upon seeing them.
|
| The lesson to learn overall AIUI is just that you should train
| your tokenizer and model on the same data. But (also AIUI - we
| don't know what OpenAI actually did) you can also simply just
| remove the glitch tokens from your tokenizer, and it'll just
| encode the string into a few more tokens afterward. The model
| won't ever have seen that specific sequence, but it'll at least
| be familiar with all the tokens in it, and unlike never-before-
| seen single tokens, it's quite used to dealing with never-
| before-seen sentences.
| nickvincent wrote:
| I think it's related to Reddit users who posted (very
| frequently!) on a counting focused subreddit (people literally
| post "1", "2" , "3" in sequence so usernames appear 50k+
| times). Some screenshots and links in this Twitter thread:
| https://twitter.com/SoC_trilogy/status/1623118034960322560
|
| Plus additional commentary here:
| https://twitter.com/nickmvincent/status/1623409493584519168 (in
| short: I think this situation is comparable to a "Trap Street"
| https://en.wikipedia.org/wiki/Trap_street that reveals when a
| map seller copies another cartographer)
|
| I hadn't seen the Twitch plays pokemon hypothesis though (from
| another comment here), I wonder if it could be both!
| SirMaster wrote:
| rawdownloadcloneembedreportprint
|
| Tokens 1 Characters 32
|
| Weird...
| stackedinserter wrote:
| "not_a_word" token
| dschnurr wrote:
| Hi folks - I work at OpenAI and helped build this page, awesome
| to see it on here! Heads up that it's a bit out of date as GPT4
| has a different tokenizer than GPT3. I'd recommend checking out
| tiktoken (https://github.com/openai/tiktoken) or this other
| excellent app that a community member made
| (https://tiktokenizer.vercel.app)
| teruakohatu wrote:
| That is very helpful, thank you. I had not realised the latest
| models were now tokenizing number as 3 digit groups. Can you
| give any insight into why 3 digits?
| ubj wrote:
| > A helpful rule of thumb is that one token generally corresponds
| to ~4 characters of text for common English text. This translates
| to roughly 3/4 of a word (so 100 tokens ~= 75 words).
|
| Just for fun I tried entering in
| "pneumonoultramicroscopicsilicovolcanoconiosis" and
| "antidisestablishmentarianism". The first was pretty evenly split
| into tokens of length 1-5 characters, but the second put all of
| "establishment" into a single token.
|
| No useful conclusions drawn, but it was an interesting test.
| luke_cq wrote:
| I desperately want to be able to get a concrete amount of
| tokens for my prompt before making a call - things like this
| make it very hard to request the right amount of max_tokens
| from longer prompt/generation pairs.
| sharkjacobs wrote:
| https://github.com/openai/tiktoken
| [deleted]
| tosh wrote:
| It is very interesting to compare how various languages
| (including programming languages) are tokenized.
| jonathankoren wrote:
| It looks like this thing doesn't tokenize into anything with any
| semantic meaning, but rather just a sequence of bytes that match
| some sort of information theoretic criteria. It doesn't appear to
| have any linguistic (written, nor verbal) pattern. I guess it's
| fine for their specific use case, but whatever.
|
| Tokenization is such a basic and domain specific operation, it
| feels like someone had to demo _something_.
|
| Bonus (code) points for just saying "fuck it" on emojis. They
| didn't even split it into code points.
| darknoon wrote:
| We noticed that this webpage is out of date for recent models so
| we (diagram.com) commissioned a better one that lets you pick any
| of OpenAI's models including chats:
|
| https://tiktokenizer.vercel.app/
| JonathanFly wrote:
| Wow, I can't thank you enough for this. Somehow I never noticed
| that GPT-4 doesn't use separate tokens for each tab like 3.5. I
| was wasting a lot of time minimizing excessively tabbed code to
| save on the token count! Like seriously way too much time, all
| based on a bad assumption.
|
| https://twitter.com/jonathanfly/status/1643633463260577794
| m3kw9 wrote:
| It's made so people don't game the system. Say "hello world"
| would be same as "hello_world" for llm. If they didn't count
| tokens by letter, I would be using it for free
| [deleted]
| coffeeri wrote:
| OpenAI seems to use Tiktoken [0]. It also covers GPT-4 token
| encoding.
|
| [0] https://github.com/openai/tiktoken
| lspears wrote:
| Seems odd they don't reference this on the page. Instead they
| list:
|
| "If you need a programmatic interface for tokenizing text,
| check out the transformers package for python or the
| gpt-3-encoder package for node.js."
|
| with the links:
|
| https://huggingface.co/docs/transformers/model_doc/gpt2#tran...
|
| https://www.npmjs.com/package/gpt-3-encoder
| [deleted]
| brianbest101 wrote:
| [dead]
| jscheel wrote:
| Tiktoken is pretty nice. I've been exposing it as an internal
| service in our infrastructure so that we can get token counts
| easily. The bigger problem is figuring out how to chunk longer
| contexts so that you stay within the context window limit defined
| by the model you are using.
| alli_star wrote:
| [dead]
| otaviogood wrote:
| '1984 is 1 token. 1884 is 2 tokens.'
|
| I would be surprised if they use this tokenization still as it's
| not math friendly.
| sharkjacobs wrote:
| It's a language model, not a mathematical model
| dudeinjapan wrote:
| Interesting that Japanese seems to get every character/letter
| tokenized individually.
| fenomas wrote:
| Huh, unless the demo is broken it seems to tokenize unicode
| per-byte, rather than per-character.
|
| E.g.: "a" => [40948] "Ya " => [12859,
| 250] "a" => [171, 121, 109]
| numpad0 wrote:
| "0123456789" => [ 171, 120, 238, 171, 120, 239,
| 171, 120, 240, 171, 120, 241, 171, 120, 242, 171,
| 120, 243, 171, 120, 244, 171, 120, 245,
| 171, 120, 246, 171, 120, 247 ] "~" => [171,
| 121, 252, 198]
|
| Whoa. Literally just giving it "Potato" gives thrice as much
| token count as the letter count, 18 tokens for 6 letters.
| dudeinjapan wrote:
| Yeah I noticed similar. That can't be right...
| rhdunn wrote:
| It's probably operating on UTF-8 data on a byte-per-byte
| level without any additional processing. Just feeding it
| the raw string data and letting it assign the tokens.
|
| It's similar to how it is splitting words at arbitrary
| points, rather than at clear morphological or lexical
| locations (e.g. on the Jane Austen text `"Now, ma'am," said
| Jane to her aunt, "shall we join Mrs. Elton?"` I've seen it
| tokenize that as `"|Now|,| ma|'|am|,"| said| Jane| to| her|
| aunt|,| "|shall| we| join| Mrs|.| El|ton|?"`).
| dudeinjapan wrote:
| I would find that hard to believe, as the bytes have zero
| semantic meaning, and moreover, pairing the wrong bytes
| in the output will result in complete gibberish. It would
| be akin to tokenizing each English letter "N|o|w|,|
| |m|a|'|a|m|..." except far worse.
|
| Moreover it's trivially easy to tokenize the glyphs.
| hn_throwaway_99 wrote:
| For which alphabet, or for all alphabets? Kanji that would make
| sense, as each character is (sort of) a word. Hiragana and
| Katakana are phonetic, with each character usually representing
| a consonant -vowel pair, so even then there is more information
| content than a single English letter.
| planbattack wrote:
| Japanese to English translator here. The general rule of
| thumb (that is often used for billing estimates) is that N
| Japanese characters = N/2 English words.
|
| So if you have a Japanese source text that is 2,000
| characters, the English translation will be around 1,000
| words.
|
| I tested a translation (one sentence) from a previous job:
|
| Japanese: 94 characters, 128 tokens
|
| English: 39 words (232 characters), 47 tokens
|
| Seems quite unbalanced given that the amount of "information"
| in the two is equivalent.
| dudeinjapan wrote:
| Oof... that's a rough job to have in the world of
| ChatGPT...
| bilater wrote:
| Dammit they copied me lol https://www.gptcalculator.xyz/
| bilater wrote:
| mine has an API though which hopefully is useful
| thatwasunusual wrote:
| Also, their solution works. Yours just says "Loading..."
| whenever I try it.
| bilater wrote:
| Is working for me
| non- wrote:
| I found this tool recently when it was linked from this
| Computerphile video[1] about "glitch tokens".
|
| tldw:
|
| Certain junk data was thrown out post-tokenization e.g. the
| /r/counting[2] community data and debug logs from Rocket League
|
| some tokens specific to those contexts stuck around, however, and
| are now like "a color you've never seen before" as far as GPT-X
| models are concerned
|
| giving the model one of these "glitch" tokens causes it to kind
| of freak out and return gibberish or some completely random
| response because it has not encountered them during training,
| because they were removed when the data was cleaned.
|
| [1] https://www.youtube.com/watch?v=WO2X3oZEJOA [2]
| https://reddit.com/r/counting
| non- wrote:
| Another interesting tweet[1] I saw today shows how you can ask
| ChatGPT to compress text and it invents it's own (effective!)
| shorthand.
|
| I bet it's related somehow to glitch tokens and the way GPT is
| grouping tokens internally.
|
| [1]
| https://mobile.twitter.com/VictorTaelin/status/1642664054912...
| artdigital wrote:
| I experimented with something similar previously and it
| doesn't really work. It usually can't decompress it properly
| MuffinFlavored wrote:
| I took a random Java file I had laying around that I was working
| on lately.
|
| ~100 lines of code + whitespace
|
| 1300-1900 tokens
|
| So if I fed this to OpenAI and said "how can I make this file
| better/improve upon it", it would have cost:
|
| between $0.03 and $0.12 for this one file using GPT-4
|
| not sure I could use gpt-3.5-turbo since it says it is for chat
| and not code?
|
| Does that sound right? $0.05 for every file of source code
| scanned sounds too high for realistic usage. Even $0.01 sounds
| high? Modern company might have 1,000,000+ files of code, no?
| engrefoobiz wrote:
| GPT-4 costs 30 times more than gpt-3.5-turbo and 60ktimes more
| if you use the 32k token gpt-4 model. It's by far their most
| expensive service! I'm using gpt-3.5-turbo, also for coding,
| and honestly it does just fine.
| maherbeg wrote:
| The seems remarkably cheap compared to engineering hours.
| smallerfish wrote:
| 3.5-turbo can definitely understand code, just like ChatGPT
| can. GPT4 is better at complex tasks, but 3.5-turbo is always
| worth evaluating.
| SubiculumCode wrote:
| excuse my ignorance, but I thought it was $20 per month.
| MuffinFlavored wrote:
| It would be cool if they told their $20/mo users "here's how
| much your past 30 day usage would have cost if we billed you
| via the API (aka how many tokens/sessions/chats/whatever did
| you use)
| bikingbismuth wrote:
| That is for the interactive chat experience. API calls are
| sold a la carte. Details here https://openai.com/pricing.
| SubiculumCode wrote:
| thanks!
| RugnirViking wrote:
| Really interesting. How do these work? are these a separate
| ai/neural net/model to the transformer? they don't seem to follow
| any humanlike structure or process?
| aqme28 wrote:
| What's the benefit of OpenAI charging per-token instead of per-
| character or per-word?
|
| Since token algorithms change model-to-model and version-to-
| version, it seems like they've added a lot of complication for no
| actual benefit to the user except for a little peek under the
| hood.
|
| Is there a benefit to this scheme that I'm not seeing? Is there
| some way to game the system otherwise?
| smallerfish wrote:
| The models know how to decode base64, so if they were naive,
| you could pass them one base64 "word" representing a prompt
| thousands of lines long.
|
| There are still ways to compress prompts though.
| qeternity wrote:
| Because tokens are the unit of work in an LLM and it's not
| correct to say that tokens or even embeddings change between
| models.
| hn_throwaway_99 wrote:
| It's not a "benefit", it's simply how the technology works -
| the underlying model just fundamentally works on tokens as it's
| atomic inputs.
|
| The models don't know anything about words, just tokens.
| typest wrote:
| It's not that they're just charging per token -- the actual
| models are operating on a token level. The model sees things in
| terms of tokens, and in openai's case, these tokens are subword
| (pieces of words), not words themselves, not characters.
|
| So the real question is, what is the benefit of modeling your
| tokens as subwords, rather than as characters or words?
|
| I think there is a lot of nuance here, and I don't understand
| it all. But, some benefits:
|
| * Words, at least in English, are composed of different pieces,
| like roots, prefixes, and stems. Modeling at the subword level
| more naturally aligns your model with this aspect of language.
| If I tokenize "warmest", I get "warm" and "est". So, the
| meaning of the token "est" can be learned by the model --
| whereas if you modeled by words, the model would have to
| individually relearn this aspect of information for every word
| ending in "est".
|
| * Modeling at the subword level makes your sequences a lot
| shorter than modeling at the character level, which should help
| with things like efficiency.
|
| * Modeling at the subword level makes your vocabulary a lot
| bigger than just modeling at the character level, which I
| suspect helps the model, as it can assign the subwords
| themselves meaning. E.g., it can learn the meaning of the token
| "warm" on its own, rather than having to learn this meaning
| only through learning the relationship of the tokens "w" "a"
| "r" and "m".
|
| Hope this helps! Would love for anyone else to chime in/add
| on/correct me.
| rhdunn wrote:
| I've noticed that it correctly splits warm|est, cold|est,
| bleak|est, but darkest is a single token.
|
| I've also seen it group `?"`, `."`, `!"`, and `.--` into
| single tokens.
|
| It also splits some words like "Elton" as El|ton. Presumably
| in that case it has mis-idetified a -ton prefix.
| bootsmann wrote:
| The tokenizer doesn't actually change model to model, by the
| looks of it this is still the GPT-2 tokenizer. Also the per-
| token cost makes sense because predicticting a token is a
| single forward pass through the model, while for other cost
| measures they would need to do some science to make it work out
| on average.
| nextworddev wrote:
| One interesting fact I stumbled upon recently is that
| GPT2Tokenizer library and Tiktoken library produces the same
| number of tokens for `text-davinci-003` model, despite
| GPT2Tokenizer being GPT2 and text-davinci-003 being GPT3.5.
|
| For code, however, Tiktoken library and GPT2Tokenizer produce
| different tokenizations.
| shevis wrote:
| Key difference here is in tokenization encoders. The newer
| models make use of the `cl100k_base` encoding.
| devit wrote:
| Interestingly they seem to have different token ids for "Word",
| "word", " Word" and " word". That seems kind of a wasteful
| design.
|
| It seems like it would make more sense to have a single token for
| all variants and then a "capitalized where not expected" token
| (e.g. "foo Foo"), a "not capitalized where expected" token (e.g.
| "foo. foo") and a "missing space where expected" token (e.g.
| "foo.Foo").
|
| The lack of any normalization also means that WrItInG tExT lIkE
| tHiS will make future GPT versions not be able to make full use
| of the text during future training unless they change the
| tokenization (or the model is so overpowered that it doesn't
| matter).
| neerd wrote:
| They charge by the token so I'm not so sure about that
| rafram wrote:
| Not all languages use capitalization the same way (or have it
| at all) and not all LLM input/output is natural language.
| gradys wrote:
| The model is indeed so overpowered that it doesn't matter in
| practice. See the Sentencepiece paper for some discussion of
| the design decisions on stuff like whitespace.
| rickdeckard wrote:
| I don't think it's wasteful, if I ask GPT to process/generate a
| non-human language like a linux shell, capitalization is
| crucial...
| williamstein wrote:
| I am glad it tokenizes Python and all other programming
| languages in a systematic way.
| AbrahamParangi wrote:
| The tokenization is a statistical product of the frequency of
| byte sequences in the training corpus. It might seem
| unintuitive but I wouldn't go so far as to say it's "wasteful".
| It may very well be but frankly you'd have to have a good
| explanation for why byte pair encoding is so much more
| successful than other tokenization schemes.
| swyx wrote:
| > why byte pair encoding is so much more successful than
| other tokenization schemes.
|
| what's the evidence for that please? just asking because i
| dont know, not because i disagree. ive read a bunch of BPE
| explainers but nobody has bothered to explain _why_ or _how_
| we landed on BPE
| hn_throwaway_99 wrote:
| I'm not an AI expert, so I don't know what research has
| been done to verify it, but this comment below,
| https://news.ycombinator.com/item?id=35454839 , helped me
| understand it, and intuitively I think it makes sense.
|
| That is, byte pair encoding tokenization is _itself_ based
| on how common it is to see particular characters in
| sequential order in the training data. Thus, if the
| training data really frequently sees characters together
| (as, of course, it does in common words), then these words
| get a single token. Which, given how an LLM works, really
| makes sense because it looks for statistical relationships
| among strings of _tokens_. Thus, the way I think of it is
| that byte pair encoding is essentially like a pre-
| processing step that _already_ optimizes for statistical
| relationships among individual _characters_.
| RC_ITR wrote:
| In practice, GPT uses byte-pair encoding [0] for each Unicode
| character.
|
| That's why cases are treated differently - they're different in
| Unicode.
|
| This is also the only way to teach a model how to properly
| capitalize things (since there are no human defined rules).
|
| [0] https://towardsdatascience.com/byte-pair-encoding-subword-
| ba....
| totony wrote:
| The actual tokenizer often does not matter since you can add
| pre processors/normalizers. I assume they did it like this
| because capitalization matters in a lot of contexts
| tel wrote:
| Similarly, pre-processing can be harmful. I think there are
| reasonable predictive differences when predicting the next-
| word follow up to a sentence that's properly capitalized
| versus one that's all lowercase. Not only will the "all
| lowercase" convention likely prevail in forward
| predictions, it also indicates something about the context
| of the writing, the author, their sense of style.
|
| It's hard to argue that this information isn't (a) being
| captured by GPTs and (b) important. If you just threw it
| away, GPTs would have less information available to absorb.
| isuckatcoding wrote:
| I wonder if this is why if you wrote things in all caps in
| chatgpt, it sometimes has some effect on the response.
| king_magic wrote:
| It's not surprising or bad design at all. Words mean different
| things depending on context, punctuation, etc.
| stared wrote:
| How are these encodings created?
|
| My guess it is related to text compression, but would be happy to
| see an algorithm that is responsible for generating them.
| lairv wrote:
| https://en.wikipedia.org/wiki/Byte_pair_encoding
|
| tldr: start with unary characters and greedily merge pairs that
| are the most frequents
|
| A consequence is that an encoding is suited for the dataset it
| was trained on, so if a language is under-represented in the
| data it will result in higher number of tokens to encode it
| thomasahle wrote:
| Reading the sentencepiece paper they say:
|
| > The main difference to other compression algorithms, such
| as Huffman encoding, which have been proposed to produce a
| variable-length encoding of words for NMT (Chitnis and
| DeNero, 2015), is that our symbol sequences are still
| interpretable as subword units, and that the network can
| generalize to translate and produce new words (unseen at
| training time) on the basis of these subword units.
|
| I don't see why Huffman encoding doesn't give you that same
| interpretability?
|
| Actually the algorthm for producing a Hoffman tree is very
| similar to that for BPE:
|
| > The process begins with the leaf nodes containing the
| probabilities of the symbol they represent. Then, the process
| takes the two nodes with smallest probability, and creates a
| new internal node having these two nodes as children. The
| weight of the new node is set to the sum of the weight of the
| children. We then apply the process again, on the new
| internal node and on the remaining nodes (i.e., we exclude
| the two leaf nodes), we repeat this process until only one
| node remains, which is the root
|
| (from https://en.m.wikipedia.org/wiki/Huffman_coding)
|
| I guess the issue is that Huffman requires the alphabet to be
| predefined, where BPE "discovers it" as it goes along.
| astrange wrote:
| > I don't see why Huffman encoding doesn't give you that
| same interpretability?
|
| It might just be that a Huffman encoding is a bit-string
| and not a byte-string.
|
| BPE encoding causes interesting failures, like how it can't
| do anagrams or spell words backwards properly. And yet it
| can make rhyming poems now.
| thomasahle wrote:
| > BPE encoding causes interesting failures, like how it
| can't do anagrams or spell words backwards properly. And
| yet it can make rhyming poems now.
|
| I don't think BPE encoding makes anagrams impossible.
| Just harder.
| [deleted]
| GistNoesis wrote:
| Accidentally quadratic !
|
| Byte pair encoding by construction is quadratic on the length of
| the words. And usually the input is pre-split into words before
| being given to the byte pair encoder.
|
| Hopefully they use something different implementation in prod. It
| needs to be sanitized against very long words (like 10k character
| long words :) ).
|
| In previous tokenizer like CLIP
| (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz...
| ) , they used additional preprocessing steps like html escaping
| and various cleanup preprocessing using some python library
| (ftfy, html and regex), which made porting the code exactly to
| other languages a real pain.
|
| Sadly this library doesn't solve that :'-(
| simonw wrote:
| This tool is really useful for helping develop a better intuition
| for how GPT models actually work.
|
| Paste in some text and switch to the token IDs view. Note how
| common words (like "the ") have low integer token IDs, while
| things like emojis are split into several numbers.
|
| An LLM is a function that takes an array of integers and returns
| a new array of integers. Seeing the tokens like this helped me
| reinforce that mental model.
| psychphysic wrote:
| Takes an array and returns a single integer is more correct and
| usefully so?
|
| What I still can't wrap my head around is that tokens often
| don't align with word structures.
|
| "Antidepressants" I'd imagine tokenizes as "anti" "depress"
| "ant". But nope. And "antipsychotic" tokenizes differently from
| it too!
|
| I assumed the output is a token i.e. a single integer and
| that's rarely even a full word?
| shagie wrote:
| Imagine a circular list (in however you want to construct
| that) that matches the input size for the model.
|
| The prompt is initially loaded at the start of the list and
| the model is run and produces high activation on a single
| output. That token output is then fed to the end of the input
| circular list and also added to the "this is what the model
| returned."
|
| This process of running the model, getting the token output
| and sending one copy to the input list and one copy to the
| return string is repeated until the number of tokens
| generated hits a numeric limit or a token that represents the
| stop token is encountered.
| raldi wrote:
| How does it decide whether to split "antid" into "ant" "id"
| or "anti" "d"?
| bastawhiz wrote:
| > "Antidepressants" I'd imagine tokenizes as "anti" "depress"
| "ant". But nope. And "antipsychotic" tokenizes differently
| from it too..
|
| Tokens are symbols. You're thinking of them like embedding
| vectors. Tokens represent the step before a meaning is
| assigned to the text: it turns some unit of text into what's
| essentially an identifier.
|
| Which is to say, two homonyms would have the same token id,
| even though they have different meanings. Tokens have no
| notion of context.
| crakenzak wrote:
| what is the benefit to such splitting of text based on
| seemingly meaningless lines? isn't there a better way to do
| it?
| rmellow wrote:
| Breaking text into sub-word units...
|
| 1. Greatly reduces memory usage. Instead of memorizing
| every inflection of the word "walk", it memorizes the
| root (walk) and the modifiers (ing, ed, er, ...). These
| modifiers can be reused for other words.
|
| 2. Allows for word compositions that weren't in the
| training set. This is great for uncommon or new
| expressions like "googlification" or "unalive".
| shagie wrote:
| The walk example doesn't quite hold up.
|
| If you put: test walk walker walking
| walked
|
| into the tokenizer you will see the following tokens:
| [test][ walk][ walk][er][ walking][ walked]
|
| Only walker is broken up into two different tokens.
|
| I added "test" to that because walk at the start doesn't
| include the leading space and [walk] and [ walk] are
| different tokens.
|
| For even _more_ fun, [walker] is a distinct token if it
| doesn 't include the leading space.
| test walker floorwalker foowalker
|
| becomes: [test][ walk][er][
| floor][walker][ fo][ow][alker]
|
| How _we_ think of words doesn 't cleanly map to tokens.
|
| (Late edit) walker floorwalker
|
| becomes tokenized as: [walker][
| floor][walker]
|
| So in that case, they're the same token. It's curious how
| white space influences the word to token making.
| nonfamous wrote:
| There's no syntax or structure to the token set. The
| actual tokens were algorithmically selected based on the
| training data to (putting things loosely) optimize
| compression of the training data given a token set size.
| sharkjacobs wrote:
| ChatGPT models syntax, not semantics
|
| There's no "better way" to do it because the tokens are
| all meaningless to ChatGPT, it only cares about how
| efficiently they can be parsed and processed.
|
| The competing desires are to model all language with the
| biggest tokens possible, and the fewest tokens possible.
| The lines aren't meaningless, text is split into the
| largest possible chunks using a set of the most common
| tokens.
|
| Common words, like "the", "fast", "unity", "flying" are
| all tokens, but it's not because they're words, it's
| because they're common letter clusters, undistinguished
| from "fl", "ing", "un", "ple"
|
| "gadflying" is tokenized into [g, ad, flying], even
| though it's only loosely semantically related to
| "flying", it's just the most efficient way to tokenize
| it.
| roel_v wrote:
| There is no 'meaning' inside these AI's. It's terribly
| confusing to think about these LLM's as having 'meaning'
| in the same way we humans do. It's all just statistics.
| Given a sequence of numbers (each representing some
| abstract token), what is most likely to come next. That's
| how 'simple' it is. It's also what makes it so amazing
| that these things work as well as they do. I giggle like
| a schoolgirl every time I get it to add some
| functionality to a function, or write an entire new
| function, and that's several times a day for what is now
| months on end. But the key to using them is seeing that
| there is no 'meaning' in them. It's all just streams of
| (to the machine) meaningless tokens.
| goldfeld wrote:
| If it just decides on a single token at a time, can it
| backtrack and choose differently under that operation,
| given the next tokens? What I wonder is, how can it plan
| ahead and output meaningful (to us) responses, like
| working code or useful articles? How can it "reason"
| logically when it needs to solve a problem, a riddle etc,
| by only selecting a token at a time? Wouldn't that dumbed
| down approach prove myopic for complex compositions?
| Doesn't it need some over-ruling goal-based heuristic
| system?
| wingspar wrote:
| There's no planning, no reason. It's all 'what word is
| next...'
|
| I found Stephen Wolframs explanation helpful. He has a
| YouTube video version which I enjoyed too. This blog post
| was on HN last month, but I never get good search results
| on hn
|
| https://writings.stephenwolfram.com/2023/02/what-is-
| chatgpt-...
| zeven7 wrote:
| It is wild that a process like that can generate working
| code. Humans speak their words in order, but they don't
| write their code in order. Why would writing code in
| order work?
| flangola7 wrote:
| With GPT-4 this process also allows it to understand what
| is inside a graphical image and talk intelligently and
| coherently about it.
|
| Next token prediction produces the most head exploding
| emergent effects.
| goldfeld wrote:
| If we get a bit quantum (or an act of God for some), then
| backtracking could happen by collapsing the dead-ends and
| "changing" history to stay with what turns out to be the
| solid plan. Could emergent conscience on AI's neurons do
| the planning and reasoning that it rather seems to be
| doing but ML experts will say it is not? If our
| conscience could by any chance reside not in the
| electrical currents of the wetware, could AI's reason
| also not reside in tokens? Is there some mysterious
| process possible to be taking place?
| psychphysic wrote:
| Bard at least produces multiple drafts. I believe that is
| preferred over backtracking.
|
| Generation is ultimately deterministic (seeded prng) so
| backtracking wouldn't make sense.
| psychphysic wrote:
| You say that but we have models of meaning in humans too.
|
| You can put people in an fMRI and ask them to think
| "car".
|
| You can ask someone to think of objects and detect when
| they think "car".
|
| What happened there pairing a bunch of tensors to
| meanings and matching them.
|
| We can do something similar with embeddings.
|
| To be clear I don't intend to give the impression that
| these LLMs are doing something miraculous. Just that we
| are increasingly peeling back the veil of how brains
| think.
| wizzwizz4 wrote:
| > _You can put people in an fMRI and ask them to think
| "car"._
|
| I don't know about other people, but when I think "car"
| really hard, I can feel the muscles in my throat adjust
| slightly to match the sound of the word "car". Perhaps
| that sort of thing is what the MRI machines is picking
| up, rather than being able to pick up some kind of
| "internal representation" of _car_.
| Zircom wrote:
| Maybe they should do the same study on people that lack
| an internal monologue to see if they have the same
| results.
| psychphysic wrote:
| In fact it also picks up the parts of your brain to do
| with driving (if you're a driver). Maybe also the part to
| do with the smell of fuel in me, but not you.
|
| It'll also light up in the parts of my brain to do with
| reading, writing, hearing the word in the languages I
| speak.
|
| What does car mean to me if it doesn't connect to all the
| concepts that relate to cars?
| [deleted]
| TaylorAlexander wrote:
| There's no meaning to the tokens, but research has shown
| that the models themselves capture meaning. Technically
| they are producing the next word but in order to do that
| for a dataset of a trillion words they actually have to
| develop internal models of how the world works. There was
| a post on HN a couple days ago that talked about the
| research done to show this.
| ruuda wrote:
| You could split on words instead of tokens, but then you
| need a large vocabulary, you can't deal with inputs that
| contain a word which is not in the vocabulary, and it's
| not so clear what a "word" even is.
|
| Instead of coming up with more and more heuristics to
| chop a sequence of bytes up in "words" in a vocabulary,
| we could simply set a limit on the size of the vocabulary
| (number of tokens), put all bytes in there (so we can at
| least handle any input byte by byte), and pack the
| remaining space with the most common multi-byte byte
| sequences. Then you end up with tokens like here.
| [deleted]
| eternalban wrote:
| Try this:
|
| 'this is a day that this sentence with clarify that day. Is
| this not a good day?'
|
| [5661, 318, 257, 1110, 326, 428, 6827, 351, 18282, 326, 1110,
| 13, 1148, 428, 407, 257, 922, 1110, 30]
|
| Note 'day' is solidly 1110 here. Now start a sentence with day.
|
| "day began with laughter"
|
| [12393, 2540, 351, 20263, 13]
|
| So the logical word -> token(s, p) -> id(s) function definitely
| has 1 position parameter as well.
|
| "Day after day after this day"
|
| [12393, 706, 1110, 706, 428, 1110]
|
| "Day day Home home home"
|
| [12393, 1110, 5995, 1363, 1363]
|
| "day day day home home home"
|
| [820, 1110, 1110, 1363, 1363, 1363]
|
| [corrected/edited: so case-sensitive and position sensitive as
| well.]
|
| btw doesn't the output array contain the prompt as well
| (because of the transformer architecture? not entirey sure
| ~iirc)
| fenomas wrote:
| > So the logical word -> token(s, p) -> id(s) function
| definitely has 1 position parameter as well.
|
| You're missing that it groups in spaces. The position isn't
| relevant, but "day" is a different token than " day".
| eternalban wrote:
| Ah, you're right. [try "datedate". interesting how it
| partitions that as 'dated' + 'ate'. Compare with "matemate"
| -> 'mat', 'emate'.]
|
| p.s. "ifthiswasagermanword if this was a german word."
|
| It's not even spaces. That second sequence ' german' is
| chopped up as 'ag' 'erman'.
| hn_throwaway_99 wrote:
| It's a lot simpler than that. You can see in the tokenizer
| that the boundary for words includes the preceding space. So,
| since the first word doesn't have a preceding space, it has a
| different token.
| sebzim4500 wrote:
| There's no position parameter, it's just that " day" and
| "day" are different tokens.
| smaddox wrote:
| > An LLM is a function that takes an array of integers and
| returns a new array of integers.
|
| To refine this a bit more, a LLM is a function that takes an
| array of integers (or really, a batch of arrays of integers),
| and returns a probability distribution for each possible
| integer, with the array shifted left by one place to enable
| prediction.
| lukasb wrote:
| Could the properties of the distribution (the spread? not
| stats literate enough) be used to calculate a confidence
| metric for the answer?
| fnbr wrote:
| Yes! This is something that is done. The problem is that a)
| it's tough to find a sane denominator as the likelihood of
| the entire sequence can be quite small, even though it's
| the best answer and b) the answer isn't grounded in
| anything, so the confidence score isn't super helpful.
|
| A score like this can be useful for active learning though,
| where you find areas of low confidence in your dataset and
| get more data to train on.
| nonfamous wrote:
| I've always wondered how stop tokens fit in here. Does the
| LLM generate a probability for "stop" in addition to every
| other token in the space? Or is stopping handled
| heuristically by the outer loop that generates the output
| tokens sequentially?
|
| The API docs talk about letting you specify your own stop
| token (like "<!-->") but I don't think "token" is meant in
| the same sense here.
| amilios wrote:
| Yes, the model has something like an EOF token which it
| emits for the output to end. It is part of the probability
| distribution that the model predicts.
| flir wrote:
| A probability distribution for each token in the array, not
| just the last one?
|
| I don't understand that, because wouldn't the probabilities
| later in the sentence be impacted by the tokens chosen
| earlier in the sentence?
| ruuda wrote:
| You unfold it one token at a time by sampling from the
| returned distribution. To control the amount of variation,
| you can make the probability distribution more extreme, in
| the most extreme case you only select the most likely
| token, and the sequence becomes deterministic.
|
| Yes, what happens later in the sentence depends on the
| particular choice you made earlier in the sentence.
| smaddox wrote:
| Yes, one distribution per position. This was a key
| innovation that allowed training over the entire sequence
| in parallel, rather than on one token prediction at a time,
| thereby massively speeding up training.
|
| More recently, there are models like RWKV that can run in
| both parallel (GPT-like) mode for training and serial (RNN-
| like) mode for inference.
|
| But transformers always output a probability distribution
| at each position in the context.
| sorokod wrote:
| To refine further: takes an array of integers and draws the
| rest of the f**king owl.
| turzmo wrote:
| I would argue that my hard drive too can take an array of
| integers and produce an owl.
| VHRanger wrote:
| Moreover:
|
| The integers are really indeces into the embedding space.
|
| So you'd want to think more that the model maintains a giant
| matrix (one row = one token; one column is an embedding
| feature).
|
| The array of indices gets the relevant embeddings to shove
| through the rest of the model's forward pass.
| yewenjie wrote:
| NOTE: this is only valid for the old models (GPT-3 and Codex).
| IIRC, there is no simple way to know the token usage for the new
| models (gpt3.5-turbo and beyond).
| iamjackg wrote:
| You can use https://github.com/openai/tiktoken
| sebzim4500 wrote:
| Assuming the API docs are honest, all the publicly available
| GPTs use the same tokens.
| tedsanders wrote:
| Different GPTs use different tokens:
| https://github.com/openai/openai-
| cookbook/blob/main/examples...
| NameError wrote:
| The API does tell you how many prompt and completion tokens
| each request used, if you're okay knowing after-the-fact
| tedsanders wrote:
| This guide explains how to count tokens for 3.5-turbo and
| beyond: https://github.com/openai/openai-
| cookbook/blob/main/examples...
___________________________________________________________________
(page generated 2023-04-05 23:01 UTC)