[HN Gopher] OpenAI Tokenizer
       ___________________________________________________________________
        
       OpenAI Tokenizer
        
       Author : tosh
       Score  : 226 points
       Date   : 2023-04-05 13:00 UTC (10 hours ago)
        
 (HTM) web link (platform.openai.com)
 (TXT) w3m dump (platform.openai.com)
        
       | gnrlst wrote:
       | Completely useless, but I was curious about the token indexes. I
       | tried to look for Token #0. After a couple minutes of trial and
       | error, it turns out it's the exclamation mark.
        
         | swyx wrote:
         | how does one "look" for a token? there isn't a lookup table
         | somewhere?
        
       | v4dok wrote:
       | It completely butchers Greek. No wonder it charges some much for
       | so little output. Every Greek character is a token.
       | 
       | I wonder if there is space for innovation there. I would imagine
       | that it similarly difficult for other non-English languages as
       | well-known. I fear for the effect this will have on them.
        
         | goldfeld wrote:
         | There is a market opportunity here for a GPTesque thinking
         | machine who actually masters and knows their greek ancients
         | well. I knew it it was a lack of refined Platonic understanding
         | when ChatGPT said it could not comment further on the Russian
         | war.
        
         | 0xDEF wrote:
         | It's crazy that OpenAI hasn't fixed their tokenizer yet. They
         | are leaving the door wide open for some Chinese big tech
         | company to capture the non-Latin script parts of the world.
         | 
         | i18n (and accessibility) was something American tech companies
         | were serious about in the 90s and early 2000s. That is how they
         | captured most of the global market. US tech dropping the ball
         | on this leaves the door wide open for Chinese competitors.
        
           | hombre_fatal wrote:
           | Does OpenAI's tokenizer issues cash out into having worse
           | results for Greek rather than just being more expensive for
           | gpt-4? (gpt-3.5-turbo already costs peanuts)
           | 
           | If not, then this response seems overblown. The competitive
           | advantage in LLM at this point probably is not tokenizer
           | optimizations and more about having results worth a damn.
        
             | v4dok wrote:
             | The usability is worse. Token limits are so much easier to
             | reach. It's like using a model with dementia.
        
           | Fauntleroy wrote:
           | It could be that they are actively working on this problem,
           | but the product has not yet been released.
        
       | swyx wrote:
       | " SolidGoldMagikarp"
       | 
       | Characters: 18
       | 
       | Tokens: 1
       | 
       | heh. all i know is this is a fun magic token but 1) i dont really
       | know how they found this and 2) i dont know what its implications
       | are. i heard that you can use it to detect if you are talking to
       | an AI.
        
         | RugnirViking wrote:
         | it comes from this blog post:
         | https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
        
         | kevingadd wrote:
         | Some of the magic tokens are related to Twitch Plays Pokemon.
         | https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
        
           | swyx wrote:
           | hmm. so this is evidence that openai scraped twitch chat of
           | all places? (notoriously ephemeral)
           | 
           | also opens a question as to how tokenizers are trained.
           | should you discard or break up super niche words like this?
        
             | squeaky-clean wrote:
             | It doesn't necessarily mean it scraped twitch chat. That is
             | the name of a moderator. They also moderate the subreddit
             | and probably some other places. And being a moderator for
             | such a popular event they probably had their name mentioned
             | in other places as well. Every time they comment on Reddit
             | their username would also appear.
             | 
             | https://www.reddit.com/r/twitchplayspokemon/comments/2cxkpp
             | /...
        
         | Ari_Rahikkala wrote:
         | "They" as in OpenAI, when they trained the tokenizer, just
         | dumped a big set of text data into a BPE (byte pair encoding)
         | tokenizer training script, and it saw that string in the data
         | so many times that it ended up making a token for it.
         | 
         | "They" as in the rest of us afterward... probably just looked
         | at the token list. It's a little over fifty thousand items,
         | mostly short words and fragments of words, and can be fun to
         | explore.
         | 
         | The GPT-2 and GPT-3 models proper were trained on different
         | data than the tokenizer they use, one of the major differences
         | being that some strings (like " SolidGoldMagikarp") showed up
         | very rarely in the data that the model saw. As a result, the
         | models can respond to the tokens for those strings a bit
         | strangely, which is why they're called "glitch tokens". From
         | what I've seen, the base models tend to just act as if the
         | glitch token wasn't there, but instruction-tuned models can act
         | in weirdly deranged ways upon seeing them.
         | 
         | The lesson to learn overall AIUI is just that you should train
         | your tokenizer and model on the same data. But (also AIUI - we
         | don't know what OpenAI actually did) you can also simply just
         | remove the glitch tokens from your tokenizer, and it'll just
         | encode the string into a few more tokens afterward. The model
         | won't ever have seen that specific sequence, but it'll at least
         | be familiar with all the tokens in it, and unlike never-before-
         | seen single tokens, it's quite used to dealing with never-
         | before-seen sentences.
        
         | nickvincent wrote:
         | I think it's related to Reddit users who posted (very
         | frequently!) on a counting focused subreddit (people literally
         | post "1", "2" , "3" in sequence so usernames appear 50k+
         | times). Some screenshots and links in this Twitter thread:
         | https://twitter.com/SoC_trilogy/status/1623118034960322560
         | 
         | Plus additional commentary here:
         | https://twitter.com/nickmvincent/status/1623409493584519168 (in
         | short: I think this situation is comparable to a "Trap Street"
         | https://en.wikipedia.org/wiki/Trap_street that reveals when a
         | map seller copies another cartographer)
         | 
         | I hadn't seen the Twitch plays pokemon hypothesis though (from
         | another comment here), I wonder if it could be both!
        
       | SirMaster wrote:
       | rawdownloadcloneembedreportprint
       | 
       | Tokens 1 Characters 32
       | 
       | Weird...
        
         | stackedinserter wrote:
         | "not_a_word" token
        
       | dschnurr wrote:
       | Hi folks - I work at OpenAI and helped build this page, awesome
       | to see it on here! Heads up that it's a bit out of date as GPT4
       | has a different tokenizer than GPT3. I'd recommend checking out
       | tiktoken (https://github.com/openai/tiktoken) or this other
       | excellent app that a community member made
       | (https://tiktokenizer.vercel.app)
        
         | teruakohatu wrote:
         | That is very helpful, thank you. I had not realised the latest
         | models were now tokenizing number as 3 digit groups. Can you
         | give any insight into why 3 digits?
        
       | ubj wrote:
       | > A helpful rule of thumb is that one token generally corresponds
       | to ~4 characters of text for common English text. This translates
       | to roughly 3/4 of a word (so 100 tokens ~= 75 words).
       | 
       | Just for fun I tried entering in
       | "pneumonoultramicroscopicsilicovolcanoconiosis" and
       | "antidisestablishmentarianism". The first was pretty evenly split
       | into tokens of length 1-5 characters, but the second put all of
       | "establishment" into a single token.
       | 
       | No useful conclusions drawn, but it was an interesting test.
        
         | luke_cq wrote:
         | I desperately want to be able to get a concrete amount of
         | tokens for my prompt before making a call - things like this
         | make it very hard to request the right amount of max_tokens
         | from longer prompt/generation pairs.
        
           | sharkjacobs wrote:
           | https://github.com/openai/tiktoken
        
           | [deleted]
        
       | tosh wrote:
       | It is very interesting to compare how various languages
       | (including programming languages) are tokenized.
        
       | jonathankoren wrote:
       | It looks like this thing doesn't tokenize into anything with any
       | semantic meaning, but rather just a sequence of bytes that match
       | some sort of information theoretic criteria. It doesn't appear to
       | have any linguistic (written, nor verbal) pattern. I guess it's
       | fine for their specific use case, but whatever.
       | 
       | Tokenization is such a basic and domain specific operation, it
       | feels like someone had to demo _something_.
       | 
       | Bonus (code) points for just saying "fuck it" on emojis. They
       | didn't even split it into code points.
        
       | darknoon wrote:
       | We noticed that this webpage is out of date for recent models so
       | we (diagram.com) commissioned a better one that lets you pick any
       | of OpenAI's models including chats:
       | 
       | https://tiktokenizer.vercel.app/
        
         | JonathanFly wrote:
         | Wow, I can't thank you enough for this. Somehow I never noticed
         | that GPT-4 doesn't use separate tokens for each tab like 3.5. I
         | was wasting a lot of time minimizing excessively tabbed code to
         | save on the token count! Like seriously way too much time, all
         | based on a bad assumption.
         | 
         | https://twitter.com/jonathanfly/status/1643633463260577794
        
       | m3kw9 wrote:
       | It's made so people don't game the system. Say "hello world"
       | would be same as "hello_world" for llm. If they didn't count
       | tokens by letter, I would be using it for free
        
       | [deleted]
        
       | coffeeri wrote:
       | OpenAI seems to use Tiktoken [0]. It also covers GPT-4 token
       | encoding.
       | 
       | [0] https://github.com/openai/tiktoken
        
         | lspears wrote:
         | Seems odd they don't reference this on the page. Instead they
         | list:
         | 
         | "If you need a programmatic interface for tokenizing text,
         | check out the transformers package for python or the
         | gpt-3-encoder package for node.js."
         | 
         | with the links:
         | 
         | https://huggingface.co/docs/transformers/model_doc/gpt2#tran...
         | 
         | https://www.npmjs.com/package/gpt-3-encoder
        
         | [deleted]
        
       | brianbest101 wrote:
       | [dead]
        
       | jscheel wrote:
       | Tiktoken is pretty nice. I've been exposing it as an internal
       | service in our infrastructure so that we can get token counts
       | easily. The bigger problem is figuring out how to chunk longer
       | contexts so that you stay within the context window limit defined
       | by the model you are using.
        
       | alli_star wrote:
       | [dead]
        
       | otaviogood wrote:
       | '1984 is 1 token. 1884 is 2 tokens.'
       | 
       | I would be surprised if they use this tokenization still as it's
       | not math friendly.
        
         | sharkjacobs wrote:
         | It's a language model, not a mathematical model
        
       | dudeinjapan wrote:
       | Interesting that Japanese seems to get every character/letter
       | tokenized individually.
        
         | fenomas wrote:
         | Huh, unless the demo is broken it seems to tokenize unicode
         | per-byte, rather than per-character.
         | 
         | E.g.:                   "a" => [40948]         "Ya " => [12859,
         | 250]         "a" => [171, 121, 109]
        
           | numpad0 wrote:
           | "0123456789" => [           171, 120, 238, 171, 120, 239,
           | 171, 120, 240, 171, 120, 241,            171, 120, 242, 171,
           | 120, 243,            171, 120, 244, 171, 120, 245,
           | 171, 120, 246, 171, 120, 247       ]            "~" => [171,
           | 121, 252, 198]
           | 
           | Whoa. Literally just giving it "Potato" gives thrice as much
           | token count as the letter count, 18 tokens for 6 letters.
        
           | dudeinjapan wrote:
           | Yeah I noticed similar. That can't be right...
        
             | rhdunn wrote:
             | It's probably operating on UTF-8 data on a byte-per-byte
             | level without any additional processing. Just feeding it
             | the raw string data and letting it assign the tokens.
             | 
             | It's similar to how it is splitting words at arbitrary
             | points, rather than at clear morphological or lexical
             | locations (e.g. on the Jane Austen text `"Now, ma'am," said
             | Jane to her aunt, "shall we join Mrs. Elton?"` I've seen it
             | tokenize that as `"|Now|,| ma|'|am|,"| said| Jane| to| her|
             | aunt|,| "|shall| we| join| Mrs|.| El|ton|?"`).
        
               | dudeinjapan wrote:
               | I would find that hard to believe, as the bytes have zero
               | semantic meaning, and moreover, pairing the wrong bytes
               | in the output will result in complete gibberish. It would
               | be akin to tokenizing each English letter "N|o|w|,|
               | |m|a|'|a|m|..." except far worse.
               | 
               | Moreover it's trivially easy to tokenize the glyphs.
        
         | hn_throwaway_99 wrote:
         | For which alphabet, or for all alphabets? Kanji that would make
         | sense, as each character is (sort of) a word. Hiragana and
         | Katakana are phonetic, with each character usually representing
         | a consonant -vowel pair, so even then there is more information
         | content than a single English letter.
        
           | planbattack wrote:
           | Japanese to English translator here. The general rule of
           | thumb (that is often used for billing estimates) is that N
           | Japanese characters = N/2 English words.
           | 
           | So if you have a Japanese source text that is 2,000
           | characters, the English translation will be around 1,000
           | words.
           | 
           | I tested a translation (one sentence) from a previous job:
           | 
           | Japanese: 94 characters, 128 tokens
           | 
           | English: 39 words (232 characters), 47 tokens
           | 
           | Seems quite unbalanced given that the amount of "information"
           | in the two is equivalent.
        
             | dudeinjapan wrote:
             | Oof... that's a rough job to have in the world of
             | ChatGPT...
        
       | bilater wrote:
       | Dammit they copied me lol https://www.gptcalculator.xyz/
        
         | bilater wrote:
         | mine has an API though which hopefully is useful
        
           | thatwasunusual wrote:
           | Also, their solution works. Yours just says "Loading..."
           | whenever I try it.
        
             | bilater wrote:
             | Is working for me
        
       | non- wrote:
       | I found this tool recently when it was linked from this
       | Computerphile video[1] about "glitch tokens".
       | 
       | tldw:
       | 
       | Certain junk data was thrown out post-tokenization e.g. the
       | /r/counting[2] community data and debug logs from Rocket League
       | 
       | some tokens specific to those contexts stuck around, however, and
       | are now like "a color you've never seen before" as far as GPT-X
       | models are concerned
       | 
       | giving the model one of these "glitch" tokens causes it to kind
       | of freak out and return gibberish or some completely random
       | response because it has not encountered them during training,
       | because they were removed when the data was cleaned.
       | 
       | [1] https://www.youtube.com/watch?v=WO2X3oZEJOA [2]
       | https://reddit.com/r/counting
        
         | non- wrote:
         | Another interesting tweet[1] I saw today shows how you can ask
         | ChatGPT to compress text and it invents it's own (effective!)
         | shorthand.
         | 
         | I bet it's related somehow to glitch tokens and the way GPT is
         | grouping tokens internally.
         | 
         | [1]
         | https://mobile.twitter.com/VictorTaelin/status/1642664054912...
        
           | artdigital wrote:
           | I experimented with something similar previously and it
           | doesn't really work. It usually can't decompress it properly
        
       | MuffinFlavored wrote:
       | I took a random Java file I had laying around that I was working
       | on lately.
       | 
       | ~100 lines of code + whitespace
       | 
       | 1300-1900 tokens
       | 
       | So if I fed this to OpenAI and said "how can I make this file
       | better/improve upon it", it would have cost:
       | 
       | between $0.03 and $0.12 for this one file using GPT-4
       | 
       | not sure I could use gpt-3.5-turbo since it says it is for chat
       | and not code?
       | 
       | Does that sound right? $0.05 for every file of source code
       | scanned sounds too high for realistic usage. Even $0.01 sounds
       | high? Modern company might have 1,000,000+ files of code, no?
        
         | engrefoobiz wrote:
         | GPT-4 costs 30 times more than gpt-3.5-turbo and 60ktimes more
         | if you use the 32k token gpt-4 model. It's by far their most
         | expensive service! I'm using gpt-3.5-turbo, also for coding,
         | and honestly it does just fine.
        
         | maherbeg wrote:
         | The seems remarkably cheap compared to engineering hours.
        
         | smallerfish wrote:
         | 3.5-turbo can definitely understand code, just like ChatGPT
         | can. GPT4 is better at complex tasks, but 3.5-turbo is always
         | worth evaluating.
        
         | SubiculumCode wrote:
         | excuse my ignorance, but I thought it was $20 per month.
        
           | MuffinFlavored wrote:
           | It would be cool if they told their $20/mo users "here's how
           | much your past 30 day usage would have cost if we billed you
           | via the API (aka how many tokens/sessions/chats/whatever did
           | you use)
        
           | bikingbismuth wrote:
           | That is for the interactive chat experience. API calls are
           | sold a la carte. Details here https://openai.com/pricing.
        
             | SubiculumCode wrote:
             | thanks!
        
       | RugnirViking wrote:
       | Really interesting. How do these work? are these a separate
       | ai/neural net/model to the transformer? they don't seem to follow
       | any humanlike structure or process?
        
       | aqme28 wrote:
       | What's the benefit of OpenAI charging per-token instead of per-
       | character or per-word?
       | 
       | Since token algorithms change model-to-model and version-to-
       | version, it seems like they've added a lot of complication for no
       | actual benefit to the user except for a little peek under the
       | hood.
       | 
       | Is there a benefit to this scheme that I'm not seeing? Is there
       | some way to game the system otherwise?
        
         | smallerfish wrote:
         | The models know how to decode base64, so if they were naive,
         | you could pass them one base64 "word" representing a prompt
         | thousands of lines long.
         | 
         | There are still ways to compress prompts though.
        
         | qeternity wrote:
         | Because tokens are the unit of work in an LLM and it's not
         | correct to say that tokens or even embeddings change between
         | models.
        
         | hn_throwaway_99 wrote:
         | It's not a "benefit", it's simply how the technology works -
         | the underlying model just fundamentally works on tokens as it's
         | atomic inputs.
         | 
         | The models don't know anything about words, just tokens.
        
         | typest wrote:
         | It's not that they're just charging per token -- the actual
         | models are operating on a token level. The model sees things in
         | terms of tokens, and in openai's case, these tokens are subword
         | (pieces of words), not words themselves, not characters.
         | 
         | So the real question is, what is the benefit of modeling your
         | tokens as subwords, rather than as characters or words?
         | 
         | I think there is a lot of nuance here, and I don't understand
         | it all. But, some benefits:
         | 
         | * Words, at least in English, are composed of different pieces,
         | like roots, prefixes, and stems. Modeling at the subword level
         | more naturally aligns your model with this aspect of language.
         | If I tokenize "warmest", I get "warm" and "est". So, the
         | meaning of the token "est" can be learned by the model --
         | whereas if you modeled by words, the model would have to
         | individually relearn this aspect of information for every word
         | ending in "est".
         | 
         | * Modeling at the subword level makes your sequences a lot
         | shorter than modeling at the character level, which should help
         | with things like efficiency.
         | 
         | * Modeling at the subword level makes your vocabulary a lot
         | bigger than just modeling at the character level, which I
         | suspect helps the model, as it can assign the subwords
         | themselves meaning. E.g., it can learn the meaning of the token
         | "warm" on its own, rather than having to learn this meaning
         | only through learning the relationship of the tokens "w" "a"
         | "r" and "m".
         | 
         | Hope this helps! Would love for anyone else to chime in/add
         | on/correct me.
        
           | rhdunn wrote:
           | I've noticed that it correctly splits warm|est, cold|est,
           | bleak|est, but darkest is a single token.
           | 
           | I've also seen it group `?"`, `."`, `!"`, and `.--` into
           | single tokens.
           | 
           | It also splits some words like "Elton" as El|ton. Presumably
           | in that case it has mis-idetified a -ton prefix.
        
         | bootsmann wrote:
         | The tokenizer doesn't actually change model to model, by the
         | looks of it this is still the GPT-2 tokenizer. Also the per-
         | token cost makes sense because predicticting a token is a
         | single forward pass through the model, while for other cost
         | measures they would need to do some science to make it work out
         | on average.
        
       | nextworddev wrote:
       | One interesting fact I stumbled upon recently is that
       | GPT2Tokenizer library and Tiktoken library produces the same
       | number of tokens for `text-davinci-003` model, despite
       | GPT2Tokenizer being GPT2 and text-davinci-003 being GPT3.5.
       | 
       | For code, however, Tiktoken library and GPT2Tokenizer produce
       | different tokenizations.
        
         | shevis wrote:
         | Key difference here is in tokenization encoders. The newer
         | models make use of the `cl100k_base` encoding.
        
       | devit wrote:
       | Interestingly they seem to have different token ids for "Word",
       | "word", " Word" and " word". That seems kind of a wasteful
       | design.
       | 
       | It seems like it would make more sense to have a single token for
       | all variants and then a "capitalized where not expected" token
       | (e.g. "foo Foo"), a "not capitalized where expected" token (e.g.
       | "foo. foo") and a "missing space where expected" token (e.g.
       | "foo.Foo").
       | 
       | The lack of any normalization also means that WrItInG tExT lIkE
       | tHiS will make future GPT versions not be able to make full use
       | of the text during future training unless they change the
       | tokenization (or the model is so overpowered that it doesn't
       | matter).
        
         | neerd wrote:
         | They charge by the token so I'm not so sure about that
        
         | rafram wrote:
         | Not all languages use capitalization the same way (or have it
         | at all) and not all LLM input/output is natural language.
        
         | gradys wrote:
         | The model is indeed so overpowered that it doesn't matter in
         | practice. See the Sentencepiece paper for some discussion of
         | the design decisions on stuff like whitespace.
        
         | rickdeckard wrote:
         | I don't think it's wasteful, if I ask GPT to process/generate a
         | non-human language like a linux shell, capitalization is
         | crucial...
        
         | williamstein wrote:
         | I am glad it tokenizes Python and all other programming
         | languages in a systematic way.
        
         | AbrahamParangi wrote:
         | The tokenization is a statistical product of the frequency of
         | byte sequences in the training corpus. It might seem
         | unintuitive but I wouldn't go so far as to say it's "wasteful".
         | It may very well be but frankly you'd have to have a good
         | explanation for why byte pair encoding is so much more
         | successful than other tokenization schemes.
        
           | swyx wrote:
           | > why byte pair encoding is so much more successful than
           | other tokenization schemes.
           | 
           | what's the evidence for that please? just asking because i
           | dont know, not because i disagree. ive read a bunch of BPE
           | explainers but nobody has bothered to explain _why_ or _how_
           | we landed on BPE
        
             | hn_throwaway_99 wrote:
             | I'm not an AI expert, so I don't know what research has
             | been done to verify it, but this comment below,
             | https://news.ycombinator.com/item?id=35454839 , helped me
             | understand it, and intuitively I think it makes sense.
             | 
             | That is, byte pair encoding tokenization is _itself_ based
             | on how common it is to see particular characters in
             | sequential order in the training data. Thus, if the
             | training data really frequently sees characters together
             | (as, of course, it does in common words), then these words
             | get a single token. Which, given how an LLM works, really
             | makes sense because it looks for statistical relationships
             | among strings of _tokens_. Thus, the way I think of it is
             | that byte pair encoding is essentially like a pre-
             | processing step that _already_ optimizes for statistical
             | relationships among individual _characters_.
        
         | RC_ITR wrote:
         | In practice, GPT uses byte-pair encoding [0] for each Unicode
         | character.
         | 
         | That's why cases are treated differently - they're different in
         | Unicode.
         | 
         | This is also the only way to teach a model how to properly
         | capitalize things (since there are no human defined rules).
         | 
         | [0] https://towardsdatascience.com/byte-pair-encoding-subword-
         | ba....
        
           | totony wrote:
           | The actual tokenizer often does not matter since you can add
           | pre processors/normalizers. I assume they did it like this
           | because capitalization matters in a lot of contexts
        
             | tel wrote:
             | Similarly, pre-processing can be harmful. I think there are
             | reasonable predictive differences when predicting the next-
             | word follow up to a sentence that's properly capitalized
             | versus one that's all lowercase. Not only will the "all
             | lowercase" convention likely prevail in forward
             | predictions, it also indicates something about the context
             | of the writing, the author, their sense of style.
             | 
             | It's hard to argue that this information isn't (a) being
             | captured by GPTs and (b) important. If you just threw it
             | away, GPTs would have less information available to absorb.
        
         | isuckatcoding wrote:
         | I wonder if this is why if you wrote things in all caps in
         | chatgpt, it sometimes has some effect on the response.
        
         | king_magic wrote:
         | It's not surprising or bad design at all. Words mean different
         | things depending on context, punctuation, etc.
        
       | stared wrote:
       | How are these encodings created?
       | 
       | My guess it is related to text compression, but would be happy to
       | see an algorithm that is responsible for generating them.
        
         | lairv wrote:
         | https://en.wikipedia.org/wiki/Byte_pair_encoding
         | 
         | tldr: start with unary characters and greedily merge pairs that
         | are the most frequents
         | 
         | A consequence is that an encoding is suited for the dataset it
         | was trained on, so if a language is under-represented in the
         | data it will result in higher number of tokens to encode it
        
           | thomasahle wrote:
           | Reading the sentencepiece paper they say:
           | 
           | > The main difference to other compression algorithms, such
           | as Huffman encoding, which have been proposed to produce a
           | variable-length encoding of words for NMT (Chitnis and
           | DeNero, 2015), is that our symbol sequences are still
           | interpretable as subword units, and that the network can
           | generalize to translate and produce new words (unseen at
           | training time) on the basis of these subword units.
           | 
           | I don't see why Huffman encoding doesn't give you that same
           | interpretability?
           | 
           | Actually the algorthm for producing a Hoffman tree is very
           | similar to that for BPE:
           | 
           | > The process begins with the leaf nodes containing the
           | probabilities of the symbol they represent. Then, the process
           | takes the two nodes with smallest probability, and creates a
           | new internal node having these two nodes as children. The
           | weight of the new node is set to the sum of the weight of the
           | children. We then apply the process again, on the new
           | internal node and on the remaining nodes (i.e., we exclude
           | the two leaf nodes), we repeat this process until only one
           | node remains, which is the root
           | 
           | (from https://en.m.wikipedia.org/wiki/Huffman_coding)
           | 
           | I guess the issue is that Huffman requires the alphabet to be
           | predefined, where BPE "discovers it" as it goes along.
        
             | astrange wrote:
             | > I don't see why Huffman encoding doesn't give you that
             | same interpretability?
             | 
             | It might just be that a Huffman encoding is a bit-string
             | and not a byte-string.
             | 
             | BPE encoding causes interesting failures, like how it can't
             | do anagrams or spell words backwards properly. And yet it
             | can make rhyming poems now.
        
               | thomasahle wrote:
               | > BPE encoding causes interesting failures, like how it
               | can't do anagrams or spell words backwards properly. And
               | yet it can make rhyming poems now.
               | 
               | I don't think BPE encoding makes anagrams impossible.
               | Just harder.
        
         | [deleted]
        
       | GistNoesis wrote:
       | Accidentally quadratic !
       | 
       | Byte pair encoding by construction is quadratic on the length of
       | the words. And usually the input is pre-split into words before
       | being given to the byte pair encoder.
       | 
       | Hopefully they use something different implementation in prod. It
       | needs to be sanitized against very long words (like 10k character
       | long words :) ).
       | 
       | In previous tokenizer like CLIP
       | (https://github.com/openai/CLIP/blob/main/clip/simple_tokeniz...
       | ) , they used additional preprocessing steps like html escaping
       | and various cleanup preprocessing using some python library
       | (ftfy, html and regex), which made porting the code exactly to
       | other languages a real pain.
       | 
       | Sadly this library doesn't solve that :'-(
        
       | simonw wrote:
       | This tool is really useful for helping develop a better intuition
       | for how GPT models actually work.
       | 
       | Paste in some text and switch to the token IDs view. Note how
       | common words (like "the ") have low integer token IDs, while
       | things like emojis are split into several numbers.
       | 
       | An LLM is a function that takes an array of integers and returns
       | a new array of integers. Seeing the tokens like this helped me
       | reinforce that mental model.
        
         | psychphysic wrote:
         | Takes an array and returns a single integer is more correct and
         | usefully so?
         | 
         | What I still can't wrap my head around is that tokens often
         | don't align with word structures.
         | 
         | "Antidepressants" I'd imagine tokenizes as "anti" "depress"
         | "ant". But nope. And "antipsychotic" tokenizes differently from
         | it too!
         | 
         | I assumed the output is a token i.e. a single integer and
         | that's rarely even a full word?
        
           | shagie wrote:
           | Imagine a circular list (in however you want to construct
           | that) that matches the input size for the model.
           | 
           | The prompt is initially loaded at the start of the list and
           | the model is run and produces high activation on a single
           | output. That token output is then fed to the end of the input
           | circular list and also added to the "this is what the model
           | returned."
           | 
           | This process of running the model, getting the token output
           | and sending one copy to the input list and one copy to the
           | return string is repeated until the number of tokens
           | generated hits a numeric limit or a token that represents the
           | stop token is encountered.
        
           | raldi wrote:
           | How does it decide whether to split "antid" into "ant" "id"
           | or "anti" "d"?
        
           | bastawhiz wrote:
           | > "Antidepressants" I'd imagine tokenizes as "anti" "depress"
           | "ant". But nope. And "antipsychotic" tokenizes differently
           | from it too..
           | 
           | Tokens are symbols. You're thinking of them like embedding
           | vectors. Tokens represent the step before a meaning is
           | assigned to the text: it turns some unit of text into what's
           | essentially an identifier.
           | 
           | Which is to say, two homonyms would have the same token id,
           | even though they have different meanings. Tokens have no
           | notion of context.
        
             | crakenzak wrote:
             | what is the benefit to such splitting of text based on
             | seemingly meaningless lines? isn't there a better way to do
             | it?
        
               | rmellow wrote:
               | Breaking text into sub-word units...
               | 
               | 1. Greatly reduces memory usage. Instead of memorizing
               | every inflection of the word "walk", it memorizes the
               | root (walk) and the modifiers (ing, ed, er, ...). These
               | modifiers can be reused for other words.
               | 
               | 2. Allows for word compositions that weren't in the
               | training set. This is great for uncommon or new
               | expressions like "googlification" or "unalive".
        
               | shagie wrote:
               | The walk example doesn't quite hold up.
               | 
               | If you put:                   test walk walker walking
               | walked
               | 
               | into the tokenizer you will see the following tokens:
               | [test][ walk][ walk][er][ walking][ walked]
               | 
               | Only walker is broken up into two different tokens.
               | 
               | I added "test" to that because walk at the start doesn't
               | include the leading space and [walk] and [ walk] are
               | different tokens.
               | 
               | For even _more_ fun, [walker] is a distinct token if it
               | doesn 't include the leading space.
               | test walker floorwalker foowalker
               | 
               | becomes:                   [test][ walk][er][
               | floor][walker][ fo][ow][alker]
               | 
               | How _we_ think of words doesn 't cleanly map to tokens.
               | 
               | (Late edit)                   walker floorwalker
               | 
               | becomes tokenized as:                   [walker][
               | floor][walker]
               | 
               | So in that case, they're the same token. It's curious how
               | white space influences the word to token making.
        
               | nonfamous wrote:
               | There's no syntax or structure to the token set. The
               | actual tokens were algorithmically selected based on the
               | training data to (putting things loosely) optimize
               | compression of the training data given a token set size.
        
               | sharkjacobs wrote:
               | ChatGPT models syntax, not semantics
               | 
               | There's no "better way" to do it because the tokens are
               | all meaningless to ChatGPT, it only cares about how
               | efficiently they can be parsed and processed.
               | 
               | The competing desires are to model all language with the
               | biggest tokens possible, and the fewest tokens possible.
               | The lines aren't meaningless, text is split into the
               | largest possible chunks using a set of the most common
               | tokens.
               | 
               | Common words, like "the", "fast", "unity", "flying" are
               | all tokens, but it's not because they're words, it's
               | because they're common letter clusters, undistinguished
               | from "fl", "ing", "un", "ple"
               | 
               | "gadflying" is tokenized into [g, ad, flying], even
               | though it's only loosely semantically related to
               | "flying", it's just the most efficient way to tokenize
               | it.
        
               | roel_v wrote:
               | There is no 'meaning' inside these AI's. It's terribly
               | confusing to think about these LLM's as having 'meaning'
               | in the same way we humans do. It's all just statistics.
               | Given a sequence of numbers (each representing some
               | abstract token), what is most likely to come next. That's
               | how 'simple' it is. It's also what makes it so amazing
               | that these things work as well as they do. I giggle like
               | a schoolgirl every time I get it to add some
               | functionality to a function, or write an entire new
               | function, and that's several times a day for what is now
               | months on end. But the key to using them is seeing that
               | there is no 'meaning' in them. It's all just streams of
               | (to the machine) meaningless tokens.
        
               | goldfeld wrote:
               | If it just decides on a single token at a time, can it
               | backtrack and choose differently under that operation,
               | given the next tokens? What I wonder is, how can it plan
               | ahead and output meaningful (to us) responses, like
               | working code or useful articles? How can it "reason"
               | logically when it needs to solve a problem, a riddle etc,
               | by only selecting a token at a time? Wouldn't that dumbed
               | down approach prove myopic for complex compositions?
               | Doesn't it need some over-ruling goal-based heuristic
               | system?
        
               | wingspar wrote:
               | There's no planning, no reason. It's all 'what word is
               | next...'
               | 
               | I found Stephen Wolframs explanation helpful. He has a
               | YouTube video version which I enjoyed too. This blog post
               | was on HN last month, but I never get good search results
               | on hn
               | 
               | https://writings.stephenwolfram.com/2023/02/what-is-
               | chatgpt-...
        
               | zeven7 wrote:
               | It is wild that a process like that can generate working
               | code. Humans speak their words in order, but they don't
               | write their code in order. Why would writing code in
               | order work?
        
               | flangola7 wrote:
               | With GPT-4 this process also allows it to understand what
               | is inside a graphical image and talk intelligently and
               | coherently about it.
               | 
               | Next token prediction produces the most head exploding
               | emergent effects.
        
               | goldfeld wrote:
               | If we get a bit quantum (or an act of God for some), then
               | backtracking could happen by collapsing the dead-ends and
               | "changing" history to stay with what turns out to be the
               | solid plan. Could emergent conscience on AI's neurons do
               | the planning and reasoning that it rather seems to be
               | doing but ML experts will say it is not? If our
               | conscience could by any chance reside not in the
               | electrical currents of the wetware, could AI's reason
               | also not reside in tokens? Is there some mysterious
               | process possible to be taking place?
        
               | psychphysic wrote:
               | Bard at least produces multiple drafts. I believe that is
               | preferred over backtracking.
               | 
               | Generation is ultimately deterministic (seeded prng) so
               | backtracking wouldn't make sense.
        
               | psychphysic wrote:
               | You say that but we have models of meaning in humans too.
               | 
               | You can put people in an fMRI and ask them to think
               | "car".
               | 
               | You can ask someone to think of objects and detect when
               | they think "car".
               | 
               | What happened there pairing a bunch of tensors to
               | meanings and matching them.
               | 
               | We can do something similar with embeddings.
               | 
               | To be clear I don't intend to give the impression that
               | these LLMs are doing something miraculous. Just that we
               | are increasingly peeling back the veil of how brains
               | think.
        
               | wizzwizz4 wrote:
               | > _You can put people in an fMRI and ask them to think
               | "car"._
               | 
               | I don't know about other people, but when I think "car"
               | really hard, I can feel the muscles in my throat adjust
               | slightly to match the sound of the word "car". Perhaps
               | that sort of thing is what the MRI machines is picking
               | up, rather than being able to pick up some kind of
               | "internal representation" of _car_.
        
               | Zircom wrote:
               | Maybe they should do the same study on people that lack
               | an internal monologue to see if they have the same
               | results.
        
               | psychphysic wrote:
               | In fact it also picks up the parts of your brain to do
               | with driving (if you're a driver). Maybe also the part to
               | do with the smell of fuel in me, but not you.
               | 
               | It'll also light up in the parts of my brain to do with
               | reading, writing, hearing the word in the languages I
               | speak.
               | 
               | What does car mean to me if it doesn't connect to all the
               | concepts that relate to cars?
        
               | [deleted]
        
               | TaylorAlexander wrote:
               | There's no meaning to the tokens, but research has shown
               | that the models themselves capture meaning. Technically
               | they are producing the next word but in order to do that
               | for a dataset of a trillion words they actually have to
               | develop internal models of how the world works. There was
               | a post on HN a couple days ago that talked about the
               | research done to show this.
        
               | ruuda wrote:
               | You could split on words instead of tokens, but then you
               | need a large vocabulary, you can't deal with inputs that
               | contain a word which is not in the vocabulary, and it's
               | not so clear what a "word" even is.
               | 
               | Instead of coming up with more and more heuristics to
               | chop a sequence of bytes up in "words" in a vocabulary,
               | we could simply set a limit on the size of the vocabulary
               | (number of tokens), put all bytes in there (so we can at
               | least handle any input byte by byte), and pack the
               | remaining space with the most common multi-byte byte
               | sequences. Then you end up with tokens like here.
        
             | [deleted]
        
         | eternalban wrote:
         | Try this:
         | 
         | 'this is a day that this sentence with clarify that day. Is
         | this not a good day?'
         | 
         | [5661, 318, 257, 1110, 326, 428, 6827, 351, 18282, 326, 1110,
         | 13, 1148, 428, 407, 257, 922, 1110, 30]
         | 
         | Note 'day' is solidly 1110 here. Now start a sentence with day.
         | 
         | "day began with laughter"
         | 
         | [12393, 2540, 351, 20263, 13]
         | 
         | So the logical word -> token(s, p) -> id(s) function definitely
         | has 1 position parameter as well.
         | 
         | "Day after day after this day"
         | 
         | [12393, 706, 1110, 706, 428, 1110]
         | 
         | "Day day Home home home"
         | 
         | [12393, 1110, 5995, 1363, 1363]
         | 
         | "day day day home home home"
         | 
         | [820, 1110, 1110, 1363, 1363, 1363]
         | 
         | [corrected/edited: so case-sensitive and position sensitive as
         | well.]
         | 
         | btw doesn't the output array contain the prompt as well
         | (because of the transformer architecture? not entirey sure
         | ~iirc)
        
           | fenomas wrote:
           | > So the logical word -> token(s, p) -> id(s) function
           | definitely has 1 position parameter as well.
           | 
           | You're missing that it groups in spaces. The position isn't
           | relevant, but "day" is a different token than " day".
        
             | eternalban wrote:
             | Ah, you're right. [try "datedate". interesting how it
             | partitions that as 'dated' + 'ate'. Compare with "matemate"
             | -> 'mat', 'emate'.]
             | 
             | p.s. "ifthiswasagermanword if this was a german word."
             | 
             | It's not even spaces. That second sequence ' german' is
             | chopped up as 'ag' 'erman'.
        
           | hn_throwaway_99 wrote:
           | It's a lot simpler than that. You can see in the tokenizer
           | that the boundary for words includes the preceding space. So,
           | since the first word doesn't have a preceding space, it has a
           | different token.
        
           | sebzim4500 wrote:
           | There's no position parameter, it's just that " day" and
           | "day" are different tokens.
        
         | smaddox wrote:
         | > An LLM is a function that takes an array of integers and
         | returns a new array of integers.
         | 
         | To refine this a bit more, a LLM is a function that takes an
         | array of integers (or really, a batch of arrays of integers),
         | and returns a probability distribution for each possible
         | integer, with the array shifted left by one place to enable
         | prediction.
        
           | lukasb wrote:
           | Could the properties of the distribution (the spread? not
           | stats literate enough) be used to calculate a confidence
           | metric for the answer?
        
             | fnbr wrote:
             | Yes! This is something that is done. The problem is that a)
             | it's tough to find a sane denominator as the likelihood of
             | the entire sequence can be quite small, even though it's
             | the best answer and b) the answer isn't grounded in
             | anything, so the confidence score isn't super helpful.
             | 
             | A score like this can be useful for active learning though,
             | where you find areas of low confidence in your dataset and
             | get more data to train on.
        
           | nonfamous wrote:
           | I've always wondered how stop tokens fit in here. Does the
           | LLM generate a probability for "stop" in addition to every
           | other token in the space? Or is stopping handled
           | heuristically by the outer loop that generates the output
           | tokens sequentially?
           | 
           | The API docs talk about letting you specify your own stop
           | token (like "<!-->") but I don't think "token" is meant in
           | the same sense here.
        
             | amilios wrote:
             | Yes, the model has something like an EOF token which it
             | emits for the output to end. It is part of the probability
             | distribution that the model predicts.
        
           | flir wrote:
           | A probability distribution for each token in the array, not
           | just the last one?
           | 
           | I don't understand that, because wouldn't the probabilities
           | later in the sentence be impacted by the tokens chosen
           | earlier in the sentence?
        
             | ruuda wrote:
             | You unfold it one token at a time by sampling from the
             | returned distribution. To control the amount of variation,
             | you can make the probability distribution more extreme, in
             | the most extreme case you only select the most likely
             | token, and the sequence becomes deterministic.
             | 
             | Yes, what happens later in the sentence depends on the
             | particular choice you made earlier in the sentence.
        
             | smaddox wrote:
             | Yes, one distribution per position. This was a key
             | innovation that allowed training over the entire sequence
             | in parallel, rather than on one token prediction at a time,
             | thereby massively speeding up training.
             | 
             | More recently, there are models like RWKV that can run in
             | both parallel (GPT-like) mode for training and serial (RNN-
             | like) mode for inference.
             | 
             | But transformers always output a probability distribution
             | at each position in the context.
        
           | sorokod wrote:
           | To refine further: takes an array of integers and draws the
           | rest of the f**king owl.
        
             | turzmo wrote:
             | I would argue that my hard drive too can take an array of
             | integers and produce an owl.
        
           | VHRanger wrote:
           | Moreover:
           | 
           | The integers are really indeces into the embedding space.
           | 
           | So you'd want to think more that the model maintains a giant
           | matrix (one row = one token; one column is an embedding
           | feature).
           | 
           | The array of indices gets the relevant embeddings to shove
           | through the rest of the model's forward pass.
        
       | yewenjie wrote:
       | NOTE: this is only valid for the old models (GPT-3 and Codex).
       | IIRC, there is no simple way to know the token usage for the new
       | models (gpt3.5-turbo and beyond).
        
         | iamjackg wrote:
         | You can use https://github.com/openai/tiktoken
        
         | sebzim4500 wrote:
         | Assuming the API docs are honest, all the publicly available
         | GPTs use the same tokens.
        
           | tedsanders wrote:
           | Different GPTs use different tokens:
           | https://github.com/openai/openai-
           | cookbook/blob/main/examples...
        
         | NameError wrote:
         | The API does tell you how many prompt and completion tokens
         | each request used, if you're okay knowing after-the-fact
        
         | tedsanders wrote:
         | This guide explains how to count tokens for 3.5-turbo and
         | beyond: https://github.com/openai/openai-
         | cookbook/blob/main/examples...
        
       ___________________________________________________________________
       (page generated 2023-04-05 23:01 UTC)