Post AV6HfUtoKTdgjKGD6u by simon@fedi.simonwillison.net
 (DIR) More posts by simon@fedi.simonwillison.net
 (DIR) Post #AV6El4wBT30o5PL9ns by simon@fedi.simonwillison.net
       2023-04-28T04:29:47Z
       
       0 likes, 0 repeats
       
       I built a little tool for playing with the GPT-3 tokenizer using an Observable notebookYou can convert a string of text into numeric tokens, convert tokens back to text and search the full list of tokenshttps://observablehq.com/@simonw/gpt-3-token-encoder-decoder
       
 (DIR) Post #AV6FsCcXOrxltd6zE8 by TobiasFrech@ijug.social
       2023-04-28T04:41:58Z
       
       0 likes, 0 repeats
       
       @simon I wonder why my "My name is Andy" is 4 tokens and in German "Mein Name ist Andy" is 6 tokens, but also "Me in Name is t Andy" is also 6 tokens with 2nd and 5th being different ones.
       
 (DIR) Post #AV6G33uY7dFu4NiNEm by simon@fedi.simonwillison.net
       2023-04-28T04:42:55Z
       
       0 likes, 0 repeats
       
       @TobiasFrech lots of common English words have a token for the word and a different token for that same word including a preceding space
       
 (DIR) Post #AV6H9VP4v1nB1ldAOm by sxpert@mastodon.sxpert.org
       2023-04-28T04:43:03.199060Z
       
       0 likes, 0 repeats
       
       @TobiasFrech @simon now why would the token include the space ?
       
 (DIR) Post #AV6H9WINbQ1nnHVKb2 by simon@fedi.simonwillison.net
       2023-04-28T04:56:28Z
       
       0 likes, 0 repeats
       
       @sxpert presumably because they want to use as few tokens as possible, and input is usually words separated by spaces - so having tokens that include the space can halve the number they need to represent a sentence
       
 (DIR) Post #AV6HfUG6i5sckBgVE0 by sxpert@mastodon.sxpert.org
       2023-04-28T04:53:39.011143Z
       
       0 likes, 0 repeats
       
       @simon i mean, if tokens represent words, there’s no point in keeping a trace of the existence of whitespace
       
 (DIR) Post #AV6HfUtoKTdgjKGD6u by simon@fedi.simonwillison.net
       2023-04-28T05:02:25Z
       
       0 likes, 0 repeats
       
       @sxpert you need to be able to tell the difference between "caravan" and "car a van"
       
 (DIR) Post #AV6HqFy8BEPlbsPZwm by simon@fedi.simonwillison.net
       2023-04-28T05:03:45Z
       
       0 likes, 0 repeats
       
       @sxpert if you play with the search interface in my notebook you'll see there are 50,257 tokens total - that's not enough for a token for every word, so common words get their own token but others are formed from multiple tokens
       
 (DIR) Post #AV6JS0RMSmDkPJeF3w by sxpert@mastodon.sxpert.org
       2023-04-28T05:11:27.794600Z
       
       0 likes, 0 repeats
       
       @simon well, "caravan" would be 1 token, presumably "car" "a" "van" would be 3...
       
 (DIR) Post #AV6JS11sH1QaEYjOyW by simon@fedi.simonwillison.net
       2023-04-28T05:22:19Z
       
       0 likes, 0 repeats
       
       @sxpert try it in the notebook: caravan isn't a common enough word so it takes two tokens, car a van uses three
       
 (DIR) Post #AV7p5DkaCaXoibmHMO by nelson@tech.lgbt
       2023-04-28T22:51:28Z
       
       0 likes, 0 repeats
       
       @simon neat! what's with all the tokens that start with Ġ? Ie tokens that start with the numbers 35 or 95?I used this to understand how GPT-3.5 invented new English words for me when I asked. Combining the tokens in plausible ways.
       
 (DIR) Post #AV7pFn5TftRpD5aiMi by simon@fedi.simonwillison.net
       2023-04-28T22:52:23Z
       
       0 likes, 0 repeats
       
       @nelson that's a space character as far as I can tell