Post AWUeu6jzKN40OAnjHM by carlmjohnson@mastodon.social
 (DIR) More posts by carlmjohnson@mastodon.social
 (DIR) Post #AWUcKl8VSe9XL8Hno8 by simon@fedi.simonwillison.net
       2023-06-08T20:38:57Z
       
       0 likes, 1 repeats
       
       Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/
       
 (DIR) Post #AWUcW9sJp4I7Ff9MES by simon@fedi.simonwillison.net
       2023-06-08T20:40:11Z
       
       0 likes, 0 repeats
       
       The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.
       
 (DIR) Post #AWUd0P8nd4ZBHkHUjg by simon@fedi.simonwillison.net
       2023-06-08T20:46:48Z
       
       0 likes, 0 repeats
       
       Here's the interactive @observablehq notebook I built to help demonstrate how the tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer
       
 (DIR) Post #AWUdGh32oo5TSNM9zM by fuzzychef@m6n.io
       2023-06-08T20:49:37Z
       
       0 likes, 0 repeats
       
       @simon How many tokens is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz?
       
 (DIR) Post #AWUdRXyMtvEQUlLQdU by simon@fedi.simonwillison.net
       2023-06-08T20:50:25Z
       
       0 likes, 0 repeats
       
       And here's a demo of my "llm" tool (https://github.com/simonw/llm) showing output from GPT-4 a token at a time - note how the word "Pelly" is two tokens but the word "Captain" in "Captain Gulliver" is only one.
       
 (DIR) Post #AWUdug2215Ftt6wr56 by simon@fedi.simonwillison.net
       2023-06-08T20:57:00Z
       
       0 likes, 0 repeats
       
       (Captain Gulliver is a genuinely excellent name for a pet pelican)
       
 (DIR) Post #AWUe6OXKt9fl6RB6fY by simon@fedi.simonwillison.net
       2023-06-08T20:58:42Z
       
       0 likes, 0 repeats
       
       @fuzzychef 28!
       
 (DIR) Post #AWUeu6jzKN40OAnjHM by carlmjohnson@mastodon.social
       2023-06-08T21:07:51Z
       
       0 likes, 0 repeats
       
       @simon https://www.imore.com/animal-crossing-new-horizons-how-help-gulliver
       
 (DIR) Post #AWVNXA3bKLv3x2PJQm by mborus@mastodon.social
       2023-06-09T05:27:51Z
       
       0 likes, 0 repeats
       
       @simon @fuzzychef now the challenge is to find a German word with the most tokens that has a Wikipedia page. Can you try https://en.wikipedia.org/wiki/Donaudampfschiffahrtselektrizit%C3%A4tenhauptbetriebswerkbauunterbeamtengesellschaft?
       
 (DIR) Post #AWep3O7TMR5yz6beLo by osma@mas.to
       2023-06-13T18:48:42Z
       
       0 likes, 0 repeats
       
       @simon The longest words in the official Finnish dictionary: pyyhkäisyelektronimikroskooppi - 17 tokens = electron microscope - 3 tokensElintarviketurvallisuusvirasto - 13 tokens= food safety authority - 3 tokensA constructed compound word:lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas - 29 tokens = airplane jet turbine engine assistant mechanic non-commissioned officer in training - 15 tokensThose are pretty extreme differences.