Post AWUeu6jzKN40OAnjHM by carlmjohnson@mastodon.social
(DIR) More posts by carlmjohnson@mastodon.social
(DIR) Post #AWUcKl8VSe9XL8Hno8 by simon@fedi.simonwillison.net
2023-06-08T20:38:57Z
0 likes, 1 repeats
Understanding GPT tokenizers: I wrote about how the tokenizers used by the various GPT models actually work, including an interactive tool for experimenting with their output https://simonwillison.net/2023/Jun/8/gpt-tokenizers/
(DIR) Post #AWUcW9sJp4I7Ff9MES by simon@fedi.simonwillison.net
2023-06-08T20:40:11Z
0 likes, 0 repeats
The tokenizers have a strong bias towards English: "The dog eats the apples" is 5 tokens, "El perro come las manzanas" is 8 tokens, and many Japanese characters end up using two integer tokens for each character of text.
(DIR) Post #AWUd0P8nd4ZBHkHUjg by simon@fedi.simonwillison.net
2023-06-08T20:46:48Z
0 likes, 0 repeats
Here's the interactive @observablehq notebook I built to help demonstrate how the tokenizers work: https://observablehq.com/@simonw/gpt-tokenizer
(DIR) Post #AWUdGh32oo5TSNM9zM by fuzzychef@m6n.io
2023-06-08T20:49:37Z
0 likes, 0 repeats
@simon How many tokens is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz?
(DIR) Post #AWUdRXyMtvEQUlLQdU by simon@fedi.simonwillison.net
2023-06-08T20:50:25Z
0 likes, 0 repeats
And here's a demo of my "llm" tool (https://github.com/simonw/llm) showing output from GPT-4 a token at a time - note how the word "Pelly" is two tokens but the word "Captain" in "Captain Gulliver" is only one.
(DIR) Post #AWUdug2215Ftt6wr56 by simon@fedi.simonwillison.net
2023-06-08T20:57:00Z
0 likes, 0 repeats
(Captain Gulliver is a genuinely excellent name for a pet pelican)
(DIR) Post #AWUe6OXKt9fl6RB6fY by simon@fedi.simonwillison.net
2023-06-08T20:58:42Z
0 likes, 0 repeats
@fuzzychef 28!
(DIR) Post #AWUeu6jzKN40OAnjHM by carlmjohnson@mastodon.social
2023-06-08T21:07:51Z
0 likes, 0 repeats
@simon https://www.imore.com/animal-crossing-new-horizons-how-help-gulliver
(DIR) Post #AWVNXA3bKLv3x2PJQm by mborus@mastodon.social
2023-06-09T05:27:51Z
0 likes, 0 repeats
@simon @fuzzychef now the challenge is to find a German word with the most tokens that has a Wikipedia page. Can you try https://en.wikipedia.org/wiki/Donaudampfschiffahrtselektrizit%C3%A4tenhauptbetriebswerkbauunterbeamtengesellschaft?
(DIR) Post #AWep3O7TMR5yz6beLo by osma@mas.to
2023-06-13T18:48:42Z
0 likes, 0 repeats
@simon The longest words in the official Finnish dictionary: pyyhkäisyelektronimikroskooppi - 17 tokens = electron microscope - 3 tokensElintarviketurvallisuusvirasto - 13 tokens= food safety authority - 3 tokensA constructed compound word:lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas - 29 tokens = airplane jet turbine engine assistant mechanic non-commissioned officer in training - 15 tokensThose are pretty extreme differences.