[HN Gopher] Show HN: LLaMA tokenizer that runs in browser
___________________________________________________________________
Show HN: LLaMA tokenizer that runs in browser
Author : belladoreai
Score : 26 points
Date : 2023-06-13 20:22 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| belladoreai wrote:
| Hi HN! I was looking for a tokenizer that would accurately(!)
| count tokens in browser, and I couldn't find one. So I thought
| "how hard can it be", and here we are 2 weeks later...
| zerojames wrote:
| I love this sentiment! Amazing work!
| Solvency wrote:
| For those who would also think the same thing, what're some of
| the the tldr bulletpoints on why this is more complicated than
| it'd seem?
| belladoreai wrote:
| I'll answer with an example.
|
| Consider the input string " grabbed".
|
| If we wanted to map this string to tokens by greedily going
| from left to right and choosing tokens from the vocabulary
| with the strategy of minimizing the number of tokens, our
| algorithm would be very simple. We would end up with the
| following tokenization: [17229, 2580] == [" grab", "bed"]
|
| Surprisingly, the LLaMA tokenizer does not work this way. It
| actually finds a "worse" tokenization for this input string:
| [2646, 1327, 287] == [" gra", "bb", "ed"]
|
| The tokenizer arrives at this 3 token output by applying
| "merges" in a priority order. For example, this is a merge:
| [" g", "r"] -> " gr". The trained data contains tens of
| thousands of these merges. When we apply the merges in the
| priority order, we end up with 3 tokens.
|
| Now you might be thinking, that's easy, we'll just iterate
| the list of merges and see if any of them apply. Only problem
| with that approach is that _applying a merge can open up a
| new opportunity to merge something else that wasn 't possible
| before_. This right here is the key thing that makes this
| problem complicated. We can solve this problem by iterating
| all possible merges from the beginning after every time we
| apply a merge. This would produce the correct solution. Only
| problem is: our algorithm is now very slow and takes minutes
| to run...
___________________________________________________________________
(page generated 2023-06-13 23:01 UTC)