[HN Gopher] Show HN: TokenDagger - A tokenizer faster than OpenA...
___________________________________________________________________
Show HN: TokenDagger - A tokenizer faster than OpenAI's Tiktoken
TokenDagger is a drop-in replacement for OpenAI's Tiktoken (the
tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It's written in
C++ 17 with thin Python bindings, keeps the exact same BPE
vocab/special-token rules, and focuses on raw speed. I'm teaching
myself LLM internals by re-implementing the stack from first
principles. Profiling TikToken's Python/Rust implementation showed
a lot of time was spent doing regex matching. Most of my perf gains
come from a) using a faster jit-compiled regex engine; and b)
simplifying the algorithm to forego regex matching special tokens
at all. Benchmarking code is included. Notable results show: - 4x
faster code sample tokenization on a single thread. - 2-3x higher
throughput when tested on a 1GB natural language text file.
Author : matthewolfe
Score : 235 points
Date : 2025-06-30 12:33 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| chrismustcode wrote:
| There's something beautiful about creating a drop in replacement
| for something that improves performance substantially.
|
| ScyllaDB comes to mind
| matthewolfe wrote:
| Agreed. I figured nobody would use it otherwise.
| parhamn wrote:
| Put it in there readme & description. It's a big selling
| point.
| matthewolfe wrote:
| Thanks, I clarified it.
| pvg wrote:
| To be fair, many people have token stabbing needs.
| npalli wrote:
| Kudos, I think (in the short term at least) there is a large
| amount of perf. optimization to be found by coding parts of the
| whole AI/ML infrastructure in C++ like this one, not as a rewrite
| (god no!) but drop in and fix key bottlenecks. Anytime I see
| someone (seems Chinese engineers are good at this) put something
| out in C++, good chance some solid engineering tradeoffs have
| been made and dramatic improvement will be seen.
| matthewolfe wrote:
| Agreed. A former mentor of mine told me a nice way of viewing
| software development:
|
| 1. Make it work. 2. Make it fast. 3. Make it pretty.
|
| Transformers & LLMs have been developed to a point where they
| work quite well. I feel as though we're at a stage where most
| substantial progress is being made on the performance side.
| diggan wrote:
| Heh, seems people I've been learning from been biased away
| from beauty, as I know that as "Make It Work, Make It Right,
| Make It Fast".
| abybaddi009 wrote:
| What's the difference between make it work and make it
| right? Aren't they the same thing?
| stavros wrote:
| Yeah, if it's not right, it doesn't work.
| darknoon wrote:
| In ML, often it does work to a degree even if it's not
| 100% correct. So getting it working at all is all about
| hacking b/c most ideas are bad and don't work. Then
| you'll find wins by incrementally correcting issues with
| the math / data / floating point precision / etc.
| DSingularity wrote:
| Not true. Things can work with hacks. Your standards
| might consider it unacceptable to have hacks. So you can
| have a "make it right" stage.
| gabrielhidasy wrote:
| Depends on your definition of "right" and "work". It
| could be a big ball of mud that always returns exactly
| the required response (so it 'works'), but be hellish
| hard change and very picky about dependencies and
| environment (so it's not 'right').
| stavros wrote:
| Nope, it's right, but it's not pretty.
| robertfw wrote:
| Making it work can be a hacky, tech debt laden
| implementation. Making it right involves
| refactoring/rewriting with an eye towards
| maintainability, testability, etc etc
| gopalv wrote:
| > make it work and make it right?
|
| My mentor used say it is the difference between a screw
| and glue.
|
| You can glue some things together and prove that it
| works, but eventually you learn that anytime you had to
| break something to fix it, you should've used a screw.
|
| It is trade off in coupling - the glue binds tightly over
| the entire surface but a screw concentrates the loads, so
| needs maintenance to stay tight.
|
| You only really know which is "right" it if you test it
| to destruction.
|
| All of that advice is probably sounding date now, even in
| material science the glue might be winning (see the Tesla
| bumper or Lotus Elise bonding videos - every screw is
| extra grams).
| kevindamm wrote:
| I've usually heard/said it as 1. Make it
| 2. Make it work 3. Make it work better
|
| (different circumstances have different nuances about what
| "better" means, it isn't always performance optimization;
| some do substitute "faster" for "better" here, but I think
| it loses generality then).
| gabrielhidasy wrote:
| I always heard the "Make it Right" as "Make it Beautiful",
| where Right and Beautiful would mean "non-hacky, easily
| maintainable, easily extendable, well tested, and well
| documented"
| matthewolfe wrote:
| Fair chance I'm remembering it wrong :D
| mindcrime wrote:
| I've always heard it (and said it) as: 1.
| Make it work 2. Make it correct 3. Make it fast
| binarymax wrote:
| The Huggingface transformers lib is currently undergoing a
| refactor to get rid of cruft and make it more extensible,
| hopefully with some perf gains.
| jotux wrote:
| A similar concept dates back to 30BC:
| https://en.wikipedia.org/wiki/De_architectura
|
| Firmitas, utilitas, venustas - Strong, useful, and beautiful.
| saretup wrote:
| And while we're at it, let's move away from Python altogether.
| In the long run it doesn't make sense just because it's the
| language ML engineers are familiar with.
| tbalsam wrote:
| No! This is not good.
|
| Iteration speed trumps all in research, most of what Python
| does is launch GPU operations, if you're having slowdowns
| from Pythonland then you're doing something terribly wrong.
|
| Python is an excellent (and yes, fast!) language for
| orchestrating and calling ML stuff. If C++ code is needed,
| call it as a module.
| bigyabai wrote:
| It makes plenty of sense. Python handles strings well, has a
| great package ecosystem, and is easy to write/learn for non-
| programmers. It can be easily embedded into a notebook (which
| is huge for academics) and is technically a "write once run
| anywhere" platform in theory. It's great.
|
| If you think _Python_ is a bad language for AI integrations,
| try writing one in a compiled language.
| janalsncm wrote:
| Most of that is already happening under the hood. A lot of
| performance-sensitive code is already written in C or cython.
| For example numpy, scikit learn, pandas. Lots of torch code
| is either C or CUDA.
|
| ML researchers aren't using python because they are dumb.
| They use it because what takes 8 lines in Java can be done
| with 2 or 3 (including import json) in python for example.
| ipsum2 wrote:
| Sort of. The key bottlenecks are not in tokenization, but
| running the actual CUDA kernels. Python actually has very
| little overhead. (See VLLM, which is primarily in Python). So
| when people (like deepseek) 'rewrite in C++', they're usually
| just rewriting CUDA kernels to be more efficient.
| notatallshaw wrote:
| It looks like TikToken is written in Rust
| (https://github.com/openai/tiktoken/tree/main/src), are the
| gains here actually from porting to C++?
| konsalexee wrote:
| > simplifying the algorithm to forego regex matching special
| tokens at all
|
| Does that mean there could be cases with less quality in terms of
| tokenization?
| matthewolfe wrote:
| The output should be identical, assuming no bugs.
|
| The Tiktoken implementation takes a collection of all special
| tokens upon initialization and compiles them into a regex by
| joining them with `|` [0]. Then the actual encoding process
| checks for matches on this expression.
|
| Models like Llama 4 define a list of 1,135 special tokens.
| Notably, 1,115 of those are "reserved" special tokens! So this
| yields a huge regexp of special tokens that shouldn't be
| considered at all.
|
| TokenDagger does not do this. Instead, simple string matching
| is used. This works because we don't need to consider the
| entire special vocabulary every time. The caller of `encode`
| must explicitly define which special tokens should be
| considered [1]. So it's faster to check against the much
| smaller list we _know_ is being used.
|
| [0]
| https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476
|
| [1]
| https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...
| manishsharan wrote:
| Is there a tokenizer someone can recommend for code ? I have
| tried CodeBert but maybe I am using it wrong as my results with
| it were pretty bad.
| fkyoureadthedoc wrote:
| Would be cool to see WASM bindings for this here
| https://github.com/dqbd/tiktoken
|
| Or maybe even your speedups from "b" in the pure js
| implementation
| p0 wrote:
| How does this compare to the BPE crate [1]? Its main selling
| point is support for incrementally re-tokenising text, but it's
| also faster than tiktoken.
|
| [1] https://crates.io/crates/bpe
| matthewolfe wrote:
| I'm working on incremental re-tokenizing next. Then I'll run
| some benchmarks against this crate too.
| frabcus wrote:
| Is there any way we can get local tokenizers for other LLMs? e.g.
| Gemini only offer a remote API for their tokenizer. Is it
| proprietary? Could we infer the token mapping somehow efficiently
| by making lots of calls?
| weberer wrote:
| I thought Gemini used SentencePiece
|
| https://github.com/google/sentencepiece
| Deathmax wrote:
| Gemini uses SentencePiece [1], and the proprietary Gemini
| models share the same tokenizer vocabulary as Gemma [2, 3, 4].
|
| Out of the large proprietary western AI labs (OpenAI,
| Anthropic, Google), only Anthropic with Claude 3 and newer lack
| local tokenizers.
|
| [1] https://github.com/google/sentencepiece
|
| [2] https://github.com/googleapis/python-
| aiplatform/blob/main/ve...
|
| [3] https://storage.googleapis.com/deepmind-
| media/gemma/gemma-2-...: "We inherit from the large Gemini
| vocabulary (256k entries)."
|
| [4] https://storage.googleapis.com/deepmind-
| media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini
| 2.0."
| matthewolfe wrote:
| A lot of model-specific tokenizers have reference
| implementations ([0], [1]). Underlying them is a core algorithm
| like SentencePiece or Byte-pair encoding (BPE). Tiktoken and
| TokenDagger are BPE implementations. The wrapping "tokenizer"
| mostly deals with the quirks of the vocabulary and handling
| special tokens.
|
| For this project, I think there is value in building some of
| these model-specific quirks into the library. Could see some
| minor performance gains and generally make it easier to
| integrate with. It's probably not too much work to keep up with
| newer models. Tokenizers change much less frequently.
|
| [0] https://github.com/meta-llama/llama-
| models/blob/01dc8ce46fec...
|
| [1] https://github.com/mistralai/mistral-
| common/tree/main/src/mi...
| pama wrote:
| Cool. Would it be possible to eliminate that little vocab format
| conversion requirement for the vocab I see in the test against
| tiktoken? It would be nice to have a fully compatible drop in
| replacement without having to think about details. It also would
| be nice to have examples that work the other way around:
| initialize tiktoken as you normally would, including any
| specialized extension of standard tokenizers, and then use that
| initialized tokenizer to initialize a new tokendagger and test
| identity of results.
| matthewolfe wrote:
| Ah good catch. Updating this right now.
| matthewolfe wrote:
| Alright, 0.1.1 should now be a true drop-in replacement. I'll
| write up some examples soon.
| janwilmake wrote:
| You know what's also faster to roughly get the amount of tokens?
| string.length/5
| _flux wrote:
| It is not helpful in actual tokenization, though.
| EGreg wrote:
| What about pairing this with BigBird and Mamba?
| pamelafox wrote:
| Just curious whether it's possible to push any of your
| performance improvements to tiktoken itself?
| matthewolfe wrote:
| I probably will. Was hesitant initially, because adding PCRE2
| as a dependency might cause issues to existing projects. I
| believe this was discussed briefly in a closed PR with other
| performance improvements.
| b0a04gl wrote:
| if dagger builds a byte level DFA for special tokens and resolves
| overlaps via longest match, how does it handle inputs with
| partial matches at chunk boundaries, say a stream ends mid token
| like <|endo , does it buffer forward or require lookahead
| kevmo314 wrote:
| Nice work! I tried something similar a while back ago:
| https://github.com/kevmo314/tokie
|
| The takeaway I also found was that the running cost was really
| dominated by pretokenization (the regex). It's cool to see that
| you found a faster way to run the regex, but have you tried
| comparing the performance of just swapping out the regex engine
| and leaving the actual BPE to tiktoken? I wonder if that is
| upstreamable?
| matthewolfe wrote:
| Cool!
|
| I've reached out to the guy who maintains Tiktoken to talk
| about this.
| polynomial wrote:
| Just to note that Tiktoken is still the tokenizer behind the
| GPT-4x series, it just uses a different token model. (Post only
| says GPT-3, implying they were using something else for
| subsequent iterations.)
| silentsea90 wrote:
| "I'm teaching myself LLM internals by re-implementing the stack
| from first principles." - curious what resources you're using?
| Any books or courses, or just building it straight up? Great
| work!
| matthewolfe wrote:
| Modal's GPU glossary is a good overview about how GPUs work
| [0]. Karpathy's LLM overview is a good high level overview on
| LLMs [1]. 3b1b's video (and subsequent videos) on transformers
| was excellent at helping me understand the math at a high level
| [2]. This matrix multiplication optimization worklog helped me
| understand writing better CUDA (not for beginner intro though)
| [3].
|
| During this process I also asked ChatGPT a lot of questions.
|
| I'm definitely open to suggestions about "how to learn" with
| all the new tools we have. I felt this has not been
| straightforward to figure out.
|
| [0] https://modal.com/gpu-glossary
|
| [1] https://www.youtube.com/watch?v=7xTGNNLPyMI
|
| [2] https://www.youtube.com/watch?v=wjZofJX0v4M
|
| [3] https://siboehm.com/articles/22/CUDA-MMM
| matrix2596 wrote:
| is is possible for your tokenizer to give different tokenization
| ever then openai tokenizer? i am asking because there are
| multiple ways to tokenize the same string?? sry if i am mistaken
| matthewolfe wrote:
| Should be the same. Both use Byte-Pair Encoding (BPE) as
| underlying algo.
| Tiberium wrote:
| Can you also compare the performance with
| https://github.com/huggingface/tokenizers/? Would be helpful,
| since the benchmark in the tiktoken readme seems to be very
| outdated.
| binarymax wrote:
| Anecdotally I've always found tiktoken to be far slower than
| huggingface tokenizers. I'm not sure why, as I haven't dug into
| tiktoken, but I'm a heavy user of HF's rust tokenizers
| superlopuh wrote:
| Can someone familiar with performance of LLMs please tell me how
| important this is to the overall perf? I'm interested in looking
| into optimizing tokenizers, and have not yet run the
| measurements. I would have assumed that the cost is generally
| dominated by matmuls but am encouraged by the reception of this
| post in the comments.
| serjester wrote:
| Tokenizing text is ridiculously small part of the overall
| computation that goes into serving a request. With that said if
| you're doing this on petabytes of data, never hurts to have
| something faster.
| odyssey7 wrote:
| A language that isn't memory-safe can definitely hurt. AI
| needs more security, not less.
| refibrillator wrote:
| Tokenization is typically done on CPU and is rarely (if ever) a
| bottleneck for training or inference.
|
| GPU kernels typically dominate in terms of wall clock time, the
| only exception might be very small models.
|
| Thus the latency of tokenization can essentially be "hidden",
| by having the CPU prepare the next batch while the GPU finishes
| the current batch.
| benreesman wrote:
| Tokenization performance is complicated, but your guidepost is
| that the institutions with the resources and talent to do so
| choose to write extremely fast tokenizers: sentencepiece and
| tiktoken both pay dearly in complexity (particularly complexity
| of deployment because now you've got another axis of
| architecture-specific build/bundle/dylib to manage in addition
| to whatever your accelerator burden always was: its now aarch64
| cross x86_64 cross CUDA capability...)
|
| Sometimes it can overlap with accelerator issue, but pros look
| at flame graphs: a CPU core running the AVX lanes hard isn't
| keeping the bus fed, million things. People pre-tokenize big
| runs all the time.
|
| I don't know why this thread is full of "nothing to see here",
| this obliterates the SOTA from the money is no object status
| quo: I'd like to think better of the community than the obvious
| which is that C++ is threatening a modest mindshare comeback
| against a Rust narrative that's already under pressure from the
| explosion of interest in Zig. Maybe there's a better reason.
| matthewolfe wrote:
| To echo the other replies, the tokenizer is definitely not the
| bottleneck. It just happens to be the first step in inference,
| so it's what I did first.
| isjustintime wrote:
| Very cool. We use Tiktoken and I'd love to see the performance
| impact. Pretty great decision to make it drop-in compatible.
| sheerun wrote:
| Now that byte-patch-level embeddings are discovered?
| semiinfinitely wrote:
| I'm relieved to see that its not written in rust
| matthewolfe wrote:
| haha, I thought about it.
___________________________________________________________________
(page generated 2025-06-30 23:00 UTC)