hngopher.com

       [HN Gopher] Show HN: TokenDagger - A tokenizer faster than OpenA...
       ___________________________________________________________________
        
       Show HN: TokenDagger - A tokenizer faster than OpenAI's Tiktoken
        
       TokenDagger is a drop-in replacement for OpenAI's Tiktoken (the
       tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It's written in
       C++ 17 with thin Python bindings, keeps the exact same BPE
       vocab/special-token rules, and focuses on raw speed.  I'm teaching
       myself LLM internals by re-implementing the stack from first
       principles. Profiling TikToken's Python/Rust implementation showed
       a lot of time was spent doing regex matching. Most of my perf gains
       come from a) using a faster jit-compiled regex engine; and b)
       simplifying the algorithm to forego regex matching special tokens
       at all.  Benchmarking code is included. Notable results show: - 4x
       faster code sample tokenization on a single thread. - 2-3x higher
       throughput when tested on a 1GB natural language text file.
        
       Author : matthewolfe
       Score  : 235 points
       Date   : 2025-06-30 12:33 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | chrismustcode wrote:
       | There's something beautiful about creating a drop in replacement
       | for something that improves performance substantially.
       | 
       | ScyllaDB comes to mind
        
         | matthewolfe wrote:
         | Agreed. I figured nobody would use it otherwise.
        
           | parhamn wrote:
           | Put it in there readme & description. It's a big selling
           | point.
        
             | matthewolfe wrote:
             | Thanks, I clarified it.
        
           | pvg wrote:
           | To be fair, many people have token stabbing needs.
        
       | npalli wrote:
       | Kudos, I think (in the short term at least) there is a large
       | amount of perf. optimization to be found by coding parts of the
       | whole AI/ML infrastructure in C++ like this one, not as a rewrite
       | (god no!) but drop in and fix key bottlenecks. Anytime I see
       | someone (seems Chinese engineers are good at this) put something
       | out in C++, good chance some solid engineering tradeoffs have
       | been made and dramatic improvement will be seen.
        
         | matthewolfe wrote:
         | Agreed. A former mentor of mine told me a nice way of viewing
         | software development:
         | 
         | 1. Make it work. 2. Make it fast. 3. Make it pretty.
         | 
         | Transformers & LLMs have been developed to a point where they
         | work quite well. I feel as though we're at a stage where most
         | substantial progress is being made on the performance side.
        
           | diggan wrote:
           | Heh, seems people I've been learning from been biased away
           | from beauty, as I know that as "Make It Work, Make It Right,
           | Make It Fast".
        
             | abybaddi009 wrote:
             | What's the difference between make it work and make it
             | right? Aren't they the same thing?
        
               | stavros wrote:
               | Yeah, if it's not right, it doesn't work.
        
               | darknoon wrote:
               | In ML, often it does work to a degree even if it's not
               | 100% correct. So getting it working at all is all about
               | hacking b/c most ideas are bad and don't work. Then
               | you'll find wins by incrementally correcting issues with
               | the math / data / floating point precision / etc.
        
               | DSingularity wrote:
               | Not true. Things can work with hacks. Your standards
               | might consider it unacceptable to have hacks. So you can
               | have a "make it right" stage.
        
               | gabrielhidasy wrote:
               | Depends on your definition of "right" and "work". It
               | could be a big ball of mud that always returns exactly
               | the required response (so it 'works'), but be hellish
               | hard change and very picky about dependencies and
               | environment (so it's not 'right').
        
               | stavros wrote:
               | Nope, it's right, but it's not pretty.
        
               | robertfw wrote:
               | Making it work can be a hacky, tech debt laden
               | implementation. Making it right involves
               | refactoring/rewriting with an eye towards
               | maintainability, testability, etc etc
        
               | gopalv wrote:
               | > make it work and make it right?
               | 
               | My mentor used say it is the difference between a screw
               | and glue.
               | 
               | You can glue some things together and prove that it
               | works, but eventually you learn that anytime you had to
               | break something to fix it, you should've used a screw.
               | 
               | It is trade off in coupling - the glue binds tightly over
               | the entire surface but a screw concentrates the loads, so
               | needs maintenance to stay tight.
               | 
               | You only really know which is "right" it if you test it
               | to destruction.
               | 
               | All of that advice is probably sounding date now, even in
               | material science the glue might be winning (see the Tesla
               | bumper or Lotus Elise bonding videos - every screw is
               | extra grams).
        
             | kevindamm wrote:
             | I've usually heard/said it as                 1. Make it
             | 2. Make it work       3. Make it work better
             | 
             | (different circumstances have different nuances about what
             | "better" means, it isn't always performance optimization;
             | some do substitute "faster" for "better" here, but I think
             | it loses generality then).
        
             | gabrielhidasy wrote:
             | I always heard the "Make it Right" as "Make it Beautiful",
             | where Right and Beautiful would mean "non-hacky, easily
             | maintainable, easily extendable, well tested, and well
             | documented"
        
             | matthewolfe wrote:
             | Fair chance I'm remembering it wrong :D
        
             | mindcrime wrote:
             | I've always heard it (and said it) as:                 1.
             | Make it work       2. Make it correct       3. Make it fast
        
           | binarymax wrote:
           | The Huggingface transformers lib is currently undergoing a
           | refactor to get rid of cruft and make it more extensible,
           | hopefully with some perf gains.
        
           | jotux wrote:
           | A similar concept dates back to 30BC:
           | https://en.wikipedia.org/wiki/De_architectura
           | 
           | Firmitas, utilitas, venustas - Strong, useful, and beautiful.
        
         | saretup wrote:
         | And while we're at it, let's move away from Python altogether.
         | In the long run it doesn't make sense just because it's the
         | language ML engineers are familiar with.
        
           | tbalsam wrote:
           | No! This is not good.
           | 
           | Iteration speed trumps all in research, most of what Python
           | does is launch GPU operations, if you're having slowdowns
           | from Pythonland then you're doing something terribly wrong.
           | 
           | Python is an excellent (and yes, fast!) language for
           | orchestrating and calling ML stuff. If C++ code is needed,
           | call it as a module.
        
           | bigyabai wrote:
           | It makes plenty of sense. Python handles strings well, has a
           | great package ecosystem, and is easy to write/learn for non-
           | programmers. It can be easily embedded into a notebook (which
           | is huge for academics) and is technically a "write once run
           | anywhere" platform in theory. It's great.
           | 
           | If you think _Python_ is a bad language for AI integrations,
           | try writing one in a compiled language.
        
           | janalsncm wrote:
           | Most of that is already happening under the hood. A lot of
           | performance-sensitive code is already written in C or cython.
           | For example numpy, scikit learn, pandas. Lots of torch code
           | is either C or CUDA.
           | 
           | ML researchers aren't using python because they are dumb.
           | They use it because what takes 8 lines in Java can be done
           | with 2 or 3 (including import json) in python for example.
        
         | ipsum2 wrote:
         | Sort of. The key bottlenecks are not in tokenization, but
         | running the actual CUDA kernels. Python actually has very
         | little overhead. (See VLLM, which is primarily in Python). So
         | when people (like deepseek) 'rewrite in C++', they're usually
         | just rewriting CUDA kernels to be more efficient.
        
         | notatallshaw wrote:
         | It looks like TikToken is written in Rust
         | (https://github.com/openai/tiktoken/tree/main/src), are the
         | gains here actually from porting to C++?
        
       | konsalexee wrote:
       | > simplifying the algorithm to forego regex matching special
       | tokens at all
       | 
       | Does that mean there could be cases with less quality in terms of
       | tokenization?
        
         | matthewolfe wrote:
         | The output should be identical, assuming no bugs.
         | 
         | The Tiktoken implementation takes a collection of all special
         | tokens upon initialization and compiles them into a regex by
         | joining them with `|` [0]. Then the actual encoding process
         | checks for matches on this expression.
         | 
         | Models like Llama 4 define a list of 1,135 special tokens.
         | Notably, 1,115 of those are "reserved" special tokens! So this
         | yields a huge regexp of special tokens that shouldn't be
         | considered at all.
         | 
         | TokenDagger does not do this. Instead, simple string matching
         | is used. This works because we don't need to consider the
         | entire special vocabulary every time. The caller of `encode`
         | must explicitly define which special tokens should be
         | considered [1]. So it's faster to check against the much
         | smaller list we _know_ is being used.
         | 
         | [0]
         | https://github.com/openai/tiktoken/blob/main/src/lib.rs#L476
         | 
         | [1]
         | https://github.com/openai/tiktoken/blob/main/tiktoken/core.p...
        
       | manishsharan wrote:
       | Is there a tokenizer someone can recommend for code ? I have
       | tried CodeBert but maybe I am using it wrong as my results with
       | it were pretty bad.
        
       | fkyoureadthedoc wrote:
       | Would be cool to see WASM bindings for this here
       | https://github.com/dqbd/tiktoken
       | 
       | Or maybe even your speedups from "b" in the pure js
       | implementation
        
       | p0 wrote:
       | How does this compare to the BPE crate [1]? Its main selling
       | point is support for incrementally re-tokenising text, but it's
       | also faster than tiktoken.
       | 
       | [1] https://crates.io/crates/bpe
        
         | matthewolfe wrote:
         | I'm working on incremental re-tokenizing next. Then I'll run
         | some benchmarks against this crate too.
        
       | frabcus wrote:
       | Is there any way we can get local tokenizers for other LLMs? e.g.
       | Gemini only offer a remote API for their tokenizer. Is it
       | proprietary? Could we infer the token mapping somehow efficiently
       | by making lots of calls?
        
         | weberer wrote:
         | I thought Gemini used SentencePiece
         | 
         | https://github.com/google/sentencepiece
        
         | Deathmax wrote:
         | Gemini uses SentencePiece [1], and the proprietary Gemini
         | models share the same tokenizer vocabulary as Gemma [2, 3, 4].
         | 
         | Out of the large proprietary western AI labs (OpenAI,
         | Anthropic, Google), only Anthropic with Claude 3 and newer lack
         | local tokenizers.
         | 
         | [1] https://github.com/google/sentencepiece
         | 
         | [2] https://github.com/googleapis/python-
         | aiplatform/blob/main/ve...
         | 
         | [3] https://storage.googleapis.com/deepmind-
         | media/gemma/gemma-2-...: "We inherit from the large Gemini
         | vocabulary (256k entries)."
         | 
         | [4] https://storage.googleapis.com/deepmind-
         | media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini
         | 2.0."
        
         | matthewolfe wrote:
         | A lot of model-specific tokenizers have reference
         | implementations ([0], [1]). Underlying them is a core algorithm
         | like SentencePiece or Byte-pair encoding (BPE). Tiktoken and
         | TokenDagger are BPE implementations. The wrapping "tokenizer"
         | mostly deals with the quirks of the vocabulary and handling
         | special tokens.
         | 
         | For this project, I think there is value in building some of
         | these model-specific quirks into the library. Could see some
         | minor performance gains and generally make it easier to
         | integrate with. It's probably not too much work to keep up with
         | newer models. Tokenizers change much less frequently.
         | 
         | [0] https://github.com/meta-llama/llama-
         | models/blob/01dc8ce46fec...
         | 
         | [1] https://github.com/mistralai/mistral-
         | common/tree/main/src/mi...
        
       | pama wrote:
       | Cool. Would it be possible to eliminate that little vocab format
       | conversion requirement for the vocab I see in the test against
       | tiktoken? It would be nice to have a fully compatible drop in
       | replacement without having to think about details. It also would
       | be nice to have examples that work the other way around:
       | initialize tiktoken as you normally would, including any
       | specialized extension of standard tokenizers, and then use that
       | initialized tokenizer to initialize a new tokendagger and test
       | identity of results.
        
         | matthewolfe wrote:
         | Ah good catch. Updating this right now.
        
         | matthewolfe wrote:
         | Alright, 0.1.1 should now be a true drop-in replacement. I'll
         | write up some examples soon.
        
       | janwilmake wrote:
       | You know what's also faster to roughly get the amount of tokens?
       | string.length/5
        
         | _flux wrote:
         | It is not helpful in actual tokenization, though.
        
       | EGreg wrote:
       | What about pairing this with BigBird and Mamba?
        
       | pamelafox wrote:
       | Just curious whether it's possible to push any of your
       | performance improvements to tiktoken itself?
        
         | matthewolfe wrote:
         | I probably will. Was hesitant initially, because adding PCRE2
         | as a dependency might cause issues to existing projects. I
         | believe this was discussed briefly in a closed PR with other
         | performance improvements.
        
       | b0a04gl wrote:
       | if dagger builds a byte level DFA for special tokens and resolves
       | overlaps via longest match, how does it handle inputs with
       | partial matches at chunk boundaries, say a stream ends mid token
       | like <|endo , does it buffer forward or require lookahead
        
       | kevmo314 wrote:
       | Nice work! I tried something similar a while back ago:
       | https://github.com/kevmo314/tokie
       | 
       | The takeaway I also found was that the running cost was really
       | dominated by pretokenization (the regex). It's cool to see that
       | you found a faster way to run the regex, but have you tried
       | comparing the performance of just swapping out the regex engine
       | and leaving the actual BPE to tiktoken? I wonder if that is
       | upstreamable?
        
         | matthewolfe wrote:
         | Cool!
         | 
         | I've reached out to the guy who maintains Tiktoken to talk
         | about this.
        
       | polynomial wrote:
       | Just to note that Tiktoken is still the tokenizer behind the
       | GPT-4x series, it just uses a different token model. (Post only
       | says GPT-3, implying they were using something else for
       | subsequent iterations.)
        
       | silentsea90 wrote:
       | "I'm teaching myself LLM internals by re-implementing the stack
       | from first principles." - curious what resources you're using?
       | Any books or courses, or just building it straight up? Great
       | work!
        
         | matthewolfe wrote:
         | Modal's GPU glossary is a good overview about how GPUs work
         | [0]. Karpathy's LLM overview is a good high level overview on
         | LLMs [1]. 3b1b's video (and subsequent videos) on transformers
         | was excellent at helping me understand the math at a high level
         | [2]. This matrix multiplication optimization worklog helped me
         | understand writing better CUDA (not for beginner intro though)
         | [3].
         | 
         | During this process I also asked ChatGPT a lot of questions.
         | 
         | I'm definitely open to suggestions about "how to learn" with
         | all the new tools we have. I felt this has not been
         | straightforward to figure out.
         | 
         | [0] https://modal.com/gpu-glossary
         | 
         | [1] https://www.youtube.com/watch?v=7xTGNNLPyMI
         | 
         | [2] https://www.youtube.com/watch?v=wjZofJX0v4M
         | 
         | [3] https://siboehm.com/articles/22/CUDA-MMM
        
       | matrix2596 wrote:
       | is is possible for your tokenizer to give different tokenization
       | ever then openai tokenizer? i am asking because there are
       | multiple ways to tokenize the same string?? sry if i am mistaken
        
         | matthewolfe wrote:
         | Should be the same. Both use Byte-Pair Encoding (BPE) as
         | underlying algo.
        
       | Tiberium wrote:
       | Can you also compare the performance with
       | https://github.com/huggingface/tokenizers/? Would be helpful,
       | since the benchmark in the tiktoken readme seems to be very
       | outdated.
        
         | binarymax wrote:
         | Anecdotally I've always found tiktoken to be far slower than
         | huggingface tokenizers. I'm not sure why, as I haven't dug into
         | tiktoken, but I'm a heavy user of HF's rust tokenizers
        
       | superlopuh wrote:
       | Can someone familiar with performance of LLMs please tell me how
       | important this is to the overall perf? I'm interested in looking
       | into optimizing tokenizers, and have not yet run the
       | measurements. I would have assumed that the cost is generally
       | dominated by matmuls but am encouraged by the reception of this
       | post in the comments.
        
         | serjester wrote:
         | Tokenizing text is ridiculously small part of the overall
         | computation that goes into serving a request. With that said if
         | you're doing this on petabytes of data, never hurts to have
         | something faster.
        
           | odyssey7 wrote:
           | A language that isn't memory-safe can definitely hurt. AI
           | needs more security, not less.
        
         | refibrillator wrote:
         | Tokenization is typically done on CPU and is rarely (if ever) a
         | bottleneck for training or inference.
         | 
         | GPU kernels typically dominate in terms of wall clock time, the
         | only exception might be very small models.
         | 
         | Thus the latency of tokenization can essentially be "hidden",
         | by having the CPU prepare the next batch while the GPU finishes
         | the current batch.
        
         | benreesman wrote:
         | Tokenization performance is complicated, but your guidepost is
         | that the institutions with the resources and talent to do so
         | choose to write extremely fast tokenizers: sentencepiece and
         | tiktoken both pay dearly in complexity (particularly complexity
         | of deployment because now you've got another axis of
         | architecture-specific build/bundle/dylib to manage in addition
         | to whatever your accelerator burden always was: its now aarch64
         | cross x86_64 cross CUDA capability...)
         | 
         | Sometimes it can overlap with accelerator issue, but pros look
         | at flame graphs: a CPU core running the AVX lanes hard isn't
         | keeping the bus fed, million things. People pre-tokenize big
         | runs all the time.
         | 
         | I don't know why this thread is full of "nothing to see here",
         | this obliterates the SOTA from the money is no object status
         | quo: I'd like to think better of the community than the obvious
         | which is that C++ is threatening a modest mindshare comeback
         | against a Rust narrative that's already under pressure from the
         | explosion of interest in Zig. Maybe there's a better reason.
        
         | matthewolfe wrote:
         | To echo the other replies, the tokenizer is definitely not the
         | bottleneck. It just happens to be the first step in inference,
         | so it's what I did first.
        
       | isjustintime wrote:
       | Very cool. We use Tiktoken and I'd love to see the performance
       | impact. Pretty great decision to make it drop-in compatible.
        
       | sheerun wrote:
       | Now that byte-patch-level embeddings are discovered?
        
       | semiinfinitely wrote:
       | I'm relieved to see that its not written in rust
        
         | matthewolfe wrote:
         | haha, I thought about it.
        
       ___________________________________________________________________
       (page generated 2025-06-30 23:00 UTC)