hngopher.com

       [HN Gopher] Strategies for Fast Lexers
       ___________________________________________________________________
        
       Strategies for Fast Lexers
        
       Author : xnacly
       Score  : 115 points
       Date   : 2025-07-14 14:42 UTC (8 hours ago)
        
 (HTM) web link (xnacly.me)
 (TXT) w3m dump (xnacly.me)
        
       | skeptrune wrote:
       | I really appreciate the commitment to bench-marking in this one.
       | The memoization speedup for number processing was particularly
       | surprising.
        
         | felineflock wrote:
         | Are you referring to the part where he said "crazy 15ms/64%
         | faster" ?
        
       | duped wrote:
       | Do you have benchmarks that show the hand rolled jump table has a
       | significant impact?
       | 
       | The only reason this raises an eyebrow is that I've seen
       | conflicting anec/data on this, depending pretty hard on target
       | microarchitecture and the program itself.
        
         | xnacly wrote:
         | Sadly that was one of the things I did benchmark, but didn't
         | write down the results. I read a lot that naive switch is
         | faster because the compiler knows how to optimise them better,
         | but for my architecture and benchmarks the computed gotos were
         | faster
        
       | cratermoon wrote:
       | ... written in C.
       | 
       | Not sure how many of these translate to other languages.
        
         | xnacly wrote:
         | Most of them, jump tables work in rust, mmapping too. deferred
         | numeric parsing, keeping allocations to a minimum, string
         | slices, interning and inline hashing all work in rust, go, c,
         | c++; you name it.
        
       | thechao wrote:
       | I like to have my lexers operate on `FILE*`, rather than string-
       | views. This has some real-world performance implications (not
       | good ones); but, it does mean I can operate on streams. If the
       | user has a c-string, the string can be easily wrapped by
       | `funopen()` or `fopencookie()` to provide a `FILE*` adapter
       | layer. (Most of my lexers include one of these, out-of-the-box.)
       | 
       | Everything else, I stole from Bob Nystrom: I keep a local copy of
       | the token's string in the token, aka, `char word[64]`. I try to
       | minimize "decision making" during lexing. Really, at the
       | consumption point we're only interested in an extremely small
       | number of things: (1) does the lexeme start with a letter or a
       | number?; (2) is it whitespace, and is that whitespace a new
       | line?; or, (3) does it look like an operator?
       | 
       | The only place where I've ever considered goto-threading was in
       | keyword identification. However, if your language keeps keywords
       | to <= 8 bytes, you can just bake the keywords into `uint64_t`'s
       | and compare against those values. You can do a crapload of 64b
       | compares/ns.
       | 
       | The next level up (parsing) is slow enough to eat & memoize the
       | decision making of the lexer; and, materially, it doesn't
       | complicate the parser. (In fact: there's a lot of decision making
       | that happens in the parser that'd have to be replicated in the
       | lexer, otherwise.)
       | 
       | The result, overall, is you can have a pretty general-purpose
       | lexer that you can reuse for a any old C-ish language, and tune
       | to your heart's content, without needing a custom rewrite, each
       | time.
        
         | tempodox wrote:
         | The tragic thing is that you can't do `fgetwc()` on a `FILE *`
         | produced by `fopencookie()` on Linux. glibc will crash your
         | program deliberately as soon as there is a non-ASCII char in
         | that stream (because, reasons?). But it does work with
         | `funopen()` on a BSD, like macOS. I'm using that to read wide
         | characters from UTF-8 streams.
        
           | o11c wrote:
           | Wide characters are best avoided even on platforms where it
           | doesn't mean UTF-16. It's better to stay in UTF-8 mode, and
           | only _verify_ that it 's well-formed.
        
             | tempodox wrote:
             | But at some point you'll want to know whether that code
             | point you read `iswalpha()` or whatever, so you'll have to
             | decode UTF-8 anyway.
        
               | thechao wrote:
               | At the parser-level, though; not down in the lexer. I
               | intern unique user-defined strings (just with a hashcons
               | or whatever the cool kids call it, these days). That
               | defers the determination of correctness of UTF-kness to
               | "someone else".
        
         | o11c wrote:
         | Have you considered making your lexer operate in push mode
         | instead?
         | 
         | This does mean you have to worry about partial tokens ... but
         | if you limit yourself to feeding full lines that mostly goes
         | away.
         | 
         | Besides, for reasonable-size workloads, "read the whole file
         | ahead of time" is usually a win. The only time it's tempting
         | not to do so is for REPLs.
        
           | thechao wrote:
           | I agree. But, I also like the _discipline_ of lexing from
           | `FILE*`. I 've ended up with cleaner separation of concerns
           | throughout the front-end stack, because I can't dip back into
           | the well, unless I'm thinking very clearly about that
           | operation. For instance, I keep around _coordinates_ of
           | things, rather than _pointers_ , etc.
        
         | codr7 wrote:
         | I'd do this in almost any other language than C :)
         | 
         | In C, I like just passing a const char * around as input; this
         | also gives me ability to return progress and unget chars as an
         | added bonus.
         | 
         | https://github.com/codr7/shi-c/blob/b1d5cb718b7eb166a0a93c77...
        
       | ummonk wrote:
       | Wait do modern compilers not use jump tables for large switch
       | statements?
        
         | packetlost wrote:
         | Some definitely do.
        
       | skybrian wrote:
       | This is fun and all, but I wonder what's the largest program
       | that's ever been written in this new language (purple garden)?
       | Seems like it will be a while before the optimizations pay off.
        
         | xnacly wrote:
         | I havent written a lot, but it needs to be fast so i can be
         | motivated to program more by the fast iteration
        
       | sparkie wrote:
       | As an alternative to the computed gotos, you can use regular
       | functions with the `[[musttail]]` attribute in Clang or GCC to
       | achieve basically the same thing - the call in the tail position
       | is replaced with a `jmp` instruction to the next function rather
       | than to the label, and stack usage remains constant because the
       | current frame is reutililzed for the called function. `musttail`
       | requires that the calling function and callee have the same
       | signature, and a prototype.
       | 
       | You'd replace the JUMP_TARGET macro:                   #define
       | JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]]
       | 
       | With:                   #ifdef __clang__         #define musttail
       | [[clang::musttail]]         #elif __GNUC__         #define
       | musttail [[gnu::musttail]]         #else         #define musttail
       | #endif         #define JUMP_TARGET return musttail
       | jump_table[(int32_t)l->input.p[l->pos]](l, a, out)
       | 
       | Then move the jump table out to the top level and replace each
       | `&&` with `&`.
       | 
       | See diff (untested): https://www.diffchecker.com/V4yH3EyF/
       | 
       | This approach has the advantage that it will work everywhere and
       | not only on compilers that support the computed gotos - it just
       | won't optimize it on compilers that don't support `musttail`.
       | (Though it has been proposed to standardize it in a future
       | version of C).
       | 
       | It might also work better with code navigation tools that show
       | functions, but not labels, and enables modularity as we can split
       | rules over multiple translation units.
       | 
       | Performance wise should basically be the same - though it's been
       | argued that it may do better in some cases because the compiler's
       | register allocator doesn't do a great job in large functions with
       | computed gotos - whereas in musttail approach each function is a
       | smaller unit and optimized separately.
        
         | bestouff wrote:
         | Can't wait for mandatory TCO coming to Rust. But it's not there
         | yet. https://github.com/phi-go/rfcs/blob/guaranteed-
         | tco/text/0000...
        
           | sparkie wrote:
           | Not sure I like the `become` keyword. Seems bizarre - someone
           | encountering this word in code for the first time would have
           | no idea what it's doing.
           | 
           | Why don't they just use `tailcall`? That would make it's
           | obvious what it's doing because we've been using the term for
           | nearly half a century, and the entire literature on the
           | subject uses the term "tail call".
           | 
           | Even better would be to just automatically insert a tail call
           | - like every other language that has supported tail calls for
           | decades - provided the callee has the same signature as the
           | caller. If it's undesirable because we want a stack trace,
           | then instead have some keyword or attribute to suppress the
           | tail call - such as `no_tail`, `nontail` or `donttail`.
           | 
           | Requiring tail calls to be marked will basically mean the
           | optimization will be underutilized. Other than having a stack
           | trace for debugging, there's basically no reason not to have
           | the optimization on by default.
        
             | kibwen wrote:
             | Rust does allow tail call optimization. But that's LLVM's
             | decision to optimize tail calls on a case-by-case basis. An
             | explicit syntax to denote tail calls would be the
             | difference between tail call _optimization_ and guaranteed
             | tall call _elimination_ , which is important because if
             | you're writing a tail-recursive function then it's pretty
             | trivial to blow the stack at any moderate recursion depth
             | unless you can guarantee the elimination.
             | 
             | As for why it's not trivial for Rust to do this by default,
             | consider the question of what should happen in the case of
             | local destructors, which in an ordinary function would be
             | called _after_ `return myfunc()` returns, but in a tail-
             | recursive function would need to be called beforehand. The
             | proposals for `become` tend to handle this by making it a
             | compiler error to have any locals with destructors in scope
             | at the point of the tail-call, further motivating the
             | explicit syntax.
        
       | norir wrote:
       | Lexing being the major performance bottleneck in a compiler is a
       | great problem to have.
        
         | norskeld wrote:
         | Is lexing ever a bottleneck though? Even if you push for lexing
         | and parsing 10M lines/second [1], I'd argue that semantic
         | analysis and codegen (for AOT-compiled languages) will dominate
         | the timings.
         | 
         | That said, there's no reason not to squeeze every bit of
         | performance out of it!
         | 
         | [1]: In this talk about the Carbon language, Chandler Carruth
         | shows and explains some goals/challenges regarding performance:
         | https://youtu.be/ZI198eFghJk?t=1462
        
           | munificent wrote:
           | It depends a lot on the language.
           | 
           | For a statically typed language, it's very unlikely that the
           | lexer shows up as a bottleneck. Compilation time will likely
           | be dominated by semantic analysis, type checking, and code
           | generation.
           | 
           | For a dynamically typed language where there isn't as much
           | for the compiler to do, then the lexer might be a more
           | noticeable chunk of compile times. As one of the V8 folks
           | pointed out to me years ago, the lexer is the only part of
           | the compiler that has to operate on every single individual
           | byte of input. Everything else gets the luxury of greater
           | granularity, so the lexer can be worth optimizing.
        
           | SnowflakeOnIce wrote:
           | A simple hello world in C++ can pull in dozens of megabytes
           | of header files.
           | 
           | Years back I worked at a C++ shop with a big codebase
           | (hundreds of millions of LOC when you included vendored
           | dependencies). Compile times there were sometimes dominated
           | by parsing speed! Now, I don't remember the exact breakdown
           | of lexing vs parsing, but I _did_ look at it under a
           | profiler.
           | 
           | It's very easy in C++ projects to structure your code such
           | that you inadvertently cause hundreds of megabytes of sources
           | to be parsed by each single #include. In such a case, lexing
           | and parsing costs can dominate build times. Precompiled
           | headers help, but not enough...
        
             | adev_ wrote:
             | > Now, I don't remember the exact breakdown of lexing vs
             | parsing, but I did look at it under a profiler.
             | 
             | Lexing, parsing and even type checking are interleaved in
             | most C++ compilers due to the ambiguous nature of many
             | construct in the language.
             | 
             | It is very hard to profile only one of these in isolation.
             | And even with compiler built-in instrumentation, the
             | results are not very representative of the work done
             | behind.
             | 
             | C++ compilers are amazing machines. They are blazing fast
             | at parsing a language which is a nightmare of ambiguities.
             | And they are like that mainly because how stupidly verbose
             | and inefficient the C++ include system is.
        
       | zX41ZdbW wrote:
       | I recommend taking a look at the ClickHouse SQL Lexer:
       | 
       | https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
       | 
       | https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
       | 
       | It supports SIMD for accelerated character matching, it does not
       | do any allocations, and it is very small (compiles to a few KB of
       | WASM code).
        
         | tuveson wrote:
         | How much of an improvement does SIMD offer for something like
         | this? It looks like it's only being used for strings and
         | comments, but I would kind of assume that for most programming
         | languages, the proportion of code that is long strings /
         | comments is not large. Also curious if there's any performance
         | penalty for trying to do SIMD if most of the comments and
         | strings are short.
        
           | camel-cdr wrote:
           | Usually lexing isn't part of the performance equation
           | compared to all other parts of the compiler, but SIMD can be
           | used to speedup the number parsing.
        
       | o11c wrote:
       | Unfortunately, operating a byte at a time means there's a hard
       | limit on performance.
       | 
       | A truly performant lexer needs to jump ahead as far as possible.
       | This likely involves SIMD (or SWAR) since unfortunately the C
       | library fails to provide most of the important interfaces.
       | 
       | As an example that the C library _can_ handle tolerably, while
       | lexing a string, you should repeatedly call `strcspn(input,
       | "\"\\\\\n")` to skip over chunks of ordinary characters, then
       | only special-case the quote, backslash, newline and (implicit!)
       | NUL after each jump. Be sure to correctly distinguish between an
       | embedded NUL and the one you probably append to represent EOF
       | (or, if streaming [which requires quite a bit more logic], end of
       | current chunk).
       | 
       | Unfortunately, there's a decent chance your implementation of
       | `strcspn` doesn't optimize for the possibility of small sets, and
       | instead constructs a full 256-bit bitset. And even if it does,
       | this strategy won't work for larger sets such as "all characters
       | in an identifier" (you'd actually use `strspn` since this is
       | positive), for which you'll want to take advantage of the
       | characters being adjacent.
       | 
       | Edit: yikes, is this using a hash without checking for
       | collisions?!?
        
         | dist1ll wrote:
         | You can get pretty far with a branch per byte, as long as the
         | bulk of the work is done w/ SIMD (like character
         | classification). But yeah, LUT lookup per byte is not
         | recommended.
        
         | xnacly wrote:
         | You are somewhat right, I used tagging masks to differntiate
         | between different types of atoms [1]. But yes, interning will
         | be backed by a correct implementation of a hashmap with some
         | collision handling in the future.
         | 
         | [1]: https://github.com/xNaCly/purple-
         | garden/blob/master/cc.c#L76...
        
         | kingstnap wrote:
         | You can go pretty far processing one byte at a time in
         | hardware. You just keep making the pipeline deeper and pushing
         | the frequency. And then to combat dependent parsing you add
         | speculative execution to avoid bubbles.
         | 
         | Eventually you land on recreating the modern cpu.
        
       | adev_ wrote:
       | Cool exercise and thank you for the blog post.
       | 
       | I did a similar thing (for fun) for the tokenizer associated to a
       | Swift derivates language written in C++.
       | 
       | My approach was however very different of yours:
       | 
       | - No macro, no ASM, just explicit vectorization using std.simd
       | 
       | - No hand rolled allocator. Just std::vector and SOA.
       | 
       | - No hashing for keyword. They are short. A single SIMD load /
       | compare is often enough for a comparison
       | 
       | - All the lookup tables are compile time generated from the token
       | list using constexpr to keep the code small and maintainable.
       | 
       | I was able to reach around 8 Mloc/s on server grade hardware,
       | single core.
        
       | JonChesterfield wrote:
       | Byte at a time means not-fast but I suppose it's all relative.
       | The benchmarks would benefit from a re2c version, I'd expect that
       | to beat the computed goto one. Easier for the compiler to deal
       | with, mostly.
        
       | zahlman wrote:
       | Is lexing really ever the bottleneck? Why focus effort here?
        
       | s3graham wrote:
       | simdjson is another project to look at for ideas.
       | 
       | I found it quite tricky to apply its ideas to the more general
       | syntax for a programming language, but with a bunch of hacking
       | and few subtle changes to the language itself, the performance
       | difference over one-character-at-a-time was quite substantial
       | (about 6-10x).
        
       | psanchez wrote:
       | The jump table is interesting, although I guess the performance
       | of switch will be similar if properly optimized with the
       | compiler, but would not be able to tell without trying. Also
       | different compilers might take different approaches.
       | 
       | A few months ago I built a toy boolean expression parser as a
       | weekend project. The main goal was simple: evaluate an expression
       | and return true or false. It supported basic types like int,
       | float, string, arrays, variables, and even custom operators.
       | 
       | The syntax and grammar were intentionally kept simple. I wanted
       | the whole implementation to be self-contained and compact,
       | something that could live in just a .h and .cc file. Single pass
       | for lexing, parsing, and evaluation.
       | 
       | After having the first version working, I kind of challenged
       | myself to make it faster and tried many things.
       | 
       | Once the first version was functional, I challenged myself to
       | optimize it for speed. Here are some of the performance-related
       | tricks I remember using:                 - No string allocations:
       | used the input *str directly, relying on pointer manipulation
       | instead of allocating memory for substrings.       - Stateful
       | parsing: maintained a parsing state structure passed by reference
       | to avoid unnecessary copies or allocations.       - Minimized
       | allocations: tried to avoid heap allocations wherever possible.
       | Some were unavoidable during evaluation, but I kept them to a
       | minimum.       - Branch prediction-friendly design: used lookup
       | tables to assist with token identification (mapping the first
       | character to token type and validating identifier characters).
       | - Inline literal parsing: converted integer and float literals to
       | their native values directly during lexing instead of deferring
       | conversion to a later phase.
       | 
       | I think all the tricks are mentioned in the article already.
       | 
       | For what is worth, here is the project:
       | https://github.com/pausan/tinyrulechecker
       | 
       | I used this expression to assess the performance on an Intel(R)
       | Core(TM) i7-8565U CPU @ 1.80GHz (launched Q3 2018):
       | myfloat.eq(1.9999999) || myint.eq(32)
       | 
       | I know it is a simple expression and likely a larger expression
       | would perform worse due to variables lookups, ... I could get a
       | speed of 287MB/s or 142ns per evaluation (7M evaluations per
       | second). I was gladly surprised to reach those speeds given that
       | 1 evaluation is a full cycle of lexing, parsing and evaluating
       | the expression itself.
       | 
       | The next step I thought was also to use SIMD for tokenizing, but
       | not sure it would have helped a lot on the overall expression
       | evaluation times, I seem to recall most of the time was spent on
       | the parser or evaluation phases anyway, not the lexer.
       | 
       | It was a fun project.
        
       | aappleby wrote:
       | Lexing is almost never a bottleneck. I'd much rather see a
       | "Strategies for Readable Lexers".
        
       | kklisura wrote:
       | > As introduced in the previous chapters, all identifers are
       | hashed, thus we can also hash the known keywords at startup and
       | make comparing them very fast.
       | 
       | One trick that postgres uses [1][2] is perfect hashing [3]. Since
       | you know in advance what your keywords are, you can design such
       | hashing functions that for each w(i) in list of i keywords W,
       | h(w(i)) = i. It essentially means no collisions and it's O(i) for
       | the memory requirement.
       | 
       | [1]
       | https://github.com/postgres/postgres/blob/master/src/tools/P...
       | 
       | [2]
       | https://github.com/postgres/postgres/blob/master/src/tools/g...
       | 
       | [3] https://en.wikipedia.org/wiki/Perfect_hash_function
        
       ___________________________________________________________________
       (page generated 2025-07-14 23:00 UTC)