[HN Gopher] Strategies for Fast Lexers
___________________________________________________________________
Strategies for Fast Lexers
Author : xnacly
Score : 115 points
Date : 2025-07-14 14:42 UTC (8 hours ago)
(HTM) web link (xnacly.me)
(TXT) w3m dump (xnacly.me)
| skeptrune wrote:
| I really appreciate the commitment to bench-marking in this one.
| The memoization speedup for number processing was particularly
| surprising.
| felineflock wrote:
| Are you referring to the part where he said "crazy 15ms/64%
| faster" ?
| duped wrote:
| Do you have benchmarks that show the hand rolled jump table has a
| significant impact?
|
| The only reason this raises an eyebrow is that I've seen
| conflicting anec/data on this, depending pretty hard on target
| microarchitecture and the program itself.
| xnacly wrote:
| Sadly that was one of the things I did benchmark, but didn't
| write down the results. I read a lot that naive switch is
| faster because the compiler knows how to optimise them better,
| but for my architecture and benchmarks the computed gotos were
| faster
| cratermoon wrote:
| ... written in C.
|
| Not sure how many of these translate to other languages.
| xnacly wrote:
| Most of them, jump tables work in rust, mmapping too. deferred
| numeric parsing, keeping allocations to a minimum, string
| slices, interning and inline hashing all work in rust, go, c,
| c++; you name it.
| thechao wrote:
| I like to have my lexers operate on `FILE*`, rather than string-
| views. This has some real-world performance implications (not
| good ones); but, it does mean I can operate on streams. If the
| user has a c-string, the string can be easily wrapped by
| `funopen()` or `fopencookie()` to provide a `FILE*` adapter
| layer. (Most of my lexers include one of these, out-of-the-box.)
|
| Everything else, I stole from Bob Nystrom: I keep a local copy of
| the token's string in the token, aka, `char word[64]`. I try to
| minimize "decision making" during lexing. Really, at the
| consumption point we're only interested in an extremely small
| number of things: (1) does the lexeme start with a letter or a
| number?; (2) is it whitespace, and is that whitespace a new
| line?; or, (3) does it look like an operator?
|
| The only place where I've ever considered goto-threading was in
| keyword identification. However, if your language keeps keywords
| to <= 8 bytes, you can just bake the keywords into `uint64_t`'s
| and compare against those values. You can do a crapload of 64b
| compares/ns.
|
| The next level up (parsing) is slow enough to eat & memoize the
| decision making of the lexer; and, materially, it doesn't
| complicate the parser. (In fact: there's a lot of decision making
| that happens in the parser that'd have to be replicated in the
| lexer, otherwise.)
|
| The result, overall, is you can have a pretty general-purpose
| lexer that you can reuse for a any old C-ish language, and tune
| to your heart's content, without needing a custom rewrite, each
| time.
| tempodox wrote:
| The tragic thing is that you can't do `fgetwc()` on a `FILE *`
| produced by `fopencookie()` on Linux. glibc will crash your
| program deliberately as soon as there is a non-ASCII char in
| that stream (because, reasons?). But it does work with
| `funopen()` on a BSD, like macOS. I'm using that to read wide
| characters from UTF-8 streams.
| o11c wrote:
| Wide characters are best avoided even on platforms where it
| doesn't mean UTF-16. It's better to stay in UTF-8 mode, and
| only _verify_ that it 's well-formed.
| tempodox wrote:
| But at some point you'll want to know whether that code
| point you read `iswalpha()` or whatever, so you'll have to
| decode UTF-8 anyway.
| thechao wrote:
| At the parser-level, though; not down in the lexer. I
| intern unique user-defined strings (just with a hashcons
| or whatever the cool kids call it, these days). That
| defers the determination of correctness of UTF-kness to
| "someone else".
| o11c wrote:
| Have you considered making your lexer operate in push mode
| instead?
|
| This does mean you have to worry about partial tokens ... but
| if you limit yourself to feeding full lines that mostly goes
| away.
|
| Besides, for reasonable-size workloads, "read the whole file
| ahead of time" is usually a win. The only time it's tempting
| not to do so is for REPLs.
| thechao wrote:
| I agree. But, I also like the _discipline_ of lexing from
| `FILE*`. I 've ended up with cleaner separation of concerns
| throughout the front-end stack, because I can't dip back into
| the well, unless I'm thinking very clearly about that
| operation. For instance, I keep around _coordinates_ of
| things, rather than _pointers_ , etc.
| codr7 wrote:
| I'd do this in almost any other language than C :)
|
| In C, I like just passing a const char * around as input; this
| also gives me ability to return progress and unget chars as an
| added bonus.
|
| https://github.com/codr7/shi-c/blob/b1d5cb718b7eb166a0a93c77...
| ummonk wrote:
| Wait do modern compilers not use jump tables for large switch
| statements?
| packetlost wrote:
| Some definitely do.
| skybrian wrote:
| This is fun and all, but I wonder what's the largest program
| that's ever been written in this new language (purple garden)?
| Seems like it will be a while before the optimizations pay off.
| xnacly wrote:
| I havent written a lot, but it needs to be fast so i can be
| motivated to program more by the fast iteration
| sparkie wrote:
| As an alternative to the computed gotos, you can use regular
| functions with the `[[musttail]]` attribute in Clang or GCC to
| achieve basically the same thing - the call in the tail position
| is replaced with a `jmp` instruction to the next function rather
| than to the label, and stack usage remains constant because the
| current frame is reutililzed for the called function. `musttail`
| requires that the calling function and callee have the same
| signature, and a prototype.
|
| You'd replace the JUMP_TARGET macro: #define
| JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]]
|
| With: #ifdef __clang__ #define musttail
| [[clang::musttail]] #elif __GNUC__ #define
| musttail [[gnu::musttail]] #else #define musttail
| #endif #define JUMP_TARGET return musttail
| jump_table[(int32_t)l->input.p[l->pos]](l, a, out)
|
| Then move the jump table out to the top level and replace each
| `&&` with `&`.
|
| See diff (untested): https://www.diffchecker.com/V4yH3EyF/
|
| This approach has the advantage that it will work everywhere and
| not only on compilers that support the computed gotos - it just
| won't optimize it on compilers that don't support `musttail`.
| (Though it has been proposed to standardize it in a future
| version of C).
|
| It might also work better with code navigation tools that show
| functions, but not labels, and enables modularity as we can split
| rules over multiple translation units.
|
| Performance wise should basically be the same - though it's been
| argued that it may do better in some cases because the compiler's
| register allocator doesn't do a great job in large functions with
| computed gotos - whereas in musttail approach each function is a
| smaller unit and optimized separately.
| bestouff wrote:
| Can't wait for mandatory TCO coming to Rust. But it's not there
| yet. https://github.com/phi-go/rfcs/blob/guaranteed-
| tco/text/0000...
| sparkie wrote:
| Not sure I like the `become` keyword. Seems bizarre - someone
| encountering this word in code for the first time would have
| no idea what it's doing.
|
| Why don't they just use `tailcall`? That would make it's
| obvious what it's doing because we've been using the term for
| nearly half a century, and the entire literature on the
| subject uses the term "tail call".
|
| Even better would be to just automatically insert a tail call
| - like every other language that has supported tail calls for
| decades - provided the callee has the same signature as the
| caller. If it's undesirable because we want a stack trace,
| then instead have some keyword or attribute to suppress the
| tail call - such as `no_tail`, `nontail` or `donttail`.
|
| Requiring tail calls to be marked will basically mean the
| optimization will be underutilized. Other than having a stack
| trace for debugging, there's basically no reason not to have
| the optimization on by default.
| kibwen wrote:
| Rust does allow tail call optimization. But that's LLVM's
| decision to optimize tail calls on a case-by-case basis. An
| explicit syntax to denote tail calls would be the
| difference between tail call _optimization_ and guaranteed
| tall call _elimination_ , which is important because if
| you're writing a tail-recursive function then it's pretty
| trivial to blow the stack at any moderate recursion depth
| unless you can guarantee the elimination.
|
| As for why it's not trivial for Rust to do this by default,
| consider the question of what should happen in the case of
| local destructors, which in an ordinary function would be
| called _after_ `return myfunc()` returns, but in a tail-
| recursive function would need to be called beforehand. The
| proposals for `become` tend to handle this by making it a
| compiler error to have any locals with destructors in scope
| at the point of the tail-call, further motivating the
| explicit syntax.
| norir wrote:
| Lexing being the major performance bottleneck in a compiler is a
| great problem to have.
| norskeld wrote:
| Is lexing ever a bottleneck though? Even if you push for lexing
| and parsing 10M lines/second [1], I'd argue that semantic
| analysis and codegen (for AOT-compiled languages) will dominate
| the timings.
|
| That said, there's no reason not to squeeze every bit of
| performance out of it!
|
| [1]: In this talk about the Carbon language, Chandler Carruth
| shows and explains some goals/challenges regarding performance:
| https://youtu.be/ZI198eFghJk?t=1462
| munificent wrote:
| It depends a lot on the language.
|
| For a statically typed language, it's very unlikely that the
| lexer shows up as a bottleneck. Compilation time will likely
| be dominated by semantic analysis, type checking, and code
| generation.
|
| For a dynamically typed language where there isn't as much
| for the compiler to do, then the lexer might be a more
| noticeable chunk of compile times. As one of the V8 folks
| pointed out to me years ago, the lexer is the only part of
| the compiler that has to operate on every single individual
| byte of input. Everything else gets the luxury of greater
| granularity, so the lexer can be worth optimizing.
| SnowflakeOnIce wrote:
| A simple hello world in C++ can pull in dozens of megabytes
| of header files.
|
| Years back I worked at a C++ shop with a big codebase
| (hundreds of millions of LOC when you included vendored
| dependencies). Compile times there were sometimes dominated
| by parsing speed! Now, I don't remember the exact breakdown
| of lexing vs parsing, but I _did_ look at it under a
| profiler.
|
| It's very easy in C++ projects to structure your code such
| that you inadvertently cause hundreds of megabytes of sources
| to be parsed by each single #include. In such a case, lexing
| and parsing costs can dominate build times. Precompiled
| headers help, but not enough...
| adev_ wrote:
| > Now, I don't remember the exact breakdown of lexing vs
| parsing, but I did look at it under a profiler.
|
| Lexing, parsing and even type checking are interleaved in
| most C++ compilers due to the ambiguous nature of many
| construct in the language.
|
| It is very hard to profile only one of these in isolation.
| And even with compiler built-in instrumentation, the
| results are not very representative of the work done
| behind.
|
| C++ compilers are amazing machines. They are blazing fast
| at parsing a language which is a nightmare of ambiguities.
| And they are like that mainly because how stupidly verbose
| and inefficient the C++ include system is.
| zX41ZdbW wrote:
| I recommend taking a look at the ClickHouse SQL Lexer:
|
| https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
|
| https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
|
| It supports SIMD for accelerated character matching, it does not
| do any allocations, and it is very small (compiles to a few KB of
| WASM code).
| tuveson wrote:
| How much of an improvement does SIMD offer for something like
| this? It looks like it's only being used for strings and
| comments, but I would kind of assume that for most programming
| languages, the proportion of code that is long strings /
| comments is not large. Also curious if there's any performance
| penalty for trying to do SIMD if most of the comments and
| strings are short.
| camel-cdr wrote:
| Usually lexing isn't part of the performance equation
| compared to all other parts of the compiler, but SIMD can be
| used to speedup the number parsing.
| o11c wrote:
| Unfortunately, operating a byte at a time means there's a hard
| limit on performance.
|
| A truly performant lexer needs to jump ahead as far as possible.
| This likely involves SIMD (or SWAR) since unfortunately the C
| library fails to provide most of the important interfaces.
|
| As an example that the C library _can_ handle tolerably, while
| lexing a string, you should repeatedly call `strcspn(input,
| "\"\\\\\n")` to skip over chunks of ordinary characters, then
| only special-case the quote, backslash, newline and (implicit!)
| NUL after each jump. Be sure to correctly distinguish between an
| embedded NUL and the one you probably append to represent EOF
| (or, if streaming [which requires quite a bit more logic], end of
| current chunk).
|
| Unfortunately, there's a decent chance your implementation of
| `strcspn` doesn't optimize for the possibility of small sets, and
| instead constructs a full 256-bit bitset. And even if it does,
| this strategy won't work for larger sets such as "all characters
| in an identifier" (you'd actually use `strspn` since this is
| positive), for which you'll want to take advantage of the
| characters being adjacent.
|
| Edit: yikes, is this using a hash without checking for
| collisions?!?
| dist1ll wrote:
| You can get pretty far with a branch per byte, as long as the
| bulk of the work is done w/ SIMD (like character
| classification). But yeah, LUT lookup per byte is not
| recommended.
| xnacly wrote:
| You are somewhat right, I used tagging masks to differntiate
| between different types of atoms [1]. But yes, interning will
| be backed by a correct implementation of a hashmap with some
| collision handling in the future.
|
| [1]: https://github.com/xNaCly/purple-
| garden/blob/master/cc.c#L76...
| kingstnap wrote:
| You can go pretty far processing one byte at a time in
| hardware. You just keep making the pipeline deeper and pushing
| the frequency. And then to combat dependent parsing you add
| speculative execution to avoid bubbles.
|
| Eventually you land on recreating the modern cpu.
| adev_ wrote:
| Cool exercise and thank you for the blog post.
|
| I did a similar thing (for fun) for the tokenizer associated to a
| Swift derivates language written in C++.
|
| My approach was however very different of yours:
|
| - No macro, no ASM, just explicit vectorization using std.simd
|
| - No hand rolled allocator. Just std::vector and SOA.
|
| - No hashing for keyword. They are short. A single SIMD load /
| compare is often enough for a comparison
|
| - All the lookup tables are compile time generated from the token
| list using constexpr to keep the code small and maintainable.
|
| I was able to reach around 8 Mloc/s on server grade hardware,
| single core.
| JonChesterfield wrote:
| Byte at a time means not-fast but I suppose it's all relative.
| The benchmarks would benefit from a re2c version, I'd expect that
| to beat the computed goto one. Easier for the compiler to deal
| with, mostly.
| zahlman wrote:
| Is lexing really ever the bottleneck? Why focus effort here?
| s3graham wrote:
| simdjson is another project to look at for ideas.
|
| I found it quite tricky to apply its ideas to the more general
| syntax for a programming language, but with a bunch of hacking
| and few subtle changes to the language itself, the performance
| difference over one-character-at-a-time was quite substantial
| (about 6-10x).
| psanchez wrote:
| The jump table is interesting, although I guess the performance
| of switch will be similar if properly optimized with the
| compiler, but would not be able to tell without trying. Also
| different compilers might take different approaches.
|
| A few months ago I built a toy boolean expression parser as a
| weekend project. The main goal was simple: evaluate an expression
| and return true or false. It supported basic types like int,
| float, string, arrays, variables, and even custom operators.
|
| The syntax and grammar were intentionally kept simple. I wanted
| the whole implementation to be self-contained and compact,
| something that could live in just a .h and .cc file. Single pass
| for lexing, parsing, and evaluation.
|
| After having the first version working, I kind of challenged
| myself to make it faster and tried many things.
|
| Once the first version was functional, I challenged myself to
| optimize it for speed. Here are some of the performance-related
| tricks I remember using: - No string allocations:
| used the input *str directly, relying on pointer manipulation
| instead of allocating memory for substrings. - Stateful
| parsing: maintained a parsing state structure passed by reference
| to avoid unnecessary copies or allocations. - Minimized
| allocations: tried to avoid heap allocations wherever possible.
| Some were unavoidable during evaluation, but I kept them to a
| minimum. - Branch prediction-friendly design: used lookup
| tables to assist with token identification (mapping the first
| character to token type and validating identifier characters).
| - Inline literal parsing: converted integer and float literals to
| their native values directly during lexing instead of deferring
| conversion to a later phase.
|
| I think all the tricks are mentioned in the article already.
|
| For what is worth, here is the project:
| https://github.com/pausan/tinyrulechecker
|
| I used this expression to assess the performance on an Intel(R)
| Core(TM) i7-8565U CPU @ 1.80GHz (launched Q3 2018):
| myfloat.eq(1.9999999) || myint.eq(32)
|
| I know it is a simple expression and likely a larger expression
| would perform worse due to variables lookups, ... I could get a
| speed of 287MB/s or 142ns per evaluation (7M evaluations per
| second). I was gladly surprised to reach those speeds given that
| 1 evaluation is a full cycle of lexing, parsing and evaluating
| the expression itself.
|
| The next step I thought was also to use SIMD for tokenizing, but
| not sure it would have helped a lot on the overall expression
| evaluation times, I seem to recall most of the time was spent on
| the parser or evaluation phases anyway, not the lexer.
|
| It was a fun project.
| aappleby wrote:
| Lexing is almost never a bottleneck. I'd much rather see a
| "Strategies for Readable Lexers".
| kklisura wrote:
| > As introduced in the previous chapters, all identifers are
| hashed, thus we can also hash the known keywords at startup and
| make comparing them very fast.
|
| One trick that postgres uses [1][2] is perfect hashing [3]. Since
| you know in advance what your keywords are, you can design such
| hashing functions that for each w(i) in list of i keywords W,
| h(w(i)) = i. It essentially means no collisions and it's O(i) for
| the memory requirement.
|
| [1]
| https://github.com/postgres/postgres/blob/master/src/tools/P...
|
| [2]
| https://github.com/postgres/postgres/blob/master/src/tools/g...
|
| [3] https://en.wikipedia.org/wiki/Perfect_hash_function
___________________________________________________________________
(page generated 2025-07-14 23:00 UTC)