https://marijnhaverbeke.nl/blog/lezer.html

Marijn Haverbeke's blog (license)

Lezer

Tuesday, September 3, 2019 parsing javascript performance lezer

I keep coming across people who consider parser technology a
forbidding, scary field of programming. This is nonsense--a small
parser can be very, very simple, and provide a wholesome exercise in
recursive thinking. At the same time, it is true that you can make
parsing extremely complicated. Mostly, this tends to happen when you
generalize parsing techniques to work for different grammars. Since
compiler textbooks usually describe general approaches to parsing,
they may be to blame for putting people off.

This post describes a new parsing system I wrote for CodeMirror, a
source code editor. It frames the system with some history, and
digresses into some neat architectural details.

Editor features like syntax highlighting, bracket matching, code
folding, and autocompletion all involve some level of parsing.
Unfortunately, since editors have to handle many different languages,
they require a generalized approach to parsing.

CodeMirror is in the process of being rewritten, and I wanted to
improve the way it parses its content. Parsing inside of an editor
comes with its own unique set of constraints, which can be hard to
satisfy. Though I had been planning new approaches for years, all I
had to show for it so far were a pile of dead ends.

The constraints that make the parsing problem in a code editor hard
are roughly these:

  * The document is constantly changing.

  * You can't do anything expensive. If the parsing works takes too
    long, it'll introduce latency that makes editing feel slugglish
    and unresponsive.

  * The input is often not in a finished, syntactically correct form.
    But you still have to make some sense of it--nobody wants an
    editor where most features stop working when you have a syntax
    error in your document.

  * You often want to be able to mix several languages/grammars in a
    single document (think HTML with JavaScript and CSS embedded in
    it).

Keeping those in mind, let's go over the approaches I've tried.

A Brief History of CodeMirror Parsing

The system in as it exists in CodeMirror 5 now (which is pretty much
what we've been using from the very beginning) is a simple one. For
each language, you write a tokenizer which splits the input into
pieces, and labels each piece with some syntactic category (such as
variable, keyword, or number). The tokenizers can be stateful, which
allows them to secretly be full parsers if they want to.

This state must by copyable, so that the editor can strategically
store tokenizer states from a previous run, and after a change,
resume one close to that change to avoid re-tokenizing the entire
document. Because we are usually only interested in the code in the
visible viewport, this means the complexity of re-tokenizing is
bounded by the distance between the change and the end of the
viewport. Since most changes happen inside of that viewport, this
works well in practice.

---------------------------------------------------------------------

Such tokenizers are awkward to write directly, so over the years
several attempts have been made to build abstractions over them. The
first was the Common JavaScript Syntax Highlighting Specification, an
attempt by the authors of Mozilla Skywriter (formerly Bespin, later
merged into ACE) to define a declarative format for describing
tokenizers as state machines with regular expressions (describing the
tokens) as edges. The ACE project ended up with an incompatible but
similar format (too entangled with their internals to use in
CodeMirror, unfortunately). I did an implementation of the original
spec for CodeMirror, and then another incompatible extension because
the base spec was too limiting. There are a few CodeMirror modes
still based on that code, but it was no real success.

I think the reason such state machines (and the somewhat related
TextMate grammars which are in wide use in desktop editors) never
felt like a great solution is that, once you get past trivial
grammars (where their declarative simplicity does look really nice),
they don't really help that much with abstraction. Manually designing
complicated state machines is a chore. Regular expressions, which are
bad enough on their own, become downright terrifying when you have to
construct all your edges out of them, often stuffing multiple tokens
into a single expression to avoid creating intermediate states. This
"abstraction" has a tendency to produce uglier, less maintainable
code than what you'd get when writing the tokenizer as plain code.

---------------------------------------------------------------------

So in 2017, I started an ambitious project to create a better way to
abstractly define incremental tokenizers. I had concluded that
classical parser generators based on context-free grammars were never
going to work in this context (for reasons that I'll come back to
later on). But I kept coming across parsing expression grammars,
which weren't based on context-free grammars and had some interesting
properties, such as being able to combine multiple grammars to create
a new grammar (which is great for mixed-language documents).

So I spent several months building a parsing system that took a
PEG-like grammar, compiled it down to a state machine, and made it
possible to run that state machine as a CodeMirror language mode.

This system is a marvel. It uses a moderately sophisticated
optimizing compiler to generate the state machines. The result works
quite well, and is used in several real-world systems today. But
unfortunately, if I'm honest, it is a tragically bad idea taken way
too far.

Parsing expression grammars are parsed by backtracking. And as such,
they are very poorly suited for implementing a stateful tokenizer. In
a backtracking system, you never know when you've definitely parsed a
piece of content--later input might require you to backtrack again. So
what I ended up with was actually not PEG at all, but a system where
you had to explicitly annotate where the parser should look ahead.
Though grammars written this way were relatively readable, they
involved a lot of finicky, error-prone kludges to resolve local
ambiguity.

Also, parsing PEG is just really inefficient. Such grammars are
"scannerless" meaning they don't make a distinction between
tokenizing and parsing. When parsing in that way naively, you
basically have to run your whole parsing logic for every input
character. Probably multiple times, due to backtracking. A lot of the
magic in the compiler was intended to recover the tokens that were
implicit in the grammar, in order to recover some efficiency. But the
system never came close to hand-written language modes in terms of
speed.

Tree-sitter

So, though I knew I needed a new approach, I went into the CodeMirror
6 rewrite without any specific idea on what that approach would look
like.

And then I saw tree-sitter, and was enlightened.

Tree-sitter is a parser system written with the code editor use case
in mind, and is in the process of being integrated into the Atom
editor. It takes a much more ambitious approach to what a parser
inside an editor should do: It builds up a full, accurate syntax tree
for the content.

You can do so much more with an actual syntax tree than with a
sequence of tokens. Whereas tokens, possibly augmented with some
information stored in the tokenizer state, allow you to sort of
approximate understanding some aspects of the code's structure, a
tree usually gives you precisely the information you need.

Most of the ideas that tree-sitter uses aren't new, in fact a paper
from 2000 describes a somewhat similar system. But as far as I know,
tree-sitter is the first system that puts them all together into a
practical piece of software.

Unfortunately, tree-sitter is written in C, which is still awkward to
run in the browser (and CodeMirrror targets non-WASM browsers). It
also generates very hefty grammar files because it makes the size/
speed trade-off in a different way than a web system would.

But good ideas can be ported. Lezer is a JavaScript-based system
heavily inspired by tree-sitter.

LR Parsing and Context-Free Grammars

For a long time, I was firmly convinced that classical parser system
based on context-free grammars and LL or LR parsing algorithms were
just not suitable for the editor use case. My arguments for this
were...

Context-free grammars are a limiting abstraction that breaks down as
soon as the language does anything funky. Needing the grammar to be
LR or LL to please the parser generator further pins you into a
corner.

This is not wrong. Expressing operator precedence in a pure
context-free grammar requires writing a silly formulaic rule for each
level of precedence. And when you need to implement something like
automatic semicolon insertion or whitespace-sensitivity, which would
be a couple of lines of code in a hand-written grammar, you can't
express that directly, and have to somehow escape the context-free
abstraction.

Making such a grammar suitable for an LR parser generator can be even
more tricky, and often requires you to have a rather deep
understanding of how the parser generator works.

But like many things, once you get to know them, they aren't that
bad. Parser generators can support precedence declarations, which
make operator parsing a lot less terrible. They can even output
decent error messages.

Supporting dynamic resolution of ambiguities through something like
GLR parsing can provide a practical way out of situations that parser
generators are traditionally bad at.

And contrary to some of the abstractions I mentioned before, this one
actually gets us something. Context-free grammars, when combined with
a proper parser generator, really do give us fast parsers from
readable, compact grammar declarations.

A strict separation between the tokenizer and parser is problematic.

It is, in many languages (think of JavaScript's ambiguity between
regular expressions and the division operator). It also tends to make
mixed-language parsing harder.

But just because this type of parser is traditionally ran with a
completely separate tokenizer doesn't mean it has to be. Having the
parse state drive the tokenizer is largely unproblematic. You can
even have the parser generator set this up automatically, without
user involvement.

Generated parsers are way too big.

A naively generated LR parser is huge, and many tools spit out
embarrassingly big files. But with careful parser state deduplication
and table compression such a parser can be made about as compact as a
hand-written one.

Making such a parser error-tolerant is extremely cumbersome.

If you search the scholarly literature for approaches to
error-tolerance in LR parser systems, you get a lot of results, with
a lot of different approaches, but none of them are very practical.
Most require the grammar writer to explicitly annotate the grammar
with error-recovery strategies, bloating the grammar and putting the
responsibility for getting it right on every grammar author.

Tree-sitter ingeniously abuses GLR parsing, where the parser can try
multiple interpretations simultaneously, to integrate automatic
error-correction without a lot of extra complexity. Lezer copies this
approach.

Lezer

I called my tree-sitter copycat project Lezer, which is the Dutch
word for reader (and pronounced a lot like laser). It is a bit less
advanced than tree-sitter in some areas, a bit more advanced in
others, and simply different on quite a lot of points, as determined
by a different set of priorities and tastes.

CodeMirror 6 will retain the ability to run a classical stateful
tokenizer, but its recommended way to define a language mode is to
write a Lezer grammar and wrap it in a CodeMirror-specific packages
that adds some editor-related metadata.

Lezer is an LR (with opt-in GLR) parser generator. It has support for
incremental parsing, where you can cheaply re-parse a document after
local changes have been made to it by reusing pieces of the old parse
tree. It automatically tries to recover and continue parsing when it
runs into a syntax error, leaving markers in the output tree that
indicate where the recovery happened.

Lezer consists of an off-line parser generator tool, which takes a
grammar description and outputs a JavaScript module containing a
parser for that grammar, and a parser run-time system (which such
output files depend on) to do the actual parsing. Only the run-time
system and the generated parser need to be loaded by the editor.

The parser outputs non-abstract syntax trees, meaning that it just
creates a raw tree structure containing the constructs it parsed
(with information on where it found them), without organizing them
into a clean, easy-to-use data structure.

The system is optimized for compactness, both in parser table size
and syntax tree size. It needs to be practical to ship a bunch of
parsers to a user on the web without producing megabytes of network
traffic, and it needs to be realistic to keep syntax trees for large
documents around without running out of memory.

The Lezer guide provides a more thorough introduction, as well as a
description of its grammar notation. In this blog post, I want to go
into the neat implementation details that aren't relevant in user
documentation.

Error Recovery

The point where I became convinced that I definitely needed to use or
copy tree-sitter was when I understood its error recovery strategy.

Say you reach a point where you can no longer proceed normally
because there is a syntax error. The rest of the input, after the
error, is probably full of meaningful constructs that could still be
parsed. We want those constructs in our syntax tree. But our regular
parsing process is stuck--it doesn't know how to get from the error to
a state where the parse can continue.

I definitely did not want to require the grammar author to add error
recovery hints to their grammar. These tend to clutter up the grammar
and are error-prone to write. Writing a grammar is hard enough
without that distraction.

You can see error recovery as a search problem. There might be a
parse state and input position (past the error) where the parse can
meaningfully continue. We just have to find it.

The actions encoded in the parse tables, along with some
recovery-specific actions that the parser wouldn't normally take,
provide a kind of search tree. You start at the state(s) where the
error occurred, and keep exploring new states from there.

But what does the accept condition look like? When do you know that
you've found an acceptable solution? You could define that precisely,
for example as the state that can handle the next N tokens without
further errors. But we can also be vague.

The solution found by Max Brunsfeld in tree-sitter is to use the same
mechanism that's used to parse ambiguous grammars. A GLR parser can
split its parse stack and run both sides alongside each other for a
while until it becomes clear which one works out.

That's pretty much exactly what a search algorithm does--it tracks a
number of branches that it still has to explore, and continues to
explore them, possibly pruning unpromising branches with some
heuristic, until it finds a solution.

To be able to get good results, or at least some result, in messy
situations like longer stretches of invalid input, each branch has a
badness score associated with it, which is increased (linearly) each
time a recovery action is taken, and decreased (asymptotically) every
time it can consume a token normally.

What we want to do is, after an error, try all kinds of possible
recovery tricks, which recursively branch off a large amount of
states. But then, after a bit of that, we should consolidate to one
or, at most, a few parse states again, because parsing input in a
whole bunch of different ways is expensive.

To get this effect, Lezer forbids states with a badness higher than a
given multiple of the best state's badness (or some maximum
threshold) from applying further recovery actions, effectively
dropping those branches when they can't proceed normally. In the case
where one branch finds a good way to continue, that branch's badness
will converge to zero and eventually stop all worse branches. In
cases where the input continues to make no sense, all branches will
eventually get a badness score exceeding the maximum, and the parser
will only continue one of them.

The recovery strategies used are:

  * Skip the next token, and try again with the same state after
    that.

  * Invent a token--take any of the tokens that are valid in this
    state, and continue to the state that consuming them would
    produce. This is the main source of branching, since many states
    allow a lot of tokens.

  * Force the end of the innermost production that's currently being
    parsed.

There are situations where the result of this approach isn't entirely
optimal, but it usually does well. The important thing is that it
always keeps parsing, and does so in a way that remains tractable
(exponential searches are quickly dampened). The system is biased a
bit towards the token-skipping rule, so that if all else fails it'll,
in effect, just continue skipping tokens until it stumbles into a
situation where it can continue parsing.

Post-Order Parser Output

When you have a parser that may be splitting its state--a lot--and
build up parts of the tree multiple times, that duplicate tree
building and the bookkeeping involved in it can cause a lot of
unnecessary work.

The order in which an LR parser creates nodes is inner-to-outer. It
will, for example, first create the node for the operands, and then
later the node for the operator expression. This suggests an
approach: What if, instead of building a tree structure right away,
the parser just keeps a flat log of the nodes it created. This can be
an array in which the nodes are arranged in post-order, with children
coming before parents.

The parser just appends to this array. When splitting the state, one
state keeps the existing array, and the other gets a new empty array
along with a pointer to the state that has the rest of the array, and
the length of that array at the time of the split.

Now splitting involves no node copying at all. You do need to copy
the state stack, which LR parser use to track context, but that is
generally shallow.

In addition, node allocation becomes as cheap as appending a few
numbers to an array. For actions that don't result in tree nodes
(Lezer allows you to mark rules as uninteresting, to keep the tree
small), you don't have to do anything at all. The control stacks
stores the output array position at the start of each rule, and can
use that to emit enough data to later reconstruct parent-child
relationships.

After a parse finishes successfully, the final state's parent-array
pointers can be used to find all the nodes that make up the tree, and
construct an actual tree structure out of them.

One tricky issue occurs when skipped content (whitespace and
comments) produces nodes. If you have code like this...

if (true) something()
// Comment
otherStatement()

... the comment should not be part of the if statement's node. Yet
the parser only knows for sure that it can finish that node after
seeing the next statement (there might be an else still coming).

In cases like this, where the output array contains skipped nodes
immediately in front of a reduction, the parser has to move them
forward and store the end of the node before them. Fortunately, this
occurs relatively rarely (unless you add nodes for whitespace, in
which case it'll happen at the end of every rule that has a possible
continuation).

Buffer Trees

A nice thing about the flat post-order tree representation is that it
is compact. Tree structures constructed the usual way, as separately
allocated nodes, incur a lot of extra overhead for pointers and
allocation headers. They can also have terrible locality, since who
knows how far from each other the memory allocator will put the
nodes.

Unfortunately, we can't just use a flat representation for our syntax
trees. The incremental parser has to be able to reuse parts of it
without copying those parts into a different buffer.

But we can use it for parts of the tree. Storing the coarse structure
as a classical tree, but the content of smaller nodes (say less than
a few thousand characters long) as flat arrays, gives us the best of
both worlds. Since most nodes, by number, live in the fine structure,
this saves a large amount of overhead (and helps with locality).

That does mean that we can't reuse small nodes. But since their size
is limited, the amount of work that is involved in re-parsing them is
also limited. And by removing them from consideration, the
incremental parser can avoid quite a bit of the work involved in
preparing and scanning the tree for reuse.

A small node stores its content in a typed array of 16-bit unsigned
integers. It uses 4 such numbers (64 bits) per node, storing a type,
a start position, an end position, and a child count for each node.
Contrary to the array created by the parser, these arrays are in
pre-order, because that makes forward iteration (which tends to be
more common than backward iteration) cheaper. The child count was
almost obsolete (the end position can sort of tell you which nodes
are children), but Lezer supports zero-length nodes, which might land
on the end of their parent node and make it ambiguous whether they
belong to it or not.

Client code, of course, doesn't want to deal with this
representation. Lezer provides an abstract interface to searching in
and walking through trees that hides the buffer structure, allowing
you to conceptually work with a uniform tree of nodes.

Lezer, like tree-sitter, stores the result of repetitions in the
grammar (produced by the * and + operators) as balanced subtrees.
This means that, unless your input is pathological (say, a thousand
applications of a single binary operator in a row), you tend to get
shallow, well-balanced syntax trees, which are cheap to search and
allow effective reuse.

Contextual Tokens

Depending on the grammar's complexity, an LR parser generator creates
between a dozen and a few thousand parse states for your grammar.
These represent syntactic positions like "after the opening paren of
an argument list" or "after an expression, possibly expecting some
expression suffix".

The parser generator can figure out which tokens are valid in a given
state. It can also, for tokens specified as part of the grammar,
automatically determine which tokens conflict (match the same input,
or some prefix of each other).

A well-known example of conflicting tokens is the division operator
versus regular expression syntax in JavaScript. But others are
keywords that can also appear as property names, and the bitwise
right shift operator (>>) versus two closing angle brackets in C++.

Lezer will not complain about overlapping tokens if the tokens do not
appear in the same parse states. This implicitly resolves the regular
expression and property name issues, without any user interaction.

When conflicting tokens do appear in the same place, such as division
operators and C-style comments, you have to specify an explicit
precedence ordering (comments take precedence) to tell the tool that
you know what you're doing.

Contextual tokenization is implemented with a concept called token
groups. Tokens that have unresolved conflicts with other tokens are
assigned to one or more groups, where each group contains only
non-conflicting tokens. Each state is assigned a single group (if it
expects tokens that conflict with each other that's an error). This
group is passed to the tokenizer, which then takes care to only
return tokens that are either in that group, or don't conflict with
any other tokens. The check is optimized by storing group membership
in a bitset, and seeing if the right bit is set with binary and.

Tokens are compiled down to a single deterministic state machine,
which is ran on the input character stream. In cases like the
regexp-versus-division issue, you don't want the machine to go
running through regexp-specific states in a situation where you only
allow division, since that would be wasteful. Therefore, each
tokenizer state is also tagged with a bitset that tells you which
groups the tokens reachable from that state belong to, and the
tokenizer stops running when it hits a state that has no overlap with
the allowed tokens for the parse state.

Skip Expressions

Almost all programming languages have special syntactic elements like
whitespace and comments that may occur between any tokens. Encoding
these directly in the grammar is extremely tedious for most
languages.

Traditionally, tokenizer just skip such elements when reading the
next token. That works well in most contexts, but makes it awkward to
include the elements in the parse tree.

Lezer treats skipped things like they are part of the grammar (though
in an optimized way to avoid increasing the size of the parse
tables). It is possible to skip things that aren't single tokens (to
implement something like nestable comments, for example, or to make
sure your block comment nodes consist of smaller nodes so that you
can incrementally parse giant block comments).

Each rule or group of rules may have its own set of skipped
expressions, so that you can express different sublanguages in a
grammar, for example something like the content of interpolated
strings, without allowing spacing in places where the language
doesn't allow it.

Each parse state has a pointer to a (shared) set of skip actions,
which, for the skipped tokens or tokens that start a compound skipped
expression, contains the actions to take for those tokens. For
single-token skipped elements, that action just tells the parser to
skip the token and stay in the same state. For compound elements, it
causes the state that handles the rest of the element to be pushed
onto the control stack.

Tree Node Tagging

The languages that a tool like Lezer needs to handle are wildly
different, from JavaScript to Haskell to CSS to YAML. As such, it is
difficult to find a cross-language vocabulary to describe their
constructs. In fact, it seems like that would be a separate
multi-year project, and pull in a serious amount of complexity.

Yet it would be nice if the parser output comes with some information
that can be interpreted without knowing what language you are working
with.

After several iterations, what I decided on was a system where nodes
have names, which only have a meaning within the language, and props,
which are values associated with tags defined by external code.
Integrating a language grammar into CodeMirror involves assigning
values for some of these props to the node types used by the
language--things like syntax highlighting style information and how to
indent such nodes.

Since the number of node types in a language is limited, we can
allocate an object for each node type to hold this information, and
have all nodes of that type point to the same object.

To allow code outside the grammar to add props without mutating
global state, parser instances can be extended with additional props,
creating a copy that will output nodes with the props attached. This
is especially useful in the context of mixed-language trees.

Lezer has support for a limited form of grammar nesting. If language
A can appear inside a document in language B, and the end of the
region covered by A can be unambiguously found by scanning for a
specific token, Lezer can temporarily switch to another set of parse
tables while parsing such a region.

The syntax tree will then contain nodes from both grammars. Having
props directly attached to the nodes makes it much easier to work
with such trees (as opposed to using a language-specific table that
associates node names with metadata).