[HN Gopher] Pratt Parsers: Expression Parsing Made Easy (2011)
___________________________________________________________________
Pratt Parsers: Expression Parsing Made Easy (2011)
Author : memalign
Score : 127 points
Date : 2024-01-20 11:16 UTC (11 hours ago)
(HTM) web link (journal.stuffwithstuff.com)
(TXT) w3m dump (journal.stuffwithstuff.com)
| zevv wrote:
| Ha, nice to see this on HN: this article was pretty helpful to me
| to understand the concept a few years back when extending my PEG
| parsing library [1] with a Pratt parser; this mitigates the
| problem of PEG parsers not allowing left recursion and allows for
| a much more concise notation of grammars with operator
| precedence. Thank you Bob:
|
| 1. https://github.com/zevv/npeg#precedence-operators
| vidarh wrote:
| I love that way of embedding precedence tables without the
| "classic" repetitive EBNF pattern. I will shamelessly steal
| this for my next parser library. Also, your documentation is
| outstanding.
|
| (tiny typo - "Pratt" is spelled "Prett" in one sentence on your
| page)
| kragen wrote:
| did you try warth's left-recursion hack for packrat parsers?
| i'm curious what your experience was with it if so
| throwup238 wrote:
| A few years back I had to implement a new parser for a custom
| expression language to replace an old parser that didn't give
| very good errors and was hard to extend. I never did the
| traditional CS track so parsers were kind of mysterious black
| magic so naturally first thing I did was search HN.
|
| I stumbled on an older parsing thread [1] with a link to a toy
| Pratt implementation [2] made by an HNer... and shamelessly
| ripped it off to write my own production Pratt parser
| implementation.
|
| Great success! Writing a Pratt parser by hand is a lot easier
| than I thought and like the comment says, adding type information
| was trivial.
|
| [1] https://news.ycombinator.com/item?id=24480504
|
| [2] https://github.com/mx-scissortail/WebBS
| PartiallyTyped wrote:
| This is always a nice read. I was going through Bob's interpreter
| book and I wanted a more complete or interesting parser. This is
| a very nice addition to that.
| kevindamm wrote:
| Pratt or TDOP parsers are my favorite design when coding the
| parser directly in a host language, with parser combinators a
| close second (but it edges ahead depending on the host language).
|
| But when it comes to writing a larger parser with many production
| rules, I'd rather use a parser generator based around an Earley
| parser. You can split production rules and organize their post-
| processing into an AST a lot easier .. but providing good error
| statements is still a bit of a hassle.
|
| I haven't found a parser generator that makes it painless to
| provide good error messages. Writing the parser directly in the
| host language has always been much easier in that regard.
| nickpsecurity wrote:
| sklogic told us here that his strategy was to use two parsers:
| a fast, generated one by default; a separate one for good,
| error handling. If using that strategy, then we might be able
| to build some error handling in as part of the grammar. The
| generator ignores it for one engine and uses it for another.
| kevindamm wrote:
| That's good advice.. it's especially useful in an Earley
| design because it is much faster when it doesn't have to keep
| explicit back references for derivation, but those references
| are needed for explaining the current context. Adding a
| switch for keeping/discarding history (and a compiler pass
| for whether or not to fold the epsilon-DFA into collapsed
| states, "NNF" style) would automate the fast-vs-
| slow|informative variants. Add in some debug-only output in
| postprocessors and it wouldn't really need any syntactical
| changes to the parser generator. I may try this soon..
| danybittel wrote:
| I've been working on a parser that parses while typing and
| gives type and syntax errors as soon as possible but never any
| false positives. I didn't expect it to be that difficult. I
| still haven't really grokked it and any bugfix usually ends in
| a game of whack a mole over integration tests. Any hints?
| ReleaseCandidat wrote:
| It depends on the language you're parsing, of course. But the
| most important thing is to know when your parser is on the
| right track again. The (most of the time) easiest (but not
| necessarily first possible) such point is the next definition
| statement. And make sure to _always_ consume (at least) one
| token (even if just adding it to the error expression in the
| AST), so there is no possibility of an endless loop.
| ivanjermakov wrote:
| I can recommend this article:
| https://matklad.github.io/2023/05/21/resilient-ll-parsing-
| tu...
| orthoxerox wrote:
| Roslyn (C# compiler) has been designed to quickly parse
| invalid code while still providing intellisense.
| ww520 wrote:
| One way I found helpful in giving precise parsing errors
| without causing excessive false errors is to have the parser
| auto-completing the missing tokens with placeholder error
| nodes. E.g. if (cond1 And ) { ... }
|
| The And operator is missing its right operand, but everything
| else is valid.
|
| The parser knows the And operator needs two arms. It can
| complete the missing arm with an error node.
| fn parseBiCondition() { let leftNode =
| parseBoolExpr() if (lookaheadToken() is And)
| return parseAnd(leftNode) if (lookaheadToken() is
| Or) ... } fn
| parseAnd(leftNode) { consumeToken()
| let rightNode = parseBoolExpr() if (rightNode !=
| null) return AndNode(leftNode, rightNode)
| else return AndNode(leftNode,
| ErrorNode("missing operand", location)) }
|
| In this way, the conditional expression is parsed to
| completion without interruption. Multiple errors can be
| handled with completion and recorded. At the end it's a
| matter of traversing the AST to print out all the ErrorNodes.
| ReleaseCandidat wrote:
| > I haven't found a parser generator that makes it painless to
| provide good error messages.
|
| This. And letting the user add their own infix function with
| configurable precedence and associativity is easy using Pratt
| too.
| timhh wrote:
| I had to write an expression parser recently and found the number
| of names for essentially the same algorithm very confusing.
| There's Pratt parsing, precedence climbing, and shunting yard.
| But really they're all the same algorithm.
|
| Pratt parsing and precedence climbing are the same except one
| merges left/right associativity precedence into a single
| precedence table (which makes way more sense). Shunting yard is
| the same as the others except it's iterative instead of
| recursive, and it seems like common implementations omit some
| syntax checking (but there's no reason you have to omit it).
|
| Here's what I ended up with:
|
| https://github.com/Timmmm/expr
|
| I had to write that because somewhat surprisingly I couldn't find
| an expression parser library that supports 64-bit integers and
| bools. Almost all of them only do floats. Also I needed it in C++
| - that library was a prototype that I later translated into C++
| (easier to write in Rust and then translate).
| secondcoming wrote:
| There is Boost.Spirit. We use it but I find its cleverness just
| results in incomprehensible code.
|
| https://www.boost.org/doc/libs/1_84_0/libs/spirit/classic/do...
| timhh wrote:
| Ah yeah I should have said "a library for which the
| integration is easier than writing my own library from
| scratch"!
| mgaunard wrote:
| There are dozens of parser generators in C++ that support
| arbitrary types.
| azhenley wrote:
| Eli Bendersky also has a nice post on Pratt parsing from 2010. I
| used both of these when implementing it for the first time.
|
| https://eli.thegreenplace.net/2010/01/02/top-down-operator-p...
| lioeters wrote:
| Some time ago I went down a fascinating rabbit hole on Pratt
| parsers while writing a math expression evaluator. In addition
| to the OP and above-linked article, I found the following
| helpful.
|
| - How Desmos Uses Pratt Parsers (2018) - TypeScript -
| https://engineering.desmos.com/articles/pratt-parser/ - Source:
| https://github.com/desmosinc/pratt-parser-blog-code
|
| - Simple But Powerful Pratt Parsing (2020) - Rust -
| https://matklad.github.io/2020/04/13/simple-but-powerful-pra...
| - Source: https://github.com/matklad/minipratt
| mgaunard wrote:
| Isn't this just the traditional way to parse infix operators with
| LL parsers, which are commonplace?
| vidarh wrote:
| It's commonplace, and one of a few very closely related
| algorithms (Pratt/TDOP, precedence climbing, shunting yard,
| etc.) that I guess you can call "traditional", but outside of
| people who regularly write parsers people are still often
| unaware of it.
|
| Not least, I suspect, because while the pattern is obvious once
| you know it, a lot of people will implement their parser from
| looking at a (E)BNF representation that doesn't make it
| explicit. E.g. this is a small subset of Oberon 07's grammar:
| expression = SimpleExpression [relation SimpleExpression].
| relation = "=" | "#" | "<" | "<=" | ">" | ">=" | IN | IS.
| SimpleExpression = ["+"|"-"] term {AddOperator term}.
| AddOperator = "+" | "-" | OR. term = factor
| {MulOperator factor}. MulOperator = "*" | "/" | DIV |
| MOD | "&" .
|
| When you know _any_ of the precedence parser techniques and how
| to map a grammar to it, it 's instantly obvious that this is
| modeling precedence rules, and it's trivial to turn it straight
| into a table, but if you've never written a parser before,
| isn't familiar with the theory, and just about figured out how
| to read EBNF, it's a lot easier to stumble your way to a
| working recursive descent parser than figuring out to
| generalize the grammar fragment above into reusable functions
| and a precedence table.
| chubot wrote:
| FWIW this survey links to ~10 articles on Pratt Parsing,
| including the featured article -
|
| https://www.oilshell.org/blog/2017/03/31.html - _Pratt Parsing
| Index and Updates_
|
| I wrote it several years ago and updated it recently (probably
| should give it a better title)
|
| The later articles are in TypeScript, Rust, Elm
| gsliepen wrote:
| I recently have implemented a Pratt parser for a language
| prototype. I initially looked at several parser generators, but
| while they are easier for writing down the grammar of your
| language, you need to do a lot of work to make them actually spit
| out working code that integrates well with your project. A Pratt
| parser can be implemented with just a few tens of lines of code
| in whatever language of your choice. You just have to mentally
| switch from thinking of your grammar in terms of production rules
| to thinking in terms of operators. For example, "if" becomes a
| prefix (nud) operator, and for function calls, the opening
| parenthesis after a function name becomes a binary (led)
| operator.
|
| One issue I have with the tutorials for writing Pratt parsers is
| that they usually only describe how to parse simple and very
| regular expressions. You can still use Pratt parsers to parse
| much more complicated programming languages, but this requires a
| bit more work and some modifications to the main parsing
| function.
| vidarh wrote:
| I have parsed a whole language using a precedence-based
| approach before - the first time I wrote a precedence-based
| parser.
|
| It very easily slips away from you and ends up a maddening mess
| of extra rules and flags (and I keep making that mistake, to be
| honest, because the conceptual simplicity of a precedence-based
| parser - be it Pratt or shunting-yard or variations - is
| seductive), unless you remember that you can easily "mode-
| shift" between different methods.
|
| E.g. "escaping" out from a Pratt parser to recursive descent
| (or anything else) for a sub-expression that is sufficiently
| messy to make it painful to model with precedence tables is
| fairly simple. It's worth keeping in mind (for myself too...)
| if your main parsing function and/or precedence table starts
| becoming too complex.
| kragen wrote:
| i feel like maybe this is messier if the parser you're trying
| to escape from is lalr or something else bottom-up?
| vidarh wrote:
| It really depends on how easy it is to determine the
| handover points. If the grammar gives you clear,
| unambiguous boundaries that don't require much context to
| determine when to switch, you should be able to switch
| reasonably easily almost no matter the grammar type.
|
| I agree it's easier to apply this top-down, though - you
| just call into the other parser and get whatever
| subexpression back, and you "just" need to have it treat
| tokens that can't continue an expression as the end of the
| stream.
|
| E.g. a trivial top down example would be any language where
| IF/ELSE is not an expression - you might parser IF
| (condition), then call into a Pratt parser, treat "ELSE" or
| "END" as terminating the expression, and just continue.
|
| Bottom-up you can still handle grammars that unambiguous
| fairly easily if the delimiting tokens aren't valid in
| other contexts, but I suspect you're right you'll get into
| trouble more easily if the grammars are more complex.
| kerkeslager wrote:
| I found this Rust tutorial[1] on Pratt parsers to be really easy
| to follow as well. I'm not a Rustacean but I didn't find not
| knowing Rust to be a barrier. I used that as a guide to write the
| parser for my experimental programming language Fur[2].
|
| However, I'll caution anyone writing their own programming
| languages to read some wisdom from someone who has written a
| production-quality programming language[3]. _Most_ programming
| language developers get bogged down in writing the parser and
| never even get into solving the real hard problems of language
| development, which are things like bytecode generation and
| garbage collection. The fact is, a naive recursive descent parser
| with some low lookahead number to handle left recursion will be
| performant and can be written in an afternoon for 99% of parsable
| languages. I 'm not sure what it is about parsers that makes them
| susceptible to yak shaving, but it seems to be a trap a lot of
| people fall into (including myself!).
|
| [1] https://matklad.github.io/2020/04/13/simple-but-powerful-
| pra...
|
| [2] https://github.com/kerkeslager/fur/blob/main/parser.c
|
| [3] https://craftinginterpreters.com/compiling-
| expressions.html#...
| GloomyBoots wrote:
| Maybe in this case it should be spelled yacc shaving.
| norir wrote:
| I don't know. This is becoming conventional wisdom but I think
| a reverse argument is equally true: aspiring language designers
| get too bogged down in bytecode generation or gc that they are
| unable to make a compelling language that adequately
| differentiates itself from existing languages that already do
| those things well.
|
| Are you actually claiming that handwriting a recursive descent
| parser for most popular languages can be done in afternoon?
| Unless the language is a lisp, I find that unlikely in most
| cases. You can write a parser generator that generates a
| recursive descent parser. Writing the grammar will almost
| surely be significantly faster in the parser generator dsl than
| a handrolled recursive descent parser.
|
| To improve your language design skills, you need to design
| multiple languages. Parser generators will get you a prototype
| faster and allow you to explore the language design space. You
| can target a high level language to do the hard parts. Once you
| are satisfied with a language, then maybe it makes sense to
| design your own custom bytecode or gc to solve its unique
| problems.
|
| I do agree though that optimizing parsing at an early stage
| makes no sense and many do get bogged down there.
| kerkeslager wrote:
| > Are you actually claiming that handwriting a recursive
| descent parser for most popular languages can be done in
| afternoon? Unless the language is a lisp, I find that
| unlikely in most cases.
|
| Yes. I've written enough parsers that I can say that with a
| lot of confidence, and I'm not sure where your doubt is
| coming from. The parser for Fur was written in about 6 hours
| and probably 4 hours of that was tracking down pathological
| semicolon elision edge cases (0/10, would not implement
| semicolon elision in future languages). I'm sure that Nystrom
| has written many more parsers than I have.
|
| The edge cases might be tricky to iron out for existing,
| mature languages, but when you're writing your own language
| I'd shy away from features which require crazy parser
| shenanigans like JavaScript RegEx literals or C++ templates,
| at least until you're far along enough to have justification
| for those kinds of features.
|
| This is also the sort of thing that I imagine ChatGPT, even
| the free version, could do in even less time.
|
| There's one caveat here, which is that error reporting _is_
| one pretty hard part of parsing. But if you understood that
| you wouldn 't be recommending parser generators; poor error
| reporting capabilities is exactly why many production-quality
| languages have hand-rolled their parsers.
|
| > Writing the grammar will almost surely be significantly
| faster in the parser generator dsl than a handrolled
| recursive descent parser.
|
| I'm not interested in hearing speculation on things I've
| benchmarked. When it comes to performance, benchmark or GTFO.
|
| Performance isn't significantly different in most cases, and
| in pathological worst cases, it's often easily solved by
| memoizing (i.e. converting the recursive descent parser into
| a Packrat Parser). You shouldn't even memoize everything in
| most cases, because a lot of sub-parsers won't trigger
| backtracking, so there's no need to waste memory on
| memoization.
|
| > To improve your language design skills, you need to design
| multiple languages. Parser generators will get you a
| prototype faster and allow you to explore the language design
| space.
|
| To design a language, I don't need to write a parser at all
| in most cases. I certainly don't need to prototype syntax to
| design it effectively. Writing programs in your target
| syntax, even if you can't actually run them, is far more
| effective.
|
| Related to the fact that parsers aren't the hard part, syntax
| isn't the _important_ part, so if you get bogged down in
| designing the syntax that 's also a lose. I genuinely think
| most novel syntaxes create more usability/entry barriers for
| users than the benefits they offer. Go with the syntax of an
| existing popular language if you at all can--the simpler/more
| popular, the better. Semantics, idioms, and ecosystem are all
| much more important in what makes a language useful.
|
| > You can target a high level language to do the hard parts.
|
| That's exactly why this approach is terrible. The hard parts
| are _the entire point of creating a new language_. If the
| hard parts are all implemented in a different language then
| people should just use that language instead of yours.
|
| > Once you are satisfied with a language, then maybe it makes
| sense to design your own custom bytecode or gc to solve its
| unique problems.
|
| Still no--I don't think you understood what I meant when I
| said "the real hard problems of language development, which
| are things like bytecode generation and garbage collection".
|
| If you're targeting an existing VM such as the JVM or CLR you
| won't need to implement custom bytecode or GC, and even if
| you decide to implement your own bytecode or GC, you may be
| able to use an off-the-shelf implementation. You'll note that
| my post said _bytecode generation_ , which is hard no matter
| what syntax you're implementing or bytecode you're targeting,
| because it's translating the _semantics_ of the syntax to
| semantics that the target architecture can understand. When I
| said implementing garbage collection was hard, what I meant
| was _not_ that implementing the algorithms is hard (you can
| literally plug in an off-the-shelf garbage collector in most
| cases) but that calling the garbage collection algorithms at
| the right times is hard.
| _a_a_a_ wrote:
| Quoting [3]:
|
| "I'm going to make a claim here that will be unpopular with
| some compiler and language people. It's OK if you don't agree.
| Personally, I learn more from strongly stated opinions that I
| disagree with than I do from several pages of qualifiers and
| equivocation. My claim is that parsing doesn't matter."
|
| agreed 100%
| naasking wrote:
| Parsing matters for tooling, and that's about it. The recent
| Go retrospective on this goes into this somewhat, but if you
| have a reusable parser then you or others can easily create
| linters, formatters, and static analysis tools for your
| language, which really fleshes out the ecosystem.
| kerkeslager wrote:
| I'd say that Nystrom's point in context is more "How you
| parse doesn't matter." He says, later in that link:
|
| "Recursive descent, Pratt parsing, and the popular parser
| generators like ANTLR or Bison are all fine."
|
| Any of these can be reusable parsers that you and/or others
| can easily use to create tools.
| lichtenberger wrote:
| The technique is also described in the awesome book "Crafting
| Interpreters" by Robert Nystrom.
| reichstein wrote:
| Which is probably not a coincidence, since the article was
| written by the same Robert Nystrom.
| pdpi wrote:
| I saw the title and it reminded me that I wanted to work
| through that book. Was thoroughly unsurprised when I clicked
| through and saw his face.
| o11c wrote:
| These are certainly useful, but there's the major caveat that the
| "easy" stops when you have anything but the simplest precedence
| rules. For example, if unary `NOT` has a weaker precedence than
| other unary operators, or if `a < b < c` is to be treated
| differently than `(a < b) < c`, or if you want to forbid bitwise
| operators mixing with relationals to avoid confusion from fact
| that they have a different precedence in C (for historical
| reasons) than in many other languages.
|
| Note also that Bison with precedence tags is more efficient than
| writing out the same grammar without precedence. And `bison
| --xml` lets you run the (trivial) machine yourself so you can do
| it in any language.
| ReleaseCandidat wrote:
| > anything but the simplest precedence rules. For example, if
| unary `NOT` has a weaker precedence than other unary operators
|
| Unary operators use precedence levels (binding powers) too,
| they are no different (regarding the existence of precedence
| levels) from infix functions. This is just a simplification of
| the precedence table in the linked article (which I haven't
| read).
|
| > or if `a < b < c` is to be treated differently than `(a < b)
| < c`
|
| If you want "<" to be right associative, that's exactly one of
| the things that's easy to do with a Pratt parser, change the
| binding power on one side for the "<".
|
| > if you want to forbid bitwise operators mixing with
| relationals
|
| I don't know how using a Pratt parser would interfere with
| throwing an error?
| orthoxerox wrote:
| > If you want "<" to be right associative, that's exactly one
| of the things that's easy to do with a Pratt parser, change
| the binding power on one side for the "<".
|
| It's not about associativity at all. "a < b < c" is a ternary
| operator that compares three operands.
| ReleaseCandidat wrote:
| That's what I would think too (as it doesn't make much
| sense to compare a boolean to c), but what OP wrote doesn't
| make much sense else: or if `a < b < c` is
| to be treated differently than `(a < b) < c`
| professoretc wrote:
| I met Vaughan Pratt at a math seminar a few years back, where he
| was talking about his current interest, Chu spaces. While we were
| waiting in line to get coffee, with the coffee stations labeled
| "regular" and "decaf", Pratt said, "But where's the context-free
| coffee?" and then laughed at his own joke. Later at lunch he was
| telling us about a physics experiment he set up in his pool shed;
| "My wife told me to do it in the pool shed so that if it explodes
| it won't destroy the house."
| dang wrote:
| Related:
|
| _Pratt Parsers: Expression Parsing Made Easy (2011)_ -
| https://news.ycombinator.com/item?id=32125386 - July 2022 (7
| comments)
|
| _Pratt Parsers: Expression Parsing Made Easy (2011)_ -
| https://news.ycombinator.com/item?id=16398830 - Feb 2018 (12
| comments)
|
| _Pratt Parsers: Expression Parsing Made Easy_ -
| https://news.ycombinator.com/item?id=2344837 - March 2011 (20
| comments)
___________________________________________________________________
(page generated 2024-01-20 23:00 UTC)