[HN Gopher] Pratt Parsers: Expression Parsing Made Easy (2011)
       ___________________________________________________________________
        
       Pratt Parsers: Expression Parsing Made Easy (2011)
        
       Author : memalign
       Score  : 127 points
       Date   : 2024-01-20 11:16 UTC (11 hours ago)
        
 (HTM) web link (journal.stuffwithstuff.com)
 (TXT) w3m dump (journal.stuffwithstuff.com)
        
       | zevv wrote:
       | Ha, nice to see this on HN: this article was pretty helpful to me
       | to understand the concept a few years back when extending my PEG
       | parsing library [1] with a Pratt parser; this mitigates the
       | problem of PEG parsers not allowing left recursion and allows for
       | a much more concise notation of grammars with operator
       | precedence. Thank you Bob:
       | 
       | 1. https://github.com/zevv/npeg#precedence-operators
        
         | vidarh wrote:
         | I love that way of embedding precedence tables without the
         | "classic" repetitive EBNF pattern. I will shamelessly steal
         | this for my next parser library. Also, your documentation is
         | outstanding.
         | 
         | (tiny typo - "Pratt" is spelled "Prett" in one sentence on your
         | page)
        
         | kragen wrote:
         | did you try warth's left-recursion hack for packrat parsers?
         | i'm curious what your experience was with it if so
        
       | throwup238 wrote:
       | A few years back I had to implement a new parser for a custom
       | expression language to replace an old parser that didn't give
       | very good errors and was hard to extend. I never did the
       | traditional CS track so parsers were kind of mysterious black
       | magic so naturally first thing I did was search HN.
       | 
       | I stumbled on an older parsing thread [1] with a link to a toy
       | Pratt implementation [2] made by an HNer... and shamelessly
       | ripped it off to write my own production Pratt parser
       | implementation.
       | 
       | Great success! Writing a Pratt parser by hand is a lot easier
       | than I thought and like the comment says, adding type information
       | was trivial.
       | 
       | [1] https://news.ycombinator.com/item?id=24480504
       | 
       | [2] https://github.com/mx-scissortail/WebBS
        
       | PartiallyTyped wrote:
       | This is always a nice read. I was going through Bob's interpreter
       | book and I wanted a more complete or interesting parser. This is
       | a very nice addition to that.
        
       | kevindamm wrote:
       | Pratt or TDOP parsers are my favorite design when coding the
       | parser directly in a host language, with parser combinators a
       | close second (but it edges ahead depending on the host language).
       | 
       | But when it comes to writing a larger parser with many production
       | rules, I'd rather use a parser generator based around an Earley
       | parser. You can split production rules and organize their post-
       | processing into an AST a lot easier .. but providing good error
       | statements is still a bit of a hassle.
       | 
       | I haven't found a parser generator that makes it painless to
       | provide good error messages. Writing the parser directly in the
       | host language has always been much easier in that regard.
        
         | nickpsecurity wrote:
         | sklogic told us here that his strategy was to use two parsers:
         | a fast, generated one by default; a separate one for good,
         | error handling. If using that strategy, then we might be able
         | to build some error handling in as part of the grammar. The
         | generator ignores it for one engine and uses it for another.
        
           | kevindamm wrote:
           | That's good advice.. it's especially useful in an Earley
           | design because it is much faster when it doesn't have to keep
           | explicit back references for derivation, but those references
           | are needed for explaining the current context. Adding a
           | switch for keeping/discarding history (and a compiler pass
           | for whether or not to fold the epsilon-DFA into collapsed
           | states, "NNF" style) would automate the fast-vs-
           | slow|informative variants. Add in some debug-only output in
           | postprocessors and it wouldn't really need any syntactical
           | changes to the parser generator. I may try this soon..
        
         | danybittel wrote:
         | I've been working on a parser that parses while typing and
         | gives type and syntax errors as soon as possible but never any
         | false positives. I didn't expect it to be that difficult. I
         | still haven't really grokked it and any bugfix usually ends in
         | a game of whack a mole over integration tests. Any hints?
        
           | ReleaseCandidat wrote:
           | It depends on the language you're parsing, of course. But the
           | most important thing is to know when your parser is on the
           | right track again. The (most of the time) easiest (but not
           | necessarily first possible) such point is the next definition
           | statement. And make sure to _always_ consume (at least) one
           | token (even if just adding it to the error expression in the
           | AST), so there is no possibility of an endless loop.
        
           | ivanjermakov wrote:
           | I can recommend this article:
           | https://matklad.github.io/2023/05/21/resilient-ll-parsing-
           | tu...
        
           | orthoxerox wrote:
           | Roslyn (C# compiler) has been designed to quickly parse
           | invalid code while still providing intellisense.
        
           | ww520 wrote:
           | One way I found helpful in giving precise parsing errors
           | without causing excessive false errors is to have the parser
           | auto-completing the missing tokens with placeholder error
           | nodes. E.g.                   if (cond1 And ) { ... }
           | 
           | The And operator is missing its right operand, but everything
           | else is valid.
           | 
           | The parser knows the And operator needs two arms. It can
           | complete the missing arm with an error node.
           | fn parseBiCondition() {             let leftNode =
           | parseBoolExpr()             if (lookaheadToken() is And)
           | return parseAnd(leftNode)             if (lookaheadToken() is
           | Or)                 ...         }              fn
           | parseAnd(leftNode) {             consumeToken()
           | let rightNode = parseBoolExpr()             if (rightNode !=
           | null)                 return AndNode(leftNode, rightNode)
           | else                 return AndNode(leftNode,
           | ErrorNode("missing operand", location))         }
           | 
           | In this way, the conditional expression is parsed to
           | completion without interruption. Multiple errors can be
           | handled with completion and recorded. At the end it's a
           | matter of traversing the AST to print out all the ErrorNodes.
        
         | ReleaseCandidat wrote:
         | > I haven't found a parser generator that makes it painless to
         | provide good error messages.
         | 
         | This. And letting the user add their own infix function with
         | configurable precedence and associativity is easy using Pratt
         | too.
        
       | timhh wrote:
       | I had to write an expression parser recently and found the number
       | of names for essentially the same algorithm very confusing.
       | There's Pratt parsing, precedence climbing, and shunting yard.
       | But really they're all the same algorithm.
       | 
       | Pratt parsing and precedence climbing are the same except one
       | merges left/right associativity precedence into a single
       | precedence table (which makes way more sense). Shunting yard is
       | the same as the others except it's iterative instead of
       | recursive, and it seems like common implementations omit some
       | syntax checking (but there's no reason you have to omit it).
       | 
       | Here's what I ended up with:
       | 
       | https://github.com/Timmmm/expr
       | 
       | I had to write that because somewhat surprisingly I couldn't find
       | an expression parser library that supports 64-bit integers and
       | bools. Almost all of them only do floats. Also I needed it in C++
       | - that library was a prototype that I later translated into C++
       | (easier to write in Rust and then translate).
        
         | secondcoming wrote:
         | There is Boost.Spirit. We use it but I find its cleverness just
         | results in incomprehensible code.
         | 
         | https://www.boost.org/doc/libs/1_84_0/libs/spirit/classic/do...
        
           | timhh wrote:
           | Ah yeah I should have said "a library for which the
           | integration is easier than writing my own library from
           | scratch"!
        
         | mgaunard wrote:
         | There are dozens of parser generators in C++ that support
         | arbitrary types.
        
       | azhenley wrote:
       | Eli Bendersky also has a nice post on Pratt parsing from 2010. I
       | used both of these when implementing it for the first time.
       | 
       | https://eli.thegreenplace.net/2010/01/02/top-down-operator-p...
        
         | lioeters wrote:
         | Some time ago I went down a fascinating rabbit hole on Pratt
         | parsers while writing a math expression evaluator. In addition
         | to the OP and above-linked article, I found the following
         | helpful.
         | 
         | - How Desmos Uses Pratt Parsers (2018) - TypeScript -
         | https://engineering.desmos.com/articles/pratt-parser/ - Source:
         | https://github.com/desmosinc/pratt-parser-blog-code
         | 
         | - Simple But Powerful Pratt Parsing (2020) - Rust -
         | https://matklad.github.io/2020/04/13/simple-but-powerful-pra...
         | - Source: https://github.com/matklad/minipratt
        
       | mgaunard wrote:
       | Isn't this just the traditional way to parse infix operators with
       | LL parsers, which are commonplace?
        
         | vidarh wrote:
         | It's commonplace, and one of a few very closely related
         | algorithms (Pratt/TDOP, precedence climbing, shunting yard,
         | etc.) that I guess you can call "traditional", but outside of
         | people who regularly write parsers people are still often
         | unaware of it.
         | 
         | Not least, I suspect, because while the pattern is obvious once
         | you know it, a lot of people will implement their parser from
         | looking at a (E)BNF representation that doesn't make it
         | explicit. E.g. this is a small subset of Oberon 07's grammar:
         | expression = SimpleExpression [relation SimpleExpression].
         | relation = "=" | "#" | "<" | "<=" | ">" | ">=" | IN | IS.
         | SimpleExpression = ["+"|"-"] term {AddOperator term}.
         | AddOperator = "+" | "-" | OR.         term = factor
         | {MulOperator factor}.          MulOperator = "*" | "/" | DIV |
         | MOD | "&" .
         | 
         | When you know _any_ of the precedence parser techniques and how
         | to map a grammar to it, it 's instantly obvious that this is
         | modeling precedence rules, and it's trivial to turn it straight
         | into a table, but if you've never written a parser before,
         | isn't familiar with the theory, and just about figured out how
         | to read EBNF, it's a lot easier to stumble your way to a
         | working recursive descent parser than figuring out to
         | generalize the grammar fragment above into reusable functions
         | and a precedence table.
        
       | chubot wrote:
       | FWIW this survey links to ~10 articles on Pratt Parsing,
       | including the featured article -
       | 
       | https://www.oilshell.org/blog/2017/03/31.html - _Pratt Parsing
       | Index and Updates_
       | 
       | I wrote it several years ago and updated it recently (probably
       | should give it a better title)
       | 
       | The later articles are in TypeScript, Rust, Elm
        
       | gsliepen wrote:
       | I recently have implemented a Pratt parser for a language
       | prototype. I initially looked at several parser generators, but
       | while they are easier for writing down the grammar of your
       | language, you need to do a lot of work to make them actually spit
       | out working code that integrates well with your project. A Pratt
       | parser can be implemented with just a few tens of lines of code
       | in whatever language of your choice. You just have to mentally
       | switch from thinking of your grammar in terms of production rules
       | to thinking in terms of operators. For example, "if" becomes a
       | prefix (nud) operator, and for function calls, the opening
       | parenthesis after a function name becomes a binary (led)
       | operator.
       | 
       | One issue I have with the tutorials for writing Pratt parsers is
       | that they usually only describe how to parse simple and very
       | regular expressions. You can still use Pratt parsers to parse
       | much more complicated programming languages, but this requires a
       | bit more work and some modifications to the main parsing
       | function.
        
         | vidarh wrote:
         | I have parsed a whole language using a precedence-based
         | approach before - the first time I wrote a precedence-based
         | parser.
         | 
         | It very easily slips away from you and ends up a maddening mess
         | of extra rules and flags (and I keep making that mistake, to be
         | honest, because the conceptual simplicity of a precedence-based
         | parser - be it Pratt or shunting-yard or variations - is
         | seductive), unless you remember that you can easily "mode-
         | shift" between different methods.
         | 
         | E.g. "escaping" out from a Pratt parser to recursive descent
         | (or anything else) for a sub-expression that is sufficiently
         | messy to make it painful to model with precedence tables is
         | fairly simple. It's worth keeping in mind (for myself too...)
         | if your main parsing function and/or precedence table starts
         | becoming too complex.
        
           | kragen wrote:
           | i feel like maybe this is messier if the parser you're trying
           | to escape from is lalr or something else bottom-up?
        
             | vidarh wrote:
             | It really depends on how easy it is to determine the
             | handover points. If the grammar gives you clear,
             | unambiguous boundaries that don't require much context to
             | determine when to switch, you should be able to switch
             | reasonably easily almost no matter the grammar type.
             | 
             | I agree it's easier to apply this top-down, though - you
             | just call into the other parser and get whatever
             | subexpression back, and you "just" need to have it treat
             | tokens that can't continue an expression as the end of the
             | stream.
             | 
             | E.g. a trivial top down example would be any language where
             | IF/ELSE is not an expression - you might parser IF
             | (condition), then call into a Pratt parser, treat "ELSE" or
             | "END" as terminating the expression, and just continue.
             | 
             | Bottom-up you can still handle grammars that unambiguous
             | fairly easily if the delimiting tokens aren't valid in
             | other contexts, but I suspect you're right you'll get into
             | trouble more easily if the grammars are more complex.
        
       | kerkeslager wrote:
       | I found this Rust tutorial[1] on Pratt parsers to be really easy
       | to follow as well. I'm not a Rustacean but I didn't find not
       | knowing Rust to be a barrier. I used that as a guide to write the
       | parser for my experimental programming language Fur[2].
       | 
       | However, I'll caution anyone writing their own programming
       | languages to read some wisdom from someone who has written a
       | production-quality programming language[3]. _Most_ programming
       | language developers get bogged down in writing the parser and
       | never even get into solving the real hard problems of language
       | development, which are things like bytecode generation and
       | garbage collection. The fact is, a naive recursive descent parser
       | with some low lookahead number to handle left recursion will be
       | performant and can be written in an afternoon for 99% of parsable
       | languages. I 'm not sure what it is about parsers that makes them
       | susceptible to yak shaving, but it seems to be a trap a lot of
       | people fall into (including myself!).
       | 
       | [1] https://matklad.github.io/2020/04/13/simple-but-powerful-
       | pra...
       | 
       | [2] https://github.com/kerkeslager/fur/blob/main/parser.c
       | 
       | [3] https://craftinginterpreters.com/compiling-
       | expressions.html#...
        
         | GloomyBoots wrote:
         | Maybe in this case it should be spelled yacc shaving.
        
         | norir wrote:
         | I don't know. This is becoming conventional wisdom but I think
         | a reverse argument is equally true: aspiring language designers
         | get too bogged down in bytecode generation or gc that they are
         | unable to make a compelling language that adequately
         | differentiates itself from existing languages that already do
         | those things well.
         | 
         | Are you actually claiming that handwriting a recursive descent
         | parser for most popular languages can be done in afternoon?
         | Unless the language is a lisp, I find that unlikely in most
         | cases. You can write a parser generator that generates a
         | recursive descent parser. Writing the grammar will almost
         | surely be significantly faster in the parser generator dsl than
         | a handrolled recursive descent parser.
         | 
         | To improve your language design skills, you need to design
         | multiple languages. Parser generators will get you a prototype
         | faster and allow you to explore the language design space. You
         | can target a high level language to do the hard parts. Once you
         | are satisfied with a language, then maybe it makes sense to
         | design your own custom bytecode or gc to solve its unique
         | problems.
         | 
         | I do agree though that optimizing parsing at an early stage
         | makes no sense and many do get bogged down there.
        
           | kerkeslager wrote:
           | > Are you actually claiming that handwriting a recursive
           | descent parser for most popular languages can be done in
           | afternoon? Unless the language is a lisp, I find that
           | unlikely in most cases.
           | 
           | Yes. I've written enough parsers that I can say that with a
           | lot of confidence, and I'm not sure where your doubt is
           | coming from. The parser for Fur was written in about 6 hours
           | and probably 4 hours of that was tracking down pathological
           | semicolon elision edge cases (0/10, would not implement
           | semicolon elision in future languages). I'm sure that Nystrom
           | has written many more parsers than I have.
           | 
           | The edge cases might be tricky to iron out for existing,
           | mature languages, but when you're writing your own language
           | I'd shy away from features which require crazy parser
           | shenanigans like JavaScript RegEx literals or C++ templates,
           | at least until you're far along enough to have justification
           | for those kinds of features.
           | 
           | This is also the sort of thing that I imagine ChatGPT, even
           | the free version, could do in even less time.
           | 
           | There's one caveat here, which is that error reporting _is_
           | one pretty hard part of parsing. But if you understood that
           | you wouldn 't be recommending parser generators; poor error
           | reporting capabilities is exactly why many production-quality
           | languages have hand-rolled their parsers.
           | 
           | > Writing the grammar will almost surely be significantly
           | faster in the parser generator dsl than a handrolled
           | recursive descent parser.
           | 
           | I'm not interested in hearing speculation on things I've
           | benchmarked. When it comes to performance, benchmark or GTFO.
           | 
           | Performance isn't significantly different in most cases, and
           | in pathological worst cases, it's often easily solved by
           | memoizing (i.e. converting the recursive descent parser into
           | a Packrat Parser). You shouldn't even memoize everything in
           | most cases, because a lot of sub-parsers won't trigger
           | backtracking, so there's no need to waste memory on
           | memoization.
           | 
           | > To improve your language design skills, you need to design
           | multiple languages. Parser generators will get you a
           | prototype faster and allow you to explore the language design
           | space.
           | 
           | To design a language, I don't need to write a parser at all
           | in most cases. I certainly don't need to prototype syntax to
           | design it effectively. Writing programs in your target
           | syntax, even if you can't actually run them, is far more
           | effective.
           | 
           | Related to the fact that parsers aren't the hard part, syntax
           | isn't the _important_ part, so if you get bogged down in
           | designing the syntax that 's also a lose. I genuinely think
           | most novel syntaxes create more usability/entry barriers for
           | users than the benefits they offer. Go with the syntax of an
           | existing popular language if you at all can--the simpler/more
           | popular, the better. Semantics, idioms, and ecosystem are all
           | much more important in what makes a language useful.
           | 
           | > You can target a high level language to do the hard parts.
           | 
           | That's exactly why this approach is terrible. The hard parts
           | are _the entire point of creating a new language_. If the
           | hard parts are all implemented in a different language then
           | people should just use that language instead of yours.
           | 
           | > Once you are satisfied with a language, then maybe it makes
           | sense to design your own custom bytecode or gc to solve its
           | unique problems.
           | 
           | Still no--I don't think you understood what I meant when I
           | said "the real hard problems of language development, which
           | are things like bytecode generation and garbage collection".
           | 
           | If you're targeting an existing VM such as the JVM or CLR you
           | won't need to implement custom bytecode or GC, and even if
           | you decide to implement your own bytecode or GC, you may be
           | able to use an off-the-shelf implementation. You'll note that
           | my post said _bytecode generation_ , which is hard no matter
           | what syntax you're implementing or bytecode you're targeting,
           | because it's translating the _semantics_ of the syntax to
           | semantics that the target architecture can understand. When I
           | said implementing garbage collection was hard, what I meant
           | was _not_ that implementing the algorithms is hard (you can
           | literally plug in an off-the-shelf garbage collector in most
           | cases) but that calling the garbage collection algorithms at
           | the right times is hard.
        
         | _a_a_a_ wrote:
         | Quoting [3]:
         | 
         | "I'm going to make a claim here that will be unpopular with
         | some compiler and language people. It's OK if you don't agree.
         | Personally, I learn more from strongly stated opinions that I
         | disagree with than I do from several pages of qualifiers and
         | equivocation. My claim is that parsing doesn't matter."
         | 
         | agreed 100%
        
           | naasking wrote:
           | Parsing matters for tooling, and that's about it. The recent
           | Go retrospective on this goes into this somewhat, but if you
           | have a reusable parser then you or others can easily create
           | linters, formatters, and static analysis tools for your
           | language, which really fleshes out the ecosystem.
        
             | kerkeslager wrote:
             | I'd say that Nystrom's point in context is more "How you
             | parse doesn't matter." He says, later in that link:
             | 
             | "Recursive descent, Pratt parsing, and the popular parser
             | generators like ANTLR or Bison are all fine."
             | 
             | Any of these can be reusable parsers that you and/or others
             | can easily use to create tools.
        
       | lichtenberger wrote:
       | The technique is also described in the awesome book "Crafting
       | Interpreters" by Robert Nystrom.
        
         | reichstein wrote:
         | Which is probably not a coincidence, since the article was
         | written by the same Robert Nystrom.
        
           | pdpi wrote:
           | I saw the title and it reminded me that I wanted to work
           | through that book. Was thoroughly unsurprised when I clicked
           | through and saw his face.
        
       | o11c wrote:
       | These are certainly useful, but there's the major caveat that the
       | "easy" stops when you have anything but the simplest precedence
       | rules. For example, if unary `NOT` has a weaker precedence than
       | other unary operators, or if `a < b < c` is to be treated
       | differently than `(a < b) < c`, or if you want to forbid bitwise
       | operators mixing with relationals to avoid confusion from fact
       | that they have a different precedence in C (for historical
       | reasons) than in many other languages.
       | 
       | Note also that Bison with precedence tags is more efficient than
       | writing out the same grammar without precedence. And `bison
       | --xml` lets you run the (trivial) machine yourself so you can do
       | it in any language.
        
         | ReleaseCandidat wrote:
         | > anything but the simplest precedence rules. For example, if
         | unary `NOT` has a weaker precedence than other unary operators
         | 
         | Unary operators use precedence levels (binding powers) too,
         | they are no different (regarding the existence of precedence
         | levels) from infix functions. This is just a simplification of
         | the precedence table in the linked article (which I haven't
         | read).
         | 
         | > or if `a < b < c` is to be treated differently than `(a < b)
         | < c`
         | 
         | If you want "<" to be right associative, that's exactly one of
         | the things that's easy to do with a Pratt parser, change the
         | binding power on one side for the "<".
         | 
         | > if you want to forbid bitwise operators mixing with
         | relationals
         | 
         | I don't know how using a Pratt parser would interfere with
         | throwing an error?
        
           | orthoxerox wrote:
           | > If you want "<" to be right associative, that's exactly one
           | of the things that's easy to do with a Pratt parser, change
           | the binding power on one side for the "<".
           | 
           | It's not about associativity at all. "a < b < c" is a ternary
           | operator that compares three operands.
        
             | ReleaseCandidat wrote:
             | That's what I would think too (as it doesn't make much
             | sense to compare a boolean to c), but what OP wrote doesn't
             | make much sense else:                  or if `a < b < c` is
             | to be treated differently than `(a < b) < c`
        
       | professoretc wrote:
       | I met Vaughan Pratt at a math seminar a few years back, where he
       | was talking about his current interest, Chu spaces. While we were
       | waiting in line to get coffee, with the coffee stations labeled
       | "regular" and "decaf", Pratt said, "But where's the context-free
       | coffee?" and then laughed at his own joke. Later at lunch he was
       | telling us about a physics experiment he set up in his pool shed;
       | "My wife told me to do it in the pool shed so that if it explodes
       | it won't destroy the house."
        
       | dang wrote:
       | Related:
       | 
       |  _Pratt Parsers: Expression Parsing Made Easy (2011)_ -
       | https://news.ycombinator.com/item?id=32125386 - July 2022 (7
       | comments)
       | 
       |  _Pratt Parsers: Expression Parsing Made Easy (2011)_ -
       | https://news.ycombinator.com/item?id=16398830 - Feb 2018 (12
       | comments)
       | 
       |  _Pratt Parsers: Expression Parsing Made Easy_ -
       | https://news.ycombinator.com/item?id=2344837 - March 2011 (20
       | comments)
        
       ___________________________________________________________________
       (page generated 2024-01-20 23:00 UTC)