[HN Gopher] Why we need to know LR and recursive descent parsing...
       ___________________________________________________________________
        
       Why we need to know LR and recursive descent parsing techniques
        
       Author : todsacerdoti
       Score  : 162 points
       Date   : 2023-01-17 09:51 UTC (13 hours ago)
        
 (HTM) web link (tratt.net)
 (TXT) w3m dump (tratt.net)
        
       | vidarh wrote:
       | I agree strongly with the point about how making parsing a major
       | component of teaching compilers being misguided (as anyone who
       | read my compiler series would know), not because learning parsing
       | is unnecessary but because going more in depth about the rest is
       | sacrificed in favour of what is one of the least interesting
       | parts of the challenge of writing a new language.
       | 
       | There's more to do in improving parsing, but it's still one of
       | the best understood aspects of parsing.
       | 
       | That said, I do agree with the article that it's a problem that
       | many syntaxes have complex parser requirements. No grammar I know
       | of _requires_ recursive descent, but many _requires_ violations
       | of layering etc. not catered for by tooling that strongly favours
       | manual parsers.
       | 
       | Now, I prefer manually written parsers because I think (having
       | tried and failed to write better options many times) that current
       | parser generation tools are rarely suitable for writing
       | production parsers.
       | 
       | That said, I wish people _wrote_ their grammars for a tool as a
       | validation exercise, to give an incentive to design grammars that
       | are more regular and easier to parse and simpler.
       | 
       | As an example, I love Ruby. Having worked on my Ruby compiler
       | project (though it's been in hibernation for a few years), I
       | utterly hate Ruby's grammar. I don't hate the _result_ , for the
       | most part, but mostly because most of us avoid the more insane
       | things Ruby allows [1]
       | 
       | And while some of the complexity are part creating what I love
       | about Ruby, the grammar is full of horrible quirks (again [1]) at
       | least some of which might have been reduced if one of the "test
       | cases" of the design process for Ruby was to maintain and update
       | a compact language description for a parser generator tool. Even
       | if that'd not be the production grammar for MRI, using it both to
       | maintain a sane compact grammar description _and_ to validate
       | other parsers against would be great.
       | 
       | I pick on Ruby because I love it and wish it was cleaner and it's
       | a particularly bad offender in this respect, but so many
       | languages have occasional warts that they likely wouldn't have if
       | a formal specification of the grammar made them more obvious.
       | 
       | [1] E.g. 'p(% % )' is valid (it prints "%", because % without a
       | left hand operand reads the following character as the start of a
       | quoted string ending with the same character. 'p % % ' is not,
       | because the "p" can be parsed as an operand, and so the _first_
       | '%' is interpreted as an infix operator, while the _second_ '%'
       | is interpreted as the start of a quoted character. Fun times.
       | Thankfully I've never seen anyone abuse this in real code.
        
       | olliej wrote:
       | Ehn, the problem with LR parsing is that it's terribly slow and
       | can only practically be done via generators, and I have yet to
       | encounter a parser generator that can actually handle error
       | correction properly (necessary if you want anything other than
       | the first error to break everything after).
       | 
       | The theoretical argument for LR parsing is it is more powerful
       | than LL, LALR, etc because PLT seems hell bent on ignoring LL*
       | because of the arbitrary restrictions placed on what a "real" BNF
       | grammar is allowed to do.
       | 
       | Knowing about LR - and PDA/state machine based parsing is useful,
       | but pretending that LR parsers are somehow better than recursive
       | descent is nonsense.
        
       | recursivedoubts wrote:
       | _> unfortunately many big brain school teach only parser
       | generator tool. here grug usual love of tool is not: parser
       | generator tool generate code of awful snakes nest: impossible
       | understand, bottom up, what? hide recursive nature of grammar
       | from grug and debug impossible, very bad according grug! _
       | 
       | https://grugbrain.dev/#grug-on-parsing
       | 
       | Bob Nystrom has done us all a tremendous service by writing a
       | wonderful book (easily top 5 programming books available) on
       | recursive descent:
       | 
       | https://craftinginterpreters.com/
        
       | DannyBee wrote:
       | IMHO the majority of engineers not understanding lexing/parsing
       | beyond generators has held back good engineering of incremental
       | parsers/lexers for a long time.
       | 
       | Things like tree-sitter took forever to come along, but the
       | research projects that built the algorithms they rely on are a
       | quarter century
       | old.(https://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/5885.html)
       | 
       | Meanwhile, incremental lexing and incremental recursive descent
       | parsing and LL are actually not that hard (incremental LR is
       | trickier).
       | 
       | It's just that few people really understand how any of it works
       | to implement it.
       | 
       | This is not a "it was not needed" issue. Lexing times of files
       | for syntax highlighting is still "seconds" for larger files, for
       | example.
       | 
       | Meanwhile, i can explain near-optimal incremental lexing to
       | someone who knows how lexers work in a few bullet points, and
       | it's trivial to implement:
       | 
       | 1. As your lexer looks at characters, keep track of the maximum
       | lookahead and lookbehind used during a lex.
       | 
       | 2. When some range of the string to lex is invalidated, adjust
       | the invalidation bounds to include the maximum
       | lookhead/lookbehind as well.
       | 
       | 3. Relex the invalidated portion of the string.
       | 
       | That's it. For optimality, compute the maximum
       | lookhead/lookbehind on a per-token basis, instead on a per-lex
       | basis (this is trivial if your editor is already storing the
       | tokens under the text somewhere)
       | 
       | If you store position info in your tokens, you can easily adjust
       | as well.
       | 
       | For incremental recursive descent parsing, it's similar - only
       | rules who ever looked tokens that changed could be affected by
       | them. You can easily keep the interval of tokens that a given
       | rule looked at. (If you want truly optimal, you can store the set
       | of intervals looked at, but this is almost always overkill)
       | 
       | If all you ever have is a generator, you will never implement any
       | of the above until your generator does.
       | 
       | I have watched tons of projects spend lots of time optimizing
       | grammars, doing weird caching, etc, when a few simple/better
       | algorithms would have saved them from all of it.
       | 
       | (which is why in that sense, tree-sitter has been a revolution)
        
         | brhsagain wrote:
         | This is extremely interesting to me. I looked at the papers
         | tree-sitter referenced but they were all written assuming an
         | LR/generator-based parser and I couldn't figure out how to
         | adapt it to recursive descent.
         | 
         | Let's say I've parsed an AST
         | binary_expression           lhs: number           rhs:
         | binary_expression             lhs: number             rhs:
         | call_expression               func: identifier
         | args: call_argument_list                 [0]: number
         | [1]: number
         | 
         | and I've detected that `rhs` on line 3 has changed. So I go
         | back to the start of the node. What do I do then? How do I know
         | I need to parse an expression there and that it belongs to the
         | rhs of a binary expression?
         | 
         | In the literature, this is solved by storing the parse state.
         | How do you do that with recursive descent?
         | 
         | Is there something you'd recommend reading to understand how to
         | do incremental parsing for recursive descent specifically?
        
         | bsder wrote:
         | > IMHO the majority of engineers not understanding
         | lexing/parsing beyond generators has held back good engineering
         | of incremental parsers/lexers for a long time.
         | 
         | That's a touch unfair. You could easily extend that statement
         | to "The predominant languages of C and C++ have held back good
         | engineering of parsers/lexers." _period_. This is due to the
         | fact that you effectively have to parse the universe to resolve
         | the symbol table in both languages.
         | 
         | In addition, people didn't need incremental parsing until
         | syntax highlighting became widespread via LSPs. Before that,
         | you were always working at the compile level, so why put in all
         | the work for an incremental parser you were never going to make
         | use of?
         | 
         | People have recently been designing the _languages, themselves_
         | to better support parsing /lexing paradigms that can be
         | incremental and restarted.
         | 
         | This is why most modern languages have converged to having
         | specific syntax to delineate the difference between variable
         | name and variable type, for example.
         | 
         | Rust, for example, has fairly funky things around "unary minus"
         | because of its design for parsing and lexing.
        
       | le-mark wrote:
       | I had one job where my ability to write recursive descent parsers
       | made me a "100x ninja rockstar" because they had a lot of parsing
       | problems that no one recognized as such. I expressly did not use
       | a parser generator because of the complexity to the build, the
       | unfamiliarity of the rest of the team, and I considered them to
       | be overkill for the domain.
        
       | chubot wrote:
       | I agree with 90%-95% of the article. I would phrase it more
       | pithily as:
       | 
       | 1. Use recursive descent for recognizing existing languages,
       | where you have a good test corpus. (And very good point about
       | testing INVALID inputs too!)
       | 
       | 2. Use grammars to design new languages
       | 
       | I follow this with https://www.oilshell.org/ -- the compatible
       | OSH is a huge recursive descent parser, and Oil is designed with
       | a grammar.
       | 
       | Guy Steele and Russ Cox also mentioned this methodology -- design
       | the language with a grammar, and then implement it a second time
       | by hand. Then you REALLY know it! :)
       | 
       | It is definitely possible to not know the language you're
       | implementing / creating!
       | 
       | ---
       | 
       | As an undergrad, I never tried to design a language, or even
       | thought about it honestly. But I agree that undergrads should
       | know of both approaches, without necessarily getting into the
       | details of CFGs.
       | 
       | It would be a mistake to just teach recursive descent.
        
       | hinkley wrote:
       | No, we don't.
       | 
       | I've tried to learn both of these several times each, and what
       | happens is that each time I end up yelling at my computer, "but
       | we don't parse/read programs that way!" And I simply cannot
       | participate in this level of fiction so I find another solution
       | or hobby. Neither of them is a good fit for C derived languages
       | and no developer I've done DX studies with decomposes code the
       | way either of them do. I absolutely do not.
       | 
       | Last time I looked, the only one I found that's close to right is
       | SGLR.
        
       | xavxav wrote:
       | I like the idea of specifying a parser through a formal grammar
       | but I think that current technology has a few pretty annoying-
       | severe flaws.                 * Parser generator grammars are
       | usually quite far from 'paper' versions of a formal grammar. The
       | moment you have to deal with associativity/precedence or left-
       | recursive rules the beautiful grammar tends to fall apart.
       | In theory most generators have mechanisms to help make this
       | ergonomic but irl most grammars I've seen are totally illegible.
       | * Error messages suck. I think this is actually a bigger issue,
       | but parser generators have pretty poor error messages compared to
       | even naive recursive descent parsers written with a library like
       | megaparsec. There has been research on generating better error
       | messages in generated parsers like menhir [1], but I haven't seen
       | anyone apply this successfully and it requires you to manually
       | annotate each error state with a message, and a real parser can
       | easily have thousands, and they can change each time you update
       | your grammar.             * Parser files are usually fairly
       | illegible because they mix host language code with parser syntax
       | and meta-variables. Some more modern libraries like Pest separate
       | the two which I think is a net improvement.
       | 
       | Nevertheless, I hope that someone figures out a way to square the
       | circle and frees us from handwriting parsers in all but the most
       | exceptional cases.
       | 
       | [1] https://hal.inria.fr/hal-01248101/
        
         | moring wrote:
         | I disagree to some extent:
         | 
         | > Parser generator grammars are usually quite far from 'paper'
         | versions
         | 
         | The more useful of parser input formats allow precedence and
         | associativitiy annotations as well as optionals, quantifiers
         | and quantifiers with separators.
         | 
         | This really only works great together with your third point,
         | 
         | > Parser files are usually fairly illegible because they mix
         | host language code with parser syntax
         | 
         | You don't need host code if the parser just outputs an AST.
         | Neither quantifiers with or without separators nor optionals
         | make the parser spec more complex if no host code is needed --
         | the complexity is in the parser generator and AST generator.
         | 
         | > Error messages suck
         | 
         | I found error messages okay-ish when you view them in an IDE
         | where it is more important _where_ an error occurred than what
         | exactly is wrong (only applies to syntax errors). I also found
         | it helpful to show the allowed input at that place that would
         | NOT have resulted in a syntax error -- and the parser generator
         | can emit a list of both terminals and nonterminals
         | automatically, because this information is used by it to detect
         | a syntax error in the first place.
         | 
         | I also tripped over the often-said problem that LALR parsers
         | produce less useful error messages than LR(1) parsers. In
         | today's time, just using LR(1) is underrated IMHO. In my
         | experiments, the table didn't get too large (400kB parse table
         | for a simple hardware-description language) and I would have
         | tried to apply compression algorithms on the table instead of
         | resorting to LALR, but that size was small enough to be
         | acceptable to me. Only the run time of the table generation
         | process was a bit annoying (10-20 seconds).
        
           | jsmith45 wrote:
           | I will say it. Using LALR(1) is a bug.
           | 
           | Either use classic LR(1), or bison supports IELR(1), which
           | supports all LR(1) grammars (i.e. will produce identical
           | parses to classic LR(1)), but generates parse tables that
           | nearly the same size as LALR(1).
           | 
           | I'm not sure about quality of IELR's error messages, but at
           | least you don't get the new LALR introduced conflicts that
           | are hard to wrap your head around.
           | 
           | See this nice analysis of it:
           | https://news.ycombinator.com/item?id=24909161
        
           | HarHarVeryFunny wrote:
           | Your experience with the usefulness of error messages from
           | top-down (recursive descent) vs bottom-up (table-driven)
           | parsers is opposite to common experience. The trouble with
           | bottom-up parsers is that they inherently (by way of being
           | bottom-up) have no top-down context, so all they can by
           | default generate is a bottom-up "symbol XXX expected" type
           | error. A top down parser knows what high level construct it's
           | trying to parse, so is in a position to generate a more
           | meaningful error.
           | 
           | Of course in either case the grammar _could_ be augmented
           | with thoughtful error constructs designed to generate
           | meaningful errors when parsed, but that doesn 't seem to be
           | common practice given how awful compiler error messages tend
           | to be. Try writing "std::cout" instead of "std::endl" and see
           | what gibberish error message g++ generates. clang is
           | generally better than g++ in this regard.
        
         | logicalshift wrote:
         | I wrote a parser generator quite a long time ago that I think
         | improves the syntax quite a lot, and which has an interesting
         | approach to generalisation: you can write conditions on the
         | lookahead (which are just grammars that need to be matched in
         | order to pick a given rule when a conflict needs to be
         | resolved). This construct makes it much easier to write a
         | grammar that matches how a language is designed.
         | 
         | Here's an ANSI-C parser, for example:
         | https://github.com/Logicalshift/TameParse/blob/master/Exampl...
         | - this is an interesting example because `(foo)(bar)` is fully
         | ambiguous in ANSI C: it can be a cast or a function call
         | depending on if `foo` is a type or a variable.
         | 
         | The new construct makes it possible to extend grammars and
         | disambiguate them - here's a C99 grammar that extends the
         | ANSI-C grammar: https://github.com/Logicalshift/TameParse/blob/
         | master/Exampl....
         | 
         | It also allows matching at least some context-sensitive
         | languages - see
         | https://github.com/Logicalshift/TameParse/blob/master/Exampl...
         | 
         | An advantage over GLR or backtracking approaches is that this
         | still detects ambiguities in the language so it's much easier
         | to write a grammar that doesn't end up running in exponential
         | time or space, plus when an ambiguity is resolved by the
         | generalisation, which version is specified by the grammar and
         | is not arbitrary (backtracking) or left until later (GLR).
         | 
         | I was working on improving error handling when I stopped work
         | on this, but my approach here was not working out.
         | 
         | (This is a long-abandoned project of mine but the approach to
         | ambiguities and the syntax seem like they're novel to me and
         | were definitely an improvement over anything else I found at
         | the time. The lexer language has a few neat features in it too)
        
       | [deleted]
        
       | sshine wrote:
       | What an excellent article that I agree almost entirely with.
       | 
       | I'd like to add nuance in two places:
       | 
       | > If you're writing a parser for a language which is ambiguous,
       | you might be able to use the conflict resolution offered by YACC
       | and friends. However, in general, I suggest avoiding YACC's
       | conflict resolution if possible, as you have to at least somewhat
       | understand the underlying LR algorithm to understand how
       | conflicts have been resolved.
       | 
       | I'd say: Specifically for binary operator precedence and
       | associativity, definitely use conflict resolution of Yacc and
       | friends instead of rewriting the grammar. An ambiguous grammar
       | for binary operators might look like:                 E - E ^ E
       | E - E x E       E - E + E       E - (E)       E - num
       | 
       | Disambiguating in Yacc and friends involves a few lines like:
       | %left  '+'       %left  'x'       %right '^'
       | 
       | whereas rewriting the grammar above to be unambiguous is a mess:
       | E0 - EA E1       EA - E1 + EA       EA - e       E1 - EM E2
       | EM - E2 x EM       EM - e       E2 - E3 EP       EP - ^ E3 EP
       | E3 - num       E3 - (E0)
       | 
       | Disambiguating by resolving shift/reduce conflicts means fewer
       | states, whereas disambiguating by rewriting the grammar means
       | more states. Perhaps, as Laurence points out in this article wrt.
       | LALR and SLR: This probably doesn't matter at all on modern
       | computers.
       | 
       | Any other shift/reduce or reduce/reduce conflict caused by some
       | other ambiguity than specifically a binary operator, e.g. when
       | you have multiple if-then-else-if-then-else and you want to know
       | where an else belongs: You need to be quite into LR parser
       | generators to understand what's happening; took quite a lot of
       | guessing, trial-and-error when I worked with them, and I probably
       | forgot again. I'd probably document these as non-trivial choices
       | if I were maintaining a Yacc-like parser, and I'd generally avoid
       | syntax that is ambiguous in this way (although I don't think you
       | can always do that).
       | 
       | > If you need the best possible performance or error recovery,
       | recursive descent parsing is the best choice. If you know you
       | will need this and you're designing a new language, at least
       | create an LL grammar, since from that you can derive a recursive
       | descent parser. [Bear in mind that LL grammars are more annoying
       | to write than LR grammars which is why I suggest avoiding LL
       | unless you have this very specific use case.]
       | 
       | I would say: If you get used to parser combinator libraries, you
       | can get a fast, readable LL parser that is also fast to write and
       | does not require extra compilation steps like practically every
       | LR parser generator. I.e. you can skip the part where you convert
       | your grammar file into a native-$YOUR_LANG parser, because parser
       | combinators are embedded DSLs, some of which are just as fast as
       | their hand-written recursive-descent parser. Parser combinator
       | libraries usually have excellent error reporting capability. (The
       | best example I know is Haskell's diagnose package. [2]) And some
       | of them have backtracking capability, so that for those few parts
       | where your grammar isn't LL, you can go back and try something
       | else.
       | 
       | Modern parser combinator libraries like Haskell's Parsec /
       | Megaparsec has helper functions for generating a parser that
       | effectively translates an operator associativity/precedence table
       | into an LL parser; so you get the nicely readable description,
       | and the transformed LL parser that goes with it; see Megaparsec's
       | makeExprParser [3]:                 expr = makeExprParser term
       | table <?> "expression"            term = parens expr <|> integer
       | <?> "term"            table = [ [ prefix  "-"  negate
       | , prefix  "+"  id ]               , [ postfix "++" (+1) ]
       | , [ binary  "*"  (*)                 , binary  "/"  div  ]
       | , [ binary  "+"  (+)                 , binary  "-"  (-)  ] ]
       | binary  name f = InfixL  (f <$ symbol name)       prefix  name f
       | = Prefix  (f <$ symbol name)       postfix name f = Postfix (f <$
       | symbol name)
       | 
       | [1]: https://stackoverflow.com/questions/898489/what-
       | programming-... [2]: https://github.com/mesabloo/diagnose#error-
       | reporting-made-ea... [3]:
       | https://hackage.haskell.org/package/parser-combinators-1.3.0...
        
         | ratmice wrote:
         | > require extra compilation steps like practically every LR
         | parser generator.
         | 
         | It is worth noting that the author's lrpar tool (linked in the
         | article) doesn't actually require extra compilation steps or
         | code generation. This perhaps comes at the expense of less
         | flexible lexing and parser actions. But lrpar comes with a tool
         | 'nimbleparse', which given a grammar and an input file parses
         | the input file bypassing code generation. This reduces the
         | grammar editing/testing cycle by quite a lot. Inspired by that
         | I have been working on a lsp based version of the tool for a
         | while, to make the whole thing as responsive as possible...
        
         | HarHarVeryFunny wrote:
         | Encoding expression operator precedence into a grammar (via
         | add-expr, mul-expr, etc, etc) and then directly implementing is
         | very inefficient. For an N-level deep expression grammar a
         | recursive descent parser will end up making N-deep recursive
         | calls for every terminal (variable, constant)! In a recursive
         | descent parser it's more efficient just to have a single
         | ParseExpression() function that uses the simple "precedence
         | climbing" technique to handle operator precedence, which will
         | result in at most one recursive call per operator (vs N per
         | terminal). You could even eliminate the recursion completely by
         | using an explicit stack instead.
         | 
         | The same inefficiency applies to botton-up table driven parsers
         | generated by parser generators too - in a naive implementation
         | the parser will have to go thru N-steps of reduce actions to
         | elevate any terminal up to a top level expression. A smarter
         | generator may eliminate 1:1 "unit productions" of type mul-expr
         | := add-expr in order to remove this inefficiency. As you say,
         | the other way to handle precedence using a parser generator is
         | using operator precedence & associativity ambiguity resolution,
         | but this is a pain to develop.
        
           | sshine wrote:
           | Thank you for mentioning "precedence climbing".
           | 
           | I never had a need for a fast expression parser, so I never
           | explored this option.
           | 
           | For those curious,
           | 
           | https://eli.thegreenplace.net/2012/08/02/parsing-
           | expressions...
           | 
           | It should be possible to combine this with some parser
           | combinator libraries just the same.
        
       | breck wrote:
       | I enjoyed the article even though I'm in the opposite camp.
       | 
       | > Context-free grammars, and their associated parsing techniques,
       | don't align well with real-world compilers, and thus we should
       | deemphasise CFGs (Context-Free Grammars) and their associated
       | parsing algorithms.
       | 
       | I agree with this. I think CFG are highly overrated. Top down
       | recursive descent parsers are simple and allow you to craft more
       | human languages. I think building top down parsers is something
       | every dev should do. It's a simple technique with tremendous
       | power.
       | 
       | I think the source code for Scroll
       | (https://github.com/breck7/scroll/tree/main/grammar) demonstrates
       | how liberating moving away from CFGs can be. Easy to extend,
       | compose, build new backends, debug, et cetera. Parser, compiler,
       | and interpreter for each node all in one place. Swap nodes around
       | between languages. Great evolutionary characteristics.
       | 
       | I'll stop there (realizing I need to improve the docs and write a
       | blog post).
        
         | choeger wrote:
         | And no one will ever be able to implement a tool that works
         | with the input language without reimplementing _everything_.
         | 
         | CFGs are equivalent to modularity in language design. A choice,
         | clearly, but one with significant upsides.
        
           | breck wrote:
           | > And no one will ever be able to implement a tool that works
           | with the input language
           | 
           | But what do you get with such a tool? Syntax highlighting and
           | cryptic error messages on the syntax only. And then you hit a
           | wall and if you want to offer more to your users
           | (refactoring, autocomplete, etc) you have to reimplement
           | everything anyway).
           | 
           | > CFGs are equivalent to modularity in language design. A
           | choice, clearly, but one with significant upsides.
           | 
           | My point above aside, you bring up a great point, and
           | something I need to improve on a lot in Grammar (modularity
           | so Tree Languages can be run on many host systems). Now my
           | wheels are turning. Thanks!
        
       | tlb wrote:
       | Recursive descent parsers have trouble with left-associative
       | operators if you only allow recursion. But if you allow
       | iteration, you can build up the parse tree while iterating
       | through terms at the same precedence level. For example, this
       | defines + and - as left-associative at the same precedence level:
       | // EBNF is SUM = PRODUCT , { ("+" | "-") , PRODUCT }
       | AstNode *parseSum()       {         if (auto lhs =
       | parseProduct()) {           while (1) {             if
       | (consume(T_PLUS)) {               if (auto rhs = parseProduct())
       | {  // You'd call parseSum instead to get right associativity
       | lhs = mkAst(A_add, lhs, rhs); // Instead, left associativity
       | happens here               }               else {
       | return nullptr;               }             }             else if
       | (consume(T_MINUS)) {               if (auto rhs = parseProduct())
       | {                 lhs = mkAst(A_sub, lhs, rhs);               }
       | else {                 return nullptr;               }
       | }             else {               return lhs;             }
       | }         }         else {           return nullptr;         }
       | }
       | 
       | As a bonus, it doesn't blow out your stack when parsing long
       | expressions.
       | 
       | Very few binary operators should be right-associative. Basically
       | just ?: and = .
        
         | peterfirefly wrote:
         | Seems like a relevant place to put a link to this paper:
         | 
         | David R. Hanson, "Compact Recursive-descent Parsing of
         | Expressions", 1985
         | 
         | https://drhanson.s3.amazonaws.com/storage/documents/compact....
         | 
         | There are languages (such as ML) that allow the definition of
         | new operators with a choice of precedence and associativity.
         | Using a table like the one Hanson suggests and modifying it at
         | compile-time is an obvious way to handle that feature.
        
         | bruce343434 wrote:
         | When you do precedence like this in recdesc, you will get
         | functions like parse_prec_1, parse_prec_2 and so forth. If you
         | generalize this function body, and pass along a "current
         | precedence", then generalize the precedences for each operator
         | into a table, you have just reinvented Pratt Parsing.
        
         | ufo wrote:
         | The price for this is that you have to refactor the grammar,
         | replacing all uses of left recursion with right recursion
         | (iteration).
        
           | bear8642 wrote:
           | Which might then change the semantics of your grammar -
           | remember this being pointed out in 3rd year compilers course
        
       | pklausler wrote:
       | For the Fortran parser in f18 ("LLVM Flang"), I ended up writing
       | a small suite of C++17 constexpr parser combinators to build a
       | recursive descent parser over a well-cooked character stream. A
       | grammar encoded with parser combinators is no less readable than
       | the input to a parser generator, and has the advantages that you
       | can easily automate the construction of a strongly typed parse
       | tree, perform custom error recovery, enable and disable language
       | features, and use custom token recognizers for some of the
       | ancient weirdness that makes Fortran so fun.
       | 
       | One thing that I did this time that turned out to be a good idea:
       | run all source preprocessing and file inclusion first, before
       | doing any parsing; deal with continuation lines and comments, and
       | put the "cooked" character stream that results into a single
       | contiguous block of characters, and use _that_ for parsing. It 's
       | really fast, and makes backtracking trivial. Add a data structure
       | that maps every character in the cooked stream back to its source
       | provenance (input file, macro expansion, whatever) and you can
       | get derive good source locations for error messages from just a
       | basic char pointer. (Add a reverse mapping and you've got the
       | basis of an IDE's language server, too.)
       | 
       | TL;DR: Build-time parser combinators are worthy of your
       | consideration for your next parsing task.
        
       | coldcode wrote:
       | Oddly enough my first project as a programmer in 1981 was writing
       | a recursive descent parser in Fortran 78 (for Jovial language).
       | Since I had no degree in CS, I previously had zero clue what a
       | parser even was. Fun way to start 4 decades of programming.
        
       | jll29 wrote:
       | I respect this article a lot (and I appreciated the Rust compiler
       | generator pointer in it!); my own personal take is one of partial
       | agreement, but I would like to raise two points here:
       | 
       | 1. Part of compiler construction courses is to instill in the
       | students the notion of metea-linguistic abstraction (SICP),
       | namely to solve problems by creating and implementing domain-
       | specific languages (DSLs) that permit declarative statements
       | about the domain in a natural and therefore maintainable
       | notation, which is separate from an engine that can interpret or
       | compile it. Compiler classes are ideal to teach about DSLs
       | because the topic is parsing and compiling/iterpreting anyhow,
       | but not only that, they can see that compiler people do as they
       | say: they use the techniques for compiler construction
       | themselves, which is often but not always done using even more
       | than one DSLs (one for lexers, one for grammars, sometimes even
       | DSLs). Parser generators are also used outside of programming
       | languages, for example in computational linguistics formal
       | grammars are often used to model human language morphology,
       | phonology or syntax (with ambiguous grammars). Tomita's GLR
       | parsing - mentioned elsewhere in this thread - comes from that
       | application area.
       | 
       | 2. Another part of compiler construction tertiary education is to
       | get across the utility of theory for practical applications and
       | the closeness of theory and practice in computer science: Formal
       | language theory is beautiful and well-developed, and to show to
       | students that it is also useful because it gives runtime and
       | space complexity guarantees and systematic approaches for
       | reducing faulty implementations is a wonderful thing.
        
         | mjburgess wrote:
         | One could also interpret the "failure" of compiler writers to
         | do this as evidence that the "DSL paradigm" for programming
         | itself is, at best, extremely limited.
         | 
         | One obvious problem is that the semantics of a DSL arent
         | extendable in that DSL, making it highly brittle. So, eg., with
         | parsing, as soon as you need bespoke syntax-error handling,
         | etc. etc. you're leaving the world of any DSL you can create.
         | Since "bespoke" here will, ultimately mean, a semantic
         | specialisation that no generic DSL will support.
         | 
         | At the point where your DSL enables these abitary
         | specialisations, it is just the host language.
        
           | crapaud23 wrote:
           | In Clojure, the adage says "When building DSL, favor data
           | over functions over macros". When I write a DSL I start by
           | writing a sample of it as pure data (think of Clojure as a
           | data processing oriented language with embedded JSON (to be
           | exact, the format is called EDN)). Then I write the parsing
           | function. What I do though is to make sure the parser is
           | extensible. To do so I cast any data element that this
           | function parses into a corresponding function, except if that
           | element is already a function. This way I can directly write
           | custom functions within the DSL. This is very handy when
           | handling conditions and their composition through function
           | combinators. I can just reuse the language boolean logic (if,
           | if-not, switch-cases etc) and don't have to reimplement it
           | withing the DSL. Once the DSL data has been parsed into
           | functions, they are then composed through classical function
           | composition available in the language. Sometimes I also wish
           | I had access to the DSL through macros, mostly to optmize
           | those DSL. The situation I have found myself to need this
           | required me to impact the compiler's state in some way, so I
           | didn't carried my investigations forward regarding this
           | subject, and I'm very interested in the development of
           | Lux/luxlang since it comes with macros that are passed the
           | compiler's state monadically.
        
           | hinkley wrote:
           | The DSL paradigm _is_ extremely limited, because there are
           | almost no people who are good at writing programming
           | languages that work for developers instead of the other way
           | around.
           | 
           | XML didn't fail because of some technical limitation. It
           | failed because it tried to turn us all into language
           | designers. At worst JSON asks you for a schema, and that was
           | an afterthought in both cases.
        
             | oldsecondhand wrote:
             | The problem with XML was slow parsing and the horrible
             | syntax of XSD schemas. Turning attributes into elements is
             | also a pain in the ass. And for a lot of applications
             | turning whitespace into elements is unnecessary.
        
               | hinkley wrote:
               | Those are confounding factors. If you made any other
               | product with the same goals, but faster parsing and
               | better schema syntax, probably by integrating the two,
               | then you'd still end up with the same mess because
               | defining a language is worrying about a million corner
               | cases and not being able to fix more than half of them
               | once they ship.
               | 
               | It's a hard problem. Probably the hardest.
        
             | duped wrote:
             | XML didn't fail.
        
           | danielheath wrote:
           | That's why DSL-in-ruby is a moderately popular technique; the
           | host language is expressive enough to host a DSL without a
           | custom parser.
        
             | seanmcdirmid wrote:
             | DSL in Kotlin is popular as well.
        
             | silon42 wrote:
             | Give me an xml/json schema any day over this.
        
           | [deleted]
        
       | sakras wrote:
       | > The first is performance. Although there aren't, to the best of
       | my knowledge, modern performance numbers, it is reasonable to
       | assume that recursive descent parsing is generally faster than LR
       | parsing.
       | 
       | Why is this safe to assume? My understanding is that shift-reduce
       | parsers in general are faster than recursive-descent ones because
       | they have a constant overhead per production rule, rather than
       | potentially repeatedly creating a tree and then bailing out to
       | try a different production rule if it doesn't work.
        
       | choeger wrote:
       | Excellent article.
       | 
       | One point though: The author compares context-free grammars with
       | LR grammars on the one hand and recursive descent (LL) parsers
       | with LR parsers.
       | 
       | But in theory, recursive descent parsers are context-free as
       | well. It's just so damn simple to add context to such a parser.
       | And whenever someone does that, the language design itself
       | suffers.
       | 
       | As an example, consider a language like ML, where you want to
       | distinguish module-level identifiers from value-level ones. In
       | context-free languages, the tokens actually differ (e.g.,
       | starting with capital letters). If you add context to your parser
       | you might think that constraint can be relaxed. But how do you
       | handle implicit imports then?
        
       | bjourne wrote:
       | > In practise, a recursive descent parser can be thought of as
       | meaning "a hand-written written parser": unless you go out of
       | your way to do otherwise, nearly all hand-written parsers are
       | recursive descent parsers.
       | 
       | I don't think this is true. Yes, almost all hand-written parsers
       | are recursive descent (top-down), but not all recursive descent
       | parsers are handwritten. PEG parsers are top-down, but can be
       | generated very similar to bottom-up parsers. See for example
       | PyParsing. Packrat parsing seem to me to be the best of both
       | worlds and there is not much use for bottom-up parsing anymore.
        
       | jcranmer wrote:
       | Here are my thoughts as to what should be covered with regards to
       | parsing in a compiler course context:
       | 
       | * The lexing/parsing split, and that lexing is basically applying
       | regexes over and over again. (The technique of using a regular
       | expression engine to write your own lexer is a fun idea, but
       | probably should be cut for time).
       | 
       | * What "context-free grammar" means, and the more-or-less
       | standard syntax for writing CFGs.
       | 
       | * Explain how to build a parser with recursive descent as the
       | example. But...
       | 
       | * Demonstrate expression parsing that supports precedence and
       | both left- and right-associativity. We know how to do this very
       | easily algorithmically (there's several different methods, in
       | fact), it's very natural to describe in this manner, and it's so
       | goddamn common that languages have precedence. (As a side note,
       | expressing precedence via AddExpr/MulExpr in grammar should be
       | covered here, since it's commonly written as such in grammars,
       | but if I were ever to build a parser generator, I'd support a
       | syntax that didn't require manual factoring like that).
       | 
       | * Introduce LL and LR parsing _in theory_. Do not introduce first
       | /follow sets, or the state machine construction for an LR parser.
       | Instead, the points to cover are what makes the languages hard to
       | parse (e.g., ambiguous). Strict adherence to LL/LR grammar isn't
       | necessary and sometimes counterproductive (left-factoring LL
       | grammars is the case in point here).
       | 
       | * Do cover dangling-else ambiguity at some point. It's a great
       | example of something that wasn't discovered until people started
       | applying theoretical tools to existing languages (in this case,
       | ALGOL).
       | 
       | Like the author, I think there is some merit to LR parsing as a
       | theoretical concept, but the emphasis placed on it in existing
       | compiler courses and textbooks is way too high. The end goal of
       | parsing theory, for most students, should be to understand how to
       | build languages that admit unambiguous parsers that require no
       | backtracking, and LR parsers cover a very large subset of those
       | languages. It is also helpful to get people to think about lexing
       | and parsing in more formal terms--I have definitely seen a few
       | people go "oh, that makes things much easier!" when they see the
       | divide between lexing and parsing.
        
       | scotty79 wrote:
       | I really enjoyed using Rust nom crate that lets you build
       | arbitrary parsers using parser combinator approach.
        
       | pcj-github wrote:
       | As a self-taught programmer with no formal education, I figured
       | writing a lexer/parser toolkit would be a good way to bootstrap
       | my CS knowledge (about 20 years ago). I went down the rabbit hole
       | for months on it, really glad I did.
       | 
       | I can totally relate to the authors interest in error correction
       | of LR parsers, it's a fascinating topic that I also was nerd-
       | sniped by at the time:
       | https://github.com/inxar/syntacs/blob/6461578a04d3b0fda5af20...
        
       | ThePhysicist wrote:
       | I was quite impressed with GLR parsers, as they can deal really
       | well with ambiguity and are relatively easy to implement. They
       | have limited adoption though as they have high constant overhead
       | and hand-written recursive-descent parsers can easily outcompete
       | them for most languages. The Elkhound paper [1] explains GLR
       | parsing in detail and is quite enjoyable to read.
       | 
       | 1:
       | https://people.eecs.berkeley.edu/~necula/Papers/elkhound_cc0...
        
         | 082349872349872 wrote:
         | The point of TFA is why reach for GLA, why even pay a cost for
         | ambiguity, if you have control over the grammar, and can
         | specify an unambiguous one to begin with?
         | 
         | IMNSHO we should be aware of "big hammer" solutions but not be
         | afraid to pick concrete representations when they make things
         | easy on the machine. (in particular, tests using the former are
         | wonderful double checks for deployment candidates using the
         | latter)
         | 
         | [when would specifying an unambiguous grammar be premature
         | optimisation? when changing the language later is cheap
         | (unlikely!), or when the strings in that language are always
         | going to be of size <= k. Nerd snipe: what --to order of
         | magnitude-- is _k_ ca. 2023?]
        
           | nine_k wrote:
           | Unambiguous grammars are easy not just to machines. Most
           | importantly, they matter to humans. Reading something like
           | Lua or Golang or SQL is usually much easier than C++, to say
           | nothing of Perl.
        
           | sshine wrote:
           | > we should be aware of "big hammer" solutions
           | 
           | Parser combinators have become my preferred "medium hammer"
           | strategy:
           | 
           | Sometimes you get a more readable parser than if you use
           | regex, even though your language is regular.
           | 
           | Sometimes you get corner cases where you have to think and an
           | LR parser generator wouldn't require thinking.
           | 
           | On the other hand, you have one tool that almost always does
           | the job well, is readable, and embeds nicely.
        
       | HarHarVeryFunny wrote:
       | In terms of education I disagree about the importance of "LR"
       | (really meaning bottom-up, table driven, automatically generated)
       | parsers vs recursive descent (top-down, typically hand written),
       | for a couple of reasons:
       | 
       | 1) Real world compilers are almost always hand written recursive
       | descent. They just turn out to be more flexible, easier to
       | maintain, generate better error messages, etc.
       | 
       | 2) Teaching compiler writing as part of a computer science degree
       | is somewhat of a tradition, but the number of students who will
       | ever write a compiler (even for a small DSL, let alone a real
       | language is minimal). I think it's a useful exercise to show how
       | these things are structured and teach general "divide-and
       | conquor" decomposition of complex programs, but that's about it.
       | Given the number of moving parts in a compiler, the emphasis
       | should be on an overview of all the parts and how they fit
       | together, not going overboard on parsing and teaching two
       | different parsing techniques. To the extent that any Comp-Sci
       | student will ever end up needing to write a parser at some point
       | in their career (say for a DSL, or to parse some data format),
       | then recursive descent is the more useful one to know.
       | 
       | Parser generators seem largely an historical approach from a time
       | (1970's) when computers were so slow that the speed of a table
       | driven parser was a major advantage, and the idea is firmly
       | implanted in the heads of those of us that grew up reading Aho &
       | Ullman's compiler-writing "Dragon Book". I think they still serve
       | a purpose, but perhaps for more fringe use cases rather than
       | writing actual compilers for full-blown languages!
       | 
       | Having said all that, there's a lot of _fun_ in writing a parser
       | generator, regardless of how useful the end result may be! I
       | wrote a full LR(1) - not YACC-like LALR(1) - parser generator in
       | Modula-2 back in the early 80 's, at a time when the text books
       | told you this was impractical/impossible due to the size of a
       | canonical Knuth LR(1) parser.
        
         | [deleted]
        
       | not2b wrote:
       | There's been a movement away from LR parsing in favor of
       | recursive descent, and it isn't only that many languages are not
       | LR(1) (for example, C++ can require infinite lookahead). It's
       | also that even if the grammar of the language is LR(1), the
       | errors that users will make often don't fit cleanly into the
       | Yacc/bison structure, so error messages will often be poor
       | quality and improving them can feel like unstructured hacking.
       | It's much easier to produce high-quality error messages with a
       | recursive descent parser. Recursive descent is just closer to how
       | people think, so language designers generally create languages
       | that work well with it.
       | 
       | Now, if LR parsing were a simple topic, it would be OK for
       | compiler classes to spend time on it. But it's complex enough
       | that teaching it adequately takes a lot of time, and that time
       | could be better spent on any number of other topics. Get the
       | students to the point where they have a clean abstract syntax
       | tree quickly and go from there.
        
         | bitL wrote:
         | LL(1) for recursive descent aren't very powerful; you might
         | hack SQL with it but you still won't do C++ with it.
        
       | krupan wrote:
       | In his linked Which Parsing Approach article he says this,
       | 
       | "Of course, I'm clearly somewhat biased, so perhaps these words
       | from Guy Steele might be more compelling:
       | 
       | '[Be] sure that your language will parse. It seems stupid ... to
       | sit down and start designing constructs and not worry about how
       | they will fit together. You can get a language that's difficult
       | if not impossible to parse, not only for a computer, but for a
       | person. I use Yacc constantly as a check of all my language
       | designs, but I very seldom use Yacc in the implementation. I use
       | it as a tester, to be sure that it's LR(1) ... because if a
       | language is LR(1) it's more likely that a person can deal with
       | it.'"
       | 
       | I'm just starting to dabble in writing a compiler and this sounds
       | pretty reasonable to me, basically using Yacc as a linter for
       | your grammar but not for actually generating your parser. What do
       | more experienced people here think of that?
        
         | gavinhoward wrote:
         | > What do more experienced people here think of that?
         | 
         | "Experienced" person here. I've done a parser generator grammar
         | or two. I've implemented several parsers.
         | 
         | I mostly agree, but it depends on your use case.
         | 
         | Most people writing parsers are doing so for a bespoke
         | language, usually for config of an internal-use-only program or
         | something like that. Such parsers won't be great, but they will
         | work for people that:
         | 
         | 1. Know the language, or
         | 
         | 2. Are expected to know the language.
         | 
         | In that case, I agree with Guy Steele.
         | 
         | Why these two caveats? Because as people have said, error
         | reporting from the parser will be subpar. This includes hand-
         | written recursive descent ones. So knowing the language is
         | important.
         | 
         | (The second item is specifically for situations where new
         | employees might not know the language yet, but it's part of
         | their job, so they _will_ learn it.)
         | 
         | This is, by far, the greatest amount of parsers in use. I would
         | estimate 99.5% or more.
         | 
         | Of course, most people don't think about those parsers; they
         | think about the hardened parsers for general-purpose languages
         | in wide use, but while they get a lot of attention, they are
         | few and far between.
         | 
         | Guy Steele's advice doesn't apply as well to such situations.
         | For me, I wrote a parser for an existing language. I had to
         | write the parser to that, and that language had a lot of
         | baggage. It made no sense in that case.
         | 
         | For the parser I'm currently writing, it _also_ makes no sense.
         | This is for what I hope to become a widely-used general purpose
         | language. Sure, I could try, but I don 't think it would work.
         | 
         | To see why, look at perhaps the two newest big languages: Rust
         | and Zig. Both are not that old (Rust 15 years-ish, Zig 8 years-
         | ish?). And yet, I suspect both use handwritten parsers. I
         | suspect both are recursive descent.
         | 
         | Why is this? These languages shouldn't have that much legacy
         | baggage, like C or C++, right?
         | 
         | It doesn't matter. General-purpose languages have to be
         | _everything_ to _all possible users_. This means they will have
         | to be able to do some crazy things. And to do crazy things with
         | plain text, well, you need as much power as you can get.
         | 
         | This is why general-purpose languages basically _always_ become
         | context-sensitive. Context-sensitivity allows a recursive
         | descent parser to bring the full power of Turing-completeness
         | to bear on a problem.
         | 
         | Guy Steele also alludes to this in his classic "Growing a
         | Language" talk. [1] He talks about how you need to create a
         | language that _users_ can grow. There are easy ways to do this
         | (like having functions), but the easy ways are all limited. If
         | there is something that must be a language feature, like format
         | printing [2], it can 't be added by users.
         | 
         | Adding those other features usually require context-
         | sensitivity. Format printing definitely does when it allows you
         | to use variable names in the format string.
         | 
         | But what if you make a language that any user can extend in any
         | way, even if it's not easy? The result is my language, Yao.
         | (Example at [3].) It's context-sensitive because it _has_ to
         | be.
         | 
         | So basically, Guy Steele is right for everybody _but_ the
         | creators of big general-purpose languages. But most people
         | think he 's talking about the big general-purpose languages. He
         | might have been, but if he was, he's wrong.
         | 
         | That doesn't mean that I don't consider the ease of parsing
         | when I add a feature. I definitely do. But I come at it from
         | the _human_ perspective, how readable it is to a _human_. This
         | tends to give me much the same thing while giving me power.
         | 
         | [1]: https://www.youtube.com/watch?v=lw6TaiXzHAE
         | 
         | [2]: https://brevzin.github.io/c++/2023/01/02/rust-cpp-format/
         | 
         | [3]:
         | https://git.yzena.com/Yzena/Yc/src/commit/66185f4ff4fb779e04...
        
       ___________________________________________________________________
       (page generated 2023-01-17 23:01 UTC)