[HN Gopher] Why we need to know LR and recursive descent parsing...
___________________________________________________________________
Why we need to know LR and recursive descent parsing techniques
Author : todsacerdoti
Score : 162 points
Date : 2023-01-17 09:51 UTC (13 hours ago)
(HTM) web link (tratt.net)
(TXT) w3m dump (tratt.net)
| vidarh wrote:
| I agree strongly with the point about how making parsing a major
| component of teaching compilers being misguided (as anyone who
| read my compiler series would know), not because learning parsing
| is unnecessary but because going more in depth about the rest is
| sacrificed in favour of what is one of the least interesting
| parts of the challenge of writing a new language.
|
| There's more to do in improving parsing, but it's still one of
| the best understood aspects of parsing.
|
| That said, I do agree with the article that it's a problem that
| many syntaxes have complex parser requirements. No grammar I know
| of _requires_ recursive descent, but many _requires_ violations
| of layering etc. not catered for by tooling that strongly favours
| manual parsers.
|
| Now, I prefer manually written parsers because I think (having
| tried and failed to write better options many times) that current
| parser generation tools are rarely suitable for writing
| production parsers.
|
| That said, I wish people _wrote_ their grammars for a tool as a
| validation exercise, to give an incentive to design grammars that
| are more regular and easier to parse and simpler.
|
| As an example, I love Ruby. Having worked on my Ruby compiler
| project (though it's been in hibernation for a few years), I
| utterly hate Ruby's grammar. I don't hate the _result_ , for the
| most part, but mostly because most of us avoid the more insane
| things Ruby allows [1]
|
| And while some of the complexity are part creating what I love
| about Ruby, the grammar is full of horrible quirks (again [1]) at
| least some of which might have been reduced if one of the "test
| cases" of the design process for Ruby was to maintain and update
| a compact language description for a parser generator tool. Even
| if that'd not be the production grammar for MRI, using it both to
| maintain a sane compact grammar description _and_ to validate
| other parsers against would be great.
|
| I pick on Ruby because I love it and wish it was cleaner and it's
| a particularly bad offender in this respect, but so many
| languages have occasional warts that they likely wouldn't have if
| a formal specification of the grammar made them more obvious.
|
| [1] E.g. 'p(% % )' is valid (it prints "%", because % without a
| left hand operand reads the following character as the start of a
| quoted string ending with the same character. 'p % % ' is not,
| because the "p" can be parsed as an operand, and so the _first_
| '%' is interpreted as an infix operator, while the _second_ '%'
| is interpreted as the start of a quoted character. Fun times.
| Thankfully I've never seen anyone abuse this in real code.
| olliej wrote:
| Ehn, the problem with LR parsing is that it's terribly slow and
| can only practically be done via generators, and I have yet to
| encounter a parser generator that can actually handle error
| correction properly (necessary if you want anything other than
| the first error to break everything after).
|
| The theoretical argument for LR parsing is it is more powerful
| than LL, LALR, etc because PLT seems hell bent on ignoring LL*
| because of the arbitrary restrictions placed on what a "real" BNF
| grammar is allowed to do.
|
| Knowing about LR - and PDA/state machine based parsing is useful,
| but pretending that LR parsers are somehow better than recursive
| descent is nonsense.
| recursivedoubts wrote:
| _> unfortunately many big brain school teach only parser
| generator tool. here grug usual love of tool is not: parser
| generator tool generate code of awful snakes nest: impossible
| understand, bottom up, what? hide recursive nature of grammar
| from grug and debug impossible, very bad according grug! _
|
| https://grugbrain.dev/#grug-on-parsing
|
| Bob Nystrom has done us all a tremendous service by writing a
| wonderful book (easily top 5 programming books available) on
| recursive descent:
|
| https://craftinginterpreters.com/
| DannyBee wrote:
| IMHO the majority of engineers not understanding lexing/parsing
| beyond generators has held back good engineering of incremental
| parsers/lexers for a long time.
|
| Things like tree-sitter took forever to come along, but the
| research projects that built the algorithms they rely on are a
| quarter century
| old.(https://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/5885.html)
|
| Meanwhile, incremental lexing and incremental recursive descent
| parsing and LL are actually not that hard (incremental LR is
| trickier).
|
| It's just that few people really understand how any of it works
| to implement it.
|
| This is not a "it was not needed" issue. Lexing times of files
| for syntax highlighting is still "seconds" for larger files, for
| example.
|
| Meanwhile, i can explain near-optimal incremental lexing to
| someone who knows how lexers work in a few bullet points, and
| it's trivial to implement:
|
| 1. As your lexer looks at characters, keep track of the maximum
| lookahead and lookbehind used during a lex.
|
| 2. When some range of the string to lex is invalidated, adjust
| the invalidation bounds to include the maximum
| lookhead/lookbehind as well.
|
| 3. Relex the invalidated portion of the string.
|
| That's it. For optimality, compute the maximum
| lookhead/lookbehind on a per-token basis, instead on a per-lex
| basis (this is trivial if your editor is already storing the
| tokens under the text somewhere)
|
| If you store position info in your tokens, you can easily adjust
| as well.
|
| For incremental recursive descent parsing, it's similar - only
| rules who ever looked tokens that changed could be affected by
| them. You can easily keep the interval of tokens that a given
| rule looked at. (If you want truly optimal, you can store the set
| of intervals looked at, but this is almost always overkill)
|
| If all you ever have is a generator, you will never implement any
| of the above until your generator does.
|
| I have watched tons of projects spend lots of time optimizing
| grammars, doing weird caching, etc, when a few simple/better
| algorithms would have saved them from all of it.
|
| (which is why in that sense, tree-sitter has been a revolution)
| brhsagain wrote:
| This is extremely interesting to me. I looked at the papers
| tree-sitter referenced but they were all written assuming an
| LR/generator-based parser and I couldn't figure out how to
| adapt it to recursive descent.
|
| Let's say I've parsed an AST
| binary_expression lhs: number rhs:
| binary_expression lhs: number rhs:
| call_expression func: identifier
| args: call_argument_list [0]: number
| [1]: number
|
| and I've detected that `rhs` on line 3 has changed. So I go
| back to the start of the node. What do I do then? How do I know
| I need to parse an expression there and that it belongs to the
| rhs of a binary expression?
|
| In the literature, this is solved by storing the parse state.
| How do you do that with recursive descent?
|
| Is there something you'd recommend reading to understand how to
| do incremental parsing for recursive descent specifically?
| bsder wrote:
| > IMHO the majority of engineers not understanding
| lexing/parsing beyond generators has held back good engineering
| of incremental parsers/lexers for a long time.
|
| That's a touch unfair. You could easily extend that statement
| to "The predominant languages of C and C++ have held back good
| engineering of parsers/lexers." _period_. This is due to the
| fact that you effectively have to parse the universe to resolve
| the symbol table in both languages.
|
| In addition, people didn't need incremental parsing until
| syntax highlighting became widespread via LSPs. Before that,
| you were always working at the compile level, so why put in all
| the work for an incremental parser you were never going to make
| use of?
|
| People have recently been designing the _languages, themselves_
| to better support parsing /lexing paradigms that can be
| incremental and restarted.
|
| This is why most modern languages have converged to having
| specific syntax to delineate the difference between variable
| name and variable type, for example.
|
| Rust, for example, has fairly funky things around "unary minus"
| because of its design for parsing and lexing.
| le-mark wrote:
| I had one job where my ability to write recursive descent parsers
| made me a "100x ninja rockstar" because they had a lot of parsing
| problems that no one recognized as such. I expressly did not use
| a parser generator because of the complexity to the build, the
| unfamiliarity of the rest of the team, and I considered them to
| be overkill for the domain.
| chubot wrote:
| I agree with 90%-95% of the article. I would phrase it more
| pithily as:
|
| 1. Use recursive descent for recognizing existing languages,
| where you have a good test corpus. (And very good point about
| testing INVALID inputs too!)
|
| 2. Use grammars to design new languages
|
| I follow this with https://www.oilshell.org/ -- the compatible
| OSH is a huge recursive descent parser, and Oil is designed with
| a grammar.
|
| Guy Steele and Russ Cox also mentioned this methodology -- design
| the language with a grammar, and then implement it a second time
| by hand. Then you REALLY know it! :)
|
| It is definitely possible to not know the language you're
| implementing / creating!
|
| ---
|
| As an undergrad, I never tried to design a language, or even
| thought about it honestly. But I agree that undergrads should
| know of both approaches, without necessarily getting into the
| details of CFGs.
|
| It would be a mistake to just teach recursive descent.
| hinkley wrote:
| No, we don't.
|
| I've tried to learn both of these several times each, and what
| happens is that each time I end up yelling at my computer, "but
| we don't parse/read programs that way!" And I simply cannot
| participate in this level of fiction so I find another solution
| or hobby. Neither of them is a good fit for C derived languages
| and no developer I've done DX studies with decomposes code the
| way either of them do. I absolutely do not.
|
| Last time I looked, the only one I found that's close to right is
| SGLR.
| xavxav wrote:
| I like the idea of specifying a parser through a formal grammar
| but I think that current technology has a few pretty annoying-
| severe flaws. * Parser generator grammars are
| usually quite far from 'paper' versions of a formal grammar. The
| moment you have to deal with associativity/precedence or left-
| recursive rules the beautiful grammar tends to fall apart.
| In theory most generators have mechanisms to help make this
| ergonomic but irl most grammars I've seen are totally illegible.
| * Error messages suck. I think this is actually a bigger issue,
| but parser generators have pretty poor error messages compared to
| even naive recursive descent parsers written with a library like
| megaparsec. There has been research on generating better error
| messages in generated parsers like menhir [1], but I haven't seen
| anyone apply this successfully and it requires you to manually
| annotate each error state with a message, and a real parser can
| easily have thousands, and they can change each time you update
| your grammar. * Parser files are usually fairly
| illegible because they mix host language code with parser syntax
| and meta-variables. Some more modern libraries like Pest separate
| the two which I think is a net improvement.
|
| Nevertheless, I hope that someone figures out a way to square the
| circle and frees us from handwriting parsers in all but the most
| exceptional cases.
|
| [1] https://hal.inria.fr/hal-01248101/
| moring wrote:
| I disagree to some extent:
|
| > Parser generator grammars are usually quite far from 'paper'
| versions
|
| The more useful of parser input formats allow precedence and
| associativitiy annotations as well as optionals, quantifiers
| and quantifiers with separators.
|
| This really only works great together with your third point,
|
| > Parser files are usually fairly illegible because they mix
| host language code with parser syntax
|
| You don't need host code if the parser just outputs an AST.
| Neither quantifiers with or without separators nor optionals
| make the parser spec more complex if no host code is needed --
| the complexity is in the parser generator and AST generator.
|
| > Error messages suck
|
| I found error messages okay-ish when you view them in an IDE
| where it is more important _where_ an error occurred than what
| exactly is wrong (only applies to syntax errors). I also found
| it helpful to show the allowed input at that place that would
| NOT have resulted in a syntax error -- and the parser generator
| can emit a list of both terminals and nonterminals
| automatically, because this information is used by it to detect
| a syntax error in the first place.
|
| I also tripped over the often-said problem that LALR parsers
| produce less useful error messages than LR(1) parsers. In
| today's time, just using LR(1) is underrated IMHO. In my
| experiments, the table didn't get too large (400kB parse table
| for a simple hardware-description language) and I would have
| tried to apply compression algorithms on the table instead of
| resorting to LALR, but that size was small enough to be
| acceptable to me. Only the run time of the table generation
| process was a bit annoying (10-20 seconds).
| jsmith45 wrote:
| I will say it. Using LALR(1) is a bug.
|
| Either use classic LR(1), or bison supports IELR(1), which
| supports all LR(1) grammars (i.e. will produce identical
| parses to classic LR(1)), but generates parse tables that
| nearly the same size as LALR(1).
|
| I'm not sure about quality of IELR's error messages, but at
| least you don't get the new LALR introduced conflicts that
| are hard to wrap your head around.
|
| See this nice analysis of it:
| https://news.ycombinator.com/item?id=24909161
| HarHarVeryFunny wrote:
| Your experience with the usefulness of error messages from
| top-down (recursive descent) vs bottom-up (table-driven)
| parsers is opposite to common experience. The trouble with
| bottom-up parsers is that they inherently (by way of being
| bottom-up) have no top-down context, so all they can by
| default generate is a bottom-up "symbol XXX expected" type
| error. A top down parser knows what high level construct it's
| trying to parse, so is in a position to generate a more
| meaningful error.
|
| Of course in either case the grammar _could_ be augmented
| with thoughtful error constructs designed to generate
| meaningful errors when parsed, but that doesn 't seem to be
| common practice given how awful compiler error messages tend
| to be. Try writing "std::cout" instead of "std::endl" and see
| what gibberish error message g++ generates. clang is
| generally better than g++ in this regard.
| logicalshift wrote:
| I wrote a parser generator quite a long time ago that I think
| improves the syntax quite a lot, and which has an interesting
| approach to generalisation: you can write conditions on the
| lookahead (which are just grammars that need to be matched in
| order to pick a given rule when a conflict needs to be
| resolved). This construct makes it much easier to write a
| grammar that matches how a language is designed.
|
| Here's an ANSI-C parser, for example:
| https://github.com/Logicalshift/TameParse/blob/master/Exampl...
| - this is an interesting example because `(foo)(bar)` is fully
| ambiguous in ANSI C: it can be a cast or a function call
| depending on if `foo` is a type or a variable.
|
| The new construct makes it possible to extend grammars and
| disambiguate them - here's a C99 grammar that extends the
| ANSI-C grammar: https://github.com/Logicalshift/TameParse/blob/
| master/Exampl....
|
| It also allows matching at least some context-sensitive
| languages - see
| https://github.com/Logicalshift/TameParse/blob/master/Exampl...
|
| An advantage over GLR or backtracking approaches is that this
| still detects ambiguities in the language so it's much easier
| to write a grammar that doesn't end up running in exponential
| time or space, plus when an ambiguity is resolved by the
| generalisation, which version is specified by the grammar and
| is not arbitrary (backtracking) or left until later (GLR).
|
| I was working on improving error handling when I stopped work
| on this, but my approach here was not working out.
|
| (This is a long-abandoned project of mine but the approach to
| ambiguities and the syntax seem like they're novel to me and
| were definitely an improvement over anything else I found at
| the time. The lexer language has a few neat features in it too)
| [deleted]
| sshine wrote:
| What an excellent article that I agree almost entirely with.
|
| I'd like to add nuance in two places:
|
| > If you're writing a parser for a language which is ambiguous,
| you might be able to use the conflict resolution offered by YACC
| and friends. However, in general, I suggest avoiding YACC's
| conflict resolution if possible, as you have to at least somewhat
| understand the underlying LR algorithm to understand how
| conflicts have been resolved.
|
| I'd say: Specifically for binary operator precedence and
| associativity, definitely use conflict resolution of Yacc and
| friends instead of rewriting the grammar. An ambiguous grammar
| for binary operators might look like: E - E ^ E
| E - E x E E - E + E E - (E) E - num
|
| Disambiguating in Yacc and friends involves a few lines like:
| %left '+' %left 'x' %right '^'
|
| whereas rewriting the grammar above to be unambiguous is a mess:
| E0 - EA E1 EA - E1 + EA EA - e E1 - EM E2
| EM - E2 x EM EM - e E2 - E3 EP EP - ^ E3 EP
| E3 - num E3 - (E0)
|
| Disambiguating by resolving shift/reduce conflicts means fewer
| states, whereas disambiguating by rewriting the grammar means
| more states. Perhaps, as Laurence points out in this article wrt.
| LALR and SLR: This probably doesn't matter at all on modern
| computers.
|
| Any other shift/reduce or reduce/reduce conflict caused by some
| other ambiguity than specifically a binary operator, e.g. when
| you have multiple if-then-else-if-then-else and you want to know
| where an else belongs: You need to be quite into LR parser
| generators to understand what's happening; took quite a lot of
| guessing, trial-and-error when I worked with them, and I probably
| forgot again. I'd probably document these as non-trivial choices
| if I were maintaining a Yacc-like parser, and I'd generally avoid
| syntax that is ambiguous in this way (although I don't think you
| can always do that).
|
| > If you need the best possible performance or error recovery,
| recursive descent parsing is the best choice. If you know you
| will need this and you're designing a new language, at least
| create an LL grammar, since from that you can derive a recursive
| descent parser. [Bear in mind that LL grammars are more annoying
| to write than LR grammars which is why I suggest avoiding LL
| unless you have this very specific use case.]
|
| I would say: If you get used to parser combinator libraries, you
| can get a fast, readable LL parser that is also fast to write and
| does not require extra compilation steps like practically every
| LR parser generator. I.e. you can skip the part where you convert
| your grammar file into a native-$YOUR_LANG parser, because parser
| combinators are embedded DSLs, some of which are just as fast as
| their hand-written recursive-descent parser. Parser combinator
| libraries usually have excellent error reporting capability. (The
| best example I know is Haskell's diagnose package. [2]) And some
| of them have backtracking capability, so that for those few parts
| where your grammar isn't LL, you can go back and try something
| else.
|
| Modern parser combinator libraries like Haskell's Parsec /
| Megaparsec has helper functions for generating a parser that
| effectively translates an operator associativity/precedence table
| into an LL parser; so you get the nicely readable description,
| and the transformed LL parser that goes with it; see Megaparsec's
| makeExprParser [3]: expr = makeExprParser term
| table <?> "expression" term = parens expr <|> integer
| <?> "term" table = [ [ prefix "-" negate
| , prefix "+" id ] , [ postfix "++" (+1) ]
| , [ binary "*" (*) , binary "/" div ]
| , [ binary "+" (+) , binary "-" (-) ] ]
| binary name f = InfixL (f <$ symbol name) prefix name f
| = Prefix (f <$ symbol name) postfix name f = Postfix (f <$
| symbol name)
|
| [1]: https://stackoverflow.com/questions/898489/what-
| programming-... [2]: https://github.com/mesabloo/diagnose#error-
| reporting-made-ea... [3]:
| https://hackage.haskell.org/package/parser-combinators-1.3.0...
| ratmice wrote:
| > require extra compilation steps like practically every LR
| parser generator.
|
| It is worth noting that the author's lrpar tool (linked in the
| article) doesn't actually require extra compilation steps or
| code generation. This perhaps comes at the expense of less
| flexible lexing and parser actions. But lrpar comes with a tool
| 'nimbleparse', which given a grammar and an input file parses
| the input file bypassing code generation. This reduces the
| grammar editing/testing cycle by quite a lot. Inspired by that
| I have been working on a lsp based version of the tool for a
| while, to make the whole thing as responsive as possible...
| HarHarVeryFunny wrote:
| Encoding expression operator precedence into a grammar (via
| add-expr, mul-expr, etc, etc) and then directly implementing is
| very inefficient. For an N-level deep expression grammar a
| recursive descent parser will end up making N-deep recursive
| calls for every terminal (variable, constant)! In a recursive
| descent parser it's more efficient just to have a single
| ParseExpression() function that uses the simple "precedence
| climbing" technique to handle operator precedence, which will
| result in at most one recursive call per operator (vs N per
| terminal). You could even eliminate the recursion completely by
| using an explicit stack instead.
|
| The same inefficiency applies to botton-up table driven parsers
| generated by parser generators too - in a naive implementation
| the parser will have to go thru N-steps of reduce actions to
| elevate any terminal up to a top level expression. A smarter
| generator may eliminate 1:1 "unit productions" of type mul-expr
| := add-expr in order to remove this inefficiency. As you say,
| the other way to handle precedence using a parser generator is
| using operator precedence & associativity ambiguity resolution,
| but this is a pain to develop.
| sshine wrote:
| Thank you for mentioning "precedence climbing".
|
| I never had a need for a fast expression parser, so I never
| explored this option.
|
| For those curious,
|
| https://eli.thegreenplace.net/2012/08/02/parsing-
| expressions...
|
| It should be possible to combine this with some parser
| combinator libraries just the same.
| breck wrote:
| I enjoyed the article even though I'm in the opposite camp.
|
| > Context-free grammars, and their associated parsing techniques,
| don't align well with real-world compilers, and thus we should
| deemphasise CFGs (Context-Free Grammars) and their associated
| parsing algorithms.
|
| I agree with this. I think CFG are highly overrated. Top down
| recursive descent parsers are simple and allow you to craft more
| human languages. I think building top down parsers is something
| every dev should do. It's a simple technique with tremendous
| power.
|
| I think the source code for Scroll
| (https://github.com/breck7/scroll/tree/main/grammar) demonstrates
| how liberating moving away from CFGs can be. Easy to extend,
| compose, build new backends, debug, et cetera. Parser, compiler,
| and interpreter for each node all in one place. Swap nodes around
| between languages. Great evolutionary characteristics.
|
| I'll stop there (realizing I need to improve the docs and write a
| blog post).
| choeger wrote:
| And no one will ever be able to implement a tool that works
| with the input language without reimplementing _everything_.
|
| CFGs are equivalent to modularity in language design. A choice,
| clearly, but one with significant upsides.
| breck wrote:
| > And no one will ever be able to implement a tool that works
| with the input language
|
| But what do you get with such a tool? Syntax highlighting and
| cryptic error messages on the syntax only. And then you hit a
| wall and if you want to offer more to your users
| (refactoring, autocomplete, etc) you have to reimplement
| everything anyway).
|
| > CFGs are equivalent to modularity in language design. A
| choice, clearly, but one with significant upsides.
|
| My point above aside, you bring up a great point, and
| something I need to improve on a lot in Grammar (modularity
| so Tree Languages can be run on many host systems). Now my
| wheels are turning. Thanks!
| tlb wrote:
| Recursive descent parsers have trouble with left-associative
| operators if you only allow recursion. But if you allow
| iteration, you can build up the parse tree while iterating
| through terms at the same precedence level. For example, this
| defines + and - as left-associative at the same precedence level:
| // EBNF is SUM = PRODUCT , { ("+" | "-") , PRODUCT }
| AstNode *parseSum() { if (auto lhs =
| parseProduct()) { while (1) { if
| (consume(T_PLUS)) { if (auto rhs = parseProduct())
| { // You'd call parseSum instead to get right associativity
| lhs = mkAst(A_add, lhs, rhs); // Instead, left associativity
| happens here } else {
| return nullptr; } } else if
| (consume(T_MINUS)) { if (auto rhs = parseProduct())
| { lhs = mkAst(A_sub, lhs, rhs); }
| else { return nullptr; }
| } else { return lhs; }
| } } else { return nullptr; }
| }
|
| As a bonus, it doesn't blow out your stack when parsing long
| expressions.
|
| Very few binary operators should be right-associative. Basically
| just ?: and = .
| peterfirefly wrote:
| Seems like a relevant place to put a link to this paper:
|
| David R. Hanson, "Compact Recursive-descent Parsing of
| Expressions", 1985
|
| https://drhanson.s3.amazonaws.com/storage/documents/compact....
|
| There are languages (such as ML) that allow the definition of
| new operators with a choice of precedence and associativity.
| Using a table like the one Hanson suggests and modifying it at
| compile-time is an obvious way to handle that feature.
| bruce343434 wrote:
| When you do precedence like this in recdesc, you will get
| functions like parse_prec_1, parse_prec_2 and so forth. If you
| generalize this function body, and pass along a "current
| precedence", then generalize the precedences for each operator
| into a table, you have just reinvented Pratt Parsing.
| ufo wrote:
| The price for this is that you have to refactor the grammar,
| replacing all uses of left recursion with right recursion
| (iteration).
| bear8642 wrote:
| Which might then change the semantics of your grammar -
| remember this being pointed out in 3rd year compilers course
| pklausler wrote:
| For the Fortran parser in f18 ("LLVM Flang"), I ended up writing
| a small suite of C++17 constexpr parser combinators to build a
| recursive descent parser over a well-cooked character stream. A
| grammar encoded with parser combinators is no less readable than
| the input to a parser generator, and has the advantages that you
| can easily automate the construction of a strongly typed parse
| tree, perform custom error recovery, enable and disable language
| features, and use custom token recognizers for some of the
| ancient weirdness that makes Fortran so fun.
|
| One thing that I did this time that turned out to be a good idea:
| run all source preprocessing and file inclusion first, before
| doing any parsing; deal with continuation lines and comments, and
| put the "cooked" character stream that results into a single
| contiguous block of characters, and use _that_ for parsing. It 's
| really fast, and makes backtracking trivial. Add a data structure
| that maps every character in the cooked stream back to its source
| provenance (input file, macro expansion, whatever) and you can
| get derive good source locations for error messages from just a
| basic char pointer. (Add a reverse mapping and you've got the
| basis of an IDE's language server, too.)
|
| TL;DR: Build-time parser combinators are worthy of your
| consideration for your next parsing task.
| coldcode wrote:
| Oddly enough my first project as a programmer in 1981 was writing
| a recursive descent parser in Fortran 78 (for Jovial language).
| Since I had no degree in CS, I previously had zero clue what a
| parser even was. Fun way to start 4 decades of programming.
| jll29 wrote:
| I respect this article a lot (and I appreciated the Rust compiler
| generator pointer in it!); my own personal take is one of partial
| agreement, but I would like to raise two points here:
|
| 1. Part of compiler construction courses is to instill in the
| students the notion of metea-linguistic abstraction (SICP),
| namely to solve problems by creating and implementing domain-
| specific languages (DSLs) that permit declarative statements
| about the domain in a natural and therefore maintainable
| notation, which is separate from an engine that can interpret or
| compile it. Compiler classes are ideal to teach about DSLs
| because the topic is parsing and compiling/iterpreting anyhow,
| but not only that, they can see that compiler people do as they
| say: they use the techniques for compiler construction
| themselves, which is often but not always done using even more
| than one DSLs (one for lexers, one for grammars, sometimes even
| DSLs). Parser generators are also used outside of programming
| languages, for example in computational linguistics formal
| grammars are often used to model human language morphology,
| phonology or syntax (with ambiguous grammars). Tomita's GLR
| parsing - mentioned elsewhere in this thread - comes from that
| application area.
|
| 2. Another part of compiler construction tertiary education is to
| get across the utility of theory for practical applications and
| the closeness of theory and practice in computer science: Formal
| language theory is beautiful and well-developed, and to show to
| students that it is also useful because it gives runtime and
| space complexity guarantees and systematic approaches for
| reducing faulty implementations is a wonderful thing.
| mjburgess wrote:
| One could also interpret the "failure" of compiler writers to
| do this as evidence that the "DSL paradigm" for programming
| itself is, at best, extremely limited.
|
| One obvious problem is that the semantics of a DSL arent
| extendable in that DSL, making it highly brittle. So, eg., with
| parsing, as soon as you need bespoke syntax-error handling,
| etc. etc. you're leaving the world of any DSL you can create.
| Since "bespoke" here will, ultimately mean, a semantic
| specialisation that no generic DSL will support.
|
| At the point where your DSL enables these abitary
| specialisations, it is just the host language.
| crapaud23 wrote:
| In Clojure, the adage says "When building DSL, favor data
| over functions over macros". When I write a DSL I start by
| writing a sample of it as pure data (think of Clojure as a
| data processing oriented language with embedded JSON (to be
| exact, the format is called EDN)). Then I write the parsing
| function. What I do though is to make sure the parser is
| extensible. To do so I cast any data element that this
| function parses into a corresponding function, except if that
| element is already a function. This way I can directly write
| custom functions within the DSL. This is very handy when
| handling conditions and their composition through function
| combinators. I can just reuse the language boolean logic (if,
| if-not, switch-cases etc) and don't have to reimplement it
| withing the DSL. Once the DSL data has been parsed into
| functions, they are then composed through classical function
| composition available in the language. Sometimes I also wish
| I had access to the DSL through macros, mostly to optmize
| those DSL. The situation I have found myself to need this
| required me to impact the compiler's state in some way, so I
| didn't carried my investigations forward regarding this
| subject, and I'm very interested in the development of
| Lux/luxlang since it comes with macros that are passed the
| compiler's state monadically.
| hinkley wrote:
| The DSL paradigm _is_ extremely limited, because there are
| almost no people who are good at writing programming
| languages that work for developers instead of the other way
| around.
|
| XML didn't fail because of some technical limitation. It
| failed because it tried to turn us all into language
| designers. At worst JSON asks you for a schema, and that was
| an afterthought in both cases.
| oldsecondhand wrote:
| The problem with XML was slow parsing and the horrible
| syntax of XSD schemas. Turning attributes into elements is
| also a pain in the ass. And for a lot of applications
| turning whitespace into elements is unnecessary.
| hinkley wrote:
| Those are confounding factors. If you made any other
| product with the same goals, but faster parsing and
| better schema syntax, probably by integrating the two,
| then you'd still end up with the same mess because
| defining a language is worrying about a million corner
| cases and not being able to fix more than half of them
| once they ship.
|
| It's a hard problem. Probably the hardest.
| duped wrote:
| XML didn't fail.
| danielheath wrote:
| That's why DSL-in-ruby is a moderately popular technique; the
| host language is expressive enough to host a DSL without a
| custom parser.
| seanmcdirmid wrote:
| DSL in Kotlin is popular as well.
| silon42 wrote:
| Give me an xml/json schema any day over this.
| [deleted]
| sakras wrote:
| > The first is performance. Although there aren't, to the best of
| my knowledge, modern performance numbers, it is reasonable to
| assume that recursive descent parsing is generally faster than LR
| parsing.
|
| Why is this safe to assume? My understanding is that shift-reduce
| parsers in general are faster than recursive-descent ones because
| they have a constant overhead per production rule, rather than
| potentially repeatedly creating a tree and then bailing out to
| try a different production rule if it doesn't work.
| choeger wrote:
| Excellent article.
|
| One point though: The author compares context-free grammars with
| LR grammars on the one hand and recursive descent (LL) parsers
| with LR parsers.
|
| But in theory, recursive descent parsers are context-free as
| well. It's just so damn simple to add context to such a parser.
| And whenever someone does that, the language design itself
| suffers.
|
| As an example, consider a language like ML, where you want to
| distinguish module-level identifiers from value-level ones. In
| context-free languages, the tokens actually differ (e.g.,
| starting with capital letters). If you add context to your parser
| you might think that constraint can be relaxed. But how do you
| handle implicit imports then?
| bjourne wrote:
| > In practise, a recursive descent parser can be thought of as
| meaning "a hand-written written parser": unless you go out of
| your way to do otherwise, nearly all hand-written parsers are
| recursive descent parsers.
|
| I don't think this is true. Yes, almost all hand-written parsers
| are recursive descent (top-down), but not all recursive descent
| parsers are handwritten. PEG parsers are top-down, but can be
| generated very similar to bottom-up parsers. See for example
| PyParsing. Packrat parsing seem to me to be the best of both
| worlds and there is not much use for bottom-up parsing anymore.
| jcranmer wrote:
| Here are my thoughts as to what should be covered with regards to
| parsing in a compiler course context:
|
| * The lexing/parsing split, and that lexing is basically applying
| regexes over and over again. (The technique of using a regular
| expression engine to write your own lexer is a fun idea, but
| probably should be cut for time).
|
| * What "context-free grammar" means, and the more-or-less
| standard syntax for writing CFGs.
|
| * Explain how to build a parser with recursive descent as the
| example. But...
|
| * Demonstrate expression parsing that supports precedence and
| both left- and right-associativity. We know how to do this very
| easily algorithmically (there's several different methods, in
| fact), it's very natural to describe in this manner, and it's so
| goddamn common that languages have precedence. (As a side note,
| expressing precedence via AddExpr/MulExpr in grammar should be
| covered here, since it's commonly written as such in grammars,
| but if I were ever to build a parser generator, I'd support a
| syntax that didn't require manual factoring like that).
|
| * Introduce LL and LR parsing _in theory_. Do not introduce first
| /follow sets, or the state machine construction for an LR parser.
| Instead, the points to cover are what makes the languages hard to
| parse (e.g., ambiguous). Strict adherence to LL/LR grammar isn't
| necessary and sometimes counterproductive (left-factoring LL
| grammars is the case in point here).
|
| * Do cover dangling-else ambiguity at some point. It's a great
| example of something that wasn't discovered until people started
| applying theoretical tools to existing languages (in this case,
| ALGOL).
|
| Like the author, I think there is some merit to LR parsing as a
| theoretical concept, but the emphasis placed on it in existing
| compiler courses and textbooks is way too high. The end goal of
| parsing theory, for most students, should be to understand how to
| build languages that admit unambiguous parsers that require no
| backtracking, and LR parsers cover a very large subset of those
| languages. It is also helpful to get people to think about lexing
| and parsing in more formal terms--I have definitely seen a few
| people go "oh, that makes things much easier!" when they see the
| divide between lexing and parsing.
| scotty79 wrote:
| I really enjoyed using Rust nom crate that lets you build
| arbitrary parsers using parser combinator approach.
| pcj-github wrote:
| As a self-taught programmer with no formal education, I figured
| writing a lexer/parser toolkit would be a good way to bootstrap
| my CS knowledge (about 20 years ago). I went down the rabbit hole
| for months on it, really glad I did.
|
| I can totally relate to the authors interest in error correction
| of LR parsers, it's a fascinating topic that I also was nerd-
| sniped by at the time:
| https://github.com/inxar/syntacs/blob/6461578a04d3b0fda5af20...
| ThePhysicist wrote:
| I was quite impressed with GLR parsers, as they can deal really
| well with ambiguity and are relatively easy to implement. They
| have limited adoption though as they have high constant overhead
| and hand-written recursive-descent parsers can easily outcompete
| them for most languages. The Elkhound paper [1] explains GLR
| parsing in detail and is quite enjoyable to read.
|
| 1:
| https://people.eecs.berkeley.edu/~necula/Papers/elkhound_cc0...
| 082349872349872 wrote:
| The point of TFA is why reach for GLA, why even pay a cost for
| ambiguity, if you have control over the grammar, and can
| specify an unambiguous one to begin with?
|
| IMNSHO we should be aware of "big hammer" solutions but not be
| afraid to pick concrete representations when they make things
| easy on the machine. (in particular, tests using the former are
| wonderful double checks for deployment candidates using the
| latter)
|
| [when would specifying an unambiguous grammar be premature
| optimisation? when changing the language later is cheap
| (unlikely!), or when the strings in that language are always
| going to be of size <= k. Nerd snipe: what --to order of
| magnitude-- is _k_ ca. 2023?]
| nine_k wrote:
| Unambiguous grammars are easy not just to machines. Most
| importantly, they matter to humans. Reading something like
| Lua or Golang or SQL is usually much easier than C++, to say
| nothing of Perl.
| sshine wrote:
| > we should be aware of "big hammer" solutions
|
| Parser combinators have become my preferred "medium hammer"
| strategy:
|
| Sometimes you get a more readable parser than if you use
| regex, even though your language is regular.
|
| Sometimes you get corner cases where you have to think and an
| LR parser generator wouldn't require thinking.
|
| On the other hand, you have one tool that almost always does
| the job well, is readable, and embeds nicely.
| HarHarVeryFunny wrote:
| In terms of education I disagree about the importance of "LR"
| (really meaning bottom-up, table driven, automatically generated)
| parsers vs recursive descent (top-down, typically hand written),
| for a couple of reasons:
|
| 1) Real world compilers are almost always hand written recursive
| descent. They just turn out to be more flexible, easier to
| maintain, generate better error messages, etc.
|
| 2) Teaching compiler writing as part of a computer science degree
| is somewhat of a tradition, but the number of students who will
| ever write a compiler (even for a small DSL, let alone a real
| language is minimal). I think it's a useful exercise to show how
| these things are structured and teach general "divide-and
| conquor" decomposition of complex programs, but that's about it.
| Given the number of moving parts in a compiler, the emphasis
| should be on an overview of all the parts and how they fit
| together, not going overboard on parsing and teaching two
| different parsing techniques. To the extent that any Comp-Sci
| student will ever end up needing to write a parser at some point
| in their career (say for a DSL, or to parse some data format),
| then recursive descent is the more useful one to know.
|
| Parser generators seem largely an historical approach from a time
| (1970's) when computers were so slow that the speed of a table
| driven parser was a major advantage, and the idea is firmly
| implanted in the heads of those of us that grew up reading Aho &
| Ullman's compiler-writing "Dragon Book". I think they still serve
| a purpose, but perhaps for more fringe use cases rather than
| writing actual compilers for full-blown languages!
|
| Having said all that, there's a lot of _fun_ in writing a parser
| generator, regardless of how useful the end result may be! I
| wrote a full LR(1) - not YACC-like LALR(1) - parser generator in
| Modula-2 back in the early 80 's, at a time when the text books
| told you this was impractical/impossible due to the size of a
| canonical Knuth LR(1) parser.
| [deleted]
| not2b wrote:
| There's been a movement away from LR parsing in favor of
| recursive descent, and it isn't only that many languages are not
| LR(1) (for example, C++ can require infinite lookahead). It's
| also that even if the grammar of the language is LR(1), the
| errors that users will make often don't fit cleanly into the
| Yacc/bison structure, so error messages will often be poor
| quality and improving them can feel like unstructured hacking.
| It's much easier to produce high-quality error messages with a
| recursive descent parser. Recursive descent is just closer to how
| people think, so language designers generally create languages
| that work well with it.
|
| Now, if LR parsing were a simple topic, it would be OK for
| compiler classes to spend time on it. But it's complex enough
| that teaching it adequately takes a lot of time, and that time
| could be better spent on any number of other topics. Get the
| students to the point where they have a clean abstract syntax
| tree quickly and go from there.
| bitL wrote:
| LL(1) for recursive descent aren't very powerful; you might
| hack SQL with it but you still won't do C++ with it.
| krupan wrote:
| In his linked Which Parsing Approach article he says this,
|
| "Of course, I'm clearly somewhat biased, so perhaps these words
| from Guy Steele might be more compelling:
|
| '[Be] sure that your language will parse. It seems stupid ... to
| sit down and start designing constructs and not worry about how
| they will fit together. You can get a language that's difficult
| if not impossible to parse, not only for a computer, but for a
| person. I use Yacc constantly as a check of all my language
| designs, but I very seldom use Yacc in the implementation. I use
| it as a tester, to be sure that it's LR(1) ... because if a
| language is LR(1) it's more likely that a person can deal with
| it.'"
|
| I'm just starting to dabble in writing a compiler and this sounds
| pretty reasonable to me, basically using Yacc as a linter for
| your grammar but not for actually generating your parser. What do
| more experienced people here think of that?
| gavinhoward wrote:
| > What do more experienced people here think of that?
|
| "Experienced" person here. I've done a parser generator grammar
| or two. I've implemented several parsers.
|
| I mostly agree, but it depends on your use case.
|
| Most people writing parsers are doing so for a bespoke
| language, usually for config of an internal-use-only program or
| something like that. Such parsers won't be great, but they will
| work for people that:
|
| 1. Know the language, or
|
| 2. Are expected to know the language.
|
| In that case, I agree with Guy Steele.
|
| Why these two caveats? Because as people have said, error
| reporting from the parser will be subpar. This includes hand-
| written recursive descent ones. So knowing the language is
| important.
|
| (The second item is specifically for situations where new
| employees might not know the language yet, but it's part of
| their job, so they _will_ learn it.)
|
| This is, by far, the greatest amount of parsers in use. I would
| estimate 99.5% or more.
|
| Of course, most people don't think about those parsers; they
| think about the hardened parsers for general-purpose languages
| in wide use, but while they get a lot of attention, they are
| few and far between.
|
| Guy Steele's advice doesn't apply as well to such situations.
| For me, I wrote a parser for an existing language. I had to
| write the parser to that, and that language had a lot of
| baggage. It made no sense in that case.
|
| For the parser I'm currently writing, it _also_ makes no sense.
| This is for what I hope to become a widely-used general purpose
| language. Sure, I could try, but I don 't think it would work.
|
| To see why, look at perhaps the two newest big languages: Rust
| and Zig. Both are not that old (Rust 15 years-ish, Zig 8 years-
| ish?). And yet, I suspect both use handwritten parsers. I
| suspect both are recursive descent.
|
| Why is this? These languages shouldn't have that much legacy
| baggage, like C or C++, right?
|
| It doesn't matter. General-purpose languages have to be
| _everything_ to _all possible users_. This means they will have
| to be able to do some crazy things. And to do crazy things with
| plain text, well, you need as much power as you can get.
|
| This is why general-purpose languages basically _always_ become
| context-sensitive. Context-sensitivity allows a recursive
| descent parser to bring the full power of Turing-completeness
| to bear on a problem.
|
| Guy Steele also alludes to this in his classic "Growing a
| Language" talk. [1] He talks about how you need to create a
| language that _users_ can grow. There are easy ways to do this
| (like having functions), but the easy ways are all limited. If
| there is something that must be a language feature, like format
| printing [2], it can 't be added by users.
|
| Adding those other features usually require context-
| sensitivity. Format printing definitely does when it allows you
| to use variable names in the format string.
|
| But what if you make a language that any user can extend in any
| way, even if it's not easy? The result is my language, Yao.
| (Example at [3].) It's context-sensitive because it _has_ to
| be.
|
| So basically, Guy Steele is right for everybody _but_ the
| creators of big general-purpose languages. But most people
| think he 's talking about the big general-purpose languages. He
| might have been, but if he was, he's wrong.
|
| That doesn't mean that I don't consider the ease of parsing
| when I add a feature. I definitely do. But I come at it from
| the _human_ perspective, how readable it is to a _human_. This
| tends to give me much the same thing while giving me power.
|
| [1]: https://www.youtube.com/watch?v=lw6TaiXzHAE
|
| [2]: https://brevzin.github.io/c++/2023/01/02/rust-cpp-format/
|
| [3]:
| https://git.yzena.com/Yzena/Yc/src/commit/66185f4ff4fb779e04...
___________________________________________________________________
(page generated 2023-01-17 23:01 UTC)