[HN Gopher] Parser generators vs. handwritten parsers: surveying...
___________________________________________________________________
Parser generators vs. handwritten parsers: surveying major
languages in 2021
Author : eatonphil
Score : 91 points
Date : 2021-08-21 17:45 UTC (5 hours ago)
(HTM) web link (notes.eatonphil.com)
(TXT) w3m dump (notes.eatonphil.com)
| SirensOfTitan wrote:
| Does anyone recommend any resources for learning how to write
| handwritten parsers nowadays? I've written a couple simple ones
| for various tasks before, but I'd really love to learn more about
| it.
| adamgordonbell wrote:
| This book is great: https://interpreterbook.com/
|
| It is sort of a walk-through of a small programming langauge
| written in go, using a recursive descent parser.
|
| Crafting Interpreters is good as well, though I only read part
| of it, because it wasn't done at the time I read it.
| https://craftinginterpreters.com/
|
| I've also interviewed both of them, if you like an audio intro:
| https://corecursive.com/032-bob-nystrom-on-building-an-inter...
| junon wrote:
| Learn what lexing is. That's step one. Once you have a token
| stream, look into how you'd transform them into structures.
|
| PUB FN ID(main) OPAREN CPAREN COLON ID(i32) OBRACE NL RETURN
| NUMBER(0) CBRACE
|
| Look familiar? How might you parse that into something like
|
| { kind: "function", name: "main", args: [], public: true,
| return_type: { kind: "integer_type", width: 32 }, body: [ {
| kind: "return", expression: { kind: "number_literal", value: 0
| } } ] }
|
| A recursive descent parser is just a parser that starts at a
| root node type and recursively traverses a token stream and
| calls handler functions based on what's next. Those functions
| might also call other handlers (e.g. for nested expression
| parsing, etc.) which is why it's called recursive.
| eigenhombre wrote:
| I enjoyed learning that Julia's parser is written in Scheme.
| adamgordonbell wrote:
| The Ruby yacc file is scary to look at. 13+ thousand lines in a
| single file.
|
| Would it be better with hand rolled and they could have
| abstracted and organized somethings or does it all make sense in
| its current format if you are familiar with it?
|
| https://github.com/ruby/ruby/blob/v3_0_2/parse.y
| chrisseaton wrote:
| I have a PhD in compiling Ruby and I struggle to read and
| modify that file.
| adamgordonbell wrote:
| Ruby is very flexible. Is that the issue?
|
| I suspect not, because Scala is super flexible and probably
| more difficult to parse, but has a better organized parser.
| chrisseaton wrote:
| The language is just quite large and quite complex to parse
| - context-sensitive and requires interaction back and forth
| with the lexer as it parses. TreeSitter seems to have a
| much simpler Ruby parser - but I believe it ignores the
| context-sensitive parts of Ruby.
| reikonomusha wrote:
| For serious projects, I find myself typically resorting to making
| a hand-written parser (usually recursive descent) because it's
| the only way I can really get excellent and contextual error
| reporting and recovery. Parser generators _are_ computer science
| marvels, but for the most part they're typically too hard to
| customize well enough to produce a sensible error to the user,
| with line numbers, token stream provenance, and suggested
| recovery information. I can't imagine the difficulties of trying
| to get good messages for something as large and pervasive as,
| say, C.
|
| I think that with enough work, we can adapt and evolve parser
| generators to mesh well with better error reporting, and give the
| programmer more options and control, but it'll take some elbow
| grease and probably breaking backwards compatibility with the
| production syntax of these parser generator systems. There's also
| a risk, of course, that if there's _too_ much customization, you
| sort of lose the value of having a parser generator DSL in the
| first place.
| eddieh wrote:
| I'm always shocked to find major projects that use parser
| generators. A hand-written recursive descent parser is easier
| to write, read, understand, and modify. It's almost trivial to
| translate a language specification's grammar productions into a
| recursive descent parser.
|
| So looking at the list I'm dumbfounded that CPython and Ruby
| use generators.
| jasonwatkinspdx wrote:
| You won't find me defending the use of yacc, but in general
| hand written parsers are far more error prone.
| dataflow wrote:
| > It's almost trivial to translate a language specification's
| grammar productions into a recursive descent parser.
|
| It's also potentially quite slow.
| billconan wrote:
| I really hope to know why people choose one parsing algorithm
| vs another.
|
| I implemented an earley parser, because from what I read on
| wikipedia, it seems to be more advanced.
|
| "Earley parsers are appealing because they can parse all
| context-free languages, unlike LR parsers and LL parsers, which
| are more typically used in compilers but which can only handle
| restricted classes of languages."
|
| however I seldom see languages use Earley parser, there must be
| a reason, but I've never seen anybody explaining why choosing
| one algorithm over another.
| armchairhacker wrote:
| Most languages have a simple, unambiguous syntax, so LL or LR
| is fine. LL or LR is almost certainly faster then Earley,
| since in general more restricted = faster.
|
| As the above commenter mentioned, most language designers
| make hand-rolled parsers. Making a hand-rolled LL or LR
| parser is easier too.
|
| In general most people think advanced = bad, they want the
| easiest solution which gets the job done well.
| UncleEntity wrote:
| > LL or LR is almost certainly faster then Earley, since in
| general more restricted = faster.
|
| Earley's algorithm should be able to be trivially
| parallelized which might make it a contender for languages
| which are ambiguous (like most real world languages) where
| they are hand-rolling parsers. I haven't tried since I have
| no real need but looking at the steps I can't see a reason
| it couldn't do its work across multiple threads.
|
| Honestly, other than JavaScript 1.1 I can't think of a
| popular language which has an unambiguous syntax and I
| _really_ like playing with language grammars for some odd
| reason -- probably wrong though...
| eesmith wrote:
| From what I infer from articles like
| https://jeffreykegler.github.io/personal/timeline_v3 , the
| original Earley paper had a bug, and wasn't a good fit for
| 1960s hardware, and had poor performance for some types of
| grammars.
|
| By 1991, "Most researchers see the Parsing Problem as
| "solved" -- a closed issue. Earley parsing is almost
| forgotten, and Leo's discovery is ignored. Two decades will
| pass before anyone attempts a practical implementation of Leo
| 1991."
|
| It takes Aycock and Horspool's work in 2002 and Kegler's work
| in 2010 in Marpa to have a "practical implementation"
| (quoting that link).
|
| (I quote that also because Aycock distributed SPARK, an
| Earley parser, which was included as part of the Python
| distribution, in the Parser/ subdirectory, and a couple of
| people here on HN report having used it.)
| UncleEntity wrote:
| > I quote that also because Aycock distributed SPARK, an
| Earley parser, which was included as part of the Python
| distribution, in the Parser/ subdirectory, and a couple of
| people here on HN report having used it.
|
| That one is really the only Earley parser I've found used
| in the wild (don't know what marpa is used for) and
| unfortunately it is mostly unhackable because they did some
| serious optimization voodoo on it so it was replaced by a
| hand-written recursive decent parser a while back because
| nobody in the world could figure how it works[0] -- which
| is kind of strange since ASDL is super simple to parse and
| the generator which used spark was meant to check files
| into source control but, whatever.
|
| Its easy to play around with but not a great source if you
| want to see how an Earley parser is put together. There are
| also some bugs with parser action on duplicate rules not
| working properly that were pretty easy to fix but python
| pulled it out of the source tree so no upstream to send
| patches to?
|
| [0] might be making that part up, dunno?
| DylanSp wrote:
| nearley.js is an Earley parser that sees at least some
| use. https://nearley.js.org/
| eesmith wrote:
| You are one of the "couple of people" I was referring to.
| :)
|
| I know SPARK's docstring use influenced PLY.
|
| PLY doesn't use Earley, but "Earley" does come up in the
| show notes of an interview with Beazley, PLY's author, at
| https://www.pythonpodcast.com/episode-95-parsing-and-
| parsers... . No transcript, and I'm not going to listen
| to it just to figure out the context.
|
| https://github.com/lark-parser/lark "implements both
| Earley(SPPF) and LALR(1)".
|
| Kegler, the author of that timeline I linked to, is the
| author of Marpa. Home page is
| http://savage.net.au/Marpa.html . The most recent HN
| comments about it are from a year ago, at
| https://news.ycombinator.com/item?id=24321395 .
| sirwhinesalot wrote:
| Early and similar algorithms give you a parse forest rather
| than a parse tree. For a programming language you don't want
| to have your grammar be ambiguous. When Early gives you a
| bunch of alternative parse trees, how do you disambiguate?
| pfdietz wrote:
| If you are parsing incorrect programs, you want resilience
| in the parser, including considering alternate parsings and
| generation of parses of sections of the file.
|
| Consider the problem of parsing a C program that has NOT
| been put through the preprocessor. You want to be able to
| do this for structure editors and source-to-source
| transformation systems.
| rightbyte wrote:
| It is quite possible to redifine much of the syntax with
| the cpp. Parsing without the cpp is hopeless.
|
| My favourite is Bourne's longing for Algol:
| https://minnie.tuhs.org/cgi-
| bin/utree.pl?file=V7/usr/src/cmd...
| pfdietz wrote:
| Of course parsing without cpp is, in general, hopeless.
| But for something like a structure-aware editor you don't
| need to parse it perfectly, just "good enough". Ditto for
| a source-to-source transformation system.
| chrisseaton wrote:
| If your grammar is ambiguous, then an ambiguous program
| isn't incorrect.
| pfdietz wrote:
| It's not that the grammar is ambiguous, it's that the
| thing you're parsing may not even be in the language
| generated by the grammar. But you want to parse it as
| well as possible, and there may be more than one way to
| do that.
| billconan wrote:
| in my implementation, I give each rule a priority. I always
| choose the one with the highest priority to expand. I
| basically followed this https://loup-
| vaillant.fr/tutorials/earley-parsing/parser
| xyzzy_plugh wrote:
| I've written parsers for context-sensitive grammars, and
| as it turns out, this is not a desirable feature of a
| language. The reason you typically see simple parsers is
| because simple grammars are usually unambiguous and
| context-free, which is more convenient for humans to
| understand and internalize.
|
| After all, when you're writing software, you pretend to
| be the compiler to some extent.
|
| Natural languages, like English, are a good example of
| something which humans struggle with because they are
| complex, ambiguous, and often require context. Sure, it's
| extremely expressive, but that is the sharpest double-
| edged sword in programming languages.
| sirwhinesalot wrote:
| Neat, though this just seems like PEG with extra steps
| and more pain. Still this answers my question. Thanks!
| account-5 wrote:
| I'd love to learn about parsers but I need an ELI5 version.
|
| I've written a parser for CSV which I think works very well but
| I'm self taught and have no idea if it's any good.
|
| Any 'parsers for dummies' guides?
| prestonbriggs wrote:
| Compilers have parsers, not languages. Many languages have more
| than one compiler (e.g., C), and the various compilers may use
| different techniques for parsing.
| Zababa wrote:
| To be even more pedentic: implementations have parsers. Some
| programming languages are not compiled.
| eatonphil wrote:
| That's right! But HN has a title length limit. See the (actual)
| blog post title and the blog post contents. :)
| chrisseaton wrote:
| Nobody was mistaken about this fact.
| Zababa wrote:
| > Developers often think parser generators are the sole legit way
| to build programming language frontends, possibly because
| compiler courses in university teach lex/yacc variants.
|
| They're far from the only way, handwritten parsers are often
| mentionned as having way better error recovery. But if you're
| making a new configuration language, programming language, or
| something like that, ensuring that it has LR-compliant grammar
| and that this grammar is the source of truth can avoid a lot of
| pain later down the road.
| mathgladiator wrote:
| When I switched from ANTLR to hand written for Adama (
| http://www.adama-lang.org/ ), I felt way better about things. I
| was able to get sane error messages, and I could better annotate
| my syntax tree with comments and line/char numbers.
|
| A killer feature for a parser generator would be the ability to
| auto-generate a pretty printer which requires stuffing comments
| into the tree as a "meta token".
| fjfaase wrote:
| I implemented an unparse function in IParse, which is not a
| parser generator, but a parser that interprets a grammar. See
| for example
| https://github.com/FransFaase/IParse/blob/master/software/c_...
| where symbols starting with a back slash are a kind of white
| space terminals during the unparse. For example, \inc stands
| for incrementing the indentation where \dec decrements it. The
| \s is used to indicate that at given location a space should be
| included.
| rurban wrote:
| Perl uses yacc, but with extreme lexer tricks to be dynamic.
|
| All lisps use handwritten parsers, in under 100 lines.
| aarchi wrote:
| Are there any significant languages that do parsing with
| derivatives?
| mhh__ wrote:
| What are derivatives in this context?
| drdeca wrote:
| From what I can find, in the case of regexes, the derivative
| of a regex R with respect to a character c, is the regex
| \partial_c R , which matches a string S iff R matches c S.
| So, the derivative of (ab)* with respect to the character 'a'
| , would be b(ab)* .
|
| I imagine that the same idea applies for more general
| languages, so, I expect the derivative of a language with
| respect to a character, is the language of strings such that
| prepending one by the character results in a string of the
| original language.
|
| So, if R_1 and R_2 are two different regexes, then the
| derivative with respect to c of the regex R_1 | R_2 , i.e.
| \partial_c (R_1 | R_2) , would be (\partial_c R_1) |
| (\partial_c R_2),
|
| and, if + is used for concatenation of two regexes, the
| derivative of R_1 + R_2 would be, ((\partial_c R_1) + R_2) |
| ((if R_1 accepts the empty string, then the language which
| accepts only the empty string, otherwise, the language that
| does not accept any string) + (\partial_c R_2))
|
| etc.
|
| ... I think.
| olzd wrote:
| Probably Brzozowski's derivative of regular expressions, e.g
| https://matt.might.net/papers/might2011derivatives.pdf
| muuglay wrote:
| It seems strange that in an effort to solve a common problem that
| 80% of the success stories opt to not use the automation.
| Perhaps, there is ground for another toolkit
| mhh__ wrote:
| Work on a compiler for a bit and I guarantee you'll get it.
| Parser generators pretty much always get in your way
| shoulderchipper wrote:
| Building an AST is an easy task once you've figured out your
| grammar and error recovery mechanism, but these are essentially
| a kind of language-specific design work that can not be
| automated.
| [deleted]
| EamonnMR wrote:
| There's something to be said for a tool that a novice can use (my
| first hand written recursive descent parser was, no joke, because
| I wasn't up to the task of including a json library in my Java
| project) but still useful for professionals.
| veltas wrote:
| > Clang. Not only handwritten but the same file handles parsing
| C, Objective-C and C++.
|
| And it's under 3000 lines!
|
| https://github.com/llvm/llvm-project/blob/llvmorg-12.0.1/cla...
| NilsIRL wrote:
| It seems to me that the parsing code in clang is distributed
| over multiple files which together are way more than 3000
| lines: https://github.com/llvm/llvm-
| project/tree/llvmorg-12.0.1/cla...
| recursivedoubts wrote:
| I took the compilers class at Stanford and never really
| understood the algorithms of bottom up parsing, or even really
| how grammars worked. I just made the tool work.
|
| I then went to work at a private company, and an older guy who
| had gone to a state school that taught recursive descent (his was
| the last class to teach it) taught me how to do it. In a month or
| so I had learned more about how grammars actually work, what
| ambiguity is, and so forth, than in my whole class at Stanford.
|
| I now teach compilers at a university, and I teach recursive
| descent.
| recursivedoubts wrote:
| as an aside, the parser for https://hyperscript.org is
| recursive descent, and our motto is: "remember what our parser
| professors taught us, then do the opposite of that."
|
| There's some ugly stuff in there, but it's fun, all because we
| wrote it in recursive descent and can do whatever we darn well
| please.
| tiborsaas wrote:
| I know it's not Reddit but username checks out :)
| exdsq wrote:
| I took a compilers class at an ex-Polytechnic in the UK that
| taught recursive descent - I now wonder how old the textbook
| they were using was!
| Koshkin wrote:
| I hope you are not making the same mistakes as your teacher at
| Stanford did (and leave students clueless).
| recursivedoubts wrote:
| i hope so too
| atarian wrote:
| Any resources you recommend for someone who is interested in
| learning about recursive descent?
| recursivedoubts wrote:
| Yes, the second part of crafting interpreters has a great
| introduction to the topic and is available online for free:
|
| https://craftinginterpreters.com/contents.html
| atarian wrote:
| Thanks!
| dataflow wrote:
| I still recommend spending the time to (re-)learn bottom-up
| parsing though, even if you never use it again. Specifically,
| LR and then GLR (no need to bother with LALR/SLR/all those
| variants). Maybe even recursive ascent too, or recursive
| ascent-descent if you find those interesting. It's painful and
| confusing and takes a long time to pick them up, but it's quite
| gratifying once you do learn it. The algorithms are super
| clever and cool, and help you gain a lot of appreciation and
| intuition for what is/isn't possible to do (whether efficiently
| or inefficiently).
|
| Using recursive descent for everything is like using Python (or
| Go or pick your favorite easy language to learn) for
| everything. Yeah you can totally do it, and a heck of a lot of
| people do precisely because of how nice and convenient it is,
| but it's good to have a sense what else is out there and how
| you can think about a problem differently IMO, even if you
| rarely use other languages in practice.
| recursivedoubts wrote:
| i agree, but for an undergraduate understanding of parsing
| and grammars, recursive descent is perfect
|
| i think the more complex parsing algorithms were good for
| producing papers, which explains their success in academia
| but relative lack of success in industry
| ckok wrote:
| Parser generators generally suffer from at least speed, recovery,
| error handling and lack of debuggability. They also require you
| to learn a new language, the parser generators language.
|
| Last but not least, what you actually end up wanting to build is
| an ast, as at some point you want to do something with the input,
| for most parsers you then have to implement even more code to
| build up the ast.
|
| It is much easier to hand write it. In the end its faster to
| write and usually faster to run.
|
| Every few years I evaluate the new options for the languages I
| use (c#, pascal), every time so far I am disappointed with the
| tooling. Maybe one year.
| Zababa wrote:
| I think you have to see them a bit like regexes. It may be a
| bit annoying to learn the grammar at first, but it's cross-
| language learning in a way, as you can use parser generators
| libraries in any language.
| chrisseaton wrote:
| But it's still one extra language to learn, for in practice
| no benefit.
| Zababa wrote:
| Copy/pasting the grammar of a language and using a library
| will be faster to do than any hand-written implementation,
| so I don't agree that there are no benefits.
| chrisseaton wrote:
| But you can't do that because in practice these grammars
| have lots of imperative action code inserted all over the
| place because the parser generator's model fundamentally
| doesn't match the language class you're parsing and it
| needs to be worked around.
|
| For example I work with a parser definition that's
| supposedly shared between C and Java, but it's a massive
| nightmare because it's 50% imperative actions.
| Zababa wrote:
| That depends on the language you're trying to parse. If
| it has a LL or LR grammar, it'll work well. If it's
| something else that can't be described with something
| like that or something less powerful, you're going to
| have a bad time.
| chrisseaton wrote:
| > That depends on the language you're trying to parse.
|
| Yes... but most practical languages do not have simple LL
| or LR grammars. Hence this blog post and discussion and
| these problems.
| codr7 wrote:
| I wasted so much time, first on using other people's frameworks
| and later on writing my own.
|
| Recursive descent might be a tiny bit more code, but it will
| never stand in your way.
|
| And you don't have to write absolutely everything by hand; it's
| perfectly possible to simplify the process somewhat using the
| full power of the host language to get incremental parsing, look
| ahead etc:
|
| https://github.com/codr7/swifties/blob/main/Sources/Swifties...
| mseepgood wrote:
| I think more programmers should learn how to write parsers by
| hand. You need them everywhere; for protocols, file formats, user
| input, little languages. They often use regular expressions or
| other crutches when a parser would be the better choice.
| suncherta wrote:
| What would be good resources for this?
| PartiallyTyped wrote:
| Do you know Haskell? If not, I suggest you get accustomed to
| the language, and then read about monadic parsing [1] through
| Graham Hutton's work. Graham is a famous CS professor at U
| Notthingham, appears often in ComputerPhile [3,4], and wrote
| a book on Haskell [2].
|
| I had to write an interpreter, optimizer and engine for a
| declarative language plus bottom up knowledge base in Haskell
| as part of an assignment, and an exam in a graduate course on
| advanced programming. Haskell made the problem significantly
| easier compared to languages I am much more comfortable with,
| like Python or C.
|
| [1] www.cs.nott.ac.uk/~pszgmh/pearl.pdf
|
| [2] https://www.amazon.com/Programming-Haskell-Graham-
| Hutton/dp/...
|
| [3] https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA
|
| [4] https://www.youtube.com/watch?v=eis11j_iGMs
| rofrol wrote:
| Crafting interpreters?
| srcreigh wrote:
| Course notes for University of Waterloo CS241 "Foundations of
| sequential programs"
|
| Part of the course is understanding exactly when Regexps are
| good enough and when you need something more powerful.
|
| http://anthony-zhang.me/University-Notes/CS241/CS241.html
| eatonphil wrote:
| Start with a simple language like parsing JSON or
| S-expressions and do it in a language you already know. There
| are a ton of tutorials you can find online along these lines!
___________________________________________________________________
(page generated 2021-08-21 23:01 UTC)