hngopher.com

       [HN Gopher] Parser generators vs. handwritten parsers: surveying...
       ___________________________________________________________________
        
       Parser generators vs. handwritten parsers: surveying major
       languages in 2021
        
       Author : eatonphil
       Score  : 91 points
       Date   : 2021-08-21 17:45 UTC (5 hours ago)
        
 (HTM) web link (notes.eatonphil.com)
 (TXT) w3m dump (notes.eatonphil.com)
        
       | SirensOfTitan wrote:
       | Does anyone recommend any resources for learning how to write
       | handwritten parsers nowadays? I've written a couple simple ones
       | for various tasks before, but I'd really love to learn more about
       | it.
        
         | adamgordonbell wrote:
         | This book is great: https://interpreterbook.com/
         | 
         | It is sort of a walk-through of a small programming langauge
         | written in go, using a recursive descent parser.
         | 
         | Crafting Interpreters is good as well, though I only read part
         | of it, because it wasn't done at the time I read it.
         | https://craftinginterpreters.com/
         | 
         | I've also interviewed both of them, if you like an audio intro:
         | https://corecursive.com/032-bob-nystrom-on-building-an-inter...
        
         | junon wrote:
         | Learn what lexing is. That's step one. Once you have a token
         | stream, look into how you'd transform them into structures.
         | 
         | PUB FN ID(main) OPAREN CPAREN COLON ID(i32) OBRACE NL RETURN
         | NUMBER(0) CBRACE
         | 
         | Look familiar? How might you parse that into something like
         | 
         | { kind: "function", name: "main", args: [], public: true,
         | return_type: { kind: "integer_type", width: 32 }, body: [ {
         | kind: "return", expression: { kind: "number_literal", value: 0
         | } } ] }
         | 
         | A recursive descent parser is just a parser that starts at a
         | root node type and recursively traverses a token stream and
         | calls handler functions based on what's next. Those functions
         | might also call other handlers (e.g. for nested expression
         | parsing, etc.) which is why it's called recursive.
        
       | eigenhombre wrote:
       | I enjoyed learning that Julia's parser is written in Scheme.
        
       | adamgordonbell wrote:
       | The Ruby yacc file is scary to look at. 13+ thousand lines in a
       | single file.
       | 
       | Would it be better with hand rolled and they could have
       | abstracted and organized somethings or does it all make sense in
       | its current format if you are familiar with it?
       | 
       | https://github.com/ruby/ruby/blob/v3_0_2/parse.y
        
         | chrisseaton wrote:
         | I have a PhD in compiling Ruby and I struggle to read and
         | modify that file.
        
           | adamgordonbell wrote:
           | Ruby is very flexible. Is that the issue?
           | 
           | I suspect not, because Scala is super flexible and probably
           | more difficult to parse, but has a better organized parser.
        
             | chrisseaton wrote:
             | The language is just quite large and quite complex to parse
             | - context-sensitive and requires interaction back and forth
             | with the lexer as it parses. TreeSitter seems to have a
             | much simpler Ruby parser - but I believe it ignores the
             | context-sensitive parts of Ruby.
        
       | reikonomusha wrote:
       | For serious projects, I find myself typically resorting to making
       | a hand-written parser (usually recursive descent) because it's
       | the only way I can really get excellent and contextual error
       | reporting and recovery. Parser generators _are_ computer science
       | marvels, but for the most part they're typically too hard to
       | customize well enough to produce a sensible error to the user,
       | with line numbers, token stream provenance, and suggested
       | recovery information. I can't imagine the difficulties of trying
       | to get good messages for something as large and pervasive as,
       | say, C.
       | 
       | I think that with enough work, we can adapt and evolve parser
       | generators to mesh well with better error reporting, and give the
       | programmer more options and control, but it'll take some elbow
       | grease and probably breaking backwards compatibility with the
       | production syntax of these parser generator systems. There's also
       | a risk, of course, that if there's _too_ much customization, you
       | sort of lose the value of having a parser generator DSL in the
       | first place.
        
         | eddieh wrote:
         | I'm always shocked to find major projects that use parser
         | generators. A hand-written recursive descent parser is easier
         | to write, read, understand, and modify. It's almost trivial to
         | translate a language specification's grammar productions into a
         | recursive descent parser.
         | 
         | So looking at the list I'm dumbfounded that CPython and Ruby
         | use generators.
        
           | jasonwatkinspdx wrote:
           | You won't find me defending the use of yacc, but in general
           | hand written parsers are far more error prone.
        
           | dataflow wrote:
           | > It's almost trivial to translate a language specification's
           | grammar productions into a recursive descent parser.
           | 
           | It's also potentially quite slow.
        
         | billconan wrote:
         | I really hope to know why people choose one parsing algorithm
         | vs another.
         | 
         | I implemented an earley parser, because from what I read on
         | wikipedia, it seems to be more advanced.
         | 
         | "Earley parsers are appealing because they can parse all
         | context-free languages, unlike LR parsers and LL parsers, which
         | are more typically used in compilers but which can only handle
         | restricted classes of languages."
         | 
         | however I seldom see languages use Earley parser, there must be
         | a reason, but I've never seen anybody explaining why choosing
         | one algorithm over another.
        
           | armchairhacker wrote:
           | Most languages have a simple, unambiguous syntax, so LL or LR
           | is fine. LL or LR is almost certainly faster then Earley,
           | since in general more restricted = faster.
           | 
           | As the above commenter mentioned, most language designers
           | make hand-rolled parsers. Making a hand-rolled LL or LR
           | parser is easier too.
           | 
           | In general most people think advanced = bad, they want the
           | easiest solution which gets the job done well.
        
             | UncleEntity wrote:
             | > LL or LR is almost certainly faster then Earley, since in
             | general more restricted = faster.
             | 
             | Earley's algorithm should be able to be trivially
             | parallelized which might make it a contender for languages
             | which are ambiguous (like most real world languages) where
             | they are hand-rolling parsers. I haven't tried since I have
             | no real need but looking at the steps I can't see a reason
             | it couldn't do its work across multiple threads.
             | 
             | Honestly, other than JavaScript 1.1 I can't think of a
             | popular language which has an unambiguous syntax and I
             | _really_ like playing with language grammars for some odd
             | reason -- probably wrong though...
        
           | eesmith wrote:
           | From what I infer from articles like
           | https://jeffreykegler.github.io/personal/timeline_v3 , the
           | original Earley paper had a bug, and wasn't a good fit for
           | 1960s hardware, and had poor performance for some types of
           | grammars.
           | 
           | By 1991, "Most researchers see the Parsing Problem as
           | "solved" -- a closed issue. Earley parsing is almost
           | forgotten, and Leo's discovery is ignored. Two decades will
           | pass before anyone attempts a practical implementation of Leo
           | 1991."
           | 
           | It takes Aycock and Horspool's work in 2002 and Kegler's work
           | in 2010 in Marpa to have a "practical implementation"
           | (quoting that link).
           | 
           | (I quote that also because Aycock distributed SPARK, an
           | Earley parser, which was included as part of the Python
           | distribution, in the Parser/ subdirectory, and a couple of
           | people here on HN report having used it.)
        
             | UncleEntity wrote:
             | > I quote that also because Aycock distributed SPARK, an
             | Earley parser, which was included as part of the Python
             | distribution, in the Parser/ subdirectory, and a couple of
             | people here on HN report having used it.
             | 
             | That one is really the only Earley parser I've found used
             | in the wild (don't know what marpa is used for) and
             | unfortunately it is mostly unhackable because they did some
             | serious optimization voodoo on it so it was replaced by a
             | hand-written recursive decent parser a while back because
             | nobody in the world could figure how it works[0] -- which
             | is kind of strange since ASDL is super simple to parse and
             | the generator which used spark was meant to check files
             | into source control but, whatever.
             | 
             | Its easy to play around with but not a great source if you
             | want to see how an Earley parser is put together. There are
             | also some bugs with parser action on duplicate rules not
             | working properly that were pretty easy to fix but python
             | pulled it out of the source tree so no upstream to send
             | patches to?
             | 
             | [0] might be making that part up, dunno?
        
               | DylanSp wrote:
               | nearley.js is an Earley parser that sees at least some
               | use. https://nearley.js.org/
        
               | eesmith wrote:
               | You are one of the "couple of people" I was referring to.
               | :)
               | 
               | I know SPARK's docstring use influenced PLY.
               | 
               | PLY doesn't use Earley, but "Earley" does come up in the
               | show notes of an interview with Beazley, PLY's author, at
               | https://www.pythonpodcast.com/episode-95-parsing-and-
               | parsers... . No transcript, and I'm not going to listen
               | to it just to figure out the context.
               | 
               | https://github.com/lark-parser/lark "implements both
               | Earley(SPPF) and LALR(1)".
               | 
               | Kegler, the author of that timeline I linked to, is the
               | author of Marpa. Home page is
               | http://savage.net.au/Marpa.html . The most recent HN
               | comments about it are from a year ago, at
               | https://news.ycombinator.com/item?id=24321395 .
        
           | sirwhinesalot wrote:
           | Early and similar algorithms give you a parse forest rather
           | than a parse tree. For a programming language you don't want
           | to have your grammar be ambiguous. When Early gives you a
           | bunch of alternative parse trees, how do you disambiguate?
        
             | pfdietz wrote:
             | If you are parsing incorrect programs, you want resilience
             | in the parser, including considering alternate parsings and
             | generation of parses of sections of the file.
             | 
             | Consider the problem of parsing a C program that has NOT
             | been put through the preprocessor. You want to be able to
             | do this for structure editors and source-to-source
             | transformation systems.
        
               | rightbyte wrote:
               | It is quite possible to redifine much of the syntax with
               | the cpp. Parsing without the cpp is hopeless.
               | 
               | My favourite is Bourne's longing for Algol:
               | https://minnie.tuhs.org/cgi-
               | bin/utree.pl?file=V7/usr/src/cmd...
        
               | pfdietz wrote:
               | Of course parsing without cpp is, in general, hopeless.
               | But for something like a structure-aware editor you don't
               | need to parse it perfectly, just "good enough". Ditto for
               | a source-to-source transformation system.
        
               | chrisseaton wrote:
               | If your grammar is ambiguous, then an ambiguous program
               | isn't incorrect.
        
               | pfdietz wrote:
               | It's not that the grammar is ambiguous, it's that the
               | thing you're parsing may not even be in the language
               | generated by the grammar. But you want to parse it as
               | well as possible, and there may be more than one way to
               | do that.
        
             | billconan wrote:
             | in my implementation, I give each rule a priority. I always
             | choose the one with the highest priority to expand. I
             | basically followed this https://loup-
             | vaillant.fr/tutorials/earley-parsing/parser
        
               | xyzzy_plugh wrote:
               | I've written parsers for context-sensitive grammars, and
               | as it turns out, this is not a desirable feature of a
               | language. The reason you typically see simple parsers is
               | because simple grammars are usually unambiguous and
               | context-free, which is more convenient for humans to
               | understand and internalize.
               | 
               | After all, when you're writing software, you pretend to
               | be the compiler to some extent.
               | 
               | Natural languages, like English, are a good example of
               | something which humans struggle with because they are
               | complex, ambiguous, and often require context. Sure, it's
               | extremely expressive, but that is the sharpest double-
               | edged sword in programming languages.
        
               | sirwhinesalot wrote:
               | Neat, though this just seems like PEG with extra steps
               | and more pain. Still this answers my question. Thanks!
        
       | account-5 wrote:
       | I'd love to learn about parsers but I need an ELI5 version.
       | 
       | I've written a parser for CSV which I think works very well but
       | I'm self taught and have no idea if it's any good.
       | 
       | Any 'parsers for dummies' guides?
        
       | prestonbriggs wrote:
       | Compilers have parsers, not languages. Many languages have more
       | than one compiler (e.g., C), and the various compilers may use
       | different techniques for parsing.
        
         | Zababa wrote:
         | To be even more pedentic: implementations have parsers. Some
         | programming languages are not compiled.
        
         | eatonphil wrote:
         | That's right! But HN has a title length limit. See the (actual)
         | blog post title and the blog post contents. :)
        
         | chrisseaton wrote:
         | Nobody was mistaken about this fact.
        
       | Zababa wrote:
       | > Developers often think parser generators are the sole legit way
       | to build programming language frontends, possibly because
       | compiler courses in university teach lex/yacc variants.
       | 
       | They're far from the only way, handwritten parsers are often
       | mentionned as having way better error recovery. But if you're
       | making a new configuration language, programming language, or
       | something like that, ensuring that it has LR-compliant grammar
       | and that this grammar is the source of truth can avoid a lot of
       | pain later down the road.
        
       | mathgladiator wrote:
       | When I switched from ANTLR to hand written for Adama (
       | http://www.adama-lang.org/ ), I felt way better about things. I
       | was able to get sane error messages, and I could better annotate
       | my syntax tree with comments and line/char numbers.
       | 
       | A killer feature for a parser generator would be the ability to
       | auto-generate a pretty printer which requires stuffing comments
       | into the tree as a "meta token".
        
         | fjfaase wrote:
         | I implemented an unparse function in IParse, which is not a
         | parser generator, but a parser that interprets a grammar. See
         | for example
         | https://github.com/FransFaase/IParse/blob/master/software/c_...
         | where symbols starting with a back slash are a kind of white
         | space terminals during the unparse. For example, \inc stands
         | for incrementing the indentation where \dec decrements it. The
         | \s is used to indicate that at given location a space should be
         | included.
        
       | rurban wrote:
       | Perl uses yacc, but with extreme lexer tricks to be dynamic.
       | 
       | All lisps use handwritten parsers, in under 100 lines.
        
       | aarchi wrote:
       | Are there any significant languages that do parsing with
       | derivatives?
        
         | mhh__ wrote:
         | What are derivatives in this context?
        
           | drdeca wrote:
           | From what I can find, in the case of regexes, the derivative
           | of a regex R with respect to a character c, is the regex
           | \partial_c R , which matches a string S iff R matches c S.
           | So, the derivative of (ab)* with respect to the character 'a'
           | , would be b(ab)* .
           | 
           | I imagine that the same idea applies for more general
           | languages, so, I expect the derivative of a language with
           | respect to a character, is the language of strings such that
           | prepending one by the character results in a string of the
           | original language.
           | 
           | So, if R_1 and R_2 are two different regexes, then the
           | derivative with respect to c of the regex R_1 | R_2 , i.e.
           | \partial_c (R_1 | R_2) , would be (\partial_c R_1) |
           | (\partial_c R_2),
           | 
           | and, if + is used for concatenation of two regexes, the
           | derivative of R_1 + R_2 would be, ((\partial_c R_1) + R_2) |
           | ((if R_1 accepts the empty string, then the language which
           | accepts only the empty string, otherwise, the language that
           | does not accept any string) + (\partial_c R_2))
           | 
           | etc.
           | 
           | ... I think.
        
           | olzd wrote:
           | Probably Brzozowski's derivative of regular expressions, e.g
           | https://matt.might.net/papers/might2011derivatives.pdf
        
       | muuglay wrote:
       | It seems strange that in an effort to solve a common problem that
       | 80% of the success stories opt to not use the automation.
       | Perhaps, there is ground for another toolkit
        
         | mhh__ wrote:
         | Work on a compiler for a bit and I guarantee you'll get it.
         | Parser generators pretty much always get in your way
        
         | shoulderchipper wrote:
         | Building an AST is an easy task once you've figured out your
         | grammar and error recovery mechanism, but these are essentially
         | a kind of language-specific design work that can not be
         | automated.
        
         | [deleted]
        
       | EamonnMR wrote:
       | There's something to be said for a tool that a novice can use (my
       | first hand written recursive descent parser was, no joke, because
       | I wasn't up to the task of including a json library in my Java
       | project) but still useful for professionals.
        
       | veltas wrote:
       | > Clang. Not only handwritten but the same file handles parsing
       | C, Objective-C and C++.
       | 
       | And it's under 3000 lines!
       | 
       | https://github.com/llvm/llvm-project/blob/llvmorg-12.0.1/cla...
        
         | NilsIRL wrote:
         | It seems to me that the parsing code in clang is distributed
         | over multiple files which together are way more than 3000
         | lines: https://github.com/llvm/llvm-
         | project/tree/llvmorg-12.0.1/cla...
        
       | recursivedoubts wrote:
       | I took the compilers class at Stanford and never really
       | understood the algorithms of bottom up parsing, or even really
       | how grammars worked. I just made the tool work.
       | 
       | I then went to work at a private company, and an older guy who
       | had gone to a state school that taught recursive descent (his was
       | the last class to teach it) taught me how to do it. In a month or
       | so I had learned more about how grammars actually work, what
       | ambiguity is, and so forth, than in my whole class at Stanford.
       | 
       | I now teach compilers at a university, and I teach recursive
       | descent.
        
         | recursivedoubts wrote:
         | as an aside, the parser for https://hyperscript.org is
         | recursive descent, and our motto is: "remember what our parser
         | professors taught us, then do the opposite of that."
         | 
         | There's some ugly stuff in there, but it's fun, all because we
         | wrote it in recursive descent and can do whatever we darn well
         | please.
        
         | tiborsaas wrote:
         | I know it's not Reddit but username checks out :)
        
         | exdsq wrote:
         | I took a compilers class at an ex-Polytechnic in the UK that
         | taught recursive descent - I now wonder how old the textbook
         | they were using was!
        
         | Koshkin wrote:
         | I hope you are not making the same mistakes as your teacher at
         | Stanford did (and leave students clueless).
        
           | recursivedoubts wrote:
           | i hope so too
        
         | atarian wrote:
         | Any resources you recommend for someone who is interested in
         | learning about recursive descent?
        
           | recursivedoubts wrote:
           | Yes, the second part of crafting interpreters has a great
           | introduction to the topic and is available online for free:
           | 
           | https://craftinginterpreters.com/contents.html
        
             | atarian wrote:
             | Thanks!
        
         | dataflow wrote:
         | I still recommend spending the time to (re-)learn bottom-up
         | parsing though, even if you never use it again. Specifically,
         | LR and then GLR (no need to bother with LALR/SLR/all those
         | variants). Maybe even recursive ascent too, or recursive
         | ascent-descent if you find those interesting. It's painful and
         | confusing and takes a long time to pick them up, but it's quite
         | gratifying once you do learn it. The algorithms are super
         | clever and cool, and help you gain a lot of appreciation and
         | intuition for what is/isn't possible to do (whether efficiently
         | or inefficiently).
         | 
         | Using recursive descent for everything is like using Python (or
         | Go or pick your favorite easy language to learn) for
         | everything. Yeah you can totally do it, and a heck of a lot of
         | people do precisely because of how nice and convenient it is,
         | but it's good to have a sense what else is out there and how
         | you can think about a problem differently IMO, even if you
         | rarely use other languages in practice.
        
           | recursivedoubts wrote:
           | i agree, but for an undergraduate understanding of parsing
           | and grammars, recursive descent is perfect
           | 
           | i think the more complex parsing algorithms were good for
           | producing papers, which explains their success in academia
           | but relative lack of success in industry
        
       | ckok wrote:
       | Parser generators generally suffer from at least speed, recovery,
       | error handling and lack of debuggability. They also require you
       | to learn a new language, the parser generators language.
       | 
       | Last but not least, what you actually end up wanting to build is
       | an ast, as at some point you want to do something with the input,
       | for most parsers you then have to implement even more code to
       | build up the ast.
       | 
       | It is much easier to hand write it. In the end its faster to
       | write and usually faster to run.
       | 
       | Every few years I evaluate the new options for the languages I
       | use (c#, pascal), every time so far I am disappointed with the
       | tooling. Maybe one year.
        
         | Zababa wrote:
         | I think you have to see them a bit like regexes. It may be a
         | bit annoying to learn the grammar at first, but it's cross-
         | language learning in a way, as you can use parser generators
         | libraries in any language.
        
           | chrisseaton wrote:
           | But it's still one extra language to learn, for in practice
           | no benefit.
        
             | Zababa wrote:
             | Copy/pasting the grammar of a language and using a library
             | will be faster to do than any hand-written implementation,
             | so I don't agree that there are no benefits.
        
               | chrisseaton wrote:
               | But you can't do that because in practice these grammars
               | have lots of imperative action code inserted all over the
               | place because the parser generator's model fundamentally
               | doesn't match the language class you're parsing and it
               | needs to be worked around.
               | 
               | For example I work with a parser definition that's
               | supposedly shared between C and Java, but it's a massive
               | nightmare because it's 50% imperative actions.
        
               | Zababa wrote:
               | That depends on the language you're trying to parse. If
               | it has a LL or LR grammar, it'll work well. If it's
               | something else that can't be described with something
               | like that or something less powerful, you're going to
               | have a bad time.
        
               | chrisseaton wrote:
               | > That depends on the language you're trying to parse.
               | 
               | Yes... but most practical languages do not have simple LL
               | or LR grammars. Hence this blog post and discussion and
               | these problems.
        
       | codr7 wrote:
       | I wasted so much time, first on using other people's frameworks
       | and later on writing my own.
       | 
       | Recursive descent might be a tiny bit more code, but it will
       | never stand in your way.
       | 
       | And you don't have to write absolutely everything by hand; it's
       | perfectly possible to simplify the process somewhat using the
       | full power of the host language to get incremental parsing, look
       | ahead etc:
       | 
       | https://github.com/codr7/swifties/blob/main/Sources/Swifties...
        
       | mseepgood wrote:
       | I think more programmers should learn how to write parsers by
       | hand. You need them everywhere; for protocols, file formats, user
       | input, little languages. They often use regular expressions or
       | other crutches when a parser would be the better choice.
        
         | suncherta wrote:
         | What would be good resources for this?
        
           | PartiallyTyped wrote:
           | Do you know Haskell? If not, I suggest you get accustomed to
           | the language, and then read about monadic parsing [1] through
           | Graham Hutton's work. Graham is a famous CS professor at U
           | Notthingham, appears often in ComputerPhile [3,4], and wrote
           | a book on Haskell [2].
           | 
           | I had to write an interpreter, optimizer and engine for a
           | declarative language plus bottom up knowledge base in Haskell
           | as part of an assignment, and an exam in a graduate course on
           | advanced programming. Haskell made the problem significantly
           | easier compared to languages I am much more comfortable with,
           | like Python or C.
           | 
           | [1] www.cs.nott.ac.uk/~pszgmh/pearl.pdf
           | 
           | [2] https://www.amazon.com/Programming-Haskell-Graham-
           | Hutton/dp/...
           | 
           | [3] https://www.youtube.com/channel/UC9-y-6csu5WGm29I7JiwpnA
           | 
           | [4] https://www.youtube.com/watch?v=eis11j_iGMs
        
           | rofrol wrote:
           | Crafting interpreters?
        
           | srcreigh wrote:
           | Course notes for University of Waterloo CS241 "Foundations of
           | sequential programs"
           | 
           | Part of the course is understanding exactly when Regexps are
           | good enough and when you need something more powerful.
           | 
           | http://anthony-zhang.me/University-Notes/CS241/CS241.html
        
           | eatonphil wrote:
           | Start with a simple language like parsing JSON or
           | S-expressions and do it in a language you already know. There
           | are a ton of tutorials you can find online along these lines!
        
       ___________________________________________________________________
       (page generated 2021-08-21 23:01 UTC)