hngopher.com

       [HN Gopher] Parsing: The Solved Problem That Isn't (2011)
       ___________________________________________________________________
        
       Parsing: The Solved Problem That Isn't (2011)
        
       Author : contr-error
       Score  : 79 points
       Date   : 2024-02-21 15:42 UTC (7 hours ago)
        
 (HTM) web link (tratt.net)
 (TXT) w3m dump (tratt.net)
        
       | temp123789246 wrote:
       | Have there been any notable innovations in parsing since this was
       | written?
        
         | marcusf wrote:
         | An extremely layman answer is that most interesting innovation
         | in parsing in relatively modern times has happened seems to be
         | in the context of IDE's. I.e. incremental, high-performance
         | parsing to support syntax highlighting, refactoring, etc. etc.
         | 
         | (I may be talking out of my ass here.)
        
           | ReleaseCandidat wrote:
           | Actually the most important step of parsers (as even non-
           | incremental, slow (or better: not fast) parsers are fast
           | enough) is error recovery (error resilience) from syntax
           | errors (mostly half written or half deleted code). What is
           | time consuming is e.g. type-checking. Semantic checking in
           | general, like exhaustiveness checks of pattern matches,
           | syntax checking is fast.
        
         | davidkellis wrote:
         | These are not new, but my takeaways from
         | https://tratt.net/laurie/blog/2020/which_parsing_approach.ht...
         | and https://rust-
         | analyzer.github.io/blog/2020/09/16/challeging-L... are to
         | embrace various forms of LR parsing.
         | https://github.com/igordejanovic/parglare is a very capable GLR
         | parser, and I've been keeping a close eye on it for use in my
         | projects.
        
         | troupo wrote:
         | Yes. You roll your own manual parser. It's not as difficult as
         | people make it out to be.
        
         | thechao wrote:
         | Aycock & Horspool came up with a 'practical' method for
         | implementing Earley parsing (conversion to a state-machine)
         | that has pretty humorously good performance delta over "naive"
         | Earley, and is still reasonable to implement. Joop Leo figured
         | out how to get the worst-case of Earley parsing down to either
         | O(n) (left-recursive, non-ambiguous) or O(n^2) (right-
         | recursive, non-ambiguous). That means the Earley algorithm is
         | only O(n^3) on right-recursive, ambiguous grammars; and, if
         | you're doing that, you're holding your language wrong.
         | 
         | A somewhat breathless description of all of this is in the
         | Marpa parser documentation:
         | https://jeffreykegler.github.io/Marpa-web-site/
         | 
         | In practice, I've found that computers are so fast, that with
         | just the Joop Leo optimizations, 'naive' Earley parsing is Good
         | Enough(tm):                   https://loup-
         | vaillant.fr/tutorials/earley-parsing/
        
         | norir wrote:
         | I feel that most of the time the two options are presented as
         | either write a handwritten parser or use a parser generator. A
         | nice third way is to write a custom parser generator for the
         | language you wish to parse. Handwritten parsers do tend to get
         | unwieldy and general purpose parser generators can have
         | inscrutable behavior for any specific language.
         | 
         | Because the grammar for a parser generator is usually much
         | simpler than most general purpose programming languages, it is
         | typically relatively straightforward to handwrite a parser for
         | it.
        
         | danielvaughn wrote:
         | I'm not super familiar with the space, but tree-sitter seems to
         | take an interesting approach in that they are an incremental
         | parser. So instead of re-parsing the entire document on change,
         | it only parses the affected text, thereby making it much more
         | efficient for text editors.
         | 
         | I don't know if that's specific to tree-sitter though, I'm sure
         | there are other incremental parsers. I have to say that I've
         | tried ANTLR and tree-sitter, and I absolutely love tree-sitter.
         | It's a joy to work with.
        
           | ReleaseCandidat wrote:
           | > [incremental parsing] I don't know if that's specific to
           | tree-sitter though
           | 
           | No, it isn't. And incremental parsing is older than 2011 too
           | (like at least the 70s).
           | 
           | For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066
        
           | IshKebab wrote:
           | In my experience incremental parsing doesn't really make much
           | sense. Non-incremental parsing can easily parse huge
           | documents in milliseconds.
           | 
           | Also Tree Sitter only does half the parsing job - you get a
           | tree on nodes, but you have to do your own parse of that tree
           | to get useful structures out.
           | 
           | I prefer Chumsky or Nom which go all the way.
        
         | bsder wrote:
         | Yes, we resdesigned our programming languages to be easy to
         | parse with limited lookahead.
        
         | o11c wrote:
         | Not sure, but I at least am certainly aware of possibilities
         | that such writeups exclude.
         | 
         | In particular, you can do (a subset of) the following in
         | sequence:
         | 
         | * write your own grammar in whatever bespoke language you want
         | 
         | * compose those grammars into a single grammar
         | 
         | * generate a Bison grammar from that grammar
         | 
         | * run `bison --xml` instead of actually generating code
         | 
         | * read the XML file and implement your own (trivial) runtime so
         | you can easily handle ownership issues
         | 
         | In particular, I am vehemently opposed to the idea of
         | implementing parsers separately using some non-proven
         | tool/theory, since that way leads to subtle grammar
         | incompatibilities later.
        
         | sse wrote:
         | The one I'm most excited about is improved error recovery:
         | 
         | https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
         | 
         | https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...
        
       | taeric wrote:
       | My current view of what makes parsing so difficult is that people
       | want to jump straight over a ton of intermediate things from
       | parsing to execution. That is, we often know what we want to
       | happen at the end. And we know what we are given. It is hoped
       | that it is a trivially mechanical problem to go from one to the
       | other.
       | 
       | But this ignores all sorts of other steps you can take. Targeting
       | multiple execution environments is an obvious step. Optimization
       | is another. Trivial local optimizations like shifts over
       | multiplications by 2 and fusing operations to take advantage of
       | the machine that is executing it. Less trivial full program
       | optimizations that can propagate constants across source files.
       | 
       | And preemptive execution is a huge consideration, of course. Very
       | little code runs in a way that can't be interrupted for some
       | other code to run in the meantime. To the point that we don't
       | even think of what this implies anymore. Despite accumulators
       | being a very basic execution unit on most every computer.
       | (Though, I think I'm thankful that reentrancy is the norm
       | nowadays in functions.)
        
         | Karellen wrote:
         | What have those other things got to do with parsing though?
         | Granted, they rely on parsing _having already happened_ , but I
         | don't see how there's much feedback from those considerations
         | to the way that parsers work, or are written, or - as the
         | article discussed - can be combined?
        
           | taeric wrote:
           | You can easily view it as having nothing to do with it. My
           | push is that it is the point of parsing. You don't parse
           | directly into understanding/execution. You parse into another
           | representation, one that we never directly talk about, so
           | that you can then move it into the next level.
           | 
           | Even English can be parsed first into the sounds. This is why
           | puns work. Consider the joke, "why should you wear glasses to
           | math class? It helps with division." That only works if you
           | go to the sounds first. And you will have optionality in
           | where to go from there.
           | 
           | So, for parsing programs, we often first decide on primitives
           | for execution. For teaching, this is often basic math
           | operations. But in reality, you have far more than the basic
           | math operations. And, as I was saying, you can do more with
           | the intermediate representation than you probably realize at
           | the outset.
        
       | pcfwik wrote:
       | There was an interesting discussion two years ago regarding
       | nonobvious issues with PEGs:
       | 
       | https://news.ycombinator.com/item?id=30414683
       | 
       | https://news.ycombinator.com/item?id=30414879
       | 
       | I spent a year or two working with PEGs, and ran into similar
       | issues multiple times. Adding a new production could totally
       | screw up seemingly unrelated parses that worked fine before.
       | 
       | As the author points out, Earley parsing with some disambiguation
       | rules (production precedence, etc.) has been much less
       | finicky/annoying to work with. It's also reasonably fast for
       | small parses even with a dumb implementation. Would suggest for
       | prototyping/settings when runtime ambiguity is not a showstopper,
       | despite the remaining issues described in the article re: having
       | a separate lexer.
        
         | dataflow wrote:
         | > I spent a year or two working with PEGs
         | 
         | > Earley parsing with some disambiguation rules
         | 
         | Any idea why GLR always gets ignored?
        
       | EdSchouten wrote:
       | What I find annoying about using parser generators is that it
       | always feels messy integrating the resulting parser into your
       | application. So you write a file that contains the grammar and
       | generate a parser out of that. Now you build it into your app and
       | call into it to parse some input file, but that ends up giving
       | you some poorly typed AST that is cluttered/hard to work with.
       | 
       | Certain parser generators make life easier by supporting actions
       | on parser/lexer rules. This is great and all, but it has the
       | downside that the grammar you provide is no longer reusable.
       | There's no way for others to import that grammar and provide
       | custom actions for them.
       | 
       | I don't know. In my opinion parsing theory is already solved.
       | Whether it's PEG, LL, LR, LALR, whatever. One of those is
       | certainly good enough for the kind of data you're trying to
       | parse. I think the biggest annoyance is the tooling.
        
         | mrkeen wrote:
         | Parser combinators is what I've been loving in the last few
         | years.
         | 
         | Pros: * They're just a technique/library that you can use in
         | your own language without the seperate generation step.
         | 
         | * They're simple enough that I often roll my own rather than
         | using an existing library.
         | 
         | * They let you stick code into your parsing steps - logging,
         | extra information, constructing your own results directly, e.g.
         | 
         | * The same technique works for lexing and parsing - just write
         | a parser from bytes to tokens, and a second parser from tokens
         | to objects.
         | 
         | * Depending on your languages syntax, you can get your parser
         | code looking a lot like the bnf grammar you're trying to
         | implement.
         | 
         | Cons: * You will eventually run into left-recursion problems.
         | It can be nightmarish trying to change the code so it 'just
         | works'. You really need to step back and grok left-recursion
         | itself - no handholding from parser combinators.
         | 
         | * Same thing with precedence - you just gotta learn how to do
         | it. Fixing left-recursion didn't click for me until I learned
         | how to do precedence.
        
       | kazinator wrote:
       | Parsing computer languages is an entirely self-inflicted problem.
       | You can easily design a language so it doesn't require any
       | parsing techniques that were not known and practical in 1965, and
       | it will greatly benefit the readability also.
        
         | amelius wrote:
         | End-comment
        
           | lgas wrote:
           | )
        
         | loevborg wrote:
         | Do you mean lisp? If yes I agree
        
           | kazinator wrote:
           | Including, but not limited to.
        
             | Nevermark wrote:
             | Many an argument has been won by the response: "but
             | consider Lisp and Forth"
        
         | Legend2440 wrote:
         | But I don't want to be able to parse only highly restricted
         | languages. I want to be able to parse _anything_ , including
         | natural language or even non-languages like raw audio.
         | 
         | My brain can do it, why can't my computer?
        
           | reissbaker wrote:
           | Your computer _can_ do it, if it has a beefy enough Nvidia
           | card. That 's what LLMs are for!
           | 
           | Grammar-based parsing for natural language isn't anywhere
           | close to working, sadly, and may never be.
        
             | deathanatos wrote:
             | You've never had an LLM misinterpret what you correctly
             | wrote?
             | 
             | (... I have.)
        
           | throwbadubadu wrote:
           | Yes, never do humans misunderstand each other, or
           | instructions are not clear to everyone and totally
           | unambiguous, and luckily no language has pure differentiation
           | of meaning by intonation, and, and.. and...
        
             | thfuran wrote:
             | Yeah, natural languages don't have a specification or
             | canonical parser implementation, so they cannot be reliably
             | parsed.
        
               | Legend2440 wrote:
               | They don't have a _short_ parser. They can be parsed, it
               | just requires a huge amount of priors and world
               | knowledge. Rule-based parsers are too simple.
        
               | thfuran wrote:
               | No, they cannot be reliably parsed. There is no
               | unambiguously correct parsing for many (or, arguably,
               | any) strings. Two people could say the same thing in the
               | same context and mean different things by it. You can't
               | even definitively say whether what they said/wrote is
               | valid English. Sure, there are strings most would agree
               | are and strings most would agree aren't, but even taking
               | consensus opinion as the source of truth, most isn't all,
               | and there's no universally agreed upon threshold for
               | acceptance.
        
               | lisper wrote:
               | My favorite example, due to Douglas Hofstadter:
               | 
               | Politicians lie.
               | 
               | Cast iron sinks.
               | 
               | Politicians lie in cast iron sinks.
               | 
               | It's not actually ambiguous, but I think it's a lovely
               | illustration of the subtleties of the problem.
               | 
               | An actually ambiguous example: I saw a politician lying
               | in a cast iron sink.
        
               | Legend2440 wrote:
               | That's okay - that just means your parser needs to model
               | what the speaker was thinking when they said it. That's
               | extra information that's required to decode the message.
               | It is not necessary for the same text to always mean the
               | same thing.
        
         | dkjaudyeqooe wrote:
         | This is entirely the case. Given a sensible grammar stated in a
         | sensible way, it's very easy to write a nice recursive decent
         | parser. They are fast and easy to maintain. It doesn't limit
         | the expressiveness of your grammar unduly.
         | 
         | Both GCC and LLVM implement recursive decent parsers for their
         | C compilers.
         | 
         | Parser generators are an abomination inflected upon us by
         | academia, solving a non problem, and poorly.
        
           | derriz wrote:
           | Agree completely. Having used a bunch of parser generators
           | (Antlr and bison most extensively) and written a parser
           | combinator library, I came to the conclusion that they're a
           | complete waste of time for practical applications.
           | 
           | A hand-written recursive descent parser (with an embedded
           | Pratt parser to handle expressions/operators) solves all the
           | problems that parser generators struggle with. The big/tricky
           | "issue" mentioned in the article - composing or embedding one
           | parser in another - is a complete non-issue with recursive-
           | descent - it's just a function call. Other basic features of
           | parsing: informative/useful error messages, recovery (i.e.
           | don't just blow up with the first error), seamless
           | integration with the rest of the host language, remain issues
           | with all parser generators but are simply not issues with
           | recursive descent.
           | 
           | And that's before you consider non-functional advantages of
           | recursive-descent: debuggability, no change/additions to the
           | build system, fewer/zero dependencies, no requirement to
           | learn a complex DSL, etc.
        
       | dang wrote:
       | Related:
       | 
       |  _Parsing: The Solved Problem That Isn 't (2011)_ -
       | https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70
       | comments)
       | 
       |  _Parsing: the solved problem that isn 't_ -
       | https://news.ycombinator.com/item?id=2327313 - March 2011 (47
       | comments)
        
       | Nevermark wrote:
       | Common example of complications of two grammars being combined: C
       | code and character strings.
       | 
       | Double quotes in C code mean begin and end of a string. But
       | strings contain quotes too. And newlines. Etc.
       | 
       | So we got the cumbersome invention of escape codes, and so
       | characters strings in source (itself a character string) are not
       | literally the strings they represent.
        
         | PH95VuimJjqBqy wrote:
         | at no point in my life have I ever considered escape codes to
         | be problematic.
         | 
         | ugly, yes. problematic? no.
        
           | convolvatron wrote:
           | until you need to get your string through several levels of
           | escape. how many backslashes to add? depends on how deep your
           | pipe is and how each of those layers is defined
        
             | samus wrote:
             | The only alternative is extracting them to other files or
             | designing specialized string formats.
        
               | lisper wrote:
               | There is one obvious "specialized string format" that
               | solves 99% of all escaping issues: use <<balanced
               | quotes>>. The real problem isn't escaping, it is that the
               | same character is used both to open and close strings.
        
               | ctrw wrote:
               | Qwerty has ` and ', but for some reason we've decided
               | that '' isn't " and " is the one true quotation symbol.
        
               | ReleaseCandidat wrote:
               | Or old Fortran style Hollerith constants. They consist of
               | the string length, a "H" and the string itself. Like
               | 13HHello, world!
        
           | Nevermark wrote:
           | Not problematic. Just a little cumbersome. (And ugly,
           | agreed.)
           | 
           | You can't always just copy and paste some text into code,
           | without adding escape encodings.
           | 
           | Now write code that generates C code with strings, that
           | generates C code with strings, and ... (ahem!)
           | 
           | It's not a big deal, but it isn't zero friction either.
           | Relevant here because it might be the most prevalent example
           | of what happens when even two simple grammars collide.
        
       ___________________________________________________________________
       (page generated 2024-02-21 23:00 UTC)