[HN Gopher] Parsing: The Solved Problem That Isn't (2011)
___________________________________________________________________
Parsing: The Solved Problem That Isn't (2011)
Author : contr-error
Score : 79 points
Date : 2024-02-21 15:42 UTC (7 hours ago)
(HTM) web link (tratt.net)
(TXT) w3m dump (tratt.net)
| temp123789246 wrote:
| Have there been any notable innovations in parsing since this was
| written?
| marcusf wrote:
| An extremely layman answer is that most interesting innovation
| in parsing in relatively modern times has happened seems to be
| in the context of IDE's. I.e. incremental, high-performance
| parsing to support syntax highlighting, refactoring, etc. etc.
|
| (I may be talking out of my ass here.)
| ReleaseCandidat wrote:
| Actually the most important step of parsers (as even non-
| incremental, slow (or better: not fast) parsers are fast
| enough) is error recovery (error resilience) from syntax
| errors (mostly half written or half deleted code). What is
| time consuming is e.g. type-checking. Semantic checking in
| general, like exhaustiveness checks of pattern matches,
| syntax checking is fast.
| davidkellis wrote:
| These are not new, but my takeaways from
| https://tratt.net/laurie/blog/2020/which_parsing_approach.ht...
| and https://rust-
| analyzer.github.io/blog/2020/09/16/challeging-L... are to
| embrace various forms of LR parsing.
| https://github.com/igordejanovic/parglare is a very capable GLR
| parser, and I've been keeping a close eye on it for use in my
| projects.
| troupo wrote:
| Yes. You roll your own manual parser. It's not as difficult as
| people make it out to be.
| thechao wrote:
| Aycock & Horspool came up with a 'practical' method for
| implementing Earley parsing (conversion to a state-machine)
| that has pretty humorously good performance delta over "naive"
| Earley, and is still reasonable to implement. Joop Leo figured
| out how to get the worst-case of Earley parsing down to either
| O(n) (left-recursive, non-ambiguous) or O(n^2) (right-
| recursive, non-ambiguous). That means the Earley algorithm is
| only O(n^3) on right-recursive, ambiguous grammars; and, if
| you're doing that, you're holding your language wrong.
|
| A somewhat breathless description of all of this is in the
| Marpa parser documentation:
| https://jeffreykegler.github.io/Marpa-web-site/
|
| In practice, I've found that computers are so fast, that with
| just the Joop Leo optimizations, 'naive' Earley parsing is Good
| Enough(tm): https://loup-
| vaillant.fr/tutorials/earley-parsing/
| norir wrote:
| I feel that most of the time the two options are presented as
| either write a handwritten parser or use a parser generator. A
| nice third way is to write a custom parser generator for the
| language you wish to parse. Handwritten parsers do tend to get
| unwieldy and general purpose parser generators can have
| inscrutable behavior for any specific language.
|
| Because the grammar for a parser generator is usually much
| simpler than most general purpose programming languages, it is
| typically relatively straightforward to handwrite a parser for
| it.
| danielvaughn wrote:
| I'm not super familiar with the space, but tree-sitter seems to
| take an interesting approach in that they are an incremental
| parser. So instead of re-parsing the entire document on change,
| it only parses the affected text, thereby making it much more
| efficient for text editors.
|
| I don't know if that's specific to tree-sitter though, I'm sure
| there are other incremental parsers. I have to say that I've
| tried ANTLR and tree-sitter, and I absolutely love tree-sitter.
| It's a joy to work with.
| ReleaseCandidat wrote:
| > [incremental parsing] I don't know if that's specific to
| tree-sitter though
|
| No, it isn't. And incremental parsing is older than 2011 too
| (like at least the 70s).
|
| For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066
| IshKebab wrote:
| In my experience incremental parsing doesn't really make much
| sense. Non-incremental parsing can easily parse huge
| documents in milliseconds.
|
| Also Tree Sitter only does half the parsing job - you get a
| tree on nodes, but you have to do your own parse of that tree
| to get useful structures out.
|
| I prefer Chumsky or Nom which go all the way.
| bsder wrote:
| Yes, we resdesigned our programming languages to be easy to
| parse with limited lookahead.
| o11c wrote:
| Not sure, but I at least am certainly aware of possibilities
| that such writeups exclude.
|
| In particular, you can do (a subset of) the following in
| sequence:
|
| * write your own grammar in whatever bespoke language you want
|
| * compose those grammars into a single grammar
|
| * generate a Bison grammar from that grammar
|
| * run `bison --xml` instead of actually generating code
|
| * read the XML file and implement your own (trivial) runtime so
| you can easily handle ownership issues
|
| In particular, I am vehemently opposed to the idea of
| implementing parsers separately using some non-proven
| tool/theory, since that way leads to subtle grammar
| incompatibilities later.
| sse wrote:
| The one I'm most excited about is improved error recovery:
|
| https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
|
| https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...
| taeric wrote:
| My current view of what makes parsing so difficult is that people
| want to jump straight over a ton of intermediate things from
| parsing to execution. That is, we often know what we want to
| happen at the end. And we know what we are given. It is hoped
| that it is a trivially mechanical problem to go from one to the
| other.
|
| But this ignores all sorts of other steps you can take. Targeting
| multiple execution environments is an obvious step. Optimization
| is another. Trivial local optimizations like shifts over
| multiplications by 2 and fusing operations to take advantage of
| the machine that is executing it. Less trivial full program
| optimizations that can propagate constants across source files.
|
| And preemptive execution is a huge consideration, of course. Very
| little code runs in a way that can't be interrupted for some
| other code to run in the meantime. To the point that we don't
| even think of what this implies anymore. Despite accumulators
| being a very basic execution unit on most every computer.
| (Though, I think I'm thankful that reentrancy is the norm
| nowadays in functions.)
| Karellen wrote:
| What have those other things got to do with parsing though?
| Granted, they rely on parsing _having already happened_ , but I
| don't see how there's much feedback from those considerations
| to the way that parsers work, or are written, or - as the
| article discussed - can be combined?
| taeric wrote:
| You can easily view it as having nothing to do with it. My
| push is that it is the point of parsing. You don't parse
| directly into understanding/execution. You parse into another
| representation, one that we never directly talk about, so
| that you can then move it into the next level.
|
| Even English can be parsed first into the sounds. This is why
| puns work. Consider the joke, "why should you wear glasses to
| math class? It helps with division." That only works if you
| go to the sounds first. And you will have optionality in
| where to go from there.
|
| So, for parsing programs, we often first decide on primitives
| for execution. For teaching, this is often basic math
| operations. But in reality, you have far more than the basic
| math operations. And, as I was saying, you can do more with
| the intermediate representation than you probably realize at
| the outset.
| pcfwik wrote:
| There was an interesting discussion two years ago regarding
| nonobvious issues with PEGs:
|
| https://news.ycombinator.com/item?id=30414683
|
| https://news.ycombinator.com/item?id=30414879
|
| I spent a year or two working with PEGs, and ran into similar
| issues multiple times. Adding a new production could totally
| screw up seemingly unrelated parses that worked fine before.
|
| As the author points out, Earley parsing with some disambiguation
| rules (production precedence, etc.) has been much less
| finicky/annoying to work with. It's also reasonably fast for
| small parses even with a dumb implementation. Would suggest for
| prototyping/settings when runtime ambiguity is not a showstopper,
| despite the remaining issues described in the article re: having
| a separate lexer.
| dataflow wrote:
| > I spent a year or two working with PEGs
|
| > Earley parsing with some disambiguation rules
|
| Any idea why GLR always gets ignored?
| EdSchouten wrote:
| What I find annoying about using parser generators is that it
| always feels messy integrating the resulting parser into your
| application. So you write a file that contains the grammar and
| generate a parser out of that. Now you build it into your app and
| call into it to parse some input file, but that ends up giving
| you some poorly typed AST that is cluttered/hard to work with.
|
| Certain parser generators make life easier by supporting actions
| on parser/lexer rules. This is great and all, but it has the
| downside that the grammar you provide is no longer reusable.
| There's no way for others to import that grammar and provide
| custom actions for them.
|
| I don't know. In my opinion parsing theory is already solved.
| Whether it's PEG, LL, LR, LALR, whatever. One of those is
| certainly good enough for the kind of data you're trying to
| parse. I think the biggest annoyance is the tooling.
| mrkeen wrote:
| Parser combinators is what I've been loving in the last few
| years.
|
| Pros: * They're just a technique/library that you can use in
| your own language without the seperate generation step.
|
| * They're simple enough that I often roll my own rather than
| using an existing library.
|
| * They let you stick code into your parsing steps - logging,
| extra information, constructing your own results directly, e.g.
|
| * The same technique works for lexing and parsing - just write
| a parser from bytes to tokens, and a second parser from tokens
| to objects.
|
| * Depending on your languages syntax, you can get your parser
| code looking a lot like the bnf grammar you're trying to
| implement.
|
| Cons: * You will eventually run into left-recursion problems.
| It can be nightmarish trying to change the code so it 'just
| works'. You really need to step back and grok left-recursion
| itself - no handholding from parser combinators.
|
| * Same thing with precedence - you just gotta learn how to do
| it. Fixing left-recursion didn't click for me until I learned
| how to do precedence.
| kazinator wrote:
| Parsing computer languages is an entirely self-inflicted problem.
| You can easily design a language so it doesn't require any
| parsing techniques that were not known and practical in 1965, and
| it will greatly benefit the readability also.
| amelius wrote:
| End-comment
| lgas wrote:
| )
| loevborg wrote:
| Do you mean lisp? If yes I agree
| kazinator wrote:
| Including, but not limited to.
| Nevermark wrote:
| Many an argument has been won by the response: "but
| consider Lisp and Forth"
| Legend2440 wrote:
| But I don't want to be able to parse only highly restricted
| languages. I want to be able to parse _anything_ , including
| natural language or even non-languages like raw audio.
|
| My brain can do it, why can't my computer?
| reissbaker wrote:
| Your computer _can_ do it, if it has a beefy enough Nvidia
| card. That 's what LLMs are for!
|
| Grammar-based parsing for natural language isn't anywhere
| close to working, sadly, and may never be.
| deathanatos wrote:
| You've never had an LLM misinterpret what you correctly
| wrote?
|
| (... I have.)
| throwbadubadu wrote:
| Yes, never do humans misunderstand each other, or
| instructions are not clear to everyone and totally
| unambiguous, and luckily no language has pure differentiation
| of meaning by intonation, and, and.. and...
| thfuran wrote:
| Yeah, natural languages don't have a specification or
| canonical parser implementation, so they cannot be reliably
| parsed.
| Legend2440 wrote:
| They don't have a _short_ parser. They can be parsed, it
| just requires a huge amount of priors and world
| knowledge. Rule-based parsers are too simple.
| thfuran wrote:
| No, they cannot be reliably parsed. There is no
| unambiguously correct parsing for many (or, arguably,
| any) strings. Two people could say the same thing in the
| same context and mean different things by it. You can't
| even definitively say whether what they said/wrote is
| valid English. Sure, there are strings most would agree
| are and strings most would agree aren't, but even taking
| consensus opinion as the source of truth, most isn't all,
| and there's no universally agreed upon threshold for
| acceptance.
| lisper wrote:
| My favorite example, due to Douglas Hofstadter:
|
| Politicians lie.
|
| Cast iron sinks.
|
| Politicians lie in cast iron sinks.
|
| It's not actually ambiguous, but I think it's a lovely
| illustration of the subtleties of the problem.
|
| An actually ambiguous example: I saw a politician lying
| in a cast iron sink.
| Legend2440 wrote:
| That's okay - that just means your parser needs to model
| what the speaker was thinking when they said it. That's
| extra information that's required to decode the message.
| It is not necessary for the same text to always mean the
| same thing.
| dkjaudyeqooe wrote:
| This is entirely the case. Given a sensible grammar stated in a
| sensible way, it's very easy to write a nice recursive decent
| parser. They are fast and easy to maintain. It doesn't limit
| the expressiveness of your grammar unduly.
|
| Both GCC and LLVM implement recursive decent parsers for their
| C compilers.
|
| Parser generators are an abomination inflected upon us by
| academia, solving a non problem, and poorly.
| derriz wrote:
| Agree completely. Having used a bunch of parser generators
| (Antlr and bison most extensively) and written a parser
| combinator library, I came to the conclusion that they're a
| complete waste of time for practical applications.
|
| A hand-written recursive descent parser (with an embedded
| Pratt parser to handle expressions/operators) solves all the
| problems that parser generators struggle with. The big/tricky
| "issue" mentioned in the article - composing or embedding one
| parser in another - is a complete non-issue with recursive-
| descent - it's just a function call. Other basic features of
| parsing: informative/useful error messages, recovery (i.e.
| don't just blow up with the first error), seamless
| integration with the rest of the host language, remain issues
| with all parser generators but are simply not issues with
| recursive descent.
|
| And that's before you consider non-functional advantages of
| recursive-descent: debuggability, no change/additions to the
| build system, fewer/zero dependencies, no requirement to
| learn a complex DSL, etc.
| dang wrote:
| Related:
|
| _Parsing: The Solved Problem That Isn 't (2011)_ -
| https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70
| comments)
|
| _Parsing: the solved problem that isn 't_ -
| https://news.ycombinator.com/item?id=2327313 - March 2011 (47
| comments)
| Nevermark wrote:
| Common example of complications of two grammars being combined: C
| code and character strings.
|
| Double quotes in C code mean begin and end of a string. But
| strings contain quotes too. And newlines. Etc.
|
| So we got the cumbersome invention of escape codes, and so
| characters strings in source (itself a character string) are not
| literally the strings they represent.
| PH95VuimJjqBqy wrote:
| at no point in my life have I ever considered escape codes to
| be problematic.
|
| ugly, yes. problematic? no.
| convolvatron wrote:
| until you need to get your string through several levels of
| escape. how many backslashes to add? depends on how deep your
| pipe is and how each of those layers is defined
| samus wrote:
| The only alternative is extracting them to other files or
| designing specialized string formats.
| lisper wrote:
| There is one obvious "specialized string format" that
| solves 99% of all escaping issues: use <<balanced
| quotes>>. The real problem isn't escaping, it is that the
| same character is used both to open and close strings.
| ctrw wrote:
| Qwerty has ` and ', but for some reason we've decided
| that '' isn't " and " is the one true quotation symbol.
| ReleaseCandidat wrote:
| Or old Fortran style Hollerith constants. They consist of
| the string length, a "H" and the string itself. Like
| 13HHello, world!
| Nevermark wrote:
| Not problematic. Just a little cumbersome. (And ugly,
| agreed.)
|
| You can't always just copy and paste some text into code,
| without adding escape encodings.
|
| Now write code that generates C code with strings, that
| generates C code with strings, and ... (ahem!)
|
| It's not a big deal, but it isn't zero friction either.
| Relevant here because it might be the most prevalent example
| of what happens when even two simple grammars collide.
___________________________________________________________________
(page generated 2024-02-21 23:00 UTC)