[HN Gopher] Applicative Parsing
___________________________________________________________________
Applicative Parsing
Author : gbrown_
Score : 74 points
Date : 2021-03-01 10:59 UTC (12 hours ago)
(HTM) web link (jobjo.github.io)
(TXT) w3m dump (jobjo.github.io)
| lindig wrote:
| I believe the interesting twist is the use of a GADT to represent
| the parse result. I see this in this line of history:
|
| "Higher-Order Functions for Parsing", Graham Hutton - parser
| combinators before monadic parsing
|
| "Monadic parser combinators", Graham Hutton and Erik Meijer - the
| discovery of monadic parsing, which is now the standard for
| parser combinators
| gambler wrote:
| I've recently played with making simple JS-based parsers without
| reading anything about the theory. My conclusion was that making
| a parser by itself is a relatively easy task. The hard part is
| making grammars easy to read and write, as well as handling
| various errors.
|
| My best stab at representing something like EBNF in JS looked
| like this: class ArrayGrammar { //e.g: [1, 2,
| 3, [4, 5,],] array =()=> ['[', this.entries, ']'];
| entries =()=> many(this.entry); entry =()=>
| [either(this.array, this.number, ''), /\s*,\s*/];
| number =()=> /\d+/; }
|
| This is a grammar for "arrays" of positive integers. I tested it
| out and concluded that this is too awful to be of any practical
| use. Now I'm seeing something with significantly more involved
| representation making it to the front page of HN. Hm.
|
| ---
|
| Edit: Not much to look at, but here is the source code of the toy
| implementation if anyone is interested:
|
| https://jsfiddle.net/306jrnt9/1/
|
| The only "cool" part about it is that the parser implementation
| is close to 100 lines of code.
| rmasters wrote:
| Tree-sitter uses JS for grammars and I'm sure there are others:
|
| https://tree-sitter.github.io/tree-sitter/creating-parsers#t...
| gambler wrote:
| I was aiming for sub-100 lines of code implementation. IMO, a
| full-fledged library would be much better off if they just
| supported EBNF or McKeeman Form.
| hardmath123 wrote:
| Parsers in nearley.js [1] are written in a very readable EBNF-
| like DSL; then they get desugared down to a JS file that's a
| lot like your snippet.
|
| [1] https://github.com/kach/nearley
| platz wrote:
| Does this fall under the category of a recursive descent parser?
| chowells wrote:
| No.
|
| The other answer is also correct, that it's Yes.
|
| In other words, the question is wrong.
|
| Applicative parser combinators are an interface. The backing
| implementation is pretty open. It might be recursive descent.
| It might be the Earley algorithm. It might be a packrat parser.
|
| The interesting thing about Applicative parser combinators is
| that the way they are composed is amenable to static analysis.
| That allows a wide variety of implementation strategies, all
| behind a simple uniform interface...
|
| ... So long as you ignore that there are still things left
| unspecified by the interface, especially related to
| backtracking behavior. (I really hate how much mindspace
| Parsec-like libraries have in Haskell. They have the worst
| backtracking behavior.)
| platz wrote:
| And yet all the implementations are actually just functional
| recursion, in practice, no?
| chowells wrote:
| Yes, but that isn't saying much. It's basically the same as
| saying "they're all assembly instructions when they're sent
| to the CPU".
|
| Functional recursion is just looping. "Recursive Descent"
| is a very specific parsing algorithm.
| platz wrote:
| then you'd also claim that a parser generator grammer
| could also be implemented with functional recursion. so
| what is the value of the original claim though?
| chowells wrote:
| The original question was about "recursive descent".
|
| You keep talking about "functional recursion".
|
| These are not the same words, nor do they mean the same
| thing. I do not understand your underlying question, or
| what point you might be trying to make if the question is
| intended to be rhetorical.
| seagreen wrote:
| > ... So long as you ignore that there are still things left
| unspecified by the interface, especially related to
| backtracking behavior. (I really hate how much mindspace
| Parsec-like libraries have in Haskell. They have the worst
| backtracking behavior.)
|
| The one that auto-backtracks on Alternative is attoparsec
| (IIRC), parsec and megaparsec don't unless you use `try`. Is
| your problem with the attoparsec style or with all of them,
| and if the latter what do you prefer instead?
| kreetx wrote:
| Came here to say this (that parsec, the old standard,
| didn't backtrack).
| chowells wrote:
| Attoparsec and megaparsec don't backtrack. They hard commit
| to choices and never reconsider them after having done so.
|
| Let's say I want to match the strings "a" or "aa" followed
| by "ab". Let's just go ahead and write that the most
| trivially obviously correct way: thing =
| (string "a" <|> string "aa") <> string "ab"
|
| Woo! See how easy it is? ... except that's utterly broken
| in any parsec-derived library. They don't backtrack.
| There's nothing in the world you can do to make them
| backtrack. The only thing you can do is refactor the
| structure of the grammar to reflect the parser's
| limitations.
|
| Yes, this example is artificial, but that makes it easy to
| understand. The broader issue of which this is one example
| is that you cannot apply distributive laws in parsec or its
| derivatives. (a <*> c) <|> (b <*> c)
|
| Is not interchangeable with (a <|> b) <*>
| c
|
| On the other hand, (a <*> b) <|> (a <*>
| c)
|
| Is interchangeable with a <*> (b <|> c)
|
| This might be good for efficiency, but it's not very good
| for making it pleasant to write parsers.
| seagreen wrote:
| > Attoparsec and megaparsec don't backtrack.
|
| There are two notion of backtracking at play here. One is
| backtracking out of a failing first branch of <|> even
| though it's consumed some input. The megaparsec authors
| consider this a form of backtracking (see https://hackage
| .haskell.org/package/megaparsec-9.0.1/docs/Te...).
|
| The other is backtracking from a failure in the second
| argument to <> or <*>.
|
| I agree that none of the libraries being considered do
| the latter. But the effect on memory use wouldn't just be
| slightly inefficient, it would mean keeping the entire
| input following a <|> in memory until the whole thing
| completes. Do you think this is the best way to do things
| in general, or just for specialized situations like
| parsing things with a fixed size such as config files?
| chowells wrote:
| Very, very few parsers need to handle multiple gigabyte
| inputs. For those, sure, focus on performance first. Your
| tools should always be suited to your problem.
|
| None of that changes the fact that Parsec-like designs
| are especially obnoxious when it comes to backtracking.
| You need to be aware of their limitations and design the
| structure of your grammar around them.
|
| As an idea for a design somewhere in between, maybe try
| taking inspiration from Prolog. Add a cut combinator for
| limiting the scope of backtracking explicitly. It still
| solves the size issues on large inputs, but it allows
| more freedom in refactoring as long as you don't cross
| one of those explicit boundaries.
| lirazel wrote:
| Yes
___________________________________________________________________
(page generated 2021-03-01 23:01 UTC)