[HN Gopher] Parsing and all that
___________________________________________________________________
Parsing and all that
Author : foresterre
Score : 24 points
Date : 2024-04-18 01:01 UTC (1 days ago)
(HTM) web link (blog.jeffsmits.net)
(TXT) w3m dump (blog.jeffsmits.net)
| kazinator wrote:
| Combining a finite state machine for recognizing sentential
| patterns, with a stack, gives us LR parsing.
|
| But push-down automata are significant not only because they have
| practical uses in parsing, but because they represent a
| theoretical class. They are more powerful than finite automata,
| but are not Turing complete.
|
| If, say, we use a push-down automaton to make a Boolean decision:
| does this input string fall into the to-be-recognized set, or
| not? then there are some kinds of strings it won't be able to
| decide, that a full Turing machine could decide.
|
| The limitation of push down automata is a direct consequence of
| the stack discipline. There is only one stack, and when the stack
| is popped, information is thrown away.
|
| The tape machine that Turing described as part of developing his
| theory of computation doesn't have that limitation; when the tape
| head moves backwards, material it has written to the tape is not
| erased in a stack-like discipline.
| mcguire wrote:
| Neat observation: finite automata, push-down automata, and
| Turing machines can be distinguished purely by the number of
| stacks they need.
|
| Finite automata: no stack.
|
| Pushdown automata: one stack.
|
| Turing machine: two stacks. (Use them like the tape: to move
| forward, pop the f-stack and push the symbol on the b-stack; to
| move backward pop the b-stack and push to the f-stack.)
| norir wrote:
| There is a part of me that feels we should just forget that LR
| ever existed and insist on LL(k) for k <= 2 for all new
| languages. The supposed expressivity benefits are not worth the
| complexity cost in my not so humble opinion. For me, the best
| approach is to write the grammar in a dsl that checks that it is
| unambiguous and in LL(k) and then implement by hand with
| recursive descent.
| marcosdumay wrote:
| Well, IMO, but you write a parser only a couple of times for
| each language, but you read code all the time.
|
| So, anything that makes code easier to read is a must. That
| includes the extra expressiveness of an LR grammar.
| mhh__ wrote:
| In 2012 yes, but these days having grammars that lots of
| tools can parse and output easily is a boon for tooling.
| vidarh wrote:
| I agree there are a few places like this, but they're usually
| solved when you want an LL parser for as much as possible of
| the language by either making them part of the tokenization
| _or_ by "escaping" a small subset of the grammar.
|
| I wish that language designers (looking at you, Ruby - as
| much as I love _using_ Ruby, the grammar is an abomination)
| would at least _try_ to stick to LL as much as possible, and
| consciously, carefully and in a limited fashion "escape"
| into LR for as small subsets as possible.
|
| Break rules if it actually helps developers, sure, but it'd
| be nice if people tried to make the weird bits as tiny as
| possible...
|
| EDIT: Also, in practice this is often what we end up doing
| when we write recursive descent anyway, only in a less than
| formal way. It'd be nice if more tooling let us write
| grammars that default to LL w/checking, but let you insert
| _clearly delineated_ exceptions.
| marcosdumay wrote:
| Oh, I do agree with that!
|
| A language should be LL as much as possible. And also
| depend on the parsed data as little as possible. Stuff like
| C's variable declaration must be avoided.
|
| But I don't think either of those rules can be made
| universal. Some language for expressing exceptions would be
| great.
| haberman wrote:
| Do you know of any particularly readable constructs that are
| possible with LR but incompatible with LL?
|
| My general opinion is that any constructs that confuse a
| parser will also confuse a human.
| marcosdumay wrote:
| Operators with precedence rules.
|
| And given the modern idea of "everything is either an
| identifier or an operator", precedence became very
| important.
| haberman wrote:
| Operators with precedence rules frequently appear in top-
| down parsers. I agree that pure-LL form doesn't make this
| easy, but most languages I have seen use Pratt parsers
| that allow precedence parsing within a top-down
| framework. It doesn't require full LR.
| Rusky wrote:
| Pratt parsing is essentially hand-written LR. It's
| certainly bottom-up in the same way, despite Pratt's
| paper title.
|
| (I went into more depth on this a few years ago:
| https://www.abubalay.com/blog/2021/12/31/lr-control-flow)
| dbcurtis wrote:
| I tend to agree. Having done a bunch of DSL's over the years,
| and becoming extremely fluent with Bison+FLEX, I have to say
| that a hand-written LL parser with a couple of mutually
| recursive table-driving functions to parse expressions via
| precedence climbing is the way to go. Precedence climbing (or I
| think it is also called Pratt parsing?) is super simple and
| flexible. But the BIG bonus is in error reporting, and let's
| face it, most of the user interaction with your parser is with
| the error messages, otherwise it is push 'enter' and forget.
| Getting a reasonable error message out of an LR parser is about
| as much fun as poking yourself in the eye repeatedly with a
| sharp stick. With LL, you know what structure you are trying to
| build... with LR, you have to rummage around in parse stack
| looking for obscure clues.
| Calavar wrote:
| I am aware this is an unpopular opinion, but I think the idea
| that recursive descent is inherently better suited to error
| reporting than table-driven LR is massively cargo-culted. You
| can handcraft error messages for an LR parser if you
| overspecify the grammar, meaning that you don't just write
| productions for the actual language; you also write
| productions for likely errors and have the predicates for
| those productions report your handcrafted error message.
|
| I'm guessing that a lot of people's first impression of that
| will be "That's insane, are you seriously suggesting writing
| two grammars in one?" But that's exactly what you do when you
| build custom error reporting code into a recursive descent
| parser. It's just less obvious because the code is procedural
| instead of declarative.
| keybored wrote:
| How relevant are Context Free grammars for modern programming
| languages? It seems that most modern languages want things like
| "raw strings" where you can deal with arbitrary strings by using
| syntax like (Rust) `r###` (however many `#` you need). It is my
| understanding that syntax like that makes these grammars fall
| outside of Context Free, no?
| Rusky wrote:
| Features like that _technically_ make the language context-
| sensitive, but in practice that context sensitivity is entirely
| contained within the lexer, and you can still describe the rest
| of the language with a context free grammar of tokens.
| keybored wrote:
| Okay, thanks. I think that's good.
| chowells wrote:
| That sort of thing is typically handled during lexing. By the
| time the parser deals with it, it's a string literal. This is
| one of the advantages of splitting those into separate phases.
|
| There are other hacks in use as well. Haskell's syntax is
| context-sensitive because operator precedence and associativity
| can be declared in the same context as the operator is used.
| But GHC ignores that during the phase where it actually parses
| the input. It just ignores precedence and associativity when
| parsing operators. After it finishes parsing the translation
| unit, another phase goes back, finds nested operator
| applications in the parse tree, and fixes them up based on the
| precedence and associativity information it has found.
|
| It turns out that when your language is only just barely
| context-sensitive, you can find a way to use context-free tools
| to handle it, with a couple extra passes to fix things up.
___________________________________________________________________
(page generated 2024-04-19 23:01 UTC)