hngopher.com

       [HN Gopher] Parsing and all that
       ___________________________________________________________________
        
       Parsing and all that
        
       Author : foresterre
       Score  : 24 points
       Date   : 2024-04-18 01:01 UTC (1 days ago)
        
 (HTM) web link (blog.jeffsmits.net)
 (TXT) w3m dump (blog.jeffsmits.net)
        
       | kazinator wrote:
       | Combining a finite state machine for recognizing sentential
       | patterns, with a stack, gives us LR parsing.
       | 
       | But push-down automata are significant not only because they have
       | practical uses in parsing, but because they represent a
       | theoretical class. They are more powerful than finite automata,
       | but are not Turing complete.
       | 
       | If, say, we use a push-down automaton to make a Boolean decision:
       | does this input string fall into the to-be-recognized set, or
       | not? then there are some kinds of strings it won't be able to
       | decide, that a full Turing machine could decide.
       | 
       | The limitation of push down automata is a direct consequence of
       | the stack discipline. There is only one stack, and when the stack
       | is popped, information is thrown away.
       | 
       | The tape machine that Turing described as part of developing his
       | theory of computation doesn't have that limitation; when the tape
       | head moves backwards, material it has written to the tape is not
       | erased in a stack-like discipline.
        
         | mcguire wrote:
         | Neat observation: finite automata, push-down automata, and
         | Turing machines can be distinguished purely by the number of
         | stacks they need.
         | 
         | Finite automata: no stack.
         | 
         | Pushdown automata: one stack.
         | 
         | Turing machine: two stacks. (Use them like the tape: to move
         | forward, pop the f-stack and push the symbol on the b-stack; to
         | move backward pop the b-stack and push to the f-stack.)
        
       | norir wrote:
       | There is a part of me that feels we should just forget that LR
       | ever existed and insist on LL(k) for k <= 2 for all new
       | languages. The supposed expressivity benefits are not worth the
       | complexity cost in my not so humble opinion. For me, the best
       | approach is to write the grammar in a dsl that checks that it is
       | unambiguous and in LL(k) and then implement by hand with
       | recursive descent.
        
         | marcosdumay wrote:
         | Well, IMO, but you write a parser only a couple of times for
         | each language, but you read code all the time.
         | 
         | So, anything that makes code easier to read is a must. That
         | includes the extra expressiveness of an LR grammar.
        
           | mhh__ wrote:
           | In 2012 yes, but these days having grammars that lots of
           | tools can parse and output easily is a boon for tooling.
        
           | vidarh wrote:
           | I agree there are a few places like this, but they're usually
           | solved when you want an LL parser for as much as possible of
           | the language by either making them part of the tokenization
           | _or_ by  "escaping" a small subset of the grammar.
           | 
           | I wish that language designers (looking at you, Ruby - as
           | much as I love _using_ Ruby, the grammar is an abomination)
           | would at least _try_ to stick to LL as much as possible, and
           | consciously, carefully and in a limited fashion  "escape"
           | into LR for as small subsets as possible.
           | 
           | Break rules if it actually helps developers, sure, but it'd
           | be nice if people tried to make the weird bits as tiny as
           | possible...
           | 
           | EDIT: Also, in practice this is often what we end up doing
           | when we write recursive descent anyway, only in a less than
           | formal way. It'd be nice if more tooling let us write
           | grammars that default to LL w/checking, but let you insert
           | _clearly delineated_ exceptions.
        
             | marcosdumay wrote:
             | Oh, I do agree with that!
             | 
             | A language should be LL as much as possible. And also
             | depend on the parsed data as little as possible. Stuff like
             | C's variable declaration must be avoided.
             | 
             | But I don't think either of those rules can be made
             | universal. Some language for expressing exceptions would be
             | great.
        
           | haberman wrote:
           | Do you know of any particularly readable constructs that are
           | possible with LR but incompatible with LL?
           | 
           | My general opinion is that any constructs that confuse a
           | parser will also confuse a human.
        
             | marcosdumay wrote:
             | Operators with precedence rules.
             | 
             | And given the modern idea of "everything is either an
             | identifier or an operator", precedence became very
             | important.
        
               | haberman wrote:
               | Operators with precedence rules frequently appear in top-
               | down parsers. I agree that pure-LL form doesn't make this
               | easy, but most languages I have seen use Pratt parsers
               | that allow precedence parsing within a top-down
               | framework. It doesn't require full LR.
        
               | Rusky wrote:
               | Pratt parsing is essentially hand-written LR. It's
               | certainly bottom-up in the same way, despite Pratt's
               | paper title.
               | 
               | (I went into more depth on this a few years ago:
               | https://www.abubalay.com/blog/2021/12/31/lr-control-flow)
        
         | dbcurtis wrote:
         | I tend to agree. Having done a bunch of DSL's over the years,
         | and becoming extremely fluent with Bison+FLEX, I have to say
         | that a hand-written LL parser with a couple of mutually
         | recursive table-driving functions to parse expressions via
         | precedence climbing is the way to go. Precedence climbing (or I
         | think it is also called Pratt parsing?) is super simple and
         | flexible. But the BIG bonus is in error reporting, and let's
         | face it, most of the user interaction with your parser is with
         | the error messages, otherwise it is push 'enter' and forget.
         | Getting a reasonable error message out of an LR parser is about
         | as much fun as poking yourself in the eye repeatedly with a
         | sharp stick. With LL, you know what structure you are trying to
         | build... with LR, you have to rummage around in parse stack
         | looking for obscure clues.
        
           | Calavar wrote:
           | I am aware this is an unpopular opinion, but I think the idea
           | that recursive descent is inherently better suited to error
           | reporting than table-driven LR is massively cargo-culted. You
           | can handcraft error messages for an LR parser if you
           | overspecify the grammar, meaning that you don't just write
           | productions for the actual language; you also write
           | productions for likely errors and have the predicates for
           | those productions report your handcrafted error message.
           | 
           | I'm guessing that a lot of people's first impression of that
           | will be "That's insane, are you seriously suggesting writing
           | two grammars in one?" But that's exactly what you do when you
           | build custom error reporting code into a recursive descent
           | parser. It's just less obvious because the code is procedural
           | instead of declarative.
        
       | keybored wrote:
       | How relevant are Context Free grammars for modern programming
       | languages? It seems that most modern languages want things like
       | "raw strings" where you can deal with arbitrary strings by using
       | syntax like (Rust) `r###` (however many `#` you need). It is my
       | understanding that syntax like that makes these grammars fall
       | outside of Context Free, no?
        
         | Rusky wrote:
         | Features like that _technically_ make the language context-
         | sensitive, but in practice that context sensitivity is entirely
         | contained within the lexer, and you can still describe the rest
         | of the language with a context free grammar of tokens.
        
           | keybored wrote:
           | Okay, thanks. I think that's good.
        
         | chowells wrote:
         | That sort of thing is typically handled during lexing. By the
         | time the parser deals with it, it's a string literal. This is
         | one of the advantages of splitting those into separate phases.
         | 
         | There are other hacks in use as well. Haskell's syntax is
         | context-sensitive because operator precedence and associativity
         | can be declared in the same context as the operator is used.
         | But GHC ignores that during the phase where it actually parses
         | the input. It just ignores precedence and associativity when
         | parsing operators. After it finishes parsing the translation
         | unit, another phase goes back, finds nested operator
         | applications in the parse tree, and fixes them up based on the
         | precedence and associativity information it has found.
         | 
         | It turns out that when your language is only just barely
         | context-sensitive, you can find a way to use context-free tools
         | to handle it, with a couple extra passes to fix things up.
        
       ___________________________________________________________________
       (page generated 2024-04-19 23:01 UTC)