hngopher.com

       [HN Gopher] Let's write a compiler, part 3: A parser
       ___________________________________________________________________
        
       Let's write a compiler, part 3: A parser
        
       Author : ingve
       Score  : 99 points
       Date   : 2021-08-16 13:37 UTC (9 hours ago)
        
 (HTM) web link (briancallahan.net)
 (TXT) w3m dump (briancallahan.net)
        
       | eatonphil wrote:
       | Anyone can do whatever they want, and if you do want to write a
       | parser in C this may be a good guide for you. But it's hard to
       | understand the benefit to writing a language frontend in C,
       | generally speaking. (An obvious exception might be if you are
       | writing one of the first languages for a new
       | architecture/operating system or one where only a C compiler
       | exists.)
       | 
       | If you are writing a compiler, you'll just be emitting some other
       | language anyway so the parser/emitter being in C doesn't really
       | give you anything even if you are emitting C.
       | 
       | If you are writing an interpreter and you want performance,
       | you're going to have to emit bytecode and have a bytecode
       | interpreter. Wanting your bytecode _interpreter_ to be in C could
       | make sense but your frontend could just emit the bytecode to a
       | file and your interpreter written in C could read from that file.
       | 
       | It's a handwritten parser too so it's not like you picked the
       | language so you could use a certain complex or nice parsing
       | library.
       | 
       | And it's not like the performance (or binary size) of the
       | parser/emitter typically matters so much that you need even that
       | part to be so much faster (or smaller) than a parser/emitter in
       | Go/Java/C#/whatever.
       | 
       | The overhead to writing C makes it a less friendly introduction
       | to the topic if you have never written a compiler or interpreter
       | (or parser, specifically) before.
       | 
       | Again, you can do what you want. And if you must do it in C, you
       | will probably value any posts like this. Just wanted to throw out
       | reasons you may not want to do this.
        
         | arkj wrote:
         | > But it's hard to understand the benefit to writing a language
         | frontend in C, generally speaking.
         | 
         | The advantage of writing in C is that the language provides
         | very little "high" level features. All you have is just
         | functions and memory so you end up doing everything which is a
         | laborious process but rewarding to the mind especially if you
         | are someone who never did low level programming.
        
         | david2ndaccount wrote:
         | C is a simple language and at least used to be a language
         | almost everyone knew.
         | 
         | I don't know why they're writing the parser using globals
         | though instead of passing a pointer to a parser state struct.
        
           | arcticbull wrote:
           | C is an incredibly complex language, it just appears simple.
           | It's simple to get started and incredibly hard to do
           | correctly, like skiing.
        
             | IdiocyInAction wrote:
             | C itself is not particularly complex, especially compared
             | to other languages, like C++. It does have a ton of
             | footguns and nasty UB things though.
             | 
             | But writing good, safe code in C can be very complex. That
             | doesn't make the language itself is complex.
        
             | retrac wrote:
             | No, those things are not a complexity of C. They're just
             | complex to implement at a lower level. They're complicated
             | in any language, often, if approached at the level C
             | provides. You have to hand-hold the machine through it. A
             | C-like implementation of linked lists in an array of memory
             | using pointers, for example, looks just as bad if written
             | in that way in unsafe {} Rust, or even some mutating thing
             | in Haskell. You can do that if you want, but you don't have
             | to. Those languages allow you to abstract that away very
             | quickly (often already pre-abstracted for you with default
             | approaches). That's what people mean when they say C is
             | simple. Plain and simple. Minimal. You have to do much of
             | the heavy lifting.
        
               | arcticbull wrote:
               | Respectfully disagree. A language that fails to
               | adequately provide safeguards to a user is a complex
               | language because it forces all the complexity of dealing
               | with the underlying machine onto the user. These are
               | things a user needs to know to be effective in the
               | language, which in turn makes using the language complex.
               | It's a _deceptively_ simple language.
               | 
               | C is complex because of what it doesn't do, not what it
               | does. That's IMO worse.
        
               | retrac wrote:
               | It's not so much that we're disagreeing as arguing
               | different points about the language, I think. You are
               | right that it is not a simple language to use, in many
               | ways. Archaic, unnecessarily complicated to implement
               | many standard approaches, and very repetitive. The header
               | system is just the wrong approach. Etc. I wouldn't pick
               | it for the main language for a new project, if at all
               | avoidable personally.
               | 
               | It's very easy to implement, though. A compiler fits in
               | 32 kilobytes of RAM on some architectures. Only a handful
               | of control structures. Limited types. That's the sense of
               | simple that I meant, and which most people meant. It's
               | not a compliment. It's a neutral observation. It's
               | actually C's greatest weakness. It's _too_ simple.
        
               | arcticbull wrote:
               | I think we're in violent agreement :)
        
           | int_19h wrote:
           | Parsers involve a lot of string processing, which is really
           | inconvenient and easy to get wrong in C.
        
         | fishmaster wrote:
         | > The overhead to writing C makes it a less friendly
         | introduction to the topic if you have never written a compiler
         | or interpreter (or parser, specifically) before.
         | 
         | That's true and was my first thought. C seems like an arcane
         | choice for an introduction course.
        
         | foxfluff wrote:
         | Brian's choice to use C is perfectly well justified:
         | 
         | "I am thinking that we should choose C to implement our PL/0
         | compiler. Eventually we will want to have this compiler run on
         | many different platforms, including Unix, MS-DOS, and
         | potentially even CP/M. C is a language that will run on every
         | platform I would want to run our PL/0 compiler on, so C works
         | for me. C also requires no special setup so our usual
         | development platform of vim on OpenBSD will work just fine.
         | Finally, it will help with bootstrapping the self-hosting
         | version of the PL/0 compiler, since we in theory could natively
         | compile the C implementation on our target platform then use
         | that to compile the PL/0 implementation, also on the target
         | platform."
         | 
         | The reader may learn something from these posts whether they
         | choose to write a compiler in C or not.
        
           | eatonphil wrote:
           | I don't mean to debate Brian's choice (your quote shows it
           | does fall into my exception anyway) but mention to readers
           | here why they may be more interested in following guides on
           | compiling/interpreting implemented in other languages than C.
        
             | fishmaster wrote:
             | C is actually a huge disadvantage here for me, since I have
             | no use for it outside of this tutorial which makes it hard
             | to justify learning it on top of learning how a compiler
             | works...
        
       | ibains wrote:
       | This post is good for learning, we have too many "systems"
       | developed in Java (Hadoop et al - I'm looking at you - you
       | sluggish bugs in the rug), we need more programmers to learn C
       | and build systems in it.
       | 
       | Also, you can do this so much easier in scala using parser
       | combinators, if the goal is time to market and the code to be
       | compiled is not very large. Here is an example to see how the
       | approaches compare: https://www.prophecy.io/blogs/scala-packrat-
       | parser-combinato...
        
       | nikkinana wrote:
       | yacc is key
        
         | chrisseaton wrote:
         | Never use Yacc. Just write your own parser.
        
           | arcticbull wrote:
           | Lemon is an infinitely better parser generator than Yacc; it
           | solves a lot of the issues that make Yacc terrible to work
           | with: named non-terminals, non-terminal destructors, re-
           | entrancy, etc. It's part of the SQLite codebase. Highly
           | recommended.
           | 
           | In many cases, though, I think parser combinators thread the
           | needle elegantly - nom in Rust for instance. Or even just a
           | hand-rolled LL(1).
        
           | eatonphil wrote:
           | Adding on, the fraction of "real" language implementations
           | using handwritten parsers vs. parser generators basically
           | seems to be half and half.
           | 
           | The Wikipedia list of Bison users even calls out that a
           | number of them switched from Bison to handwritten parsers.
           | 
           | https://en.wikipedia.org/wiki/GNU_Bison#Use
        
           | p4bl0 wrote:
           | Why would you say that? Yacc and other parser generators
           | exist for a good reason: hand written parsers can be quite
           | hairy to debug and extend, while parser generators offer a
           | domain specific language to specify your grammar and can
           | generate efficient parser code based on it.
           | 
           | What are the arguments in favor of manually writing parsers?
        
             | chrisseaton wrote:
             | Fundamentally I don't think parsing is a problem that's
             | complex enough to warrant a custom tool and language. And
             | even if it was Yacc is the wrong tool for the job in most
             | cases.
             | 
             | Since most languages are context-sensitive, you almost
             | always have to bend Yacc, which is designed for context-
             | free languages, out of shape to apply it.
             | 
             | It's a hammer but we hardly have any nails, and the nails
             | can just be pushed in by hand without a hammer, and they're
             | the wrong type of nails anyway so that hammer isn't really
             | compatible, and top of that it's a really expensive hammer.
        
               | rightbyte wrote:
               | Isn't the context sensitivity handled when you build an
               | AST with the action statements in Yacc?
        
               | chrisseaton wrote:
               | Yes - so straight away you have to leave the DSL and
               | bypass the formal mode of Yacc. At which point why bother
               | with it? If the first thing you need to do with your tool
               | is hack around it, maybe the tool isn't so well-designed?
               | 
               | And actions often do things like side-effect changes to
               | the lexer state, so it gets worse from there.
        
             | codr7 wrote:
             | What few seem to realize is that it's perfectly possible to
             | abstract out some of the work and create an extensible
             | foundation for manual parsers.
             | 
             | Since it's all regular code, you can use the full power of
             | the host language to deal with the problem.
             | 
             | https://github.com/codr7/swifties/blob/main/Sources/Swiftie
             | s...
        
               | p4bl0 wrote:
               | I understand that, but I still find it easier to
               | write/maintain/extend a grammar than a parser for that
               | grammar.
        
               | brundolf wrote:
               | If you've got the right utility functions, a handwritten
               | parser visually maps very directly to its grammar. It
               | won't be as concise of course, but cognitively you can
               | work with it almost as if it were just a grammar.
        
       | fishmaster wrote:
       | I'm thinking about following the tutorial in Julia or Kotlin, but
       | it would be interesting how the code looks in a different
       | language like F#.
        
         | the_benno wrote:
         | F# is an ML-family language, and ML stands for meta-language --
         | it was designed for language tools and compilers! Andrew
         | Appel's Modern Compiler Implementation in ML is a great
         | resource for code examples that aren't exactly F#, but will
         | look damn close. The book has also been published with C and
         | Java, if you want side-by-side comparisons
        
           | pfdietz wrote:
           | ML was designed to be the metalanguage for the LCF proof
           | system, used to write proof tactics. It was then recognized
           | to be interesting in its own right as a programming language.
        
             | the_benno wrote:
             | Oh, thanks for the correction and apologies for the
             | inadvertent misinfo.
             | 
             | Your comment sent me reading the HOPL paper about SML -- I
             | think I (based on folklore knowledge/informal
             | conversations) conflate the original development of ML with
             | subsequent development and standardization around SML. All
             | of this was well before my time so it's quite interesting
             | to read about some of the details.
        
         | BazookaMusic wrote:
         | The code isn't perfect, it's a learning project and still WIP,
         | but you can check the code for a parser for the LOX language (
         | c-like) from 'Crafting Interpreters' in this project. 'Ast.fs'
         | and 'Parser.fs' should be enough to get an idea of what it's
         | doing.You can also google the grammar of the language which is
         | quite simple:
         | 
         | https://github.com/BazookaMusic/FLOX
        
       | musicale wrote:
       | I expect this series is similar to Wirth's own chapter on writing
       | a PL/0 compiler. It seems to be following the same recursive
       | descent approach, but using C rather than Pascal.
       | 
       | Since the implementation is in C, perhaps it might make sense to
       | compile a version of PL/0 ("C/0") with C syntax rather than
       | Pascal syntax. It might be possible to get something close to a
       | C/0 compiler that could compile its own source code.
       | 
       | Though one issue with C is you also need a cpp implementation.
        
       | kazinator wrote:
       | The author forgot default cases in various switch (tok)
       | statements:                 static void       factor(void)
       | {          switch (type) {          case TOK_IDENT:          case
       | TOK_NUMBER:            next();            break;          case
       | TOK_LPAREN:            expect(TOK_LPAREN);
       | expression();            expect(TOK_RPAREN);          }        }
       | 
       | The TOK_ enumeration has a good many more members than just those
       | three. If any other one occurs in the code, or an outright
       | corrupt type value occurs, it silently falls through. If the
       | error is handled at all, it will result in some strange
       | diagnostic, in some higher level code that thinks it has seen an
       | expression with a factor in it.
       | 
       | Adding a break after the last block in a switch is a good idea;
       | someone adding a new case might forget. (Wit diagnostic support,
       | like GCC requiring a fallthrough comment, this is less important
       | nowadays).
       | 
       | If a switch deliberately falls through for unmatched cases, have
       | default:          break;
       | 
       | to express that explicitly.
       | 
       | This whole approach of the parser consisting of void functions is
       | poor for 2021. It's as if the author is deliberately trolling
       | everyone who has ever taken any level of interest in functional
       | programming.
       | 
       | The author will have a hard time fitting the necessary semantic
       | actions into this code, like constructing an abstract syntax
       | tree, and might succumb to the temptation to do it with some
       | hacks which build the structure procedurally using global
       | variables.
       | 
       | The parser would be a more future proof if it anticipated the
       | need for backtracking. So that is to say, a function such as
       | factor() above could be considered to be a pattern matching
       | construct, and return a useful value indicating whether it
       | matched or failed to match. Furthermore, if it it failed to
       | match, then it would backtrack, leaving the state of the input
       | stream as it was prior to entry.
       | 
       | You can then do interesting things, because you're essentially
       | LL(k) now, like speculatively parse the input using several
       | possible constructs in priority order, until one matches.
       | 
       | The code inconsistently mixes the use of concrete character
       | constants like '.' and '{' and #define symbols that expand to
       | character constants like #define TOK_PLUS '+'.
       | 
       | Using the character constants in the code as in expect('+') or
       | whatever is clearer. The Yacc approach of using constants for
       | abstract tokens, (which are enumerated starting at some value
       | above 256), and concrete character constants for single-character
       | tokens that denote themselves, is worth mimicking.
       | 
       | You're not going to redefine TOK_PLUS to something other than '+'
       | in the future, and doing so would be like like changing #define
       | FOUR 4 to #define FOUR 5. TOK_PLUS doesn't inform any better than
       | '+', and isn't any easier to find with grep.
        
         | wott wrote:
         | > The author will have a hard time fitting the necessary
         | semantic actions into this code, like constructing an abstract
         | syntax tree,
         | 
         | It's a declared non-goal.
         | 
         | > The code inconsistently mixes the use of concrete character
         | constants like '.' and '{' and #define symbols that expand to
         | character constants like #define TOK_PLUS '+'. > Using the
         | character constants in the code as in expect('+') or whatever
         | is clearer.
         | 
         | That's not how it works, TOK_* represent whole elements, the
         | element may happen to be a single character, but it can be a
         | word. It would be highly confusing to pass single characters to
         | expect() as you wouldn't know at first sight whether it expects
         | a simple character or an element. The TOK_* could as well be
         | assigned a random number as you say in your following sentence.
         | And they were until yesterday :-) I guess it is easier to debug
         | with mnemonic characters.
         | 
         | https://github.com/ibara/pl0c/commit/aba92627deac3f8cfe08679...
        
           | [deleted]
        
           | kazinator wrote:
           | Using character constants for denoting one-character tokens
           | is a time-honored tradition that everyone understands.
           | 
           | E.g. GNU Awk:
           | 
           | http://git.savannah.gnu.org/cgit/gawk.git/tree/awkgram.y?h=g.
           | ..
           | 
           | This is readable; it says "I am just what I look like: a
           | terminal symbol corresponding to the actual +".
        
           | p4bl0 wrote:
           | > [building an AST is] a declared non-goal.
           | 
           | Then what will be going on after parsing? What will be
           | compiled?
        
             | kazinator wrote:
             | Compilers can go straight to output without building an
             | AST, which is just one form of output. I should have more
             | generally written about output.
             | 
             | Output can be produced procedurally in a single pass.
             | However, it will still require refactoring to work that in.
             | For instance, say we go to a register machine (one that has
             | an unlimited number of registers). The expression()
             | function needs to know the destination register of the
             | calculation. If it returns void as before, then it has to
             | take it as an argument. Or else, it has to come up with the
             | register and return it.
             | 
             | I think that if the target is a stack-based machine, that
             | would be the easiest to work into the parsing scheme. This
             | is because without any context at all. Let's use term() as
             | an example:
             | 
             | Original recognizer skeleton:                 static void
             | term(void)       {         factor();              while
             | (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
             | next();           factor();         }       }
             | 
             | Stack-based output:                 static void
             | term(void)       {         factor(); // this now outputs
             | the code which puts the factor on the stack
             | while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
             | next();           factor(); // of course, likewise outputs
             | code.                output_stack_operation(type); // we
             | just add this         }       }
             | 
             | What is output_stack_operation_type:                 static
             | void       output_stack_operation_type(int type)       {
             | switch (type) {          case TOK_MULTIPLY:
             | output_line("mul"); break;          case TOK_DIVIDE:
             | output_line("div"); break;            // ...          }
             | }
             | 
             | For instance if we see                 3 * 4 / 6
             | 
             | we have the grammar symbols                 factor
             | TOK_MULTIPLY factor TOK_DIVIDE factor
             | 
             | the term function calls factor() first, and that function
             | produces output, which might be the line:
             | push 3
             | 
             | the term function calls factor() again, so this time the
             | output:                 push 4
             | 
             | is produced. Then it calls
             | output_stack_operation_type(type) where type is
             | TOK_MULTIPLY. This outputs:                 mul
             | 
             | The loop continues and further produces
             | push 6       div
             | 
             | Because stack code implicitly accesses and returns operands
             | using the stack, the syntax-directed translation for it
             | doesn't have to pass around environmental contexts like
             | register allocation info and whatnot.
             | 
             | The stack-based operations do not take any argument, and
             | therefore a parser whose functions don't return anything
             | and don't take any arguments can accommodate stack-based
             | code generation.
             | 
             | Stack-based code can be an intermediate representation that
             | is parsed again, and turned into something else, like
             | register-based.
             | 
             | If the plan is to go to stack-based output, the code can
             | work.
        
         | p4bl0 wrote:
         | I second all this. Reading the blogpost I was wondering how the
         | author will do with this parser to actually build an AST rather
         | than just validating syntax.
         | 
         | The thing is that building an AST is typically the kind of
         | thing where " _do it with some hacks which build the structure
         | procedurally using global variables_ " as kazinator said is a
         | sure way to have horrible bugs. The inherently recursive nature
         | of an AST doesn't mix very well with procedural manipulation of
         | a global state.
        
           | Someone wrote:
           | I don't think there will be an AST.
           | https://briancallahan.net/blog/20210814.html: _"We will be
           | writing a single-pass compiler for a simple language and
           | immediately output our final output code as soon as our
           | compiler has enough information to do so"_
           | 
           | That " _as soon as_ " implies code will be generated before
           | the entire program has been parsed.
           | 
           | Also, for me single-pass implies "no AST", as you would need
           | at least one pass to construct one, and iterating over an AST
           | counts as another pass for me.
        
             | pwdisswordfish8 wrote:
             | Single-pass to me also implies a code generator that writes
             | the object file directly, rather than compiling to C or
             | assembly (or some other language) as an intermediate
             | format. But words don't mean anything anymore, so long as
             | people can plausibly convince others that making them feel
             | bad by calling out improper use means that they should be
             | allowed to use words however they want, like a child owed
             | the opportunity to exercise their unbridled spirit.
        
       ___________________________________________________________________
       (page generated 2021-08-16 23:01 UTC)