[HN Gopher] Let's write a compiler, part 3: A parser
___________________________________________________________________
Let's write a compiler, part 3: A parser
Author : ingve
Score : 99 points
Date : 2021-08-16 13:37 UTC (9 hours ago)
(HTM) web link (briancallahan.net)
(TXT) w3m dump (briancallahan.net)
| eatonphil wrote:
| Anyone can do whatever they want, and if you do want to write a
| parser in C this may be a good guide for you. But it's hard to
| understand the benefit to writing a language frontend in C,
| generally speaking. (An obvious exception might be if you are
| writing one of the first languages for a new
| architecture/operating system or one where only a C compiler
| exists.)
|
| If you are writing a compiler, you'll just be emitting some other
| language anyway so the parser/emitter being in C doesn't really
| give you anything even if you are emitting C.
|
| If you are writing an interpreter and you want performance,
| you're going to have to emit bytecode and have a bytecode
| interpreter. Wanting your bytecode _interpreter_ to be in C could
| make sense but your frontend could just emit the bytecode to a
| file and your interpreter written in C could read from that file.
|
| It's a handwritten parser too so it's not like you picked the
| language so you could use a certain complex or nice parsing
| library.
|
| And it's not like the performance (or binary size) of the
| parser/emitter typically matters so much that you need even that
| part to be so much faster (or smaller) than a parser/emitter in
| Go/Java/C#/whatever.
|
| The overhead to writing C makes it a less friendly introduction
| to the topic if you have never written a compiler or interpreter
| (or parser, specifically) before.
|
| Again, you can do what you want. And if you must do it in C, you
| will probably value any posts like this. Just wanted to throw out
| reasons you may not want to do this.
| arkj wrote:
| > But it's hard to understand the benefit to writing a language
| frontend in C, generally speaking.
|
| The advantage of writing in C is that the language provides
| very little "high" level features. All you have is just
| functions and memory so you end up doing everything which is a
| laborious process but rewarding to the mind especially if you
| are someone who never did low level programming.
| david2ndaccount wrote:
| C is a simple language and at least used to be a language
| almost everyone knew.
|
| I don't know why they're writing the parser using globals
| though instead of passing a pointer to a parser state struct.
| arcticbull wrote:
| C is an incredibly complex language, it just appears simple.
| It's simple to get started and incredibly hard to do
| correctly, like skiing.
| IdiocyInAction wrote:
| C itself is not particularly complex, especially compared
| to other languages, like C++. It does have a ton of
| footguns and nasty UB things though.
|
| But writing good, safe code in C can be very complex. That
| doesn't make the language itself is complex.
| retrac wrote:
| No, those things are not a complexity of C. They're just
| complex to implement at a lower level. They're complicated
| in any language, often, if approached at the level C
| provides. You have to hand-hold the machine through it. A
| C-like implementation of linked lists in an array of memory
| using pointers, for example, looks just as bad if written
| in that way in unsafe {} Rust, or even some mutating thing
| in Haskell. You can do that if you want, but you don't have
| to. Those languages allow you to abstract that away very
| quickly (often already pre-abstracted for you with default
| approaches). That's what people mean when they say C is
| simple. Plain and simple. Minimal. You have to do much of
| the heavy lifting.
| arcticbull wrote:
| Respectfully disagree. A language that fails to
| adequately provide safeguards to a user is a complex
| language because it forces all the complexity of dealing
| with the underlying machine onto the user. These are
| things a user needs to know to be effective in the
| language, which in turn makes using the language complex.
| It's a _deceptively_ simple language.
|
| C is complex because of what it doesn't do, not what it
| does. That's IMO worse.
| retrac wrote:
| It's not so much that we're disagreeing as arguing
| different points about the language, I think. You are
| right that it is not a simple language to use, in many
| ways. Archaic, unnecessarily complicated to implement
| many standard approaches, and very repetitive. The header
| system is just the wrong approach. Etc. I wouldn't pick
| it for the main language for a new project, if at all
| avoidable personally.
|
| It's very easy to implement, though. A compiler fits in
| 32 kilobytes of RAM on some architectures. Only a handful
| of control structures. Limited types. That's the sense of
| simple that I meant, and which most people meant. It's
| not a compliment. It's a neutral observation. It's
| actually C's greatest weakness. It's _too_ simple.
| arcticbull wrote:
| I think we're in violent agreement :)
| int_19h wrote:
| Parsers involve a lot of string processing, which is really
| inconvenient and easy to get wrong in C.
| fishmaster wrote:
| > The overhead to writing C makes it a less friendly
| introduction to the topic if you have never written a compiler
| or interpreter (or parser, specifically) before.
|
| That's true and was my first thought. C seems like an arcane
| choice for an introduction course.
| foxfluff wrote:
| Brian's choice to use C is perfectly well justified:
|
| "I am thinking that we should choose C to implement our PL/0
| compiler. Eventually we will want to have this compiler run on
| many different platforms, including Unix, MS-DOS, and
| potentially even CP/M. C is a language that will run on every
| platform I would want to run our PL/0 compiler on, so C works
| for me. C also requires no special setup so our usual
| development platform of vim on OpenBSD will work just fine.
| Finally, it will help with bootstrapping the self-hosting
| version of the PL/0 compiler, since we in theory could natively
| compile the C implementation on our target platform then use
| that to compile the PL/0 implementation, also on the target
| platform."
|
| The reader may learn something from these posts whether they
| choose to write a compiler in C or not.
| eatonphil wrote:
| I don't mean to debate Brian's choice (your quote shows it
| does fall into my exception anyway) but mention to readers
| here why they may be more interested in following guides on
| compiling/interpreting implemented in other languages than C.
| fishmaster wrote:
| C is actually a huge disadvantage here for me, since I have
| no use for it outside of this tutorial which makes it hard
| to justify learning it on top of learning how a compiler
| works...
| ibains wrote:
| This post is good for learning, we have too many "systems"
| developed in Java (Hadoop et al - I'm looking at you - you
| sluggish bugs in the rug), we need more programmers to learn C
| and build systems in it.
|
| Also, you can do this so much easier in scala using parser
| combinators, if the goal is time to market and the code to be
| compiled is not very large. Here is an example to see how the
| approaches compare: https://www.prophecy.io/blogs/scala-packrat-
| parser-combinato...
| nikkinana wrote:
| yacc is key
| chrisseaton wrote:
| Never use Yacc. Just write your own parser.
| arcticbull wrote:
| Lemon is an infinitely better parser generator than Yacc; it
| solves a lot of the issues that make Yacc terrible to work
| with: named non-terminals, non-terminal destructors, re-
| entrancy, etc. It's part of the SQLite codebase. Highly
| recommended.
|
| In many cases, though, I think parser combinators thread the
| needle elegantly - nom in Rust for instance. Or even just a
| hand-rolled LL(1).
| eatonphil wrote:
| Adding on, the fraction of "real" language implementations
| using handwritten parsers vs. parser generators basically
| seems to be half and half.
|
| The Wikipedia list of Bison users even calls out that a
| number of them switched from Bison to handwritten parsers.
|
| https://en.wikipedia.org/wiki/GNU_Bison#Use
| p4bl0 wrote:
| Why would you say that? Yacc and other parser generators
| exist for a good reason: hand written parsers can be quite
| hairy to debug and extend, while parser generators offer a
| domain specific language to specify your grammar and can
| generate efficient parser code based on it.
|
| What are the arguments in favor of manually writing parsers?
| chrisseaton wrote:
| Fundamentally I don't think parsing is a problem that's
| complex enough to warrant a custom tool and language. And
| even if it was Yacc is the wrong tool for the job in most
| cases.
|
| Since most languages are context-sensitive, you almost
| always have to bend Yacc, which is designed for context-
| free languages, out of shape to apply it.
|
| It's a hammer but we hardly have any nails, and the nails
| can just be pushed in by hand without a hammer, and they're
| the wrong type of nails anyway so that hammer isn't really
| compatible, and top of that it's a really expensive hammer.
| rightbyte wrote:
| Isn't the context sensitivity handled when you build an
| AST with the action statements in Yacc?
| chrisseaton wrote:
| Yes - so straight away you have to leave the DSL and
| bypass the formal mode of Yacc. At which point why bother
| with it? If the first thing you need to do with your tool
| is hack around it, maybe the tool isn't so well-designed?
|
| And actions often do things like side-effect changes to
| the lexer state, so it gets worse from there.
| codr7 wrote:
| What few seem to realize is that it's perfectly possible to
| abstract out some of the work and create an extensible
| foundation for manual parsers.
|
| Since it's all regular code, you can use the full power of
| the host language to deal with the problem.
|
| https://github.com/codr7/swifties/blob/main/Sources/Swiftie
| s...
| p4bl0 wrote:
| I understand that, but I still find it easier to
| write/maintain/extend a grammar than a parser for that
| grammar.
| brundolf wrote:
| If you've got the right utility functions, a handwritten
| parser visually maps very directly to its grammar. It
| won't be as concise of course, but cognitively you can
| work with it almost as if it were just a grammar.
| fishmaster wrote:
| I'm thinking about following the tutorial in Julia or Kotlin, but
| it would be interesting how the code looks in a different
| language like F#.
| the_benno wrote:
| F# is an ML-family language, and ML stands for meta-language --
| it was designed for language tools and compilers! Andrew
| Appel's Modern Compiler Implementation in ML is a great
| resource for code examples that aren't exactly F#, but will
| look damn close. The book has also been published with C and
| Java, if you want side-by-side comparisons
| pfdietz wrote:
| ML was designed to be the metalanguage for the LCF proof
| system, used to write proof tactics. It was then recognized
| to be interesting in its own right as a programming language.
| the_benno wrote:
| Oh, thanks for the correction and apologies for the
| inadvertent misinfo.
|
| Your comment sent me reading the HOPL paper about SML -- I
| think I (based on folklore knowledge/informal
| conversations) conflate the original development of ML with
| subsequent development and standardization around SML. All
| of this was well before my time so it's quite interesting
| to read about some of the details.
| BazookaMusic wrote:
| The code isn't perfect, it's a learning project and still WIP,
| but you can check the code for a parser for the LOX language (
| c-like) from 'Crafting Interpreters' in this project. 'Ast.fs'
| and 'Parser.fs' should be enough to get an idea of what it's
| doing.You can also google the grammar of the language which is
| quite simple:
|
| https://github.com/BazookaMusic/FLOX
| musicale wrote:
| I expect this series is similar to Wirth's own chapter on writing
| a PL/0 compiler. It seems to be following the same recursive
| descent approach, but using C rather than Pascal.
|
| Since the implementation is in C, perhaps it might make sense to
| compile a version of PL/0 ("C/0") with C syntax rather than
| Pascal syntax. It might be possible to get something close to a
| C/0 compiler that could compile its own source code.
|
| Though one issue with C is you also need a cpp implementation.
| kazinator wrote:
| The author forgot default cases in various switch (tok)
| statements: static void factor(void)
| { switch (type) { case TOK_IDENT: case
| TOK_NUMBER: next(); break; case
| TOK_LPAREN: expect(TOK_LPAREN);
| expression(); expect(TOK_RPAREN); } }
|
| The TOK_ enumeration has a good many more members than just those
| three. If any other one occurs in the code, or an outright
| corrupt type value occurs, it silently falls through. If the
| error is handled at all, it will result in some strange
| diagnostic, in some higher level code that thinks it has seen an
| expression with a factor in it.
|
| Adding a break after the last block in a switch is a good idea;
| someone adding a new case might forget. (Wit diagnostic support,
| like GCC requiring a fallthrough comment, this is less important
| nowadays).
|
| If a switch deliberately falls through for unmatched cases, have
| default: break;
|
| to express that explicitly.
|
| This whole approach of the parser consisting of void functions is
| poor for 2021. It's as if the author is deliberately trolling
| everyone who has ever taken any level of interest in functional
| programming.
|
| The author will have a hard time fitting the necessary semantic
| actions into this code, like constructing an abstract syntax
| tree, and might succumb to the temptation to do it with some
| hacks which build the structure procedurally using global
| variables.
|
| The parser would be a more future proof if it anticipated the
| need for backtracking. So that is to say, a function such as
| factor() above could be considered to be a pattern matching
| construct, and return a useful value indicating whether it
| matched or failed to match. Furthermore, if it it failed to
| match, then it would backtrack, leaving the state of the input
| stream as it was prior to entry.
|
| You can then do interesting things, because you're essentially
| LL(k) now, like speculatively parse the input using several
| possible constructs in priority order, until one matches.
|
| The code inconsistently mixes the use of concrete character
| constants like '.' and '{' and #define symbols that expand to
| character constants like #define TOK_PLUS '+'.
|
| Using the character constants in the code as in expect('+') or
| whatever is clearer. The Yacc approach of using constants for
| abstract tokens, (which are enumerated starting at some value
| above 256), and concrete character constants for single-character
| tokens that denote themselves, is worth mimicking.
|
| You're not going to redefine TOK_PLUS to something other than '+'
| in the future, and doing so would be like like changing #define
| FOUR 4 to #define FOUR 5. TOK_PLUS doesn't inform any better than
| '+', and isn't any easier to find with grep.
| wott wrote:
| > The author will have a hard time fitting the necessary
| semantic actions into this code, like constructing an abstract
| syntax tree,
|
| It's a declared non-goal.
|
| > The code inconsistently mixes the use of concrete character
| constants like '.' and '{' and #define symbols that expand to
| character constants like #define TOK_PLUS '+'. > Using the
| character constants in the code as in expect('+') or whatever
| is clearer.
|
| That's not how it works, TOK_* represent whole elements, the
| element may happen to be a single character, but it can be a
| word. It would be highly confusing to pass single characters to
| expect() as you wouldn't know at first sight whether it expects
| a simple character or an element. The TOK_* could as well be
| assigned a random number as you say in your following sentence.
| And they were until yesterday :-) I guess it is easier to debug
| with mnemonic characters.
|
| https://github.com/ibara/pl0c/commit/aba92627deac3f8cfe08679...
| [deleted]
| kazinator wrote:
| Using character constants for denoting one-character tokens
| is a time-honored tradition that everyone understands.
|
| E.g. GNU Awk:
|
| http://git.savannah.gnu.org/cgit/gawk.git/tree/awkgram.y?h=g.
| ..
|
| This is readable; it says "I am just what I look like: a
| terminal symbol corresponding to the actual +".
| p4bl0 wrote:
| > [building an AST is] a declared non-goal.
|
| Then what will be going on after parsing? What will be
| compiled?
| kazinator wrote:
| Compilers can go straight to output without building an
| AST, which is just one form of output. I should have more
| generally written about output.
|
| Output can be produced procedurally in a single pass.
| However, it will still require refactoring to work that in.
| For instance, say we go to a register machine (one that has
| an unlimited number of registers). The expression()
| function needs to know the destination register of the
| calculation. If it returns void as before, then it has to
| take it as an argument. Or else, it has to come up with the
| register and return it.
|
| I think that if the target is a stack-based machine, that
| would be the easiest to work into the parsing scheme. This
| is because without any context at all. Let's use term() as
| an example:
|
| Original recognizer skeleton: static void
| term(void) { factor(); while
| (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
| next(); factor(); } }
|
| Stack-based output: static void
| term(void) { factor(); // this now outputs
| the code which puts the factor on the stack
| while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
| next(); factor(); // of course, likewise outputs
| code. output_stack_operation(type); // we
| just add this } }
|
| What is output_stack_operation_type: static
| void output_stack_operation_type(int type) {
| switch (type) { case TOK_MULTIPLY:
| output_line("mul"); break; case TOK_DIVIDE:
| output_line("div"); break; // ... }
| }
|
| For instance if we see 3 * 4 / 6
|
| we have the grammar symbols factor
| TOK_MULTIPLY factor TOK_DIVIDE factor
|
| the term function calls factor() first, and that function
| produces output, which might be the line:
| push 3
|
| the term function calls factor() again, so this time the
| output: push 4
|
| is produced. Then it calls
| output_stack_operation_type(type) where type is
| TOK_MULTIPLY. This outputs: mul
|
| The loop continues and further produces
| push 6 div
|
| Because stack code implicitly accesses and returns operands
| using the stack, the syntax-directed translation for it
| doesn't have to pass around environmental contexts like
| register allocation info and whatnot.
|
| The stack-based operations do not take any argument, and
| therefore a parser whose functions don't return anything
| and don't take any arguments can accommodate stack-based
| code generation.
|
| Stack-based code can be an intermediate representation that
| is parsed again, and turned into something else, like
| register-based.
|
| If the plan is to go to stack-based output, the code can
| work.
| p4bl0 wrote:
| I second all this. Reading the blogpost I was wondering how the
| author will do with this parser to actually build an AST rather
| than just validating syntax.
|
| The thing is that building an AST is typically the kind of
| thing where " _do it with some hacks which build the structure
| procedurally using global variables_ " as kazinator said is a
| sure way to have horrible bugs. The inherently recursive nature
| of an AST doesn't mix very well with procedural manipulation of
| a global state.
| Someone wrote:
| I don't think there will be an AST.
| https://briancallahan.net/blog/20210814.html: _"We will be
| writing a single-pass compiler for a simple language and
| immediately output our final output code as soon as our
| compiler has enough information to do so"_
|
| That " _as soon as_ " implies code will be generated before
| the entire program has been parsed.
|
| Also, for me single-pass implies "no AST", as you would need
| at least one pass to construct one, and iterating over an AST
| counts as another pass for me.
| pwdisswordfish8 wrote:
| Single-pass to me also implies a code generator that writes
| the object file directly, rather than compiling to C or
| assembly (or some other language) as an intermediate
| format. But words don't mean anything anymore, so long as
| people can plausibly convince others that making them feel
| bad by calling out improper use means that they should be
| allowed to use words however they want, like a child owed
| the opportunity to exercise their unbridled spirit.
___________________________________________________________________
(page generated 2021-08-16 23:01 UTC)