https://briancallahan.net/blog/20210814.html

Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social
life

---------------------------------------------------------------------
Home | Blog archives | LinkedIn | CV | Code | Extras
---------------------------------------------------------------------

[prev]

[next]

2021-08-14
Let's write a compiler, part 1: Introduction, selecting a language,
and doing some planning

All source code for this blog post can be found here.

I think we all knew this might be coming once we finished up with our
assembler. While we can in theory solve any computable problem by
writing the appropriate assembly for our assembler, writing assembly
code can be a bit more cumbersome than a high-level language. There
is a reason why we did not write our assembler in assembly but
instead chose D for the implementation language.

Today we will begin writing a compiler for a real high-level
language. It will be a simple language for sure. But let's not let
simplicity stand in our way. We will be able to write real programs
with our compiler. How real, you ask? Let's plan on writing a
compiler for our selected language and then being able to write a
compiler in that same language the compiler compiles using our
compiler to compile it! From zero to compiler to self-hosted
compiler. It will take a good bit of work for sure, but I think we
are up to the task. We'll spend this series writing our initial,
non-self-hosting compiler. We'll take some time to enjoy our work,
then we'll come back and embark on a second series that develops the
self-hosted compiler using the compiler we will write in this series.

For today, let's select a language for our compiler. Two languages,
actually. One language will be the language that the compiler
understands. The other will be the language that the compiler is
written in. Once we've done that, let's do some organizational work,
discuss some of the different ways we could implement our compiler,
and choose one of those implementation methods.

What our compiler will be, and, importantly, what it will not be

Before we get started, I want to make it clear that will we not be
producing an incredibly complex compiler using state-of-the-art
techniques. Nor will we be diving headfirst into compiler theory.
Instead, much like our assembler, we will produce a "good enough"
compiler using basic techniques. This blog series is a hands-on
practical guide to creating your first compiler. Once you've mastered
this compiler you can expand your horizons and begin to incorporate
further enhancements to it, to all the difference pieces of the
compiler. We will be writing a single-pass compiler for a simple
language and immediately output our final output code as soon as our
compiler has enough information to do so. That strategy is good
enough for Turbo Pascal, so it's good enough for me.

A language to compile

There are a lot of programming languages out there. Rosetta Code, a
programming chrestomathy site, is aware of 838 languages as of me
writing this blog post, and that almost certainly represents only a
fraction of all the programming languages that have been implemented,
let alone devised.

We could devise our own language, but I think it might be easier for
us to choose a language that already exists. To make our lives a bit
easier, let's select a language that can be implemented in what is
known as a single-pass compiler; that is, we will choose a language
whose syntax requires our compiler to look at the input source code
only one time. If you remember from our assembler, that was a
two-pass assembler because we had to solve the problem of what
happens if you reference a label before declaring it and so the
assembler had to read the source code twice: first to figure out all
the label addresses and second to use that information to generate
the binary code. Whichever language we choose for our compiler to
understand, it won't have this problem. Better still, let's select a
language that has a history of usage in teaching compiler
construction.

That might lead us to something like Pascal, which was developed
specifically for teaching programming and was designed to be parsed
with a single-pass compiler. But Pascal has a lot of extra complexity
which, while very useful for a richer programming experience,
probably is not necessary for us just learning. However, there is a
Pascal-like language, also developed by Niklaus Wirth, that was
developed specifically for teaching compiler construction. That
language is PL/0, and humorously it is not known by Rosetta Code
despite being first described in the mid-1970s and used to teach
compiler construction ever since. Though it will be outside the scope
of this series, if you wanted you could use our finished PL/0
compiler as a starting point to build up to a complete Pascal
compiler.

A language to implement our PL/0 compiler in

I am thinking that we should choose C to implement our PL/0 compiler.
Eventually we will want to have this compiler run on many different
platforms, including Unix, MS-DOS, and potentially even CP/M. C is a
language that will run on every platform I would want to run our PL/0
compiler on, so C works for me. C also requires no special setup so
our usual development platform of vim on OpenBSD will work just fine.
Finally, it will help with bootstrapping the self-hosting version of
the PL/0 compiler, since we in theory could natively compile the C
implementation on our target platform then use that to compile the PL
/0 implementation, also on the target platform.

Some organizational considerations and a basic walkthrough of a
compiler

Let's plan out our PL/0 compiler a little bit before we start writing
any code. Let's start by reminding ourselves what a compiler should
do by comparing it to our assembler.

The workflow of our assembler looked roughly something like this:

+-------+
| Start |
+-------+
    |
    +---+
        |
        V
+----------------+    +------------+      -------      +--------+
| Read in a line |    | Parse into |     /       \  No | Write  |
|                |--->|            |--->| Pass 1? |--->|        |
| of assembly    |    | tokens     |     \       /     | binary |
+----------------+    +------------+      -------      +--------+
        ^                                    |             |
        |                                    | Yes         |
        |                                    |             |
        |                                    V             |
        |                               +---------+        |
        |                               | Record  |        |
        |                               | address |        |
        |                               +---------+        |
        |                                    |             V
        |                                    |       -----------
        |                                    |  No  / End of    \
        +------------------------------------+-----|             |
                                                    \ assembly? /
                                                     -----------
                                                           |
                                                           | Yes
                                                           |
                                                           V
                                                        +-----+
                                                        | End |
                                                        +-----+

The assembler as we wrote it is surprisingly straightforward. We were
able to treat each line as its own independent universe and generate
the binary code for that line without needing to possess any
additional context other than potentially the address of a symbol,
which we collected during the first pass. We might be able to use a
similar strategy with our compiler. However, it won't be as neat and
tidy as with our assembler. With the assembler, the newline character
was the instruction terminator. That won't quite be the same for PL/
0.

By the way, here is the syntax for PL/0:

program         = block "." .
block           = [ "const" ident "=" number { "," ident "=" number } ";" ]
                  [ "var" ident { "," ident } ";" ]
                  { "procedure" ident ";" block ";" } statement .
statement       = [ ident ":=" expression
                  | "call" ident
                  | "begin" statement { ";" statement } "end"
                  | "if" condition "then" statement
                  | "while" condition "do" statement ] .
condition       = "odd" expression
                | expression ( "=" | "#" | "<" | ">" ) expression .
expression      = [ "+" | "-" ] term { ( "+" | "-" ) term } .
term            = factor { ( "*" | "/" ) factor } .
factor          = ident
                | number
                | "(" expression ")" .
ident           = "A-Za-z_" { "A-Za-z0-9_" } .
number          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" .

Don't worry if this doesn't make sense to you right now. But still,
try to read through it and see if you can mentally parse what the
syntax is trying to tell you. If you still can't, don't worry we will
see this again when we write our parser and talk through it then.

Now, let's consider the rough workflow of our compiler:

+-------+
| Start |
+-------+
    |                                      No
    +--+                   +------------------------------+
       |                   |                              |
       V                   |                              |
+-------------+            V                        ------------
| Read in     |   +-------------+    +-------+     / Ready to   \
|             |-->| Lex a token |--->| Parse |--->|              |
| source code |   +-------------+    +-------+     \ emit code? /
+-------------+            ^                        ------------
                           |                              |
                           |          +-----------+  Yes  |
                           |          | Emit code |<------+
                           |          +-----------+
                           |No              |
                           |                |
                           |                V
                           |          --------------
                           |         / End of       \  Yes +-----+
                           +--------|                |---->| End |
                                     \ source code? /      +-----+
                                      --------------

We would be wise to break up these different parts of the compiler
organizationally. There are three main parts for this compiler: the
lexer, the parser, and the code generator. The lexer's job is to take
the free-form source code and turn it into a series of tokens, the
individual units of the language. The parser's job is to take that
series of tokens and ensure that their ordering follows the rules of
the language syntax. The code generator's job is to produce
equivalent... well, equivalent what, exactly?

Well, it has to be some other language but which one to pick? It
depends on how much effort we want to spend on our backend. And it
doesn't even have to be just one language, either. The GCC compiler
uses multiple intermediate languages in order to enable powerful
optimization techniques.

We could choose a CPU and have our compiler output the equivalent
assembly language for that CPU. But since we are already writing our
initial compiler in C, I think we should output C. Outputting to C is
a well-recognized compilation strategy. And it will allow us to have
the PL/0 code we will write and feed to our compiler ultimately
produce native code for whatever platform we happen to be using. It
turns out that it will also save us a lot of work in our code
generator.

Breaking it down into steps

There is more than one approach to converting source code into the
final output code. Based on the workflow graph above, we will have
all three main parts--the lexer, the parser, and the code
generator--all be intertwined with each other. There will be a clear
demarcation in the code which functions belong to which part, but
they will still be intertwined in action. We will have a situation
where the parser drives the action. The parser will call on the lexer
to hand over tokens one at a time as the parser needs tokens. And
once the parser has enough information to make a decision on what C
code to output next, it will call the code generator to produce that
C code. The code generator never gives any information back to the
parser; the parser assumes that the code generator will do what it is
told. The parser will repeat this cycle until it reaches the end of
the source code.

In graphic form, this is how the different parts of the compiler will
communicate with each other:

+-------+     +--------+     +----------------+
| Lexer |<--->| Parser |---->| Code generator |
+-------+     +--------+     +----------------+

But this is not the only way the work can be managed and the
different parts of a compiler can communicate with each other. We
could instead take an approach where a standalone lexer reads in
source code and writes out a temporary file that contains a list of
tokens with some additional contextual information. We could then
have a standalone parser that reads in the temporary file produced by
the lexer and parses that, producing some sort of temporary data
structure like an abstract syntax tree. We could then have a
standalone code generator that reads in that temporary data structure
produced by the parser and writes out the final output code. This is
somewhat similar to the strategy employed by the Cowgol compiler;
each major task is its own separate binary and intermediate code is
written out to files to be used as input for the next binary in the
chain.

We could go even further and break things down into the tiniest units
of work and write a separate pass, or separate standalone utility, to
handle each of these individual units of work. This would be a
nanopass compiler.

Our first language extension

Let's also add in this planning stage our first extension to the PL/0
language our compiler will understand. One thing that is missing from
the syntax is any comment characters. It is always good to be able to
put comments into our source code. In Pascal, there are two comment
character sets, { ... } and (* ... *) which permit nested comments
and the ability to mix and match opening and closing comment
characters. I don't think I want to implement quite that much. I will
choose { ... } for our PL/0 comment characters and not have nested
comments. I think that is good enough for us.

On the next episode: A lexer

Even though we didn't write any code in this post, we are now ready
to begin our coding adventure. Coming up, we will turn our attention
to writing our first bits of code. We will write a lexer for the PL/0
language, which will allow us to feed the compiler any free-form PL/0
source code and have the compiler return to us the first important
structure of a programming language: all the tokens in order from
that free-form source code.

Top

RSS
---------------------------------------------------------------------

OpenBSD httpd