https://briancallahan.net/blog/20210816.html
Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social
life
---------------------------------------------------------------------
Home | Blog archives | LinkedIn | CV | Code | Extras
---------------------------------------------------------------------
[prev]
[next]
2021-08-16
Let's write a compiler, part 3: A parser
All source code for this blog post can be found here.
Tokenizing the source code input is a great start for our compiler.
Now let's turn our attention to parsing the PL/0 grammar. Once we
have finished with the parser, the front end of our compiler will be
finished and we will have a PL/0 validator. This is a program that
can read in PL/0 source code and inform the user whether or not it is
valid PL/0 code.
There are many strategies available for writing a parser. And lexers
too, for that matter. While we wrote a lexer by hand because it's a
useful thing to do at least once, you could have used the flex tool
instead to generate a lexer for you. Much the same, for a simple
grammar like that found in PL/0, you could use a tool such as yacc to
take the PL/0 grammar and output a parser for us.
But since the point of this series is to learn, we will write a
hand-crafted recursive-descent parser. Among other features, a
recursive-descent parser is a top-down parser. That means we will
start with the very first token in the PL/0 source code and continue
in order until we do something meaningful with the last token in the
PL/0 source code.
Some helper functions
Before we get to writing the parser itself, we are going to need two
helper functions, which I am going to name next() and expect(). The
next() function is how the parser will control the lexer, as that is
how the parser can tell the lexer to fetch the next token from the PL
/0 source code and record all of the useful information for that
token: the token type and, if needed, the token string. This is
exactly what our lexer is already doing, so we don't need to make any
changes to the lexer at all. We just need to hook it up to the
parser.
The expect() function is how we will enforce proper syntax. The
function will take one parameter, a token type, and it will check
that the currently recorded token type matches the token type we
expect to have at this point in the grammar. If it does not, we will
report a syntax error through our already existing error mechanism.
If it does match, have the lexer fetch the next token.
The code for these two helper functions might look like this:
/*
* Parser.
*/
static void
next(void)
{
type = lex();
++raw;
}
static void
expect(int match)
{
if (match != type)
error("syntax error");
next();
}
Reading the PL/0 grammar
We will need to be able to read the Extended Backus-Naur Form (EBNF)
style grammar that is that long grammar comment in our compiler
source code in order to be able to write our parser. Without going
into lots of formal theory or detail, let's keep in mind the
following rules for reading the grammar:
* Everything to the left of an = sign will become a function in our
parser.
* A | means "or." For example, a factor can be an identifier or a
number or an expression wrapped in parentheses.
* Anything wrapped in "double quotes" means literally this thing.
So for example the "const" in the block rule means that if you
get here you must literally see const as the next token, i.e., a
TOK_CONST token.
* Anything wrapped in [ square brackets ] means you must have
exactly zero or one of these. To take again for example the
entire line in the block rule that begins with [ "const"
essentially means that you may have exactly zero or one constant
declaration sections in a block.
* Anything wrapped in { curly braces } means you may have zero or
more of these, with no upper limit. In the block rule, you may
have as many procedures as you want or no procedures, if that's
what you want.
* Anything wrapped in ( parentheses ) means you must have exactly
one of these. This only appears in the condition, expression, and
term rules.
* The bare dot at the end of each rule simply symbolizes the end of
the rule and doesn't have any meaning for us.
* We don't need to worry about what an identifier or a number is.
If you remember, the lexer knows how to determine those things
and provides us with the token string for all identifiers,
reserved words, and numbers. The lexer also validates numbers for
us. For the sake of repetition, in our version of the PL/0
language an identifier begins with an A-Za-z_ character and is
followed by zero or more A-Za-z0-9_ characters. A number begins
with a 0-9 character and contains one or more 0-9 characters with
optional _ characters for human readability. The reserved words
are all lowercase.
Writing our first parser functions
Let's start at the top. The very first rule in our grammar is
program. The program rule is defined to be a block rule followed
by a literal dot. Let's write that out in code:
static void
parse(void)
{
next();
block();
expect(TOK_DOT);
if (type != 0)
error("extra tokens at end of file");
}
OK, so I slightly violated my rule of naming the function after
the rule. This will be the only time I do this. This is because
parse() was the name of the function that held the lexer printing
logic from the previous entry and I've simply replaced that logic
with this new logic that begins the compilation process.
Let's see what we have here. The first thing we do is call next()
because if we do not, then the lexer will not fetch a token for
us and we won't have anything to work with. Next we have what
looks extremely similar to the rule itself: a block function
followed by us enforcing a literal dot. Because the expect()
function fetches the next token automatically, we can also check
that no other tokens exist after that literal dot. If there is,
that's an error because you have code after the point in which
the grammar says you cannot have any more code. You can have
whitespace and comments after the final literal dot, because our
lexer does not return on comments as comments are not tokens. The
lexer will instead loop again, reach the end of the input, and
return 0, signifying the end of the input. And thus we are OK for
this special case of having comments after the final dot.
Now all we have to do is descend down into the functions that are
as-of-now undefined, which is the block() function. Which just so
happens to correspond to the next rule in the PL/0 grammar, the
block rule.
Dealing with optional sections
The block rule has several optional sections. The first is the
const section. As it is wrapped in square brackets, that means
you can have exactly zero or exactly one const sections. We can
turn this into code by following the same rules we did for
writing the previous function, except we will wrap it inside an
if statement that checks to see if the first token is a TOK_CONST
token. If it is, we will execute everything in the block for
exactly one const section. If it is not, we will skip the entire
block for exactly zero const sections.
This opening part of the block() function might look like this:
static void
block(void)
{
if (type == TOK_CONST) {
expect(TOK_CONST);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
}
expect(TOK_SEMICOLON);
}
You might have already figured out how to deal with the curly
brace wrapped parts of the grammar reading through the code
above. Curly braces mean zero or more, with no upper limit. And
so handling curly braces is exactly the same as dealing with
square brackets, except the if for square brackets becomes a
while for curly braces which allows looping and therefore any
number of arbitrary iterations or zero iterations.
We can follow these rules to finish the block() function:
static void
block(void)
{
if (type == TOK_CONST) {
expect(TOK_CONST);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
}
expect(TOK_SEMICOLON);
}
if (type == TOK_VAR) {
expect(TOK_VAR);
expect(TOK_IDENT);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
}
expect(TOK_SEMICOLON);
}
while (type == TOK_PROCEDURE) {
expect(TOK_PROCEDURE);
expect(TOK_IDENT);
expect(TOK_SEMICOLON);
block();
expect(TOK_SEMICOLON);
}
statement();
}
Both the const and var sections are in square brackets, while the
procedure section is in curly braces. The statement section is
not inside anything, so it is not optional. You might also notice
that a procedure is little more than a named block, and it
recursively calls block() to do its work. This means that you can
have infinitely nested procedures as per the grammar. That's a
common feature of Pascal-like languages. But I am going to do a
little bit of future planning here and forbid nested procedures.
You may have as many procedures as you want, they just can't be
nested inside each other. If I remember correctly, Pascal
compilers enforced a limit on how deep the nested procedures
could go (though I'm not sure if any banned nested procedures
outright, someone on Hacker News will let us know, I'm sure).
We will not alter the grammar in order to achieve this forbidding
of nested procedures, however. I want us to implement a parser
for the PL/0 grammar exactly as the grammar is written. We will
instead use some metadata to enforce no nested procedures. Let's
create a new global variable named depth and we will increment
the depth level at the beginning of each block and decrement the
depth level at the end of each block. We can use this to check
that we don't have nested procedures.
That makes our finished block() function look like this:
static void
block(void)
{
if (depth++ > 1)
error("nesting depth exceeded");
if (type == TOK_CONST) {
expect(TOK_CONST);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
}
expect(TOK_SEMICOLON);
}
if (type == TOK_VAR) {
expect(TOK_VAR);
expect(TOK_IDENT);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
}
expect(TOK_SEMICOLON);
}
while (type == TOK_PROCEDURE) {
expect(TOK_PROCEDURE);
expect(TOK_IDENT);
expect(TOK_SEMICOLON);
block();
expect(TOK_SEMICOLON);
}
statement();
if (--depth < 0)
error("nesting depth fell below 0");
}
Additionally, you might notice that while procedures have names,
they do not accept any parameters nor do they return any values.
They are equivalent to a void function in C that accepts no
parameters, i.e., void procedure(void). Keep that in mind when
you are writing PL/0 code to give to our PL/0 compiler.
Descending further down the grammar
Are there any functions in the block() function that we have not
yet implemented? We are going to keep asking this question until
every grammar rule has its own function. The answer is yes, just
one: statement(). Lucky us, the statement rule is the next rule
in the grammar.
Handling or
The statement rule has a number of possibilities. Each
possibility is separated by a |, which means "or." To figure out
what to do, we must look at the first token in each possibility.
What we are looking for is that every possibility begins with a
different token. If that were not the case, we might have to
rewrite the affected possibilities until we reached a point where
each possibility begins with a different token. We are lucky, or
possibly Dr. Wirth did it by design (it was done by design),
because each possibility begins with a different token. That
lends itself well to a switch statement. Handling | here means a
switch statement on the first token with each possibility being
its own case.
That might look like this:
static void
statement(void)
{
switch (type) {
case TOK_IDENT:
expect(TOK_IDENT);
expect(TOK_ASSIGN);
expression();
break;
case TOK_CALL:
expect(TOK_CALL);
expect(TOK_IDENT);
break;
case TOK_BEGIN:
expect(TOK_BEGIN);
statement();
while (type == TOK_SEMICOLON) {
expect(TOK_SEMICOLON);
statement();
}
expect(TOK_END);
break;
case TOK_IF:
expect(TOK_IF);
condition();
expect(TOK_THEN);
statement();
break;
case TOK_WHILE:
expect(TOK_WHILE);
condition();
expect(TOK_DO);
statement();
}
}
All we need to do is follow the logic we already learned
implementing the program and block rules to implement the
statement rule, with each possibility living in its own case.
Note that there is no default case. This is to leave open the
possibility for the empty statement.
Descending even further down the grammar
Let's ask the question again: are there any undefined functions
in our grammar? Yes, there are two. Both of them were first used
in the statement() function: expression() and condition(). And
sure enough, there is a grammar rule named expression and a
grammar rule named condition. The condition rule is next in the
grammar, so let's do that one first.
The condition rule has two possibilities, and, just like the
statement rule, each possibility begins with a different token.
Since there are only two possibilities, a switch statement is
unnecessary. Also, because the expression rule has many different
possibilities for a starting token, though none of those
possibilities is a TOK_ODD token, let's check for a TOK_ODD token
and if it's not, it must be an expression.
That might look like this:
static void
condition(void)
{
if (type == TOK_ODD) {
expect(TOK_ODD);
expression();
} else {
expression();
switch (type) {
case TOK_EQUAL:
case TOK_HASH:
case TOK_LESSTHAN:
case TOK_GREATERTHAN:
next();
break;
default:
error("invalid conditional");
}
expression();
}
}
No new functions. Just the expression() function that we have yet
to implement. So let's implement it. Implementing the expression
rule will trigger needing to write the term and factor rules as
well, which are all the rules we have left to implement. Let's
finish the job:
static void
factor(void)
{
switch (type) {
case TOK_IDENT:
case TOK_NUMBER:
next();
break;
case TOK_LPAREN:
expect(TOK_LPAREN);
expression();
expect(TOK_RPAREN);
}
}
static void
term(void)
{
factor();
while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
next();
factor();
}
}
static void
expression(void)
{
if (type == TOK_PLUS || type == TOK_MINUS)
next();
term();
while (type == TOK_PLUS || type == TOK_MINUS) {
next();
term();
}
}
Wrapping up the parser
There are no undefined functions to implement in our code, and
all the grammar rules have a corresponding function in our
compiler, so our parser is finished. And now with both our lexer
and our parser finished, we have a completed front end for our PL
/0 compiler. We should be able to use our front end as a
validator, as it should accept all valid PL/0 code and reject all
invalid PL/0 code.
For completeness, here is all the code we have written thusfar:
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define TOK_IDENT 'I'
#define TOK_NUMBER 'N'
#define TOK_CONST 'C'
#define TOK_VAR 'V'
#define TOK_PROCEDURE 'P'
#define TOK_CALL 'c'
#define TOK_BEGIN 'B'
#define TOK_END 'E'
#define TOK_IF 'i'
#define TOK_THEN 'T'
#define TOK_WHILE 'W'
#define TOK_DO 'D'
#define TOK_ODD 'O'
#define TOK_DOT '.'
#define TOK_EQUAL '='
#define TOK_COMMA ','
#define TOK_SEMICOLON ';'
#define TOK_ASSIGN ':'
#define TOK_HASH '#'
#define TOK_LESSTHAN '<'
#define TOK_GREATERTHAN '>'
#define TOK_PLUS '+'
#define TOK_MINUS '-'
#define TOK_MULTIPLY '*'
#define TOK_DIVIDE '/'
#define TOK_LPAREN '('
#define TOK_RPAREN ')'
/*
* pl0c -- PL/0 compiler.
*
* program = block "." .
* block = [ "const" ident "=" number { "," ident "=" number } ";" ]
* [ "var" ident { "," ident } ";" ]
* { "procedure" ident ";" block ";" } statement .
* statement = [ ident ":=" expression
* | "call" ident
* | "begin" statement { ";" statement } "end"
* | "if" condition "then" statement
* | "while" condition "do" statement ] .
* condition = "odd" expression
* | expression ( "=" | "#" | "<" | ">" ) expression .
* expression = [ "+" | "-" ] term { ( "+" | "-" ) term } .
* term = factor { ( "*" | "/" ) factor } .
* factor = ident
* | number
* | "(" expression ")" .
*/
static char *raw, *token;
static int depth, type;
static size_t line = 1;
static void expression(void);
/*
* Misc. functions.
*/
static void
error(const char *fmt, ...)
{
va_list ap;
(void) fprintf(stderr, "pl0c: error: %lu: ", line);
va_start(ap, fmt);
(void) vfprintf(stderr, fmt, ap);
va_end(ap);
(void) fputc('\n', stderr);
exit(1);
}
static void
readin(char *file)
{
int fd;
struct stat st;
if (strrchr(file, '.') == NULL)
error("file must end in '.pl0'");
if (!!strcmp(strrchr(file, '.'), ".pl0"))
error("file must end in '.pl0'");
if ((fd = open(file, O_RDONLY)) == -1)
error("couldn't open %s", file);
if (fstat(fd, &st) == -1)
error("couldn't get file size");
if ((raw = malloc(st.st_size + 1)) == NULL)
error("malloc failed");
if (read(fd, raw, st.st_size) != st.st_size)
error("couldn't read %s", file);
raw[st.st_size] = '\0';
(void) close(fd);
}
/*
* Lexer.
*/
static void
comment(void)
{
int ch;
while ((ch = *raw++) != '}') {
if (ch == '\0')
error("unterminated comment");
if (ch == '\n')
++line;
}
}
static int
ident(void)
{
char *p;
size_t i, len;
p = raw;
while (isalnum(*raw) || *raw == '_')
++raw;
len = raw - p;
--raw;
free(token);
if ((token = malloc(len + 1)) == NULL)
error("malloc failed");
for (i = 0; i < len; i++)
token[i] = *p++;
token[i] = '\0';
if (!strcmp(token, "const"))
return TOK_CONST;
else if (!strcmp(token, "var"))
return TOK_VAR;
else if (!strcmp(token, "procedure"))
return TOK_PROCEDURE;
else if (!strcmp(token, "call"))
return TOK_CALL;
else if (!strcmp(token, "begin"))
return TOK_BEGIN;
else if (!strcmp(token, "end"))
return TOK_END;
else if (!strcmp(token, "if"))
return TOK_IF;
else if (!strcmp(token, "then"))
return TOK_THEN;
else if (!strcmp(token, "while"))
return TOK_WHILE;
else if (!strcmp(token, "do"))
return TOK_DO;
else if (!strcmp(token, "odd"))
return TOK_ODD;
return TOK_IDENT;
}
static int
number(void)
{
const char *errstr;
char *p;
size_t i, j = 0, len;
p = raw;
while (isdigit(*raw) || *raw == '_')
++raw;
len = raw - p;
--raw;
free(token);
if ((token = malloc(len + 1)) == NULL)
error("malloc failed");
for (i = 0; i < len; i++) {
if (isdigit(*p))
token[j++] = *p;
++p;
}
token[j] = '\0';
(void) strtonum(token, 0, LONG_MAX, &errstr);
if (errstr != NULL)
error("invalid number: %s", token);
return TOK_NUMBER;
}
static int
lex(void)
{
again:
/* Skip whitespace. */
while (*raw == ' ' || *raw == '\t' || *raw == '\n') {
if (*raw++ == '\n')
++line;
}
if (isalpha(*raw) || *raw == '_')
return ident();
if (isdigit(*raw))
return number();
switch (*raw) {
case '{':
comment();
goto again;
case '.':
case '=':
case ',':
case ';':
case '#':
case '<':
case '>':
case '+':
case '-':
case '*':
case '/':
case '(':
case ')':
return (*raw);
case ':':
if (*++raw != '=')
error("unknown token: ':%c'", *raw);
return TOK_ASSIGN;
case '\0':
return 0;
default:
error("unknown token: '%c'", *raw);
}
return 0;
}
/*
* Parser.
*/
static void
factor(void)
{
switch (type) {
case TOK_IDENT:
case TOK_NUMBER:
next();
break;
case TOK_LPAREN:
expect(TOK_LPAREN);
expression();
expect(TOK_RPAREN);
}
}
static void
term(void)
{
factor();
while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
next();
factor();
}
}
static void
expression(void)
{
if (type == TOK_PLUS || type == TOK_MINUS)
next();
term();
while (type == TOK_PLUS || type == TOK_MINUS) {
next();
term();
}
}
static void
condition(void)
{
if (type == TOK_ODD) {
expect(TOK_ODD);
expression();
} else {
expression();
switch (type) {
case TOK_EQUAL:
case TOK_HASH:
case TOK_LESSTHAN:
case TOK_GREATERTHAN:
next();
break;
default:
error("invalid conditional");
}
expression();
}
}
static void
statement(void)
{
switch (type) {
case TOK_IDENT:
expect(TOK_IDENT);
expect(TOK_ASSIGN);
expression();
break;
case TOK_CALL:
expect(TOK_CALL);
expect(TOK_IDENT);
break;
case TOK_BEGIN:
expect(TOK_BEGIN);
statement();
while (type == TOK_SEMICOLON) {
expect(TOK_SEMICOLON);
statement();
}
expect(TOK_END);
break;
case TOK_IF:
expect(TOK_IF);
condition();
expect(TOK_THEN);
statement();
break;
case TOK_WHILE:
expect(TOK_WHILE);
condition();
expect(TOK_DO);
statement();
}
}
static void
block(void)
{
if (depth++ > 1)
error("nesting depth exceeded");
if (type == TOK_CONST) {
expect(TOK_CONST);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
expect(TOK_EQUAL);
expect(TOK_NUMBER);
}
expect(TOK_SEMICOLON);
}
if (type == TOK_VAR) {
expect(TOK_VAR);
expect(TOK_IDENT);
while (type == TOK_COMMA) {
expect(TOK_COMMA);
expect(TOK_IDENT);
}
expect(TOK_SEMICOLON);
}
while (type == TOK_PROCEDURE) {
expect(TOK_PROCEDURE);
expect(TOK_IDENT);
expect(TOK_SEMICOLON);
block();
expect(TOK_SEMICOLON);
}
statement();
if (--depth < 0)
error("nesting depth fell below 0");
}
static void
parse(void)
{
next();
block();
expect(TOK_DOT);
if (type != 0)
error("extra tokens at end of file");
}
/*
* Main.
*/
int
main(int argc, char *argv[])
{
char *startp;
if (argc != 2) {
(void) fputs("usage: pl0c file.pl0\n", stderr);
exit(1);
}
readin(argv[1]);
startp = raw;
parse();
free(startp);
return 0;
}
On the next episode: Testing
How do we know our front end is correct, anyway? Yes, you can
take my word for it. But don't take my word for it. Have test
programs that you can use to verify the correctness of the
compiler. As we turn our attention to the code generator, we will
have to make tweaks to the parser to store metadata about what we
are parsing and pass that to the code generator. It would be good
to, at each step along the way, ensure that we don't have any
errors in the parser creep in. And what if we wanted to add new
reserved words or fix the unary minus deficiency in the PL/0
grammar? It would be good to know we didn't break anything in
those cases.
Top
RSS
-----------------------------------------------------------------
OpenBSD httpd