https://briancallahan.net/blog/20210816.html

Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social
life

---------------------------------------------------------------------
Home | Blog archives | LinkedIn | CV | Code | Extras
---------------------------------------------------------------------

[prev]

[next]

2021-08-16
Let's write a compiler, part 3: A parser

All source code for this blog post can be found here.

Tokenizing the source code input is a great start for our compiler.
Now let's turn our attention to parsing the PL/0 grammar. Once we
have finished with the parser, the front end of our compiler will be
finished and we will have a PL/0 validator. This is a program that
can read in PL/0 source code and inform the user whether or not it is
valid PL/0 code.

There are many strategies available for writing a parser. And lexers
too, for that matter. While we wrote a lexer by hand because it's a
useful thing to do at least once, you could have used the flex tool
instead to generate a lexer for you. Much the same, for a simple
grammar like that found in PL/0, you could use a tool such as yacc to
take the PL/0 grammar and output a parser for us.

But since the point of this series is to learn, we will write a
hand-crafted recursive-descent parser. Among other features, a
recursive-descent parser is a top-down parser. That means we will
start with the very first token in the PL/0 source code and continue
in order until we do something meaningful with the last token in the
PL/0 source code.

Some helper functions

Before we get to writing the parser itself, we are going to need two
helper functions, which I am going to name next() and expect(). The
next() function is how the parser will control the lexer, as that is
how the parser can tell the lexer to fetch the next token from the PL
/0 source code and record all of the useful information for that
token: the token type and, if needed, the token string. This is
exactly what our lexer is already doing, so we don't need to make any
changes to the lexer at all. We just need to hook it up to the
parser.

The expect() function is how we will enforce proper syntax. The
function will take one parameter, a token type, and it will check
that the currently recorded token type matches the token type we
expect to have at this point in the grammar. If it does not, we will
report a syntax error through our already existing error mechanism.
If it does match, have the lexer fetch the next token.

The code for these two helper functions might look like this:

/*
 * Parser.
 */

static void
next(void)
{

        type = lex();
        ++raw;
}

static void
expect(int match)
{

        if (match != type)
                error("syntax error");

        next();
}

Reading the PL/0 grammar

We will need to be able to read the Extended Backus-Naur Form (EBNF)
style grammar that is that long grammar comment in our compiler
source code in order to be able to write our parser. Without going
into lots of formal theory or detail, let's keep in mind the
following rules for reading the grammar:

  * Everything to the left of an = sign will become a function in our
    parser.
  * A | means "or." For example, a factor can be an identifier or a
    number or an expression wrapped in parentheses.
  * Anything wrapped in "double quotes" means literally this thing.
    So for example the "const" in the block rule means that if you
    get here you must literally see const as the next token, i.e., a
    TOK_CONST token.
  * Anything wrapped in [ square brackets ] means you must have
    exactly zero or one of these. To take again for example the
    entire line in the block rule that begins with [ "const"
    essentially means that you may have exactly zero or one constant
    declaration sections in a block.
  * Anything wrapped in { curly braces } means you may have zero or
    more of these, with no upper limit. In the block rule, you may
    have as many procedures as you want or no procedures, if that's
    what you want.
  * Anything wrapped in ( parentheses ) means you must have exactly
    one of these. This only appears in the condition, expression, and
    term rules.
  * The bare dot at the end of each rule simply symbolizes the end of
    the rule and doesn't have any meaning for us.
  * We don't need to worry about what an identifier or a number is.
    If you remember, the lexer knows how to determine those things
    and provides us with the token string for all identifiers,
    reserved words, and numbers. The lexer also validates numbers for
    us. For the sake of repetition, in our version of the PL/0
    language an identifier begins with an A-Za-z_ character and is
    followed by zero or more A-Za-z0-9_ characters. A number begins
    with a 0-9 character and contains one or more 0-9 characters with
    optional _ characters for human readability. The reserved words
    are all lowercase.

    Writing our first parser functions

    Let's start at the top. The very first rule in our grammar is
    program. The program rule is defined to be a block rule followed
    by a literal dot. Let's write that out in code:

    static void
    parse(void)
    {

            next();
            block();
            expect(TOK_DOT);

            if (type != 0)
                    error("extra tokens at end of file");
    }

    OK, so I slightly violated my rule of naming the function after
    the rule. This will be the only time I do this. This is because
    parse() was the name of the function that held the lexer printing
    logic from the previous entry and I've simply replaced that logic
    with this new logic that begins the compilation process.

    Let's see what we have here. The first thing we do is call next()
    because if we do not, then the lexer will not fetch a token for
    us and we won't have anything to work with. Next we have what
    looks extremely similar to the rule itself: a block function
    followed by us enforcing a literal dot. Because the expect()
    function fetches the next token automatically, we can also check
    that no other tokens exist after that literal dot. If there is,
    that's an error because you have code after the point in which
    the grammar says you cannot have any more code. You can have
    whitespace and comments after the final literal dot, because our
    lexer does not return on comments as comments are not tokens. The
    lexer will instead loop again, reach the end of the input, and
    return 0, signifying the end of the input. And thus we are OK for
    this special case of having comments after the final dot.

    Now all we have to do is descend down into the functions that are
    as-of-now undefined, which is the block() function. Which just so
    happens to correspond to the next rule in the PL/0 grammar, the
    block rule.

    Dealing with optional sections

    The block rule has several optional sections. The first is the
    const section. As it is wrapped in square brackets, that means
    you can have exactly zero or exactly one const sections. We can
    turn this into code by following the same rules we did for
    writing the previous function, except we will wrap it inside an
    if statement that checks to see if the first token is a TOK_CONST
    token. If it is, we will execute everything in the block for
    exactly one const section. If it is not, we will skip the entire
    block for exactly zero const sections.

    This opening part of the block() function might look like this:

    static void
    block(void)
    {

            if (type == TOK_CONST) {
                    expect(TOK_CONST);
                    expect(TOK_IDENT);
                    expect(TOK_EQUAL);
                    expect(TOK_NUMBER);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                            expect(TOK_EQUAL);
                            expect(TOK_NUMBER);
                    }
                    expect(TOK_SEMICOLON);
            }

    You might have already figured out how to deal with the curly
    brace wrapped parts of the grammar reading through the code
    above. Curly braces mean zero or more, with no upper limit. And
    so handling curly braces is exactly the same as dealing with
    square brackets, except the if for square brackets becomes a
    while for curly braces which allows looping and therefore any
    number of arbitrary iterations or zero iterations.

    We can follow these rules to finish the block() function:

    static void
    block(void)
    {

            if (type == TOK_CONST) {
                    expect(TOK_CONST);
                    expect(TOK_IDENT);
                    expect(TOK_EQUAL);
                    expect(TOK_NUMBER);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                            expect(TOK_EQUAL);
                            expect(TOK_NUMBER);
                    }
                    expect(TOK_SEMICOLON);
            }

            if (type == TOK_VAR) {
                    expect(TOK_VAR);
                    expect(TOK_IDENT);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                    }
                    expect(TOK_SEMICOLON);
            }

            while (type == TOK_PROCEDURE) {
                    expect(TOK_PROCEDURE);
                    expect(TOK_IDENT);
                    expect(TOK_SEMICOLON);

                    block();

                    expect(TOK_SEMICOLON);
            }

            statement();
    }

    Both the const and var sections are in square brackets, while the
    procedure section is in curly braces. The statement section is
    not inside anything, so it is not optional. You might also notice
    that a procedure is little more than a named block, and it
    recursively calls block() to do its work. This means that you can
    have infinitely nested procedures as per the grammar. That's a
    common feature of Pascal-like languages. But I am going to do a
    little bit of future planning here and forbid nested procedures.
    You may have as many procedures as you want, they just can't be
    nested inside each other. If I remember correctly, Pascal
    compilers enforced a limit on how deep the nested procedures
    could go (though I'm not sure if any banned nested procedures
    outright, someone on Hacker News will let us know, I'm sure).

    We will not alter the grammar in order to achieve this forbidding
    of nested procedures, however. I want us to implement a parser
    for the PL/0 grammar exactly as the grammar is written. We will
    instead use some metadata to enforce no nested procedures. Let's
    create a new global variable named depth and we will increment
    the depth level at the beginning of each block and decrement the
    depth level at the end of each block. We can use this to check
    that we don't have nested procedures.

    That makes our finished block() function look like this:

    static void
    block(void)
    {

            if (depth++ > 1)
                    error("nesting depth exceeded");

            if (type == TOK_CONST) {
                    expect(TOK_CONST);
                    expect(TOK_IDENT);
                    expect(TOK_EQUAL);
                    expect(TOK_NUMBER);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                            expect(TOK_EQUAL);
                            expect(TOK_NUMBER);
                    }
                    expect(TOK_SEMICOLON);
            }

            if (type == TOK_VAR) {
                    expect(TOK_VAR);
                    expect(TOK_IDENT);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                    }
                    expect(TOK_SEMICOLON);
            }

            while (type == TOK_PROCEDURE) {
                    expect(TOK_PROCEDURE);
                    expect(TOK_IDENT);
                    expect(TOK_SEMICOLON);

                    block();

                    expect(TOK_SEMICOLON);
            }

            statement();

            if (--depth < 0)
                    error("nesting depth fell below 0");
    }

    Additionally, you might notice that while procedures have names,
    they do not accept any parameters nor do they return any values.
    They are equivalent to a void function in C that accepts no
    parameters, i.e., void procedure(void). Keep that in mind when
    you are writing PL/0 code to give to our PL/0 compiler.

    Descending further down the grammar

    Are there any functions in the block() function that we have not
    yet implemented? We are going to keep asking this question until
    every grammar rule has its own function. The answer is yes, just
    one: statement(). Lucky us, the statement rule is the next rule
    in the grammar.

    Handling or

    The statement rule has a number of possibilities. Each
    possibility is separated by a |, which means "or." To figure out
    what to do, we must look at the first token in each possibility.
    What we are looking for is that every possibility begins with a
    different token. If that were not the case, we might have to
    rewrite the affected possibilities until we reached a point where
    each possibility begins with a different token. We are lucky, or
    possibly Dr. Wirth did it by design (it was done by design),
    because each possibility begins with a different token. That
    lends itself well to a switch statement. Handling | here means a
    switch statement on the first token with each possibility being
    its own case.

    That might look like this:

    static void
    statement(void)
    {

            switch (type) {
            case TOK_IDENT:
                    expect(TOK_IDENT);
                    expect(TOK_ASSIGN);
                    expression();
                    break;
            case TOK_CALL:
                    expect(TOK_CALL);
                    expect(TOK_IDENT);
                    break;
            case TOK_BEGIN:
                    expect(TOK_BEGIN);
                    statement();
                    while (type == TOK_SEMICOLON) {
                            expect(TOK_SEMICOLON);
                            statement();
                    }
                    expect(TOK_END);
                    break;
            case TOK_IF:
                    expect(TOK_IF);
                    condition();
                    expect(TOK_THEN);
                    statement();
                    break;
            case TOK_WHILE:
                    expect(TOK_WHILE);
                    condition();
                    expect(TOK_DO);
                    statement();
            }
    }

    All we need to do is follow the logic we already learned
    implementing the program and block rules to implement the
    statement rule, with each possibility living in its own case.

    Note that there is no default case. This is to leave open the
    possibility for the empty statement.

    Descending even further down the grammar

    Let's ask the question again: are there any undefined functions
    in our grammar? Yes, there are two. Both of them were first used
    in the statement() function: expression() and condition(). And
    sure enough, there is a grammar rule named expression and a
    grammar rule named condition. The condition rule is next in the
    grammar, so let's do that one first.

    The condition rule has two possibilities, and, just like the
    statement rule, each possibility begins with a different token.
    Since there are only two possibilities, a switch statement is
    unnecessary. Also, because the expression rule has many different
    possibilities for a starting token, though none of those
    possibilities is a TOK_ODD token, let's check for a TOK_ODD token
    and if it's not, it must be an expression.

    That might look like this:

    static void
    condition(void)
    {

            if (type == TOK_ODD) {
                    expect(TOK_ODD);
                    expression();
            } else {
                    expression();

                    switch (type) {
                    case TOK_EQUAL:
                    case TOK_HASH:
                    case TOK_LESSTHAN:
                    case TOK_GREATERTHAN:
                            next();
                            break;
                    default:
                            error("invalid conditional");
                    }

                    expression();
            }
    }

    No new functions. Just the expression() function that we have yet
    to implement. So let's implement it. Implementing the expression
    rule will trigger needing to write the term and factor rules as
    well, which are all the rules we have left to implement. Let's
    finish the job:

    static void
    factor(void)
    {

            switch (type) {
            case TOK_IDENT:
            case TOK_NUMBER:
                    next();
                    break;
            case TOK_LPAREN:
                    expect(TOK_LPAREN);
                    expression();
                    expect(TOK_RPAREN);
            }
    }

    static void
    term(void)
    {

            factor();

            while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
                    next();
                    factor();
            }
    }

    static void
    expression(void)
    {

            if (type == TOK_PLUS || type == TOK_MINUS)
                    next();

            term();

            while (type == TOK_PLUS || type == TOK_MINUS) {
                    next();
                    term();
            }
    }

    Wrapping up the parser

    There are no undefined functions to implement in our code, and
    all the grammar rules have a corresponding function in our
    compiler, so our parser is finished. And now with both our lexer
    and our parser finished, we have a completed front end for our PL
    /0 compiler. We should be able to use our front end as a
    validator, as it should accept all valid PL/0 code and reject all
    invalid PL/0 code.

    For completeness, here is all the code we have written thusfar:

    #include <sys/stat.h>

    #include <ctype.h>
    #include <fcntl.h>
    #include <limits.h>
    #include <stdarg.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #define TOK_IDENT       'I'
    #define TOK_NUMBER      'N'
    #define TOK_CONST       'C'
    #define TOK_VAR         'V'
    #define TOK_PROCEDURE   'P'
    #define TOK_CALL        'c'
    #define TOK_BEGIN       'B'
    #define TOK_END         'E'
    #define TOK_IF          'i'
    #define TOK_THEN        'T'
    #define TOK_WHILE       'W'
    #define TOK_DO          'D'
    #define TOK_ODD         'O'
    #define TOK_DOT         '.'
    #define TOK_EQUAL       '='
    #define TOK_COMMA       ','
    #define TOK_SEMICOLON   ';'
    #define TOK_ASSIGN      ':'
    #define TOK_HASH        '#'
    #define TOK_LESSTHAN    '<'
    #define TOK_GREATERTHAN '>'
    #define TOK_PLUS        '+'
    #define TOK_MINUS       '-'
    #define TOK_MULTIPLY    '*'
    #define TOK_DIVIDE      '/'
    #define TOK_LPAREN      '('
    #define TOK_RPAREN      ')'

    /*
     * pl0c -- PL/0 compiler.
     *
     * program      = block "." .
     * block        = [ "const" ident "=" number { "," ident "=" number } ";" ]
     *                [ "var" ident { "," ident } ";" ]
     *                { "procedure" ident ";" block ";" } statement .
     * statement    = [ ident ":=" expression
     *                | "call" ident
     *                | "begin" statement { ";" statement } "end"
     *                | "if" condition "then" statement
     *                | "while" condition "do" statement ] .
     * condition    = "odd" expression
     *              | expression ( "=" | "#" | "<" | ">" ) expression .
     * expression   = [ "+" | "-" ] term { ( "+" | "-" ) term } .
     * term         = factor { ( "*" | "/" ) factor } .
     * factor       = ident
     *              | number
     *              | "(" expression ")" .
     */

    static char *raw, *token;
    static int depth, type;
    static size_t line = 1;

    static void expression(void);

    /*
     * Misc. functions.
     */

    static void
    error(const char *fmt, ...)
    {
            va_list ap;

            (void) fprintf(stderr, "pl0c: error: %lu: ", line);

            va_start(ap, fmt);
            (void) vfprintf(stderr, fmt, ap);
            va_end(ap);

            (void) fputc('\n', stderr);

            exit(1);
    }

    static void
    readin(char *file)
    {
            int fd;
            struct stat st;

            if (strrchr(file, '.') == NULL)
                    error("file must end in '.pl0'");

            if (!!strcmp(strrchr(file, '.'), ".pl0"))
                    error("file must end in '.pl0'");

            if ((fd = open(file, O_RDONLY)) == -1)
                    error("couldn't open %s", file);

            if (fstat(fd, &st) == -1)
                    error("couldn't get file size");

            if ((raw = malloc(st.st_size + 1)) == NULL)
                    error("malloc failed");

            if (read(fd, raw, st.st_size) != st.st_size)
                    error("couldn't read %s", file);
            raw[st.st_size] = '\0';

            (void) close(fd);
    }

    /*
     * Lexer.
     */

    static void
    comment(void)
    {
            int ch;

            while ((ch = *raw++) != '}') {
                    if (ch == '\0')
                            error("unterminated comment");

                    if (ch == '\n')
                            ++line;
            }
    }

    static int
    ident(void)
    {
            char *p;
            size_t i, len;

            p = raw;
            while (isalnum(*raw) || *raw == '_')
                    ++raw;

            len = raw - p;

            --raw;

            free(token);

            if ((token = malloc(len + 1)) == NULL)
                    error("malloc failed");

            for (i = 0; i < len; i++)
                    token[i] = *p++;
            token[i] = '\0';

            if (!strcmp(token, "const"))
                    return TOK_CONST;
            else if (!strcmp(token, "var"))
                    return TOK_VAR;
            else if (!strcmp(token, "procedure"))
                    return TOK_PROCEDURE;
            else if (!strcmp(token, "call"))
                    return TOK_CALL;
            else if (!strcmp(token, "begin"))
                    return TOK_BEGIN;
            else if (!strcmp(token, "end"))
                    return TOK_END;
            else if (!strcmp(token, "if"))
                    return TOK_IF;
            else if (!strcmp(token, "then"))
                    return TOK_THEN;
            else if (!strcmp(token, "while"))
                    return TOK_WHILE;
            else if (!strcmp(token, "do"))
                    return TOK_DO;
            else if (!strcmp(token, "odd"))
                    return TOK_ODD;

            return TOK_IDENT;
    }

    static int
    number(void)
    {
            const char *errstr;
            char *p;
            size_t i, j = 0, len;

            p = raw;
            while (isdigit(*raw) || *raw == '_')
                    ++raw;

            len = raw - p;

            --raw;

            free(token);

            if ((token = malloc(len + 1)) == NULL)
                    error("malloc failed");

            for (i = 0; i < len; i++) {
                    if (isdigit(*p))
                            token[j++] = *p;
                    ++p;
            }
            token[j] = '\0';

            (void) strtonum(token, 0, LONG_MAX, &errstr);
            if (errstr != NULL)
                    error("invalid number: %s", token);

            return TOK_NUMBER;
    }

    static int
    lex(void)
    {

    again:
            /* Skip whitespace.  */
            while (*raw == ' ' || *raw == '\t' || *raw == '\n') {
                    if (*raw++ == '\n')
                            ++line;
            }

            if (isalpha(*raw) || *raw == '_')
                    return ident();

            if (isdigit(*raw))
                    return number();

            switch (*raw) {
            case '{':
                    comment();
                    goto again;
            case '.':
            case '=':
            case ',':
            case ';':
            case '#':
            case '<':
            case '>':
            case '+':
            case '-':
            case '*':
            case '/':
            case '(':
            case ')':
                    return (*raw);
            case ':':
                    if (*++raw != '=')
                            error("unknown token: ':%c'", *raw);

                    return TOK_ASSIGN;
            case '\0':
                    return 0;
            default:
                    error("unknown token: '%c'", *raw);
            }

            return 0;
    }

    /*
     * Parser.
     */

    static void
    factor(void)
    {

            switch (type) {
            case TOK_IDENT:
            case TOK_NUMBER:
                    next();
                    break;
            case TOK_LPAREN:
                    expect(TOK_LPAREN);
                    expression();
                    expect(TOK_RPAREN);
            }
    }

    static void
    term(void)
    {

            factor();

            while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
                    next();
                    factor();
            }
    }

    static void
    expression(void)
    {

            if (type == TOK_PLUS || type == TOK_MINUS)
                    next();

            term();

            while (type == TOK_PLUS || type == TOK_MINUS) {
                    next();
                    term();
            }
    }

    static void
    condition(void)
    {

            if (type == TOK_ODD) {
                    expect(TOK_ODD);
                    expression();
            } else {
                    expression();

                    switch (type) {
                    case TOK_EQUAL:
                    case TOK_HASH:
                    case TOK_LESSTHAN:
                    case TOK_GREATERTHAN:
                            next();
                            break;
                    default:
                            error("invalid conditional");
                    }

                    expression();
            }
    }

    static void
    statement(void)
    {

            switch (type) {
            case TOK_IDENT:
                    expect(TOK_IDENT);
                    expect(TOK_ASSIGN);
                    expression();
                    break;
            case TOK_CALL:
                    expect(TOK_CALL);
                    expect(TOK_IDENT);
                    break;
            case TOK_BEGIN:
                    expect(TOK_BEGIN);
                    statement();
                    while (type == TOK_SEMICOLON) {
                            expect(TOK_SEMICOLON);
                            statement();
                    }
                    expect(TOK_END);
                    break;
            case TOK_IF:
                    expect(TOK_IF);
                    condition();
                    expect(TOK_THEN);
                    statement();
                    break;
            case TOK_WHILE:
                    expect(TOK_WHILE);
                    condition();
                    expect(TOK_DO);
                    statement();
            }
    }

    static void
    block(void)
    {

            if (depth++ > 1)
                    error("nesting depth exceeded");

            if (type == TOK_CONST) {
                    expect(TOK_CONST);
                    expect(TOK_IDENT);
                    expect(TOK_EQUAL);
                    expect(TOK_NUMBER);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                            expect(TOK_EQUAL);
                            expect(TOK_NUMBER);
                    }
                    expect(TOK_SEMICOLON);
            }

            if (type == TOK_VAR) {
                    expect(TOK_VAR);
                    expect(TOK_IDENT);
                    while (type == TOK_COMMA) {
                            expect(TOK_COMMA);
                            expect(TOK_IDENT);
                    }
                    expect(TOK_SEMICOLON);
            }

            while (type == TOK_PROCEDURE) {
                    expect(TOK_PROCEDURE);
                    expect(TOK_IDENT);
                    expect(TOK_SEMICOLON);

                    block();

                    expect(TOK_SEMICOLON);
            }

            statement();

            if (--depth < 0)
                    error("nesting depth fell below 0");
    }

    static void
    parse(void)
    {

            next();
            block();
            expect(TOK_DOT);

            if (type != 0)
                    error("extra tokens at end of file");
    }

    /*
     * Main.
     */

    int
    main(int argc, char *argv[])
    {
            char *startp;

            if (argc != 2) {
                    (void) fputs("usage: pl0c file.pl0\n", stderr);
                    exit(1);
            }

            readin(argv[1]);
            startp = raw;

            parse();

            free(startp);

            return 0;
    }

    On the next episode: Testing

    How do we know our front end is correct, anyway? Yes, you can
    take my word for it. But don't take my word for it. Have test
    programs that you can use to verify the correctness of the
    compiler. As we turn our attention to the code generator, we will
    have to make tweaks to the parser to store metadata about what we
    are parsing and pass that to the code generator. It would be good
    to, at each step along the way, ensure that we don't have any
    errors in the parser creep in. And what if we wanted to add new
    reserved words or fix the unary minus deficiency in the PL/0
    grammar? It would be good to know we didn't break anything in
    those cases.

    Top

    RSS
    -----------------------------------------------------------------
   
    OpenBSD httpd