https://briancallahan.net/blog/20220330.html

Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social
life

---------------------------------------------------------------------
Home | Blog archives | LinkedIn | CV | Code | Extras
---------------------------------------------------------------------

[prev]

[next]

2022-03-30
I wrote a peephole optimizer (for QBE), part 1

All source code for this blog post can be found here.

One of the things we did not tackle when writing our PL/0 compiler
was an optimizer. Our code generator output C and then we let our C
compiler handle the rest. If we wanted optimizations, we could turn
them on in the C compiler. That's great but it didn't help us learn
anything about optimizations.

I test a lot of C compilers when testing oksh. One of those compilers
is called cproc. It's a C11 compiler that compiles C into QBE and
then has QBE transform its intermediate language into assembly, which
then follows the usual pipeline of being run through the system
assembler and ultimately the system linker.

We saw QBE before when we wrote our brainfuck compiler in QBE IL.
When observing how QBE transforms its IL into assembly, we noticed
that it missed a chance at optimization in the case where you moved a
zero into a register. The generated assembly used a mov instruction
to move an immediate of zero into the register. But the fastest and
smallest way to put a zero into a register (at least, on amd64
machines) is to xor the register with itself.

I noticed the lastest cproc, which uses the latest QBE, still fails
to make this optimization. At one point in time, the author of cproc
sent me a diff for QBE that did make the optimization but I guess it
was never applied to the QBE repository. As the savings for this
optimization really can add up over a large project, and since it
gave us an opportunity to look at peephole optimizers, I decided to
write a small program that could be run in the cproc pipeline in
between QBE and the system assembler to fix up this optimization.
Maybe we could find more optimizations as well.

A peephole optimizer

The simplest type of optimizer that we can write without having to
dive into the vast literature on compiler optimizations is a peephole
optimizer. In this technique, we read in some number of lines of
assembly and perform basic pattern matching on those lines looking
for instructions that we know can be replaced with better ones (or
entirely eliminated). "Better" in this case brings up the first
interesting question about optimization: better can be contextual
based on your goals. Do you want instructions that take less time to
execute? Do you want instructions that require fewer bytes to encode,
making for a smaller program? Someting else? What is faster may not
be smaller. What is smaller may not be faster. For us with our first
optimization, we are lucky that the optimized version is both faster
and smaller.

The peephole, or window, for our first optimization is just one line.
There are some other peephole optimizations that can be performed
with a peephole of just one line: removing useless instructions, such
as moving a register into itself or adding zero to a register,
immediately come to mind.

After we've completed the mov to xor optimization, we'll tackle an
additional optimization that requires a larger peephole to enable.

We will name our peephole optimizer O in honor of the C compiler flag
(-O) that is commonly used to tell the C compiler to turn on its
optimizer.

Choosing a language

I am going to write O in C. Not because C is necessarily the best
language for this task; it probably isn't. But because if we write O
in C, then we can run cproc on O itself and determine the size
savings on itself. I am also going to rely on direct knowledge of
cproc to make our lives easier when figuring out what strings to
match on. There is also a little bit of argc and argv handling with
that knowledge so that O can be plugged into cproc when we are
finished.

With all of that said, our main function looks like this:

static int
usage(void)
{

        (void) fputs("usage: O in.s [-o out.s]\n", stderr);

        return 1;
}

int
main(int argc, char *argv[])
{
        FILE *fp;

        if (argc == 4) {
                if (strcmp(argv[2], "-o") != 0)
                        return usage();

                if (freopen(argv[3], "w+", stdout) == NULL) {
                        (void) fprintf(stderr, "O: error: couldn't open %s\n",
                            argv[3]);
                }
        } else if (argc != 2) {
                return usage();
        }

        if (!strcmp(argv[1], "-")) {
                O(stdin);

                return 0;
        }

        if ((fp = fopen(argv[1], "r")) == NULL) {
                (void) fprintf(stderr, "O: error: couldn't open %s\n",
                    argv[1]);

                return 1;
        }

        O(fp);

        (void) fclose(fp);

        return 0;
}

Because of how cproc works, when the final stage of the compilation
is to output assembly (-S), cproc automatically appends -o [file.s]
so O needs to support that. If cproc is going to write an object file
or binary, then O will write to stdout and the assembler will read
the output in via a pipe. Additionally, input from cproc is always
stdin, demarcated as -, so we need to support that too.

To avoid too much complexity, if we do have to output to a file, we
can freopen(3) that file to stdout. Most of the time, we will be
outputting to stdout so it makes sense to make outputting to a file
the special case.

Reading in one line at a time

We can use the getline(3) function to read in one line at a time into
a buffer. While the getline(3) function is "new" with POSIX.1-2008, I
think all modern Unix systems have the function. If your system does
not, you can get a copy of getline(3) here.

Reading in one line at a time might look something like this:

static void
O(FILE *fp)
{
        char *line = NULL;
        size_t size = 0;

        while (getline(&line, &size, fp) != -1)
                (void) fputs(line, stdout);

        free(line);
}

Yes, you really do need the call to free(3) at the end there.

This would read in one line at a time and then immediately print it
out. Of course, we need to perform a check to match for potential
lines to replace, so let's improve this a bit:

static void
one(const char *line)
{

        if (xorq(line))
                return;

        if (xorl(line))
                return;

        (void) fputs(line, stdout);
}

static void
O(FILE *fp)
{
        char *line = NULL;
        size_t size = 0;

        while (getline(&line, &size, fp) != -1)
                one(line);

        free(line);
}

Now we are reading in one line at a time, and checking if it's a movq
line that can be replaced with an xorq line. If not, perhaps it is a
movl line that can be replaced with an xorl line. If not, then it is
not a line that can be optimized this way so we should just print it
out as-is.

Find and replace

Because we know exactly what QBE is going to output, we can use that
to our advantage. We don't need to tokenize the output; we can simply
check for exact strings. That saves us a lot of logic. A movq line
that we want to match will always look like this:

        movq $0, %r__

Where __ reflects any of the 64-bit registers. Therefore, we can
match exactly that and since that's the only form this line can be
in, it is the only thing we need to check for. We want to replace
that movq line with this:

        xorq %r__, %r__

If we were to write that as a function, it might look like this:

static int
xorq(const char *line)
{

        if (!strncmp("\tmovq $0, %r", line, 12)) {
                (void) fprintf(stdout, "\txorq %%r%c%c, %%r%c%c\n", line[12],
                    line[13], line[12], line[13]);

                return 1;
        }

        return 0;
}

If you wanted to be paranoid, you could check that strlen(line) > 13
before calling fprintf.

We check to see if the first twelve characters of the line match a
movq line that would be better if it was optimized into a xorq line.
If it does, we instead write the equivalent xorq line. We are able to
use the fact that we can learn the target register directly from the
movq line and use that in the rewritten xorq line.

We can use the same strategy for optimizing movl lines into xorl
lines. Here, we want to transform:

        movl $0, %e__

Into:

        xorl %e__, %e__

In code, that looks like this:

static int
xorl(const char *line)
{

        if (!strncmp("\tmovl $0, %e", line, 12)) {
                (void) fprintf(stdout, "\txorl %%e%c%c, %%e%c%c\n", line[12],
                    line[13], line[12], line[13]);

                return 1;
        }

        return 0;
}

We have written a peephole optimizer! It is guaranteed to improve the
code that QBE produces.

Improving these optimizations

But we can do better. The xorq function doesn't account for %r8 or
%r9 while the xorl function doesn't account for any of the %r8d-%r15d
registers. They'll need special tweaks, but it is nothing too
difficult:

static int
xorq(struct peephole *window)
{
        char buf[32], r1a, r1b;

        if (window->line1 == NULL)
                return 0;

        if (!strncmp("\tmovq $0, %r", window->line1, 12)) {
                if (strlen(window->line1) < 14)
                        return 0;

                r1a = window->line1[12];
                r1b = window->line1[13];

                if (r1b == '\n')
                        r1b = ' ';

                (void) snprintf(buf, sizeof(buf), "\txorq %%r%c%c, %%r%c%c\n",
                    r1a, r1b, r1a, r1b);

                free(window->line1);
                window->line1 = xstrdup(buf);

                return 1;
        }

        return 0;
}

static int
xorl(struct peephole *window)
{
        char buf[32], e1a, e1b;

        if (window->line1 == NULL)
                return 0;

        if (!strncmp("\tmovl $0, %e", window->line1, 12)) {
                if (strlen(window->line1) != 15)
                        return 0;

                e1a = window->line1[12];
                e1b = window->line1[13];

                (void) snprintf(buf, sizeof(buf), "\txorl %%e%c%c, %%e%c%c\n",
                    e1a, e1b, e1a, e1b);

                free(window->line1);
                window->line1 = xstrdup(buf);

                return 1;
        } else if (!strncmp("\tmovl $0, %r", window->line1, 12)) {
                if (strlen(window->line1) < 14)
                        return 0;

                e1a = window->line1[12];
                e1b = window->line1[13];

                if (e1b == 'd') {
                        (void) snprintf(buf, sizeof(buf),
                            "\txorl %%r%cd, %%r%cd\n", e1a, e1a);
                } else {
                        (void) snprintf(buf, sizeof(buf),
                            "\txorl %%r%c%cd, %%r%c%cd\n", e1a, e1b, e1a, e1b);
                }

                free(window->line1);
                window->line1 = xstrdup(buf);

                return 1;
        }

        return 0;
}

It is really important that we make these tweaks, since QBE really
enjoys using the %r8d register. That way, we can benefit from this
optimization in those cases.

Testing

While we are not yet done with our peephole optimizer, since we will
add another optimization, it would be good to test it out. We can
test it out on the commandline by running O -. Here is the results
when I type in some assembly directly:

/home/brian $ O -
        movq $0, %rax
        xorq %rax, %rax
        movl $0, %ebx
        xorl %ebx, %ebx
        movq $1, %rax
        movq $1, %rax

I typed in the odd-numbered lines and O responded with the
even-numbered lines. Don't forget that you need to type in the
leading tab, since O requires the leading tab to make a successful
match. We can see that right now the optimizer successfully replaces
lines that match our logic and, importantly, successfully does not
replace lines that do not match.

Plugging O into cproc

We can execute a much bigger test by plugging O into the cproc
pipeline. We should immediately begin to see improvements to the code
we compile. And, since my OpenBSD port of cproc does the whole three
stage self-hosting compile, we will immediately learn if cproc can
handle this improvement to itself. I'm betting that it can, but it's
always better to be sure.

There are two files that need to be patched. First, we need to patch
the configure script and teach it about O, since that will place a
needed data structure in a header file to be used later:

Index: configure
--- configure.orig
+++ configure
@@ -159,6 +159,7 @@ static const char *const preprocesscmd[] = {
        "-D", "__extension__=",
 $defines};
 static const char *const codegencmd[]    = {"$DEFAULT_QBE"};
+static const char *const optimizecmd[]   = {"O"};
 static const char *const assemblecmd[]   = {"$DEFAULT_ASSEMBLER"};
 static const char *const linkcmd[]       = {"$DEFAULT_LINKER", $linkflags};
 EOF

Second, we need to patch driver.c to teach the compiler driver about
this new stage in the compilation pipeline:

Index: driver.c
--- driver.c.orig
+++ driver.c
@@ -32,6 +32,7 @@ enum stage {
        PREPROCESS,
        COMPILE,
        CODEGEN,
+       OPTIMIZE,
        ASSEMBLE,
        LINK,
 };
@@ -60,6 +61,7 @@ static struct stageinfo stages[] = {
        [PREPROCESS] = {.name = "preprocess"},
        [COMPILE]    = {.name = "compile"},
        [CODEGEN]    = {.name = "codegen"},
+       [OPTIMIZE]   = {.name = "optimize"},
        [ASSEMBLE]   = {.name = "assemble"},
        [LINK]       = {.name = "link"},
 };
@@ -381,6 +383,7 @@ main(int argc, char *argv[])
        arrayaddbuf(&stages[PREPROCESS].cmd, preprocesscmd, sizeof(preprocesscmd));
        arrayaddptr(&stages[COMPILE].cmd, compilecommand(argv[0]));
        arrayaddbuf(&stages[CODEGEN].cmd, codegencmd, sizeof(codegencmd));
+       arrayaddbuf(&stages[OPTIMIZE].cmd, optimizecmd, sizeof(optimizecmd));
        arrayaddbuf(&stages[ASSEMBLE].cmd, assemblecmd, sizeof(assemblecmd));
        arrayaddbuf(&stages[LINK].cmd, linkcmd, sizeof(linkcmd));

@@ -400,6 +403,7 @@ main(int argc, char *argv[])
        arrayaddptr(&stages[COMPILE].cmd, arch);
        arrayaddptr(&stages[CODEGEN].cmd, "-t");
        arrayaddptr(&stages[CODEGEN].cmd, qbearch);
+       arrayaddptr(&stages[OPTIMIZE].cmd, "-");

        for (;;) {
                ++argv, --argc;
@@ -414,10 +418,10 @@ main(int argc, char *argv[])
                        switch (input->filetype) {
                        case ASM:    input->stages =                                     1<<ASSEMBLE|1<<LINK; break;
                        case ASMPP:  input->stages = 1<<PREPROCESS|                      1<<ASSEMBLE|1<<LINK; break;
-                       case C:      input->stages = 1<<PREPROCESS|1<<COMPILE|1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break;
+                       case C:      input->stages = 1<<PREPROCESS|1<<COMPILE|1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break;
                        case CHDR:   input->stages = 1<<PREPROCESS                                          ; break;
-                       case CPPOUT: input->stages =               1<<COMPILE|1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break;
-                       case QBE:    input->stages =                          1<<CODEGEN|1<<ASSEMBLE|1<<LINK; break;
+                       case CPPOUT: input->stages =               1<<COMPILE|1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break;
+                       case QBE:    input->stages =                          1<<CODEGEN|1<<OPTIMIZE|1<<ASSEMBLE|1<<LINK; break;
                        case OBJ:    input->stages =                                                 1<<LINK; break;
                        default:     usage("reading from standard input requires -x");
                        }
@@ -505,7 +509,7 @@ main(int argc, char *argv[])
                                arrayaddptr(&stages[PREPROCESS].cmd, "-P");
                                break;
                        case 'S':
-                               last = CODEGEN;
+                               last = OPTIMIZE;
                                break;
                        case 's':
                                arrayaddptr(&stages[LINK].cmd, "-s");

Perhaps not too surprising, adding these patches and running cproc
through a three stage compile produced a working compiler.

Some preliminary numbers

It's a bit unfair to compare numbers for cproc itself, since we
literally added code and therefore it is not a fair fight. But what
we can do is test O itself, once with the original cproc and again
with the new cproc.

Here is the output of size(1) on O.o compiled with the original
cproc:

text    data    bss     dec     hex
975     138     0       1113    459

And again with the new cproc:

text    data    bss     dec     hex
948     138     0       1086    43e

That's a savings of 27 bytes, or about a 2.77% reduction in binary
size. That's not a huge amount, but it's also not nothing.

For comparison, with clang -O2 we get this size:

text    data    bss     dec     hex
914     16      0       930     3a2

And with clang -Oz, that becomes:

text    data    bss     dec     hex
866     24      0       890     37a

But the big winner is gcc -Os (gcc (GCC) 12.0.1 20220310
(experimental)):

text    data    bss     dec     hex
713     0       0       713     2c9

So while QBE can't quite compete with the likes of GCC and Clang, it
isn't designed to. It's designed to be fast and good enough, and it
is. With O, QBE is a little bit better without sacrificing too much
compilation speed. After all, one of the selling points of cproc is
that compiles are much faster compared to GCC and Clang.

But O is a very small codebase. Let's try something larger. Let's try
oksh.

Here is the size measurements of oksh built with the old cproc:

text    data    bss     dec     hex
278585  17363   29920   325868  4f8ec

And with the new cproc:

text    data    bss     dec     hex
269745  17363   29920   317028  4d664

That's a savings of 8840 bytes, or about a 3.17% reduction in binary
size. Again, it's not huge but it's also not nothing. Remember too
it's also a speed increase on top of a size decrease, so we win with
this optimization on both levels. While I'm sure the speed increase
isn't much more than negligible, over a large enough span of time it
probably adds up.

How about something large, like SQLite? With the old cproc:

text    data    bss     dec     hex
1164826 90104   888     1255818 13298a

And with the new cproc:

text    data    bss     dec     hex
1132438 90104   888     1223430 12ab06

That's a savings of 32388 bytes, or about a 2.78% reduction in binary
size. I'm starting to see a trend. We're consistently getting between
2% and 3% reduction in binary size with this one optimization.

A more complicated second optimization

When reading through assembly generated by cproc, I discovered a
peculiar construct:

        movq %rdi, %r12
.Lbb28:
        movl %ebx, %edi
        movl %edi, %ebx
        movq %r12, %rdi

It looks like this piece of assembly wants to use %rdi as a temporary
register for a register-to-register copy, so it is copying itself to
%r12, performing the register-to-register copy, then returning to its
original value. But there's an opportunity for an optimization for
this particular register-to-register copy: all we're doing is moving
%ebx into itself. We can say with certainty that the last three lines
of this construct can be reduced to just the last line, since copying
a register into itself via a temporary register and then immediately
reassigning the temporary register is equivalent to simply assigning
the final value to the temporary register and not bothering with the
copy.

Interestingly, we cannot actually say with certainty if moving %rdi
into %r12 and back again is useless. It could be the case that it is
useless, in which case we might be able to remove all four lines of
assembly, leaving just the label. But it also might be the case that
the very next line of assembly is performing an operation on %r12
that is dependent on %r12 having previously gotten a copy of the
value of %rdi. But there's also a label in the middle: so perhaps
there is logic later where a value for the next iteration of a loop
is placed in %r12 en route to %rdi. We can't know any of that from
the context we have in this small five-line window.

That brings us to another interesting problem with peephole
optimizers: we need to be fairly conservative in our selection of
optimizations. We should only implement optimizations that we can be
absolutely certain have no undesirable side effects. The eliminaton
of the third and fourth lines in that five-line window above has no
side effects so that is the optimization we will teach O.

Setting up a larger window

First, we will need to slightly amend our O function:

static int
three(const char *line, FILE *fp)
{

        if (mov(line, fp) == 1)
                return 1;

        return 0;
}

static void
O(FILE *fp)
{
        char *line = NULL;
        size_t size = 0;

        while (getline(&line, &size, fp) != -1) {
                if (three(line, fp) == 1)
                        break;
        }

        free(line);
}

The three function will call the one function when appropriate, so we
will still benefit from our previous work.

We only have one optimization for a three-line peephole, which
removes a useless register self-copy:

static int
mov(const char *line, FILE *fp)
{
        char *line2 = NULL, *line3 = NULL;
        size_t size = 0;

        if (strncmp("\tmovl %e", line, 8) != 0 &&
            strncmp("\tmovq %r", line, 8) != 0) {
                one(line);

                return 0;
        }

        if (getline(&line2, &size, fp) == -1) {
                (void) fputs(line, stdout);

                return 1;
        }

        if (strncmp("\tmovl %e", line2, 8) != 0 &&
            strncmp("\tmovq %r", line2, 8) != 0) {
                (void) fputs(line, stdout);
                one(line2);

                free(line2);

                return 0;
        }

        if (getline(&line3, &size, fp) == -1) {
                (void) fputs(line, stdout);
                (void) fputs(line2, stdout);

                free(line2);

                return 1;
        }

        if (strncmp("\tmovl", line3, 5) != 0 &&
            strncmp("\tmovq", line3, 5) != 0) {
                (void) fputs(line, stdout);
                (void) fputs(line2, stdout);
                one(line3);

                free(line3);
                free(line2);

                return 0;
        }

        if (line[4] == line2[4] &&
            line[7] == line2[13] &&
            line[8] == line2[14] &&
            line[9] == line2[15] &&
            line[13] == line2[7] &&
            line[14] == line2[8] &&
            line[15] == line2[9] &&
            line[14] == line3[14] &&
            line[15] == line3[15]) {
                (void) fputs(line3, stdout);
        } else {
                (void) fputs(line, stdout);
                (void) fputs(line2, stdout);
                one(line3);
        }

        free(line3);
        free(line2);

        return 0;
}

There's a lot here. Let's break it down.

We first recognize if the first line we read is a mov with a 32-bit
or 64-bit register as the source register. If not, then it definitely
cannot be a candidate for this second optimization. However, it might
still be a candidate for the first optimization we coded, so we
should feed this line into the one function. Regardless of whether or
not this line is a successful candidate for the first optimization,
after this we are done with this peephole so we should return and
start all over again with a new peephole.

You'll notice that we will return zero to say "we did not see the end
of the file" and one to say "we have seen the end of the file."

Next is to realize that this optimization only works if we have at
least three lines to interrogate. But we may be close to the end of
the file; there may not be three lines left. We need to account for
that. If there is no next line, then we are definitely done and we
should simply print what we have. We know at this point the line we
have is not a candidate for the first optimization, or else it would
have entered into the previous if statement.

And we continue on. It is exactly the same logic for the second line:
we check to make sure it is a line that continues candidacy for this
second optimization; we then check to ensure there is a third line
that we can use to finish checking for candidacy.

Once we have read in all three lines, we need to know the following:

 1. If lines one and two are the same type of mov (movl or movq)
 2. If the source register in line one is the destination register in
    line two
 3. If the destination register in line one is the source register in
    line two
 4. If the destination register in line one is the destination
    register in line three

The large if statement performs all these checks.

Something interesting is that we don't care if line three is a 32-bit
movl or a 64-bit movq. Just so long as the destination register is
the same in both line one and line three, that is enough to ensure
the first two lines are useless. We do care that the type of mov is
the same for lines one and two, because if they are not it is
possible that there are side effects that would render those two
lines not useless.

This function also doesn't account for the fact that the 32-bit
versions of %r12-%r15 are named %r12d-%r15d. I haven't seen any cproc
code that uses those 32-bit registers in the type of construct that
this optimization eliminates, but it is certainly possible. We'll
probably have to write another function at some point that checks for
this special case and acts accordingly.

I was able to rebuild cproc with this new optimization enabled. It
produced exactly the same binary compared to O that only knew the
first optimization. All that means is the construct that this second
optimization eliminates wasn't present in the cproc assembly. It also
means that we didn't break anything by adding this second
optimization. The codebase that did generate this peculiar assembly
construct saw that the second optimization worked and the useless
copy was successfully optimized away.

Bugs and caveats, to be fixed in part 2

Astute readers may have noticed that our peephole optimizer, as good
as it is, may fail to optimize constructs it knows how to optimize.
Notably, O will fail to optimize this:

        movl %ecx, %eax
        movl %ebx, %edi
        movl %edi, %ebx
        movq %r12, %rdi

This is because the first three lines get read in, then we learn that
this does not match the pattern we are looking for, so we print out
all three lines and then read in the fourth line on its own, which
doesn't match anything either.

It would be better if we printed out just the first line after
discovering that the first three lines do not satisfy any
optimization patterns, then check lines two through four for any
optimization patterns; if we did that, we could optimize the last
three lines.

It will take some additional coding and problem solving to be able to
accomplish this goal. We will endeavor to do so in the next post.

Top

RSS
---------------------------------------------------------------------

OpenBSD httpd