https://briancallahan.net/blog/20220330.html
Brian Robert Callahan
academic, developer, with an eye towards a brighter techno-social
life
---------------------------------------------------------------------
Home | Blog archives | LinkedIn | CV | Code | Extras
---------------------------------------------------------------------
[prev]
[next]
2022-03-30
I wrote a peephole optimizer (for QBE), part 1
All source code for this blog post can be found here.
One of the things we did not tackle when writing our PL/0 compiler
was an optimizer. Our code generator output C and then we let our C
compiler handle the rest. If we wanted optimizations, we could turn
them on in the C compiler. That's great but it didn't help us learn
anything about optimizations.
I test a lot of C compilers when testing oksh. One of those compilers
is called cproc. It's a C11 compiler that compiles C into QBE and
then has QBE transform its intermediate language into assembly, which
then follows the usual pipeline of being run through the system
assembler and ultimately the system linker.
We saw QBE before when we wrote our brainfuck compiler in QBE IL.
When observing how QBE transforms its IL into assembly, we noticed
that it missed a chance at optimization in the case where you moved a
zero into a register. The generated assembly used a mov instruction
to move an immediate of zero into the register. But the fastest and
smallest way to put a zero into a register (at least, on amd64
machines) is to xor the register with itself.
I noticed the lastest cproc, which uses the latest QBE, still fails
to make this optimization. At one point in time, the author of cproc
sent me a diff for QBE that did make the optimization but I guess it
was never applied to the QBE repository. As the savings for this
optimization really can add up over a large project, and since it
gave us an opportunity to look at peephole optimizers, I decided to
write a small program that could be run in the cproc pipeline in
between QBE and the system assembler to fix up this optimization.
Maybe we could find more optimizations as well.
A peephole optimizer
The simplest type of optimizer that we can write without having to
dive into the vast literature on compiler optimizations is a peephole
optimizer. In this technique, we read in some number of lines of
assembly and perform basic pattern matching on those lines looking
for instructions that we know can be replaced with better ones (or
entirely eliminated). "Better" in this case brings up the first
interesting question about optimization: better can be contextual
based on your goals. Do you want instructions that take less time to
execute? Do you want instructions that require fewer bytes to encode,
making for a smaller program? Someting else? What is faster may not
be smaller. What is smaller may not be faster. For us with our first
optimization, we are lucky that the optimized version is both faster
and smaller.
The peephole, or window, for our first optimization is just one line.
There are some other peephole optimizations that can be performed
with a peephole of just one line: removing useless instructions, such
as moving a register into itself or adding zero to a register,
immediately come to mind.
After we've completed the mov to xor optimization, we'll tackle an
additional optimization that requires a larger peephole to enable.
We will name our peephole optimizer O in honor of the C compiler flag
(-O) that is commonly used to tell the C compiler to turn on its
optimizer.
Choosing a language
I am going to write O in C. Not because C is necessarily the best
language for this task; it probably isn't. But because if we write O
in C, then we can run cproc on O itself and determine the size
savings on itself. I am also going to rely on direct knowledge of
cproc to make our lives easier when figuring out what strings to
match on. There is also a little bit of argc and argv handling with
that knowledge so that O can be plugged into cproc when we are
finished.
With all of that said, our main function looks like this:
static int
usage(void)
{
(void) fputs("usage: O in.s [-o out.s]\n", stderr);
return 1;
}
int
main(int argc, char *argv[])
{
FILE *fp;
if (argc == 4) {
if (strcmp(argv[2], "-o") != 0)
return usage();
if (freopen(argv[3], "w+", stdout) == NULL) {
(void) fprintf(stderr, "O: error: couldn't open %s\n",
argv[3]);
}
} else if (argc != 2) {
return usage();
}
if (!strcmp(argv[1], "-")) {
O(stdin);
return 0;
}
if ((fp = fopen(argv[1], "r")) == NULL) {
(void) fprintf(stderr, "O: error: couldn't open %s\n",
argv[1]);
return 1;
}
O(fp);
(void) fclose(fp);
return 0;
}
Because of how cproc works, when the final stage of the compilation
is to output assembly (-S), cproc automatically appends -o [file.s]
so O needs to support that. If cproc is going to write an object file
or binary, then O will write to stdout and the assembler will read
the output in via a pipe. Additionally, input from cproc is always
stdin, demarcated as -, so we need to support that too.
To avoid too much complexity, if we do have to output to a file, we
can freopen(3) that file to stdout. Most of the time, we will be
outputting to stdout so it makes sense to make outputting to a file
the special case.
Reading in one line at a time
We can use the getline(3) function to read in one line at a time into
a buffer. While the getline(3) function is "new" with POSIX.1-2008, I
think all modern Unix systems have the function. If your system does
not, you can get a copy of getline(3) here.
Reading in one line at a time might look something like this:
static void
O(FILE *fp)
{
char *line = NULL;
size_t size = 0;
while (getline(&line, &size, fp) != -1)
(void) fputs(line, stdout);
free(line);
}
Yes, you really do need the call to free(3) at the end there.
This would read in one line at a time and then immediately print it
out. Of course, we need to perform a check to match for potential
lines to replace, so let's improve this a bit:
static void
one(const char *line)
{
if (xorq(line))
return;
if (xorl(line))
return;
(void) fputs(line, stdout);
}
static void
O(FILE *fp)
{
char *line = NULL;
size_t size = 0;
while (getline(&line, &size, fp) != -1)
one(line);
free(line);
}
Now we are reading in one line at a time, and checking if it's a movq
line that can be replaced with an xorq line. If not, perhaps it is a
movl line that can be replaced with an xorl line. If not, then it is
not a line that can be optimized this way so we should just print it
out as-is.
Find and replace
Because we know exactly what QBE is going to output, we can use that
to our advantage. We don't need to tokenize the output; we can simply
check for exact strings. That saves us a lot of logic. A movq line
that we want to match will always look like this:
movq $0, %r__
Where __ reflects any of the 64-bit registers. Therefore, we can
match exactly that and since that's the only form this line can be
in, it is the only thing we need to check for. We want to replace
that movq line with this:
xorq %r__, %r__
If we were to write that as a function, it might look like this:
static int
xorq(const char *line)
{
if (!strncmp("\tmovq $0, %r", line, 12)) {
(void) fprintf(stdout, "\txorq %%r%c%c, %%r%c%c\n", line[12],
line[13], line[12], line[13]);
return 1;
}
return 0;
}
If you wanted to be paranoid, you could check that strlen(line) > 13
before calling fprintf.
We check to see if the first twelve characters of the line match a
movq line that would be better if it was optimized into a xorq line.
If it does, we instead write the equivalent xorq line. We are able to
use the fact that we can learn the target register directly from the
movq line and use that in the rewritten xorq line.
We can use the same strategy for optimizing movl lines into xorl
lines. Here, we want to transform:
movl $0, %e__
Into:
xorl %e__, %e__
In code, that looks like this:
static int
xorl(const char *line)
{
if (!strncmp("\tmovl $0, %e", line, 12)) {
(void) fprintf(stdout, "\txorl %%e%c%c, %%e%c%c\n", line[12],
line[13], line[12], line[13]);
return 1;
}
return 0;
}
We have written a peephole optimizer! It is guaranteed to improve the
code that QBE produces.
Improving these optimizations
But we can do better. The xorq function doesn't account for %r8 or
%r9 while the xorl function doesn't account for any of the %r8d-%r15d
registers. They'll need special tweaks, but it is nothing too
difficult:
static int
xorq(struct peephole *window)
{
char buf[32], r1a, r1b;
if (window->line1 == NULL)
return 0;
if (!strncmp("\tmovq $0, %r", window->line1, 12)) {
if (strlen(window->line1) < 14)
return 0;
r1a = window->line1[12];
r1b = window->line1[13];
if (r1b == '\n')
r1b = ' ';
(void) snprintf(buf, sizeof(buf), "\txorq %%r%c%c, %%r%c%c\n",
r1a, r1b, r1a, r1b);
free(window->line1);
window->line1 = xstrdup(buf);
return 1;
}
return 0;
}
static int
xorl(struct peephole *window)
{
char buf[32], e1a, e1b;
if (window->line1 == NULL)
return 0;
if (!strncmp("\tmovl $0, %e", window->line1, 12)) {
if (strlen(window->line1) != 15)
return 0;
e1a = window->line1[12];
e1b = window->line1[13];
(void) snprintf(buf, sizeof(buf), "\txorl %%e%c%c, %%e%c%c\n",
e1a, e1b, e1a, e1b);
free(window->line1);
window->line1 = xstrdup(buf);
return 1;
} else if (!strncmp("\tmovl $0, %r", window->line1, 12)) {
if (strlen(window->line1) < 14)
return 0;
e1a = window->line1[12];
e1b = window->line1[13];
if (e1b == 'd') {
(void) snprintf(buf, sizeof(buf),
"\txorl %%r%cd, %%r%cd\n", e1a, e1a);
} else {
(void) snprintf(buf, sizeof(buf),
"\txorl %%r%c%cd, %%r%c%cd\n", e1a, e1b, e1a, e1b);
}
free(window->line1);
window->line1 = xstrdup(buf);
return 1;
}
return 0;
}
It is really important that we make these tweaks, since QBE really
enjoys using the %r8d register. That way, we can benefit from this
optimization in those cases.
Testing
While we are not yet done with our peephole optimizer, since we will
add another optimization, it would be good to test it out. We can
test it out on the commandline by running O -. Here is the results
when I type in some assembly directly:
/home/brian $ O -
movq $0, %rax
xorq %rax, %rax
movl $0, %ebx
xorl %ebx, %ebx
movq $1, %rax
movq $1, %rax
I typed in the odd-numbered lines and O responded with the
even-numbered lines. Don't forget that you need to type in the
leading tab, since O requires the leading tab to make a successful
match. We can see that right now the optimizer successfully replaces
lines that match our logic and, importantly, successfully does not
replace lines that do not match.
Plugging O into cproc
We can execute a much bigger test by plugging O into the cproc
pipeline. We should immediately begin to see improvements to the code
we compile. And, since my OpenBSD port of cproc does the whole three
stage self-hosting compile, we will immediately learn if cproc can
handle this improvement to itself. I'm betting that it can, but it's
always better to be sure.
There are two files that need to be patched. First, we need to patch
the configure script and teach it about O, since that will place a
needed data structure in a header file to be used later:
Index: configure
--- configure.orig
+++ configure
@@ -159,6 +159,7 @@ static const char *const preprocesscmd[] = {
"-D", "__extension__=",
$defines};
static const char *const codegencmd[] = {"$DEFAULT_QBE"};
+static const char *const optimizecmd[] = {"O"};
static const char *const assemblecmd[] = {"$DEFAULT_ASSEMBLER"};
static const char *const linkcmd[] = {"$DEFAULT_LINKER", $linkflags};
EOF
Second, we need to patch driver.c to teach the compiler driver about
this new stage in the compilation pipeline:
Index: driver.c
--- driver.c.orig
+++ driver.c
@@ -32,6 +32,7 @@ enum stage {
PREPROCESS,
COMPILE,
CODEGEN,
+ OPTIMIZE,
ASSEMBLE,
LINK,
};
@@ -60,6 +61,7 @@ static struct stageinfo stages[] = {
[PREPROCESS] = {.name = "preprocess"},
[COMPILE] = {.name = "compile"},
[CODEGEN] = {.name = "codegen"},
+ [OPTIMIZE] = {.name = "optimize"},
[ASSEMBLE] = {.name = "assemble"},
[LINK] = {.name = "link"},
};
@@ -381,6 +383,7 @@ main(int argc, char *argv[])
arrayaddbuf(&stages[PREPROCESS].cmd, preprocesscmd, sizeof(preprocesscmd));
arrayaddptr(&stages[COMPILE].cmd, compilecommand(argv[0]));
arrayaddbuf(&stages[CODEGEN].cmd, codegencmd, sizeof(codegencmd));
+ arrayaddbuf(&stages[OPTIMIZE].cmd, optimizecmd, sizeof(optimizecmd));
arrayaddbuf(&stages[ASSEMBLE].cmd, assemblecmd, sizeof(assemblecmd));
arrayaddbuf(&stages[LINK].cmd, linkcmd, sizeof(linkcmd));
@@ -400,6 +403,7 @@ main(int argc, char *argv[])
arrayaddptr(&stages[COMPILE].cmd, arch);
arrayaddptr(&stages[CODEGEN].cmd, "-t");
arrayaddptr(&stages[CODEGEN].cmd, qbearch);
+ arrayaddptr(&stages[OPTIMIZE].cmd, "-");
for (;;) {
++argv, --argc;
@@ -414,10 +418,10 @@ main(int argc, char *argv[])
switch (input->filetype) {
case ASM: input->stages = 1<stages = 1<stages = 1<stages = 1<stages = 1<stages = 1<stages = 1<stages = 1<stages = 1<stages = 1<