https://briancallahan.net/blog/20210710.html

Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social
life

---------------------------------------------------------------------
Home | Blog archives | LinkedIn | CV | Code | Extras
---------------------------------------------------------------------

[prev]

[next]

2021-07-10
I wrote a 231-byte Brainfuck compiler by abusing everything

All source code for this blog post can be found here.

We wrote a Brainfuck compiler back in April. That one developed a
simple architecture to support multiple backends. Today, I want to
write a new Brainfuck compiler. One that can only run on a single
system, OpenBSD/amd64. And one that is as small as I can possibly
make it. I broke into the sub-256 bytes club with my compiler, so I
thought it was worthwhile sharing and talking about how I was able to
pull it off.

While this works on my machine, there may be subtle differences about
how your machine works that cause this compiler not to work. Make
sure to check your machine's behavior if there are any differences!

We are optimizing exclusively for size. That means that we may make
decisions that are less performant than other ways of writing a
Brainfuck compiler. That's OK.

The assembly code will be written using AT&T syntax. Last time I had
a blog post with assembly code, someone complained that it was not
Intel syntax. I am not much bothered by either and I teach my
undergraduates both syntaxes because I think it is important to be
able to read both. If you are an Intel syntax person, take this post
as an opportunity to improve your AT&T syntax skills.

Reviewing Brainfuck

Let's quickly review the syntax of Brainfuck so we know what we need
to implement. The table below presumes that we are translating
Brainfuck to C, have a global array acting as the Brainfuck tape, and
have a pointer, *p, that is intialized to the beginning of the array.

Symbol  C equivalent
<      --p;
>      ++p;
-      --*p;
+      ++*p;
,      *p = getchar();
.      putchar(*p);
[      while (*p) {
]      }

Choosing an implementation language

I think the only way we stand a chance at creating a compiler as
small as possible is to use hand-optimized assembly. Those precious
bytes aren't going to save themselves!

We will be using the GNU binutils for our assembler and the OpenBSD
system linker, which is LLVM lld, for our linker. I am using the GNU
assembler from ports, which is a much newer version of the assembler.
This newer assembler is a little bit smarter when it comes to
encoding and that ends up saving a byte for us down the line. I'll
point it out when we get there.

Coding our strings

Our compiler will compile Brainfuck to C. I chose C because it will
allow us to have some of the smallest output code, and by extension
the shortest strings needed to have built-in to our compiler. We will
use the table above for our translation.

Our compiler therefore begins as follows:

        .text
        .globl  main
main:

.LSprologue:
        .ascii  "a[65536],*p=a;main(){" # 21

.LS:
        .ascii  "--p;"          # 4
        .ascii  "++p;"          # 4
        .ascii  "--*p;"         # 5
        .ascii  "++*p;"         # 5
        .ascii  "*p=getchar();" # 13
        .ascii  "putchar(*p); " # 13
        .ascii  "while(*p){"    # 10
        .ascii  "return 0;"     # 9
        .ascii  "}"             # 1

The top is some boilerplate to inform the toolchain that we have a
global symbol named main and it (and everything else, for that
matter) lives in the text section of our binary.

Beginning with .LSprototype is our strings. Labels beginning with .L
are considered local labels for ELF targets like my machine. After
each string I put a comment with the number of characters in the
string. We will use the OpenBSD kernel facilities for reading and
writing, and the write(2) system call has a parameter for number of
bytes to write. Yes, we could write a strlen(3) function, and in fact
we did that for our echo program. However, writing such a function
would take too many bytes compared to simply shoving the correct
value into the register.

The .LSprologue string is separate from the rest of the .LS strings
because the way I organized the compiler the prologue is used
directly and for all other strings we index into the string starting
at .LS.

You'll also notice I am using the .ascii pseudo-op rather than the
.asciz pseudo-op. This means that there is no NUL-termination of any
of these strings, so they are not C strings. Since we are providing
the exact number of bytes to write with each write(2) call, we can
get away with this and save a number of bytes in the process.

We are also leaving them in the .text section. That technically means
they are code. We could place them in the .rodata section but since
these string are never written to, I think it is fine.

We can also see some more byte-shaving at work. While we are
compiling to C, we didn't specify which standard we were targeting.
We are going very old by having implicit ints everywhere. We are also
going to be implicitly declaring getchar(3) and putchar(3) which is
invalid in C99 and later, according to clang. So we will target C89
and/or K&R C.

As a result of the implicit ints the cells on the Brainfuck tape are
32-bits in size. There is no requirement for any particular cell
size, so this is acceptable from a Brainfuck perspective.

I also made the getchar and putchar strings equal length, even though
the putchar string is actually one character shorter. We will spend a
byte now to save several bytes later.

Reading code into a buffer

Our parser needs to read in the program one character at a time and
process each character. If it is one of the Brainfuck characers, we
output the appropriate translation, otherwise we move on to the next
character. We are going to take advantage of the fact that we can
read from stdin and write to stdout without having to set anything up
outselves; these facilities are set up for us by our environment.

We will also need to look up the syscall numbers for read(2) and
write(2), which are 3 and 4.

We must read in a character before we can parse it and write
anything. Our read can look like this:

.Lparse:
        movb    $3, %al         # Load read(2) system call
        movb    $1, %dl         # Read one character
        xorl    %edi, %edi      # Read from stdin
        leaq    (%rsp), %rsi    # Read into top of stack
        syscall                 # read(0, (%rsp), 1);
        incl    %edi            # Set %edi to 1, for write

We can take advantage of some assumptions on our system for this to
work. On our system all general-purpose registers are initialized to
0 by the environment before program start. The movb instructions
require only 2 bytes but do not 0-out the remainder of the register.
If there was something already in %rax or %rdx above the lowest byte,
it would remain as-is. But since our environment set these registers
to 0, we are OK. These movb instructions are 2 bytes each.

We also chose to use xorl instead of movb for the %rdi register. This
is because you cannot use movb on %rdi since it is at its smallest a
16-bit register. Since we are making it 0, xor'ing any 32-bit
register with itself is the smallest way to do this at 2 bytes.

The leaq instruction is expensive at 4 bytes. It's actually one of
the smallest versions of this instruction and we will see a 7-byte
version later. Finally, we need to call the syscall for 2 bytes.
That's 12 bytes total to get a character from stdin.

We are also assuming that our environment set up some stack for us
and that our stack pointer points to something reasonable. That
happens to be true. As we learned before, the environment does
contain our argc and argv variables and that is the case here too. So
we have a reasonable stack pointer.

While it is not directly related to reading in a character, it turns
out that %rdi stays the same across read(2) calls, so we can
increment it to prepare for an upcoming write(2) call. Incrementing
is 2 bytes, which is the smallest way to turn a 0 into a 1 for us.
Even in the case where we do not actually end up writing anything,
for example because we read in a comment character, it does not much
matter since we xor the register before the read(2) call every time.

Checking for EOF

Now would be a good time to check if we are at the end of the file,
since if we are we have finished our work and should exit the
program. Fortunately, read(2) returns the number of bytes read in
%rax. If that number is 1, then we are not yet finished. Otherwise,
if it is 0 or -1 then we are finished. A 0 means end of file and a -1
means an error occured. We don't actually care if we reached end of
file or encountered an error. Both of those mean it is time to exit
for us.

Quick aside: The System V AMD64 ABI calling convention

Our machine uses the System V AMD64 ABI calling convention. This
dictates how our registers are used and how registers are saved or
not across function calls, including syscalls. For us, we need to
know that %rax will contain the return value of a call, so we should
presume that it will change after each call. Additionally, %rax
contains the syscall number prior to making the syscall. The calling
convention also says that the %rbx, %rsp, %rbp, and %r12-%r15
registers must be restored to their original value by the callee if
they are used. That means as far as we are concerned, we can use
those registers across calls and presume that their value will stay
the same. Any other registers may be clobbered by calls. We must also
respect the registers used for call parameters. Since read(2) and
write(2) use three parameters, we must make sure to use %rdi, %rsi,
and %rdx for those parameters, like we did with our read(2) call
above.

It happens to be the case that that %rdi and %rdx do not get
clobbered by read(2) and write(2) calls on our machine, which we will
use to our advantage.

Checking for EOF, part 2

Now we can check for EOF after each read:

        cmpl    %edx, %eax      # EOF ? (%eax < 1)
        jl      .Leof

That's all we need. Since %rdx does not get clobbered on our system,
we know it is 1. If %rax is less than that, we know we either reached
the end of the file or an error occured. In that case, we should jump
to the code that exits the program.

Parsing the input

Something that we need to know is that the smallest cmp is when you
compare a byte against %al. But as of now, %rax contains the return
code of the read(2) call. We stand to save eight bytes using the
smallest cmp. We will end up spending three bytes to save eight
bytes, which is great.

We can make this happen by slightly tweaking our end of file check
and then parsing our input after the end of file check:

        xchg    %ebp, %eax      # Store return value in %ebp
        movb    (%rsi), %al     # cmpb imm, %al is the smallest cmp
        leal    .LS, %esi       # Preload string
        cmpl    %edx, %ebp      # EOF ? (%ebp < 1)
        jl      .Leof
        cmpb    $60, %al        # '<' ?
        je      .Lleft
        cmpb    $62, %al        # '>' ?
        je      .Lright
        cmpb    $45, %al        # '-' ?
        je      .Ldec
        cmpb    $43, %al        # '+' ?
        je      .Linc
        cmpb    $44, %al        # ',' ?
        je      .Lgetchar
        cmpb    $46, %al        # '.' ?
        je      .Lputchar
        cmpb    $91, %al        # '[' ?
        je      .Lopenloop
        cmpb    $93, %al        # ']' ?
        je      .Lcloseloop
        jmp     .Lparse         # Comment character, skip

We can xchg to get our smallest cmp. This is the reason for the newer
GNU assembler. The older assembler in base always encodes this to a
2-byte sequence but the newer assembler is smart enough to know this
can be encoded as a 1-byte instruction and does so. We then put the
character we read in into %rax and preload the address of all our
strings via a 7-byte leal. That's very expensive, in fact it is the
most expensive instruction we will encounter. But it is a necessary
evil for us. Now we perform our end of file check and if we are not
at the end of the file we continue on figuring out which character we
have and what to do with it.

Processing each possibility

Let's leave the end of file processing for later and now tackle the
Brainfuck symbol processing. There are eight symbols we need to
handle:

.Lright:
        addl    $4, %esi
.Lleft:
        movb    $4, %dl
        jmp     .Lwrite

Let's take a look at just this for right now. This code handles both
the < and > symbols. Recall that .LS contains all our strings,
prologue notwithstanding. The first character of .LS corresponds to
the < symbol. We are calling that the .Lleft label. That has an
offset of 0 so we immediately move to telling write(2) we want to
write 4 characters and then head to the syscall. If we instead see a
> that's the .Lright label and we need to offset the starting point
of the string by 4. We then fall through because both outputs are 4
characters in length so we can combine the code to save bytes.

We will use these tricks a few more times. Let's continue writing the
symbol processing code:

.Ldec:
        subl    $5, %esi        # 13 - 5 = 8
.Linc:
        addl    $13, %esi
        movb    $5, %dl
        jmp     .Lwrite

Let's stop here too and take a look at one more trick. Both .Ldec and
.Linc output strings that are the same length. We want to be able to
combine the code like we did in the .Lleft and .Lright case. However,
we don't have an index of 0 for either. We get around this by
creatively subtracting the length of the string from the offset of
the string farther from the beginning of .LS to get the offset of the
string closer to the beginning of .LS and then fall through from
there. These subtractions and additions, because they are within the
range of -128 and 127 all encode to 2 bytes.

Let's continue on:

.Lgetchar:
        subl    $13, %esi       # 31 - 13 = 18
.Lputchar:
        addl    $31, %esi
        movb    $13, %dl
        jmp     .Lwrite

Exactly the same tricks as .Linc and .Ldec. Note though that if I did
not extend the putchar string by one character at the beginning, we
could not use any of these tricks and it would cost us a whole lot
more than the one byte it cost to elongate the putchar string.

Let's finish up our processing code:

.Lopenloop:
        incl    %ebx            # Increment loop counter
        addl    $44, %esi
        movb    $10, %dl
        jmp     .Lwrite
.Lcloseloop:
        decl    %ebx            # Decrement loop counter
        js      .Lexit          # %ebx < 0 ? (%rdi == 1 from the write(2) call)
        addl    $63, %esi
        movb    $1, %dl
        jmp     .Lwrite

Unfortunately, we cannot use the fall through trick here but there
are some other tricks we can use. We use the fact that all our
general-purpose registers are initialized to 0 and the fact that %rbx
is saved across calls to just start using it effectively out of
nowhere to act as our loop counter. The .Lopenloop label is otherwise
straightforward.

The .Lcloseloop label has an additional trick. Because we want to
have a good compiler that has proper return values, 0 for success and
1 for failure, we decrement the loop counter and check to see if it
has fallen below 0. If it has, there is an imbalance of brackets and
we should stop processing and error out. Unfortunately, dec preserves
the value of the carry flag, so we cannot use it to determine if we
have gone below 0. However, it alters all other flags and we can use
the sign flag to check if we have gone negative. The js opcode jumps
if the sign flag is set, which it only will be if we have gone
negative. We can avoid a cmp against -1 this way. Fortunately, %rsi
has already been set to 1 and is exactly the value we want for _exit
(2) when we error out so we do not have to make any adjustments to
%rdi. The rest is straightfoward.

Writing output

We will use the fact that stdout comes available to use from our
environment without us having to do anything and as such we will
write our compiled code to stdout. We are also going to use another
trick here: we will place the .Lwrite label between the .Lparse and
.Lright labels. This is because if you have a jmp that is within the
range of -128 and 127 bytes, that encodes to a short jump which is
only 2 bytes. That is the smallest jmp there is. We want all our jmps
to be short jumps. Our .Lwrite labels ends with a jmp to .Lparse in
order to read in the next character and if it came after all the code
we just wrote, it will be more than -128 bytes away.

Our .Lwrite label is very short:

.Lwrite:
        movb    $4, %al
        syscall                 # write(1, string, length);
        jmp     .Lparse

That's it. All the writes have to put 4 into %rax, make the syscall,
and then read in the next character.

Dealing with end of file and exiting

Now let's write the code to handle end of file and the code to handle
exit, since we will use some fall through to handle them. Let's also
put this new code immediately after .Lwrite and before .Lright, again
to ensure all our jmps are short jumps.

It looks like this:

.Leof:
        cmpl    %edx, %ebx      # Loop counter < 1 ? (i.e., 0)
        jge     .Lexit
        addl    $54, %esi
        movb    $10, %dl
        movb    $4, %al
        syscall
        xorl    %edi, %edi      # Get ready to exit
.Lexit:
        movb    $1, %al         # _exit(%edi);
        syscall

When we reach the end of the file, we need to make sure our loop
counter is 0. If it is not, we have a bracket imbalance and should
error out immediately. Like before, %rdi happens to be 1 from the set
up for the write(2) call so we do not need to make any adjustments.

The remainder of the .Leof label is writing the C epilogue. It then
sets %rdi to 0 because we are going to exit successfully and falls
through to .Lexit which sets the syscall number to the _exit(2)
syscall and calls it, which exits the program.

Writing the prologue

The last thing we need to write in order to finish our compiler is
the prologue. This code goes immediately after the main label and
before the .Lparse label.

We need to make a call to write(2) with our .LSprologue string:

        popq    %rdi            # Write prologue (argc == 1, hopefully)
        leal    .LSprologue, %esi
        movb    $21, %dl
        jmp     .Lwrite

Our final trick is immediately popping the top of the stack into %rdi
right at the start. As we learned before, our environment puts argc
in there. So long as argc is 1, which it should be, that allows us to
set up %rdi correctly for this initial write(2) call in just 1 byte.
If not, well then your output will be incorrect. It is the one error
that goes undetected by our compiler. We are going to accept that we
have cut this corner.

And that's it! Our compiler is finished.

Putting it all together

This is what our final completed compiler looks like:

        .text
        .globl  main
main:
        popq    %rdi            # Write prologue (argc == 1, hopefully)
        leal    .LSprologue, %esi
        movb    $21, %dl
        jmp     .Lwrite
.Lparse:
        movb    $3, %al         # Load read(2) system call
        movb    $1, %dl         # Read one character
        xorl    %edi, %edi      # Read from stdin
        leaq    (%rsp), %rsi    # Read into top of stack
        syscall                 # read(0, (%rsp), 1);
        incl    %edi            # Set %edi to 1, for write
        xchg    %ebp, %eax      # Store return value in %ebp
        movb    (%rsi), %al     # cmpb imm, %al is the smallest cmp
        leal    .LS, %esi       # Preload string
        cmpl    %edx, %ebp      # EOF ? (%ebp < 1)
        jl      .Leof
        cmpb    $60, %al        # '<' ?
        je      .Lleft
        cmpb    $62, %al        # '>' ?
        je      .Lright
        cmpb    $45, %al        # '-' ?
        je      .Ldec
        cmpb    $43, %al        # '+' ?
        je      .Linc
        cmpb    $44, %al        # ',' ?
        je      .Lgetchar
        cmpb    $46, %al        # '.' ?
        je      .Lputchar
        cmpb    $91, %al        # '[' ?
        je      .Lopenloop
        cmpb    $93, %al        # ']' ?
        je      .Lcloseloop
        jmp     .Lparse         # Comment character, skip
.Lwrite:
        movb    $4, %al
        syscall                 # write(1, string, length);
        jmp     .Lparse
.Leof:
        cmpl    %edx, %ebx      # Loop counter < 1 ? (i.e., 0)
        jge     .Lexit
        addl    $54, %esi
        movb    $10, %dl
        movb    $4, %al
        syscall
        xorl    %edi, %edi      # Get ready to exit
.Lexit:
        movb    $1, %al         # _exit(%edi);
        syscall
.Lright:
        addl    $4, %esi
.Lleft:
        movb    $4, %dl
        jmp     .Lwrite
.Ldec:
        subl    $5, %esi        # 13 - 5 = 8
.Linc:
        addl    $13, %esi
        movb    $5, %dl
        jmp     .Lwrite
.Lgetchar:
        subl    $13, %esi       # 31 - 13 = 18
.Lputchar:
        addl    $31, %esi
        movb    $13, %dl
        jmp     .Lwrite
.Lopenloop:
        incl    %ebx            # Increment loop counter
        addl    $44, %esi
        movb    $10, %dl
        jmp     .Lwrite
.Lcloseloop:
        decl    %ebx            # Decrement loop counter
        js      .Lexit          # %ebx < 0 ? (%rdi == 1 from the write(2) call)
        addl    $63, %esi
        movb    $1, %dl
        jmp     .Lwrite

.LSprologue:
        .ascii  "a[65536],*p=a;main(){" # 21

.LS:
        .ascii  "--p;"          # 4
        .ascii  "++p;"          # 4
        .ascii  "--*p;"         # 5
        .ascii  "++*p;"         # 5
        .ascii  "*p=getchar();" # 13
        .ascii  "putchar(*p); " # 13
        .ascii  "while(*p){"    # 10
        .ascii  "return 0;"     # 9
        .ascii  "}"             # 1

How many bytes is our compiler?

On OpenBSD, there is some mandatory overhead, 24 bytes worth. We will
measure the size of our compiler three ways: by itself, with the
OpenBSD overhead, and with all the ELF overhead.

The compiler by itself:

$ size bf256.o
text    data    bss     dec     hex
231     0       0       231     e7

Only 231 bytes! Not bad. If I remember correctly, supposedly the
original Brainfuck compiler was 240 bytes.

The compile with the OpenBSD overhead:

$ size bf256
text    data    bss     dec     hex
255     0       0       255     ff

So even if you insist that it is unfair that I excluded the OpenBSD
overhead in my count, we still come in at only 255 bytes.

With all the ELF overhead:

$ ls -l bf256
-rwxr-xr-x  1 brian  brian  952 Jul 10 19:38 bf256

Well, 952 bytes. OK, that's where the ELF overhead gets you. I don't
want to get out a hex editor and start whittling down all that. I am
going to say it is good enough and giving us all permission to ignore
all the ELF overhead.

Conclusion

It is time for me to put away the text editor. My target was 256
bytes and I just scraped by. Can you find additional bytes to shave
off?

Update: A 216-byte compiler

I was able to save an additional jmp and an extraneous movb $1, %dl,
saving an additional 4 bytes, which allowed me to add back one byte
and change the initial popq %rdi to an incl %edi, guaranteeing that
the initial prologue write will always work regardless of the number
of arguments to the program and still save a byte. See it here.

I was able to save an additional byte by removing the extra space in
the putchar string and using the last character in the getchar string
to pad out the putchar string. As this is a ; character, in C that
becomes an empty statement, which is legal C. This gets us down to
944 bytes with all the ELF overhead. See it here.

I was able to save an additional 11 bytes by removing the epilogue
string entirely and simply using the closeloop string for the
epilogue. It seems that clang does not miss it. That gets us to 936
bytes with all the ELF overhead. See it here.

Top

RSS
---------------------------------------------------------------------

OpenBSD httpd