https://github.com/spencertipping/jit-tutorial

Skip to content
 
Sign up

  * Why GitHub?
      + Features
      + Mobile
      + Actions
      + Codespaces
      + Packages
      + Security
      + Code review
      + Issues
      + Integrations
      + GitHub Sponsors
      + Customer stories
  * Team
  * Enterprise
  * Explore
      + Explore GitHub
      + Learn and contribute
      + Topics
      + Collections
      + Trending
      + Learning Lab
      + Open source guides
      + Connect with others
      + The ReadME Project
      + Events
      + Community forum
      + GitHub Education
      + GitHub Stars program
  * Marketplace
  * Pricing
      + Plans
      + Compare plans
      + Contact Sales
      + Education

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}

spencertipping / jit-tutorial Public

  * Notifications
  * Fork 57
  * Star 897
  * 

How to write a very simple JIT compiler

897 stars 57 forks
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Wiki
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Wiki
  * Security
  * Insights

master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
Loading
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
Loading
View all tags
1 branch 0 tags
Code
Loading

Latest commit

@spencertipping
spencertipping Merge pull request #4 from strogiyotec/master
...
553b452 May 3, 2021
Merge pull request #4 from strogiyotec/master

fix link to jdk source code

553b452

Git stats

  * 25 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
README.md
fix link to jdk source code
May 3, 2021
hardcoded.c
Hard-coded version: 2x faster
Mar 8, 2017
jitproto.c
Minor fixes
Mar 10, 2017
mandeljit.c
More stuff, fix, further reading
Mar 12, 2017
micro-asm.h
Reformat as a table + group by opcodes
Mar 13, 2017
simple.c
Hard-coded version: 2x faster
Mar 8, 2017
simple.s
WIP describing the jit stuff
Mar 10, 2017
View code
[                    ]
How to write a JIT compiler MandelASM Simple interpreter Performance
analysis JIT design and the x86-64 calling convention Compilation
strategy Example: the Mandelbrot function *bb+ab JIT mechanics
Generating MandelASM machine code The gnarly bits: how operands are
specified A simple micro-assembler Putting it all together Further
reading Debugging JIT compilers Low-level

README.md

 How to write a JIT compiler

First up, you probably don't want to. JIT, or more accurately
"dynamic code generation," is typically not the most effective way to
optimize a project, and common techniques end up trading away a lot
of portability and require fairly detailed knowledge about
processor-level optimization.

That said, though, writing JIT compiler is a lot of fun and a great
way to learn stuff. The first thing to do is to write an interpreter.

NOTE: If you don't have solid grasp of UNIX system-level programming,
you might want to read about how to write a shell, which covers a lot
of the fundamentals.

 MandelASM

GPUs are fine for machine learning, but serious fractal enthusiasts
design their own processors to generate Mandelbrot sets. And the
first step in processor design, of course, is to write an emulator
for it. Our emulator will interpret the machine code we want to run
and emit an image to stdout.

To keep it simple, our processor has four complex-valued registers
called a, b, c, and d, and it supports three in-place operations:

  * =ab: assign register a to register b
  * +ab: add register a to register b
  * *ab: multiply register b by register a

For each pixel, the interpreter will zero all of the registers and
then set a to the current pixel's coordinates. It then iterates the
machine code for up to 256 iterations waiting for register b to
"overflow" (i.e. for its complex absolute value to exceed 2). That
means that the code for a standard Mandelbrot set is *bb+ab.

 Simple interpreter

The first thing to do is write up a bare-bones interpreter in C. It
would be simpler to use complex.h here, but I'm going to write it in
terms of individual numbers because the JIT compiler will end up
generating the longhand logic. In production code we'd include
bounds-checks and stuff, but I'm omitting those here for simplicity.

// simple.c
#include <stdio.h>
#include <stdlib.h>

#define sqr(x) ((x) * (x))

typedef struct { double r; double i; } complex;

void interpret(complex *registers, char const *code) {
  complex *src, *dst;
  double r, i;
  for (; *code; code += 3) {
    dst = &registers[code[2] - 'a'];
    src = &registers[code[1] - 'a'];
    switch (*code) {
      case '=':
        dst->r = src->r;
        dst->i = src->i;
        break;
      case '+':
        dst->r += src->r;
        dst->i += src->i;
        break;
      case '*':
        r = dst->r * src->r - dst->i * src->i;
        i = dst->r * src->i + dst->i * src->r;
        dst->r = r;
        dst->i = i;
        break;
      default:
        fprintf(stderr, "undefined instruction %s (ASCII %x)\n", code, *code);
        exit(1);
    }
  }
}

int main(int argc, char **argv) {
  complex registers[4];
  int i, x, y;
  char line[1600];
  printf("P5\n%d %d\n%d\n", 1600, 900, 255);
  for (y = 0; y < 900; ++y) {
    for (x = 0; x < 1600; ++x) {
      registers[0].r = 2 * 1.6 * (x / 1600.0 - 0.5);
      registers[0].i = 2 * 0.9 * (y /  900.0 - 0.5);
      for (i = 1; i < 4; ++i) registers[i].r = registers[i].i = 0;
      for (i = 0; i < 256 && sqr(registers[1].r) + sqr(registers[1].i) < 4; ++i)
        interpret(registers, argv[1]);
      line[x] = i;
    }
    fwrite(line, 1, sizeof(line), stdout);
  }
  return 0;
}

Now we can see the results by using display from ImageMagick (apt-get
install imagemagick), or by saving to a file:

$ gcc simple.c -o simple
$ ./simple *bb+ab | display -           # imagemagick version
$ ./simple *bb+ab > output.pgm          # save a grayscale PPM image
$ time ./simple *bb+ab > /dev/null      # quick benchmark
real    0m2.369s
user    0m2.364s
sys     0m0.000s
$

image

 Performance analysis

In the real world, JIT is absolutely the wrong move for this problem.

Array languages like APL, Matlab, and to a large extent Perl, Python,
etc, manage to achieve reasonable performance by having interpreter
operations that apply over a large number of data elements at a time.
We've got exactly that situation here: in the real world it's a lot
more practical to vectorize the operations to apply simultaneously to
a screen-worth of data at a time -- then we'd have nice options like
offloading stuff to a GPU, etc.

However, since the point here is to compile stuff, on we go.

JIT can basically eliminate the interpreter overhead, which we can
easily model here by replacing interpret() with a hard-coded
Mandelbrot calculation. This will provide an upper bound on realistic
JIT performance, since we're unlikely to optimize as well as gcc
does.

// hardcoded.c
#include <stdio.h>
#include <stdlib.h>

#define sqr(x) ((x) * (x))

typedef struct { double r; double i; } complex;

void interpret(complex *registers, char const *code) {
  complex *a = &registers[0];
  complex *b = &registers[1];
  double r, i;
  r = b->r * b->r - b->i * b->i;
  i = b->r * b->i + b->i * b->r;
  b->r = r;
  b->i = i;
  b->r += a->r;
  b->i += a->i;
}

int main(int argc, char **argv) {
  complex registers[4];
  int i, x, y;
  char line[1600];
  printf("P5\n%d %d\n%d\n", 1600, 900, 255);
  for (y = 0; y < 900; ++y) {
    for (x = 0; x < 1600; ++x) {
      registers[0].r = 2 * 1.6 * (x / 1600.0 - 0.5);
      registers[0].i = 2 * 0.9 * (y /  900.0 - 0.5);
      for (i = 1; i < 4; ++i) registers[i].r = registers[i].i = 0;
      for (i = 0; i < 256 && sqr(registers[1].r) + sqr(registers[1].i) < 4; ++i)
        interpret(registers, argv[1]);
      line[x] = i;
    }
    fwrite(line, 1, sizeof(line), stdout);
  }
  return 0;
}

This version runs about twice as fast as the simple interpreter:

$ gcc hardcoded.c -o hardcoded
$ time ./hardcoded *bb+ab > /dev/null
real    0m1.329s
user    0m1.328s
sys     0m0.000s
$

 JIT design and the x86-64 calling convention

The basic strategy is to replace interpret(registers, code) with a
function compile(code) that returns a pointer to a function whose
signature is this: void compiled(registers*). The memory for the
function needs to be allocated using mmap so we can set permission
for the processor to execute it.

The easiest way to start with something like this is probably to emit
the assembly for simple.c to see how it works:

$ gcc -S simple.c

Edited/annotated highlights from the assembly simple.s, which is much
more complicated than what we'll end up generating:

interpret:
        // The stack contains local variables referenced to the "base pointer"
        // stored in hardware register %rbp. Here's the layout:
        //
        //   double i  = -8(%rbp)
        //   double r  = -16(%rbp)
        //   src       = -24(%rbp)
        //   dst       = -32(%rbp)
        //   registers = -40(%rbp)      <- comes in as an argument in %rdi
        //   code      = -48(%rbp)      <- comes in as an argument in %rsi

        pushq   %rbp
        movq    %rsp, %rbp              // standard x86-64 function header
        subq    $48, %rsp               // allocate space for six local vars
        movq    %rdi, -40(%rbp)         // registers arg -> local var
        movq    %rsi, -48(%rbp)         // code arg -> local var
        jmp     for_loop_condition      // commence loopage

Before getting to the rest, I wanted to call out the %rsi and %rdi
stuff and explain a bit about how calls work on x86-64. %rsi and %rdi
seem arbitrary, which they are to some extent -- C obeys a
platform-specific calling convention that specifies how arguments get
passed in. On x86-64, up to six arguments come in as registers; after
that they get pushed onto the stack. If you're returning a value, it
goes into %rax.

The return address is automatically pushed onto the stack by call
instructions like e8 <32-bit relative>. So internally, call is the
same as push ADDRESS; jmp <call-site>; ADDRESS: .... ret is the same
as pop %rip, except that you can't pop into %rip. This means that the
return address is always the most immediate value on the stack.

Part of the calling convention also requires callees to save a couple
of registers and use %rbp to be a copy of %rsp at function-call-time,
but our JIT can mostly ignore this stuff because it doesn't call back
into C.

for_loop_body:
        // (a bunch of stuff to set up *src and *dst)

        cmpl    $43, %eax               // case '+'
        je      add_branch
        cmpl    $61, %eax               // case '='
        je      assign_branch
        cmpl    $42, %eax               // case '*'
        je      mult_branch
        jmp     switch_default          // default

assign_branch:
        // the "bunch of stuff" above calculated *src and *dst, which are
        // stored in -24(%rbp) and -32(%rbp).
        movq    -24(%rbp), %rax         // %rax = src
        movsd   (%rax), %xmm0           // %xmm0 = src.r
        movq    -32(%rbp), %rax         // %rax = dst
        movsd   %xmm0, (%rax)           // dst.r = %xmm0

        movq    -24(%rbp), %rax         // %rax = src
        movsd   8(%rax), %xmm0          // %xmm0 = src.i
        movq    -32(%rbp), %rax         // %rax = dst
        movsd   %xmm0, 8(%rax)          // dst.i = %xmm0

        jmp     for_loop_step

add_branch:
        movq    -32(%rbp), %rax         // %rax = dst
        movsd   (%rax), %xmm1           // %xmm1 = dst.r
        movq    -24(%rbp), %rax         // %rax = src
        movsd   (%rax), %xmm0           // %xmm0 = src.r
        addsd   %xmm1, %xmm0            // %xmm0 += %xmm1
        movq    -32(%rbp), %rax         // %rax = dst
        movsd   %xmm0, (%rax)           // dst.r = %xmm0

        movq    -32(%rbp), %rax         // same thing for src.i and dst.i
        movsd   8(%rax), %xmm1
        movq    -24(%rbp), %rax
        movsd   8(%rax), %xmm0
        addsd   %xmm1, %xmm0
        movq    -32(%rbp), %rax
        movsd   %xmm0, 8(%rax)

        jmp     for_loop_step

mult_branch:
        movq    -32(%rbp), %rax
        movsd   (%rax), %xmm1
        movq    -24(%rbp), %rax
        movsd   (%rax), %xmm0
        mulsd   %xmm1, %xmm0
        movq    -32(%rbp), %rax
        movsd   8(%rax), %xmm2
        movq    -24(%rbp), %rax
        movsd   8(%rax), %xmm1
        mulsd   %xmm2, %xmm1
        subsd   %xmm1, %xmm0
        movsd   %xmm0, -16(%rbp)        // double r = src.r*dst.r - src.i*dst.i

        movq    -32(%rbp), %rax
        movsd   (%rax), %xmm1
        movq    -24(%rbp), %rax
        movsd   8(%rax), %xmm0
        mulsd   %xmm0, %xmm1
        movq    -32(%rbp), %rax
        movsd   8(%rax), %xmm2
        movq    -24(%rbp), %rax
        movsd   (%rax), %xmm0
        mulsd   %xmm2, %xmm0
        addsd   %xmm1, %xmm0
        movsd   %xmm0, -8(%rbp)         // double i = src.r*dst.i + src.i*dst.r

        movq    -32(%rbp), %rax
        movsd   -16(%rbp), %xmm0
        movsd   %xmm0, (%rax)           // dst.r = r
        movq    -32(%rbp), %rax
        movsd   -8(%rbp), %xmm0
        movsd   %xmm0, 8(%rax)          // dst.i = i
        jmp     for_loop_step

for_loop_step:
        addq    $3, -48(%rbp)

for_loop_condition:
        movq    -48(%rbp), %rax         // %rax = code (the pointer)
        movzbl  (%rax), %eax            // %eax = *code (move one byte)
        testb   %al, %al                // is %eax 0?
        jne     for_loop_body           // if no, then continue

        leave                           // otherwise rewind stack
        ret                             // pop and jmp

 Compilation strategy

Most of the above is register-shuffling fluff that we can get rid of.
We're compiling the code up front, which means all of our register
addresses are known quantities and we won't need any unknown
indirection at runtime. So all of the shuffling into and out of %rax
can be replaced by a much simpler move directly to or from N(%rdi) --
since %rdi is the argument that points to the first register's real
component.

If you haven't already, at this point I'd recommend downloading the
Intel software developer's manual, of which volume 2 describes the
semantics and machine code representation of every instruction.

NOTE: GCC uses AT&T assembly syntax, whereas the Intel manuals use
Intel assembly syntax. An important difference is that AT&T reverses
the arguments: mov %rax, %rbx (AT&T syntax) assigns to %rbx, whereas
mov rax, rbx (Intel syntax) assigns to rax. All of my code examples
use AT&T, and none of this will matter once we're working with
machine code.

 Example: the Mandelbrot function *bb+ab

// Step 1: multiply register B by itself
movsd 16(%rdi), %xmm0                   // %xmm0 = b.r
movsd 24(%rdi), %xmm1                   // %xmm1 = b.i
movsd 16(%rdi), %xmm2                   // %xmm2 = b.r
movsd 24(%rdi), %xmm3                   // %xmm3 = b.i
movsd %xmm0, %xmm4                      // %xmm4 = b.r
mulsd %xmm2, %xmm4                      // %xmm4 = b.r*b.r
movsd %xmm1, %xmm5                      // %xmm5 = b.i
mulsd %xmm3, %xmm5                      // %xmm5 = b.i*b.i
subsd %xmm5, %xmm4                      // %xmm4 = b.r*b.r - b.i*b.i
movsd %xmm4, 16(%rdi)                   // b.r = %xmm4

mulsd %xmm0, %xmm3                      // %xmm3 = b.r*b.i
mulsd %xmm1, %xmm2                      // %xmm2 = b.i*b.r
addsd %xmm3, %xmm2                      // %xmm2 = b.r*b.i + b.i*b.r
movsd %xmm2, 24(%rdi)                   // b.i = %xmm2

// Step 2: add register A to register B
movpd (%rdi), %xmm0                     // %xmm0 = (a.r, a.i)
addpd %xmm0, 16(%rdi)                   // %xmm0 += (b.r, b.i)
movpd %xmm0, 16(%rdi)                   // (b.r, b.i) = %xmm0

The multiplication code isn't optimized for the squaring-a-register
use case; instead, I left it fully general so we can use it as a
template when we start generating machine code.

 JIT mechanics

Before we compile a real language, let's just get a basic code
generator working.

// jitproto.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

typedef long(*fn)(long);

fn compile_identity(void) {
  // Allocate some memory and set its permissions correctly. In particular, we
  // need PROT_EXEC (which isn't normally enabled for data memory, e.g. from
  // malloc()), which tells the processor it's ok to execute it as machine
  // code.
  char *memory = mmap(NULL,             // address
                      4096,             // size
                      PROT_READ | PROT_WRITE | PROT_EXEC,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1,               // fd (not used here)
                      0);               // offset (not used here)
  if (memory == MAP_FAILED) {
    perror("failed to allocate memory");
    exit(1);
  }

  int i = 0;

  // mov %rdi, %rax
  memory[i++] = 0x48;           // REX.W prefix
  memory[i++] = 0x8b;           // MOV opcode, register/register
  memory[i++] = 0xc7;           // MOD/RM byte for %rdi -> %rax

  // ret
  memory[i++] = 0xc3;           // RET opcode

  return (fn) memory;
}

int main() {
  fn f = compile_identity();
  int i;
  for (i = 0; i < 10; ++i)
    printf("f(%d) = %ld\n", i, (*f)(i));
  munmap(f, 4096);
  return 0;
}

This does what we expect: we've just produced an identity function.

$ gcc jitproto.c -o jitproto
$ ./jitproto
f(0) = 0
f(1) = 1
f(2) = 2
f(3) = 3
f(4) = 4
f(5) = 5
f(6) = 6
f(7) = 7
f(8) = 8
f(9) = 9

TODO: explanation about userspace page mapping/permissions, and how
ELF instructions tie into this (maybe also explain stuff like the FD
table while we're at it)

 Generating MandelASM machine code

This is where we start to get some serious mileage out of the Intel
manuals. We need encodings for the following instructions:

  * f2 0f 11: movsd reg -> memory
  * f2 0f 10: movsd memory -> reg
  * f2 0f 59: mulsd reg -> reg
  * f2 0f 58: addsd reg -> reg
  * f2 0f 5c: subsd reg -> reg
  * 66 0f 11: movpd reg -> memory (technically movupd for unaligned
    move)
  * 66 0f 10: movpd memory -> reg
  * 66 0f 58: addpd memory -> reg

 The gnarly bits: how operands are specified

Chapter 2 of the Intel manual volume 2 contains a roundabout,
confusing description of operand encoding, so I'll try to sum up the
basics here. (TODO)

For the operators above, we've got two ModR/M configurations:

  * movsd reg <-> X(%rdi): mod = 01, r/m = 111, disp8 = X
  * addsd reg -> reg: mod = 11

At the byte level, they're written like this:

movsd %xmm0, 16(%rdi)           # f2 0f 11 47 10
  # modr/m = b01 000 111 = 47
  # disp   = 16          = 10

addsd %xmm3, %xmm4              # f2 0f 58 e3
  # modr/m = b11 100 011 = e3

 A simple micro-assembler

// micro-asm.h
#include <stdarg.h>
typedef struct {
  char *dest;
} microasm;

// this makes it more obvious what we're doing later on
#define xmm(n) (n)

void asm_write(microasm *a, int n, ...) {
  va_list bytes;
  int i;
  va_start(bytes, n);
  for (i = 0; i < n; ++i) *(a->dest++) = (char) va_arg(bytes, int);
  va_end(bytes);
}

void movsd_reg_memory(microasm *a, char reg, char disp)
{ asm_write(a, 5, 0xf2, 0x0f, 0x11, 0x47 | reg << 3, disp); }

void movsd_memory_reg(microasm *a, char disp, char reg)
{ asm_write(a, 5, 0xf2, 0x0f, 0x10, 0x47 | reg << 3, disp); }

void movsd_reg_reg(microasm *a, char src, char dst)
{ asm_write(a, 4, 0xf2, 0x0f, 0x11, 0xc0 | src << 3 | dst); }

void mulsd(microasm *a, char src, char dst)
{ asm_write(a, 4, 0xf2, 0x0f, 0x59, 0xc0 | dst << 3 | src); }

void addsd(microasm *a, char src, char dst)
{ asm_write(a, 4, 0xf2, 0x0f, 0x58, 0xc0 | dst << 3 | src); }

void subsd(microasm *a, char src, char dst)
{ asm_write(a, 4, 0xf2, 0x0f, 0x5c, 0xc0 | dst << 3 | src); }

void movpd_reg_memory(microasm *a, char reg, char disp)
{ asm_write(a, 5, 0x66, 0x0f, 0x11, 0x47 | reg << 3, disp); }

void movpd_memory_reg(microasm *a, char disp, char reg)
{ asm_write(a, 5, 0x66, 0x0f, 0x10, 0x47 | reg << 3, disp); }

void addpd_memory_reg(microasm *a, char disp, char reg)
{ asm_write(a, 5, 0x66, 0x0f, 0x58, 0x47 | reg << 3, disp); }

 Putting it all together

Now that we can write assembly-level stuff, we can take the structure
from the prototype JIT compiler and modify it to compile MandelASM.

// mandeljit.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

#include "micro-asm.h"

#define sqr(x) ((x) * (x))

typedef struct { double r; double i; } complex;
typedef void(*compiled)(complex*);

#define offsetof(type, field) ((unsigned long) &(((type *) 0)->field))

compiled compile(char *code) {
  char *memory = mmap(NULL, 4096, PROT_READ | PROT_WRITE | PROT_EXEC,
                      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  microasm a = { .dest = memory };
  char src_dsp, dst_dsp;
  char const r = offsetof(complex, r);
  char const i = offsetof(complex, i);

  for (; *code; code += 3) {
    src_dsp = sizeof(complex) * (code[1] - 'a');
    dst_dsp = sizeof(complex) * (code[2] - 'a');
    switch (*code) {
      case '=':
        movpd_memory_reg(&a, src_dsp, xmm(0));
        movpd_reg_memory(&a, xmm(0), dst_dsp);
        break;

      case '+':
        movpd_memory_reg(&a, src_dsp, xmm(0));
        addpd_memory_reg(&a, dst_dsp, xmm(0));
        movpd_reg_memory(&a, xmm(0), dst_dsp);
        break;

      case '*':
        movsd_memory_reg(&a, src_dsp + r, xmm(0));
        movsd_memory_reg(&a, src_dsp + i, xmm(1));
        movsd_memory_reg(&a, dst_dsp + r, xmm(2));
        movsd_memory_reg(&a, dst_dsp + i, xmm(3));
        movsd_reg_reg   (&a, xmm(0), xmm(4));
        mulsd           (&a, xmm(2), xmm(4));
        movsd_reg_reg   (&a, xmm(1), xmm(5));
        mulsd           (&a, xmm(3), xmm(5));
        subsd           (&a, xmm(5), xmm(4));
        movsd_reg_memory(&a, xmm(4), dst_dsp + r);

        mulsd           (&a, xmm(0), xmm(3));
        mulsd           (&a, xmm(1), xmm(2));
        addsd           (&a, xmm(3), xmm(2));
        movsd_reg_memory(&a, xmm(2), dst_dsp + i);
        break;

      default:
        fprintf(stderr, "undefined instruction %s (ASCII %x)\n", code, *code);
        exit(1);
    }
  }

  // Return to caller (important! otherwise we'll segfault)
  asm_write(&a, 1, 0xc3);

  return (compiled) memory;
}

int main(int argc, char **argv) {
  compiled fn = compile(argv[1]);
  complex registers[4];
  int i, x, y;
  char line[1600];
  printf("P5\n%d %d\n%d\n", 1600, 900, 255);
  for (y = 0; y < 900; ++y) {
    for (x = 0; x < 1600; ++x) {
      registers[0].r = 2 * 1.6 * (x / 1600.0 - 0.5);
      registers[0].i = 2 * 0.9 * (y /  900.0 - 0.5);
      for (i = 1; i < 4; ++i) registers[i].r = registers[i].i = 0;
      for (i = 0; i < 256 && sqr(registers[1].r) + sqr(registers[1].i) < 4; ++i)
        (*fn)(registers);
      line[x] = i;
    }
    fwrite(line, 1, sizeof(line), stdout);
  }
  return 0;
}

Now let's benchmark the interpreted and JIT-compiled versions:

$ gcc mandeljit.c -o mandeljit
$ time ./simple *bb+ab > /dev/null
real    0m2.348s
user    0m2.344s
sys     0m0.000s
$ time ./mandeljit *bb+ab > /dev/null
real    0m1.462s
user    0m1.460s
sys     0m0.000s

Very close to the limit performance of the hardcoded version. And, of
course, the JIT-compiled result is identical to the interpreted one:

$ ./simple *bb+ab | md5sum
12a1013d55ee17998390809ffd671dbc  -
$ ./mandeljit *bb+ab | md5sum
12a1013d55ee17998390809ffd671dbc  -

 Further reading

 Debugging JIT compilers

First, you need a good scotch; this one should work.

image

Once you've got that set up, gdb can probably be scripted to do what
you need. I've used it somewhat successfully to debug a bunch of
hand-written self-modifying machine code with no debugging symbols --
the limitations of the approach ended up being whiskey-related rather
than any deficiency of GDB itself.

I've also had some luck using radare2 to figure out when I was
generating bogus instructions.

Offline disassemblers like NASM and YASM won't help you.

 Low-level

  * The Intel guides cover a lot of stuff we didn't end up using
    here: addressing modes, instructions, etc. If you're serious
    about writing JIT compilers, it's worth an in-depth read.

  * Agner Fog's guides to processor-level optimization: an insanely
    detailed tour through processor internals, instruction parsing
    pipelines, and pretty much every variant of every processor in
    existence.

  * The V8 source code: how JIT assemblers are actually written

  * The JVM source code

  * Jonesforth: a well-documented example of low-level code
    generation and interpreter structure (sort of a JIT alternative)

  * Canard machine code: similar to jonesforth, but uses machine code
    for its data structures

About

How to write a very simple JIT compiler

Resources

Readme

Stars

897 stars

Watchers

20 watching

Forks

57 forks

Releases

No releases published

Packages 0

No packages published

Contributors 5

  * 
  * 
  * 
  * 
  * 

Languages

  * C 59.5%
  * Assembly 40.5%

  * (c) 2021 GitHub, Inc.
  * Terms
  * Privacy
  * Security
  * Status
  * Docs

 

  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.