[HN Gopher] Faster interpreters in Go: Catching up with C++
       ___________________________________________________________________
        
       Faster interpreters in Go: Catching up with C++
        
       Author : ksec
       Score  : 227 points
       Date   : 2025-04-05 17:59 UTC (1 days ago)
        
 (HTM) web link (planetscale.com)
 (TXT) w3m dump (planetscale.com)
        
       | hinkley wrote:
       | Compiling queries is one of those things that is both Make it
       | Right and Make it Fast.
       | 
       | Because bind variables and Prepared Statements go hand in hand,
       | you want to do everything you can to coax your users into doing
       | the right thing. If each unique query has to be compiled before
       | first use, that's an extra inducement to using prepared
       | statements properly.
        
       | rockwotj wrote:
       | The resulting VM implementation reminds me of this post from
       | cloudflare where they similarly use closures to build their
       | interpreter.
       | 
       | https://blog.cloudflare.com/building-fast-interpreters-in-ru...
        
       | cosmos0072 wrote:
       | Compiling an expression to a tree of closures, and a list of
       | statements to a slice of closures, is exactly how I optimized
       | [gomacro](https://github.com/cosmos72/gomacro) my Go interpreter
       | written in go.
       | 
       | There are more tricks available there, as for example unrolling
       | the loop that calls the list of closures, and having a `nop`
       | closure that is executed when there's nothing to run but
       | execution is not yet at the end of the the unrolled loop.
        
         | ALLTaken wrote:
         | Impressive! I would love to learn howto implement that in
         | Julia. Could you help me understand how you did that?
         | 
         | I'd love to see, if it's possible to create a libc-free,
         | dependency-free executable without Nim (https://nim-lang.org/).
        
           | cosmos0072 wrote:
           | The core idea is simple: do a type analysis on each
           | expression you want to "compile" to a closure, and
           | instantiate the correct closure for each type combination.
           | 
           | Here is a pseudocode example, adapted from gomacro sources:
           | 
           | https://gist.github.com/cosmos72/f971c172e71d08030f92a1fc5fa.
           | ..
           | 
           | This works best for "compiling" statically typed languages,
           | and while much faster than an AST interpreter, the "tree of
           | closures" above is still ~10 times slower that natively
           | compiled code. And it's usually also slower than JIT-compiled
           | code
        
           | stevekemp wrote:
           | I put together this example after reading the article:
           | 
           | https://github.com/skx/simple-vm
           | 
           | Simpler than the full gomacro codebase, but perhaps helpful.
        
             | cosmos0072 wrote:
             | For optimal speed, you should move as much code as possible
             | _outside_ the closures.
             | 
             | In particular, you should do the `switch op` at
             | https://github.com/skx/simple-
             | vm/blob/b3917aef0bd6c4178eed0c... outside the closure, and
             | create a different, specialised closure for each case.
             | Otherwise the "fast interpreter" may be almost as slow as a
             | vanilla AST walker.
        
       | MatthiasPortzel wrote:
       | It's crazy that this post seems to have stumbled across an
       | equivalent to the Copy-and-Patch technique[0] used to create a
       | Lua interpreter faster than LuaJit[1]
       | 
       | [0]: https://sillycross.github.io/2023/05/12/2023-05-12/ [1]:
       | https://sillycross.github.io/2022/11/22/2022-11-22/
       | 
       | The major difference is that LuaJIT Remake's Copy-and-Patch
       | requires "manually" copying blocks of assembly code and patching
       | values, while this post relies on the Go compiler's closures to
       | create copies of the functions with runtime-known values.
       | 
       | I think there's fascinating processing being made in this area--I
       | think in the future this technique (in some form) will be the go-
       | to way to create new interpreted languages, and AST interpreters,
       | switch-based bytecode VMs, and JIT compilation will be things of
       | the past.
        
         | deagle50 wrote:
         | Faster than LuaJIT's interpreter, not the compiled code.
        
         | bob1029 wrote:
         | > switch-based bytecode VMs
         | 
         | I am finding a switched byte interpreter to be _very_ expedient
         | on my computer. It seems that if the # of cases is kept small
         | enough, your chances of getting a good branch prediction can go
         | up substantially. Something like a brainfuck interpreter runs
         | extremely fast. In the worst case of randomly guessing, you are
         | still going to time travel with a 12.5% success rate.
        
           | josefx wrote:
           | > In the worst case of randomly guessing, you are still going
           | to time travel with a 12.5% success rate.
           | 
           | Random guessing clearly isn't the worst case then, a bad
           | prediction can miss every single time.
        
         | frumplestlatz wrote:
         | The JVM has had a template interpreter since the mid-90s, it's
         | not anything new, and template interpreters are only
         | sufficiently performant as to provide acceptable execution
         | speed until you JIT.
         | 
         | Template interpreters are not a substitute for real JIT -- JIT
         | compilation isn't going anywhere.
        
           | cogman10 wrote:
           | My understanding of most optimizing compilers is that this is
           | an extremely common "last step" sort of optimization. A lot
           | of the optimizing work is beating the code into a canonical
           | form where these sorts of templates can be readily applied.
           | 
           | It was also my understanding that that's also the point of
           | "super optimizer"s [1] which look for these common patterns
           | in something like LLVM IR to generate optimization targets
           | for the mainline optimizer.
           | 
           | [1] https://en.wikipedia.org/wiki/Superoptimization
        
         | rockwotj wrote:
         | It's not really copy and patch, the whole point of the copy
         | patch is so you can inline that in your compilation output and
         | it's a fast baseline interpreter because individual builtin
         | functions are optimized (via the compiler output you're copying
         | from) and inlined (which is why you need to patch to update
         | what registers are being used. In this model you jit only
         | control flow really, then inline the implementation of each
         | bytecode operation (in contrast to sparkplug
         | [https://v8.dev/blog/sparkplug] which just calls a builtin
         | instead of copy/patch). It's still JIT which is vastly
         | different than an interpreter.
         | 
         | > JIT will be things of the past
         | 
         | Sorry no. JIT is not going anywhere. They mentioned in the
         | article JIT would be better performance just more effort than
         | they wanted to put in (a good tradeoff!) JIT powers Java, Wasm
         | and Javascript VMs and are certainly the way to get the fastest
         | code because you can give the CPU code that it can do a much
         | better job predicting the next instruction. With interpreters
         | you're often limited by the indirection of loads when looking
         | up the next function to call, and generating code for the
         | control flow outside calling your "builtins" is precisely what
         | Sparkplug is doing.
         | 
         | At the end of the day, like most of engineering, choose the
         | right tool for the job, which in this case is simplicity (which
         | is often the right choice!), but that doesn't mean it's always
         | the right choice. For example if browsers did this then
         | Javascript performance would tank compared to what we get
         | today.
        
         | cryptonector wrote:
         | > It's crazy that this post seems to have stumbled across an
         | equivalent to the Copy-and-Patch technique[0] used to create a
         | Lua interpreter faster than LuaJit[1]
         | 
         | > this post relies on the Go compiler's closures to create
         | copies of the functions with runtime-known values
         | 
         | To be clear the technique of using closures like this is
         | ancient in the world of LISP. You can see in Paul Graham's
         | books on LISP from the 90s, and in LiSP in Small Pieces, and
         | many interpreters of 80s/90s vintage. I would say that it is
         | quite standard.
        
       | politician wrote:
       | It would be ideal if Go would add support for computed goto, so
       | that we could build direct threaded interpreters.
        
         | zabzonk wrote:
         | I don't think "ideal" would be the exact word - I used to
         | shudder at it in FORTRAN.
        
         | kiitos wrote:
         | hard no - goto is a feature for code generators, not anything
         | meant for human use
        
           | greenavocado wrote:
           | I never understood this argument. Without RAII you can easily
           | get screwed by resource leaks without goto when returning
           | early. In this regard, using goto is expedient. How do C
           | programmers avoid this problem without goto?
           | bool function_with_cleanup(void) {             int *buffer1 =
           | NULL;             int *buffer2 = NULL;             FILE *file
           | = NULL;             bool success = false;
           | // Allocate first resource             buffer1 =
           | malloc(sizeof(int) * 100);             if (!buffer1) {
           | goto cleanup;  // Error, jump to cleanup             }
           | // Allocate second resource             buffer2 =
           | malloc(sizeof(int) * 200);             if (!buffer2) {
           | goto cleanup;  // Error, jump to cleanup             }
           | // Open a file             file = fopen("data.txt", "r");
           | if (!file) {                 goto cleanup;  // Error, jump to
           | cleanup             }                          // Do work
           | with all resources...                          success =
           | true;  // Only set to true if everything succeeded
           | cleanup:             // Free resources in reverse order of
           | acquisition             if (file) fclose(file);
           | free(buffer2);  // free() is safe on NULL pointers
           | free(buffer1);                          return success;
           | }
        
             | dwattttt wrote:
             | Open a scope when you check resource acquisition passed,
             | rather than the opposite (jump to the end of the function
             | if it failed).
             | 
             | It can get quite hilly, which doesn't look great. It does
             | have the advantage that each resource is explicitly only
             | valid in a visible scope, and there's a marker at the end
             | to denote the valid region of the resource is ending.
             | 
             | EDIT: you mentioned early return, this style forbids early
             | return (at least, any early return after the first resource
             | acquisition)
        
             | gigatexal wrote:
             | In this example couldn't the go to cleanup instead be
             | return cleanup_func where the same cleanup code was
             | executed?
        
             | pjmlp wrote:
             | Maybe that is exactly the problem, stop using a language
             | designed in 1970's that ignored on purpose the ecosystem
             | outside Bell Labs, unless where it is unavoidable.
             | 
             | And in such case, the C compiler doesn't have a limit to
             | write functions and better modularize their
             | implementations.                   bool
             | function_with_cleanup(void) {             int *buffer1 =
             | NULL;             int *buffer2 = NULL;             FILE
             | *file = NULL;                     // Allocate first
             | resource             buffer1 = malloc(sizeof(int) * 100);
             | if (!buffer1) {                 return false;             }
             | // Allocate second resource             buffer2 =
             | malloc(sizeof(int) * 200);             if (!buffer2) {
             | free(buffer1);                 return false;             }
             | // Open a file             file = fopen("data.txt", "r");
             | if (!file) {                 free(buffer1);
             | free(buffer1);                 return false;             }
             | // Do work with all resources...             fclose(file);
             | free(buffer1);             free(buffer1);
             | return true;         }
             | 
             | Ah, but all those _free()_ calls get tedious can be
             | forgotten and mistyped                   bool
             | function_with_cleanup(void) {             int *buffer1 =
             | NULL;             int *buffer2 = NULL;             FILE
             | *file = NULL;                     // Allocate first
             | resource             buffer1 = arena_alloc(&current_arena,
             | sizeof(int) * 100);             if (!buffer1) {
             | return false;             }                          //
             | Allocate second resource             buffer2 =
             | arena_alloc(&current_arena, sizeof(int) * 200);
             | if (!buffer2) {
             | arena_reset(&current_arena);                  return false;
             | }                          // Open a file             file
             | = fopen("data.txt", "r");             if (!file) {
             | arena_reset(&current_arena);                 return false;
             | }                          // Do work with all resources...
             | fclose(file);             arena_reset(&current_arena);
             | return true;          }
             | 
             | Can still be improved with a mix of macros and varargs
             | functions.
             | 
             | Or if using language extensions is a thing, the various
             | ways to do _defer_ in C.
        
               | LtWorf wrote:
               | You understand go doesn't have exceptions right?
        
               | pjmlp wrote:
               | You understand that I wrote C code, and in what concerns
               | Go, panic/recover are exceptions that don't want to
               | assume themselves as such?
        
               | LtWorf wrote:
               | You can't handle a panic, it's the whole point.
        
             | shakna wrote:
             | Attributes, mostly. Which have become so common that defer
             | is very likely to be in the next C standard. [0]
             | bool function_with_cleanup(void) {             // Allocate
             | first resource             int* buffer1
             | __attribute__((__cleanup__(free))) =  malloc(sizeof(int) *
             | 100);             if(!buffer1) { return false; }
             | // Allocate second resource             int* buffer2
             | __attribute__((__cleanup__(free))) = malloc(sizeof(int) *
             | 200);             if(!buffer2) { return false; }
             | // Open a file             FILE* file
             | __attribute__((__cleanup__(fclose))) = fopen("data.txt",
             | "r");             if (!file) { return false; }
             | return true;         }
             | 
             | [0] https://thephd.dev/c2y-the-defer-technical-
             | specification-its...
        
             | mkl wrote:
             | Nested ifs are my preference:                 bool
             | function_with_cleanup(void) {           int *buffer1 =
             | NULL;           int *buffer2 = NULL;           FILE *file =
             | NULL;           bool success = false;                  //
             | Allocate first resource           buffer1 =
             | malloc(sizeof(int) * 100);           if (buffer1) {
             | // Allocate second resource               buffer2 =
             | malloc(sizeof(int) * 200);               if (buffer2) {
             | // Open a file                   file = fopen("data.txt",
             | "r");                   if (file) {
             | // Do work with all resources...
             | fclose(file);                       success = true;  //
             | Only set to true if everything succeeded
             | }                   free(buffer2);               }
             | free(buffer1);           }                  return success;
             | }
             | 
             | Much shorter and more straightforward.
             | 
             | One-time loops with break also work if you're not doing the
             | resource allocation in another loop:                 bool
             | function_with_cleanup(void) {           int *buffer1 =
             | NULL;           int *buffer2 = NULL;           FILE *file =
             | NULL;           bool success = false;                  do {
             | // One-time loop to break out of on error               //
             | Allocate first resource               buffer1 =
             | malloc(sizeof(int) * 100);               if (!buffer1) {
             | break;  // Error, jump to cleanup               }
             | // Allocate second resource               buffer2 =
             | malloc(sizeof(int) * 200);               if (!buffer2) {
             | break;  // Error, jump to cleanup               }
             | // Open a file               file = fopen("data.txt", "r");
             | if (!file) {                   break;  // Error, jump to
             | cleanup               }                      // Do work
             | with all resources...                      success = true;
             | // Only set to true if everything succeeded           }
             | while(false);                  // Free resources in reverse
             | order of acquisition           if (file) fclose(file);
             | free(buffer2);  // free() is safe on NULL pointers
             | free(buffer1);                  return success;       }
             | 
             | Still simpler to follow than goto IMHO. Both these patterns
             | work in other languages without goto too, e.g. Python.
        
               | gpderetta wrote:
               | Nested if, aside from being awful, doesn't scale.
               | 
               | And break is extremely thin sugar on top of go-to.
        
           | jandrewrogers wrote:
           | There are situations in practical code where goto is the only
           | reasonable choice if you want to avoid spaghetti code.
           | Absolutes have no place in software, you can find scenarios
           | where almost every typically bad idea is actually a good
           | idea.
           | 
           | It is a tool, not a religion.
        
             | 9rx wrote:
             | Even Dijkstra was clear that he was only referring to
             | unbridled gotos.
        
         | 9dev wrote:
         | Go to goto, go!
        
         | pklausler wrote:
         | Did you mean "assigned GOTO", not computed GOTO? Because that's
         | just a switch list.
        
           | knome wrote:
           | https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
           | 
           | 'computed goto' is used in gcc to mean the same as assigned
           | goto in Fortran. The Fortran variation appears to have more
           | restrictions than the gnuc one.
        
             | johnisgood wrote:
             | I may be dumb, but what is "goto *ptr;" useful for? Or
             | &&foo.
        
               | withinboredom wrote:
               | error conditions when you don't have exceptions:
               | goto *errOrSuccess;
               | 
               | is a pretty normal one. This basically allows you to
               | emulate a "finally".
               | 
               | State machines are another big one, which look much nicer
               | with computed goto than a switch statement.
        
               | trealira wrote:
               | It's used in VMs, emulators, and other interpreters for
               | making a dispatch loop with less branching than a
               | standard loop with a switch inside it.
               | 
               | https://eli.thegreenplace.net/2012/07/12/computed-goto-
               | for-e...
        
         | nemo1618 wrote:
         | Is computed goto used for anything _other_ than interpreter
         | loops? Because if not, I would rather have a special  "it looks
         | like you're trying to implement an interpreter loop" case in
         | the compiler than add new syntax.
        
           | grandempire wrote:
           | Yes. Goto is useful for efficient control flow in many
           | stateful algorithms.
        
             | dwattttt wrote:
             | I don't know that _efficient_ is the best word. If you use
             | goto to force a particular control flow rather than a more
             | constrained form of control flow (e.g. if), you make it
             | harder for the optimiser to work; it can only make "as-if"
             | changes, ones that mean the code that executes looks as if
             | it's what you wrote.
             | 
             | The most efficient control flow is one that describes only
             | what your algorithm needs, coupled with an optimiser that
             | can exploit the particular flow you're describing.
             | 
             | Among the many things discovered by the author of
             | https://blog.nelhage.com/post/cpython-tail-call/,
             | Clang/LLVM was able to optimise the standard switch based
             | interpreter loop as if it had been written with computed
             | gotos.
        
               | grandempire wrote:
               | > The most efficient control flow is one that describes
               | only what your algorithm needs,
               | 
               | Yes. I'm not saying goto is just faster in general. But
               | some algorithms are difficult to describe with while and
               | if (bunch of examples in Knuth).
               | 
               | > Clang/LLVM was able to optimise
               | 
               | Because it's implicit, this is the kind of optimization
               | that's easy to silently regress a few weeks before ship
               | by accidentally violating a rule.
               | 
               | I think unrestricted goto is not good. But an alternative
               | is to make the semantics and scope stricter, something
               | like a Common Lisp prog/tag body block.
        
       | kiitos wrote:
       | vitess is not the answer
        
         | ksec wrote:
         | May be you want to explain yourself rather than a statement?
         | Vitess is battle tested and used by a lot of large enterprise.
        
           | gigatexal wrote:
           | Yeah it's literally the monetization of db setup that powered
           | YouTube -- pretty much the best battle testing imo.
        
       | giovannibonetti wrote:
       | The article mentions the struggle to speed up expression
       | computation on MySQL due to its dynamic typing. I wonder if
       | Postgres' more static type system works better in that regard.
        
       | rajandatta wrote:
       | Nice article. Instructive to see the latest trends and direction
       | for high performance interpreters. Thanks for sharing.
        
       | Thaxll wrote:
       | This looks very similar to how "basic" emulators work.
        
       | brianolson wrote:
       | I built a 'slice of function pointers' bytecode interpreter in Go
       | in 2019 for the Algorand VM (Blockchain smart contract stuff) and
       | before that the same pattern in C for a toy JVM around 2005.
       | 
       | It's a good pattern!
       | 
       | The Algorand VM was focused on low overhead running thousands of
       | tiny programs per second. Version 1 had no loops and a 1000
       | instruction limit.
       | 
       | The JVM was focused on low memory, towards possible embedded
       | microcontroller use.
       | 
       | So, 'array of function pointers' is nothing new, but it is a good
       | pattern.
        
       | returningfory2 wrote:
       | Really interesting technique. I guess one of the (theoretical?)
       | drawbacks is the memory usage is not ideal? In a traditional
       | bytecode compiler the result of compilation is a very-cache-
       | friendly binary array, whereas here it looks like a vector of
       | pointers to heap allocated closures.
        
       | medv wrote:
       | This is very interesting article! In https://expr-lang.org we did
       | our own research and a paper about comparison on different vm
       | implementation in go:
       | 
       | - https://expr-lang.org/blog/exploring-virtual-machines-for-em...
        
       ___________________________________________________________________
       (page generated 2025-04-06 23:01 UTC)