[HN Gopher] Faster interpreters in Go: Catching up with C++
___________________________________________________________________
Faster interpreters in Go: Catching up with C++
Author : ksec
Score : 227 points
Date : 2025-04-05 17:59 UTC (1 days ago)
(HTM) web link (planetscale.com)
(TXT) w3m dump (planetscale.com)
| hinkley wrote:
| Compiling queries is one of those things that is both Make it
| Right and Make it Fast.
|
| Because bind variables and Prepared Statements go hand in hand,
| you want to do everything you can to coax your users into doing
| the right thing. If each unique query has to be compiled before
| first use, that's an extra inducement to using prepared
| statements properly.
| rockwotj wrote:
| The resulting VM implementation reminds me of this post from
| cloudflare where they similarly use closures to build their
| interpreter.
|
| https://blog.cloudflare.com/building-fast-interpreters-in-ru...
| cosmos0072 wrote:
| Compiling an expression to a tree of closures, and a list of
| statements to a slice of closures, is exactly how I optimized
| [gomacro](https://github.com/cosmos72/gomacro) my Go interpreter
| written in go.
|
| There are more tricks available there, as for example unrolling
| the loop that calls the list of closures, and having a `nop`
| closure that is executed when there's nothing to run but
| execution is not yet at the end of the the unrolled loop.
| ALLTaken wrote:
| Impressive! I would love to learn howto implement that in
| Julia. Could you help me understand how you did that?
|
| I'd love to see, if it's possible to create a libc-free,
| dependency-free executable without Nim (https://nim-lang.org/).
| cosmos0072 wrote:
| The core idea is simple: do a type analysis on each
| expression you want to "compile" to a closure, and
| instantiate the correct closure for each type combination.
|
| Here is a pseudocode example, adapted from gomacro sources:
|
| https://gist.github.com/cosmos72/f971c172e71d08030f92a1fc5fa.
| ..
|
| This works best for "compiling" statically typed languages,
| and while much faster than an AST interpreter, the "tree of
| closures" above is still ~10 times slower that natively
| compiled code. And it's usually also slower than JIT-compiled
| code
| stevekemp wrote:
| I put together this example after reading the article:
|
| https://github.com/skx/simple-vm
|
| Simpler than the full gomacro codebase, but perhaps helpful.
| cosmos0072 wrote:
| For optimal speed, you should move as much code as possible
| _outside_ the closures.
|
| In particular, you should do the `switch op` at
| https://github.com/skx/simple-
| vm/blob/b3917aef0bd6c4178eed0c... outside the closure, and
| create a different, specialised closure for each case.
| Otherwise the "fast interpreter" may be almost as slow as a
| vanilla AST walker.
| MatthiasPortzel wrote:
| It's crazy that this post seems to have stumbled across an
| equivalent to the Copy-and-Patch technique[0] used to create a
| Lua interpreter faster than LuaJit[1]
|
| [0]: https://sillycross.github.io/2023/05/12/2023-05-12/ [1]:
| https://sillycross.github.io/2022/11/22/2022-11-22/
|
| The major difference is that LuaJIT Remake's Copy-and-Patch
| requires "manually" copying blocks of assembly code and patching
| values, while this post relies on the Go compiler's closures to
| create copies of the functions with runtime-known values.
|
| I think there's fascinating processing being made in this area--I
| think in the future this technique (in some form) will be the go-
| to way to create new interpreted languages, and AST interpreters,
| switch-based bytecode VMs, and JIT compilation will be things of
| the past.
| deagle50 wrote:
| Faster than LuaJIT's interpreter, not the compiled code.
| bob1029 wrote:
| > switch-based bytecode VMs
|
| I am finding a switched byte interpreter to be _very_ expedient
| on my computer. It seems that if the # of cases is kept small
| enough, your chances of getting a good branch prediction can go
| up substantially. Something like a brainfuck interpreter runs
| extremely fast. In the worst case of randomly guessing, you are
| still going to time travel with a 12.5% success rate.
| josefx wrote:
| > In the worst case of randomly guessing, you are still going
| to time travel with a 12.5% success rate.
|
| Random guessing clearly isn't the worst case then, a bad
| prediction can miss every single time.
| frumplestlatz wrote:
| The JVM has had a template interpreter since the mid-90s, it's
| not anything new, and template interpreters are only
| sufficiently performant as to provide acceptable execution
| speed until you JIT.
|
| Template interpreters are not a substitute for real JIT -- JIT
| compilation isn't going anywhere.
| cogman10 wrote:
| My understanding of most optimizing compilers is that this is
| an extremely common "last step" sort of optimization. A lot
| of the optimizing work is beating the code into a canonical
| form where these sorts of templates can be readily applied.
|
| It was also my understanding that that's also the point of
| "super optimizer"s [1] which look for these common patterns
| in something like LLVM IR to generate optimization targets
| for the mainline optimizer.
|
| [1] https://en.wikipedia.org/wiki/Superoptimization
| rockwotj wrote:
| It's not really copy and patch, the whole point of the copy
| patch is so you can inline that in your compilation output and
| it's a fast baseline interpreter because individual builtin
| functions are optimized (via the compiler output you're copying
| from) and inlined (which is why you need to patch to update
| what registers are being used. In this model you jit only
| control flow really, then inline the implementation of each
| bytecode operation (in contrast to sparkplug
| [https://v8.dev/blog/sparkplug] which just calls a builtin
| instead of copy/patch). It's still JIT which is vastly
| different than an interpreter.
|
| > JIT will be things of the past
|
| Sorry no. JIT is not going anywhere. They mentioned in the
| article JIT would be better performance just more effort than
| they wanted to put in (a good tradeoff!) JIT powers Java, Wasm
| and Javascript VMs and are certainly the way to get the fastest
| code because you can give the CPU code that it can do a much
| better job predicting the next instruction. With interpreters
| you're often limited by the indirection of loads when looking
| up the next function to call, and generating code for the
| control flow outside calling your "builtins" is precisely what
| Sparkplug is doing.
|
| At the end of the day, like most of engineering, choose the
| right tool for the job, which in this case is simplicity (which
| is often the right choice!), but that doesn't mean it's always
| the right choice. For example if browsers did this then
| Javascript performance would tank compared to what we get
| today.
| cryptonector wrote:
| > It's crazy that this post seems to have stumbled across an
| equivalent to the Copy-and-Patch technique[0] used to create a
| Lua interpreter faster than LuaJit[1]
|
| > this post relies on the Go compiler's closures to create
| copies of the functions with runtime-known values
|
| To be clear the technique of using closures like this is
| ancient in the world of LISP. You can see in Paul Graham's
| books on LISP from the 90s, and in LiSP in Small Pieces, and
| many interpreters of 80s/90s vintage. I would say that it is
| quite standard.
| politician wrote:
| It would be ideal if Go would add support for computed goto, so
| that we could build direct threaded interpreters.
| zabzonk wrote:
| I don't think "ideal" would be the exact word - I used to
| shudder at it in FORTRAN.
| kiitos wrote:
| hard no - goto is a feature for code generators, not anything
| meant for human use
| greenavocado wrote:
| I never understood this argument. Without RAII you can easily
| get screwed by resource leaks without goto when returning
| early. In this regard, using goto is expedient. How do C
| programmers avoid this problem without goto?
| bool function_with_cleanup(void) { int *buffer1 =
| NULL; int *buffer2 = NULL; FILE *file
| = NULL; bool success = false;
| // Allocate first resource buffer1 =
| malloc(sizeof(int) * 100); if (!buffer1) {
| goto cleanup; // Error, jump to cleanup }
| // Allocate second resource buffer2 =
| malloc(sizeof(int) * 200); if (!buffer2) {
| goto cleanup; // Error, jump to cleanup }
| // Open a file file = fopen("data.txt", "r");
| if (!file) { goto cleanup; // Error, jump to
| cleanup } // Do work
| with all resources... success =
| true; // Only set to true if everything succeeded
| cleanup: // Free resources in reverse order of
| acquisition if (file) fclose(file);
| free(buffer2); // free() is safe on NULL pointers
| free(buffer1); return success;
| }
| dwattttt wrote:
| Open a scope when you check resource acquisition passed,
| rather than the opposite (jump to the end of the function
| if it failed).
|
| It can get quite hilly, which doesn't look great. It does
| have the advantage that each resource is explicitly only
| valid in a visible scope, and there's a marker at the end
| to denote the valid region of the resource is ending.
|
| EDIT: you mentioned early return, this style forbids early
| return (at least, any early return after the first resource
| acquisition)
| gigatexal wrote:
| In this example couldn't the go to cleanup instead be
| return cleanup_func where the same cleanup code was
| executed?
| pjmlp wrote:
| Maybe that is exactly the problem, stop using a language
| designed in 1970's that ignored on purpose the ecosystem
| outside Bell Labs, unless where it is unavoidable.
|
| And in such case, the C compiler doesn't have a limit to
| write functions and better modularize their
| implementations. bool
| function_with_cleanup(void) { int *buffer1 =
| NULL; int *buffer2 = NULL; FILE
| *file = NULL; // Allocate first
| resource buffer1 = malloc(sizeof(int) * 100);
| if (!buffer1) { return false; }
| // Allocate second resource buffer2 =
| malloc(sizeof(int) * 200); if (!buffer2) {
| free(buffer1); return false; }
| // Open a file file = fopen("data.txt", "r");
| if (!file) { free(buffer1);
| free(buffer1); return false; }
| // Do work with all resources... fclose(file);
| free(buffer1); free(buffer1);
| return true; }
|
| Ah, but all those _free()_ calls get tedious can be
| forgotten and mistyped bool
| function_with_cleanup(void) { int *buffer1 =
| NULL; int *buffer2 = NULL; FILE
| *file = NULL; // Allocate first
| resource buffer1 = arena_alloc(¤t_arena,
| sizeof(int) * 100); if (!buffer1) {
| return false; } //
| Allocate second resource buffer2 =
| arena_alloc(¤t_arena, sizeof(int) * 200);
| if (!buffer2) {
| arena_reset(¤t_arena); return false;
| } // Open a file file
| = fopen("data.txt", "r"); if (!file) {
| arena_reset(¤t_arena); return false;
| } // Do work with all resources...
| fclose(file); arena_reset(¤t_arena);
| return true; }
|
| Can still be improved with a mix of macros and varargs
| functions.
|
| Or if using language extensions is a thing, the various
| ways to do _defer_ in C.
| LtWorf wrote:
| You understand go doesn't have exceptions right?
| pjmlp wrote:
| You understand that I wrote C code, and in what concerns
| Go, panic/recover are exceptions that don't want to
| assume themselves as such?
| LtWorf wrote:
| You can't handle a panic, it's the whole point.
| shakna wrote:
| Attributes, mostly. Which have become so common that defer
| is very likely to be in the next C standard. [0]
| bool function_with_cleanup(void) { // Allocate
| first resource int* buffer1
| __attribute__((__cleanup__(free))) = malloc(sizeof(int) *
| 100); if(!buffer1) { return false; }
| // Allocate second resource int* buffer2
| __attribute__((__cleanup__(free))) = malloc(sizeof(int) *
| 200); if(!buffer2) { return false; }
| // Open a file FILE* file
| __attribute__((__cleanup__(fclose))) = fopen("data.txt",
| "r"); if (!file) { return false; }
| return true; }
|
| [0] https://thephd.dev/c2y-the-defer-technical-
| specification-its...
| mkl wrote:
| Nested ifs are my preference: bool
| function_with_cleanup(void) { int *buffer1 =
| NULL; int *buffer2 = NULL; FILE *file =
| NULL; bool success = false; //
| Allocate first resource buffer1 =
| malloc(sizeof(int) * 100); if (buffer1) {
| // Allocate second resource buffer2 =
| malloc(sizeof(int) * 200); if (buffer2) {
| // Open a file file = fopen("data.txt",
| "r"); if (file) {
| // Do work with all resources...
| fclose(file); success = true; //
| Only set to true if everything succeeded
| } free(buffer2); }
| free(buffer1); } return success;
| }
|
| Much shorter and more straightforward.
|
| One-time loops with break also work if you're not doing the
| resource allocation in another loop: bool
| function_with_cleanup(void) { int *buffer1 =
| NULL; int *buffer2 = NULL; FILE *file =
| NULL; bool success = false; do {
| // One-time loop to break out of on error //
| Allocate first resource buffer1 =
| malloc(sizeof(int) * 100); if (!buffer1) {
| break; // Error, jump to cleanup }
| // Allocate second resource buffer2 =
| malloc(sizeof(int) * 200); if (!buffer2) {
| break; // Error, jump to cleanup }
| // Open a file file = fopen("data.txt", "r");
| if (!file) { break; // Error, jump to
| cleanup } // Do work
| with all resources... success = true;
| // Only set to true if everything succeeded }
| while(false); // Free resources in reverse
| order of acquisition if (file) fclose(file);
| free(buffer2); // free() is safe on NULL pointers
| free(buffer1); return success; }
|
| Still simpler to follow than goto IMHO. Both these patterns
| work in other languages without goto too, e.g. Python.
| gpderetta wrote:
| Nested if, aside from being awful, doesn't scale.
|
| And break is extremely thin sugar on top of go-to.
| jandrewrogers wrote:
| There are situations in practical code where goto is the only
| reasonable choice if you want to avoid spaghetti code.
| Absolutes have no place in software, you can find scenarios
| where almost every typically bad idea is actually a good
| idea.
|
| It is a tool, not a religion.
| 9rx wrote:
| Even Dijkstra was clear that he was only referring to
| unbridled gotos.
| 9dev wrote:
| Go to goto, go!
| pklausler wrote:
| Did you mean "assigned GOTO", not computed GOTO? Because that's
| just a switch list.
| knome wrote:
| https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html
|
| 'computed goto' is used in gcc to mean the same as assigned
| goto in Fortran. The Fortran variation appears to have more
| restrictions than the gnuc one.
| johnisgood wrote:
| I may be dumb, but what is "goto *ptr;" useful for? Or
| &&foo.
| withinboredom wrote:
| error conditions when you don't have exceptions:
| goto *errOrSuccess;
|
| is a pretty normal one. This basically allows you to
| emulate a "finally".
|
| State machines are another big one, which look much nicer
| with computed goto than a switch statement.
| trealira wrote:
| It's used in VMs, emulators, and other interpreters for
| making a dispatch loop with less branching than a
| standard loop with a switch inside it.
|
| https://eli.thegreenplace.net/2012/07/12/computed-goto-
| for-e...
| nemo1618 wrote:
| Is computed goto used for anything _other_ than interpreter
| loops? Because if not, I would rather have a special "it looks
| like you're trying to implement an interpreter loop" case in
| the compiler than add new syntax.
| grandempire wrote:
| Yes. Goto is useful for efficient control flow in many
| stateful algorithms.
| dwattttt wrote:
| I don't know that _efficient_ is the best word. If you use
| goto to force a particular control flow rather than a more
| constrained form of control flow (e.g. if), you make it
| harder for the optimiser to work; it can only make "as-if"
| changes, ones that mean the code that executes looks as if
| it's what you wrote.
|
| The most efficient control flow is one that describes only
| what your algorithm needs, coupled with an optimiser that
| can exploit the particular flow you're describing.
|
| Among the many things discovered by the author of
| https://blog.nelhage.com/post/cpython-tail-call/,
| Clang/LLVM was able to optimise the standard switch based
| interpreter loop as if it had been written with computed
| gotos.
| grandempire wrote:
| > The most efficient control flow is one that describes
| only what your algorithm needs,
|
| Yes. I'm not saying goto is just faster in general. But
| some algorithms are difficult to describe with while and
| if (bunch of examples in Knuth).
|
| > Clang/LLVM was able to optimise
|
| Because it's implicit, this is the kind of optimization
| that's easy to silently regress a few weeks before ship
| by accidentally violating a rule.
|
| I think unrestricted goto is not good. But an alternative
| is to make the semantics and scope stricter, something
| like a Common Lisp prog/tag body block.
| kiitos wrote:
| vitess is not the answer
| ksec wrote:
| May be you want to explain yourself rather than a statement?
| Vitess is battle tested and used by a lot of large enterprise.
| gigatexal wrote:
| Yeah it's literally the monetization of db setup that powered
| YouTube -- pretty much the best battle testing imo.
| giovannibonetti wrote:
| The article mentions the struggle to speed up expression
| computation on MySQL due to its dynamic typing. I wonder if
| Postgres' more static type system works better in that regard.
| rajandatta wrote:
| Nice article. Instructive to see the latest trends and direction
| for high performance interpreters. Thanks for sharing.
| Thaxll wrote:
| This looks very similar to how "basic" emulators work.
| brianolson wrote:
| I built a 'slice of function pointers' bytecode interpreter in Go
| in 2019 for the Algorand VM (Blockchain smart contract stuff) and
| before that the same pattern in C for a toy JVM around 2005.
|
| It's a good pattern!
|
| The Algorand VM was focused on low overhead running thousands of
| tiny programs per second. Version 1 had no loops and a 1000
| instruction limit.
|
| The JVM was focused on low memory, towards possible embedded
| microcontroller use.
|
| So, 'array of function pointers' is nothing new, but it is a good
| pattern.
| returningfory2 wrote:
| Really interesting technique. I guess one of the (theoretical?)
| drawbacks is the memory usage is not ideal? In a traditional
| bytecode compiler the result of compilation is a very-cache-
| friendly binary array, whereas here it looks like a vector of
| pointers to heap allocated closures.
| medv wrote:
| This is very interesting article! In https://expr-lang.org we did
| our own research and a paper about comparison on different vm
| implementation in go:
|
| - https://expr-lang.org/blog/exploring-virtual-machines-for-em...
___________________________________________________________________
(page generated 2025-04-06 23:01 UTC)