[HN Gopher] MiniVM: A zero-dependency cross-language runtime on ...
___________________________________________________________________
MiniVM: A zero-dependency cross-language runtime on par with LuaJIT
and C
Author : isaacimagine
Score : 212 points
Date : 2022-01-08 11:05 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| fastkebab wrote:
| rahimiali wrote:
| The main interpreter loop looks a lot like python's. It uses the
| same pre processor tricks and jump tables to help both the reader
| and the compiler to make sense of the flow. It's also near that
| integers are unboxed, at the expensive of their top two bits
| being reserved as a type hint.
| jng wrote:
| Not only the reader and the compiler, that kind of threaded
| code really helps performance thanks to leveraging the CPU's
| branch target buffer to obtain great branch prediction of
| interpreted code, and that is one of the killer aspects to get
| current CPUs to perform well.
| celeritascelery wrote:
| Is there any documentation on the bytecode? I would be interested
| to see their design choices.
| isaacimagine wrote:
| Check the code, there's a big list of opcodes with pretty
| descriptive names. The code is effectively the documentation
| for now.
| adontz wrote:
| Does it support any kind of debugging?
| wruza wrote:
| 4984, there is a roadmap, but I couldn't find info on strategic
| goals of this project or where it supposed to be a best fit. Can
| you please share your views on this?
|
| Also, is there an API to connect to to emit VM bytecodes? Or an
| instruction set reference.
|
| I'm asking out of pure fan-like interest in this field and
| because I see lack of simple-enough (for me) tools for language
| building. Don't take me serious or anything like that, I'm not a
| pro. I tried to dig to llvm few years ago, but so c++ much
| boilerplate, meh. I know I can just transpile to Lua or js with
| sourcemaps, but it's zero fun. Projects like this remind me of
| 8080 asm times.
| 4984 wrote:
| There is no roadmap. There are no written long term goals. This
| project came out of need for a faster virtual machine for my
| Paka language. There is an api to emit vm bytecode, but it's
| not in C, it is from Paka.
|
| Roadmap could go like this tho: More Core Opcodes (Bit Ops,
| Closures), Split overloaded opcodes. More frontends. Optimizer,
| Dynamic Compiler and JIT.
| childintime wrote:
| This is a nice Forth-style bytecode interpreter, that is easy
| to generate code for. Close to WASM, but much simpler.
|
| The thing that attracts me most is that the bytecode format can
| be easily translated to a more compact format, allowing it
| serve as an intermediate format for _many_ homegrown
| instruction sets.
|
| So a GCC or LLVM backend needs to be on the roadmap. Code
| generation is a blast. Only the number of registers should be
| user-configurable to make this useful.
| rahimiali wrote:
| In what way is this forth-like? The opcodes take immediates
| and registers as arguments, whereas forth-like would be much
| more stack-based than this, right?
| jng wrote:
| The bytecodes aren't but the interpreter is, look up
| threaded code.
| b3morales wrote:
| Have a look at https://c9x.me/compile/ and
| https://c9x.me/compile/bib/
| [deleted]
| iainmerrick wrote:
| This looks like a fun project, but claiming that it "beats
| luajit" is very premature based on the benchmarks listed.
|
| The README lists only an allocation stress test (fast allocs are
| good, to be sure) and recursive Fibonacci (always good fun to try
| but doesn't resemble real usage).
|
| Looks like it doesn't have a JIT compiler, and without that I'd
| put a lot of money against it ever being as fast as LuaJIT, V8,
| the JVM, etc etc in the general case.
|
| _Edit to add:_ on reading more closely, there's more benchmark-
| gaming going on here than I realised. On the second benchmark, it
| says:
|
| _As you can see, minivm (no JIT) is a hair slower than Node JS
| (JIT) but beats luajit --joff by a fair margin (no JIT)._
|
| They already noted that Node has a longer startup time, so that
| obviously dominates in a benchmark that's only a few milliseconds
| long; and disabling the JIT in LuaJIT doesn't tell us much.
| isaacimagine wrote:
| Personally, I find it pretty impressive that it performs as
| well as these runtimes despite _not_ having a JIT compiler. I
| 'm pretty sure Shaw's written more benchmarks, but as the
| README explains, it's really hard to tell what the performance
| characteristics of a language are without writing a larger
| application. So far the largest applications written with
| MiniVM are Paka[0], a self-hosted language similar to Lua that
| targets MiniVM; os49[1], an operating system built on
| Paka/MiniVM in the spirit of lisp machines; and xori[2], an
| online playground for the language.
|
| _Edit: to address your edit, the the Tree benchmark starts
| small, but grows exponentially (the graph is logarithmic). The
| final runs of the benchmark take about 15 seconds to run, and
| are run 10 times per language, at which point startup time is
| no longer a large factor. The fib test compares both luajit
| with the JIT on and JIT off: as MiniVM is not JIT 'd, I think
| that this is a fair comparison to make._
|
| [0]: https://github.com/FastVM/paka
|
| [1]: https://github.com/ShawSumma/os49
|
| [2]: https://fastvm.github.io/xori
| iainmerrick wrote:
| Sure, it's a nice piece of work!
|
| I just don't think you should claim it's very fast unless you
| can show that it actually is very fast. The current
| benchmarks look like toy benchmarks. (I don't mean that in a
| derogatory way -- they're exactly the kind of tiny programs
| you need to build and test while bringing up a new VM.)
| mst wrote:
| The disclaimer at the top of the benchmark section -
| https://github.com/FastVM/minivm#benchmarks - seemed pretty
| reasonable to me.
|
| It's very clear that microbenchmarks don't tell the whole
| story, and that whole application benchmarks to give more
| real results will be desirable later.
|
| I mean, I agree these benchmarks don't mean that much
| except for being a useful way to verify that the core code
| is working and reasonably performant - but they're up front
| about that fact so calling it 'benchmark-gaming' seems a
| trifle unfair.
| 4984 wrote:
| MiniVM will JIT eventually. Right now the biggest barrier is
| the VM snapshots. Currently MiniVM its own stack and heap and
| can snapshot them from any point in the program.
|
| One thing to note about MiniVM is that it has very strong
| types. V8 has many steps to go through when it needs to perform
| something like addition (weak typing). MiniVM supports only 4
| basic types.
|
| MiniVM's reference frontend Paka is fully self hosted, It is
| currently the only large MiniVM program.
| deckarep wrote:
| Question on VM snapshotting: what's the purpose/point in even
| having such an ability? What does it allow you to do?
|
| I only know of snapshotting perhaps being necessary to
| support coroutine based context switching.
|
| Thanks and very cool project!
| slightknack wrote:
| A snapshot is a the entire state of a program at a single
| moment in time. Continuations are basically exposed
| snapshots, i.e. taking a snapshot, storing it in a
| variable, doing some work, and then 'calling' the snapshot
| to return to an earlier point. Continuations allow you to
| implement a naive version of single-shot delimited
| continuations - coroutines! This can be very useful for
| modeling concurrency.
|
| Aside from coroutines and continuations, snapshots are neat
| for distributed computing: spin up a vm, take a snapshot,
| and replicate it over the network. You could also send
| snapshots of different tasks to other computers to execute.
| In the context of edge computing, you could snapshot the
| program once it's 'warm' to cut back on VM startup time.
|
| Snapshots allow you to peek into your program. Imagine a
| debugger that takes snapshots on breakpoints, lets you to
| inspect the stack and heap, and replay the program forward
| from a given point in a deterministic manner. You could
| also send a snapshot to a friend so they can run an
| application from a given point on their machine. If you do
| snapshots + live reloading there are tons of other things
| you can do (e.g. live patching and replaying of functions
| while debugging).
| samatman wrote:
| That sounds like a solid basis for adding asymmetric
| coroutines as well, is that something you're thinking about
| adding?
|
| I like the philosophy and gumption shown by this project but
| coroutines aren't a feature I'd lightly give up.
|
| Digging what I see so far!
| 4984 wrote:
| MiniVM uses coroutines already, every 1000 branches the vm
| will return to the scheduler. The Build system can accept
| CFLAGS+=-DVM_BRANCH_DEFER which makes MiniVM able to run
| part of bytecode up-to N instructions.
|
| It is not exposed on the insruction level yet tho. Would be
| quite easy to add.
| samatman wrote:
| Ah thank you, that's very promising! It was the kind of
| thing I wanted to hear, I write Lua every day and four
| data types didn't appear to support 'thread'.
|
| I might just kick the tires on this over the weekend.
| exikyut wrote:
| It might be interesting to be able to tell the VM to
| resume into code for exactly N instructions or branches.
| I can see instrumentation/debugging tooling using this
| with N=1 for example.
| rgrmrts wrote:
| That's cool. Have you played around with that number? I'm
| curious how you pick that constant and what the trade
| offs are
| exikyut wrote:
| Out of curiosity, are you planning on (progressively, slowly)
| rolling your own JIT, or using something like DynASM
| (https://luajit.org/dynasm_features.html), libFirm
| (https://pp.ipd.kit.edu/firm/Features.html), or some other
| preexisting thing (eg https://github.com/wdv4758h/awesome-
| jit) in the space?
|
| FWIW, I understand that LuaJIT gets some of its insane real-
| world performance from a JIT and VM design that's effectively
| shrink-wrapped around Lua semantics/intrinsics - it's not
| general-purpose. I've read (unfortunately don't remember
| exactly where) that people have tried to run off with the JIT
| and VM and use it in other projects, but never succeeded
| because of the tight coupling.
|
| In the same way, while a bespoke 64/32-bit x86+ARM JIT would
| be a reasonable undertaking, it could make for a pretty
| interesting target with a potentially wide-ranging set of
| uses.
|
| For example, it could be the VM+JIT combination that all
| those people trying to dissect LuaJIT were looking for :).
|
| I could see something like this becoming an attractive option
| for games that want an exceptionally simple runtime. Sort of
| like a scaled-down NekoVM (a la Haxe).
|
| Broadly speaking, I get the (potentially incorrect)
| impression (from a very naive/inexperienced POV) that
| MiniVM+JIT would be looking to close a more widely-scoped,
| higher-level loop out of the box than something like libFirm
| would be (despite the very close correlation in functionality
| when looking at libFirm). So it'd be closer to Cling
| (https://root.cern/cling/) than raw LLVM, perhaps (albeit
| with 0.01% of the code size :D). It is for this reason that I
| kind of pause for a minute and ponder that a fully integrated
| JIT could be a pretty good idea. It would absolutely make the
| irreducible complexity of the project balloon, with
| reasonable motivation likely necessary to maintain cohesion.
|
| On a different note, If I were to backseat-drive for a minute
| :) the first thing I'd rant about is how attractive modern
| JITs need trivial ways to verify code correctness, both
| online (how exactly was *this* specific generated code
| constructed - so, straightforward logging) but also (and
| particularly) offline (humans staring at the JIT source code
| and mentally stepping through its behavior - _and succeeding_
| (miracles!!) because the code is small and comprehensibly
| architected). If the JIT was implemented in such a
| straightforward manner, end users wanting to run potentially
| malicious user-supplied code with high performance in
| potentially security-sensitive settings might be attracted to
| the project. (Mike Pall made bank for a while while
| CloudFlare was using LuaJIT for its WAF*... ahem...)
|
| I came across this reference of how to break out of LuaJIT
| 2.1 (2015) a while back:
| https://www.corsix.org/content/malicious-luajit-bytecode -
| and every time I take a look at the code I switch away from
| the tab :) (and sometimes step away from the computer for a
| minute :D). It's solely a demonstration of "this is how it
| would work", and clarifies that LuaJIT makes no sandbox
| guarantees about the code it executes, but reading through
| it, the amount of Stuff(tm) going on represents a surface
| area that to me (naively) seems just... like LuaJIT as a
| whole is generally too large to easily reason about from a
| security standpoint (oh yeah, besides being written in
| assembly language...). This might be inexperience speaking,
| but I can't help but wonder whether a smaller, simpler
| implementation _might_ be able to implement a secure JIT; for
| all I know this might be an impossible P=NP pipe dream I
| haven 't fully grasped yet, I guess what I'm trying to figure
| out is whether "small enough for non-100x programmers to
| mentally reason through" and "large enough to do a few
| practical things quickly" have practical overlap?
|
| [* CloudFlare only ran internally-generated code through
| LuaJIT, for completeness. A VM+JIT that could safely run
| untrusted code would thus be even _more_ interesting than
| LuaJIT was to CloudFlare, from a general perspective. ]
|
| ---
|
| Something completely different that I randomly discovered
| recently and which I thought I'd idly mention is that JITs
| _might_ seem to have a bit of a hard time on Android in
| certain obscure circumstances. I can 't (yet) tell if this is
| LuaJIT-specific or "anything that's a JIT"-specific: KOReader
| (Android eBook reader, implemented entirely using LuaJIT -
| https://github.com/koreader/android-luajit-launcher) has a
| bunch of very scary magic Things(tm) it seems to need to do
| to make LuaJIT even work at all on Android
| (https://github.com/koreader/android-luajit-
| launcher/blob/mas...), due to a apparently-current issue
| causing issues across different domains
| (https://github.com/LuaJIT/LuaJIT/issues/285), which has been
| apparently cropping up going back years
| (https://www.freelists.org/post/luajit/Android-performance-
| dr... (2013)). KOReader has even nontrivially patched
| LuaJIT's C code in places
| (https://github.com/koreader/android-luajit-
| launcher/blob/bb0...) with purposes I am yet to fully
| understand (it might just be for debugging). I happened to be
| considering idly playing around with Lua on Android
| (currently fascinated with interpreted/JITed runtimes) and
| after stumbling on this I'm debating whether to use Lua
| instead, ha. I've been meaning to ask around on the LuaJIT
| list and wherever KOReader discusses stuff to learn more,
| after spending the time actually getting LuaJIT linked into
| an Android project and poking around. Haven't got to it yet.
| This could be an absolutely massive red herring that I'm
| going deer-in-headlights about because it just _looks_ off-
| in-the-weeds, or potentially significant. It might also be
| absolutely irrelevant to not-LuaJIT pursuits, but as I noted
| I 'm not (yet) experienced enough to tell.
| CalChris wrote:
| Fibonacci is the poster child of hotspots. A JIT should be
| crushing an interpreter at Fibonacci as it is simple to
| generate good code for. Yes, it's just a microbenchmark; so you
| shouldn't make SPEC type conclusions from it. But you shouldn't
| dismiss it either.
| steeleduncan wrote:
| The Readme mentions that they've experimented with a Lua
| frontend. I can't find any reference to this in the repo. Is this
| a private experiment, or is it available somewhere?
| kaba0 wrote:
| Wouldn't a templated interpreter be perhaps faster (eg. what is
| used by the JVM)?
| eps wrote:
| Anything beats something on some benchmarks. That's not a very
| captivating pitch point.
|
| Good stuff otherwise. Very nice project.
| isaacimagine wrote:
| Thanks for pointing this out, I'll adjust the title to make it
| more accurate. I think the benchmarks show that the runtime is
| at least on par with existing popular runtimes.
|
| Some other cool pitch points I've learned while chatting with
| Shaw:
|
| - MiniVM supports saving and loading compressed
| snapshots/images, like smalltalk
|
| - JIT is in the works (it had a JIT in the past, which was
| removed in a refactor)
|
| - Zero dependencies, compiles to Wasm with cosmo libc.
|
| - VM supports multiple languages and sharing libraries between
| languages
|
| - The main language that targets MiniVM is self-hosted on
| MiniVM; there's a minimal os built on MiniVM.
|
| - Fast startup times make it good for embedded scripting, like
| lua.
|
| - the way the ISA itself is also really interesting, would
| recommend reading the C if that's your thing.
| 4984 wrote:
| Thank you for having a look.
|
| The reason MiniVM is so benchmark-oriented in its presentation
| is that MiniVM is ver benchmark-oriented in my workflow.
|
| I have tried to find a faster interpreter without JIT. It is
| hard to find benchmarks that don't just say that MiniVM is
| 2-10x faster than the interpreted languages.
| iainmerrick wrote:
| Yeah, it's a bit disappointing how slow most non-JIT
| interpreters are -- I think JIT understandably took the wind
| out of their sails.
|
| The state of the art as I understand it (but maybe you've
| figured out something faster?) is threaded interpreters. I
| suggest looking at Forth implementations, which is where I
| think most interpreter innovation has happened.
| MaxBarraclough wrote:
| > I suggest looking at Forth implementations, which is
| where I think most interpreter innovation has happened.
|
| Fortunately the Forth community write about their
| techniques, so you don't have to just eyeball gforth's
| source-code. A handful of links for those interested:
|
| https://www.complang.tuwien.ac.at/papers/
|
| https://github.com/ForthPapersMirror/Papers
|
| https://www.complang.tuwien.ac.at/projects/forth.html
|
| https://www.complang.tuwien.ac.at/anton/euroforth/
|
| http://soton.mpeforth.com/flag/jfar/vol7.html
|
| http://soton.mpeforth.com/flag/fd/
|
| http://www.forth.org/literature.html
|
| http://www.forth.org/fd/contents.html
| isaacimagine wrote:
| So I was talking about threaded code with Shaw the other
| day, and he gave me a long explanation about the various
| tradeoffs being made and the resulting techniques used in
| MiniVM. As I understand it, MiniVM is basically threaded
| code, with cached opcodes. Essentially, it uses
| _precomputed_ gotos: // classical
| computed goto goto *x[y[z]] // MiniVM
| precomputed goto goto *a[z]
|
| This optimization makes MiniVM a tad faster than raw
| computed gotos, because it's essentially threaded code with
| some optimizations. He can probably expand on this; if you
| want to hear the full really interesting explanation
| yourself, he's pretty active in the Paka Discord[0].
|
| [0]: https://discord.gg/UyvxuC5W5q
| aardvark179 wrote:
| That feels like the sort of optimisation that wins in
| small cases and loses in larger ones. The jump array is
| going to be 4 or 8 times the size of the opcode array so
| will put a lot more pressure on the caches as your
| programs get larger.
| sitkack wrote:
| Sounds like a great test case for using a tensorflow-lite
| model to switch between both techniques.
| kgeist wrote:
| >Built on a register-based ISA that beats luajit--with the JIT on
| --in some benchmarks
|
| This is interesting, considering that LuaJIT in interpreter mode
| is famous for being faster than simple JITs. Worth looking into
| what the differences are.
| isaacimagine wrote:
| From a discussion with Shaw:
|
| - Simpler more consistent type system
|
| - MiniVM has more larger instructions to take advantage of
|
| - MiniVM precomputes and caches jumps where possible
|
| - Calling functions is more efficient, less overhead
|
| - Has a custom GC/allocator which allows for some optimizations
|
| - Fixed-sized arrays vs Lua's dynamic tables
|
| - Not a JIT, so faster startup times (no tracing)
| thayne wrote:
| Looks like an interesting project, but I can't find any
| documentation on how to actually use it (or on the paka language
| either). Is it just still in early stages and documentation
| hasn't been written yet?
| slightknack wrote:
| I've written some documentation for it in the past, but had to
| scrap it because things tend to move so quickly that it's
| better to read the code and ask Shaw than use incorrect
| documentation. As things stabilize, these issues will be
| resolved, but it's very bleeding-edge right now.
| sitkack wrote:
| How does this compare to the techniques used in the wasm3 (name,
| not description)?
|
| https://github.com/wasm3/wasm3
|
| architecture of M3 the interpreter core of wasm3
| https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md
|
| https://github.com/wasm3/wasm3/blob/main/docs/Performance.md
| 4984 wrote:
| I have looked into Meta Machines. They are not what I want.
|
| Meta Machines can be almost as fast, but i just prefer the
| cache locality of local functions.
|
| Wasm3 is slower than MiniVM when you don't count the GC. Wasm3
| is very cool!
| tonyg wrote:
| Hard to tell whether it supports proper tail calls or not. The
| readme mentions Scheme, so it's not impossible I suppose! One of
| the nice things about wasm is the formal semantics; a project
| like this could benefit from one of those, too.
| 4984 wrote:
| Author here, Tail calls can be implemented in MiniVM bytecode.
| There was once an instruction that performed a tail call, It
| bit-rotted due to not being used.
|
| MiniVM is moving very quickly currently. A lot still needs to
| be done, like JIT and optimizers.
|
| Eventually a formal specification of MiniVM should happen, just
| not yet as things are changing too fast.
| user-the-name wrote:
| Speaking of wasm, any idea how feasible it would be to
| convert wasm to minivm bytecode and use it as a wasm runner?
| ford_o wrote:
| How does this compare to MIR?
| mst wrote:
| This example seems a bit pessimised:
| head_of_loop: load_var x push_int 1000
| less_than jump_if_false :end_of_loop load_var x
| increment store_var x end_of_loop:
|
| I'd more expect a stack engine to be something like something
| like - load_var x head_of_loop:
| jump_lt_keep 1000 :end_of_loop increment
| end_of_loop: store_var x
|
| in the sense that anything that can figure out how to populate
| and use r0 would likely also be able to figure out it can wrap
| the load/store calls around the outside of the loop.
|
| (I'm far from an expert on stack versus register VMs though so I
| may be missing several somethings here)
| tln wrote:
| That would be a nice optimization. Here's what Python does
| 0 LOAD_FAST 0 (x) 2 LOAD_CONST
| 1 (1000) 4 COMPARE_OP 0 (<)
| 6 POP_JUMP_IF_FALSE 18 8 LOAD_FAST
| 0 (x) 10 LOAD_CONST 2 (1)
| 12 BINARY_ADD 14 STORE_FAST 0
| (x) 16 JUMP_ABSOLUTE 0
| 18 LOAD_CONST 0 (None)
|
| Python's VM is very slow, but note only one arg per op and the
| compare + jump are not fused.
|
| BTW, if we're considering "advanced" optimization might as well
| elide the whole loop :) EG https://godbolt.org/z/n3GznqKsM
| test(int): push rbp mov rbp, rsp
| mov DWORD PTR [rbp-4], edi jmp .L2
| .L3: add DWORD PTR [rbp-4], 1 .L2:
| cmp DWORD PTR [rbp-4], 999 jle .L3
| nop pop rbp ret
| mst wrote:
| Well, sure, but I was thinking along the lines of "reserving
| a spot on the stack and reserving a register seem like about
| the same level of clever", especially thinking about how
| mixed register/stack argument passing works in even
| relatively stupid C compilers.
|
| Eliding the entire loop seems relatively similarly clever for
| either architecture, but would've counted as cheating in the
| frame I was pondering from.
|
| Still neat to see even if x86 asm continues to make my eyes
| cross ;)
| djwatson24 wrote:
| Binary trees is a classic GC benchmark - ideally a GC'd language
| should be able to do _better_ than a malloc /free implementation
| in C (tree.c) since it can release the whole tree in bulk (like
| what the ptree.c does).
|
| Also note that tree.c speeds up substantially if you link with a
| better malloc like jemalloc or tcmalloc.
___________________________________________________________________
(page generated 2022-01-08 23:00 UTC)