[HN Gopher] Compilers for free with weval
       ___________________________________________________________________
        
       Compilers for free with weval
        
       Author : todsacerdoti
       Score  : 153 points
       Date   : 2024-05-19 11:33 UTC (11 hours ago)
        
 (HTM) web link (bernsteinbear.com)
 (TXT) w3m dump (bernsteinbear.com)
        
       | tekknolagi wrote:
       | Author here! Feel free to ask questions or leave comments or
       | whatever.
        
         | burakemir wrote:
         | Thanks for the write-up.
         | 
         | Neil Jones' (free) book about partial evaluation deserves a
         | mention here, since it explains how/why it works in some
         | detail.
         | 
         | Query languages are a very cool application of this stuff. I
         | was planning to use wasm for Mangle implementation (a language
         | extending datalog, with its implementation a library for
         | deductive database), following Tiark Rompf and Nada Amin's "a
         | SQL to C compiler in 500 lines of code"... but of course I only
         | have the interpreter for now and started a Rust rewrite
         | instead.
        
           | quag wrote:
           | The free book on partial evaluations:
           | http://www.itu.dk/people/sestoft/pebook/
        
             | sestep wrote:
             | Is the book enlightening beyond what one can learn from
             | sources like this post and the aforementioned paper by
             | Rompf and Amin? I'm trying to decide whether I should spend
             | the time to dig deeper into it.
        
         | azakai wrote:
         | I understand that by making the bytecode constant a lot of
         | optimization become possible. But I can't see how you can
         | unroll the interpreter loop past unknown inputs to the program?
         | 
         | That is, if the program is a pure function like power(x, y)
         | (from the top of the article) then I understand if the bytecode
         | is constant and the inputs x, y are constant then you can
         | unroll completely. In fact, in that case, you can just do a
         | full evaluation of the entire program to a constant. But, if x
         | and y are unknown at compile time then you can do very little,
         | I think?
         | 
         | I guess the example at the top of the article is saying that if
         | you know y but not x then partial evaluation can still help a
         | lot. But in a real-world program the first unknown input that
         | causes a significant branch in execution will block progress,
         | won't it? (edit: in particular unrolling has to stop, unless
         | you consider multiple paths)
        
           | tekknolagi wrote:
           | Weval has intrinsics that unroll the loop and specialize on
           | the PC, merging paths with the same PC. So if you pick a good
           | specialization context like PC, the unrolling stops
           | eventually.
        
           | n_plus_1_acc wrote:
           | For a given function like `power`, you can unroll the
           | interpreter loop and eliminate branches on the instruction,
           | so the code has less branching overall.
        
       | posnet wrote:
       | I will never not be amazed by Futamura Projections.
        
         | metadat wrote:
         | They are pretty remarkable.
         | 
         | https://en.wikipedia.org/wiki/Partial_evaluation#Futamura_pr...
        
       | rst wrote:
       | The most elaborate deployed version of this idea is probably
       | PyPy, an alternative Python implementation that works by giving a
       | JIT-compiling partial evaluator a hint-annotated Python
       | interpreter as the program, and the Python code to be executed as
       | its input. Slightly abstruse overview here:
       | https://doc.pypy.org/en/latest/architecture.html
        
         | tekknolagi wrote:
         | I think PyPy moved away from partial evaluation to trace
         | optimization, fwiw
        
         | alserio wrote:
         | Futamura projections should still be the basis for GraalVM
        
           | evacchi wrote:
           | they are for Truffle-based language implementations https://c
           | hrisseaton.com/rubytruffle/pldi17-truffle/pldi17-tr...
        
         | chubot wrote:
         | PyPy used partial evaluation at one point, but now it's "meta-
         | tracing":
         | 
         |  _Why did we Abandon Partial Evaluation?_ -
         | https://www.pypy.org/posts/2018/09/the-first-15-years-of-pyp...
        
           | 082349872349872 wrote:
           | Taking meta-tracing to the next level would reconstruct not
           | just hot loops but even the original CFG, as speculated in:
           | https://news.ycombinator.com/item?id=38778372
           | 
           | Anyone know of non-fictional work in this area?
        
       | cfallin wrote:
       | weval author here (thanks Max for the blog post!). Also AMA!
       | 
       | The talk about weval that Max mentions was at NEU and also CMU;
       | the latter was recorded and is here: https://vimeo.com/940568191
       | 
       | I also plan to write up a blog post on weval in depth, plus its
       | application to SpiderMonkey-on-Wasm, in a few months; it's pretty
       | exciting though, currently getting 4-5x speedups on some
       | benchmarks on a decidedly nontrivial interpreter!
        
       | JonChesterfield wrote:
       | I remember being really excited about the derive-a-compiler-from-
       | a-partial-evaluator premise when I first ran across it. You
       | genuinely do get a compiler from an interpreter plus a partial
       | evaluator. That trick does work.
       | 
       | The gotcha I didn't spot at the time is that a compiler is (I
       | think, could be argued either way) easier to write than a partial
       | evaluator. The "get it for free bit" only really applies if
       | someone else gave you the partial evaluator _and_ if you somehow
       | don 't need to spend ages debugging why that partial evaluator
       | isn't behaving like you hoped.
       | 
       | Now that I'm somewhat more beaten down by toolchain dev, I'm
       | starting to think compilers are easier to write than
       | interpreters. It's definitely not a clear win in favour of the
       | interpreter. If you compile to x64, or to C, or to javascript or
       | whatever, you now have a thing you can debug with whatever tools
       | are native to that target. If you run in an interpreter, you get
       | to debug that interpreter running the program, with whatever
       | somewhat ad hoc debugging tools you put into the interpreter
       | yourself.
       | 
       | Getting useful semantic error message out of an interpreter at
       | partial evaluation time (aka the "compile time" of the aggregate
       | tool) is probably solvable but not likely to work out of the box.
       | Partial eval isn't really a phase separation friendly thing.
        
         | cfallin wrote:
         | This is indeed a good point and something I want to write about
         | when I eventually do a blog post on weval.
         | 
         | A few counterpoints that I'd offer (and what led me to still
         | take this approach):
         | 
         | - If the target has sub-par debugging infrastructure, it can be
         | easier to debug an interpreter (which is portable) then apply
         | the semantics-preserving PE. In particular when targeting Wasm
         | outside the browser, there is... not really a good debug
         | experience, anywhere, for that. It was way easier to get an
         | interpreter right by developing on native with gdb/rr/whatever,
         | and then separately ensure weval preserves semantics (which I
         | tested with lockstep differential execution).
         | 
         | - Maintenance: if one is going to have an interpreter and a
         | compiler anyway (and one often wants this or needs this e.g. to
         | handle eval()), easier for them both to come from the same
         | source.
         | 
         | - Amortized cost: in the Wasm world we want AOT compilers for
         | many languages eventually; there are interpreter ports with no
         | Wasm backends; developing weval was a one-time cost and we can
         | eventually apply it multiple times.
         | 
         | - If the semantics of the existing interpreter are quite
         | nontrivial, that can push the balance the other way. I designed
         | weval as part of my work on SpiderMonkey; extremely nontrivial
         | interpreter with all sorts of edge cases; replicating that in a
         | from-scratch compiler seemed a far harder path. (It's since
         | been done by someone else and you can find the "wasm32 codegen"
         | patchset in Bugzilla but there are other phasing issues with it
         | from our use-case's PoV; it's not true AOT, it requires codegen
         | at runtime.)
         | 
         | I don't think the tradeoff is always clear and if one is
         | building a language from scratch, and targeting a simple ISA,
         | by all means write a direct compiler! But other interesting
         | use-cases do exist.
        
           | JonChesterfield wrote:
           | The divergent semantics risk between the interpreter and the
           | compiler is a really big deal. It's genuinely difficult to
           | get a language implementation to behave exactly as specified,
           | even when the spec is do the same as some other
           | implementation. Treating "compiled code" as specialising the
           | interpreter with respect to the program is a great solution
           | to that, since the bugs in the optimiser/partial-evaluator
           | (they're kind of the same thing) are unlikely to be of the
           | same class as bugs from independent implementations.
           | 
           | Wasm is a really solid target for heroic compiler
           | optimisations. It's relatively precisely specified, user
           | facing semantic diagnostics are in some language front end
           | out of sight, aliasing is limited and syscalls are finite
           | with known semantics. Pretty much because it was designed by
           | compiler people. You've picked a good target for this
           | technique.
        
         | titzer wrote:
         | > The gotcha I didn't spot at the time is that a compiler is (I
         | think, could be argued either way) easier to write than a
         | partial evaluator.
         | 
         | I agree that this is sometimes true, but there are several,
         | IMHO bigger, issues:
         | 
         | 1. The partial evaluation of the interpreter might degenerate
         | back to getting a copy of the interpreter loop (or multiple!)
         | and not achieve any speedup, just memory overhead
         | 
         | 2. The partial evaluator might need additional annotations on
         | what is and is not mutable; it might need a lot of help in
         | order to constant-fold away a key piece of data. Tuning that
         | and getting wrong and result in #1.
         | 
         | 3. Partially evaluating the interpreter is a _pretty slow_
         | compiler. You need to do the second Futamura projection (apply
         | the partial evaluator to an interpreter loop without the user
         | code) in order to get a fast compiler. That means the partial
         | evaluator needs to be in the same IR as the partial evaluator
         | 's input language.
         | 
         | That said, I've chatted a bit with Chris offline about this,
         | and I think this work is really cool and promising. I might
         | tinker with this a little in Wizard too.
        
           | JonChesterfield wrote:
           | That is better expressed than my rambling! The futamura
           | projection is satisfied by passing the "program" directly to
           | the interpreter without optimising either, which is sometimes
           | useful as a packaging exercise and does look somewhat like a
           | non-optimising compiler.
           | 
           | If you want this trickery to make a useful compiler, the
           | partial evaluator picks up global value numbering, dead code
           | elimination, loop transforms - the effective partial
           | evaluator is very much the optimisation passes of a compiler.
           | It can still be a win if someone else wrote said partial
           | evaluator.
           | 
           | Point 2 is the usual "my compiler is not sufficiently smart,
           | I must annotate the program" problem, with the slight twist
           | that you're annotating the interpreter in the hope that it
           | does useful things with the end program. Interacts with
           | hoping someone else built the dependency well.
           | 
           | And yeah, generated compilers in this fashion have a
           | reputation for being slow, and for not being great optimising
           | compilers, where self application might dig oneself out of
           | the hole. Very like adding optimisations to your compiler to
           | make your compiler faster when building your compiler.
           | 
           | All in all the compiler for free tagline (not specifically
           | this post, it's written on a lot of futamura references)
           | feels a bit like an inside joke. It reminds me of the sad
           | discovery that a metacircular interpreter can't interpret
           | itself after all since what you've written is a heavily
           | obfuscated infinite loop.
        
       | teo_zero wrote:
       | If the wasi+weval version is better than the compiled one,
       | doesn't it mean that the compiler made a bad job at optimizing
       | out the constant parts?
       | 
       | To be fair, in order for weval to do its magic, the code has to
       | be changed and annotated. I'm wondering if a similar set of
       | annotations could convince clang to generate a binary that runs
       | in less than 40 ms.
        
         | foota wrote:
         | Clang doesn't know about the static code, so it can't do the
         | optimizations this does.
         | 
         | You could probably do the same thing with some hacky C++ code
         | gen and templates, but I'm not sure clang et al would be able
         | to optimize them.
        
           | tekknolagi wrote:
           | In the limit, that's called Copy and Patch!
        
       | abeppu wrote:
       | This is cool, but the post swaps in a tiny interpreter bc it says
       | that CPython and SpiderMonkey are both huge. It then goes on to
       | talk about modifying the interpreter to make some specific info
       | visible to weval, and I can imagine that those changes would be
       | considerable for CPython or SpiderMonkey. But if we ignore those
       | specialization sections, and just consider the material up
       | through the "Enter weval" section ... could weval have handled
       | CPython or SpiderMonkey (or some other real interpreter that
       | people use regularly)? Or does the partial evaluation have some
       | scaling behavior that explodes or bogs down with large or complex
       | interpreters? I.e. was the tiny interpreter needed just to make
       | the following sections compact, or does even the basic use of
       | weval require a small input?
        
         | tekknolagi wrote:
         | You should wait for Chris's upcoming post :) the minimal
         | changes are the same as the ones I have for my tiny
         | interpreter. There are some more advanced changes you can make
         | for more performance
        
       ___________________________________________________________________
       (page generated 2024-05-19 23:00 UTC)