[HN Gopher] Beating the Compiler
       ___________________________________________________________________
        
       Beating the Compiler
        
       Author : mkeeter
       Score  : 181 points
       Date   : 2024-07-12 18:54 UTC (1 days ago)
        
 (HTM) web link (www.mattkeeter.com)
 (TXT) w3m dump (www.mattkeeter.com)
        
       | 0x3444ac53 wrote:
       | I was overjoyed to open this and find a reference to 100r and UXN
        
       | gumby wrote:
       | IMHO the best way to think of it (well, it's how _I_ have long
       | thought of it) is two lemmas:
       | 
       | 1 - The compiler has more "fingers" than a human does. Back when
       | I wrote programs in assembly I would use printout to keep track
       | of what I was doing and for debugging, and so would often put a
       | finger on the paper to mark where a jump was and then go to the
       | destination to see if it matched up, etc. This process isn't at
       | all scalable, so the compiler is inevitably going to do better
       | than I ever could on anything more than something trivial.
       | 
       | 2 - I know more about the program constraints than I could ever
       | express (much less be bothered expressing) in a HLL. "All the
       | values will be less than 16" or "I can use this memory location
       | for something else after I've read from it" or who knows what. So
       | sometimes I can chop stuff out or even rewrite the compiler
       | output locally to speed it up.
       | 
       | Also I can only do this for a few architectures...and with modern
       | x86 there are so many variants with all sorts of
       | microoptimization affordances that I can rarely beat the compiler
       | (this is a corollary of lemma 1)
        
         | almostgotcaught wrote:
         | A compiler is a combinatorial optimizer (think bin-packing). In
         | general, optimizers/solvers basically search for the best
         | solution. Most production compilers don't have solvers in them,
         | they use heuristics instead, but even the best solvers use tons
         | of heuristics. Naturally a computer will search/try heuristics
         | faster and more thoroughly than you but sometimes you can do
         | better because performant searching is all about "knowing where
         | to look".
        
           | BeeOnRope wrote:
           | Modern compilers are not doing much searching in general.
           | It's mostly apply some feed-forward heuristic to determine
           | whether to apply a transformation or not.
           | 
           | I think a slower, search based compiler could have a lot of
           | potential for the hottest parts you're willing to spend
           | exorbitant time on a search.
        
             | jonjojojon wrote:
             | I have heard search based compiler optimization called
             | "superoptimization"[1]. It seems interesting, but as far as
             | I know has not seen much industrial use.
             | 
             | 1. https://blog.regehr.org/archives/2578
        
               | kaba0 wrote:
               | It simply doesn't scale. You can only superoptimize very
               | short runs of code, nowhere anywhere close to even
               | smaller code bases, let alone big ones.
        
               | fooker wrote:
               | It scales well enough. You can apparently run Souper on
               | SQLite in 24 hours with a beefy machine, according to a
               | talk I recently attended, by one of the developers.
        
             | almostgotcaught wrote:
             | > Modern compilers are not doing much searching in general.
             | 
             | This is false. Any compiler that does register allocation
             | and instruction scheduling (all of them) is _searching_ for
             | an optimal (or just good enough) solution to an
             | optimization problem.
        
               | drpixie wrote:
               | Can you give an example?
               | 
               | In general, searching any significant space is _very_
               | slow, applying heuristics is much quicker.
        
               | azakai wrote:
               | I'm not sure there is a clear separation between applying
               | heuristics and searching a space. Often in compilers you
               | search a subset of a space using heuristics, and you can
               | adjust those to control how much of the space you cover.
               | 
               | For example, here is a pass that reorders WebAssembly
               | globals in the Binaryen optimizer:
               | 
               | https://github.com/WebAssembly/binaryen/blob/main/src/pas
               | ses...
               | 
               | We have a simple criteria for the quality of a solution -
               | how big the binary size is with an order - but the space
               | of possible orders is huge (every permutation that keeps
               | every global after its dependencies). What we do is a
               | targeted search of that space using some heuristics using
               | parameters that work well enough and aren't too slow in
               | practice.
        
               | almostgotcaught wrote:
               | > I'm not sure there is a clear separation between
               | applying heuristics
               | 
               | There is and it's quite simple: if your heuristic reduces
               | the size of your search space faster than it takes to
               | perform the search (ie try solutions) then you have a
               | real algo on your hands. Otherwise you're just searching.
               | This is basically the border between P and NP and it's
               | just that in compilers most of the problems are NP hard
               | so none of the heuristics are really that good.
        
               | gumby wrote:
               | I disagree with this part of your comment:
               | 
               | > then you have a real algo on your hands
               | 
               | To me an algorithm is closed, while a heuristic (aka rule
               | of thumb) is just a fast way to probably get a better
               | solution / subset of the solution space at the cost of
               | possibly missing the optimal result or even ending up in
               | a pessimal corner case.
               | 
               | With an NP complete problem you'd rather have _some_
               | solution rather than use up your lifetime searching for
               | the best.
        
               | almostgotcaught wrote:
               | > To me an algorithm is closed
               | 
               | i dunno what "closed" means here? converges? lots of
               | things with heuristics converge...
               | 
               | most things people think of as algorithms are just
               | heuristics codified. take for example unit propagation in
               | DPLL (fairly modern SAT approach); quoting wiki[1]
               | 
               | > In practice, this often leads to deterministic cascades
               | of units, thus avoiding a large part of the naive search
               | space.
               | 
               | that *in practice* there means there's no guarantee
               | because ofc not - otherwise they would've proven
               | something about NP. but they didn't, they just came up
               | with a way to search that's sometimes/often not bad. a
               | lot of people call this an algorithm ("... is a
               | refinement of the earlier Davis-Putnam algorithm")
               | because it fits the dictionary definiton (a repeatable
               | process) but i do not because it's a repeatable process
               | that isn't proven to produce anything (ie faster than
               | brute force). and the intuition i'm proposing for why it
               | doesn't is because it doesn't actually shrink the search
               | space fast enough.
               | 
               | note, my definitions/intuitions don't translate/have the
               | same force elsewhere (especially in
               | continuous/differentiable spaces) but they're a pretty
               | standard/obvious perspective in combinatorial
               | optimization (np-hard/np-complete problems).
               | 
               | [1] https://en.wikipedia.org/wiki/DPLL_algorithm#The_algo
               | rithm
        
               | almostgotcaught wrote:
               | I don't know what you're asking for - this isn't some
               | kind of controversial topic - any iterative algo that
               | isn't polynomial time (or is approximate) is search.
               | 
               | In the context of compilers there are many. Look at this
               | block diagram for Chaitin's register allocator:
               | 
               | https://en.wikipedia.org/wiki/Register_allocation#Princip
               | le_...
               | 
               | That's a search because it tries an allocation, possibly
               | incurs a spill, tries again.
        
               | WalterBright wrote:
               | Where things get fun is when two optimizations combine to
               | make things worse. They never tell you about that in
               | compiler class!
               | 
               | It's like designing a house. If you want the master
               | closet bigger, the master bath has to shrink. Everything
               | is a tradeoff.
        
               | almostgotcaught wrote:
               | You have this exact issue in VLIW and it's exactly where
               | you can't cheat by using heuristics:
               | 
               | Combinatorial Register Allocation and Instruction
               | Scheduling
               | 
               | https://dl.acm.org/doi/10.1145/3332373
        
             | gumby wrote:
             | I understand the compiler Microsoft uses to build release
             | versions of Windows, Office etc is like this and can take
             | days to run.
        
               | flamedoge wrote:
               | and well worth the cost
        
         | albinahlback wrote:
         | While I do think there are a lot of reasons to why one should
         | not write in assembly, time-critical base routines can often
         | benefit a lot from being written in assembly (take GMP's mpn-
         | module for instance). However, this puts some constraint on how
         | the code should be formatted (in GMP's case, this means that
         | everything the mpz-module does is derived from the mpn-module),
         | which cannot be said for all applications.
        
         | ashton314 wrote:
         | I am going to be the token programming language researcher and
         | say that what you really want is a dependently typed assembly
         | language that your dependently typed higher level language
         | lowers to. One school of thought that has yet to bear fruit in
         | "mainstream" programming, but gives tantalizing hints of what
         | is possible, is expressing increasing amounts of your programs
         | constraints in the type system, thereby informing the compiler
         | precisely about your intent.
        
           | almostgotcaught wrote:
           | > want is a dependently typed assembly language
           | 
           | Doesn't make any sense. The "type constraints" on assembly
           | operands (registers and numbers) is the ISA and thus those
           | constraints are combinatorial not logical
        
             | tucnak wrote:
             | There's the reason "token programming language researchers"
             | are incapable of understanding modern computer
             | architectures: a huge gap where would have been EE
             | education.
        
               | almostgotcaught wrote:
               | Correct - most of them think the goal is an infinite
               | tower of sophisticated "abstractions" not realizing
               | there's nothing abstract about a finite piece of silicon.
        
       | jll29 wrote:
       | This was a good read, thanks.
       | 
       | Instead of using unsafe Rust to optimize the interpreter loop, I
       | would prefer to write a transpiler that compiles the source UXN
       | binaries to local CPU binaries, which would not require making
       | code unsafe/less readable and would permit further speed
       | enhancements by getting rid of an interpreter loop altogether.
        
         | mkeeter wrote:
         | This is an interesting idea, but gets tricky if someone writes
         | self-modifying code!
         | 
         | There are Uxn instructions which write to RAM; _normally_ ,
         | this is used for storing data, but nothing prevents programs
         | from editing their own code as they're running.
        
           | duped wrote:
           | > but nothing prevents programs from editing their own code
           | as they're running
           | 
           | On some platforms writing to memory mapped in with PROT_EXEC
           | will trigger a page fault and the process will be killed. In
           | other words, self modifying executables are effectively
           | forbidden by design (the workaround is to unmap the memory,
           | map it as PROT_WRITE, modify, unmap, map it in as PROT_EXEC,
           | and resume execution - which is what JITs do on MacOS, for
           | example).
        
             | duskwuff wrote:
             | There's a better workaround, incidentally, which is to map
             | the same memory to two different address ranges with
             | different access permissions. This allows code in the JIT
             | region to be updated while threads may be running in it.
        
               | saagarjha wrote:
               | An even better workaround is to make that switch faster.
        
             | Hemospectrum wrote:
             | The parent comment was referring to programs running inside
             | the UXN virtual machine, which explicitly supports and
             | endorses self-modifying code. Many of the creators' own
             | programs take advantage of this capability, so any
             | conforming implementation has to find a way to make it
             | work.
        
       | projektfu wrote:
       | Good article, brings the data to back it up.
       | 
       | Unfortunately, it was hard to read with the monokai.css theme
       | because comments were nearly invisible, and a lot of your
       | information was in comments. Changing the color from #75715e to
       | #95917e did the trick. I guess Monokai is for programmers who
       | never read comments.
        
         | mkeeter wrote:
         | Thanks for the feedback; I tweaked the brightness and pushed
         | the change.
        
       | lilyball wrote:
       | Does the compiler do anything differently if you stick to Rust
       | but change the dispatch implementation to be an #[inline] fn
       | next() that you put at the end of each opcode?
        
       | ajbt200128 wrote:
       | > // SAFETY: do you trust me?
       | 
       | No.
       | 
       | Have you seen the safer-ffi crate? Then you won't have to commit
       | the deadly sin of writing unsafe Rust
        
         | Wowfunhappy wrote:
         | But this is an entry point to assembly code. It's inherently
         | unsafe.
        
         | gpm wrote:
         | > safer-ffi crate
         | 
         | It looks to me like that crate is supposed to help with
         | exposing rust functions to C safely, not calling foreign
         | functions in foreign code safely. Am I missing something?
        
       | 201984 wrote:
       | Very cool! I enjoy seeing people writing asm, and letting us get
       | the most out of our CPUs.
       | 
       | I see you already tried what I thought of, which is getting rid
       | of the jump table and making each instruction handler the same
       | size. Do you think that could still work if you limited each
       | instruction handler to 64 or 32 bytes instead of 256, and then
       | for longer handlers jumped to a larger body of code somewhere
       | else?
        
         | abbeyj wrote:
         | This is how Dalvik (the old Android Java interpreter) worked,
         | with 64 chosen as the size of an instruction handler. See
         | https://wladimir-tm4pda.github.io/porting/dalvik.html (search
         | for "computed goto").
        
       | JackYoustra wrote:
       | Did you try PGO? This post seems like the thing it was built for.
        
         | saagarjha wrote:
         | How would it help here?
        
       | tempodox wrote:
       | I find this strange:                 &mut h as *mut _ as *mut _
       | 
       | What is going on here?
        
         | evrimoztamur wrote:
         | It dereferences a mutable reference to h twice, ignoring its
         | type with _s. I suppose this implies that h is a reference type
         | itself.
        
           | muricula wrote:
           | No dereferences, just casts. It shouldn't generate any
           | loads/reads from memory.
        
         | mkeeter wrote:
         | It's doing the following cast:                   &mut
         | DeviceHandle -> *mut DeviceHandle -> *mut c_void
         | 
         | (with the pointer types being solved for automatically by the
         | compiler)
        
         | dwattttt wrote:
         | It's a bit of a song and dance; you can turn a &mut T into a
         | *mut T, and you can cast a *mut T into a *mut anything, but you
         | can't do it in one step.
        
       | o11c wrote:
       | > Making all of the opcode implementations the same size (padding
       | to the size of the largest opcode implementation with .balign
       | 256), then removing the jump table entirely.
       | 
       | Another idea would be to sort the opcodes by size (32-byte,
       | 64-byte, 128-byte, 256-byte - or perhaps +1 to avoid the set-
       | associativity problem below) rather than simply indexing.
       | 
       | > This was also slower, also probably because of cache
       | friendliness: the opcode implementations go from 16.6 KiB total
       | to 64 KiB.
       | 
       | This is probably cache-unfriendly due to alignment too
       | (associativity limits mean more conflict misses for highly-
       | aligned memory), not just size. Most documentation only talks
       | about this problem regarding the data cache though ...
        
       | 256_ wrote:
       | Why does making the FETCH part of the cycle a macro make it
       | faster? Surely the branch predictor is fine with unconditional
       | immediate branches. What am I missing here?
       | 
       | Also, it has jump-to-register immediately after the instruction
       | that sets that register. Wouldn't it be faster if it went like:
       | get jmp address       execute VM opcode       jmp to next
       | instruction
       | 
       | So the pipeline can fetch it ahead of time?
        
         | phire wrote:
         | Modern Out-of-order CPUs (like the M1), they can't see branches
         | until far too late.
         | 
         | The M1's frontend at least 24 instruction past the
         | unconditional branch before the early possible moment it can
         | even see it.
         | 
         | So the branch predictor isn't just responsible for predicting
         | which way conditional branches go. It must remember where all
         | branches are, and their target so that the front end can follow
         | them with zero cycle delay. This means all branches, including
         | call and return instructions too.
         | 
         | Which means that unconditional immediate branches cost about
         | the same as a correctly predicted conditional branch.
         | 
         | But that's not actually why the fetch has been moved.
         | 
         | The other thing to note is that the frontend and backend of a
         | modern CPU are completely disconnected. The frontend doesn't
         | even try to get the correct address of an indirect jump from
         | the backend. It always uses the branch predictor to predict the
         | indirect branch.
         | 
         | And by inlining, each VM instruction has its own indirect jump,
         | which means it gets different slot in the branch predictor
         | allowing for better predictions.
         | 
         | At least that's the theory behind threaded code. I'm unsure how
         | much of this speedup is coming from eliminating the extra
         | unconditional immediate branch and how much is from better
         | prediction of indirect branches.
        
           | aengelke wrote:
           | Side note: Intel CPUs since Skylake and also recent AMD CPUs
           | (since Zen 3 or so?) store a history for indirect branches.
           | On such processors, using threaded jumps does not really
           | improve performance anymore (I've even seen 1-2% slowdowns on
           | some cores).
        
             | phire wrote:
             | Pretty sure it's Haswell and Zen 2. They both implement IT-
             | TAGE based branch predictors.
             | 
             | I just assumed the M1 branch predictor would also be in the
             | same class, but I guess not. In another comment
             | (https://news.ycombinator.com/item?id=40952404), I did some
             | tests to confirm that it was actually the threaded jumps
             | responsible for the speedup.
             | 
             | I'm tempted to dig deeper, see what the M1's branch
             | predator can and can't do.
        
               | phire wrote:
               | _too late to edit_
               | 
               | Turns out that M1 can track the history of indirect
               | branches just fine, but it takes 3 cycles for a correct
               | prediction. With threaded jumps, the M1 gets a slightly
               | higher hit rate for the initial 1 cycle prediction.
               | 
               | https://news.ycombinator.com/item?id=40953764
        
       | interpunct wrote:
       | > I've proven to my satisfaction that writing an interpreter in
       | assembly is both fun and performant!
       | 
       | Fun maybe/maybe not, but define "performant". I might drop into
       | assembly for 1 or 2 orders of magnitude faster for a non-toy
       | project. Even then, what are my requirements? What is the bus
       | factor of LuaJIT again? Try getting support for s390x, or your
       | patch accepted.
       | 
       | Look at the speedup of Lua v. LuaJIT (C language VM v. C/Lua VM
       | with JIT code gen):
       | 
       | https://web.archive.org/web/20180430211146/https://luajit.or...
        
       | kosolam wrote:
       | I would love to understand better this uxn thing? Is it some
       | academic experiment or it has real world applications? What's the
       | deal with the additional return stack?
        
       | rerdavies wrote:
       | It would be interesting to see what happens if you turn on
       | profile-guided optimization. This seems like a very good
       | application for PGO.
       | 
       | The 518ms profile result for the branch-table load is very
       | peculiar. To my eyes, it suggests that the branch-table load is
       | is incurring cache misses at a furious rate. But I can't honestly
       | think of why that would be. On a Pi-4 everything should fit in L2
       | comfortably. Are you using a low-performance ARM processor like
       | an A53?
        
       | Validark wrote:
       | I wish you wouldn't broadcast the sentiment contained in the
       | first paragraph. Compilers lack the ability to consistently
       | perform many basic optimizations to an embarrassing extent.
       | Including even the ones you would think would be the first
       | optimizations you'd implement when writing a compiler. Open up
       | Godbolt and tell me if you still think the compiler knows best. I
       | try to submit at least one issue to LLVM every time I look at the
       | assembly it gives me. I say "try to" because sometimes it's too
       | much to deal with.
       | 
       | And by the way, probably the absolute worst things the compiler
       | does are the "smart" things. I write my code knowing what the
       | emit should look like, and the compiler sometimes thinks it has a
       | better idea than how I wrote the code and makes the emit a lot
       | more convoluted and slower than it would be if it just did what I
       | said. Actually, I do know the machine decently well, thank you.
       | 
       | Saying "centuries of engineering" is misleading. A lot of those
       | centuries are people arguing about theoretical optimizations of
       | dubious value while we still haven't even finished on the most
       | basic optimizations that are obviously beneficial.
       | 
       | The second paragraph making it sound all mysterious that
       | compilers are even more awful at interpreters than normal code
       | really speaks to a complete lack of exposure to the subject
       | matter. This stuff is only esoteric because you couldn't be
       | bothered to look at godbolt.org. Which is totally fine by the
       | way, just don't go telling me how high quality compilers are if
       | you've never even looked.
       | 
       | That would be like me telling web developers that Dreamweaver
       | produces phenomenal HTML and CSS, without ever looking at what it
       | emits.
       | 
       | Sorry for the rant but it just bothers me how much praise is
       | heaped upon compilers like they are some kind of gift from God,
       | forged by monks, wizards, and unrivaled geniuses. A compiler is
       | just a tool. There is no magic.
        
         | pjmlp wrote:
         | "Any sufficiently advanced technology is indistinguishable from
         | magic."
         | 
         | When people call themselves engineers, without placing their
         | feets on an engineering school, with compiler development
         | degrees, rather a six weeks bootcamp, their compilers feel like
         | a sufficient advanced technology.
        
           | anonymoushn wrote:
           | You really don't need a degree for this stuff. A typical
           | degree has 1 class about this, and half of it is dedicated to
           | formal language theory or the unfortunate practice of using
           | parser generators.
        
             | pjmlp wrote:
             | Then my degree was atypical, as we had several classes
             | about this, scattered around two years of other stuff.
             | 
             | The theoritical stuff, history and evolution of programming
             | languages, compiler and language design, the actuall toy
             | language implementation all the way to native code.
             | 
             | 30 years ago, maybe the quality of teaching went down, I
             | guess.
        
           | intelVISA wrote:
           | Any good 2 week bootcamp for mastering PLT??
        
             | pjmlp wrote:
             | Get _Modern Compiler Implementation in C_ [0] (with Java
             | and ML versions also available),or the Kaleidoscope LLVM
             | tutorial [1], and do the whole thing in 2 weeks, as
             | starting point.
             | 
             | [0] - https://www.cs.princeton.edu/~appel/modern/c/
             | 
             | [1] - https://llvm.org/docs/tutorial/
        
       | teo_zero wrote:
       | I was intrigued by this paragraph:
       | 
       | > Making all of the opcode implementations the same size (padding
       | to the size of the largest opcode implementation with .balign
       | 256), then removing the jump table entirely. This was also
       | slower, also probably because of cache friendliness: the opcode
       | implementations go from 16.6 KiB total to 64 KiB.
       | 
       | Probably normalizing the opcode implementations length to the
       | _maximum_ length is not optimal. A possible experiment would be
       | to normalize to the _modal_ length. I would expect most opcodes
       | to be arithmetic or logic operations and to share the same
       | implementation length. The (hopefully few) more complex ones
       | would need a trampoline.
       | 
       | Is this a crazy idea?
        
       | phire wrote:
       | _> The dispatch loop takes a single indirect branch to the
       | opcode-specific implementation. This means that the branch will
       | be nigh unpredictable!_
       | 
       | Modern branch predictors can actually predict indirect branches
       | with multiple destinations, because they hash recent branch
       | history into the prediction. The exact same indirect branch will
       | end up with multiple BTB entries, based on previous control flow.
       | 
       | I was curious where the threaded code speedup is actually coming
       | from. It's possible much of the speedup is coming from
       | eliminating the extra branch back to the dispatch loop. Or maybe
       | the history tracking in the M1's branch predictor doesn't work
       | well for this type of control flow.
       | 
       | So I checked out this code and modified the next macro, adding a
       | dummy branch over a nop, to roughly isolate this factor.
       | 
       | On my M1, the extra unconditional branch benchmarks at 1.08 sec
       | for fib, and 1.19 sec for Mandelbrot (all other times were the
       | same).
       | 
       | Looks like the speedup is a mixture of the two. Eliminating that
       | extra branch is responsible for about 20-30% of the speedup, and
       | improving prediction on indirect branches is responsible for the
       | other 70-80%
        
         | phire wrote:
         | Ok, I spent quite a bit of time looking at performance
         | counters, trying to understand what the M1's branch predictor
         | was doing.
         | 
         | The branch predictor is really accurate with a common
         | dispatcher, it predicts those indirect branches correctly
         | 99.25% of the time. Switching to threaded jumps improves this
         | slightly to 99.75%, but not because the indirect branches are
         | at different addresses. This improvement in accuracy is
         | entirely because of the unconditional branch back the
         | dispatcher that was removed as a side effect. With that gone,
         | the branch predictor can now track branch histories that are
         | about twice as many VM instructions long.
         | 
         | My modified version with a dummy branch in the threaded
         | dispatcher negates this longer history and (according to the
         | performance counters) results in the exact same 99.25%
         | correctly predicted branches as the common dispatcher, yet it's
         | still significantly faster, only 20% slower than threaded
         | jumps.
         | 
         | -------------------
         | 
         | Why are threaded jumps faster on the M1, if it's not increasing
         | branch prediction accuracy?
         | 
         | Well, the M1 essentially has two branch predictors[1]. The
         | faster one can return a prediction in one cycle, but it's not
         | checking branch history it all, and it's almost always wrong
         | for these unpredictable indirect branches. The slower predictor
         | does take branch history into account, but takes three cycles
         | to produce the correct result.
         | 
         | Which means there is a short pipeline stall when the second
         | prediction comes in. This stall doesn't show up as a
         | BRANCH_INDIR_MISPRED_NONSPEC, because the branch was correctly
         | predicted. Instead, it seems to show up in the FETCH_RESTART
         | counter.
         | 
         | So while threaded jumps doesn't improve overall branch
         | prediction accuracy (except because of the longer history), it
         | does slightly improve the accuracy of the faster branch
         | predictor. With a common dispatcher, it predicts wrong almost
         | 100% of the time, but with threaded code, the accuracy improves
         | to 40%.
         | 
         |  _[1] Or at least that 's a decent mental model. I suspect it's
         | actually be a single ITTAGE branch predictor that returns the
         | zero-history result in 1 cycle_
        
           | mkeeter wrote:
           | Very cool, thanks for digging into this!
        
       | the8472 wrote:
       | 0x100002ec0: b 0x100002d1c       ; jump back to the dispatch loop
       | 
       | I'd expect tail call based control flow - i.e. computing the jump
       | destination at the end of each opcode function - to be nicer to
       | the branch predictor because it could track targets separately.
       | 
       | Currently that's hard to achieve in Rust. There's an experiment
       | in progress to see if guaranteed tail calls can be added to the
       | language
       | 
       | https://github.com/rust-lang/rust/issues/112788
        
       ___________________________________________________________________
       (page generated 2024-07-13 23:01 UTC)