[HN Gopher] Beating the Compiler
___________________________________________________________________
Beating the Compiler
Author : mkeeter
Score : 181 points
Date : 2024-07-12 18:54 UTC (1 days ago)
(HTM) web link (www.mattkeeter.com)
(TXT) w3m dump (www.mattkeeter.com)
| 0x3444ac53 wrote:
| I was overjoyed to open this and find a reference to 100r and UXN
| gumby wrote:
| IMHO the best way to think of it (well, it's how _I_ have long
| thought of it) is two lemmas:
|
| 1 - The compiler has more "fingers" than a human does. Back when
| I wrote programs in assembly I would use printout to keep track
| of what I was doing and for debugging, and so would often put a
| finger on the paper to mark where a jump was and then go to the
| destination to see if it matched up, etc. This process isn't at
| all scalable, so the compiler is inevitably going to do better
| than I ever could on anything more than something trivial.
|
| 2 - I know more about the program constraints than I could ever
| express (much less be bothered expressing) in a HLL. "All the
| values will be less than 16" or "I can use this memory location
| for something else after I've read from it" or who knows what. So
| sometimes I can chop stuff out or even rewrite the compiler
| output locally to speed it up.
|
| Also I can only do this for a few architectures...and with modern
| x86 there are so many variants with all sorts of
| microoptimization affordances that I can rarely beat the compiler
| (this is a corollary of lemma 1)
| almostgotcaught wrote:
| A compiler is a combinatorial optimizer (think bin-packing). In
| general, optimizers/solvers basically search for the best
| solution. Most production compilers don't have solvers in them,
| they use heuristics instead, but even the best solvers use tons
| of heuristics. Naturally a computer will search/try heuristics
| faster and more thoroughly than you but sometimes you can do
| better because performant searching is all about "knowing where
| to look".
| BeeOnRope wrote:
| Modern compilers are not doing much searching in general.
| It's mostly apply some feed-forward heuristic to determine
| whether to apply a transformation or not.
|
| I think a slower, search based compiler could have a lot of
| potential for the hottest parts you're willing to spend
| exorbitant time on a search.
| jonjojojon wrote:
| I have heard search based compiler optimization called
| "superoptimization"[1]. It seems interesting, but as far as
| I know has not seen much industrial use.
|
| 1. https://blog.regehr.org/archives/2578
| kaba0 wrote:
| It simply doesn't scale. You can only superoptimize very
| short runs of code, nowhere anywhere close to even
| smaller code bases, let alone big ones.
| fooker wrote:
| It scales well enough. You can apparently run Souper on
| SQLite in 24 hours with a beefy machine, according to a
| talk I recently attended, by one of the developers.
| almostgotcaught wrote:
| > Modern compilers are not doing much searching in general.
|
| This is false. Any compiler that does register allocation
| and instruction scheduling (all of them) is _searching_ for
| an optimal (or just good enough) solution to an
| optimization problem.
| drpixie wrote:
| Can you give an example?
|
| In general, searching any significant space is _very_
| slow, applying heuristics is much quicker.
| azakai wrote:
| I'm not sure there is a clear separation between applying
| heuristics and searching a space. Often in compilers you
| search a subset of a space using heuristics, and you can
| adjust those to control how much of the space you cover.
|
| For example, here is a pass that reorders WebAssembly
| globals in the Binaryen optimizer:
|
| https://github.com/WebAssembly/binaryen/blob/main/src/pas
| ses...
|
| We have a simple criteria for the quality of a solution -
| how big the binary size is with an order - but the space
| of possible orders is huge (every permutation that keeps
| every global after its dependencies). What we do is a
| targeted search of that space using some heuristics using
| parameters that work well enough and aren't too slow in
| practice.
| almostgotcaught wrote:
| > I'm not sure there is a clear separation between
| applying heuristics
|
| There is and it's quite simple: if your heuristic reduces
| the size of your search space faster than it takes to
| perform the search (ie try solutions) then you have a
| real algo on your hands. Otherwise you're just searching.
| This is basically the border between P and NP and it's
| just that in compilers most of the problems are NP hard
| so none of the heuristics are really that good.
| gumby wrote:
| I disagree with this part of your comment:
|
| > then you have a real algo on your hands
|
| To me an algorithm is closed, while a heuristic (aka rule
| of thumb) is just a fast way to probably get a better
| solution / subset of the solution space at the cost of
| possibly missing the optimal result or even ending up in
| a pessimal corner case.
|
| With an NP complete problem you'd rather have _some_
| solution rather than use up your lifetime searching for
| the best.
| almostgotcaught wrote:
| > To me an algorithm is closed
|
| i dunno what "closed" means here? converges? lots of
| things with heuristics converge...
|
| most things people think of as algorithms are just
| heuristics codified. take for example unit propagation in
| DPLL (fairly modern SAT approach); quoting wiki[1]
|
| > In practice, this often leads to deterministic cascades
| of units, thus avoiding a large part of the naive search
| space.
|
| that *in practice* there means there's no guarantee
| because ofc not - otherwise they would've proven
| something about NP. but they didn't, they just came up
| with a way to search that's sometimes/often not bad. a
| lot of people call this an algorithm ("... is a
| refinement of the earlier Davis-Putnam algorithm")
| because it fits the dictionary definiton (a repeatable
| process) but i do not because it's a repeatable process
| that isn't proven to produce anything (ie faster than
| brute force). and the intuition i'm proposing for why it
| doesn't is because it doesn't actually shrink the search
| space fast enough.
|
| note, my definitions/intuitions don't translate/have the
| same force elsewhere (especially in
| continuous/differentiable spaces) but they're a pretty
| standard/obvious perspective in combinatorial
| optimization (np-hard/np-complete problems).
|
| [1] https://en.wikipedia.org/wiki/DPLL_algorithm#The_algo
| rithm
| almostgotcaught wrote:
| I don't know what you're asking for - this isn't some
| kind of controversial topic - any iterative algo that
| isn't polynomial time (or is approximate) is search.
|
| In the context of compilers there are many. Look at this
| block diagram for Chaitin's register allocator:
|
| https://en.wikipedia.org/wiki/Register_allocation#Princip
| le_...
|
| That's a search because it tries an allocation, possibly
| incurs a spill, tries again.
| WalterBright wrote:
| Where things get fun is when two optimizations combine to
| make things worse. They never tell you about that in
| compiler class!
|
| It's like designing a house. If you want the master
| closet bigger, the master bath has to shrink. Everything
| is a tradeoff.
| almostgotcaught wrote:
| You have this exact issue in VLIW and it's exactly where
| you can't cheat by using heuristics:
|
| Combinatorial Register Allocation and Instruction
| Scheduling
|
| https://dl.acm.org/doi/10.1145/3332373
| gumby wrote:
| I understand the compiler Microsoft uses to build release
| versions of Windows, Office etc is like this and can take
| days to run.
| flamedoge wrote:
| and well worth the cost
| albinahlback wrote:
| While I do think there are a lot of reasons to why one should
| not write in assembly, time-critical base routines can often
| benefit a lot from being written in assembly (take GMP's mpn-
| module for instance). However, this puts some constraint on how
| the code should be formatted (in GMP's case, this means that
| everything the mpz-module does is derived from the mpn-module),
| which cannot be said for all applications.
| ashton314 wrote:
| I am going to be the token programming language researcher and
| say that what you really want is a dependently typed assembly
| language that your dependently typed higher level language
| lowers to. One school of thought that has yet to bear fruit in
| "mainstream" programming, but gives tantalizing hints of what
| is possible, is expressing increasing amounts of your programs
| constraints in the type system, thereby informing the compiler
| precisely about your intent.
| almostgotcaught wrote:
| > want is a dependently typed assembly language
|
| Doesn't make any sense. The "type constraints" on assembly
| operands (registers and numbers) is the ISA and thus those
| constraints are combinatorial not logical
| tucnak wrote:
| There's the reason "token programming language researchers"
| are incapable of understanding modern computer
| architectures: a huge gap where would have been EE
| education.
| almostgotcaught wrote:
| Correct - most of them think the goal is an infinite
| tower of sophisticated "abstractions" not realizing
| there's nothing abstract about a finite piece of silicon.
| jll29 wrote:
| This was a good read, thanks.
|
| Instead of using unsafe Rust to optimize the interpreter loop, I
| would prefer to write a transpiler that compiles the source UXN
| binaries to local CPU binaries, which would not require making
| code unsafe/less readable and would permit further speed
| enhancements by getting rid of an interpreter loop altogether.
| mkeeter wrote:
| This is an interesting idea, but gets tricky if someone writes
| self-modifying code!
|
| There are Uxn instructions which write to RAM; _normally_ ,
| this is used for storing data, but nothing prevents programs
| from editing their own code as they're running.
| duped wrote:
| > but nothing prevents programs from editing their own code
| as they're running
|
| On some platforms writing to memory mapped in with PROT_EXEC
| will trigger a page fault and the process will be killed. In
| other words, self modifying executables are effectively
| forbidden by design (the workaround is to unmap the memory,
| map it as PROT_WRITE, modify, unmap, map it in as PROT_EXEC,
| and resume execution - which is what JITs do on MacOS, for
| example).
| duskwuff wrote:
| There's a better workaround, incidentally, which is to map
| the same memory to two different address ranges with
| different access permissions. This allows code in the JIT
| region to be updated while threads may be running in it.
| saagarjha wrote:
| An even better workaround is to make that switch faster.
| Hemospectrum wrote:
| The parent comment was referring to programs running inside
| the UXN virtual machine, which explicitly supports and
| endorses self-modifying code. Many of the creators' own
| programs take advantage of this capability, so any
| conforming implementation has to find a way to make it
| work.
| projektfu wrote:
| Good article, brings the data to back it up.
|
| Unfortunately, it was hard to read with the monokai.css theme
| because comments were nearly invisible, and a lot of your
| information was in comments. Changing the color from #75715e to
| #95917e did the trick. I guess Monokai is for programmers who
| never read comments.
| mkeeter wrote:
| Thanks for the feedback; I tweaked the brightness and pushed
| the change.
| lilyball wrote:
| Does the compiler do anything differently if you stick to Rust
| but change the dispatch implementation to be an #[inline] fn
| next() that you put at the end of each opcode?
| ajbt200128 wrote:
| > // SAFETY: do you trust me?
|
| No.
|
| Have you seen the safer-ffi crate? Then you won't have to commit
| the deadly sin of writing unsafe Rust
| Wowfunhappy wrote:
| But this is an entry point to assembly code. It's inherently
| unsafe.
| gpm wrote:
| > safer-ffi crate
|
| It looks to me like that crate is supposed to help with
| exposing rust functions to C safely, not calling foreign
| functions in foreign code safely. Am I missing something?
| 201984 wrote:
| Very cool! I enjoy seeing people writing asm, and letting us get
| the most out of our CPUs.
|
| I see you already tried what I thought of, which is getting rid
| of the jump table and making each instruction handler the same
| size. Do you think that could still work if you limited each
| instruction handler to 64 or 32 bytes instead of 256, and then
| for longer handlers jumped to a larger body of code somewhere
| else?
| abbeyj wrote:
| This is how Dalvik (the old Android Java interpreter) worked,
| with 64 chosen as the size of an instruction handler. See
| https://wladimir-tm4pda.github.io/porting/dalvik.html (search
| for "computed goto").
| JackYoustra wrote:
| Did you try PGO? This post seems like the thing it was built for.
| saagarjha wrote:
| How would it help here?
| tempodox wrote:
| I find this strange: &mut h as *mut _ as *mut _
|
| What is going on here?
| evrimoztamur wrote:
| It dereferences a mutable reference to h twice, ignoring its
| type with _s. I suppose this implies that h is a reference type
| itself.
| muricula wrote:
| No dereferences, just casts. It shouldn't generate any
| loads/reads from memory.
| mkeeter wrote:
| It's doing the following cast: &mut
| DeviceHandle -> *mut DeviceHandle -> *mut c_void
|
| (with the pointer types being solved for automatically by the
| compiler)
| dwattttt wrote:
| It's a bit of a song and dance; you can turn a &mut T into a
| *mut T, and you can cast a *mut T into a *mut anything, but you
| can't do it in one step.
| o11c wrote:
| > Making all of the opcode implementations the same size (padding
| to the size of the largest opcode implementation with .balign
| 256), then removing the jump table entirely.
|
| Another idea would be to sort the opcodes by size (32-byte,
| 64-byte, 128-byte, 256-byte - or perhaps +1 to avoid the set-
| associativity problem below) rather than simply indexing.
|
| > This was also slower, also probably because of cache
| friendliness: the opcode implementations go from 16.6 KiB total
| to 64 KiB.
|
| This is probably cache-unfriendly due to alignment too
| (associativity limits mean more conflict misses for highly-
| aligned memory), not just size. Most documentation only talks
| about this problem regarding the data cache though ...
| 256_ wrote:
| Why does making the FETCH part of the cycle a macro make it
| faster? Surely the branch predictor is fine with unconditional
| immediate branches. What am I missing here?
|
| Also, it has jump-to-register immediately after the instruction
| that sets that register. Wouldn't it be faster if it went like:
| get jmp address execute VM opcode jmp to next
| instruction
|
| So the pipeline can fetch it ahead of time?
| phire wrote:
| Modern Out-of-order CPUs (like the M1), they can't see branches
| until far too late.
|
| The M1's frontend at least 24 instruction past the
| unconditional branch before the early possible moment it can
| even see it.
|
| So the branch predictor isn't just responsible for predicting
| which way conditional branches go. It must remember where all
| branches are, and their target so that the front end can follow
| them with zero cycle delay. This means all branches, including
| call and return instructions too.
|
| Which means that unconditional immediate branches cost about
| the same as a correctly predicted conditional branch.
|
| But that's not actually why the fetch has been moved.
|
| The other thing to note is that the frontend and backend of a
| modern CPU are completely disconnected. The frontend doesn't
| even try to get the correct address of an indirect jump from
| the backend. It always uses the branch predictor to predict the
| indirect branch.
|
| And by inlining, each VM instruction has its own indirect jump,
| which means it gets different slot in the branch predictor
| allowing for better predictions.
|
| At least that's the theory behind threaded code. I'm unsure how
| much of this speedup is coming from eliminating the extra
| unconditional immediate branch and how much is from better
| prediction of indirect branches.
| aengelke wrote:
| Side note: Intel CPUs since Skylake and also recent AMD CPUs
| (since Zen 3 or so?) store a history for indirect branches.
| On such processors, using threaded jumps does not really
| improve performance anymore (I've even seen 1-2% slowdowns on
| some cores).
| phire wrote:
| Pretty sure it's Haswell and Zen 2. They both implement IT-
| TAGE based branch predictors.
|
| I just assumed the M1 branch predictor would also be in the
| same class, but I guess not. In another comment
| (https://news.ycombinator.com/item?id=40952404), I did some
| tests to confirm that it was actually the threaded jumps
| responsible for the speedup.
|
| I'm tempted to dig deeper, see what the M1's branch
| predator can and can't do.
| phire wrote:
| _too late to edit_
|
| Turns out that M1 can track the history of indirect
| branches just fine, but it takes 3 cycles for a correct
| prediction. With threaded jumps, the M1 gets a slightly
| higher hit rate for the initial 1 cycle prediction.
|
| https://news.ycombinator.com/item?id=40953764
| interpunct wrote:
| > I've proven to my satisfaction that writing an interpreter in
| assembly is both fun and performant!
|
| Fun maybe/maybe not, but define "performant". I might drop into
| assembly for 1 or 2 orders of magnitude faster for a non-toy
| project. Even then, what are my requirements? What is the bus
| factor of LuaJIT again? Try getting support for s390x, or your
| patch accepted.
|
| Look at the speedup of Lua v. LuaJIT (C language VM v. C/Lua VM
| with JIT code gen):
|
| https://web.archive.org/web/20180430211146/https://luajit.or...
| kosolam wrote:
| I would love to understand better this uxn thing? Is it some
| academic experiment or it has real world applications? What's the
| deal with the additional return stack?
| rerdavies wrote:
| It would be interesting to see what happens if you turn on
| profile-guided optimization. This seems like a very good
| application for PGO.
|
| The 518ms profile result for the branch-table load is very
| peculiar. To my eyes, it suggests that the branch-table load is
| is incurring cache misses at a furious rate. But I can't honestly
| think of why that would be. On a Pi-4 everything should fit in L2
| comfortably. Are you using a low-performance ARM processor like
| an A53?
| Validark wrote:
| I wish you wouldn't broadcast the sentiment contained in the
| first paragraph. Compilers lack the ability to consistently
| perform many basic optimizations to an embarrassing extent.
| Including even the ones you would think would be the first
| optimizations you'd implement when writing a compiler. Open up
| Godbolt and tell me if you still think the compiler knows best. I
| try to submit at least one issue to LLVM every time I look at the
| assembly it gives me. I say "try to" because sometimes it's too
| much to deal with.
|
| And by the way, probably the absolute worst things the compiler
| does are the "smart" things. I write my code knowing what the
| emit should look like, and the compiler sometimes thinks it has a
| better idea than how I wrote the code and makes the emit a lot
| more convoluted and slower than it would be if it just did what I
| said. Actually, I do know the machine decently well, thank you.
|
| Saying "centuries of engineering" is misleading. A lot of those
| centuries are people arguing about theoretical optimizations of
| dubious value while we still haven't even finished on the most
| basic optimizations that are obviously beneficial.
|
| The second paragraph making it sound all mysterious that
| compilers are even more awful at interpreters than normal code
| really speaks to a complete lack of exposure to the subject
| matter. This stuff is only esoteric because you couldn't be
| bothered to look at godbolt.org. Which is totally fine by the
| way, just don't go telling me how high quality compilers are if
| you've never even looked.
|
| That would be like me telling web developers that Dreamweaver
| produces phenomenal HTML and CSS, without ever looking at what it
| emits.
|
| Sorry for the rant but it just bothers me how much praise is
| heaped upon compilers like they are some kind of gift from God,
| forged by monks, wizards, and unrivaled geniuses. A compiler is
| just a tool. There is no magic.
| pjmlp wrote:
| "Any sufficiently advanced technology is indistinguishable from
| magic."
|
| When people call themselves engineers, without placing their
| feets on an engineering school, with compiler development
| degrees, rather a six weeks bootcamp, their compilers feel like
| a sufficient advanced technology.
| anonymoushn wrote:
| You really don't need a degree for this stuff. A typical
| degree has 1 class about this, and half of it is dedicated to
| formal language theory or the unfortunate practice of using
| parser generators.
| pjmlp wrote:
| Then my degree was atypical, as we had several classes
| about this, scattered around two years of other stuff.
|
| The theoritical stuff, history and evolution of programming
| languages, compiler and language design, the actuall toy
| language implementation all the way to native code.
|
| 30 years ago, maybe the quality of teaching went down, I
| guess.
| intelVISA wrote:
| Any good 2 week bootcamp for mastering PLT??
| pjmlp wrote:
| Get _Modern Compiler Implementation in C_ [0] (with Java
| and ML versions also available),or the Kaleidoscope LLVM
| tutorial [1], and do the whole thing in 2 weeks, as
| starting point.
|
| [0] - https://www.cs.princeton.edu/~appel/modern/c/
|
| [1] - https://llvm.org/docs/tutorial/
| teo_zero wrote:
| I was intrigued by this paragraph:
|
| > Making all of the opcode implementations the same size (padding
| to the size of the largest opcode implementation with .balign
| 256), then removing the jump table entirely. This was also
| slower, also probably because of cache friendliness: the opcode
| implementations go from 16.6 KiB total to 64 KiB.
|
| Probably normalizing the opcode implementations length to the
| _maximum_ length is not optimal. A possible experiment would be
| to normalize to the _modal_ length. I would expect most opcodes
| to be arithmetic or logic operations and to share the same
| implementation length. The (hopefully few) more complex ones
| would need a trampoline.
|
| Is this a crazy idea?
| phire wrote:
| _> The dispatch loop takes a single indirect branch to the
| opcode-specific implementation. This means that the branch will
| be nigh unpredictable!_
|
| Modern branch predictors can actually predict indirect branches
| with multiple destinations, because they hash recent branch
| history into the prediction. The exact same indirect branch will
| end up with multiple BTB entries, based on previous control flow.
|
| I was curious where the threaded code speedup is actually coming
| from. It's possible much of the speedup is coming from
| eliminating the extra branch back to the dispatch loop. Or maybe
| the history tracking in the M1's branch predictor doesn't work
| well for this type of control flow.
|
| So I checked out this code and modified the next macro, adding a
| dummy branch over a nop, to roughly isolate this factor.
|
| On my M1, the extra unconditional branch benchmarks at 1.08 sec
| for fib, and 1.19 sec for Mandelbrot (all other times were the
| same).
|
| Looks like the speedup is a mixture of the two. Eliminating that
| extra branch is responsible for about 20-30% of the speedup, and
| improving prediction on indirect branches is responsible for the
| other 70-80%
| phire wrote:
| Ok, I spent quite a bit of time looking at performance
| counters, trying to understand what the M1's branch predictor
| was doing.
|
| The branch predictor is really accurate with a common
| dispatcher, it predicts those indirect branches correctly
| 99.25% of the time. Switching to threaded jumps improves this
| slightly to 99.75%, but not because the indirect branches are
| at different addresses. This improvement in accuracy is
| entirely because of the unconditional branch back the
| dispatcher that was removed as a side effect. With that gone,
| the branch predictor can now track branch histories that are
| about twice as many VM instructions long.
|
| My modified version with a dummy branch in the threaded
| dispatcher negates this longer history and (according to the
| performance counters) results in the exact same 99.25%
| correctly predicted branches as the common dispatcher, yet it's
| still significantly faster, only 20% slower than threaded
| jumps.
|
| -------------------
|
| Why are threaded jumps faster on the M1, if it's not increasing
| branch prediction accuracy?
|
| Well, the M1 essentially has two branch predictors[1]. The
| faster one can return a prediction in one cycle, but it's not
| checking branch history it all, and it's almost always wrong
| for these unpredictable indirect branches. The slower predictor
| does take branch history into account, but takes three cycles
| to produce the correct result.
|
| Which means there is a short pipeline stall when the second
| prediction comes in. This stall doesn't show up as a
| BRANCH_INDIR_MISPRED_NONSPEC, because the branch was correctly
| predicted. Instead, it seems to show up in the FETCH_RESTART
| counter.
|
| So while threaded jumps doesn't improve overall branch
| prediction accuracy (except because of the longer history), it
| does slightly improve the accuracy of the faster branch
| predictor. With a common dispatcher, it predicts wrong almost
| 100% of the time, but with threaded code, the accuracy improves
| to 40%.
|
| _[1] Or at least that 's a decent mental model. I suspect it's
| actually be a single ITTAGE branch predictor that returns the
| zero-history result in 1 cycle_
| mkeeter wrote:
| Very cool, thanks for digging into this!
| the8472 wrote:
| 0x100002ec0: b 0x100002d1c ; jump back to the dispatch loop
|
| I'd expect tail call based control flow - i.e. computing the jump
| destination at the end of each opcode function - to be nicer to
| the branch predictor because it could track targets separately.
|
| Currently that's hard to achieve in Rust. There's an experiment
| in progress to see if guaranteed tail calls can be added to the
| language
|
| https://github.com/rust-lang/rust/issues/112788
___________________________________________________________________
(page generated 2024-07-13 23:01 UTC)