[HN Gopher] Cranelift code generation comes to Rust
___________________________________________________________________
Cranelift code generation comes to Rust
Author : ridruejo
Score : 299 points
Date : 2024-03-18 11:39 UTC (11 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| k_bx wrote:
| Any fresh compilation time benchmarks and comparisons to LLVM?
| zokier wrote:
| From the article:
|
| > A full debug build of Cranelift itself using the Cranelift
| backend took 29.6 seconds on my computer, compared to 37.5 with
| LLVM (a reduction in wall-clock time of 20%)
|
| That seems much smaller difference than what I would have
| expected
| __s wrote:
| Those numbers are the entire compile time,
| parsing/typecheck/MIR+MIR optimization/linking
|
| Don't have numbers handy, so hard to say how much faster
| Cranelift is making the codegen portion, but gets into
| Amdahl's Law
| Filligree wrote:
| And there are faster linkers than the default.
|
| ...why is it still the default?
| 3836293648 wrote:
| Compatibility with niche usecases
| Filligree wrote:
| That sounds like a good reason to keep the existing one
| as an opt-in possibility.
| kibwen wrote:
| Because the linker provided by the system itself is the
| one that's guaranteed to be the most broadly compatible.
| For a toolchain to begin distributing an alternative
| linker is a lot of work on every supported platform
| (future versions of Rust may sidestep this issue by
| making LLD the default on certain platforms, which they
| get "for free" by dint of shipping alongside LLVM). But
| also, because there are about ten people on planet Earth
| qualified to write a production-grade linker, which is
| why LLD still isn't broadly recommended yet.
| boxed wrote:
| The very last paragraph:
|
| > A full debug build of Cranelift itself using the Cranelift
| backend took 29.6 seconds on my computer, compared to 37.5 with
| LLVM (a reduction in wall-clock time of 20%). Those wall-clock
| times don't tell the full story, however, because of
| parallelism in the build system. Compiling with Cranelift took
| 125 CPU-seconds, whereas LLVM took 211 CPU-seconds, a
| difference of 40%. Incremental builds -- rebuilding only
| Cranelift itself, and none of its dependencies -- were faster
| with both backends. 66ms of CPU time compared to 90ms.
| AlexErrant wrote:
| Not quite fresh, but
|
| > A paper from 2020 [0] showed that Cranelift was an order of
| magnitude faster than LLVM, while producing code that was
| approximately twice as slow on some benchmarks.
|
| [0] https://arxiv.org/pdf/2011.13127.pdf
| Findecanor wrote:
| The changes in Cranelift since 2020 have been quite
| significant so I would not put any trust in those benchmarks.
| nindalf wrote:
| Few reports comparing Cranelift to LLVM from this day old
| reddit thread [1]
|
| - 29.52s -> 24.47s (17.1%)
|
| - 27s -> 19s (29.6%)
|
| - 11.5s -> 8.4s (26.9%)
|
| - 37.5s -> 29.6s (28.7%) - this measurement from TFA.
|
| To put these numbers in context, all the perf improvements over
| the last 4 years have helped the compiler become faster on a
| variety of workloads by 7%, 17%, 13% and 15%, for an overall
| speed gain of 37% over 4 years. [2] So one large change
| providing a 20-30% improvement is very impressive.
|
| When you add that to the parallel frontend [3] and support for
| linking with LLD [4], Rust compilation could be substantially
| faster by this time next year.
|
| [1] -
| https://old.reddit.com/r/rust/comments/1bgyo8a/try_cranelift...
|
| [2] - https://nnethercote.github.io/2024/03/06/how-to-speed-up-
| the...
|
| [3] - https://blog.rust-lang.org/2023/11/09/parallel-rustc.html
|
| [4] - https://github.com/rust-lang/rust/issues/39915
| k_bx wrote:
| Thanks, that's really great news. The only thing I wish for
| now is aarch64 support on Apple Silicon
| chrisaycock wrote:
| This article provides an excellent overview of the latest in
| _speed of optimizer_ vs _quality of optimization_.
|
| In particular, copy-and-patch compilation is still the fastest
| approach because it uses pre-compiled code, though leaves little
| room for optimization.
|
| Cranelift uses e-graphs to represent equivalence on the IR. This
| allows for more optimizations than the copy-and-patch approach.
|
| Of course, the most optimized output is going to come from a more
| traditional compiler toolchain like LLVM or GCC. But for users
| who want to get "fast enough" output as quickly as possible,
| newer compiler techniques provide a promising alternative.
| weinzierl wrote:
| From what I understand, the big advantage of the e-graphs
| approach is, that the quality of the output is (within limits)
| a function the time and memory given.
|
| The more memory, the more nodes can be generated in the e-graph
| and the more time for search, the better the selected node.
|
| It might never be as fast as copy-and-patch or as good as LLVM
| or GCC, but this flexibility is a value in itself.
| thechao wrote:
| When we first reviewed the equality saturation paper, we
| thought there was one major pro, and one major con: [pro]
| phase-ordering invariant; [con] at the time there was no
| (believable) way to extract the transformed graph in a non-
| hand-wavy-way.
|
| Personally, I think e-graphs should be combined with Massalin
| superoptimization -- they're natural "duals" -- and just turn
| the whole exercise into a hill-climbing process. You can tune
| the total effort by the set of passes used, the amount of
| time to drive the graph to saturation, and the method (and
| time) for graph extraction.
| kapilsinha wrote:
| Agree, especially when developing, I'd assume speed of
| optimizer matters more than quality of optimization. I wonder
| though if LLVM spent less time on that "phase ordering"
| problem, what is the tradeoff between these two factors?
| boomanaiden154 wrote:
| LLVM doesn't spend really any runtime solving the phase
| ordering problem since the pass pipelines are static. There
| have been proposals to dynamically adjust the pipeline based
| on various factors, but those are a ways out from happening.
| vsnf wrote:
| Is there any literature or guidelines on what to do if I'm
| willing to spend effectively unlimited CPU cycles in return for
| a more optimized final output?
| thesz wrote:
| Superoptimizers:
| https://en.wikipedia.org/wiki/Superoptimization
|
| Also, program distillation: https://www.researchgate.net/publ
| ication/220989887_Distillat...
| mort96 wrote:
| Isn't it a bit weird that this isn't just the standard?
| Like imagine if Chrome was optimized with such a
| superoptimizer; let the optimizer spend a couple hours
| every month or so when cutting a new release. Surely that
| has to be worth it?
| kibwen wrote:
| Doing a full-fledged release-this-to-millions-of-users
| build of Chrome or Firefox already takes on the order of
| 24 hours (or at least it did when last I checked a few
| years ago).
| remorses wrote:
| I am pretty sure the existing release process already
| takes much more than 2 hours.
| Filligree wrote:
| Is it just a couple hours? Even just ordering the
| optimisation passes is already an NP-complete process, so
| I could easily imagine superoptimisers would take
| trillions of years for a large program...
| MaxBarraclough wrote:
| I'm not a compiler engineer, but I think it's a
| diminishing returns issue. Modern optimising compilers
| are already impressive, and they get more impressive year
| on year, but only quite slowly. We don't see compilers
| producing code that runs 30% faster than the code
| generated last year, for instance.
|
| Such improvements can happen with compiler updates, but
| not when the starting point is a compiler that's already
| of decent quality. I recall some GPU driver updates from
| ATi (years ago) delivered pretty drastic performance
| improvements, but I believe that's because the original
| drivers were rather primitive.
|
| (Perhaps a drastic improvement to autovectorisation could
| give a 30% boost, or better, but this would apply only to
| certain programs.)
|
| You could grant a compute budget 100x the typical build
| set-up, but no one has built a production-ready compiler
| to take advantage of that, and if they did I suspect the
| improvements would be unimpressive. They may also run
| into a 'complexity ceiling' issue, as the compiler would
| presumably be even more complex than today's ordinary
| optimising compilers, which are already enormous.
|
| As Filligree says, superoptimisers tend to only be
| practical for very short programs. They can't be applied
| to monstrous codebases like Chromium.
| alserio wrote:
| But maybe you can superoptimize some hot sections, and
| encode the superoptimizer findings somewhere. Then the
| compiler can validate the optimizations and apply them to
| the particular piece of code for the rest of the program
| life, untill the preconditions hold.
| thesz wrote:
| The ability of compilers to make code faster is 12
| (twelve) times slower than Moore's law: they double the
| program's speed in 18 years [1].
|
| [1] https://proebsting.cs.arizona.edu/law.html
|
| This might seem discouraging, but it is not - one can
| still reap the benefits of code optimization twelve times
| as long after Moore's law stops working.
| sanxiyn wrote:
| Have a look at Unison which uses constraint programming to do
| optimal code generation.
|
| https://unison-code.github.io/
| pjmlp wrote:
| Finally, looking forward for wider adoption.
| diggan wrote:
| Tried out the instructions from the article on a tiny Bevy
| project, and compared it to a "normal" build:
|
| > cargo build --release 23.93s user 22.85s system 66% cpu 1:09.88
| total
|
| > cargo +nightly build -Zcodegen-backend 23.52s user 21.98s
| system 68% cpu 1:06.86 total
|
| Seems just marginally faster than a normal release build. Wonder
| if there is something particular with Bevy that makes this so?
| The author of the article mentions 40% difference in build speed,
| but I'm not seeing anything near that.
|
| Edit: just realized I'm caching my release builds with sccache
| and a local NAS, hence the release builds being as fast as
| Cranelift+debug builds. Trying it again with just debug builds
| and without any caching:
|
| > cargo +nightly build 1997.35s user 200.38s system 1878% cpu
| 1:57.02 total
|
| > cargo +nightly build -Zcodegen-backend 280.96s user 73.06s
| system 657% cpu 53.850 total
|
| Definitely an improvement once I realized what I did wrong, about
| half the time spent compiling now :) Neat!
| naasking wrote:
| Maybe your build is not limited by code generation, which seems
| like the only thing that changed here. There was a good thread
| recently about the variation in what slows down compilation:
|
| https://news.ycombinator.com/item?id=39721922
| diggan wrote:
| Edited my comment now, forgot I was caching the release
| builds with sccache so wasn't actually compiling all the
| units, but fetching a lot of them instead :/
| vlovich123 wrote:
| Which speaks to maybe that cargo should just ship with
| sccache turned on by default so that the "normal"
| experience for developing Rust has this seamless
| experience. Intelligent caching should always win. I'd like
| to see rustc get sccache integration that works at the MIR
| level too so that changing something that doesn't change
| the code gen meaningfully still gets a cached response
| (e.g. changing some comments/whitespace, moving functions
| around, etc).
|
| Even cooler would be if LLVM itself could also cache
| internal expensive parts of compilation and optimization
| across process instances. That would make a huge impact in
| cutting down incremental builds.
| SkiFire13 wrote:
| > I'd like to see rustc get sccache integration that
| works at the MIR level too so that changing something
| that doesn't change the code gen meaningfully still gets
| a cached response (e.g. changing some
| comments/whitespace, moving functions around, etc).
|
| Isn't incremental compilation already like that?
| kapilsinha wrote:
| Is that Bevy project open-source by any chance? I'd love to it
| out myself
| doctor_phil wrote:
| Bevyengine is open source and very findable by most search
| engines.
|
| Here is a direct link: https://github.com/bevyengine/bevy
| kapilsinha wrote:
| Ah yep I'm aware of that. Bevy is a framework/engine that
| devs built on top of. I was referring to the project that
| OP has built on top of Bevy
| Dowwie wrote:
| Would it be naive to assume a general compile-time reduction of
| 20% for all Rust projects by swapping llvm with cranelift?
| mrklol wrote:
| There are still projects out there which won't hit 20% or even
| some %, if the bottleneck isn't code gen (for example). So the
| part with"all" can be wrong, but beside that 20% is a good
| number.
| Deukhoofd wrote:
| Does anyone by chance have benchmarks of runtime (so not the
| compile time) when using Cranelift? I'm seeing a mention of
| "twice as slow" in the article, but that's based on data from
| 2020. Wondering if it has substantially improved since then.
| cube2222 wrote:
| Slightly off-topic, but if you fancy writing compilers in your
| free time, Cranelift has a great Rust library[0] for doing code
| generation - it's a pleasure to use!
|
| [0]: https://docs.rs/cranelift-
| frontend/0.105.3/cranelift_fronten...
| loeg wrote:
| I've seen it used for really simple JIT in Advent of Code
| puzzles.
| Tehnix wrote:
| Very excited for Cranelift for debug builds to speed up
| development iteration - in particular for WASM/Frontend Rust
| where iteration speed is competing with the new era of Rust
| tooling for JS which lands in the sub 1 second builds sometimes
| (iteration speed in Frontend is crucial).
|
| Sadly, it does not yet support ARM macOS, so us M1-3 users will
| have to wait a bit :/
| kapilsinha wrote:
| But usually, at least I don't build an executable while
| iterating. And if I do, I try to set up the feature flags so
| the build is minimal for the tests I am running
| Ericson2314 wrote:
| Really looking forward to the death of non-e-graph-based
| compilation :)
| convolvatron wrote:
| I've tried to look into this a couple times, including today.
| To me this looks alot like unification? but I don't really
| understand how operationally one gets from equivalence classes
| to instructions. is there an e-graphs for dummies writeup?
| tekknolagi wrote:
| If by instructions you mean machine instructions, you don't;
| it's used for internal optimization passes only.
| convolvatron wrote:
| that helps, so this is really an alternative to pushing
| analysis attributes aronud a dataflow graph
| CodesInChaos wrote:
| You can use different backends and optimization for different
| crates. It often makes sense to use optimized LLVM builds for
| dependencies, and debug LLVM or even Cranelift for your own code.
|
| See
| https://www.reddit.com/r/rust/comments/1bhpfeb/vastly_improv...
| zozbot234 wrote:
| I would not expect this to work without issues, as Rust does
| not support a stable binary ABI across different compiler
| versions. How can we be sure that the two "codegen backends"
| will always be implementing the same binary ABI?
| Ar-Curunir wrote:
| This is for local work, so you use the same compiler for your
| dependencies and for your own code. Only the codegen backend
| differs.
| sanxiyn wrote:
| In theory yes, this could be problematic with alternative
| Rust implementations, but it works fine and many people are
| using it in this case. We can be sure LLVM and Cranelift
| codegen backends will be implementing the same binary ABI,
| because binary ABI decisions are made in the shared code.
| aw1621107 wrote:
| I suppose that might depend on how the ABI is represented
| internally. If the ABI is fully described by (one of?) the
| lowered IR the backends consume (e.g., does MIR fully
| describe struct layouts/etc.) then I wouldn't expect there to
| be any issues outside "regular" bugs since all the relevant
| information would be contained in the inputs.
| sanxiyn wrote:
| It is done in the shared code, see https://rustc-dev-
| guide.rust-lang.org/backend/backend-agnost... for details.
| CodesInChaos wrote:
| Struct layout happening in generic code makes sense (and
| is actually required since you can reference it in
| `const`s). It seems unlikely that they made function
| calling fully backend agnostic, since it'd require
| assigning the registers for parameters and result in
| generic code, and not in the backend.
|
| I'd expect the generic code to lower the function
| parameters to primitive types (pointers, ints, floats,
| etc.), but the backend would then distribute those over
| registers and/or stack. Keeping that compatible would
| still require an (unstable) specification implemented by
| all compatible backends.
|
| Unwinding might be tricky as well.
| aw1621107 wrote:
| Ah, I think that's about what I had in mind. Just didn't
| know what to look for. Thanks!
| devit wrote:
| Presumably the LLVM and Cranelift backends in the same rustc
| version generate code with the same ABI, and it would be a
| bug if that wasn't the case.
| PartiallyTyped wrote:
| If you do the compilation locally, you know the structure of
| the ABI for the compiled code. You know that regardless of
| the backend you are using. At this stage, no functions with
| generics are compiled, only what is non-generic.
|
| What is left is to account for the ABI in any
| monomorphizations that occur at the boundary, i.e. when your
| own structures monomorphize generic functions in the
| dependency.
|
| When the compiler creates this monomorphic variant, and
| lowers down, it can provide the necessary details of the ABI.
| aseipp wrote:
| Because the ABI is defined and implemented by shared code in
| the runtime and compiler. There is no need for cross version
| compatibility between the two backends, only compatibility
| within the same version. This isn't particularly new either,
| FWIW. The Glasgow Haskell Compiler also has an unstable ABI
| that is not standardized, but LLVM compiled code
| interoperates seamlessly with code generated by the non-LLVM
| compiler backend (the mechanism by which that is achieved is
| likely different though.)
| metadat wrote:
| The Equality Graphs link [0] led me to discover ESC/Java [1] [2].
| Has anyone actually tried or had any success with ESC/Java? It's
| piqued my curiosity to compare with Spot bugs (formerly known as
| Findbugs).
|
| [0] https://en.wikipedia.org/wiki/E-graph
|
| [1] https://en.wikipedia.org/wiki/ESC/Java
|
| [2] https://www.kindsoftware.com/products/opensource/escjava2/
| ad-ops wrote:
| I see that there are many comments on full debug builds, but for
| me the most important difference are incremental build times when
| making minor changes. In my opinion this is what speeds up the
| development iterations.
|
| Here are my build times when making a trivial change to a print-
| statment in a root function, comparing nightly dev vs adding
| cranelift + mold for rust-analyzer[0] (347_290 LoC) and gleam[1]
| (76_335 LoC): $ time cargo build
| Compiling rust-analyzer v0.0.0 (/home/user/repos/rust-
| analyzer/crates/rust-analyzer) # nightly Finished
| `dev` profile [unoptimized] target(s) in 6.60s cargo
| build 4.18s user 2.51s system 100% cpu 6.650 total
| # cranelift+mold Finished `dev` profile [unoptimized]
| target(s) in 2.25s cargo build 1.77s user 0.36s system
| 92% cpu 2.305 total Compiling gleam v1.0.0
| (/home/user/repos/gleam/compiler-cli) # nightly
| Finished `dev` profile [unoptimized + debuginfo] target(s) in
| 4.69s cargo build --bin gleam 3.02s user 1.74s system
| 100% cpu 4.743 total # cranelift+mold
| Finished `dev` profile [unoptimized + debuginfo] target(s) in
| 0.99s cargo build --bin gleam 0.71s user 0.20s system
| 88% cpu 1.033 total
|
| For me this is the most important metric and it shows a huge
| improvement. If I compare it to Go building Terraform[2] (371_594
| LoC) it is looking promising. This is a bit unfair since it is
| the release build for Go and this is really nice in the CI/CD.
| Love Go compilation times and I thought it would be nice to
| compare with another language to show the huge improvements that
| Rust has made. $ time go build go build
| 3.62s user 0.76s system 171% cpu 2.545 total
|
| I was looking forward to parallel front-end[3], but I have not
| seen any improvement for these small changes.
|
| [0] https://github.com/rust-lang/rust-analyzer
|
| [1] https://github.com/gleam-lang/gleam
|
| [2] https://github.com/hashicorp/terraform
|
| [3] https://blog.rust-lang.org/2023/11/09/parallel-rustc.html
|
| *edit: code-comments & links + making it easier to see the
| differences
| ad-ops wrote:
| For completion here is the compilation time for a small project
| from the axum/examples[0] (125 LoC), also comparing nightly dev
| vs adding cranelift + mold: $ time cargo
| build Compiling example-todos v0.1.0
| (/home/user/ws/rust/example-todos) Finished `dev`
| profile [unoptimized + debuginfo] target(s) in 1.65s
| cargo build 1.49s user 0.58s system 123% cpu 1.685 total
| Compiling example-todos v0.1.0 (/home/user/ws/rust/example-
| todos) Finished `dev` profile [unoptimized + debuginfo]
| target(s) in 0.55s cargo build 0.47s user 0.13s system
| 102% cpu 0.590 total
|
| [0] https://github.com/tokio-rs/axum/tree/main/examples/todos
| kapilsinha wrote:
| This is most definitely self-promotion (I am the author), but
| also extremely relevant if you are looking at incremental build
| speed-ups: https://news.ycombinator.com/item?id=39743431
| rishav_sharan wrote:
| It sucks that there is no way to use cranelift from outside of
| rust to create your own toy language. I would have loved to use
| cranelift in a toy compiler, but I am not ready to pay the Rust
| price of complexity.
| sigsev_251 wrote:
| There is always qbe if you need something simpler
| kibwen wrote:
| As far as I know, you should be able to generate textual CLIF
| files and feed them to Cranelift:
| https://github.com/bytecodealliance/wasmtime/blob/main/crane...
| posix_monad wrote:
| Can anyone explain why Cranelift is expected to be faster than
| LLVM? And why those improvements can't also be applied to LLVM?
| Filligree wrote:
| It uses E-graphs, as explained in the article. That's a
| completely different approach to compilation, and to use it in
| LLVM you'd have to rewrite LLVM.
|
| It's also unlikely that the resulting code will ever be as fast
| as a traditional compiler's output. It's great for development,
| but I wouldn't use it in a release build.
| Rusky wrote:
| I don't believe the performance difference between Cranelift
| and LLVM really has much to do with E-graphs. Cranelift had
| this same performance profile (faster than LLVM but
| generating slower output) before they switched to E-graphs.
|
| Rather it's about how much effort Cranelift puts toward
| optimizing its output- it has fewer, less involved passes
| (regardless of whether those "passes" are expressed as part
| of the E-graph framework). More subtly, this means it is also
| written to generate "okay" code without as much reliance on
| those passes- while LLVM on the other hand generates a lot of
| naive code at -O0 which contributes to its slower compile
| times in that mode.
| norman784 wrote:
| > why those improvements can't also be applied to LLVM?
|
| LLVM is a huge and bloated ecosystem (it has tons of tools and
| millions of LoC), also the code base itself is pretty old or
| rather the project has his age, so there's a lot of legacy
| code, other aspect is that is hard to try new/radical things
| because how big the project itself is.
| papruapap wrote:
| iirc Zig want to replace LLVM for similar reasons, also hard
| to debug.
| posix_monad wrote:
| X is too bloated! We need Y, which is leaner and more
| tailored to our use-case.
|
| _(years pass...)_
|
| Y is too bloated! We need Z...
| dymk wrote:
| I don't get it. Is your point that new software shouldn't
| be written to fulfill new usecases? We should always bolt
| new patterns and architecture onto existing codebases?
| alserio wrote:
| Yes. But in the meantime one can experiment with new
| things, void some assumptions. And I don't know about the
| LLVM project, but its quite more difficult to try different
| ideas in a big and old organization and codebase than it is
| in a greenfield project. Maybe they won't be successful,
| maybe they will move the whole field ahead.
| mort96 wrote:
| I've heard compiler writers complain that LLVM's API is
| inherently slow. Something about using runtime polymorphism
| everywhere. I don't know how much there is to this though, I
| haven't investigated it myself.
| tsegratis wrote:
| I feel like im reading advertising blurb reading that article
|
| I wish them every success, but i hope for a more balanced
| overview of pros and cons rather than gushing praise at every
| step...
| rayiner wrote:
| Very interesting article. I had not heard of equality graphs
| before. Here's some pretty good background reading on the
| subject:
| https://inst.eecs.berkeley.edu/~cs294-260/sp24/2024-03-04-eq...
| mmoskal wrote:
| > JIT compilers often use techniques, such as speculative
| optimizations, that make it difficult to reuse the compiler
| outside its original context, since they encode so many
| assumptions about the specific language for which they were
| designed.
|
| > The developers of Cranelift chose to use a more generic
| architecture, which means that Cranelift is usable outside of the
| confines of WebAssembly.
|
| One would think this has more to do with Wasm being the source
| language, as it's fairly generic (compared to JS or Python), so
| there are no specific assumptions to encode.
|
| Great article though. It's quite interesting to see E-matching
| used in compilers, took me down a memory lane (and found myself
| cited on Wikipedia page for e-graphs).
___________________________________________________________________
(page generated 2024-03-18 23:00 UTC)