[HN Gopher] Data-driven performance optimization with Rust and Miri
___________________________________________________________________
Data-driven performance optimization with Rust and Miri
Author : dmit
Score : 75 points
Date : 2022-12-09 14:48 UTC (8 hours ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| zackangelo wrote:
| An often overlooked option for profiling Rust is Apple's
| Instruments.app. It's amazing and usually the first thing I reach
| for when I need a profiler on Mac OS X.
| dceddia wrote:
| Yeah, Instruments is pretty great and it can even do that thing
| the article mentions, showing the hot lines of code annotated
| with their percentage of runtime. It's not always perfect but
| at least it's a quick way to know where to start looking.
| lesuorac wrote:
| > In order to get any useful results I had to run my Advent of
| Code solution 1000 times in a loop
|
| Yeah that's my general problem with all these flamegraphs and
| other time based tools. There's a bunch of noise!
|
| I'd image for something with deterministic GC (or hell no-gc) you
| should be able to get a "instruction count" based approach that'd
| be much more deterministic as to what version of the code is
| fastest (for that workflow).
| dahfizz wrote:
| An instruction account would be a decent first order
| approximation, but I don't think it will be very useful for
| optimization.
|
| The bottleneck of modern hardware (generally) is memory. You
| can get huge speedups by tweaking the way your program
| structures and operates on data to make it more cache friendly.
| This won't really affect instruction count but could make your
| program run 2x faster.
| lesuorac wrote:
| Somebody else pointed out that I probably want to use
| Valgrind. It seems to have the ability to count instruction
| counts as well as cache misses [1] [2].
|
| [1]: https://web.stanford.edu/class/archive/cs/cs107/cs107.12
| 02/r...
|
| [2]: http://www.codeofview.com/fix-rs/2017/01/24/how-to-
| optimize-...
| dahfizz wrote:
| Yup! valgrind is an amazing tool. Just keep in mind that
| the program itself will be slower when run under valgrind.
| Its great for debugging, but your benchmarks should be run
| raw.
| dmit wrote:
| And when all else fails, you reach for Intel VTune. :)
| Which has provided a free community version for a couple
| years now, I might add.
| hinkley wrote:
| I have been wondering for a while if the solution to the
| memory bottleneck is to stop treating cache memory as an
| abstraction and make it directly addressable. Removing L1
| cache is probably impossible, but I suspect removing L2 and
| L3 caches and replacing them with working memory might fare
| better.
|
| The latest iteration of this thought process was wondering
| what would happen if you exposed microcode for memory
| management to the kernel, and it ran a moral equivalent of
| eBNF directly in a beefed up MMU. Legacy code and code that
| doesn't deign to do its own management would elect to use
| routines that maintain the cache abstraction. The kernel
| could also segment the caches for different processes,
| reducing the surface area for cache-related bugs.
| foldr wrote:
| One nice think about Go's built in benchmarking is that it
| automatically takes care or running benchmarks repeatedly until
| runtime is stable and then averaging the time over a number of
| hot runs.
| pornel wrote:
| Rust has libraries like https://lib.rs/criterion that help
| running code 1000 times in a loop, with proper timing,
| elimination of outliers, etc.
| kaba0 wrote:
| Valgrind (and "friends") are exactly this, they can measure
| cache misses, branch mispredictions, etc.
| hinkley wrote:
| Something about Flamegraphs has been bugging me for a while and
| as more languages are adopting async/await semantics is coming
| to a head: Flamegraphs work well for synchronous, sequential
| code. I don't think it's an accident that a lot of people first
| encountered them for Javascript, in the era immediately before
| Promises became a widely known technique. They actually worked
| then, but now they're quickly becoming more trouble than
| they're worth.
|
| A long time ago when people still tried to charge for
| profilers, I remember one whose primary display was a DAG, not
| unlike the way some microservices are visualized today. Only
| the simplest cases of cause and effect can be adequately
| displayed as a flamegraph. For anything else it's, as you say,
| all noise.
| Arnavion wrote:
| Your asynchronous code is made up of synchronous segments
| between the yield points. The flamegraph measures those just
| fine.
| gavinray wrote:
| Whaaat, Chrome has a built-in flamegraph profiler that you can
| use with profiling data from languages like Rust (and presumably
| others)?!
|
| Sweet tip.
| oxff wrote:
| https://fasterthanli.me/articles/when-rustc-explodes &
| https://fasterthanli.me/articles/why-is-my-rust-build-so-slo...
|
| these two articles also showcase how to use chrome for this
| nszceta wrote:
| Similarly, py-spy is a sampling profiler for Python programs.
| It lets you visualize what your Python program is spending time
| on without restarting the program or modifying the code in any
| way. py-spy is extremely low overhead: it is written in Rust
| for speed and doesn't run in the same process as the profiled
| Python program. This means py-spy is safe to use against
| production Python code.
|
| I'm not sure if it exports results in a format Chrome can
| render but it does produce great interactive SVGs and is
| compatible with speedscope.app
|
| https://github.com/benfred/py-spy
|
| https://github.com/jlfwong/speedscope
| dralley wrote:
| There is also https://profiler.firefox.com/
| ZeroGravitas wrote:
| Talking of data driven, I think I read that the rust compiler
| team checks itself against some massive list of popular crates to
| check it doesn't break anything.
|
| Would it be a reasonable use of resources to run all those test
| suits and identify hot spots for community wide optimization?
| nicoburns wrote:
| > I read that the rust compiler team checks itself against some
| massive list of popular crates to check it doesn't break
| anything.
|
| Yes, that "massive list" being every single crate in the
| crates.io repository.
|
| > Would it be a reasonable use of resources to run all those
| test suits and identify hot spots for community wide
| optimization?
|
| I believe the approach on the perf side of things has been to
| take reports on crates that are particularly slow (even at one
| particular part of the compiler) and create benchmarks from
| those. I believe this is partly because running against every
| single crate would be too slow, and partly because it would be
| a moving target (as new versions of crates are released) and
| thus would make it hard to track performance accurately.
| estebank wrote:
| As crater was already linked to in a sibling comment, the
| performance suite dashboard can be seen at https://perf.rust-
| lang.org/, and the suite itself at https://github.com/rust-
| lang/rustc-perf/tree/master/collecto...
| tialaramex wrote:
| I think what ZeroGravitas was getting at was the idea of
| optimising _the actual software_ rather than the compiler.
|
| For example you could imagine noticing that people make a lot
| of Vecs with 400 or 800 things in them, but not so many with
| 500 or 1000 things in them and so maybe the Vec growth rule
| needs tweaking to better accommodate that.
| lmkg wrote:
| The tool you're referring to is called Crater:
| https://github.com/rust-lang/crater.
| Georgelemental wrote:
| Miri is not really meant for performance profiling; it runs on
| unoptimized MIR, which has very different performance from LLVM-
| optimized machine code.
| pornel wrote:
| This is really important. I think author found results
| surprising and unintuitive, because they've been looking at the
| wrong thing.
|
| Rust libraries are designed to be fast when optimized with
| LLVM. Rust has a lot of layers of abstractions, and they're
| _zero_ cost only when everything is fully optimized. If you
| look at unoptimized miri execution, or insert code-level
| instrumentation that gets in the way of the optimizer, these
| aren 't zero cost any more, and overheads add up where they
| normally don't exist.
| LegionMammal978 wrote:
| As a caveat, it can still be useful for profiling constant-
| evaluated code to improve compile time, since Miri is built on
| the same evaluator that the compiler uses.
| ttfkam wrote:
| I don't understand why you felt the need to point this out when
| the linked article explicitly mentioned it.
|
| > Important note: Miri is not intended to accurately replicate
| optimized Rust runtime code. Optimizing for Miri can sometimes
| make your real code slower, and vice versa. It's a helpful tool
| to guide your optimization, but you should always benchmark
| your changes with release builds, not with Miri.
| IshKebab wrote:
| I think they undersell the difference. You'd be laughed out
| of the room if you profiled debug code.
| Tobu wrote:
| The author was gently panned when this was originally posted
| on Reddit, and they added that note, but now that the article
| is being reposted uncritically where people may not know the
| difference, it's worth pointing out again. The way the author
| got a profile out of Miri is creative, but Miri was never a
| helpful guide for profiling, it seems that the second
| benchmark they used for confirmation was also unoptimised
| (run without --release which is a rookie mistake). Then they
| got wrong conclusions from flawed observations about the
| costs of abstractions like range.contains.
| mullr wrote:
| Every Linux C/C++/Rust developer should know about
| https://github.com/KDAB/hotspot. It's convenient and fast. I use
| it for Rust all the time, and it provides all of these features
| on the back of regular old `perf`.
| galangalalgol wrote:
| What perf record settings do you use? Trying to use dwarf has
| never worked well fore with rust, so I've been using lbr, but
| even then it seems like it gets which instructions are part of
| which function wrong a significant portion of the time.
| LAC-Tech wrote:
| My key take away from this is different - be very sceptical of
| third party packages! Both performance issues were traced back to
| them, and his replacement of their functionality - while not
| being "battle tested" and surely constituting "re-inventing the
| wheel" - were faster, easy to read, and easy to understand.
|
| Any front-end devs reading this? :)
| Arcuru wrote:
| > The most surprising thing for me is how unintuitive it is to
| optimize Rust code given that it's honestly hard to find a Rust
| project that doesn't loudly strive to be "blazingly fast". No
| language is intrinsically fast 100% of the time, at least not
| when a mortal like me is behind the keyboard. It takes work to
| optimize code, and too often that work is guess-and-check.
|
| I get the feeling that a lot of Rust projects claim to be
| "blazingly fast" just because they are written in Rust, and not
| because they've made any attempts to actually optimize it. I
| rarely see any realistic benchmarks, and the few times I've
| looked deeply into the designs they are not implemented with
| execution speed in mind, or in some cases prematurely optimized
| in a way that is actively detrimental [1].
|
| Personally I think it's because so many of the new Rust
| programmers are coming from scripting languages, so everything
| feels fast. I don't have any problems with that, but I'd advise
| anyone seeing a "blazingly fast" Rust project to check if the
| project has even a single reasonable benchmark to back that up.
|
| [1] https://jackson.dev/post/rust-coreutils-dd/
| mcqueenjordan wrote:
| https://github.com/BurntSushi/ripgrep
|
| https://github.com/rust-lang/hashbrown
|
| https://github.com/briansmith/ring
|
| https://github.com/rust-random/rand
|
| Lot of Rust programmers also coming from C, C++, and Go btw.
| kaba0 wrote:
| Good examples, but one of those languages is not like the
| others, no matter how well it tries to blend in.
| burntsushi wrote:
| So wait, are you complaining that Rust projects advertise
| themselves as "blazingly fast" but actually aren't? Or are you
| complaining that not all Rust projects are fast? If it's the
| former, I don't think uutils advertises itself as "blazingly
| fast." And if it's the latter, then that... kind of seems
| unreasonable?
| mustache_kimono wrote:
| > the few times I've looked deeply into the designs they are
| not implemented with execution speed in mind, or in some cases
| prematurely optimized in a way that is actively detrimental
| [1].
|
| I'm not sure citing uutils/coreutils as an example is fair. I
| love that project. I learned Rust contributing to that project,
| however, as the blog entry you cite itself notes:
|
| > I saw the maintainers themselves mention that a lot of the
| code quality isn't great since a lot of contributions are from
| people who are very new to Rust
|
| I'm sure plenty of my slow code is still in `ls` and `sort` and
| that's okay?
| Jarred wrote:
| "blazing fast" often means "I want to say it's fast but I've
| never benchmarked it"
| [deleted]
| [deleted]
___________________________________________________________________
(page generated 2022-12-09 23:01 UTC)