[HN Gopher] Data-driven performance optimization with Rust and Miri
       ___________________________________________________________________
        
       Data-driven performance optimization with Rust and Miri
        
       Author : dmit
       Score  : 75 points
       Date   : 2022-12-09 14:48 UTC (8 hours ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | zackangelo wrote:
       | An often overlooked option for profiling Rust is Apple's
       | Instruments.app. It's amazing and usually the first thing I reach
       | for when I need a profiler on Mac OS X.
        
         | dceddia wrote:
         | Yeah, Instruments is pretty great and it can even do that thing
         | the article mentions, showing the hot lines of code annotated
         | with their percentage of runtime. It's not always perfect but
         | at least it's a quick way to know where to start looking.
        
       | lesuorac wrote:
       | > In order to get any useful results I had to run my Advent of
       | Code solution 1000 times in a loop
       | 
       | Yeah that's my general problem with all these flamegraphs and
       | other time based tools. There's a bunch of noise!
       | 
       | I'd image for something with deterministic GC (or hell no-gc) you
       | should be able to get a "instruction count" based approach that'd
       | be much more deterministic as to what version of the code is
       | fastest (for that workflow).
        
         | dahfizz wrote:
         | An instruction account would be a decent first order
         | approximation, but I don't think it will be very useful for
         | optimization.
         | 
         | The bottleneck of modern hardware (generally) is memory. You
         | can get huge speedups by tweaking the way your program
         | structures and operates on data to make it more cache friendly.
         | This won't really affect instruction count but could make your
         | program run 2x faster.
        
           | lesuorac wrote:
           | Somebody else pointed out that I probably want to use
           | Valgrind. It seems to have the ability to count instruction
           | counts as well as cache misses [1] [2].
           | 
           | [1]: https://web.stanford.edu/class/archive/cs/cs107/cs107.12
           | 02/r...
           | 
           | [2]: http://www.codeofview.com/fix-rs/2017/01/24/how-to-
           | optimize-...
        
             | dahfizz wrote:
             | Yup! valgrind is an amazing tool. Just keep in mind that
             | the program itself will be slower when run under valgrind.
             | Its great for debugging, but your benchmarks should be run
             | raw.
        
               | dmit wrote:
               | And when all else fails, you reach for Intel VTune. :)
               | Which has provided a free community version for a couple
               | years now, I might add.
        
           | hinkley wrote:
           | I have been wondering for a while if the solution to the
           | memory bottleneck is to stop treating cache memory as an
           | abstraction and make it directly addressable. Removing L1
           | cache is probably impossible, but I suspect removing L2 and
           | L3 caches and replacing them with working memory might fare
           | better.
           | 
           | The latest iteration of this thought process was wondering
           | what would happen if you exposed microcode for memory
           | management to the kernel, and it ran a moral equivalent of
           | eBNF directly in a beefed up MMU. Legacy code and code that
           | doesn't deign to do its own management would elect to use
           | routines that maintain the cache abstraction. The kernel
           | could also segment the caches for different processes,
           | reducing the surface area for cache-related bugs.
        
         | foldr wrote:
         | One nice think about Go's built in benchmarking is that it
         | automatically takes care or running benchmarks repeatedly until
         | runtime is stable and then averaging the time over a number of
         | hot runs.
        
         | pornel wrote:
         | Rust has libraries like https://lib.rs/criterion that help
         | running code 1000 times in a loop, with proper timing,
         | elimination of outliers, etc.
        
         | kaba0 wrote:
         | Valgrind (and "friends") are exactly this, they can measure
         | cache misses, branch mispredictions, etc.
        
         | hinkley wrote:
         | Something about Flamegraphs has been bugging me for a while and
         | as more languages are adopting async/await semantics is coming
         | to a head: Flamegraphs work well for synchronous, sequential
         | code. I don't think it's an accident that a lot of people first
         | encountered them for Javascript, in the era immediately before
         | Promises became a widely known technique. They actually worked
         | then, but now they're quickly becoming more trouble than
         | they're worth.
         | 
         | A long time ago when people still tried to charge for
         | profilers, I remember one whose primary display was a DAG, not
         | unlike the way some microservices are visualized today. Only
         | the simplest cases of cause and effect can be adequately
         | displayed as a flamegraph. For anything else it's, as you say,
         | all noise.
        
           | Arnavion wrote:
           | Your asynchronous code is made up of synchronous segments
           | between the yield points. The flamegraph measures those just
           | fine.
        
       | gavinray wrote:
       | Whaaat, Chrome has a built-in flamegraph profiler that you can
       | use with profiling data from languages like Rust (and presumably
       | others)?!
       | 
       | Sweet tip.
        
         | oxff wrote:
         | https://fasterthanli.me/articles/when-rustc-explodes &
         | https://fasterthanli.me/articles/why-is-my-rust-build-so-slo...
         | 
         | these two articles also showcase how to use chrome for this
        
         | nszceta wrote:
         | Similarly, py-spy is a sampling profiler for Python programs.
         | It lets you visualize what your Python program is spending time
         | on without restarting the program or modifying the code in any
         | way. py-spy is extremely low overhead: it is written in Rust
         | for speed and doesn't run in the same process as the profiled
         | Python program. This means py-spy is safe to use against
         | production Python code.
         | 
         | I'm not sure if it exports results in a format Chrome can
         | render but it does produce great interactive SVGs and is
         | compatible with speedscope.app
         | 
         | https://github.com/benfred/py-spy
         | 
         | https://github.com/jlfwong/speedscope
        
         | dralley wrote:
         | There is also https://profiler.firefox.com/
        
       | ZeroGravitas wrote:
       | Talking of data driven, I think I read that the rust compiler
       | team checks itself against some massive list of popular crates to
       | check it doesn't break anything.
       | 
       | Would it be a reasonable use of resources to run all those test
       | suits and identify hot spots for community wide optimization?
        
         | nicoburns wrote:
         | > I read that the rust compiler team checks itself against some
         | massive list of popular crates to check it doesn't break
         | anything.
         | 
         | Yes, that "massive list" being every single crate in the
         | crates.io repository.
         | 
         | > Would it be a reasonable use of resources to run all those
         | test suits and identify hot spots for community wide
         | optimization?
         | 
         | I believe the approach on the perf side of things has been to
         | take reports on crates that are particularly slow (even at one
         | particular part of the compiler) and create benchmarks from
         | those. I believe this is partly because running against every
         | single crate would be too slow, and partly because it would be
         | a moving target (as new versions of crates are released) and
         | thus would make it hard to track performance accurately.
        
           | estebank wrote:
           | As crater was already linked to in a sibling comment, the
           | performance suite dashboard can be seen at https://perf.rust-
           | lang.org/, and the suite itself at https://github.com/rust-
           | lang/rustc-perf/tree/master/collecto...
        
           | tialaramex wrote:
           | I think what ZeroGravitas was getting at was the idea of
           | optimising _the actual software_ rather than the compiler.
           | 
           | For example you could imagine noticing that people make a lot
           | of Vecs with 400 or 800 things in them, but not so many with
           | 500 or 1000 things in them and so maybe the Vec growth rule
           | needs tweaking to better accommodate that.
        
         | lmkg wrote:
         | The tool you're referring to is called Crater:
         | https://github.com/rust-lang/crater.
        
       | Georgelemental wrote:
       | Miri is not really meant for performance profiling; it runs on
       | unoptimized MIR, which has very different performance from LLVM-
       | optimized machine code.
        
         | pornel wrote:
         | This is really important. I think author found results
         | surprising and unintuitive, because they've been looking at the
         | wrong thing.
         | 
         | Rust libraries are designed to be fast when optimized with
         | LLVM. Rust has a lot of layers of abstractions, and they're
         | _zero_ cost only when everything is fully optimized. If you
         | look at unoptimized miri execution, or insert code-level
         | instrumentation that gets in the way of the optimizer, these
         | aren 't zero cost any more, and overheads add up where they
         | normally don't exist.
        
         | LegionMammal978 wrote:
         | As a caveat, it can still be useful for profiling constant-
         | evaluated code to improve compile time, since Miri is built on
         | the same evaluator that the compiler uses.
        
         | ttfkam wrote:
         | I don't understand why you felt the need to point this out when
         | the linked article explicitly mentioned it.
         | 
         | > Important note: Miri is not intended to accurately replicate
         | optimized Rust runtime code. Optimizing for Miri can sometimes
         | make your real code slower, and vice versa. It's a helpful tool
         | to guide your optimization, but you should always benchmark
         | your changes with release builds, not with Miri.
        
           | IshKebab wrote:
           | I think they undersell the difference. You'd be laughed out
           | of the room if you profiled debug code.
        
           | Tobu wrote:
           | The author was gently panned when this was originally posted
           | on Reddit, and they added that note, but now that the article
           | is being reposted uncritically where people may not know the
           | difference, it's worth pointing out again. The way the author
           | got a profile out of Miri is creative, but Miri was never a
           | helpful guide for profiling, it seems that the second
           | benchmark they used for confirmation was also unoptimised
           | (run without --release which is a rookie mistake). Then they
           | got wrong conclusions from flawed observations about the
           | costs of abstractions like range.contains.
        
       | mullr wrote:
       | Every Linux C/C++/Rust developer should know about
       | https://github.com/KDAB/hotspot. It's convenient and fast. I use
       | it for Rust all the time, and it provides all of these features
       | on the back of regular old `perf`.
        
         | galangalalgol wrote:
         | What perf record settings do you use? Trying to use dwarf has
         | never worked well fore with rust, so I've been using lbr, but
         | even then it seems like it gets which instructions are part of
         | which function wrong a significant portion of the time.
        
       | LAC-Tech wrote:
       | My key take away from this is different - be very sceptical of
       | third party packages! Both performance issues were traced back to
       | them, and his replacement of their functionality - while not
       | being "battle tested" and surely constituting "re-inventing the
       | wheel" - were faster, easy to read, and easy to understand.
       | 
       | Any front-end devs reading this? :)
        
       | Arcuru wrote:
       | > The most surprising thing for me is how unintuitive it is to
       | optimize Rust code given that it's honestly hard to find a Rust
       | project that doesn't loudly strive to be "blazingly fast". No
       | language is intrinsically fast 100% of the time, at least not
       | when a mortal like me is behind the keyboard. It takes work to
       | optimize code, and too often that work is guess-and-check.
       | 
       | I get the feeling that a lot of Rust projects claim to be
       | "blazingly fast" just because they are written in Rust, and not
       | because they've made any attempts to actually optimize it. I
       | rarely see any realistic benchmarks, and the few times I've
       | looked deeply into the designs they are not implemented with
       | execution speed in mind, or in some cases prematurely optimized
       | in a way that is actively detrimental [1].
       | 
       | Personally I think it's because so many of the new Rust
       | programmers are coming from scripting languages, so everything
       | feels fast. I don't have any problems with that, but I'd advise
       | anyone seeing a "blazingly fast" Rust project to check if the
       | project has even a single reasonable benchmark to back that up.
       | 
       | [1] https://jackson.dev/post/rust-coreutils-dd/
        
         | mcqueenjordan wrote:
         | https://github.com/BurntSushi/ripgrep
         | 
         | https://github.com/rust-lang/hashbrown
         | 
         | https://github.com/briansmith/ring
         | 
         | https://github.com/rust-random/rand
         | 
         | Lot of Rust programmers also coming from C, C++, and Go btw.
        
           | kaba0 wrote:
           | Good examples, but one of those languages is not like the
           | others, no matter how well it tries to blend in.
        
         | burntsushi wrote:
         | So wait, are you complaining that Rust projects advertise
         | themselves as "blazingly fast" but actually aren't? Or are you
         | complaining that not all Rust projects are fast? If it's the
         | former, I don't think uutils advertises itself as "blazingly
         | fast." And if it's the latter, then that... kind of seems
         | unreasonable?
        
         | mustache_kimono wrote:
         | > the few times I've looked deeply into the designs they are
         | not implemented with execution speed in mind, or in some cases
         | prematurely optimized in a way that is actively detrimental
         | [1].
         | 
         | I'm not sure citing uutils/coreutils as an example is fair. I
         | love that project. I learned Rust contributing to that project,
         | however, as the blog entry you cite itself notes:
         | 
         | > I saw the maintainers themselves mention that a lot of the
         | code quality isn't great since a lot of contributions are from
         | people who are very new to Rust
         | 
         | I'm sure plenty of my slow code is still in `ls` and `sort` and
         | that's okay?
        
         | Jarred wrote:
         | "blazing fast" often means "I want to say it's fast but I've
         | never benchmarked it"
        
       | [deleted]
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-12-09 23:01 UTC)