[HN Gopher] Lessons learned from profiling an algorithm in Rust
___________________________________________________________________
Lessons learned from profiling an algorithm in Rust
Author : urcyanide
Score : 84 points
Date : 2024-10-13 15:03 UTC (6 hours ago)
(HTM) web link (blog.mapotofu.org)
(TXT) w3m dump (blog.mapotofu.org)
| andrewaylett wrote:
| That's really interesting -- I do enjoy a good optimisation.
|
| I was looking at one of the diffs, and thinking at a sufficiently
| advanced compiler should be able to generate the same efficient
| code for both -- and indeed it does, if you turn the optimiser
| on: https://godbolt.org/z/hjP5qjabz - let shift =
| if (i / 32) % 2 == 0 { 32 } else { 0 }; + let shift = ((i
| >> 5) & 1) << 5;
| NovaX wrote:
| I'm confused because isn't the bitwise version the inverted
| logic? If the LSB is 1 then it is an odd value, which should be
| zero, yet that is shifted to become 32. The original modulus is
| for an even value becoming 32. Shouldn't the original code or
| compiler invert it first? I'd expect let
| shift = ((~(i >> 5) & 1) << 5);
|
| EDIT: The compiler uses "vpandn" with the conditional version
| and "vpand" with the bitwise version. The difference is it
| includes a bitwise logical NOT operation on the first source
| operand. It looks like the compiler and I are correct, the
| author's bitwise version is inverted, and the incorrect code
| was merged in the author's commit. Also, I think this could be
| reduced to just (~i & 32).
| carlmr wrote:
| Great writeup with easy to understand steps. One thing it's
| lacking though is in the conclusion. I'd like to see a comparison
| to the C++ implementation.
| efnx wrote:
| Yes, exactly. How close does it come after all those
| optimisations?
| wrs wrote:
| I'm a Rust newbie, wondering how f32::clone could show up in a
| profile. Wouldn't that be an inline no-op under any kind of
| optimization? I mean, cloning a float is, at worst, a MOV
| instruction, no?
| MiguelX413 wrote:
| Floats aren't stored in the same kinds of registers.
| mwkaufma wrote:
| I don't understand why half of these aren't optimized by the
| compiler automatically. (x - y).norm_squared()? Why is
| f32::clone() not just an inline mov? Begging a lot of questions.
| JackYoustra wrote:
| I've previously had problems with the compiler not inlining /
| eliding instructions solely due to profiling code (see a blog
| post: https://www.jackyoustra.com/blog/llama-ios#-bug-bug-
| slowdown...). I wonder if it's that?
|
| (I've also always had a sneaking suspicion I did something
| wrong in my example, so if anyone knows let me know)
| vlovich123 wrote:
| Pcwalton's explanation is much more likely to be correct
| https://news.ycombinator.com/context?id=41830704
|
| Profiling native code with optimizations on is very very
| tricky.
| pcwalton wrote:
| I'm guessing that f32::clone showing up in the profile isn't
| actually a call to f32::clone, because you have optimizations on
| (if it actually is a call to a "movd xmm0,dword ptr [rdi]; ret"
| instruction pair, that's a bug in the compiler). Rather it's the
| result of the compiler choosing to attribute seemingly-random
| lines to f32::clone, because when lines from multiple functions
| are fused into one instruction the compiler will just pick one,
| and it happened to pick f32::clone to write into the debug info.
| You really want to look at instruction-level profiling when
| you're profiling at that level instead of the individual
| functions, since debug info is going to be very unreliable.
| eftychis wrote:
| Seconded. This could have been essentially anything and
| everything else bunched together. Or we have a compiler or
| debug symbol bug in our hands.
___________________________________________________________________
(page generated 2024-10-13 22:00 UTC)