[HN Gopher] Performance optimization, and how to do it wrong
___________________________________________________________________
Performance optimization, and how to do it wrong
Author : todsacerdoti
Score : 57 points
Date : 2025-03-04 17:14 UTC (3 days ago)
(HTM) web link (genna.win)
(TXT) w3m dump (genna.win)
| hansvm wrote:
| That pattern of doing something for the tail (and possibly also
| the head, if you're on a platform where aligned loads are
| preferred or required) of your data is very common in SIMD linear
| algebra. It's so common that my default way of writing those
| things is:
|
| 1. Write a function implementing the happy-path SIMD
|
| 2. (eventually) Write a function implementing the cleanup code
|
| 3. (eventually) Implement the final result by appropriately
| calling into (1) and (2)
|
| That gives you a few benefits. Notably, every problem in TFA
| (other than the inlining limit) magically goes away. Also, in the
| vast majority of problems, your cleanup code is itself a valid
| solution to the problem and also easy to understand, so for free
| you have a correct implementation to fuzz your fast
| implementation against (a testing pattern I highly recommend with
| all optimized code).
|
| Most importantly though, 99 times out of 100 I actually don't
| need (2) or (3). You can push the padding and alignment
| constraints all the way to the top level of your program, and the
| rest of your code can work on slices of vectors instead of slices
| of floats (or whatever other safe representation you want to
| choose). The resulting code is much easier to maintain, and the
| compiled binary is usually smaller and faster.
| goyel wrote:
| In c++, I would write a single template for both paths, and
| make the compiler generate two functions for each case, so the
| ifs and checks are automatically removed. Then I would call
| each function with a parameter like "bool boundary". It's more
| maintainable
| jesse__ wrote:
| So (2) in this case would be the scalar version of the function
| that takes a starting pointer, row stride, channel stride, and
| a row_length and channel_length. By setting the strides to 1
| and lengths to max you effectively tell that function the whole
| matrix is leftovers and it just does the whole computation.
| Correct?
| hansvm wrote:
| That's not the only option, but it's very reasonable and
| almost certainly the thing I'd start with, maybe the thing
| I'd finish with. You then pass an appropriate sub-matrix to
| this thing which does a whole-matrix computation, and that's
| the entire "add another loop" trick from TFA.
| jesse__ wrote:
| Yeah, got it. Thanks for the explanation
| grandempire wrote:
| This is actually a great way to write code in general. Write
| the happy function. Then write various wrappers to handle all
| the ugly checks and transforms to get it on the happy path.
| n4r9 wrote:
| I'm not too sure what the author thinks they did wrong! Perhaps
| implementing the first attempt before profiling, but... would it
| have made a difference in this scenario?
| Palomides wrote:
| not using uprof (or vtune for intel users) well seems like a
| misstep; it can give you the rate of branch prediction success,
| though that is arguably more a lack of knowledge/experience
| than a mistake
|
| they're cool tools, I encourage anyone to mess with them
| adgjlsfhk1 wrote:
| branch prediction rates wouldn't have helped here. the
| problem here is that even though the branch is predicted
| correctly ~99% of the time, the branch drops you to ipc of 1
| because you can't speculate past a 2nd branch
| Palomides wrote:
| I think that would be reported as some kind of stall? I
| can't say I'm an expert either
| turol wrote:
| I think vtune and toplev will report that as a frontend
| stall in their topdown metrics.
| Validark wrote:
| In other words, READ THE ASSEMBLY!
| dijit wrote:
| the assembly can mislead you, i have seen complex, long
| assembly that even does memory reads perform 2.5x the
| performance of "simple" assembly that would have been 4-5
| lines.
|
| Benchmarking is the only thing you can really do in some cases.
| neonsunset wrote:
| This is an edge case. Disassembly may not tell the whole
| story but it is the definitive answer to what the compiler
| thinks about your code. Given enough knowledge it is not
| difficult to account for cache/memory accesses, see where
| store-to-load forwarding may or may not kick in, consider
| optimal independent operations scheduling, check if the code
| is too branch heavy or not, etc etc.
|
| And once you're done, you can do another round with tools
| which can do more detailed analysis like
| https://uica.uops.info
|
| I do this when writing performance-sensitive C# all the time
| (it's very easy to quickly get compiler disassembly) and it
| saves me a lot of time.
| panstromek wrote:
| I will just plug https://www.computerenhance.com/ which goes into
| great depth on all of this stuff. I highly recommend it, I
| learned a lot.
| flakiness wrote:
| > This blog post is mostly written from memory, since I didn't
| keep around every version of the code being discussed.
|
| How do you track these optimization attempts, given you don't
| commit many dead-end changes? CS major often doesn't teach how to
| experiment, which is a bummer as there supposed to be established
| ways to do it among other scientific disciplines. Maybe you can
| use something like Jupyter notebook? But what if the code is in,
| say, Rust and lives outside the Jupyter kernel.
| gmm1990 wrote:
| Does a variance 5 microseconds with a min of 38 milliseconds and
| a max of 47 milliseconds with 40 samples seem like a valid
| statistic?
___________________________________________________________________
(page generated 2025-03-07 23:01 UTC)