[HN Gopher] Performance optimization, and how to do it wrong
       ___________________________________________________________________
        
       Performance optimization, and how to do it wrong
        
       Author : todsacerdoti
       Score  : 57 points
       Date   : 2025-03-04 17:14 UTC (3 days ago)
        
 (HTM) web link (genna.win)
 (TXT) w3m dump (genna.win)
        
       | hansvm wrote:
       | That pattern of doing something for the tail (and possibly also
       | the head, if you're on a platform where aligned loads are
       | preferred or required) of your data is very common in SIMD linear
       | algebra. It's so common that my default way of writing those
       | things is:
       | 
       | 1. Write a function implementing the happy-path SIMD
       | 
       | 2. (eventually) Write a function implementing the cleanup code
       | 
       | 3. (eventually) Implement the final result by appropriately
       | calling into (1) and (2)
       | 
       | That gives you a few benefits. Notably, every problem in TFA
       | (other than the inlining limit) magically goes away. Also, in the
       | vast majority of problems, your cleanup code is itself a valid
       | solution to the problem and also easy to understand, so for free
       | you have a correct implementation to fuzz your fast
       | implementation against (a testing pattern I highly recommend with
       | all optimized code).
       | 
       | Most importantly though, 99 times out of 100 I actually don't
       | need (2) or (3). You can push the padding and alignment
       | constraints all the way to the top level of your program, and the
       | rest of your code can work on slices of vectors instead of slices
       | of floats (or whatever other safe representation you want to
       | choose). The resulting code is much easier to maintain, and the
       | compiled binary is usually smaller and faster.
        
         | goyel wrote:
         | In c++, I would write a single template for both paths, and
         | make the compiler generate two functions for each case, so the
         | ifs and checks are automatically removed. Then I would call
         | each function with a parameter like "bool boundary". It's more
         | maintainable
        
         | jesse__ wrote:
         | So (2) in this case would be the scalar version of the function
         | that takes a starting pointer, row stride, channel stride, and
         | a row_length and channel_length. By setting the strides to 1
         | and lengths to max you effectively tell that function the whole
         | matrix is leftovers and it just does the whole computation.
         | Correct?
        
           | hansvm wrote:
           | That's not the only option, but it's very reasonable and
           | almost certainly the thing I'd start with, maybe the thing
           | I'd finish with. You then pass an appropriate sub-matrix to
           | this thing which does a whole-matrix computation, and that's
           | the entire "add another loop" trick from TFA.
        
             | jesse__ wrote:
             | Yeah, got it. Thanks for the explanation
        
         | grandempire wrote:
         | This is actually a great way to write code in general. Write
         | the happy function. Then write various wrappers to handle all
         | the ugly checks and transforms to get it on the happy path.
        
       | n4r9 wrote:
       | I'm not too sure what the author thinks they did wrong! Perhaps
       | implementing the first attempt before profiling, but... would it
       | have made a difference in this scenario?
        
         | Palomides wrote:
         | not using uprof (or vtune for intel users) well seems like a
         | misstep; it can give you the rate of branch prediction success,
         | though that is arguably more a lack of knowledge/experience
         | than a mistake
         | 
         | they're cool tools, I encourage anyone to mess with them
        
           | adgjlsfhk1 wrote:
           | branch prediction rates wouldn't have helped here. the
           | problem here is that even though the branch is predicted
           | correctly ~99% of the time, the branch drops you to ipc of 1
           | because you can't speculate past a 2nd branch
        
             | Palomides wrote:
             | I think that would be reported as some kind of stall? I
             | can't say I'm an expert either
        
               | turol wrote:
               | I think vtune and toplev will report that as a frontend
               | stall in their topdown metrics.
        
       | Validark wrote:
       | In other words, READ THE ASSEMBLY!
        
         | dijit wrote:
         | the assembly can mislead you, i have seen complex, long
         | assembly that even does memory reads perform 2.5x the
         | performance of "simple" assembly that would have been 4-5
         | lines.
         | 
         | Benchmarking is the only thing you can really do in some cases.
        
           | neonsunset wrote:
           | This is an edge case. Disassembly may not tell the whole
           | story but it is the definitive answer to what the compiler
           | thinks about your code. Given enough knowledge it is not
           | difficult to account for cache/memory accesses, see where
           | store-to-load forwarding may or may not kick in, consider
           | optimal independent operations scheduling, check if the code
           | is too branch heavy or not, etc etc.
           | 
           | And once you're done, you can do another round with tools
           | which can do more detailed analysis like
           | https://uica.uops.info
           | 
           | I do this when writing performance-sensitive C# all the time
           | (it's very easy to quickly get compiler disassembly) and it
           | saves me a lot of time.
        
       | panstromek wrote:
       | I will just plug https://www.computerenhance.com/ which goes into
       | great depth on all of this stuff. I highly recommend it, I
       | learned a lot.
        
       | flakiness wrote:
       | > This blog post is mostly written from memory, since I didn't
       | keep around every version of the code being discussed.
       | 
       | How do you track these optimization attempts, given you don't
       | commit many dead-end changes? CS major often doesn't teach how to
       | experiment, which is a bummer as there supposed to be established
       | ways to do it among other scientific disciplines. Maybe you can
       | use something like Jupyter notebook? But what if the code is in,
       | say, Rust and lives outside the Jupyter kernel.
        
       | gmm1990 wrote:
       | Does a variance 5 microseconds with a min of 38 milliseconds and
       | a max of 47 milliseconds with 40 samples seem like a valid
       | statistic?
        
       ___________________________________________________________________
       (page generated 2025-03-07 23:01 UTC)