[HN Gopher] How to optimize a CUDA matmul kernel for cuBLAS-like...
       ___________________________________________________________________
        
       How to optimize a CUDA matmul kernel for cuBLAS-like performance
       (2022)
        
       Author : mpweiher
       Score  : 86 points
       Date   : 2024-07-26 14:50 UTC (8 hours ago)
        
 (HTM) web link (siboehm.com)
 (TXT) w3m dump (siboehm.com)
        
       | joe_the_user wrote:
       | In other discussion here, people asserted that a CUDA replacement
       | was unworkable because you couldn't replace Nvidia's CuBLAS
       | implementation. I'm not qualified to say whether this would give
       | info for constructing an adequate replacement but I'd interested
       | in people's opinions.
        
         | hedgehog wrote:
         | Yes and no. Read his notes about his goals and what he didn't
         | implement, and note that his job when he wrote that was as part
         | of a team that ports models to different hardware. That's a
         | smaller job than writing a shippable framework and Anthropic
         | has a team of people doing it. You can read the cuDNN docs to
         | get an idea of some of the other stuff you'd need, there's a
         | lot there but generally that's the point. So in one sense, yes,
         | if you have a strong performance engineering team and a
         | specific workload already you can make use of a variety of
         | hardware. In another sense, no, a small team isn't likely to be
         | able to write a direct alternative for the CUDA ecosystem that
         | reaches devex parity. Lots of people have a pretty clear idea
         | of what it takes to do this but none of the major vendors seem
         | to be doing the work.
         | 
         | Knowing that reaching broad devex parity is very expensive I
         | think the real win is figuring out what specific problem you
         | have and building community and robust software support around
         | that.
        
         | alfalfasprout wrote:
         | > In other discussion here, people asserted that a CUDA
         | replacement was unworkable because you couldn't replace
         | Nvidia's CuBLAS implementation
         | 
         | Targeting nvidia GPUs? Or in general? For whom?
         | 
         | Building a performant BLAS library is hard but certainly not
         | impossible. The tricks discussed in this post are hardly
         | anything new either. Now, making a BLAS competitive with
         | Nvidia's on its own GPUs is bound to be tough. But not
         | technically unfeasible (after all, you can drop down to PTX if
         | needed).
        
           | deredede wrote:
           | When I was optimizing GPU kernels a few years back, Nvidia's
           | own kernels were getting those last few percent of
           | performance by making use of hardware-specific features (I
           | remember operand caches being one) that are not available
           | through PTX.
        
           | ap4 wrote:
           | Beating CuBLAS is easy. I wrote a 30-line kernel that
           | multiplies two 40962 matrices 14% faster than CuBLAS on a
           | 4090. The question is how to earn money on that.
        
             | coffeeaddict1 wrote:
             | I'm a little sceptical of your claims. Care to share the
             | kernel you wrote?
        
               | david-gpu wrote:
               | I would bet actual money that they are not doing an
               | apples-to-apples comparison.
               | 
               | I have seen how those high-performance libraries are made
               | and I'm still in awe at the quality and quantity of the
               | staffing involved. Those were the smartest and most
               | knowledgeable engineers I met in my career.
        
             | ladberg wrote:
             | If this were true (and I highly doubt it) it's obvious how
             | to make money from it: collect a 7 figure paycheck from
             | Nvidia, AMD, or any FAANG.
        
               | ap4 wrote:
               | I swapped one of the kernels in the code from the article
               | to my kernel, and left only the multiplication of
               | matrices of size 40962.
               | 
               | On average over 20 runs:
               | 
               | CuBLAS (./sgemm 0) has 50.9 TFLOPS.
               | 
               | My kernel has 61.8 TFLOPS, so it's actually +21% speedup
               | in this benchmark.
               | 
               | How do I collect my paycheck?
        
               | JuanJohnJames wrote:
               | Post the code and your curriculum
        
               | aaa370 wrote:
               | I gotta see it to believe it ;)
        
         | imtringued wrote:
         | The problem with AMD isn't that they are only hitting 90% of
         | the hardware capabilities vs Nvidia hitting 98%.
         | 
         | It's the fact that AMD doesn't prioritize the reliability of
         | its hardware and software stack. If I run llama.cpp on Vulkan I
         | get a reasonable speedup, but if I raise the batch size to 512,
         | the GPU is starting to make strange noises and shuts the PC
         | down midway. Very cool. 98% of zero is still zero.
        
         | KeplerBoy wrote:
         | Of course it can be done, it's just a lot of effort. You need
         | parametric kernels you can find optimal configurations for all
         | hardware and input size combinations.
         | 
         | Then there are also numerics: being fast is not enough if your
         | implementation accumulates a lot of rounding errors doing so.
         | Floating point arithmetic can and will mess up your results in
         | unexpected ways. -funsafe famously is neither fun nor safe.
         | 
         | Maybe tooling will catch up and make it easier. Think tinygrad
         | with beamsearch, triton or halide.
        
         | GaggiX wrote:
         | As reported by the article: "However, for smaller matrices,
         | we're doing poorly in comparison to Nvidia's library! This
         | happens because cuBLAS contains not one single implementation
         | of SGEMM, but hundreds of them.", it would take considerable
         | effort to replace CuBLAS.
        
         | ladberg wrote:
         | cuBLAS is not really Nvidia's moat: every competitor has a
         | workable BLAS implementation and some are even drop-in
         | replacements for cuBLAS.
         | 
         | In fact cuBLAS and CUDA are kinda orthogonal in that you're
         | either calling a pre-built cuBLAS kernel or writing your own
         | CUDA kernel but not really combining the two.
         | 
         | I'd say CUDA shines more because of stability, documentation,
         | community support + examples, and ability to use modern C++
         | features in GPU code.
        
       | flakiness wrote:
       | Note that this is from 2022.
       | 
       | My guess is that people nowadays are gradually moving away from
       | raw CUDA programming and moving towards things like Triton etc,
       | and you won't be focusing on pure GEMM since you tend to do some
       | fusion.
       | 
       | The Triton tutorial claims their performance is on par with
       | cuBLAS.
       | 
       | https://triton-lang.org/main/getting-started/tutorials/03-ma...
        
         | machinekob wrote:
         | But Triton is just abstraction over CUDA for Python (same like
         | cupy, numba etc.). If you need low lvl access you still will
         | use CUDA if you want high level you can use Triton or numba
         | even higer you'll just use wrappers like pytorch/jax.
        
         | bee_rider wrote:
         | I'm sure they are using cool super-modern stuff but I wonder if
         | Eigen can somehow be made to spit out CUDA code?
        
         | imjonse wrote:
         | It is still an excellent article if you care about how the GPU
         | works. Triton doesn't magically hide all the hardware details.
        
         | dang wrote:
         | Year added above. Thanks!
        
       | aaa370 wrote:
       | Another point to consider here is that this project of writing a
       | cuBLAS level GEMM kernel becomes much more challenging if you are
       | doing it with fp16, and are thus competing with the cuBLAS
       | kernels that use tensor cores. The (theoretical) arithmetic
       | throughput of tensor cores is ~8x higher as compared to fp32 math
       | on the Turing arch, I dont know off the top of my head but I
       | think this ratio is the same or greater for Ampere/Hopper tensor
       | cores.
       | 
       | This makes the project proportionally harder in my opinion
       | because you need to be that much more efficient with moving data
       | through the memory hierarchy. With tensor cores, to get anywhere
       | close to cuBLAS, you need to start with something like the most
       | efficient kernel in simon's article, and then do stuff like
       | shared memory swizzling, async global memory copies, double
       | buffering, and writing a really efficient kernel epilogue to
       | accumulate the C matrix into the product.
       | 
       | I came across this article a while ago and it inspired me to take
       | a stab at this^, and as of now I have gotten to ~80% of the
       | cuBLAS tensor core performance where the kernel is mostly compute
       | bound, and I am close to giving up on the last ~20%, because I
       | think I may need to write the inner loop in SASS to make sure the
       | instruction mix between shared memory loads, mma instructions,
       | and synchronizations is perfectly balanced so that none of the
       | hardware pipelines get overloaded (see link below), and I have
       | enough compassion for myself to not spend my free time doing
       | stuff like that :). There are also certain things implemented in
       | CUTLASS that seem important (look up serpentine traversal) but
       | NVIDIA engineers wont talk about the hardware details required to
       | understand why this helps.
       | 
       | Article on this is forthcoming
       | 
       | https://github.com/NervanaSystems/maxas/wiki/SGEMM
        
       ___________________________________________________________________
       (page generated 2024-07-26 23:08 UTC)