[HN Gopher] How to optimize a CUDA matmul kernel for cuBLAS-like...
___________________________________________________________________
How to optimize a CUDA matmul kernel for cuBLAS-like performance
(2022)
Author : mpweiher
Score : 86 points
Date : 2024-07-26 14:50 UTC (8 hours ago)
(HTM) web link (siboehm.com)
(TXT) w3m dump (siboehm.com)
| joe_the_user wrote:
| In other discussion here, people asserted that a CUDA replacement
| was unworkable because you couldn't replace Nvidia's CuBLAS
| implementation. I'm not qualified to say whether this would give
| info for constructing an adequate replacement but I'd interested
| in people's opinions.
| hedgehog wrote:
| Yes and no. Read his notes about his goals and what he didn't
| implement, and note that his job when he wrote that was as part
| of a team that ports models to different hardware. That's a
| smaller job than writing a shippable framework and Anthropic
| has a team of people doing it. You can read the cuDNN docs to
| get an idea of some of the other stuff you'd need, there's a
| lot there but generally that's the point. So in one sense, yes,
| if you have a strong performance engineering team and a
| specific workload already you can make use of a variety of
| hardware. In another sense, no, a small team isn't likely to be
| able to write a direct alternative for the CUDA ecosystem that
| reaches devex parity. Lots of people have a pretty clear idea
| of what it takes to do this but none of the major vendors seem
| to be doing the work.
|
| Knowing that reaching broad devex parity is very expensive I
| think the real win is figuring out what specific problem you
| have and building community and robust software support around
| that.
| alfalfasprout wrote:
| > In other discussion here, people asserted that a CUDA
| replacement was unworkable because you couldn't replace
| Nvidia's CuBLAS implementation
|
| Targeting nvidia GPUs? Or in general? For whom?
|
| Building a performant BLAS library is hard but certainly not
| impossible. The tricks discussed in this post are hardly
| anything new either. Now, making a BLAS competitive with
| Nvidia's on its own GPUs is bound to be tough. But not
| technically unfeasible (after all, you can drop down to PTX if
| needed).
| deredede wrote:
| When I was optimizing GPU kernels a few years back, Nvidia's
| own kernels were getting those last few percent of
| performance by making use of hardware-specific features (I
| remember operand caches being one) that are not available
| through PTX.
| ap4 wrote:
| Beating CuBLAS is easy. I wrote a 30-line kernel that
| multiplies two 40962 matrices 14% faster than CuBLAS on a
| 4090. The question is how to earn money on that.
| coffeeaddict1 wrote:
| I'm a little sceptical of your claims. Care to share the
| kernel you wrote?
| david-gpu wrote:
| I would bet actual money that they are not doing an
| apples-to-apples comparison.
|
| I have seen how those high-performance libraries are made
| and I'm still in awe at the quality and quantity of the
| staffing involved. Those were the smartest and most
| knowledgeable engineers I met in my career.
| ladberg wrote:
| If this were true (and I highly doubt it) it's obvious how
| to make money from it: collect a 7 figure paycheck from
| Nvidia, AMD, or any FAANG.
| ap4 wrote:
| I swapped one of the kernels in the code from the article
| to my kernel, and left only the multiplication of
| matrices of size 40962.
|
| On average over 20 runs:
|
| CuBLAS (./sgemm 0) has 50.9 TFLOPS.
|
| My kernel has 61.8 TFLOPS, so it's actually +21% speedup
| in this benchmark.
|
| How do I collect my paycheck?
| JuanJohnJames wrote:
| Post the code and your curriculum
| aaa370 wrote:
| I gotta see it to believe it ;)
| imtringued wrote:
| The problem with AMD isn't that they are only hitting 90% of
| the hardware capabilities vs Nvidia hitting 98%.
|
| It's the fact that AMD doesn't prioritize the reliability of
| its hardware and software stack. If I run llama.cpp on Vulkan I
| get a reasonable speedup, but if I raise the batch size to 512,
| the GPU is starting to make strange noises and shuts the PC
| down midway. Very cool. 98% of zero is still zero.
| KeplerBoy wrote:
| Of course it can be done, it's just a lot of effort. You need
| parametric kernels you can find optimal configurations for all
| hardware and input size combinations.
|
| Then there are also numerics: being fast is not enough if your
| implementation accumulates a lot of rounding errors doing so.
| Floating point arithmetic can and will mess up your results in
| unexpected ways. -funsafe famously is neither fun nor safe.
|
| Maybe tooling will catch up and make it easier. Think tinygrad
| with beamsearch, triton or halide.
| GaggiX wrote:
| As reported by the article: "However, for smaller matrices,
| we're doing poorly in comparison to Nvidia's library! This
| happens because cuBLAS contains not one single implementation
| of SGEMM, but hundreds of them.", it would take considerable
| effort to replace CuBLAS.
| ladberg wrote:
| cuBLAS is not really Nvidia's moat: every competitor has a
| workable BLAS implementation and some are even drop-in
| replacements for cuBLAS.
|
| In fact cuBLAS and CUDA are kinda orthogonal in that you're
| either calling a pre-built cuBLAS kernel or writing your own
| CUDA kernel but not really combining the two.
|
| I'd say CUDA shines more because of stability, documentation,
| community support + examples, and ability to use modern C++
| features in GPU code.
| flakiness wrote:
| Note that this is from 2022.
|
| My guess is that people nowadays are gradually moving away from
| raw CUDA programming and moving towards things like Triton etc,
| and you won't be focusing on pure GEMM since you tend to do some
| fusion.
|
| The Triton tutorial claims their performance is on par with
| cuBLAS.
|
| https://triton-lang.org/main/getting-started/tutorials/03-ma...
| machinekob wrote:
| But Triton is just abstraction over CUDA for Python (same like
| cupy, numba etc.). If you need low lvl access you still will
| use CUDA if you want high level you can use Triton or numba
| even higer you'll just use wrappers like pytorch/jax.
| bee_rider wrote:
| I'm sure they are using cool super-modern stuff but I wonder if
| Eigen can somehow be made to spit out CUDA code?
| imjonse wrote:
| It is still an excellent article if you care about how the GPU
| works. Triton doesn't magically hide all the hardware details.
| dang wrote:
| Year added above. Thanks!
| aaa370 wrote:
| Another point to consider here is that this project of writing a
| cuBLAS level GEMM kernel becomes much more challenging if you are
| doing it with fp16, and are thus competing with the cuBLAS
| kernels that use tensor cores. The (theoretical) arithmetic
| throughput of tensor cores is ~8x higher as compared to fp32 math
| on the Turing arch, I dont know off the top of my head but I
| think this ratio is the same or greater for Ampere/Hopper tensor
| cores.
|
| This makes the project proportionally harder in my opinion
| because you need to be that much more efficient with moving data
| through the memory hierarchy. With tensor cores, to get anywhere
| close to cuBLAS, you need to start with something like the most
| efficient kernel in simon's article, and then do stuff like
| shared memory swizzling, async global memory copies, double
| buffering, and writing a really efficient kernel epilogue to
| accumulate the C matrix into the product.
|
| I came across this article a while ago and it inspired me to take
| a stab at this^, and as of now I have gotten to ~80% of the
| cuBLAS tensor core performance where the kernel is mostly compute
| bound, and I am close to giving up on the last ~20%, because I
| think I may need to write the inner loop in SASS to make sure the
| instruction mix between shared memory loads, mma instructions,
| and synchronizations is perfectly balanced so that none of the
| hardware pipelines get overloaded (see link below), and I have
| enough compassion for myself to not spend my free time doing
| stuff like that :). There are also certain things implemented in
| CUTLASS that seem important (look up serpentine traversal) but
| NVIDIA engineers wont talk about the hardware details required to
| understand why this helps.
|
| Article on this is forthcoming
|
| https://github.com/NervanaSystems/maxas/wiki/SGEMM
___________________________________________________________________
(page generated 2024-07-26 23:08 UTC)