[HN Gopher] Optimizing a WebGPU Matmul Kernel for 1 TFLOP
       ___________________________________________________________________
        
       Optimizing a WebGPU Matmul Kernel for 1 TFLOP
        
       Author : zanussbaum
       Score  : 166 points
       Date   : 2024-11-11 17:29 UTC (1 days ago)
        
 (HTM) web link (zanussbaum.substack.com)
 (TXT) w3m dump (zanussbaum.substack.com)
        
       | billconan wrote:
       | WebGPU doesn't seem to talk about bank conflict, hiding some
       | hardware details that might be necessary to write the best
       | kernel. will it be able to match the perf of Cuda on the same
       | hardware?
        
         | zanussbaum wrote:
         | great question, to me webGPU sits a hair high level than CUDA
         | or Vulkan. so you don't have the exact same level of control
         | but can get to 80% performance of it without having to write
         | different kernels specific to the hardware
        
         | brrrrrm wrote:
         | WebGPU cannot even come close unfortunately since they don't
         | have support for hardware specific memory or warp-level
         | primitives (like TMA or tensorcores). it's not like it gets 80%
         | of perf, it gets < 30% of the peak perf for anything related to
         | heavy compute matrix multiplications
        
           | zanussbaum wrote:
           | you're definitely right, 80% was a bit of an overestimation,
           | especially with respect to CUDA
           | 
           | it would be cool to see if there's some way to get better
           | access to those lower-level primitives but would be surprised
           | 
           | it does seem like subgroup support are a step in the right
           | direction though!
        
           | Const-me wrote:
           | > don't have support for hardware specific memory
           | 
           | I have no experience with WebGPU but if you mean group shared
           | memory, I think the support is available. See the demo:
           | https://compute.toys/view/25
        
             | zanussbaum wrote:
             | i tried using workgroup shared memory and found it slower
             | than just recomputing everything in each thread although i
             | may have been doing something dumb
             | 
             | i'm excited to try subgroups though:
             | https://developer.chrome.com/blog/new-in-
             | webgpu-128#experime...
        
           | kayvr wrote:
           | I've heard the WebGPU workgroup wants to close the gap on
           | tensor core support.
        
       | Const-me wrote:
       | Couple years ago, I wanted about the same thing in HLSL language,
       | for a Direct3D 11.0 compute shader. Here's the fastest version I
       | managed to make back then: https://github.com/Const-
       | me/Cgml/blob/master/Mistral/Mistral...
       | 
       | As you see, I have implemented 32x32 tiling, using thread groups
       | of 32x8 threads, two groupshared buffers to load tiles of the
       | input matrices, and I accumulate numbers into local variables, 32
       | / 8 = 4 accumulators per thread.
        
         | lostmsu wrote:
         | What's the perf like?
        
           | Const-me wrote:
           | Sorry, I have not benchmarked against cuBLAS or Eigen or
           | similar, I did that thing for ML inference.
           | 
           | I have implemented a profiler on top of D3D11_QUERY_TIMESTAMP
           | and D3D11_QUERY_TIMESTAMP_DISJOINT queries, and tweaked the
           | compute shader to minimize the time reported by these queries
           | for my specific use case.
        
       | shihab wrote:
       | Great article!
       | 
       | For context: this WebGPU version achieves ~17% of peak
       | theoretical performance of M2. With CUDA (i.e. CuBLAS), you can
       | reach ~75% of peak performance for same matrix config (without
       | tensor core).
        
         | zanussbaum wrote:
         | thanks! and yes definitely not at CUDA levels :)
        
         | weinzierl wrote:
         | 75% can't be the best we can do. What would reach 100% or
         | nearly 100%? Handcoded assembly?
        
           | jsheard wrote:
           | With GPUs it's not uncommon to run out of memory bandwidth
           | before you max out the theoretical FLOPS. They may have a ton
           | of bandwidth but it's never enough.
           | 
           | That can lead you to some pretty counter-intuitive
           | optimizations because it's often faster to do _more_ compute
           | work if it means you touch less memory in the process.
        
             | wbl wrote:
             | Shouldn't the roofline inform capacity assessments?
        
               | KeplerBoy wrote:
               | Sure, but rooflines don't account for stuff like memory
               | granularity. You not only have to do a lot of bytes per
               | flop to achieve the necessary arithmetic intensity, you
               | also have to access those bytes in a coalesced way. I.e.,
               | you want to access consecutive bytes, which are ideally
               | already in registers.
        
             | stephencanon wrote:
             | For sufficiently large GEMM you should never run out of
             | bandwidth before you max out FLOPS if your blocking is
             | organized correctly, because the arithmetic scales like
             | O(n^3) while the memory access scales like O(n^2).
        
               | saagarjha wrote:
               | In theory, yes. In practice you will probably be forced
               | to tile your GEMM and incur the penalty of redundant
               | memory accesses.
        
               | stephencanon wrote:
               | Sure, but still on each tile, you do O(k^3) compute with
               | O(k^2) memory, and you generally arrange things so that
               | at least one tile is in L1 and at least one other is in
               | L2/LLC (using CPU idioms), so again, you have plenty of
               | bandwidth (typical choices of k are in the ballpark of
               | ~32, and a 32:1 compute to memory ratio is just fine on
               | most hardware, especially if some of those accesses are
               | coming from fast memory)
        
             | adev_ wrote:
             | > That can lead you to some pretty counter-intuitive
             | optimizations because it's often faster to do more compute
             | work if it means you touch less memory in the process.
             | 
             | It is not specific to the GPUs: this kind of optimizations
             | are pretty common on CPU too where latency kills you and
             | 200 cycles spent wasted on doing compute can actually be
             | faster than a single cache miss trying to fetch data. This
             | is pretty common for many SIMD algorithms actually.
             | 
             | Memory is currently lagging behind compute on almost every
             | type of modern hardware, and it will very likely become
             | worst, not better.
        
           | david-gpu wrote:
           | The parameters of the matrix multiply, such as the size of
           | the matrices, impose some limits to how close you can get to
           | the peak theoretical performance in a particular GPU. Not all
           | possible matrix multiplies are equally valuable to optimize
           | _a priori_ , so the hardware is designed to perform best on
           | problems that are financially significant, such as modern
           | LLMs.
           | 
           | As for handcoded assembly, do you believe that it would be
           | financially sound to hand code and maintain thousands of
           | kernels that way, even if you believed that they would be
           | faster?
        
             | littlestymaar wrote:
             | > As for handcoded assembly, do you believe that it would
             | be financially sound to hand code and maintain thousands of
             | kernels that way, even if you believed that they would be
             | faster?
             | 
             | Why not? We do so for cryptographic primitives and video
             | codecs. And why are you talking about "thousands of
             | kernels"? AI programs only need a small amount of different
             | kernels so it doesn't sound intractable.
        
               | david-gpu wrote:
               | _> AI programs only need a small amount of different
               | kernels_
               | 
               | That is not the case. What appears like a simple matmul
               | operation actually requires these libraries to select
               | which specific kernel out of the many internally
               | available to execute.
               | 
               | If you are curious to learn more, NVidia open sourced a
               | library called Cutlass some years ago. And remember that
               | is only what they are willing to open source.
        
               | littlestymaar wrote:
               | Is that really different from AV codecs in terms of scale
               | though?
        
               | david-gpu wrote:
               | I am not at liberty to discuss more than that.
        
               | saagarjha wrote:
               | Yes, you can peek under the hood of cuBLAS and notice
               | that it has dozens of kernels for different problem
               | sizes. It's not generally the case that when you do h264
               | at a different crf you have a completely different tiling
               | strategy that you have to implement.
        
         | Const-me wrote:
         | > you can reach ~75% of peak performance for same matrix config
         | 
         | Not on the same computer, CUDA doesn't run on the integrated
         | GPU of the Apple M2 Pro.
        
           | ladyanita22 wrote:
           | That's exactly what I was wondering. That cannot be.
        
           | stephencanon wrote:
           | Probably more relevant here is that a single CPU core on that
           | computer exceeds 1 tflop/s on gemm with plenty of margin
           | using a single lib call, and leaves the rest of the CPU cores
           | and all of the GPU free to do other work.
        
             | Const-me wrote:
             | OP implemented that stuff with WebGPU. In runtime, that
             | weird language compiles into a compute shader which
             | computes that stuff on GPU.
        
               | saagarjha wrote:
               | I believe the point being made was that this could be
               | done in the CPU faster than was achieved here.
        
               | Const-me wrote:
               | Yeah, but not on a single core.
               | 
               | In my desktop computer, I have Ryzen 7 8700G CPU, which
               | has 8 Zen 4 cores, 4.2 GHz base frequency, 65W TDP.
               | Theoretically, when doing FP32 FMA, each CPU core can do
               | 32 FLOP/cycle. At the base frequency, this translates
               | into 134 GFlops per core. You gonna need all 8 cores to
               | achieve 1 theoretical TFlops.
               | 
               | BTW, integrated GPU inside the same 8700G processor can
               | theoretically do 8.2 TFlops FP32.
        
             | adrian_b wrote:
             | Nope, no Apple CPU core has such performance.
             | 
             | That single lib call must have used the AMX accelerator,
             | which is separate from the cores and shared by a group of
             | cores.
             | 
             | So that AMX accelerator performance may be greater than of
             | all CPU cores together. AFAIK, some Apple CPUs have one AMX
             | accelerator for the big cores and another AMX accelerator
             | for the smaller cores, but in any case there is no chance
             | to hope that if you have obtained 1 TFLOP/s when running
             | the program on 1 core you will get much more when running
             | it on multiple cores, because all cores of the same type
             | will use the same shared accelerator.
        
               | Const-me wrote:
               | One interesting thing about these newfangled matrix/AI/ML
               | accelerators that's very rarely mentioned on the
               | internets, they only deliver that many TFLOP because they
               | operate in very low precision.
               | 
               | nVidia tensor cores support int8, couple versions of FP16
               | (BF16 and the standard IEEE one) and FP19 which they call
               | TensorFloat-32. I think Intel AMX only supports int8 and
               | BF16.
               | 
               | None of them supports FP32 let alone FP64 input numbers,
               | which makes them completely useless for traditional GEMM
               | stuff.
        
               | wtallis wrote:
               | https://github.com/corsix/amx indicates that Apple's AMX
               | supports up to 64-bit FP, but I don't see any performance
               | metrics. They also have the ANE, which is the low-
               | precision ML-focused accelerator.
        
               | Const-me wrote:
               | I wasn't aware there're two completely different things
               | from different companies both called AMX. I assumed that
               | AMX:
               | https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions
               | 
               | The Apple's version is indeed interesting. I wonder why
               | haven't Apple exposed it to programmers, or implemented a
               | BLAS library on top of that thing?
        
               | wtallis wrote:
               | > I wonder why haven't Apple exposed it to programmers,
               | or implemented a BLAS library on top of that thing?
               | 
               | Using the Accelerate framework (which includes Apple's
               | BLAS) is the _only_ supported way for programmers to
               | access the AMX. Reverse engineering the instruction set
               | to access it directly is discouraged, because it 's not a
               | documented stable interface.
        
               | stephencanon wrote:
               | Apple's matrix unit supports FP16, 32, and 64 sources and
               | accumulators.
        
               | stephencanon wrote:
               | Right. So, like I said, using one CPU core, you can
               | exceed 1 TFLOP/s, leaving all the other CPU cores and the
               | GPU free for other work.
        
           | shihab wrote:
           | Yes, that's why I was focusing on percentage of peak hardware
           | performance, not actual flops.
        
         | brrrrrm wrote:
         | how are you running CUDA on the integrated Apple silicon GPU
         | these days?
        
           | KeplerBoy wrote:
           | You are not.
        
       | maelito wrote:
       | WebGPU will make Web maps even more competitive than they are
       | already.
       | 
       | The smoothness of an iPhone map zoom, on any device.
        
         | jsheard wrote:
         | > The smoothness of an iPhone map zoom, on any device.
         | 
         | Any device except an iPhone, until Apple finally gets around to
         | shipping WebGPU in Safari. Any year now...
        
           | astlouis44 wrote:
           | Safari is officially enabling support for WebGPU in iOS 18.2,
           | which is rolling out within the first weeks of December.
        
             | jsheard wrote:
             | Where'd you hear that? It's not listed here:
             | 
             | https://developer.apple.com/documentation/safari-release-
             | not...
        
               | astlouis44 wrote:
               | Source is here, from a Unity WebGPU thread. Look at the
               | comment from October 27 from Brendan Duncan, a Unity
               | employee: https://discussions.unity.com/t/early-access-
               | to-the-new-webg...
               | 
               | "I have found that WebGPU is enabled by default now with
               | iOS 18.2. Apple has been working in the open on WebGPU.
               | The WebKit source code has their latest WebGPU work in
               | it. What hasn't been known is their release schedule, but
               | now with 18.2 it's looking very promising that it will be
               | on by default in that version."
        
             | luketaylor wrote:
             | Source?
             | 
             | Edit: I just pressed "Reset All to Defaults" under "WebKit
             | Feature Flags" on my device running 18.2 beta, and the
             | switch for WebGPU is on!! <3
        
       | coffeeaddict1 wrote:
       | You can do slightly better fairly easily I think. See here for
       | example https://github.com/AnswerDotAI/gpu.cpp/pull/35
        
       | inglor wrote:
       | Can you explain why you did the naive algorithm here and not any
       | of the fast matrix multiplication ones that trade multiplications
       | for more additions? Just for educational purposes or is there a
       | performance benefit in the technique?
        
         | zanussbaum wrote:
         | at least on my m2, the compiled kernel ends up using fast math
         | anyways so using WGSL's fma didn't change anything about the
         | actual kernel that gets run
        
           | hedgehog wrote:
           | inglor is probably referring to Strassen or Coppersmith-
           | Winograd.
        
             | zanussbaum wrote:
             | oh in that case it was because i didn't know about them :)
             | something to try next!
        
             | wbl wrote:
             | Last I checked the extra mems really hurt on a lot of cases
             | especially for the more complex ones, but I'm no expert.
        
         | saagarjha wrote:
         | Because those algorithms are generally not worth implementing
         | even though their algorithmic complexity is theoretically
         | lower.
        
       | pama wrote:
       | To clarify the title: TFLOP/s is the unit the author goes after,
       | not TFLOP. People in the threads compare CUDA performance on GPUs
       | to WebAssembly performance: please recall that H100 has a
       | theoretical performance of about 1000 TFLOP/s for bfloat16, and
       | even moderately complicated algorithms in typical modern
       | transformer architectures can reach about half of that
       | performance.
        
         | saagarjha wrote:
         | H100 can do well over 1500 TFLOPS in fp16.
        
           | nulltype wrote:
           | Which H100 and how much over 1500 TFLOP/s?
           | 
           | The datasheet for the H100 SXM seems to indicate that it can
           | only do ~1000 TFLOP/s peak.
        
             | saagarjha wrote:
             | I just went to Nvidia's site and downloaded the data sheet:
             | https://resources.nvidia.com/en-us-tensor-core/nvidia-
             | tensor.... It says 1600/1900 in half precision?
        
               | wtallis wrote:
               | Read the fine print: "With sparsity". They double the
               | claimed throughput by assuming that half of the FLOPs can
               | be skipped.
        
               | saagarjha wrote:
               | Oh, that is really annoying. Thanks for catching that!
        
               | pama wrote:
               | There are two populations of people reading the NVIDIA
               | specs (and now you switched groups). If NVIDIA ever
               | changes their marketing strategy and the asterisk denotes
               | something else, there might be a third population because
               | I know a lot of people that I suspect will keep dividing
               | those starred FLOPS/s by two :-)
        
               | menaerus wrote:
               | I also recently went through the specs and noticed "with
               | sparsity" but I didn't quite understand what it
               | specifically refers to - the premise is that a lot of
               | weights in matmul operations will be zero or
               | insignificant - also known as sparse matrices - and in
               | that case A100/H100 has a circuitry that can boost the
               | throughput up to 2x, essentially "skipping" half of the
               | FLOPS as you say.
               | 
               | I am not an expert in LLM but I don't think you can end
               | up having a significant amount of zeroed weights (~50%)
               | in a converged network so I think it is safe to say that
               | the theoretical throughput for 99% of cases is really
               | ~800 TFLOPS and not ~1600 TFLOPS as advertised.
        
       | FL33TW00D wrote:
       | I wrote something similar a while back:
       | https://github.com/FL33TW00D/wgpu-mm
       | 
       | Also does quantized matmuls.
        
         | brrrrrm wrote:
         | would be fun to do a leaderboard of some specific size (e.g.
         | 4096x4096x4096) just to get all the code and tricks in one spot
         | for folks to learn about things
        
       | mkeeter wrote:
       | For a very deep dive into the subject, this is a great writeup:
       | 
       | How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance
       | (https://siboehm.com/articles/22/CUDA-MMM)
       | 
       | (It's CUDA-specific, so there may be aspects that can't yet be
       | ported to WGPU)
        
         | zanussbaum wrote:
         | this was a huge inspiration for the post! i tried to highlight
         | it in the blog but it might have gotten buried
         | 
         | there are a few things that i wasn't able to figure out how to
         | get access to/i wasn't sure if they were possible. for example,
         | a lot of Simon's article takes advantage of the warp scheduler
         | and warp tiling.
         | 
         | i had a hard time finding information on if that's even
         | possible with my M2/metal and the general memory access
         | patterns. it seems like CUDA does have better documentation in
         | this regard
        
         | almostgotcaught wrote:
         | That's a nice tutorial but just to be clear: that is not a deep
         | dive in any sense. It's just the bog standard tricks. It
         | doesn't cover MMA and WMMA, which today is table stakes for
         | fast matmul. Also doesn't cover software pipelining. It's
         | basically a good summary of the basics.
        
           | saagarjha wrote:
           | It's a deep dive as of like 2015 probably. I don't know if
           | anyone has done something similar for modern GEMMs. Maybe the
           | CUTLASS or Colfax people?
        
       | jsbsjwbw wrote:
       | Kkkk
        
       ___________________________________________________________________
       (page generated 2024-11-12 23:01 UTC)