[HN Gopher] Optimizing a WebGPU Matmul Kernel for 1 TFLOP
___________________________________________________________________
Optimizing a WebGPU Matmul Kernel for 1 TFLOP
Author : zanussbaum
Score : 166 points
Date : 2024-11-11 17:29 UTC (1 days ago)
(HTM) web link (zanussbaum.substack.com)
(TXT) w3m dump (zanussbaum.substack.com)
| billconan wrote:
| WebGPU doesn't seem to talk about bank conflict, hiding some
| hardware details that might be necessary to write the best
| kernel. will it be able to match the perf of Cuda on the same
| hardware?
| zanussbaum wrote:
| great question, to me webGPU sits a hair high level than CUDA
| or Vulkan. so you don't have the exact same level of control
| but can get to 80% performance of it without having to write
| different kernels specific to the hardware
| brrrrrm wrote:
| WebGPU cannot even come close unfortunately since they don't
| have support for hardware specific memory or warp-level
| primitives (like TMA or tensorcores). it's not like it gets 80%
| of perf, it gets < 30% of the peak perf for anything related to
| heavy compute matrix multiplications
| zanussbaum wrote:
| you're definitely right, 80% was a bit of an overestimation,
| especially with respect to CUDA
|
| it would be cool to see if there's some way to get better
| access to those lower-level primitives but would be surprised
|
| it does seem like subgroup support are a step in the right
| direction though!
| Const-me wrote:
| > don't have support for hardware specific memory
|
| I have no experience with WebGPU but if you mean group shared
| memory, I think the support is available. See the demo:
| https://compute.toys/view/25
| zanussbaum wrote:
| i tried using workgroup shared memory and found it slower
| than just recomputing everything in each thread although i
| may have been doing something dumb
|
| i'm excited to try subgroups though:
| https://developer.chrome.com/blog/new-in-
| webgpu-128#experime...
| kayvr wrote:
| I've heard the WebGPU workgroup wants to close the gap on
| tensor core support.
| Const-me wrote:
| Couple years ago, I wanted about the same thing in HLSL language,
| for a Direct3D 11.0 compute shader. Here's the fastest version I
| managed to make back then: https://github.com/Const-
| me/Cgml/blob/master/Mistral/Mistral...
|
| As you see, I have implemented 32x32 tiling, using thread groups
| of 32x8 threads, two groupshared buffers to load tiles of the
| input matrices, and I accumulate numbers into local variables, 32
| / 8 = 4 accumulators per thread.
| lostmsu wrote:
| What's the perf like?
| Const-me wrote:
| Sorry, I have not benchmarked against cuBLAS or Eigen or
| similar, I did that thing for ML inference.
|
| I have implemented a profiler on top of D3D11_QUERY_TIMESTAMP
| and D3D11_QUERY_TIMESTAMP_DISJOINT queries, and tweaked the
| compute shader to minimize the time reported by these queries
| for my specific use case.
| shihab wrote:
| Great article!
|
| For context: this WebGPU version achieves ~17% of peak
| theoretical performance of M2. With CUDA (i.e. CuBLAS), you can
| reach ~75% of peak performance for same matrix config (without
| tensor core).
| zanussbaum wrote:
| thanks! and yes definitely not at CUDA levels :)
| weinzierl wrote:
| 75% can't be the best we can do. What would reach 100% or
| nearly 100%? Handcoded assembly?
| jsheard wrote:
| With GPUs it's not uncommon to run out of memory bandwidth
| before you max out the theoretical FLOPS. They may have a ton
| of bandwidth but it's never enough.
|
| That can lead you to some pretty counter-intuitive
| optimizations because it's often faster to do _more_ compute
| work if it means you touch less memory in the process.
| wbl wrote:
| Shouldn't the roofline inform capacity assessments?
| KeplerBoy wrote:
| Sure, but rooflines don't account for stuff like memory
| granularity. You not only have to do a lot of bytes per
| flop to achieve the necessary arithmetic intensity, you
| also have to access those bytes in a coalesced way. I.e.,
| you want to access consecutive bytes, which are ideally
| already in registers.
| stephencanon wrote:
| For sufficiently large GEMM you should never run out of
| bandwidth before you max out FLOPS if your blocking is
| organized correctly, because the arithmetic scales like
| O(n^3) while the memory access scales like O(n^2).
| saagarjha wrote:
| In theory, yes. In practice you will probably be forced
| to tile your GEMM and incur the penalty of redundant
| memory accesses.
| stephencanon wrote:
| Sure, but still on each tile, you do O(k^3) compute with
| O(k^2) memory, and you generally arrange things so that
| at least one tile is in L1 and at least one other is in
| L2/LLC (using CPU idioms), so again, you have plenty of
| bandwidth (typical choices of k are in the ballpark of
| ~32, and a 32:1 compute to memory ratio is just fine on
| most hardware, especially if some of those accesses are
| coming from fast memory)
| adev_ wrote:
| > That can lead you to some pretty counter-intuitive
| optimizations because it's often faster to do more compute
| work if it means you touch less memory in the process.
|
| It is not specific to the GPUs: this kind of optimizations
| are pretty common on CPU too where latency kills you and
| 200 cycles spent wasted on doing compute can actually be
| faster than a single cache miss trying to fetch data. This
| is pretty common for many SIMD algorithms actually.
|
| Memory is currently lagging behind compute on almost every
| type of modern hardware, and it will very likely become
| worst, not better.
| david-gpu wrote:
| The parameters of the matrix multiply, such as the size of
| the matrices, impose some limits to how close you can get to
| the peak theoretical performance in a particular GPU. Not all
| possible matrix multiplies are equally valuable to optimize
| _a priori_ , so the hardware is designed to perform best on
| problems that are financially significant, such as modern
| LLMs.
|
| As for handcoded assembly, do you believe that it would be
| financially sound to hand code and maintain thousands of
| kernels that way, even if you believed that they would be
| faster?
| littlestymaar wrote:
| > As for handcoded assembly, do you believe that it would
| be financially sound to hand code and maintain thousands of
| kernels that way, even if you believed that they would be
| faster?
|
| Why not? We do so for cryptographic primitives and video
| codecs. And why are you talking about "thousands of
| kernels"? AI programs only need a small amount of different
| kernels so it doesn't sound intractable.
| david-gpu wrote:
| _> AI programs only need a small amount of different
| kernels_
|
| That is not the case. What appears like a simple matmul
| operation actually requires these libraries to select
| which specific kernel out of the many internally
| available to execute.
|
| If you are curious to learn more, NVidia open sourced a
| library called Cutlass some years ago. And remember that
| is only what they are willing to open source.
| littlestymaar wrote:
| Is that really different from AV codecs in terms of scale
| though?
| david-gpu wrote:
| I am not at liberty to discuss more than that.
| saagarjha wrote:
| Yes, you can peek under the hood of cuBLAS and notice
| that it has dozens of kernels for different problem
| sizes. It's not generally the case that when you do h264
| at a different crf you have a completely different tiling
| strategy that you have to implement.
| Const-me wrote:
| > you can reach ~75% of peak performance for same matrix config
|
| Not on the same computer, CUDA doesn't run on the integrated
| GPU of the Apple M2 Pro.
| ladyanita22 wrote:
| That's exactly what I was wondering. That cannot be.
| stephencanon wrote:
| Probably more relevant here is that a single CPU core on that
| computer exceeds 1 tflop/s on gemm with plenty of margin
| using a single lib call, and leaves the rest of the CPU cores
| and all of the GPU free to do other work.
| Const-me wrote:
| OP implemented that stuff with WebGPU. In runtime, that
| weird language compiles into a compute shader which
| computes that stuff on GPU.
| saagarjha wrote:
| I believe the point being made was that this could be
| done in the CPU faster than was achieved here.
| Const-me wrote:
| Yeah, but not on a single core.
|
| In my desktop computer, I have Ryzen 7 8700G CPU, which
| has 8 Zen 4 cores, 4.2 GHz base frequency, 65W TDP.
| Theoretically, when doing FP32 FMA, each CPU core can do
| 32 FLOP/cycle. At the base frequency, this translates
| into 134 GFlops per core. You gonna need all 8 cores to
| achieve 1 theoretical TFlops.
|
| BTW, integrated GPU inside the same 8700G processor can
| theoretically do 8.2 TFlops FP32.
| adrian_b wrote:
| Nope, no Apple CPU core has such performance.
|
| That single lib call must have used the AMX accelerator,
| which is separate from the cores and shared by a group of
| cores.
|
| So that AMX accelerator performance may be greater than of
| all CPU cores together. AFAIK, some Apple CPUs have one AMX
| accelerator for the big cores and another AMX accelerator
| for the smaller cores, but in any case there is no chance
| to hope that if you have obtained 1 TFLOP/s when running
| the program on 1 core you will get much more when running
| it on multiple cores, because all cores of the same type
| will use the same shared accelerator.
| Const-me wrote:
| One interesting thing about these newfangled matrix/AI/ML
| accelerators that's very rarely mentioned on the
| internets, they only deliver that many TFLOP because they
| operate in very low precision.
|
| nVidia tensor cores support int8, couple versions of FP16
| (BF16 and the standard IEEE one) and FP19 which they call
| TensorFloat-32. I think Intel AMX only supports int8 and
| BF16.
|
| None of them supports FP32 let alone FP64 input numbers,
| which makes them completely useless for traditional GEMM
| stuff.
| wtallis wrote:
| https://github.com/corsix/amx indicates that Apple's AMX
| supports up to 64-bit FP, but I don't see any performance
| metrics. They also have the ANE, which is the low-
| precision ML-focused accelerator.
| Const-me wrote:
| I wasn't aware there're two completely different things
| from different companies both called AMX. I assumed that
| AMX:
| https://en.wikipedia.org/wiki/Advanced_Matrix_Extensions
|
| The Apple's version is indeed interesting. I wonder why
| haven't Apple exposed it to programmers, or implemented a
| BLAS library on top of that thing?
| wtallis wrote:
| > I wonder why haven't Apple exposed it to programmers,
| or implemented a BLAS library on top of that thing?
|
| Using the Accelerate framework (which includes Apple's
| BLAS) is the _only_ supported way for programmers to
| access the AMX. Reverse engineering the instruction set
| to access it directly is discouraged, because it 's not a
| documented stable interface.
| stephencanon wrote:
| Apple's matrix unit supports FP16, 32, and 64 sources and
| accumulators.
| stephencanon wrote:
| Right. So, like I said, using one CPU core, you can
| exceed 1 TFLOP/s, leaving all the other CPU cores and the
| GPU free for other work.
| shihab wrote:
| Yes, that's why I was focusing on percentage of peak hardware
| performance, not actual flops.
| brrrrrm wrote:
| how are you running CUDA on the integrated Apple silicon GPU
| these days?
| KeplerBoy wrote:
| You are not.
| maelito wrote:
| WebGPU will make Web maps even more competitive than they are
| already.
|
| The smoothness of an iPhone map zoom, on any device.
| jsheard wrote:
| > The smoothness of an iPhone map zoom, on any device.
|
| Any device except an iPhone, until Apple finally gets around to
| shipping WebGPU in Safari. Any year now...
| astlouis44 wrote:
| Safari is officially enabling support for WebGPU in iOS 18.2,
| which is rolling out within the first weeks of December.
| jsheard wrote:
| Where'd you hear that? It's not listed here:
|
| https://developer.apple.com/documentation/safari-release-
| not...
| astlouis44 wrote:
| Source is here, from a Unity WebGPU thread. Look at the
| comment from October 27 from Brendan Duncan, a Unity
| employee: https://discussions.unity.com/t/early-access-
| to-the-new-webg...
|
| "I have found that WebGPU is enabled by default now with
| iOS 18.2. Apple has been working in the open on WebGPU.
| The WebKit source code has their latest WebGPU work in
| it. What hasn't been known is their release schedule, but
| now with 18.2 it's looking very promising that it will be
| on by default in that version."
| luketaylor wrote:
| Source?
|
| Edit: I just pressed "Reset All to Defaults" under "WebKit
| Feature Flags" on my device running 18.2 beta, and the
| switch for WebGPU is on!! <3
| coffeeaddict1 wrote:
| You can do slightly better fairly easily I think. See here for
| example https://github.com/AnswerDotAI/gpu.cpp/pull/35
| inglor wrote:
| Can you explain why you did the naive algorithm here and not any
| of the fast matrix multiplication ones that trade multiplications
| for more additions? Just for educational purposes or is there a
| performance benefit in the technique?
| zanussbaum wrote:
| at least on my m2, the compiled kernel ends up using fast math
| anyways so using WGSL's fma didn't change anything about the
| actual kernel that gets run
| hedgehog wrote:
| inglor is probably referring to Strassen or Coppersmith-
| Winograd.
| zanussbaum wrote:
| oh in that case it was because i didn't know about them :)
| something to try next!
| wbl wrote:
| Last I checked the extra mems really hurt on a lot of cases
| especially for the more complex ones, but I'm no expert.
| saagarjha wrote:
| Because those algorithms are generally not worth implementing
| even though their algorithmic complexity is theoretically
| lower.
| pama wrote:
| To clarify the title: TFLOP/s is the unit the author goes after,
| not TFLOP. People in the threads compare CUDA performance on GPUs
| to WebAssembly performance: please recall that H100 has a
| theoretical performance of about 1000 TFLOP/s for bfloat16, and
| even moderately complicated algorithms in typical modern
| transformer architectures can reach about half of that
| performance.
| saagarjha wrote:
| H100 can do well over 1500 TFLOPS in fp16.
| nulltype wrote:
| Which H100 and how much over 1500 TFLOP/s?
|
| The datasheet for the H100 SXM seems to indicate that it can
| only do ~1000 TFLOP/s peak.
| saagarjha wrote:
| I just went to Nvidia's site and downloaded the data sheet:
| https://resources.nvidia.com/en-us-tensor-core/nvidia-
| tensor.... It says 1600/1900 in half precision?
| wtallis wrote:
| Read the fine print: "With sparsity". They double the
| claimed throughput by assuming that half of the FLOPs can
| be skipped.
| saagarjha wrote:
| Oh, that is really annoying. Thanks for catching that!
| pama wrote:
| There are two populations of people reading the NVIDIA
| specs (and now you switched groups). If NVIDIA ever
| changes their marketing strategy and the asterisk denotes
| something else, there might be a third population because
| I know a lot of people that I suspect will keep dividing
| those starred FLOPS/s by two :-)
| menaerus wrote:
| I also recently went through the specs and noticed "with
| sparsity" but I didn't quite understand what it
| specifically refers to - the premise is that a lot of
| weights in matmul operations will be zero or
| insignificant - also known as sparse matrices - and in
| that case A100/H100 has a circuitry that can boost the
| throughput up to 2x, essentially "skipping" half of the
| FLOPS as you say.
|
| I am not an expert in LLM but I don't think you can end
| up having a significant amount of zeroed weights (~50%)
| in a converged network so I think it is safe to say that
| the theoretical throughput for 99% of cases is really
| ~800 TFLOPS and not ~1600 TFLOPS as advertised.
| FL33TW00D wrote:
| I wrote something similar a while back:
| https://github.com/FL33TW00D/wgpu-mm
|
| Also does quantized matmuls.
| brrrrrm wrote:
| would be fun to do a leaderboard of some specific size (e.g.
| 4096x4096x4096) just to get all the code and tricks in one spot
| for folks to learn about things
| mkeeter wrote:
| For a very deep dive into the subject, this is a great writeup:
|
| How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance
| (https://siboehm.com/articles/22/CUDA-MMM)
|
| (It's CUDA-specific, so there may be aspects that can't yet be
| ported to WGPU)
| zanussbaum wrote:
| this was a huge inspiration for the post! i tried to highlight
| it in the blog but it might have gotten buried
|
| there are a few things that i wasn't able to figure out how to
| get access to/i wasn't sure if they were possible. for example,
| a lot of Simon's article takes advantage of the warp scheduler
| and warp tiling.
|
| i had a hard time finding information on if that's even
| possible with my M2/metal and the general memory access
| patterns. it seems like CUDA does have better documentation in
| this regard
| almostgotcaught wrote:
| That's a nice tutorial but just to be clear: that is not a deep
| dive in any sense. It's just the bog standard tricks. It
| doesn't cover MMA and WMMA, which today is table stakes for
| fast matmul. Also doesn't cover software pipelining. It's
| basically a good summary of the basics.
| saagarjha wrote:
| It's a deep dive as of like 2015 probably. I don't know if
| anyone has done something similar for modern GEMMs. Maybe the
| CUTLASS or Colfax people?
| jsbsjwbw wrote:
| Kkkk
___________________________________________________________________
(page generated 2024-11-12 23:01 UTC)