https://probablydance.com/2024/10/07/initial-cuda-performance-lessons/ Probably Dance I can program and like games Initial CUDA Performance Lessons by Malte Skarupke I am somehow very late to learning CUDA. I didn't even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you'll write suboptimal code, so here are some lessons I had to learn to get things to run fast. Memory Coalescing If you have multiple threads operating on an array in C++, you probably want to iterate like this: std::vector vec = ...; size_t per_thread = vec.size() / num_threads; T * my_slice = vec.data() + per_thread * my_thread_i; for (size_t i = 0; i < per_thread; ++i) { do_something(my_slice[i]); } Meaning each thread iterates over a contiguous chunk of memory. In CUDA this is going to be slow because you want the threads to load memory together. So if thread 0 loads bytes 0 to 15, then you want thread 1 to load bytes 16 to 31 and thread 2 to load bytes 32 to 47 etc. So the loop instead has to look like this: T * data = ...; size_t num_elements = ...; for (int i = my_thread_i; i < num_elements; i += num_threads) { do_something(data[i]); } This is called "memory coalescing" where adjacent threads use adjacent memory. On a loop with a small body (dot product) this is 3x faster. Most of the Performance is now in Specialized Hardware Many years ago Sean Parent presented a graph that breaks down where the performance is in a modern PC. I'm reproducing it with current numbers here: [pc_performance-2] What we see here is the breakdown of theoretical performance in a PC with a Ryzen 9950X and a RTX 4090. The overall theoretical performance is ~95 TFLOPS. These are theoretical, so for example the single-threaded CPU performance is just "5.7 Ghz * 4 instructions per cycle = 22.8 GFLOPS". That's the blue line that you can't see because it's such a tiny fraction. If you use all 32 threads and AVX 512 you can multiply that performance by 32*16 = 512 to fill up the red and yellow parts of the graph. But if you really want performance, you need to use the GPU which gives you the green part of the graph. But while these are current numbers, it's missing most of the GPU performance. The GPU now has specialized hardware for machine learning and for raytracing. When you add those to the graph you get current performance. [pc_performance_with_specialized_hardware-1] This is the same graph plus specialized hardware. For the tensor core I chose the TFLOPS when doing BF16 matrix multiplies. Meaning it's not exactly a fair comparison because it operates on lower precision (the output is in 32 bits though) but everyone uses this for matrix multiplies and thinks it's fine. The point is that now most of the performance in your PC is in specialized chips. If you're just writing straightforward CUDA code, you're leaving most of the performance on the table. The graph gets even more lopsided when looking at a deep learning GPU like the H100: [h100_performance] Note how the x-axis now goes above 2000 TFLOPS. If you're not using tensor cores, the GPU is sitting >90% idle. This is changing the algorithms that are used in deep learning. If algorithm A can just do bigger matrix multiplications to get higher quality results, and algorithm B can achieve better quality results by cleverly doing lots of little pieces of work, people will choose algorithm A. Different Kinds of Memory Memory is more complicated in CUDA, but with my limited understanding so far I think of CUDA as having three different types of memory: 1. Normal memory 2. Shared memory (faster) 3. Registers (fastest) Registers are particularly weird. One thread block has 65536 registers, meaning you can store 256k bytes of data in registers. Which is more than you can store in shared memory. I was trying to understand how some cuDNN kernel could possibly be as fast as it was, when I realized that they keep a particular matrix entirely in registers where each thread holds a small part of the matrix. You get some control over how many registers you have. You can have up to 1024 threads per thread block, meaning you get 64 registers per thread by default. But you could launch fewer threads and get proportionally more registers per thread. If you need, say 150 registers because you want to cache some data, you divide 65536/150 which tells you that you can use 436 threads. But you're still just writing in C++ which doesn't make it easy to say "keep this data in registers." The best way I found to do this is to keep a fixed-size array on the stack and then use "#pragma unroll" in every single loop that uses that array. The loop needs to be unrolled because every unrolled iteration of the loop needs to refer to different registers. Shared memory was straightforward in comparison. It allows you to dedicate some cache space for a specific purpose, and the data is shared between threads. So you can use it for two purposes: 1. To communicate between threads 2. To load data more quickly: If you want to load 512 floats and you have 512 threads, every thread can load one float into shared memory. So you don't even have to loop. Sharing is ~Free Within a Warp This one was a delight when I saw code doing this for the first time: A warp is 32 threads that share one instruction pointer. They all do the same thing at the same time. So if you e.g. parallelize a dot product, the 32 threads of the warp can sum their results to get the overall result in five steps, using a parallel sum algorithm: [parallel_sum] On a CPU this algorithm is impractical because the overhead of keeping the threads in sync is too high. But on a GPU they just are in sync, so sharing is literally five steps: __device__ float add_warp(float x) { static constexpr const unsigned all = 0xffffffff; x += __shfl_xor_sync(all, x, 1); x += __shfl_xor_sync(all, x, 2); x += __shfl_xor_sync(all, x, 4); x += __shfl_xor_sync(all, x, 8); x += __shfl_xor_sync(all, x, 16); return x; } I verified that this compiles down to two instructions each. This compiles to 5 SHFL.BFLY instructions plus 5 FADD instructions for the addition. There are no secret locks or barriers here. This only works within a warp (32 threads). For a thread block, up to 1024 threads, you can use shared memory, which requires using barriers because the threads won't automatically be in sync. If you need more threads than that and want to share data between them, don't. (you'll often want many more threads, you just can't share data. You need to write out the result to memory and then launch a new thread to work on the new data) Parallelism First My intuition for how many threads to use was wrong by a lot. If you're iterating over some data and have to do several non-trivial things to it, it's probably best to launch one thread for each of the things you want to do. It's tempting to say "this thread already loaded all the relevant data, it can just do a bit of extra work" but in CUDA it's better to launch a separate thread for that extra work, even if they both have to load the same data. It's much cheaper for them to synchronize and share their data than it would be on a CPU. When I ran Nsight Compute on the first couple versions of my code, the feedback that came back could always be summarized as "you're barely using the GPU, make it more parallel." This also means that you often want to pull your algorithm apart. If there is one part that can run massively parallel (across tens of thousands of threads) and one part that has limited parallelism (say only a few hundred threads) then it's probably worth to launch those as separate kernels to benefit from the massive parallelism on part of your problem, even if that part is only a small part. So whenever you try to solve a problem, the first question should not be "how can I make this fast?" but "how can I run this in parallel?" After you solve that, worry about making the parallel code fast. Conclusion Writing CUDA definitely has a different feeling. It feels more puzzly because it's so easy to accidentally only use 1% of your GPU. It actually reminds me of TIS-100, especially the trick of distributing data in the registers of multiple threads. But instead of managing a small number of chips you have to figure out how to generate work for tens of thousands of threads. My mental model is that you've got a bunch of container ships that can travel at 10% of the speed of light. You're using them to ship goods around the world. They're very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It's not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time. [comfyui_00091_] Share this: * Twitter * Facebook * Like Loading... Related Published: October 7, 2024 Filed Under: Programming Tags: C++ : CUDA : performance : Programming Leave a comment Cancel reply [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. << Previous Post Search for: [ ] [Search] Recent Posts * Initial CUDA Performance Lessons * How I use LLMs to program * Transform Matrices are Great and You Should Understand Them * Two Kids Put Me on a Two Sleep Schedule * Beautiful Branchless Binary Search Archives * October 2024 * April 2024 * October 2023 * September 2023 * April 2023 * December 2022 * September 2022 * June 2022 * February 2022 * January 2022 * October 2021 * July 2021 * April 2021 * January 2021 * November 2020 * October 2020 * August 2020 * July 2020 * June 2020 * May 2020 * April 2020 * March 2020 * January 2020 * December 2019 * September 2019 * August 2019 * June 2019 * April 2019 * March 2019 * June 2018 * May 2018 * April 2018 * January 2018 * December 2017 * November 2017 * October 2017 * September 2017 * August 2017 * February 2017 * January 2017 * December 2016 * November 2016 * June 2016 * April 2016 * March 2016 * February 2016 * December 2015 * September 2015 * July 2015 * June 2015 * May 2015 * February 2015 * January 2015 * December 2014 * November 2014 * October 2014 * September 2014 * August 2014 * June 2014 * May 2014 * April 2014 * March 2014 * February 2014 * January 2014 * October 2013 * September 2013 * August 2013 * May 2013 * February 2013 * January 2013 * December 2012 * November 2012 * October 2012 * August 2012 * July 2012 * April 2012 * March 2012 * February 2012 * January 2012 * October 2011 * September 2011 * August 2011 * July 2011 * June 2011 * May 2011 Categories * Children * Games * Links * Math * Politics and Economics * Programming * Uncategorized Meta * Register * Log in * Entries feed * Comments feed * WordPress.com [ ] [Search] Blog at WordPress.com. * Comment * Reblog * Subscribe Subscribed + [wpcom-] Probably Dance Join 200 other subscribers [ ] Sign me up + Already have a WordPress.com account? Log in now. * Privacy * + [wpcom-] Probably Dance + Customize + Subscribe Subscribed + Sign up + Log in + Copy shortlink + Report this content + View post in Reader + Manage subscriptions + Collapse this bar Loading Comments... Write a Comment... [ ] Email (Required) [ ] Name (Required) [ ] Website [ ] [Post Comment] %d [b]