[HN Gopher] Initial CUDA Performance Lessons
       ___________________________________________________________________
        
       Initial CUDA Performance Lessons
        
       Author : ibobev
       Score  : 138 points
       Date   : 2024-10-11 10:01 UTC (12 hours ago)
        
 (HTM) web link (probablydance.com)
 (TXT) w3m dump (probablydance.com)
        
       | elashri wrote:
       | I like this writeup as it summarizes my journey with optimizing
       | some cuda code I wrote for an LHC experiment trigger. But there
       | are few comments on some details.
       | 
       | There are 65536 registers per SM not thread block and while you
       | can indirectly control that by making your block takes all the SM
       | but this presents its own problems.
       | 
       | NVIDIA hardware limits the threads max number to 1024 (2048) and
       | shared memory to 48 KB (64 KB) per SM. So if you consume all of
       | that in one thread block or near the maximum then you are using
       | one thread block per SM. You don't usually want to do that
       | because it will lower your occupancy. Additionaly , If the kernel
       | you're running is not compute-bound and does not need all the
       | registers or shared memory allocated to it, having fewer blocks
       | on the SM could leave some compute resources idle. GPUs are
       | designed to thrive on parallelism, and limiting the number of
       | active blocks could cause underutilization of the SM's cores,
       | leading to poor performance. Finally, If each thread block
       | occupies an entire SM, you limit the scalability of your kernel
       | to the number of SMs on the GPU. For example, if your GPU has 60
       | SMs, and each block uses one SM, you can only run 60 blocks in
       | parallel, even if the problem you're solving could benefit from
       | more parallelism. This can reduce the efficiency of the GPU for
       | very large problem sizes.
        
         | jhj wrote:
         | Aiming for higher occupancy is not always a desired solution,
         | what frequently matters more is avoiding global memory
         | latencies by retaining more data in registers and/or shared
         | memory. This was first noted in 2010 and is still true today:
         | 
         | https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...
         | 
         | I would also think in terms of latency hiding rather than just
         | work parallelism (though latency hiding on GPUs is largely
         | because of parallelism). This is the reason why GPUs have
         | massive register files, because unlike modern multi-core CPUs,
         | we omit latency reducing hardware (e.g., speculative execution,
         | large caches, that out-of-order execution stuff/register
         | renaming etc) and in order to fill pipelines we need to have
         | many instructions outstanding, which means that the operands
         | for those pending arguments need to remain around for a lot
         | longer, hence the massive register file.
        
           | elashri wrote:
           | I agree that optimizing for lower occupancy can yield
           | significant performance gains in specific cases, especially
           | when memory latencies are the primary bottleneck. Leveraging
           | ILP and storing more data in registers can indeed help reduce
           | the need for higher occupancy and lead to more efficient
           | kernels. The examples in the GTC2010 talks highlighted that
           | quite well. However, I would argue that occupancy still plays
           | an important role, especially for scalability and general-
           | purpose optimization. Over-relying on low occupancy and fewer
           | threads, while beneficial in certain contexts, has its
           | limits.
           | 
           | The first thing to consider is the register pressure.
           | Increasing the number of registers per thread to optimize for
           | ILP can lead to register spilling when the register file is
           | exhausted, which drastically reduces performance. This
           | becomes more pronounced as problem sizes scale up (the talk
           | examples avoids that problem). Many real-world applications,
           | especially compute-bound kernels, need high occupancy to
           | fully utilize the GPU's resources. Focusing too much on
           | minimizing thread counts can lead to underutilization of the
           | SM's parallel execution units. An standard example will be
           | inference engines.
           | 
           | Also, while low-occupancy optimizations can be effective for
           | specific workloads (e.g, memory-bound kernels), designing
           | code that depends on such strategies as a general practice
           | can result in less adaptable and robust solutions for a wide
           | variety of applications.
           | 
           | I believe there is a balance to strike here. low occupancy
           | can work for specific cases, higher occupancy often provides
           | better scalability and overall performance for more general
           | use cases. But you have to test for that while you are
           | optimizing your code. There will not be a general rule of
           | thump to follow here.
        
         | dahart wrote:
         | Good points, though I agree with sibling that higher occupancy
         | is not the goal; higher performance is the goal. Since
         | registers are such a precious resource, you often want to set
         | your block size and occupancy to whatever is best for keeping
         | active state in registers. If you push the occupancy higher,
         | then the compiler might be forced to spill registers to VRAM,
         | that that will just slow everything down even though the
         | occupancy goes up.
         | 
         | Another thing to maybe mention, re: "if your GPU has 60 SMs,
         | and each block uses one SM, you can only run 60 blocks in
         | parallel"... CUDA tends to want to have at least 3 or 4 blocks
         | per SM so it can round-robin them as soon as one stalls on a
         | memory load or sync or something else. You might only make
         | forward progress on 60 separate blocks in any given cycle, but
         | it's quite important that you have like, for example, 240
         | blocks running in "parallel", so you can benefit from latency
         | hiding. This is where a lot of additional performance comes
         | from, doing work on one block while another is momentarily
         | stuck.
        
           | winwang wrote:
           | Is this really true in general? I'd expect it to be true for
           | highly homogenous blocks, but I'd also expect that kernels
           | where the warps are "desynced" in memory operations to do
           | just fine without having 3-4 blocks per SM.
        
             | dahart wrote:
             | Oh I think so, but I'm certainly not the most expert of
             | CUDA users there is. ;) Still, you will often see CUDA try
             | to alloc local and smem space for at least 3 blocks per SM
             | when you configure a kernel. That can't possibly always be
             | true, but is for kernels that are using modest amounts of
             | smem, lmem, and registers. In general I'd say desynced mem
             | ops are harder to make performant than highly homogeneous
             | workloads, since those are more likely to be uncoalesced as
             | well as cache misses. Think about it this way: a kernel can
             | stall for many many reasons (which Nsight Compute can show
             | you), especially memory IO, but even for compute bound
             | work, the math pipes can fill, the instruction cache can
             | miss, some instructions have higher latency than others,
             | etc. etc. Even a cache hit load can take dozens of cycles
             | to actually fill. Because stalls are everywhere, these
             | machines are specifically designed to juggle multiple
             | blocks and always look for ways to make forward progress on
             | _something_ without having to sit idle, that is how to get
             | higher throughput and hide latency.
        
         | otherjason wrote:
         | For devices with compute capability of 7.0 or greater (anything
         | from the Volta series on), a single thread block can address up
         | to the entire shared memory size of the SM; the 48 kB limit
         | that older hardware had is no more. Most contemporary
         | applications are going to be running on hardware that doesn't
         | have the shared memory limit you mentioned.
         | 
         | The claim at the end of your post, suggesting that >1 block per
         | SM is always better than 1 block per SM, isn't strictly true
         | either. In the example you gave, you're limited to 60 blocks
         | because the thread count of each block is too high. You could,
         | for example, cut the blocks in half to yield 120 blocks. But
         | each block has half as many threads in it, so you don't
         | automatically get any occupancy benefit by doing so.
         | 
         | When planning out the geometry of a CUDA thread grid, there are
         | inherent tradeoffs between SM thread and/or warp scheduler
         | limits, shared memory usage, register usage, and overall SM
         | count, and those tradeoffs can be counterintuitive if you
         | follow (admittedly, NVIDIA's official) guidance that maximizing
         | the thread count leads to optimal performance.
        
       | amelius wrote:
       | In the 90s we had segmented memory programming with near and far
       | pointers, and you had to be very careful about when you used what
       | type of pointer and how you'd organize your memory accesses. Then
       | we got processors like the 286 that finally relieved us from this
       | constrained way of programming.
       | 
       | I can't help but feel that with CUDA we're having new constraints
       | (32 threads in a warp, what?), which are begging to be unified at
       | some point.
        
         | dboreham wrote:
         | 386?
        
         | krapht wrote:
         | I'll believe it when autovectorization is actually useful in
         | day to day high performance coding work.
         | 
         | It's just a hard problem. You can code ignorantly with high
         | level libraries but you're leaving 2x to 10x performance on the
         | table.
        
         | dahart wrote:
         | While reading I thought you were going to suggest unified
         | memory between RAM and VRAM, since that's somewhat analogous,
         | though that does exist with various caveats depending on how
         | it's setup & used.
         | 
         | SIMD/SIMT probably isn't ever going away, and vector computers
         | have been around since before segmented memory; the 32 threads
         | in a CUDA warp is the source of its performance superpower, and
         | the reason we can even fit all the transistors for 20k
         | simultaneous adds & multiplies, among other things, on the die.
         | This is conceptually different from your analogy too, the
         | segmented memory was a constraint designed to get around
         | pointer size limits, but 32 threads/warp isn't getting us
         | around any limits, it's just a design that provides high
         | performance _if_ you can organize your threads to all do the
         | same thing at the same time.
        
         | talldayo wrote:
         | You can blame ARM for the popularity of CUDA. At least x86 had
         | a few passable vector ISA ops like SSE and AVX - the ARM spec
         | only supports the piss-slow NEON in it's stead. Since you're
         | not going to unify vectors and mobile hardware anytime soon,
         | the majority of people are overjoyed to pay for CUDA hardware
         | where GPGPU compute is taken seriously.
         | 
         | There were also attempts like OpenCL, that the industry
         | rejected early-on because they thought they'd never need a CUDA
         | alternative. Nvidia's success is mostly built on the ignorance
         | of their competition - if Nvidia was allowed to buy ARM then
         | they could guarantee the two specs never overlap.
        
           | oivey wrote:
           | CUDA clobbered x86, not ARM. Maybe if x86's vector ops were
           | better and more usable ARM would have been motivated to do
           | better.
        
             | refulgentis wrote:
             | Whole concept sounds like groping in the dark for a Take to
             | me: GPUs (CUDA) are orthogonal to consumer processors (ARM
             | / X86). Maybe we could assume a platonic ideal merged chip,
             | a CPU that acts like a GPU, but there's more differences
             | between those two things than an instruction set for vector
             | ops.
        
               | oivey wrote:
               | Yeah, that's true. CUDA is in large part for big HPC
               | servers, where ARM historically wasn't a player and still
               | isn't dominant. x86 got clobbered for HPC by CUDA.
        
           | saagarjha wrote:
           | ARM has SVE these days. This comment makes no sense, anyway:
           | people don't do numerical computing on phones.
        
       | miki123211 wrote:
       | What are some actually good resources to learn this stuff?
        
         | markhahn wrote:
         | there are a thousand excellent CUDA programming courses/sites.
         | 
         | none of what was mentioned in this blog post is news if you've
         | ever had more than 2 hours of a CUDA course...
        
         | corysama wrote:
         | I answered that recently here:
         | https://old.reddit.com/r/GraphicsProgramming/comments/1fpi2c...
        
         | saagarjha wrote:
         | https://siboehm.com/articles/22/CUDA-MMM is a good start.
        
       | markhahn wrote:
       | little annoying to see the one-core-compared-to-whole-gpu
       | comparisons - now decades past when this was an innocent wrong.
       | 
       | compare a 500W GPU to all the cores of a 500W CPU, please. I'm
       | not expecting the CPU (say, a 192-core AMD that does fast AVX512)
       | to beat the GPU on all data-parallel workloads, but it won't be
       | the silly sort of graphs shown in this blog.
       | 
       | or compare one SM to one CPU core - that has merit as well.
       | 
       | best yet, we're finally getting some CPUs (well, APUs...) with
       | in-package RAM. that makes the comparison more interesting as
       | well.
        
         | oivey wrote:
         | The first example plot is a 9950X that includes all threads
         | with AVX512 vs a 4090. The 9950X has a 170W TDP, which doesn't
         | include any other components like the RAM or motherboard. The
         | 4090's total max power is ~450W. The chart shows the 4090
         | burying the 9950X by far more than 450/170.
         | 
         | Comparing SMs to CPU cores 1:1 also makes no sense. They don't
         | do the same things.
        
           | adrian_b wrote:
           | It should be kept in mind that a 4090 only buries a 9950X for
           | FP32 computations.
           | 
           | For FP64 computations, the reverse happens, a 9950X buries a
           | 4090, despite the latter having a 3-times higher price and a
           | 2.5-times higher power consumption.
           | 
           | For FP64 operations, 4090 and 9950X are able to do a similar
           | number of operations per clock cycle (288 vs. 256), but 9950X
           | can do them at a double clock frequency and it is easier to
           | reach a high fraction of the maximum theoretical throughput
           | on a 9950X than on a 4090.
        
             | xfalcox wrote:
             | What about FP8? It is a target that is very popular for LLM
             | inference.
        
               | adrian_b wrote:
               | AMD Zen 5 has the so-called "Vector Neural Network
               | Instructions", which can be used for inference with INT8
               | quantization and also instructions for computing
               | inference with BF16 quantization.
               | 
               | FP8 is a more recent quantization format and AFAIK no CPU
               | implements it.
               | 
               | I do not know which is the throughput of these
               | instructions for Zen 5. It must be higher than for older
               | CPUs, but it must be slower than for the Intel Xeon
               | models that support AMX (which are much more expensive,
               | so despite having a higher absolute performance for
               | inference, they might have lower performance per dollar)
               | and obviously it must be slower than for the tensor cores
               | of a big NVIDIA GPU.
               | 
               | Nevertheless, for models that do not fit inside the
               | memory of a GPU, inference on a Zen 5 CPU may become
               | competitive.
        
       | lmeyerov wrote:
       | Nice!
       | 
       | It's interesting from the perspective of maintenance too. You can
       | bet most constants like warp sizes will change, so you get into
       | things like having profiles, autotuners, or not sweating the
       | small stuff.
       | 
       | We went more extreme, and nowadays focus on several layers up: By
       | accepting the (high!) constant overheads of tools like RAPIDS
       | cuDF , we get in exchange the ability to easily crank code with
       | good saturation on the newest GPUs and that any data scientist
       | can edit and extend. Likewise, they just need to understand
       | basics like data movement and columnar analytics data reps to
       | make GPU pipelines. We have ~1 CUDA kernel left and many years of
       | higher-level.
       | 
       | As an example, this is one of the core methods of our new graph
       | query language GFQL (think cypher on pandas/spark, w optional GPU
       | runtime), and it gets Graph500 level performance on cheapo GPUs
       | just by being data parallel with high saturation per step:
       | https://github.com/graphistry/pygraphistry/blob/master/graph... .
       | Despite ping-ponging a ton because cudf doesn't (yet) coalesce
       | GPU kernel calls, V1 competes surprisingly high, and is easy to
       | maintain & extend.
        
         | trentnelson wrote:
         | Had any exposure to r=2 hypergraph implementations on the GPU?
         | Ideally with an efficient way to determine if the graph is
         | acyclic?
         | 
         | (The CPU algos for doing this work great on CPUs but are woeful
         | on GPUs.)
        
           | lmeyerov wrote:
           | Pretty good - r=2 is a regular graph afaict, and basically
           | anything that maps to a frontier-based pattern works well.
           | Ex: level synchronous bfs during topological sort.
           | 
           | For the 'easy' way we do in gfql, which is basically vector
           | ops on bulk wavefronts, we can do massive cypher traversals
           | like you're asking, like 100M edges touched in a result
           | substep, and on a tiny GPU. There are other such bulk
           | patterns we want to add such as Pregel style, which open
           | other algorithms here. In practice we can often just call
           | cudf/cugraph as building blocks so haven't had the pressure
           | to do so yet.
           | 
           | The weak spot I find is more like small OLTP lookups. Ex:
           | Imagine a taxi routing traffic service pinging for one car to
           | do a couple hops out, where you just want a KV store in cheap
           | RAM. But if you are batching those queries, like in a heavy
           | city, and going deeper on them, maybe more interesting.
        
       | bagels wrote:
       | Definitely not an expert, but trying to use AVX instructions
       | explicitly in a c++ program can also produce un-optimal
       | performance vs. just letting the optimizer decide, much like this
       | article points out with not shaping your memory and compute to
       | fit the GPU model.
        
       | Mithriil wrote:
       | In the conclusion, I like the image:
       | 
       | > My mental model [for GPU threads] is that you've got a bunch of
       | container ships that can travel at 10% of the speed of light.
       | You're using them to ship goods around the world. They're very
       | fast so most of the work is in setting up your harbors so that
       | you can load and unload these container-ships in fractions of a
       | second so that it can sail to do the next thing. It's not easy to
       | feed these beasts, but if you do it right you can do huge chunks
       | of work in almost no time.
        
       | Const-me wrote:
       | Note that not all problems are compute bound. Many practical
       | problems bottleneck on memory bandwidth.
       | 
       | For example, LLM AI inference on a desktop (where you don't have
       | a dozen of concurrent sessions from multiple users) is guaranteed
       | to be memory bound, fetching these gigabytes of model's tensors
       | for each generated token. For use cases like that, specialized
       | tensor cores deliver about the same performance as well-written
       | compute shaders running on general purpose GPU cores.
       | 
       | However, AVX512 is way slower than GPUs, because modern GPUs have
       | memory with very high bandwidth. In my desktop computer the
       | system memory is dual channel DDR5 which delivers 75 GB/s, VRAM
       | in the discrete GPU 670 GB/sec.
        
       ___________________________________________________________________
       (page generated 2024-10-11 23:01 UTC)