[HN Gopher] Initial CUDA Performance Lessons
___________________________________________________________________
Initial CUDA Performance Lessons
Author : ibobev
Score : 138 points
Date : 2024-10-11 10:01 UTC (12 hours ago)
(HTM) web link (probablydance.com)
(TXT) w3m dump (probablydance.com)
| elashri wrote:
| I like this writeup as it summarizes my journey with optimizing
| some cuda code I wrote for an LHC experiment trigger. But there
| are few comments on some details.
|
| There are 65536 registers per SM not thread block and while you
| can indirectly control that by making your block takes all the SM
| but this presents its own problems.
|
| NVIDIA hardware limits the threads max number to 1024 (2048) and
| shared memory to 48 KB (64 KB) per SM. So if you consume all of
| that in one thread block or near the maximum then you are using
| one thread block per SM. You don't usually want to do that
| because it will lower your occupancy. Additionaly , If the kernel
| you're running is not compute-bound and does not need all the
| registers or shared memory allocated to it, having fewer blocks
| on the SM could leave some compute resources idle. GPUs are
| designed to thrive on parallelism, and limiting the number of
| active blocks could cause underutilization of the SM's cores,
| leading to poor performance. Finally, If each thread block
| occupies an entire SM, you limit the scalability of your kernel
| to the number of SMs on the GPU. For example, if your GPU has 60
| SMs, and each block uses one SM, you can only run 60 blocks in
| parallel, even if the problem you're solving could benefit from
| more parallelism. This can reduce the efficiency of the GPU for
| very large problem sizes.
| jhj wrote:
| Aiming for higher occupancy is not always a desired solution,
| what frequently matters more is avoiding global memory
| latencies by retaining more data in registers and/or shared
| memory. This was first noted in 2010 and is still true today:
|
| https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...
|
| I would also think in terms of latency hiding rather than just
| work parallelism (though latency hiding on GPUs is largely
| because of parallelism). This is the reason why GPUs have
| massive register files, because unlike modern multi-core CPUs,
| we omit latency reducing hardware (e.g., speculative execution,
| large caches, that out-of-order execution stuff/register
| renaming etc) and in order to fill pipelines we need to have
| many instructions outstanding, which means that the operands
| for those pending arguments need to remain around for a lot
| longer, hence the massive register file.
| elashri wrote:
| I agree that optimizing for lower occupancy can yield
| significant performance gains in specific cases, especially
| when memory latencies are the primary bottleneck. Leveraging
| ILP and storing more data in registers can indeed help reduce
| the need for higher occupancy and lead to more efficient
| kernels. The examples in the GTC2010 talks highlighted that
| quite well. However, I would argue that occupancy still plays
| an important role, especially for scalability and general-
| purpose optimization. Over-relying on low occupancy and fewer
| threads, while beneficial in certain contexts, has its
| limits.
|
| The first thing to consider is the register pressure.
| Increasing the number of registers per thread to optimize for
| ILP can lead to register spilling when the register file is
| exhausted, which drastically reduces performance. This
| becomes more pronounced as problem sizes scale up (the talk
| examples avoids that problem). Many real-world applications,
| especially compute-bound kernels, need high occupancy to
| fully utilize the GPU's resources. Focusing too much on
| minimizing thread counts can lead to underutilization of the
| SM's parallel execution units. An standard example will be
| inference engines.
|
| Also, while low-occupancy optimizations can be effective for
| specific workloads (e.g, memory-bound kernels), designing
| code that depends on such strategies as a general practice
| can result in less adaptable and robust solutions for a wide
| variety of applications.
|
| I believe there is a balance to strike here. low occupancy
| can work for specific cases, higher occupancy often provides
| better scalability and overall performance for more general
| use cases. But you have to test for that while you are
| optimizing your code. There will not be a general rule of
| thump to follow here.
| dahart wrote:
| Good points, though I agree with sibling that higher occupancy
| is not the goal; higher performance is the goal. Since
| registers are such a precious resource, you often want to set
| your block size and occupancy to whatever is best for keeping
| active state in registers. If you push the occupancy higher,
| then the compiler might be forced to spill registers to VRAM,
| that that will just slow everything down even though the
| occupancy goes up.
|
| Another thing to maybe mention, re: "if your GPU has 60 SMs,
| and each block uses one SM, you can only run 60 blocks in
| parallel"... CUDA tends to want to have at least 3 or 4 blocks
| per SM so it can round-robin them as soon as one stalls on a
| memory load or sync or something else. You might only make
| forward progress on 60 separate blocks in any given cycle, but
| it's quite important that you have like, for example, 240
| blocks running in "parallel", so you can benefit from latency
| hiding. This is where a lot of additional performance comes
| from, doing work on one block while another is momentarily
| stuck.
| winwang wrote:
| Is this really true in general? I'd expect it to be true for
| highly homogenous blocks, but I'd also expect that kernels
| where the warps are "desynced" in memory operations to do
| just fine without having 3-4 blocks per SM.
| dahart wrote:
| Oh I think so, but I'm certainly not the most expert of
| CUDA users there is. ;) Still, you will often see CUDA try
| to alloc local and smem space for at least 3 blocks per SM
| when you configure a kernel. That can't possibly always be
| true, but is for kernels that are using modest amounts of
| smem, lmem, and registers. In general I'd say desynced mem
| ops are harder to make performant than highly homogeneous
| workloads, since those are more likely to be uncoalesced as
| well as cache misses. Think about it this way: a kernel can
| stall for many many reasons (which Nsight Compute can show
| you), especially memory IO, but even for compute bound
| work, the math pipes can fill, the instruction cache can
| miss, some instructions have higher latency than others,
| etc. etc. Even a cache hit load can take dozens of cycles
| to actually fill. Because stalls are everywhere, these
| machines are specifically designed to juggle multiple
| blocks and always look for ways to make forward progress on
| _something_ without having to sit idle, that is how to get
| higher throughput and hide latency.
| otherjason wrote:
| For devices with compute capability of 7.0 or greater (anything
| from the Volta series on), a single thread block can address up
| to the entire shared memory size of the SM; the 48 kB limit
| that older hardware had is no more. Most contemporary
| applications are going to be running on hardware that doesn't
| have the shared memory limit you mentioned.
|
| The claim at the end of your post, suggesting that >1 block per
| SM is always better than 1 block per SM, isn't strictly true
| either. In the example you gave, you're limited to 60 blocks
| because the thread count of each block is too high. You could,
| for example, cut the blocks in half to yield 120 blocks. But
| each block has half as many threads in it, so you don't
| automatically get any occupancy benefit by doing so.
|
| When planning out the geometry of a CUDA thread grid, there are
| inherent tradeoffs between SM thread and/or warp scheduler
| limits, shared memory usage, register usage, and overall SM
| count, and those tradeoffs can be counterintuitive if you
| follow (admittedly, NVIDIA's official) guidance that maximizing
| the thread count leads to optimal performance.
| amelius wrote:
| In the 90s we had segmented memory programming with near and far
| pointers, and you had to be very careful about when you used what
| type of pointer and how you'd organize your memory accesses. Then
| we got processors like the 286 that finally relieved us from this
| constrained way of programming.
|
| I can't help but feel that with CUDA we're having new constraints
| (32 threads in a warp, what?), which are begging to be unified at
| some point.
| dboreham wrote:
| 386?
| krapht wrote:
| I'll believe it when autovectorization is actually useful in
| day to day high performance coding work.
|
| It's just a hard problem. You can code ignorantly with high
| level libraries but you're leaving 2x to 10x performance on the
| table.
| dahart wrote:
| While reading I thought you were going to suggest unified
| memory between RAM and VRAM, since that's somewhat analogous,
| though that does exist with various caveats depending on how
| it's setup & used.
|
| SIMD/SIMT probably isn't ever going away, and vector computers
| have been around since before segmented memory; the 32 threads
| in a CUDA warp is the source of its performance superpower, and
| the reason we can even fit all the transistors for 20k
| simultaneous adds & multiplies, among other things, on the die.
| This is conceptually different from your analogy too, the
| segmented memory was a constraint designed to get around
| pointer size limits, but 32 threads/warp isn't getting us
| around any limits, it's just a design that provides high
| performance _if_ you can organize your threads to all do the
| same thing at the same time.
| talldayo wrote:
| You can blame ARM for the popularity of CUDA. At least x86 had
| a few passable vector ISA ops like SSE and AVX - the ARM spec
| only supports the piss-slow NEON in it's stead. Since you're
| not going to unify vectors and mobile hardware anytime soon,
| the majority of people are overjoyed to pay for CUDA hardware
| where GPGPU compute is taken seriously.
|
| There were also attempts like OpenCL, that the industry
| rejected early-on because they thought they'd never need a CUDA
| alternative. Nvidia's success is mostly built on the ignorance
| of their competition - if Nvidia was allowed to buy ARM then
| they could guarantee the two specs never overlap.
| oivey wrote:
| CUDA clobbered x86, not ARM. Maybe if x86's vector ops were
| better and more usable ARM would have been motivated to do
| better.
| refulgentis wrote:
| Whole concept sounds like groping in the dark for a Take to
| me: GPUs (CUDA) are orthogonal to consumer processors (ARM
| / X86). Maybe we could assume a platonic ideal merged chip,
| a CPU that acts like a GPU, but there's more differences
| between those two things than an instruction set for vector
| ops.
| oivey wrote:
| Yeah, that's true. CUDA is in large part for big HPC
| servers, where ARM historically wasn't a player and still
| isn't dominant. x86 got clobbered for HPC by CUDA.
| saagarjha wrote:
| ARM has SVE these days. This comment makes no sense, anyway:
| people don't do numerical computing on phones.
| miki123211 wrote:
| What are some actually good resources to learn this stuff?
| markhahn wrote:
| there are a thousand excellent CUDA programming courses/sites.
|
| none of what was mentioned in this blog post is news if you've
| ever had more than 2 hours of a CUDA course...
| corysama wrote:
| I answered that recently here:
| https://old.reddit.com/r/GraphicsProgramming/comments/1fpi2c...
| saagarjha wrote:
| https://siboehm.com/articles/22/CUDA-MMM is a good start.
| markhahn wrote:
| little annoying to see the one-core-compared-to-whole-gpu
| comparisons - now decades past when this was an innocent wrong.
|
| compare a 500W GPU to all the cores of a 500W CPU, please. I'm
| not expecting the CPU (say, a 192-core AMD that does fast AVX512)
| to beat the GPU on all data-parallel workloads, but it won't be
| the silly sort of graphs shown in this blog.
|
| or compare one SM to one CPU core - that has merit as well.
|
| best yet, we're finally getting some CPUs (well, APUs...) with
| in-package RAM. that makes the comparison more interesting as
| well.
| oivey wrote:
| The first example plot is a 9950X that includes all threads
| with AVX512 vs a 4090. The 9950X has a 170W TDP, which doesn't
| include any other components like the RAM or motherboard. The
| 4090's total max power is ~450W. The chart shows the 4090
| burying the 9950X by far more than 450/170.
|
| Comparing SMs to CPU cores 1:1 also makes no sense. They don't
| do the same things.
| adrian_b wrote:
| It should be kept in mind that a 4090 only buries a 9950X for
| FP32 computations.
|
| For FP64 computations, the reverse happens, a 9950X buries a
| 4090, despite the latter having a 3-times higher price and a
| 2.5-times higher power consumption.
|
| For FP64 operations, 4090 and 9950X are able to do a similar
| number of operations per clock cycle (288 vs. 256), but 9950X
| can do them at a double clock frequency and it is easier to
| reach a high fraction of the maximum theoretical throughput
| on a 9950X than on a 4090.
| xfalcox wrote:
| What about FP8? It is a target that is very popular for LLM
| inference.
| adrian_b wrote:
| AMD Zen 5 has the so-called "Vector Neural Network
| Instructions", which can be used for inference with INT8
| quantization and also instructions for computing
| inference with BF16 quantization.
|
| FP8 is a more recent quantization format and AFAIK no CPU
| implements it.
|
| I do not know which is the throughput of these
| instructions for Zen 5. It must be higher than for older
| CPUs, but it must be slower than for the Intel Xeon
| models that support AMX (which are much more expensive,
| so despite having a higher absolute performance for
| inference, they might have lower performance per dollar)
| and obviously it must be slower than for the tensor cores
| of a big NVIDIA GPU.
|
| Nevertheless, for models that do not fit inside the
| memory of a GPU, inference on a Zen 5 CPU may become
| competitive.
| lmeyerov wrote:
| Nice!
|
| It's interesting from the perspective of maintenance too. You can
| bet most constants like warp sizes will change, so you get into
| things like having profiles, autotuners, or not sweating the
| small stuff.
|
| We went more extreme, and nowadays focus on several layers up: By
| accepting the (high!) constant overheads of tools like RAPIDS
| cuDF , we get in exchange the ability to easily crank code with
| good saturation on the newest GPUs and that any data scientist
| can edit and extend. Likewise, they just need to understand
| basics like data movement and columnar analytics data reps to
| make GPU pipelines. We have ~1 CUDA kernel left and many years of
| higher-level.
|
| As an example, this is one of the core methods of our new graph
| query language GFQL (think cypher on pandas/spark, w optional GPU
| runtime), and it gets Graph500 level performance on cheapo GPUs
| just by being data parallel with high saturation per step:
| https://github.com/graphistry/pygraphistry/blob/master/graph... .
| Despite ping-ponging a ton because cudf doesn't (yet) coalesce
| GPU kernel calls, V1 competes surprisingly high, and is easy to
| maintain & extend.
| trentnelson wrote:
| Had any exposure to r=2 hypergraph implementations on the GPU?
| Ideally with an efficient way to determine if the graph is
| acyclic?
|
| (The CPU algos for doing this work great on CPUs but are woeful
| on GPUs.)
| lmeyerov wrote:
| Pretty good - r=2 is a regular graph afaict, and basically
| anything that maps to a frontier-based pattern works well.
| Ex: level synchronous bfs during topological sort.
|
| For the 'easy' way we do in gfql, which is basically vector
| ops on bulk wavefronts, we can do massive cypher traversals
| like you're asking, like 100M edges touched in a result
| substep, and on a tiny GPU. There are other such bulk
| patterns we want to add such as Pregel style, which open
| other algorithms here. In practice we can often just call
| cudf/cugraph as building blocks so haven't had the pressure
| to do so yet.
|
| The weak spot I find is more like small OLTP lookups. Ex:
| Imagine a taxi routing traffic service pinging for one car to
| do a couple hops out, where you just want a KV store in cheap
| RAM. But if you are batching those queries, like in a heavy
| city, and going deeper on them, maybe more interesting.
| bagels wrote:
| Definitely not an expert, but trying to use AVX instructions
| explicitly in a c++ program can also produce un-optimal
| performance vs. just letting the optimizer decide, much like this
| article points out with not shaping your memory and compute to
| fit the GPU model.
| Mithriil wrote:
| In the conclusion, I like the image:
|
| > My mental model [for GPU threads] is that you've got a bunch of
| container ships that can travel at 10% of the speed of light.
| You're using them to ship goods around the world. They're very
| fast so most of the work is in setting up your harbors so that
| you can load and unload these container-ships in fractions of a
| second so that it can sail to do the next thing. It's not easy to
| feed these beasts, but if you do it right you can do huge chunks
| of work in almost no time.
| Const-me wrote:
| Note that not all problems are compute bound. Many practical
| problems bottleneck on memory bandwidth.
|
| For example, LLM AI inference on a desktop (where you don't have
| a dozen of concurrent sessions from multiple users) is guaranteed
| to be memory bound, fetching these gigabytes of model's tensors
| for each generated token. For use cases like that, specialized
| tensor cores deliver about the same performance as well-written
| compute shaders running on general purpose GPU cores.
|
| However, AVX512 is way slower than GPUs, because modern GPUs have
| memory with very high bandwidth. In my desktop computer the
| system memory is dual channel DDR5 which delivers 75 GB/s, VRAM
| in the discrete GPU 670 GB/sec.
___________________________________________________________________
(page generated 2024-10-11 23:01 UTC)