[HN Gopher] The Missing Nvidia GPU Glossary
___________________________________________________________________
The Missing Nvidia GPU Glossary
Author : birdculture
Score : 116 points
Date : 2025-01-12 18:22 UTC (2 days ago)
(HTM) web link (modal.com)
(TXT) w3m dump (modal.com)
| charles_irl wrote:
| Oh hey, I wrote this!
|
| Thanks for sharing it.
| petermcneeley wrote:
| Great work. Nice aesthetic.
|
| "These groups of threads, known as warps , are switched out on
| a per clock cycle basis -- roughly one nanosecond. CPU thread
| context switches, on the other hand, take few hundred to a few
| thousand clock cycles"
|
| I would note that intels SMT does do something very similar (2
| hw threads). Other like the xeon phi would round robin 4
| threads on a single core.
| zeusk wrote:
| SMT isn't that really is it?
|
| SMT allows for concurrent execution of both threads (thus
| independent front-end for fetch, decode especially) and
| certain core resources are statically partitioned unlike a
| warp being scheduled on SM.
|
| I'm not a graphics expert but warps seem closer to run-
| time/dynamic VLIW than SMT.
| charles_irl wrote:
| Thanks!
|
| > intels SMT does do something very similar (2 hw threads)
|
| Yeah that's a good point. One thing I learned from looking at
| both hardware stacks more closely was that they aren't as
| different as they seem at first -- lots of the same ideas or
| techniques get are used, but in different ways.
| byteknight wrote:
| I absolutely love the look. Is it a template or custom?
| charles_irl wrote:
| Custom! Took inspiration from lynx, lotus, and other classic
| terminal programs.
| ks2048 wrote:
| Looks nice. I'm not sure if this is the place for it, but what
| I am always searching for is a very concise table of the
| different GPUs available with approximate compute power and
| costs. Lists such as wikipedia [1] are way to complicated.
|
| [1]
| https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...
| charles_irl wrote:
| Yeah, there's a tension between showing enough information to
| be useful for driving decisions and hiding enough
| information.
|
| For example, "compute capability" sounds like it'd be what
| you need, but it's actually more of a software versioning
| index :(
|
| Was thinking of splitting the difference by collecting up the
| quoted arithmetic (FLOP/s) and memory bandwidths from the
| manufacturer datasheets. But there's caveats there too, e.g.
| the dreaded "With sparsity" asterisk on the Tensor Core
| FLOP/s of recent generations.
| shihab wrote:
| I was looking for a simple table recently- outlining say
| how the shared memory or total register size/SM varies
| between generations (Something like that Wiki table). It
| was surprisingly hard to find those info.
| K0IN wrote:
| Is there a plain text / markdown / html version?
| aithrowawaycomm wrote:
| I would also like to see a PDF that has all the text in one
| place, presented linearly. This looks like a very worthwhile
| read, but waiting a few seconds for two paragraphs to load is a
| very frustrating user experience.
| charles_irl wrote:
| A few seconds is way longer than we intended! When I click
| around all pages after the first load in milliseconds.
|
| Do you have any script blockers, browser cache settings, or
| extensions that might mess with navigation?
|
| > would also like to see a PDF that has all the text in one
| place, presented linearly
|
| Yeah, good idea! I think a PDF with links so that it's still
| easy to cross-reference terms would get the best of both
| worlds.
| swyx wrote:
| book time book time
| aithrowawaycomm wrote:
| I am using Safari on iOS - I disabled private relay and
| tested again, still seems oddly slow. No extensions; the
| settings to periodically delete cookies and block popups
| are enabled, don't see why those would affect this. Maybe
| it's just HN traffic, thousands of people flipping through
| the first dozen or so pages.
|
| Edit: I just checked again and it didn't load at all... I
| also see this is on the front page again, at 5:30pm Eastern
| US time :) Probably HN hug of death.
| htk wrote:
| I'm with you. The theme is cool for a brief blog post, but
| anything longer and I want out of the AS400 terminal.
| ks2048 wrote:
| I found it much better by clicking "light" at the top to
| change theme.
| mandevil wrote:
| Found it more readable, yeah, but all of the captions on
| the diagrams- identifying block types by color- no longer
| made any sense.
| charles_irl wrote:
| Good callout! I'll work on the captions.
| jms55 wrote:
| The weird part of the programming model is that threadblocks
| don't map 1:1 to warps or SMs. A single threadblock executes on a
| single SM, but each SM has multiple warps, and the threadblock
| could be the size of a single warp, or larger than the combined
| thread count of all warps in the SM.
|
| So, how large do you make your threadblocks to get optimal
| SM/warp scheduling? Well it "depends" based on resource usage,
| divergence, etc. Basically run it, profile, switch the
| threadblock size, profile again, etc. Repeat on every
| GPU/platform (if you're programming for multiple GPU platforms
| and not just CUDA, like games do). It's a huge pain, and very
| sensitive to code changes.
|
| People new to GPU programming ask me "how big do I make the
| threadblock size?" and I tell them go with 64 or 128 to start,
| and then profile and adjust as needed.
|
| Two articles on the AMD side of things:
|
| https://gpuopen.com/learn/occupancy-explained
|
| https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...
| EarlKing wrote:
| Sounds like the sort of thing that would lend itself to runtime
| optimization.
| jms55 wrote:
| I'm not too informed on the details, but iirc drivers _do_
| try and optimize shaders in the background, and then when
| ready swaps in a better version. But I doubt it does stuff
| like change threadgroup size, the programmer might assume a
| certain size and their shader would be broken if changed.
| Also drivers doing background work means unpredictable
| performance and stuttering, which developers really don't
| like.
|
| Someone correct me if I'm wrong, maybe drivers don't do this
| anymore.
| EarlKing wrote:
| Well, if the user isn't going to be sharing the GPU with
| another task then you could push things back to install-
| time. In other words: At install time you conduct a
| benchmark on the relevant shaders, rewrite as necessary,
| recompile, and save the results accordingly. Now the user
| has a version of your shaders optimized to their particular
| configuration. Since installation times are already
| somewhat lengthy anyway you can be reasonably certain that
| no one is going to miss an extra minute or two needed to
| conduct benchmarks, especially if it results in installing
| optimized code.
| saagarjha wrote:
| This is how autotuning often works yes
| charles_irl wrote:
| Coming from the neural network world, rather than the
| shader world, but: I'd say you're absolutely right!
|
| Right now NNs and their workloads are changing quickly
| enough that people tend to prefer runtime optimization
| (like the dynamic/JIT compilation provided by Torch's
| compiler), but when you're confident you understand the
| workload and have the know-how, you can do static
| compilation (e.g. with ONNX, TensorRT).
|
| I work on a serverless infrastructure product that gets
| used for NN inference on GPUs, so we're very interested
| in ways to amortize as much of that compilation and
| configuration work as possible. Maybe someday we'll even
| have something like what Redshift has in their query
| engine -- pre-compiled binaries cached across users.
| amelius wrote:
| But which programming languages are most amenable to
| automatic runtime optimization?
|
| Should we go back to FORTRAN?
| bassp wrote:
| I was taught that you want, usually, more threads per block
| than each SM can execute, because SMs context switch between
| threads (fancy hardware multi threading!) on memory read stalls
| to achieve super high throughput.
|
| There are, ofc, other concerns like register pressure that
| could affect the calculus, but if an SM is waiting on a memory
| read to proceed and doesn't have any other threads available to
| run, you're probably leaving perf on the table (iirc).
| saagarjha wrote:
| Pretty sure CUDA will limit your thread count to hardware
| constraints? You can't just request a million threads.
| bassp wrote:
| You can request up to 1024-2048 threads per block depending
| on the gpu; each SM can execute between 32 and 128 threads
| at a time! So you can have a lot more threads assigned to
| an SM than the SM can run at once
| buildbot wrote:
| Thread counts per block are limited to 1024 (unless I've
| missed and change and wikipedia is wrong), but total
| threads per kernel is 1024 _(2^32-1)_ 65535*65535 ~= 2^74
| threads
|
| https://en.wikipedia.org/wiki/Thread_block_(CUDA_programmin
| g...
| einpoklum wrote:
| > I was taught that you want, usually, more threads per block
| > than each SM can execute, because SMs context switch
| between > threads (fancy hardware multi threading!) on memory
| read > stalls to achieve super high throughput.
|
| You were taught wrong...
|
| First, "execution" on an SM is a complex pipelined thing,
| like on a CPU core (except without branching). If you mean
| instruction issues, an SM can up to issue up to 4
| instructions, one for each of 4 warps per cycle (on NVIDIA
| hardware for the last 10 years). But - there is no such thing
| as an SM "context switch between threads".
|
| Sometimes, more than 4 _32 = 128 threads is a good idea.
| Sometimes, it 's a bad idea. This depends on things like:
|
| _ Amount of shared memory used per warp
|
| * Makeup of the instructions to be executed
|
| * Register pressure, like you mentioned (because once you
| exceed 256 threads per block, the number of registers
| available per thread starts to decrease).
| charles_irl wrote:
| 100% -- there's basically no substitue for benchmarking! I find
| the empiricism kind of comforting, coming from a research
| science background.
|
| IIUC, even CuBLAS basically just uses a bunch of heuristics
| that are mostly derived from benchmarking to decide with
| kernels to use.
| einpoklum wrote:
| > It's a huge pain, and very sensitive to code changes.
|
| Optimization is very often like that. Making things generic,
| uniform and simple typically has a performance penalty - and
| you use your GPU because you care about that stuff.
| EarlKing wrote:
| FINALLY. Nvidia's always been pretty craptacular when it comes to
| their documentation. It's really hard to read unless you already
| know their internal names for, well, just about everything.
| let_me_post_0 wrote:
| Nvidia isn't very big on opensource either. Most CUDA libraries
| are still closed source. I think this might eventually be their
| downfall, because people want to know what they are working
| with. For example with PyTorch, I can profile the library
| against my use case and then decide to modify the official
| library to get some bespoke optimization. With CUDA, if I need
| to do that, I need to start from scratch and guess as to
| whether the library from the api already has such
| optimizations.
| einpoklum wrote:
| NVIDIA does have a bunch of FOSS libraries - like CUB and
| Thrust (now part of CCCL). But - they tend to suffer from
| "not invented here" syndrome [1] ; so they seem to avoid
| collaboration on FOSS they don't manage/control by
| themselves.
|
| I have a bit of a chip on my shoulder here, since I've been
| trying to pitch my Modern C++ API wrappers to them for years,
| and even though I've gotten some appreciative comments from
| individuals, they have shown zero interest.
|
| https://github.com/eyalroz/cuda-api-wrappers/
|
| There is also their driver, which is supposedly "open
| source", but actually none of the logic is exposed to you.
| Their runtime library is closed too, their management utility
| (nvidia-smi), their LLVM-based compiler, their profilers,
| their OpenCL stack :-(
|
| I must say they do have relatively extensive documentation,
| even if it doesn't cover everything.
|
| [1] - https://en.wikipedia.org/wiki/Not_invented_here
| joshdavham wrote:
| Incredible work, thank you so much! This will hopefully break
| down more barriers to entry for newcomers wanting to work with
| GPUs!
| charles_irl wrote:
| Thanks for the kind words! I still feel like one of those
| newcomers myself :)
|
| Now that so many more people are running workloads, including
| critical ones, on GPUs, it feels much more important that a
| base level of knowledge and intuition is broadly disseminated
| -- kinda like how most engineers basically grok database index
| management, even if they couldn't write a high-performance B+
| tree from scratch. Hope this document helps that along!
| weltensturm wrote:
| that's pretty
| charles_irl wrote:
| Thanks! We think that just because something is deeply
| technical doesn't mean it has to be ugly.
| saagarjha wrote:
| It would be nice if this also included terms that are often used
| by Nvidia that apparently come from computer architecture (?) but
| are basically foreign to software engineers, like "scoreboard" or
| "math pipe".
| charles_irl wrote:
| Great idea! We'll add some of those in the next round.
| richwater wrote:
| content is cool; usability and design of the website is awful
| (although charming)
| yshklarov wrote:
| Not at all -- the usability and design are fantastic! (On
| desktop, at least.)
|
| What, specifially, do you find awful here?
| germanjoey wrote:
| This is really incredible, thank you!
| einpoklum wrote:
| This has been submitted, like, five times already over the past 5
| weeks:
|
| https://news.ycombinator.com/from?site=modal.com
___________________________________________________________________
(page generated 2025-01-14 23:00 UTC)