hngopher.com

       [HN Gopher] The Missing Nvidia GPU Glossary
       ___________________________________________________________________
        
       The Missing Nvidia GPU Glossary
        
       Author : birdculture
       Score  : 116 points
       Date   : 2025-01-12 18:22 UTC (2 days ago)
        
 (HTM) web link (modal.com)
 (TXT) w3m dump (modal.com)
        
       | charles_irl wrote:
       | Oh hey, I wrote this!
       | 
       | Thanks for sharing it.
        
         | petermcneeley wrote:
         | Great work. Nice aesthetic.
         | 
         | "These groups of threads, known as warps , are switched out on
         | a per clock cycle basis -- roughly one nanosecond. CPU thread
         | context switches, on the other hand, take few hundred to a few
         | thousand clock cycles"
         | 
         | I would note that intels SMT does do something very similar (2
         | hw threads). Other like the xeon phi would round robin 4
         | threads on a single core.
        
           | zeusk wrote:
           | SMT isn't that really is it?
           | 
           | SMT allows for concurrent execution of both threads (thus
           | independent front-end for fetch, decode especially) and
           | certain core resources are statically partitioned unlike a
           | warp being scheduled on SM.
           | 
           | I'm not a graphics expert but warps seem closer to run-
           | time/dynamic VLIW than SMT.
        
           | charles_irl wrote:
           | Thanks!
           | 
           | > intels SMT does do something very similar (2 hw threads)
           | 
           | Yeah that's a good point. One thing I learned from looking at
           | both hardware stacks more closely was that they aren't as
           | different as they seem at first -- lots of the same ideas or
           | techniques get are used, but in different ways.
        
         | byteknight wrote:
         | I absolutely love the look. Is it a template or custom?
        
           | charles_irl wrote:
           | Custom! Took inspiration from lynx, lotus, and other classic
           | terminal programs.
        
         | ks2048 wrote:
         | Looks nice. I'm not sure if this is the place for it, but what
         | I am always searching for is a very concise table of the
         | different GPUs available with approximate compute power and
         | costs. Lists such as wikipedia [1] are way to complicated.
         | 
         | [1]
         | https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...
        
           | charles_irl wrote:
           | Yeah, there's a tension between showing enough information to
           | be useful for driving decisions and hiding enough
           | information.
           | 
           | For example, "compute capability" sounds like it'd be what
           | you need, but it's actually more of a software versioning
           | index :(
           | 
           | Was thinking of splitting the difference by collecting up the
           | quoted arithmetic (FLOP/s) and memory bandwidths from the
           | manufacturer datasheets. But there's caveats there too, e.g.
           | the dreaded "With sparsity" asterisk on the Tensor Core
           | FLOP/s of recent generations.
        
             | shihab wrote:
             | I was looking for a simple table recently- outlining say
             | how the shared memory or total register size/SM varies
             | between generations (Something like that Wiki table). It
             | was surprisingly hard to find those info.
        
       | K0IN wrote:
       | Is there a plain text / markdown / html version?
        
         | aithrowawaycomm wrote:
         | I would also like to see a PDF that has all the text in one
         | place, presented linearly. This looks like a very worthwhile
         | read, but waiting a few seconds for two paragraphs to load is a
         | very frustrating user experience.
        
           | charles_irl wrote:
           | A few seconds is way longer than we intended! When I click
           | around all pages after the first load in milliseconds.
           | 
           | Do you have any script blockers, browser cache settings, or
           | extensions that might mess with navigation?
           | 
           | > would also like to see a PDF that has all the text in one
           | place, presented linearly
           | 
           | Yeah, good idea! I think a PDF with links so that it's still
           | easy to cross-reference terms would get the best of both
           | worlds.
        
             | swyx wrote:
             | book time book time
        
             | aithrowawaycomm wrote:
             | I am using Safari on iOS - I disabled private relay and
             | tested again, still seems oddly slow. No extensions; the
             | settings to periodically delete cookies and block popups
             | are enabled, don't see why those would affect this. Maybe
             | it's just HN traffic, thousands of people flipping through
             | the first dozen or so pages.
             | 
             | Edit: I just checked again and it didn't load at all... I
             | also see this is on the front page again, at 5:30pm Eastern
             | US time :) Probably HN hug of death.
        
         | htk wrote:
         | I'm with you. The theme is cool for a brief blog post, but
         | anything longer and I want out of the AS400 terminal.
        
           | ks2048 wrote:
           | I found it much better by clicking "light" at the top to
           | change theme.
        
             | mandevil wrote:
             | Found it more readable, yeah, but all of the captions on
             | the diagrams- identifying block types by color- no longer
             | made any sense.
        
               | charles_irl wrote:
               | Good callout! I'll work on the captions.
        
       | jms55 wrote:
       | The weird part of the programming model is that threadblocks
       | don't map 1:1 to warps or SMs. A single threadblock executes on a
       | single SM, but each SM has multiple warps, and the threadblock
       | could be the size of a single warp, or larger than the combined
       | thread count of all warps in the SM.
       | 
       | So, how large do you make your threadblocks to get optimal
       | SM/warp scheduling? Well it "depends" based on resource usage,
       | divergence, etc. Basically run it, profile, switch the
       | threadblock size, profile again, etc. Repeat on every
       | GPU/platform (if you're programming for multiple GPU platforms
       | and not just CUDA, like games do). It's a huge pain, and very
       | sensitive to code changes.
       | 
       | People new to GPU programming ask me "how big do I make the
       | threadblock size?" and I tell them go with 64 or 128 to start,
       | and then profile and adjust as needed.
       | 
       | Two articles on the AMD side of things:
       | 
       | https://gpuopen.com/learn/occupancy-explained
       | 
       | https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...
        
         | EarlKing wrote:
         | Sounds like the sort of thing that would lend itself to runtime
         | optimization.
        
           | jms55 wrote:
           | I'm not too informed on the details, but iirc drivers _do_
           | try and optimize shaders in the background, and then when
           | ready swaps in a better version. But I doubt it does stuff
           | like change threadgroup size, the programmer might assume a
           | certain size and their shader would be broken if changed.
           | Also drivers doing background work means unpredictable
           | performance and stuttering, which developers really don't
           | like.
           | 
           | Someone correct me if I'm wrong, maybe drivers don't do this
           | anymore.
        
             | EarlKing wrote:
             | Well, if the user isn't going to be sharing the GPU with
             | another task then you could push things back to install-
             | time. In other words: At install time you conduct a
             | benchmark on the relevant shaders, rewrite as necessary,
             | recompile, and save the results accordingly. Now the user
             | has a version of your shaders optimized to their particular
             | configuration. Since installation times are already
             | somewhat lengthy anyway you can be reasonably certain that
             | no one is going to miss an extra minute or two needed to
             | conduct benchmarks, especially if it results in installing
             | optimized code.
        
               | saagarjha wrote:
               | This is how autotuning often works yes
        
               | charles_irl wrote:
               | Coming from the neural network world, rather than the
               | shader world, but: I'd say you're absolutely right!
               | 
               | Right now NNs and their workloads are changing quickly
               | enough that people tend to prefer runtime optimization
               | (like the dynamic/JIT compilation provided by Torch's
               | compiler), but when you're confident you understand the
               | workload and have the know-how, you can do static
               | compilation (e.g. with ONNX, TensorRT).
               | 
               | I work on a serverless infrastructure product that gets
               | used for NN inference on GPUs, so we're very interested
               | in ways to amortize as much of that compilation and
               | configuration work as possible. Maybe someday we'll even
               | have something like what Redshift has in their query
               | engine -- pre-compiled binaries cached across users.
        
           | amelius wrote:
           | But which programming languages are most amenable to
           | automatic runtime optimization?
           | 
           | Should we go back to FORTRAN?
        
         | bassp wrote:
         | I was taught that you want, usually, more threads per block
         | than each SM can execute, because SMs context switch between
         | threads (fancy hardware multi threading!) on memory read stalls
         | to achieve super high throughput.
         | 
         | There are, ofc, other concerns like register pressure that
         | could affect the calculus, but if an SM is waiting on a memory
         | read to proceed and doesn't have any other threads available to
         | run, you're probably leaving perf on the table (iirc).
        
           | saagarjha wrote:
           | Pretty sure CUDA will limit your thread count to hardware
           | constraints? You can't just request a million threads.
        
             | bassp wrote:
             | You can request up to 1024-2048 threads per block depending
             | on the gpu; each SM can execute between 32 and 128 threads
             | at a time! So you can have a lot more threads assigned to
             | an SM than the SM can run at once
        
             | buildbot wrote:
             | Thread counts per block are limited to 1024 (unless I've
             | missed and change and wikipedia is wrong), but total
             | threads per kernel is 1024 _(2^32-1)_ 65535*65535 ~= 2^74
             | threads
             | 
             | https://en.wikipedia.org/wiki/Thread_block_(CUDA_programmin
             | g...
        
           | einpoklum wrote:
           | > I was taught that you want, usually, more threads per block
           | > than each SM can execute, because SMs context switch
           | between > threads (fancy hardware multi threading!) on memory
           | read > stalls to achieve super high throughput.
           | 
           | You were taught wrong...
           | 
           | First, "execution" on an SM is a complex pipelined thing,
           | like on a CPU core (except without branching). If you mean
           | instruction issues, an SM can up to issue up to 4
           | instructions, one for each of 4 warps per cycle (on NVIDIA
           | hardware for the last 10 years). But - there is no such thing
           | as an SM "context switch between threads".
           | 
           | Sometimes, more than 4 _32 = 128 threads is a good idea.
           | Sometimes, it 's a bad idea. This depends on things like:
           | 
           | _ Amount of shared memory used per warp
           | 
           | * Makeup of the instructions to be executed
           | 
           | * Register pressure, like you mentioned (because once you
           | exceed 256 threads per block, the number of registers
           | available per thread starts to decrease).
        
         | charles_irl wrote:
         | 100% -- there's basically no substitue for benchmarking! I find
         | the empiricism kind of comforting, coming from a research
         | science background.
         | 
         | IIUC, even CuBLAS basically just uses a bunch of heuristics
         | that are mostly derived from benchmarking to decide with
         | kernels to use.
        
         | einpoklum wrote:
         | > It's a huge pain, and very sensitive to code changes.
         | 
         | Optimization is very often like that. Making things generic,
         | uniform and simple typically has a performance penalty - and
         | you use your GPU because you care about that stuff.
        
       | EarlKing wrote:
       | FINALLY. Nvidia's always been pretty craptacular when it comes to
       | their documentation. It's really hard to read unless you already
       | know their internal names for, well, just about everything.
        
         | let_me_post_0 wrote:
         | Nvidia isn't very big on opensource either. Most CUDA libraries
         | are still closed source. I think this might eventually be their
         | downfall, because people want to know what they are working
         | with. For example with PyTorch, I can profile the library
         | against my use case and then decide to modify the official
         | library to get some bespoke optimization. With CUDA, if I need
         | to do that, I need to start from scratch and guess as to
         | whether the library from the api already has such
         | optimizations.
        
           | einpoklum wrote:
           | NVIDIA does have a bunch of FOSS libraries - like CUB and
           | Thrust (now part of CCCL). But - they tend to suffer from
           | "not invented here" syndrome [1] ; so they seem to avoid
           | collaboration on FOSS they don't manage/control by
           | themselves.
           | 
           | I have a bit of a chip on my shoulder here, since I've been
           | trying to pitch my Modern C++ API wrappers to them for years,
           | and even though I've gotten some appreciative comments from
           | individuals, they have shown zero interest.
           | 
           | https://github.com/eyalroz/cuda-api-wrappers/
           | 
           | There is also their driver, which is supposedly "open
           | source", but actually none of the logic is exposed to you.
           | Their runtime library is closed too, their management utility
           | (nvidia-smi), their LLVM-based compiler, their profilers,
           | their OpenCL stack :-(
           | 
           | I must say they do have relatively extensive documentation,
           | even if it doesn't cover everything.
           | 
           | [1] - https://en.wikipedia.org/wiki/Not_invented_here
        
       | joshdavham wrote:
       | Incredible work, thank you so much! This will hopefully break
       | down more barriers to entry for newcomers wanting to work with
       | GPUs!
        
         | charles_irl wrote:
         | Thanks for the kind words! I still feel like one of those
         | newcomers myself :)
         | 
         | Now that so many more people are running workloads, including
         | critical ones, on GPUs, it feels much more important that a
         | base level of knowledge and intuition is broadly disseminated
         | -- kinda like how most engineers basically grok database index
         | management, even if they couldn't write a high-performance B+
         | tree from scratch. Hope this document helps that along!
        
       | weltensturm wrote:
       | that's pretty
        
         | charles_irl wrote:
         | Thanks! We think that just because something is deeply
         | technical doesn't mean it has to be ugly.
        
       | saagarjha wrote:
       | It would be nice if this also included terms that are often used
       | by Nvidia that apparently come from computer architecture (?) but
       | are basically foreign to software engineers, like "scoreboard" or
       | "math pipe".
        
         | charles_irl wrote:
         | Great idea! We'll add some of those in the next round.
        
       | richwater wrote:
       | content is cool; usability and design of the website is awful
       | (although charming)
        
         | yshklarov wrote:
         | Not at all -- the usability and design are fantastic! (On
         | desktop, at least.)
         | 
         | What, specifially, do you find awful here?
        
       | germanjoey wrote:
       | This is really incredible, thank you!
        
       | einpoklum wrote:
       | This has been submitted, like, five times already over the past 5
       | weeks:
       | 
       | https://news.ycombinator.com/from?site=modal.com
        
       ___________________________________________________________________
       (page generated 2025-01-14 23:00 UTC)