[HN Gopher] Gentle introduction to GPUs inner workings
       ___________________________________________________________________
        
       Gentle introduction to GPUs inner workings
        
       Author : ingve
       Score  : 310 points
       Date   : 2021-10-02 12:41 UTC (10 hours ago)
        
 (HTM) web link (vksegfault.github.io)
 (TXT) w3m dump (vksegfault.github.io)
        
       | tppiotrowski wrote:
       | I know there's a lot of Javascript developers on this forum. If
       | you want to get into GPU programming, I highly recommend gpu.js
       | [1] library as a jumping off point. It's amazing how powerful
       | computers are and how we squander most our cycles.
       | 
       | [1] https://gpu.rocks/#/
       | 
       | Disclaimer: I have one un-merged PR in the gpu.js repo
        
         | tenaciousDaniel wrote:
         | Thanks! I'm a JS dev who happens to be very interested in
         | getting into graphics.
        
           | Jasper_ wrote:
           | Note that GPU.js is not for graphics, it's primarily for
           | doing GPU Compute. I'd recommend looking at a 3D scene
           | library to get your feet wet (and there are plenty of those
           | in JS), or if you're interested in the underlying workings,
           | to look at either WebGL or WebGPU.
        
         | corysama wrote:
         | Whenever I experience a slow, stuttering UI, I am reminded that
         | 11 years ago, and iPhone 3GS could do this
         | https://m.youtube.com/watch?v=kqz5ehun-o0
        
       | h2odragon wrote:
       | Very nice. This is gentle like movie dinosaurs are "cheeky
       | lizards". I'd hate to see the "Turkish prison BDSM porn" version.
       | 
       | I'm looking at graphics code, again, from a "I know enough C to
       | shoot myself in the foot and want to draw a circle on the screen"
       | perspective. It's hilarious how much "stack" there is in all the
       | ways of doing that; I look at some of this shit and want to go
       | back to Xlib for its simple grace.
        
         | Jasper_ wrote:
         | 2D graphics is very different and mostly doesn't require the
         | GPU's assistance. If you want to plot a circle on an image and
         | then display that to the screen, you don't require any of this
         | stack. If you want a high-level library that will draw a circle
         | for you, you can use something like Skia or Cairo which will
         | wrap this for you into a C API.
         | 
         | GPUs solve problems of much larger scale, and so the stack has
         | evolved over time to meet the needs of those applications. All
         | this power and corresponding complexity has been introduced for
         | a reason, I assure you.
        
           | zozbot234 wrote:
           | > 2D graphics is very different and mostly doesn't require
           | the GPU's assistance.
           | 
           | There are interesting tasks in 2D graphics that may easily
           | warrant GPU acceleration, such as smooth animation of
           | composited surfaces (even something as simple as displaying a
           | mouse pointer falls under this, and is accelerated in most
           | systems) or rendering complex shapes including text.
        
       | floatboth wrote:
       | > Mesa 3D - driver library that provides open source driver
       | (mostly copy of AMD-VLK)
       | 
       | Wrong, wrong, very wrong. No copy here. Mesa's RADV was developed
       | completely independently from AMD, in fact it _predates_ the
       | public release of AMDVLK. It 's also possibly the best Vulkan
       | implementation out there. Valve invested heavily in the ACO
       | compiler backend, so it compiles shaders both very well and very
       | quickly.
        
         | MayeulC wrote:
         | Not to mention that there is _much_ more to Mesa than RADV.
         | Gallium3D and state trackers, RadeonSI and a few other drivers
         | (Apple M1, Qualcomm Adreno, Broadcom, Mali) to only name these.
         | 
         | The status of some drivers is tracked here:
         | https://mesamatrix.net/
        
           | Jasper_ wrote:
           | Note that a lot of the Gallium3D architecture is designed for
           | the stateful APIs like OpenGL. Very little of it is used in
           | the Vulkan drivers.
        
         | mhh__ wrote:
         | (Thank you valve!)
        
           | MayeulC wrote:
           | Valve is contributing quite a bit, with their ACO compiler
           | for instance, but the RADV driver did not in fact originate
           | with them. We owe a great deal to David Airlie for that. He
           | wrote a complete vulkan driver _years_ before AMD released
           | the source code for AMDVLK. Valve started contributing a few
           | years in, IIRC.
           | 
           | To AMD's credit, they did help with documentation, questions,
           | and much of the work was built on top of what was there for
           | RadeonSI (OpenGL): shader compiler back-end, etc.
           | 
           | https://www.phoronix.com/scan.php?page=news_item&px=RADV-
           | Rad... -> https://airlied.livejournal.com/81460.html
        
       | pixelpoet wrote:
       | This article says that AMD GPUs are vector in nature, but I think
       | that stopped being the case with GCN; before that they had some
       | weird vector stuff with 5 elements or something.
        
         | dragontamer wrote:
         | Wrong. GPUs are definitely vector (GCN in particular being 64 x
         | 32-bit wide vectors).
         | 
         | What you're confusing is "Terascale" (aka: the 6xxx series from
         | the 00s), which was VLIW _AND_ SIMD. Which was... a little bit
         | too much going on and very difficult to optimize for. GCN made
         | things way easier and more general purpose.
         | 
         | Terascale was theoretically more GFLOPs than the first GCN
         | processors (after all, VLIW unlocks a good amount of
         | performance), but actually utilizing all those VLIW units per
         | clock tick was a nightmare. Its hard enough to write good SIMD
         | code as it is.
        
           | Jasper_ wrote:
           | Eh, it's sort of a naming conflict.
           | 
           | There were two dimensions of "vector" -- whether you used
           | XMM-style registers, requiring the compiler to auto-vectorize
           | different operations. The old vector ISAs had things like dot
           | product instructions. Those things are now all gone, because
           | it turns out it's hard to optimize.
           | 
           | We in the industry call scalar-for-each-thread ISAs as
           | "scalar ISAs" these days. For instance, Mali describes their
           | transition from a vec4-for-each-thread-based ISA to a scalar-
           | for-each-thread-based ISA as transitioning from "vector to
           | scalar" [0].
           | 
           | [0] Compare 4:26 and 6:29 in the "Mali GPU Family" video on
           | this page https://developer.arm.com/solutions/graphics-and-
           | gaming/arm-...
        
             | dragontamer wrote:
             | Connection Machine / Thinking Machines Corporation use of
             | "scalar-like SIMD" was around in the late 1980s and early
             | 1990s. Look up *Lisp, the parallel language that Thinking
             | Machines Corporation used. You can still find *Lisp manuals
             | today. https://en.wikipedia.org/wiki/*Lisp
             | 
             | Intel implemented SIMD using a technique they called SWAR:
             | SIMD within a Register (aka: the 64-bit MMX registers),
             | which eventually evolved into XMM (SSE), YMM (AVX), and ZMM
             | (AVX512).
             | 
             | Today's GPUs are programmed using the old 1980s style /
             | Connection Machine *Lisp, which is clearly the source of
             | inspiration for HLSL / GLSL / OpenCL / CUDA / etc. etc.
             | 
             | Granted, today's GPUs are still SWAR (GCN's vGPR really is
             | just a 64-wide x 32-bit register). We can see with
             | languages like ispc (Intel's SPMD program compiler), that
             | we can indeed implement a CUDA-like language on top of AVX
             | / XMM registers.*
        
               | Jasper_ wrote:
               | I don't think any of it was based on *Lisp. Graphics
               | mostly developed independently. And as such, we use
               | different words and different terminology sometimes, like
               | saying "scalar ISA" when we talk about designing ISAs
               | that don't mandate cross-lane interaction in one thread.
               | Sorry!
               | 
               | As far as I know, the first paper covering using SIMD for
               | graphics was Pixar's "Channel Processor", or Chap [0], in
               | 1984. This later became one of the core implementation
               | details of their REYES algorithm [1]. By 1989, they had
               | their own RenderMan Shading Language [2], an improved
               | version of Chap, and you can see the similarities from
               | just the snippet at the start of the code. This is where
               | Microsoft took major inspiration from when designing
               | HLSL, and which NVIDIA then started to extend with their
               | own Cg compiler. 3dlabs then copy/pasted this for GLSL.
               | 
               | [0] http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11
               | /www/re... [1]
               | https://graphics.pixar.com/library/Reyes/paper.pdf [2] ht
               | tps://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2
               | 1...
        
             | [deleted]
        
             | atq2119 wrote:
             | The terms are definitely overloaded and can differ subtly
             | between vendors. Those "scalar-for-each-thread"
             | instructions are prefixed _v__ on AMD 's GCN. The "v"
             | doesn't stand for "scalar" ;)
        
               | Jasper_ wrote:
               | Yeah, certainly having "V"GPRs in a "scalar ISA" is a bit
               | off. In graphics parlance, the term really just refers to
               | the lack of things like dot products and matrix multiply
               | instructions where the compiler has to line numbers into
               | "per-thread" vectors for the instructions to work out.
               | Turns out you can save a lot of work by not having that
               | kind of ISA, and just moving to ones where you operate on
               | number registers (which are then "vector"'d to be thread-
               | count wide).
               | 
               | Wouldn't be graphics unless we overloaded a piece of
               | terminology 20 times.
        
       | [deleted]
        
       | jimmyvalmer wrote:
       | > GTX version of Turing architecture (1660, 1650) has no Tensor
       | cores, instead > it has freely available FP16 units!
       | 
       | It's always a bad sign when an author's exclamation has all the
       | surprise factor of a tax code. As a previous poster said,
       | "gentle" is relative.
        
       | sanketsarang wrote:
       | On the same basis, it would also help if you could provide a
       | comparison between GPUs commonly used for ML. Tesla k80, P100,
       | T4, V100 and A100. How has the architecture evolved to make the
       | A100 significantly faster? Is it just the 80GB RAM, or there is
       | more to it from an architecture standpoint?
        
         | einpoklum wrote:
         | > How has the architecture evolved to make the A100
         | significantly faster?
         | 
         | Oh, very much so. By way more than an order of magnitude. For a
         | deeper read, have a look at the "architecture white papers" for
         | Kepler, Pascal, Volta/Turing, and Ampere:
         | 
         | https://duckduckgo.com/?t=ffab&q=NVIDIA+architecture+white+p...
         | 
         | or check out the archive of NVIDIA's parallel4all blog ... hmm,
         | that's weird, it seems like they've retired it. They used to
         | have really good blog posts explaining what's new in each
         | architecture.
         | 
         | You could also have a look here:
         | 
         | https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
         | 
         | for the table of various numeric sizes and limits which change
         | with different architectures. But that's not a very useful
         | resource in and of itself.
        
         | M277 wrote:
         | You may find this[0] helpful (note -- download link to a .PDF).
         | It's the GA100 whitepaper.
         | 
         | [0]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
         | Cent...
        
         | kkielhofner wrote:
         | As a starter T4 is heavily optimized for low power consumption
         | on inference tasks. IIRC it doesn't even require additional
         | power beyond what the PCIe bus can provide but basically
         | useless for training unlike the others.
        
       | wpietri wrote:
       | Have folks seen good Linux tools for actually
       | monitoring/profiling the GPU's inner workings? Soon I'll need to
       | scale running ML models. For CPUs, I have a whole bag of tricks
       | for examining and monitoring performance. But for GPUs, I feel
       | like a caveman.
        
         | corysama wrote:
         | https://developer.nvidia.com/nsight-compute
        
         | not-elite wrote:
         | In nvidia land, `nvidia-smi` is like `top` for your gpus. If
         | you're running compiled CUDA, `nvprof` is very useful. But I'm
         | not sure how much work it would take to profile something like
         | a pytorch model.
         | 
         | https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-h...
        
           | arcanus wrote:
           | In AMD land, `rocm-smi` is like `top` for your gpus. Very
           | similar in capabilities and functionality to NVIDIA.
        
         | zetazzed wrote:
         | For NVIDIA GPUs, Nsight systems is wildly detailed and has both
         | GUI and CLI options: https://developer.nvidia.com/nsight-
         | systems
         | 
         | For DL specifically, this article covers a couple of options
         | that actually plug into the framework:
         | https://developer.nvidia.com/blog/profiling-and-optimizing-d...
         | 
         | nvidia-smi is the core tool most folks use for quick "top"-like
         | output, but there is also an htop equivalent:
         | https://github.com/shunk031/nvhtop
         | 
         | A lot of other tools are build on top of the low-level NVML
         | library (https://developer.nvidia.com/nvidia-management-
         | library-nvml). There are also Python NVML bindings if you need
         | to write your own monitoring tools.
        
         | danielmorozoff wrote:
         | I personally like gpustat -- it's a nvidia-smi wrapper but it
         | has colors...
         | 
         | They also i guess now have a web sever plugged into it which
         | seems pretty cool
         | 
         | https://github.com/wookayin/gpustat
         | https://github.com/wookayin/gpustat-web
        
       | dragontamer wrote:
       | For another gentle introduction to GPU architecture, I like "The
       | Graphics Codex", specifically the chapter on Parallel
       | Architectures:
       | https://graphicscodex.courses.nvidia.com/app.html?page=_rn_p...
        
         | corysama wrote:
         | Also https://fgiesen.wordpress.com/2011/07/09/a-trip-through-
         | the-...
        
         | M277 wrote:
         | Thanks a lot, always enjoy your posts on here and r/hardware!
         | Do you have a hardcore introduction with even more detail /
         | perhaps even with examples of implementations? :)
         | 
         | I find white papers quite good (although I admit there are many
         | things I don't understand yet and constantly have to look up),
         | but even these sometimes feel a bit general.
        
       | ww520 wrote:
       | Parallelism with the computing units in GPU permeates the entire
       | computing model. For the longest time I don't get how partial
       | derivative instructions like dFdx/dFdy/ddx/ddy work. None of the
       | doc helps. These instructions take in a number and return its
       | partial derivative, just a generic number, nothing to do with
       | graphic, geometry or function. The number could have been the
       | dollar amount of a mortgage and its partial derivative is
       | returned.
       | 
       | It turns out these functions tied to how the GPU architecture
       | runs the computing units in parallel. The computing units are
       | arranged to run in parallel in a geometric grid according to the
       | input data model. The computing units run the same program in
       | LOCK STEP in parallel (well at least in lock step upon arriving
       | at the dFdx/dFdy instructions). They also know their neighbors in
       | the grid. When a computing unit encounters a dFdx instruction, it
       | reaches across and grabs its neighbors' input value to their dFdx
       | instructions. All the neighbors arrive at dFdx at the same time
       | with their input value ready. With the neighbors' numbers and its
       | own number as the mid point, it can compute the partial
       | derivative using gradient slope.
        
         | anonymous532 wrote:
         | I don't sign in often, thank you for this amazing reveal.
        
         | bigdict wrote:
         | So the function F is implicitly defined to have value F(x,y) =
         | v, where x and y are coordinates of the core, and v is the
         | input value to the dFdx/dFdy instruction? Then the output of
         | the instruction running on the x,y core (take dFdx for example)
         | is supposedly equal to (F(x+1,y)-F(x-1,y))/2?
        
           | ww520 wrote:
           | Yes. That's the idea. Though not exactly sure how the GPU
           | uses the neighboring values to interpolate the partial
           | derivative. It's probably GPU dependent.
        
             | Jasper_ wrote:
             | It's defined to work on a 2x2 grid called a quad. If you're
             | the top left pixel, then dFdX(v) = v(x+1,y) - v. If you're
             | the top right pixel, then it's dFdX(v) = v(x-1,y) - v
        
             | corysama wrote:
             | Pretty much all of them work in 2x2 quads. DyDx is
             | calculated within a quad. But, not across them. This is an
             | imperfect approximation to the derivative that needs to be
             | accounted for. But, it's useful and very cheap and
             | trivially easy.
        
         | [deleted]
        
         | bla3 wrote:
         | That sounds like a useful mental model. But it can't be quite
         | right, can it? There aren't enough cores to do _all_ pixels in
         | parallel, so how is that handled? Does it render tiles and
         | compute all edges several times for this?
        
           | ww520 wrote:
           | That's correct. There aren't enough physical cores to do all
           | cells in parallel. A 1000x1000 grid would have 1M cells.
           | These are virtual computing units. Conceptually there are 1M
           | virtual computing units, one per cell. The thousands of
           | physical cores take on the cells one batch at a time until
           | all of the cells are run. In fact multiple cores can run the
           | code of one cell in parallel. E.g. for-loop is unrolled and
           | each core takes on a different branch of the for-loop. The
           | scheduling of the cores to virtual computing units is like
           | the OS scheduling CPUs to different processes.
           | 
           | Most instructions have no inter-dependency between cells.
           | These instructions can be executed by a core as fast as it
           | can on one computing unit until an inter-dependent
           | instruction like dFdx is encountered that requires syncing.
           | The computing unit is put in a wait state while the core
           | moves on to another one. When all the computing units are
           | sync'ed up at the dFdx instruction, they're then executed by
           | the cores batch by batch.
        
             | Jasper_ wrote:
             | No, multiple cores can't run the code of the same cell in
             | parallel. The key here is that these cores run in lockstep,
             | so the program counter is the is the exact same for all
             | cells being run at once. In practice, each core supports 32
             | (or sometimes 64) cells at a time. There is also no syncing
             | required for dFdX! Since the quads ran in lockstep, the
             | results are guaranteed to be computed at the same time. In
             | practice, these are all computed in vector SIMD registers,
             | and a dFdX simply pulls a registers from another SIMD lane.
             | 
             | If you want to execute more tasks than this, throw more
             | cores at it.
        
       ___________________________________________________________________
       (page generated 2021-10-02 23:00 UTC)