[HN Gopher] Gentle introduction to GPUs inner workings
___________________________________________________________________
Gentle introduction to GPUs inner workings
Author : ingve
Score : 310 points
Date : 2021-10-02 12:41 UTC (10 hours ago)
(HTM) web link (vksegfault.github.io)
(TXT) w3m dump (vksegfault.github.io)
| tppiotrowski wrote:
| I know there's a lot of Javascript developers on this forum. If
| you want to get into GPU programming, I highly recommend gpu.js
| [1] library as a jumping off point. It's amazing how powerful
| computers are and how we squander most our cycles.
|
| [1] https://gpu.rocks/#/
|
| Disclaimer: I have one un-merged PR in the gpu.js repo
| tenaciousDaniel wrote:
| Thanks! I'm a JS dev who happens to be very interested in
| getting into graphics.
| Jasper_ wrote:
| Note that GPU.js is not for graphics, it's primarily for
| doing GPU Compute. I'd recommend looking at a 3D scene
| library to get your feet wet (and there are plenty of those
| in JS), or if you're interested in the underlying workings,
| to look at either WebGL or WebGPU.
| corysama wrote:
| Whenever I experience a slow, stuttering UI, I am reminded that
| 11 years ago, and iPhone 3GS could do this
| https://m.youtube.com/watch?v=kqz5ehun-o0
| h2odragon wrote:
| Very nice. This is gentle like movie dinosaurs are "cheeky
| lizards". I'd hate to see the "Turkish prison BDSM porn" version.
|
| I'm looking at graphics code, again, from a "I know enough C to
| shoot myself in the foot and want to draw a circle on the screen"
| perspective. It's hilarious how much "stack" there is in all the
| ways of doing that; I look at some of this shit and want to go
| back to Xlib for its simple grace.
| Jasper_ wrote:
| 2D graphics is very different and mostly doesn't require the
| GPU's assistance. If you want to plot a circle on an image and
| then display that to the screen, you don't require any of this
| stack. If you want a high-level library that will draw a circle
| for you, you can use something like Skia or Cairo which will
| wrap this for you into a C API.
|
| GPUs solve problems of much larger scale, and so the stack has
| evolved over time to meet the needs of those applications. All
| this power and corresponding complexity has been introduced for
| a reason, I assure you.
| zozbot234 wrote:
| > 2D graphics is very different and mostly doesn't require
| the GPU's assistance.
|
| There are interesting tasks in 2D graphics that may easily
| warrant GPU acceleration, such as smooth animation of
| composited surfaces (even something as simple as displaying a
| mouse pointer falls under this, and is accelerated in most
| systems) or rendering complex shapes including text.
| floatboth wrote:
| > Mesa 3D - driver library that provides open source driver
| (mostly copy of AMD-VLK)
|
| Wrong, wrong, very wrong. No copy here. Mesa's RADV was developed
| completely independently from AMD, in fact it _predates_ the
| public release of AMDVLK. It 's also possibly the best Vulkan
| implementation out there. Valve invested heavily in the ACO
| compiler backend, so it compiles shaders both very well and very
| quickly.
| MayeulC wrote:
| Not to mention that there is _much_ more to Mesa than RADV.
| Gallium3D and state trackers, RadeonSI and a few other drivers
| (Apple M1, Qualcomm Adreno, Broadcom, Mali) to only name these.
|
| The status of some drivers is tracked here:
| https://mesamatrix.net/
| Jasper_ wrote:
| Note that a lot of the Gallium3D architecture is designed for
| the stateful APIs like OpenGL. Very little of it is used in
| the Vulkan drivers.
| mhh__ wrote:
| (Thank you valve!)
| MayeulC wrote:
| Valve is contributing quite a bit, with their ACO compiler
| for instance, but the RADV driver did not in fact originate
| with them. We owe a great deal to David Airlie for that. He
| wrote a complete vulkan driver _years_ before AMD released
| the source code for AMDVLK. Valve started contributing a few
| years in, IIRC.
|
| To AMD's credit, they did help with documentation, questions,
| and much of the work was built on top of what was there for
| RadeonSI (OpenGL): shader compiler back-end, etc.
|
| https://www.phoronix.com/scan.php?page=news_item&px=RADV-
| Rad... -> https://airlied.livejournal.com/81460.html
| pixelpoet wrote:
| This article says that AMD GPUs are vector in nature, but I think
| that stopped being the case with GCN; before that they had some
| weird vector stuff with 5 elements or something.
| dragontamer wrote:
| Wrong. GPUs are definitely vector (GCN in particular being 64 x
| 32-bit wide vectors).
|
| What you're confusing is "Terascale" (aka: the 6xxx series from
| the 00s), which was VLIW _AND_ SIMD. Which was... a little bit
| too much going on and very difficult to optimize for. GCN made
| things way easier and more general purpose.
|
| Terascale was theoretically more GFLOPs than the first GCN
| processors (after all, VLIW unlocks a good amount of
| performance), but actually utilizing all those VLIW units per
| clock tick was a nightmare. Its hard enough to write good SIMD
| code as it is.
| Jasper_ wrote:
| Eh, it's sort of a naming conflict.
|
| There were two dimensions of "vector" -- whether you used
| XMM-style registers, requiring the compiler to auto-vectorize
| different operations. The old vector ISAs had things like dot
| product instructions. Those things are now all gone, because
| it turns out it's hard to optimize.
|
| We in the industry call scalar-for-each-thread ISAs as
| "scalar ISAs" these days. For instance, Mali describes their
| transition from a vec4-for-each-thread-based ISA to a scalar-
| for-each-thread-based ISA as transitioning from "vector to
| scalar" [0].
|
| [0] Compare 4:26 and 6:29 in the "Mali GPU Family" video on
| this page https://developer.arm.com/solutions/graphics-and-
| gaming/arm-...
| dragontamer wrote:
| Connection Machine / Thinking Machines Corporation use of
| "scalar-like SIMD" was around in the late 1980s and early
| 1990s. Look up *Lisp, the parallel language that Thinking
| Machines Corporation used. You can still find *Lisp manuals
| today. https://en.wikipedia.org/wiki/*Lisp
|
| Intel implemented SIMD using a technique they called SWAR:
| SIMD within a Register (aka: the 64-bit MMX registers),
| which eventually evolved into XMM (SSE), YMM (AVX), and ZMM
| (AVX512).
|
| Today's GPUs are programmed using the old 1980s style /
| Connection Machine *Lisp, which is clearly the source of
| inspiration for HLSL / GLSL / OpenCL / CUDA / etc. etc.
|
| Granted, today's GPUs are still SWAR (GCN's vGPR really is
| just a 64-wide x 32-bit register). We can see with
| languages like ispc (Intel's SPMD program compiler), that
| we can indeed implement a CUDA-like language on top of AVX
| / XMM registers.*
| Jasper_ wrote:
| I don't think any of it was based on *Lisp. Graphics
| mostly developed independently. And as such, we use
| different words and different terminology sometimes, like
| saying "scalar ISA" when we talk about designing ISAs
| that don't mandate cross-lane interaction in one thread.
| Sorry!
|
| As far as I know, the first paper covering using SIMD for
| graphics was Pixar's "Channel Processor", or Chap [0], in
| 1984. This later became one of the core implementation
| details of their REYES algorithm [1]. By 1989, they had
| their own RenderMan Shading Language [2], an improved
| version of Chap, and you can see the similarities from
| just the snippet at the start of the code. This is where
| Microsoft took major inspiration from when designing
| HLSL, and which NVIDIA then started to extend with their
| own Cg compiler. 3dlabs then copy/pasted this for GLSL.
|
| [0] http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11
| /www/re... [1]
| https://graphics.pixar.com/library/Reyes/paper.pdf [2] ht
| tps://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.2
| 1...
| [deleted]
| atq2119 wrote:
| The terms are definitely overloaded and can differ subtly
| between vendors. Those "scalar-for-each-thread"
| instructions are prefixed _v__ on AMD 's GCN. The "v"
| doesn't stand for "scalar" ;)
| Jasper_ wrote:
| Yeah, certainly having "V"GPRs in a "scalar ISA" is a bit
| off. In graphics parlance, the term really just refers to
| the lack of things like dot products and matrix multiply
| instructions where the compiler has to line numbers into
| "per-thread" vectors for the instructions to work out.
| Turns out you can save a lot of work by not having that
| kind of ISA, and just moving to ones where you operate on
| number registers (which are then "vector"'d to be thread-
| count wide).
|
| Wouldn't be graphics unless we overloaded a piece of
| terminology 20 times.
| [deleted]
| jimmyvalmer wrote:
| > GTX version of Turing architecture (1660, 1650) has no Tensor
| cores, instead > it has freely available FP16 units!
|
| It's always a bad sign when an author's exclamation has all the
| surprise factor of a tax code. As a previous poster said,
| "gentle" is relative.
| sanketsarang wrote:
| On the same basis, it would also help if you could provide a
| comparison between GPUs commonly used for ML. Tesla k80, P100,
| T4, V100 and A100. How has the architecture evolved to make the
| A100 significantly faster? Is it just the 80GB RAM, or there is
| more to it from an architecture standpoint?
| einpoklum wrote:
| > How has the architecture evolved to make the A100
| significantly faster?
|
| Oh, very much so. By way more than an order of magnitude. For a
| deeper read, have a look at the "architecture white papers" for
| Kepler, Pascal, Volta/Turing, and Ampere:
|
| https://duckduckgo.com/?t=ffab&q=NVIDIA+architecture+white+p...
|
| or check out the archive of NVIDIA's parallel4all blog ... hmm,
| that's weird, it seems like they've retired it. They used to
| have really good blog posts explaining what's new in each
| architecture.
|
| You could also have a look here:
|
| https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
|
| for the table of various numeric sizes and limits which change
| with different architectures. But that's not a very useful
| resource in and of itself.
| M277 wrote:
| You may find this[0] helpful (note -- download link to a .PDF).
| It's the GA100 whitepaper.
|
| [0]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
| Cent...
| kkielhofner wrote:
| As a starter T4 is heavily optimized for low power consumption
| on inference tasks. IIRC it doesn't even require additional
| power beyond what the PCIe bus can provide but basically
| useless for training unlike the others.
| wpietri wrote:
| Have folks seen good Linux tools for actually
| monitoring/profiling the GPU's inner workings? Soon I'll need to
| scale running ML models. For CPUs, I have a whole bag of tricks
| for examining and monitoring performance. But for GPUs, I feel
| like a caveman.
| corysama wrote:
| https://developer.nvidia.com/nsight-compute
| not-elite wrote:
| In nvidia land, `nvidia-smi` is like `top` for your gpus. If
| you're running compiled CUDA, `nvprof` is very useful. But I'm
| not sure how much work it would take to profile something like
| a pytorch model.
|
| https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-h...
| arcanus wrote:
| In AMD land, `rocm-smi` is like `top` for your gpus. Very
| similar in capabilities and functionality to NVIDIA.
| zetazzed wrote:
| For NVIDIA GPUs, Nsight systems is wildly detailed and has both
| GUI and CLI options: https://developer.nvidia.com/nsight-
| systems
|
| For DL specifically, this article covers a couple of options
| that actually plug into the framework:
| https://developer.nvidia.com/blog/profiling-and-optimizing-d...
|
| nvidia-smi is the core tool most folks use for quick "top"-like
| output, but there is also an htop equivalent:
| https://github.com/shunk031/nvhtop
|
| A lot of other tools are build on top of the low-level NVML
| library (https://developer.nvidia.com/nvidia-management-
| library-nvml). There are also Python NVML bindings if you need
| to write your own monitoring tools.
| danielmorozoff wrote:
| I personally like gpustat -- it's a nvidia-smi wrapper but it
| has colors...
|
| They also i guess now have a web sever plugged into it which
| seems pretty cool
|
| https://github.com/wookayin/gpustat
| https://github.com/wookayin/gpustat-web
| dragontamer wrote:
| For another gentle introduction to GPU architecture, I like "The
| Graphics Codex", specifically the chapter on Parallel
| Architectures:
| https://graphicscodex.courses.nvidia.com/app.html?page=_rn_p...
| corysama wrote:
| Also https://fgiesen.wordpress.com/2011/07/09/a-trip-through-
| the-...
| M277 wrote:
| Thanks a lot, always enjoy your posts on here and r/hardware!
| Do you have a hardcore introduction with even more detail /
| perhaps even with examples of implementations? :)
|
| I find white papers quite good (although I admit there are many
| things I don't understand yet and constantly have to look up),
| but even these sometimes feel a bit general.
| ww520 wrote:
| Parallelism with the computing units in GPU permeates the entire
| computing model. For the longest time I don't get how partial
| derivative instructions like dFdx/dFdy/ddx/ddy work. None of the
| doc helps. These instructions take in a number and return its
| partial derivative, just a generic number, nothing to do with
| graphic, geometry or function. The number could have been the
| dollar amount of a mortgage and its partial derivative is
| returned.
|
| It turns out these functions tied to how the GPU architecture
| runs the computing units in parallel. The computing units are
| arranged to run in parallel in a geometric grid according to the
| input data model. The computing units run the same program in
| LOCK STEP in parallel (well at least in lock step upon arriving
| at the dFdx/dFdy instructions). They also know their neighbors in
| the grid. When a computing unit encounters a dFdx instruction, it
| reaches across and grabs its neighbors' input value to their dFdx
| instructions. All the neighbors arrive at dFdx at the same time
| with their input value ready. With the neighbors' numbers and its
| own number as the mid point, it can compute the partial
| derivative using gradient slope.
| anonymous532 wrote:
| I don't sign in often, thank you for this amazing reveal.
| bigdict wrote:
| So the function F is implicitly defined to have value F(x,y) =
| v, where x and y are coordinates of the core, and v is the
| input value to the dFdx/dFdy instruction? Then the output of
| the instruction running on the x,y core (take dFdx for example)
| is supposedly equal to (F(x+1,y)-F(x-1,y))/2?
| ww520 wrote:
| Yes. That's the idea. Though not exactly sure how the GPU
| uses the neighboring values to interpolate the partial
| derivative. It's probably GPU dependent.
| Jasper_ wrote:
| It's defined to work on a 2x2 grid called a quad. If you're
| the top left pixel, then dFdX(v) = v(x+1,y) - v. If you're
| the top right pixel, then it's dFdX(v) = v(x-1,y) - v
| corysama wrote:
| Pretty much all of them work in 2x2 quads. DyDx is
| calculated within a quad. But, not across them. This is an
| imperfect approximation to the derivative that needs to be
| accounted for. But, it's useful and very cheap and
| trivially easy.
| [deleted]
| bla3 wrote:
| That sounds like a useful mental model. But it can't be quite
| right, can it? There aren't enough cores to do _all_ pixels in
| parallel, so how is that handled? Does it render tiles and
| compute all edges several times for this?
| ww520 wrote:
| That's correct. There aren't enough physical cores to do all
| cells in parallel. A 1000x1000 grid would have 1M cells.
| These are virtual computing units. Conceptually there are 1M
| virtual computing units, one per cell. The thousands of
| physical cores take on the cells one batch at a time until
| all of the cells are run. In fact multiple cores can run the
| code of one cell in parallel. E.g. for-loop is unrolled and
| each core takes on a different branch of the for-loop. The
| scheduling of the cores to virtual computing units is like
| the OS scheduling CPUs to different processes.
|
| Most instructions have no inter-dependency between cells.
| These instructions can be executed by a core as fast as it
| can on one computing unit until an inter-dependent
| instruction like dFdx is encountered that requires syncing.
| The computing unit is put in a wait state while the core
| moves on to another one. When all the computing units are
| sync'ed up at the dFdx instruction, they're then executed by
| the cores batch by batch.
| Jasper_ wrote:
| No, multiple cores can't run the code of the same cell in
| parallel. The key here is that these cores run in lockstep,
| so the program counter is the is the exact same for all
| cells being run at once. In practice, each core supports 32
| (or sometimes 64) cells at a time. There is also no syncing
| required for dFdX! Since the quads ran in lockstep, the
| results are guaranteed to be computed at the same time. In
| practice, these are all computed in vector SIMD registers,
| and a dFdX simply pulls a registers from another SIMD lane.
|
| If you want to execute more tasks than this, throw more
| cores at it.
___________________________________________________________________
(page generated 2021-10-02 23:00 UTC)