[HN Gopher] How GPU Computing Works [video]
___________________________________________________________________
How GPU Computing Works [video]
Author : quick_brown_fox
Score : 104 points
Date : 2022-07-11 08:08 UTC (1 days ago)
(HTM) web link (www.nvidia.com)
(TXT) w3m dump (www.nvidia.com)
| wrs wrote:
| The Wheel of Reincarnation continues. [0] (Though it's sort of
| turning the other way, this time around?)
|
| [0] http://www.catb.org/jargon/html/W/wheel-of-
| reincarnation.htm...
| dragontamer wrote:
| The opposite.
|
| GPUs became more general purpose. Old, vector processors from
| the 1980s served as inspiration. Even in 90s commercials its
| obvious that the GPU / SIMD-compute similarities were all over
| the 3dfx cards.
|
| In the 00s, GPUs became flexible enough to execute arbitrary
| code for vector-effects and pixel-effects. Truly general
| purpose, arbitrary code, albeit in the SIMD methodology.
|
| --------
|
| Today, your Direct2D windowing code in Windows is largely
| running on the GPU, and has been migrated away from the CPU. In
| fact, your video decoders (Youtube) and video games (shaders)
| are all GPU code.
|
| GPUs have proven themselves to be a general purpose processor,
| albeit with a strange SIMD-model of compute rather than the
| traditional Von Neumann design.
|
| We're in a cycle where more-and-more code is moving away from
| CPUs into GPUs, and permanently staying in GPU space. This is
| the opposite effect of the cycle of reincarnation (CPUs may
| have gotten faster, but GPUs have become not only faster at a
| higher rate, but also more general purpose and generic allowing
| for more general code to be run on them).
|
| Code successfully ported over (ex: Tensorflow), may never
| return back to CPU-side. SIMD-compute is just superior
| underlying model for a large set of applications.
| boberoni wrote:
| _> (Almost) Nobody (really) cares about flops ...because we
| should really be caring about memory bandwidth_
|
| In university, I was shocked to learn in a database class that
| CPU costs are dwarfed by the I/O costs in the memory hierarchy.
| This was after spending a whole year on data structures and
| algorithms, where we obsessed over runtime complexity and # of
| operations.
|
| It seems that the low-hanging fruit of optimization is all gone.
| New innovations for performance will have to happen in
| transporting data.
| vladf wrote:
| Compression converts I/O bottlenecks to compute ones again.
| Sherl wrote:
| There is an entire field of parallel algorithms which makes use
| of sequential algorithms to overcome some of these issues. So
| no it's not wasted. You would apply the knowledge to build
| parallel algorithms.
|
| There are projects in my school where we implemented a
| combination of CUDA and openMP in some and MPI+OpenMP in a few.
| I think the bottleneck is always gonna be there, its just how
| much and how you deal with it in hardware from the software
| front.
| malnourish wrote:
| At the risk of being flippant, I hope you learned about space
| complexity and the lessons behind how the algorithms and data
| structures you use impact performance via the cache.
| Lichtso wrote:
| If I understand correctly:
|
| CPUs do minimize latency by:
|
| - Register renaming
|
| - Out of order execution
|
| - Branch prediction
|
| - Speculative execution
|
| They should not be over subscribed as they have to context switch
| by storing / loading registers and the cache coherence protocols
| scale badly with more threads.
|
| GPUs on the other hand maximize throughput by:
|
| - A lot more memory bandwidth
|
| - Smaller and slower cores, but more of them
|
| - Ultra threading (the massively over subscribed hyper threading
| the video mentions)
|
| - Context switching between wavefronts (basically the equivalent
| of a CPU thread), just shifts the offset into the huge register
| file (no store and load)
|
| The one area in which CPUs are getting closer to GPUs is SIMD /
| SIMT. CPUs used to be able to apply one instruction to a vector
| of elements without masking (SIMD). In ARM SVE and x86 AVX-512
| they can now (like GPUs) mask out individual lanes (SIMT) for ALU
| operations and memory operations (gather load / scatter store).
| oddity wrote:
| The difference is much more nuanced than this. A modern GPU can
| (and probably does) do most of what you've listed for a CPU.
| Speculative execution and branch prediction are a bit less
| likely to be invested in (because they don't need it as much
| due to oversubscription), but that's increasingly true for CPUs
| as well for high-efficiency cores. The difference (at a
| category vs category level and not specific microarch) is
| mostly a matter of tuning for particular workloads. I'm
| increasingly souring on SIMD/SIMT being a useful distinction
| now that bleeding-edge CPUs are widening in the microarch and
| bleeding-edge GPUs are getting better at handling thread
| divergence in the microarch. There is a difference, certainly,
| but it's difficult to describe in a few bullet points.
|
| GPUs are more likely to have more exotic features than you'll
| see on a CPU to deal with things like thread coordination and
| cache coherence, but there's nothing fundamentally stopping
| CPUs from adding that (or wanting that) as well.
| Lichtso wrote:
| > GPUs are getting better at handling thread divergence in
| the microarch
|
| That is an interesting point, how does that work (especially
| with the dynamics of ray tracing)? Do they recombine under
| utilized wavefronts or something?
| dragontamer wrote:
| I'm not aware of anything that improves thread-divergence.
| NVidia's most recent GPUs have superscalar operations,
| which is a trick from CPU-land (multiple pipelines
| operating 2 or more instructions per clock tick). NVidia
| has an integer-pipeline and a floating-point pipeline, and
| both can operate simultaneously (ex: for(int i=0; i<100;
| i++) x *=blah; the "i++" is integer, while the "x *= blah"
| is floating point, so both operate simultaneously.
|
| CPUs have extremely flexible pipelines: Intel's pipeline 0
| and 1 basically can do anything, pipeline 5 can do most
| stuff but is missing division IIRC (and a few other
| things). Load/store are done on some other pipelines, etc.
| etc.
|
| Apple's and AMD's CPU pipelines are more symmetrical and
| uniform.
|
| NVidia GPUs are the only superscalar ones I can think of,
| aside from AMD GPU's scalar vs vector split (which isn't
| really the "superscalar" operation I'm trying to describe).
| TomVDB wrote:
| Starting with Volta, Nvidia GPUs have forward progress
| guarantee, preventing lockups when there's thread
| divergence.
|
| That doesn't improve the performance of a well behaved
| and well written compute shader. But avoiding hard hangs
| IMO deserves the label "improved thread divergence."
| jjoonathan wrote:
| Aren't warps still 32 threads, even though number of
| threads is skyrocketing, effectively making them
| proportionately finer granularity? Are things different in
| AMD land?
| JonChesterfield wrote:
| Slightly, the older tech is 64 threads/lanes per
| warp/wavefront. Newer ones are 32 by default but 64 if
| desired.
|
| Bigger differences are the instruction counter per thread
| since volta on nvidia (which I think is a terrible
| feature) and that forward progress guarantees are
| stronger on nvidia (those are _really_ helpful but
| expensive).
| dragontamer wrote:
| > Slightly, the older tech is 64 threads/lanes per
| warp/wavefront. Newer ones are 32 by default but 64 if
| desired.
|
| AMD GCN was 64 threads/wavefront. NVidia always was 32
| threads/warp.
|
| AMD's newest consumer cards RDNA and RDNA2 are 32
| threads/wavefront. However, GCN lives on with CDNA (MI200
| supercomputer chips), with 64 threads/wavefront
| architecture.
| TomVDB wrote:
| Nvidia GPUs were 32 threads per warps eight from the
| start of CUDA with the 8800 GTX.
|
| > which I think is a terrible feature <> those are
| _really_ helpful but expensive
|
| Guaranteed forward progress is a direct consequence of
| having an instruction counter per thread???
|
| Or so I thought. How else would an SM be able to know the
| PC of a group of threads that wasn't stuck?
| dragontamer wrote:
| > They should not be over subscribed as they have to context
| switch by storing / loading registers and the cache coherence
| protocols scale badly with more threads.
|
| CPUs can be oversubscribed if so designed.
|
| POWER9 had SMT4 and SMT8, (4-threads per core and 8-threads per
| core, respectively). SMT8 is basically GPU-level throughput /
| threading. SMT4 is probably a better medium between x86
| (2-threads per core) and the craziness that is GPUs.
|
| > - Register renaming
|
| > - Out of order execution
|
| These are one and the same. Tomasulo's Algorithm is the key
| thing to understand here:
| https://en.wikipedia.org/wiki/Tomasulo_algorithm
|
| -------
|
| I'd describe modern CPUs to be pipelined (break up instructions
| to help parallelize fetch/decode/execution), superscalar
| (multiple execution pipelines in parallel), out-of-order
| (renaming / retirement units with Tomasulo's algorithm), branch
| predicted, speculative processors with virtual memory and MESI-
| like algorithms to enable cache-coherence / multicore.
|
| -------
|
| I basically agree with your post. Just clarifying a few points.
| My mental model of CPUs / GPUs seems to match yours.
| zozbot234 wrote:
| SMT8 is not new, the UltraSPARC T2 ("Niagara 2") processor
| also supports it as do its successors in the UltraSPARC
| T/SPARC T series.
| the_optimist wrote:
| Excepting that AXV-512 is being removed or disabled from newer
| chip sets.
| [deleted]
| einpoklum wrote:
| This seems like this year's version of the talk given last year,
| which was just recently posted here on HN as "How CUDA
| Programming works":
|
| https://news.ycombinator.com/item?id=31983460
| oifjsidjf wrote:
| Here is another interesting series of articles which describes in
| more details how GPUs draw:
|
| https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...
| [deleted]
___________________________________________________________________
(page generated 2022-07-12 23:01 UTC)