[HN Gopher] How GPU Computing Works [video]
       ___________________________________________________________________
        
       How GPU Computing Works [video]
        
       Author : quick_brown_fox
       Score  : 104 points
       Date   : 2022-07-11 08:08 UTC (1 days ago)
        
 (HTM) web link (www.nvidia.com)
 (TXT) w3m dump (www.nvidia.com)
        
       | wrs wrote:
       | The Wheel of Reincarnation continues. [0] (Though it's sort of
       | turning the other way, this time around?)
       | 
       | [0] http://www.catb.org/jargon/html/W/wheel-of-
       | reincarnation.htm...
        
         | dragontamer wrote:
         | The opposite.
         | 
         | GPUs became more general purpose. Old, vector processors from
         | the 1980s served as inspiration. Even in 90s commercials its
         | obvious that the GPU / SIMD-compute similarities were all over
         | the 3dfx cards.
         | 
         | In the 00s, GPUs became flexible enough to execute arbitrary
         | code for vector-effects and pixel-effects. Truly general
         | purpose, arbitrary code, albeit in the SIMD methodology.
         | 
         | --------
         | 
         | Today, your Direct2D windowing code in Windows is largely
         | running on the GPU, and has been migrated away from the CPU. In
         | fact, your video decoders (Youtube) and video games (shaders)
         | are all GPU code.
         | 
         | GPUs have proven themselves to be a general purpose processor,
         | albeit with a strange SIMD-model of compute rather than the
         | traditional Von Neumann design.
         | 
         | We're in a cycle where more-and-more code is moving away from
         | CPUs into GPUs, and permanently staying in GPU space. This is
         | the opposite effect of the cycle of reincarnation (CPUs may
         | have gotten faster, but GPUs have become not only faster at a
         | higher rate, but also more general purpose and generic allowing
         | for more general code to be run on them).
         | 
         | Code successfully ported over (ex: Tensorflow), may never
         | return back to CPU-side. SIMD-compute is just superior
         | underlying model for a large set of applications.
        
       | boberoni wrote:
       | _> (Almost) Nobody (really) cares about flops ...because we
       | should really be caring about memory bandwidth_
       | 
       | In university, I was shocked to learn in a database class that
       | CPU costs are dwarfed by the I/O costs in the memory hierarchy.
       | This was after spending a whole year on data structures and
       | algorithms, where we obsessed over runtime complexity and # of
       | operations.
       | 
       | It seems that the low-hanging fruit of optimization is all gone.
       | New innovations for performance will have to happen in
       | transporting data.
        
         | vladf wrote:
         | Compression converts I/O bottlenecks to compute ones again.
        
         | Sherl wrote:
         | There is an entire field of parallel algorithms which makes use
         | of sequential algorithms to overcome some of these issues. So
         | no it's not wasted. You would apply the knowledge to build
         | parallel algorithms.
         | 
         | There are projects in my school where we implemented a
         | combination of CUDA and openMP in some and MPI+OpenMP in a few.
         | I think the bottleneck is always gonna be there, its just how
         | much and how you deal with it in hardware from the software
         | front.
        
         | malnourish wrote:
         | At the risk of being flippant, I hope you learned about space
         | complexity and the lessons behind how the algorithms and data
         | structures you use impact performance via the cache.
        
       | Lichtso wrote:
       | If I understand correctly:
       | 
       | CPUs do minimize latency by:
       | 
       | - Register renaming
       | 
       | - Out of order execution
       | 
       | - Branch prediction
       | 
       | - Speculative execution
       | 
       | They should not be over subscribed as they have to context switch
       | by storing / loading registers and the cache coherence protocols
       | scale badly with more threads.
       | 
       | GPUs on the other hand maximize throughput by:
       | 
       | - A lot more memory bandwidth
       | 
       | - Smaller and slower cores, but more of them
       | 
       | - Ultra threading (the massively over subscribed hyper threading
       | the video mentions)
       | 
       | - Context switching between wavefronts (basically the equivalent
       | of a CPU thread), just shifts the offset into the huge register
       | file (no store and load)
       | 
       | The one area in which CPUs are getting closer to GPUs is SIMD /
       | SIMT. CPUs used to be able to apply one instruction to a vector
       | of elements without masking (SIMD). In ARM SVE and x86 AVX-512
       | they can now (like GPUs) mask out individual lanes (SIMT) for ALU
       | operations and memory operations (gather load / scatter store).
        
         | oddity wrote:
         | The difference is much more nuanced than this. A modern GPU can
         | (and probably does) do most of what you've listed for a CPU.
         | Speculative execution and branch prediction are a bit less
         | likely to be invested in (because they don't need it as much
         | due to oversubscription), but that's increasingly true for CPUs
         | as well for high-efficiency cores. The difference (at a
         | category vs category level and not specific microarch) is
         | mostly a matter of tuning for particular workloads. I'm
         | increasingly souring on SIMD/SIMT being a useful distinction
         | now that bleeding-edge CPUs are widening in the microarch and
         | bleeding-edge GPUs are getting better at handling thread
         | divergence in the microarch. There is a difference, certainly,
         | but it's difficult to describe in a few bullet points.
         | 
         | GPUs are more likely to have more exotic features than you'll
         | see on a CPU to deal with things like thread coordination and
         | cache coherence, but there's nothing fundamentally stopping
         | CPUs from adding that (or wanting that) as well.
        
           | Lichtso wrote:
           | > GPUs are getting better at handling thread divergence in
           | the microarch
           | 
           | That is an interesting point, how does that work (especially
           | with the dynamics of ray tracing)? Do they recombine under
           | utilized wavefronts or something?
        
             | dragontamer wrote:
             | I'm not aware of anything that improves thread-divergence.
             | NVidia's most recent GPUs have superscalar operations,
             | which is a trick from CPU-land (multiple pipelines
             | operating 2 or more instructions per clock tick). NVidia
             | has an integer-pipeline and a floating-point pipeline, and
             | both can operate simultaneously (ex: for(int i=0; i<100;
             | i++) x *=blah; the "i++" is integer, while the "x *= blah"
             | is floating point, so both operate simultaneously.
             | 
             | CPUs have extremely flexible pipelines: Intel's pipeline 0
             | and 1 basically can do anything, pipeline 5 can do most
             | stuff but is missing division IIRC (and a few other
             | things). Load/store are done on some other pipelines, etc.
             | etc.
             | 
             | Apple's and AMD's CPU pipelines are more symmetrical and
             | uniform.
             | 
             | NVidia GPUs are the only superscalar ones I can think of,
             | aside from AMD GPU's scalar vs vector split (which isn't
             | really the "superscalar" operation I'm trying to describe).
        
               | TomVDB wrote:
               | Starting with Volta, Nvidia GPUs have forward progress
               | guarantee, preventing lockups when there's thread
               | divergence.
               | 
               | That doesn't improve the performance of a well behaved
               | and well written compute shader. But avoiding hard hangs
               | IMO deserves the label "improved thread divergence."
        
             | jjoonathan wrote:
             | Aren't warps still 32 threads, even though number of
             | threads is skyrocketing, effectively making them
             | proportionately finer granularity? Are things different in
             | AMD land?
        
               | JonChesterfield wrote:
               | Slightly, the older tech is 64 threads/lanes per
               | warp/wavefront. Newer ones are 32 by default but 64 if
               | desired.
               | 
               | Bigger differences are the instruction counter per thread
               | since volta on nvidia (which I think is a terrible
               | feature) and that forward progress guarantees are
               | stronger on nvidia (those are _really_ helpful but
               | expensive).
        
               | dragontamer wrote:
               | > Slightly, the older tech is 64 threads/lanes per
               | warp/wavefront. Newer ones are 32 by default but 64 if
               | desired.
               | 
               | AMD GCN was 64 threads/wavefront. NVidia always was 32
               | threads/warp.
               | 
               | AMD's newest consumer cards RDNA and RDNA2 are 32
               | threads/wavefront. However, GCN lives on with CDNA (MI200
               | supercomputer chips), with 64 threads/wavefront
               | architecture.
        
               | TomVDB wrote:
               | Nvidia GPUs were 32 threads per warps eight from the
               | start of CUDA with the 8800 GTX.
               | 
               | > which I think is a terrible feature <> those are
               | _really_ helpful but expensive
               | 
               | Guaranteed forward progress is a direct consequence of
               | having an instruction counter per thread???
               | 
               | Or so I thought. How else would an SM be able to know the
               | PC of a group of threads that wasn't stuck?
        
         | dragontamer wrote:
         | > They should not be over subscribed as they have to context
         | switch by storing / loading registers and the cache coherence
         | protocols scale badly with more threads.
         | 
         | CPUs can be oversubscribed if so designed.
         | 
         | POWER9 had SMT4 and SMT8, (4-threads per core and 8-threads per
         | core, respectively). SMT8 is basically GPU-level throughput /
         | threading. SMT4 is probably a better medium between x86
         | (2-threads per core) and the craziness that is GPUs.
         | 
         | > - Register renaming
         | 
         | > - Out of order execution
         | 
         | These are one and the same. Tomasulo's Algorithm is the key
         | thing to understand here:
         | https://en.wikipedia.org/wiki/Tomasulo_algorithm
         | 
         | -------
         | 
         | I'd describe modern CPUs to be pipelined (break up instructions
         | to help parallelize fetch/decode/execution), superscalar
         | (multiple execution pipelines in parallel), out-of-order
         | (renaming / retirement units with Tomasulo's algorithm), branch
         | predicted, speculative processors with virtual memory and MESI-
         | like algorithms to enable cache-coherence / multicore.
         | 
         | -------
         | 
         | I basically agree with your post. Just clarifying a few points.
         | My mental model of CPUs / GPUs seems to match yours.
        
           | zozbot234 wrote:
           | SMT8 is not new, the UltraSPARC T2 ("Niagara 2") processor
           | also supports it as do its successors in the UltraSPARC
           | T/SPARC T series.
        
         | the_optimist wrote:
         | Excepting that AXV-512 is being removed or disabled from newer
         | chip sets.
        
       | [deleted]
        
       | einpoklum wrote:
       | This seems like this year's version of the talk given last year,
       | which was just recently posted here on HN as "How CUDA
       | Programming works":
       | 
       | https://news.ycombinator.com/item?id=31983460
        
       | oifjsidjf wrote:
       | Here is another interesting series of articles which describes in
       | more details how GPUs draw:
       | 
       | https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-07-12 23:01 UTC)