[HN Gopher] SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)
       ___________________________________________________________________
        
       SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)
        
       Author : shipp02
       Score  : 79 points
       Date   : 2024-06-10 06:05 UTC (1 days ago)
        
 (HTM) web link (yosefk.com)
 (TXT) w3m dump (yosefk.com)
        
       | narrowbyte wrote:
       | quite interesting framing. A couple things have changed since
       | 2011
       | 
       | - SIMD (at least intel's AVX512) does have usable gather/scatter,
       | so "Single instruction, multiple addresses" is no longer a
       | flexibility win for SIMT vs SIMD
       | 
       | - likewise for pervasive masking support and "Single instruction,
       | multiple flow paths"
       | 
       | In general, I think of SIMD as _more_ flexible than SIMT, not
       | less, in line with this other post
       | https://news.ycombinator.com/item?id=40625579. SIMT requires
       | staying more towards the "embarrassingly" parallel end of the
       | spectrum, SIMD can be applied in cases where understanding the
       | opportunity for parallelism is very non-trivial.
        
         | majke wrote:
         | Last time i looked at intel scatter/gather I got the impression
         | it only works for a very narrow use case, and getting it to
         | perform wasn't easy. Did I miss something?
        
           | narrowbyte wrote:
           | The post says, about SIMT / GPU programming, "This loss
           | results from the DRAM architecture quite directly, the GPU
           | being unable to do much about it - similarly to any other
           | processor."
           | 
           | I would say that for SIMD the situation is basically the
           | same. gather/scatter don't magically make the memory
           | hierarchy a non-issue, but they're no longer adding any
           | unnecessary pain on top.
        
             | yosefk wrote:
             | Barrel threaded machines like GPUs have easier time hiding
             | the latency of bank conflict resolution when
             | gathering/scattering against local memory/cache than a
             | machine running a single instruction thread. So pretty sure
             | they have a fundamental advantage when it comes to the
             | throughput of scatter/gather operations that gets bigger
             | with a larger number of vector lanes
        
         | raphlinus wrote:
         | One of the other major things that's changed is that Nvidia now
         | has independent thread scheduling (as of Volta, see [1]). That
         | allows things like individual threads to take locks, which is a
         | pretty big leap. Essentially, it allows you to program each
         | individual thread as if it's running a C++ program, but of
         | course you do have to think about the warp and block structure
         | if you want to optimize performance.
         | 
         | I disagree that SIMT is only for embarrassingly parallel
         | problems. Both CUDA and compute shaders are now used for fairly
         | sophisticated data structures (including trees) and algorithms
         | (including sorting).
         | 
         | [1]: https://developer.nvidia.com/blog/inside-
         | volta/#independent_...
        
           | yosefk wrote:
           | It's improtant that GPU threads support locking and control
           | flow divergence and I don't want to minimize that, but
           | threads within a warp diverging still badly loses throughput,
           | so I don't think the situation I'd fundamentally different in
           | terms of what the machine is good/bad at. We're just closer
           | to the base architecture's local maximum of capabilities, as
           | one would expect for a more mature architecture; various
           | things it could be made to support it now actually supports
           | because there was time to add this support
        
           | narrowbyte wrote:
           | I intentionally said "more towards embarrassingly parallel"
           | rather than "only embarrassingly parallel". I don't think
           | there's a hard cutoff, but there is a qualitative difference.
           | One example that springs to mind is
           | https://github.com/simdjson/simdjson - afaik there's no
           | similarly mature GPU-based JSON parsing.
        
             | raphlinus wrote:
             | I'm not aware of any similarly mature GPU-based JSON
             | parser, but I believe such a thing is possible. My stack
             | monoid work [1] contains a bunch of ideas that may be
             | helpful for building one. I've thought about pursuing that,
             | but have kept focus on 2D graphics as it's clearer how that
             | will actually be useful.
             | 
             | [1]: https://arxiv.org/abs/2205.11659
        
           | xoranth wrote:
           | > That allows things like individual threads to take locks,
           | which is a pretty big leap.
           | 
           | Does anyone know how those get translated into SIMD
           | instructions. Like, how do you do a CAS loop for each lane
           | where each lane can individually succeed or fail? What
           | happens if the lanes point to the same location?
        
             | raphlinus wrote:
             | There's a bit more information at [1], but I think the
             | details are not public. The hardware _is_ tracking a
             | separate program counter (and call stack) for each thread.
             | So in the CAS example, one thread wins and continues making
             | progress, while the other threads loop.
             | 
             | There seems to some more detail in a Bachelors thesis by
             | Phillip Grote[2], with lots of measurements of different
             | synchronization primitives, but it doesn't go too deep into
             | the hardware.
             | 
             | [1]: https://arxiv.org/abs/2205.11659
             | 
             | [2]: https://www.clemenslutz.com/pdfs/bsc_thesis_phillip_gr
             | ote.pd...
        
               | xoranth wrote:
               | Thanks!
        
       | HALtheWise wrote:
       | For a SIMD architecture that supports scatter/gather and
       | instruction masking (like Arm SVE), could a compiler or language
       | allow you to write "Scalar-style code" that compiles to SIMD
       | instructions? I guess this is just auto-vectorization, but I'd be
       | interested in explicit tagging of code regions, possibly in
       | combination with restrictions on what operations are allowed.
        
         | yosefk wrote:
         | Check out ispc/spmd by Matt Pharr, a very interesting take on
         | this subject
        
         | doophus wrote:
         | Yes, have a look at ISPC - it's amazing. I especially like that
         | it can generate code for multiple architectures and then select
         | the best implementation at runtime for the CPU it's running on.
        
           | xoranth wrote:
           | Do you know any good tutorial for ISPC? Documentation is a
           | bit sparse.
        
       | Remnant44 wrote:
       | I think the principle things that have changed since this article
       | was written is mostly each category taking inspiration from the
       | other.
       | 
       | For example, SIMD instructions gained gather/scatter and even
       | masking of instructions for divergent flow (in avx512 that
       | consumers never get to play with). These can really simplify
       | writing explicit SIMD and make it more GPU-like.
       | 
       | Conversely, GPUs gained a much higher emphasis on caching,
       | sustained divergent flow via independent program counters, and
       | subgroup instructions which are essentially explicit SIMD in
       | disguise.
       | 
       | SMT on the other hand... seems like it might be on the way out
       | completely. While still quite effective for some workloads, it
       | seems like quite a lot of complexity for only situational
       | improvements in throughput.
        
         | yosefk wrote:
         | The basic architecture still matters. GPUs still lose
         | throughput upon divergence regardless of their increased
         | ability to run more kinds of divergent flows correctly due to
         | having separate PCs, and SIMD still has more trouble with
         | instruction latency (including due to bank conflict resolution
         | in scatter/gather) than barrel threaded machines, etc. This is
         | not to detract from the importance of the improvements to the
         | base architecture made over time
        
           | Remnant44 wrote:
           | agreed! The basic categories remain, just blurring a bit at
           | the edges.
        
       | mkoubaa wrote:
       | This type of parallelism is sort of like a flops metric.
       | Optimizing the amount of wall time the GPU is actually doing
       | computation is just as important (if not more). There are some
       | synchronization and pipelining tools in CUDA and Vulkan but they
       | are scary at first glance.
        
       | Const-me wrote:
       | > How many register sets does a typical SMT processor have? Er,
       | 2, sometimes 4
       | 
       | Way more of them. Pipelines are deep, and different in-flight
       | instructions need different versions of the same registers.
       | 
       | For example, my laptop has AMD Zen3 processor. Each core has 192
       | scalar physical registers, while only ~16 general-purpose scalar
       | registers defined in the ISA. This gives 12 register sets; they
       | are shared by both threads running on the core.
       | 
       | Similar with SIMD vector registers. Apparently each core has 160
       | 32-byte vector registers. Because AVX2 ISA defines 16 vector
       | register, this gives 10 register sets per core, again shared by 2
       | threads.
        
         | xoranth wrote:
         | I believe the author is referring to how many logical
         | threads/hyperthreads can a core run (for AMD and Intel, two. I
         | believe POWER can do 8, Sparc 4).
         | 
         | The extra physical registers are there for superscalar
         | execution, not for SMT/hyperthreading.
        
         | yosefk wrote:
         | I meant register sets visible to software. The fact that
         | there's even more hardware not visible to software that you
         | need for a thread to run fast just means that the cost of
         | adding another thread is even higher
        
       | jabl wrote:
       | A couple of related questions:
       | 
       | - It has been claimed that several GPU vendors behind the covers
       | convert the SIMT programming model (graphics shaders, CUDA,
       | OpenCL, whatever) into something like a SIMD ISA that the
       | underlying hardware supports. Why is that? Why not have something
       | SIMT-like as the underlying HW ISA? Seems the conceptual beauty
       | of SIMT is that you don't need to duplicate the entire scalar ISA
       | for vectors like you need with SIMD, you just need a few thread
       | control instructions (fork, join, etc.) to tell the HW to switch
       | between scalar or SIMT mode. So why haven't vendors gone with
       | this? Is there some hidden complexity that makes SIMT hard to
       | implement efficiently, despite the nice high level programming
       | model?
       | 
       | - How do these higher level HW features like Tensor cores map to
       | the SIMT model? It's sort of easy to see how SIMT handles a
       | vector, each thread handles one element of the vector. But if you
       | have HW support for something like a matrix multiplication, what
       | then? Or does each SIMT thread have access to a 'matmul'
       | instruction, and all the threads in a warp that run concurrently
       | can concurrently run matmuls?
        
         | xoranth wrote:
         | It is the same reason in software sometimes you batch
         | operations:
         | 
         | When you add two numbers, the GPU needs to do a lot more stuff
         | besides the addition.
         | 
         | If you implemented SIMT by having multiple cores, you would
         | need to do the extra stuff once per core, so you wouldn't save
         | power (and you have a fixed power budget). With SIMD, you get
         | $NUM_LANES additions, but you do the extra stuff only once,
         | saving power.
         | 
         | (See this article by OP, which goes into more details:
         | https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht...
         | )
        
         | ribit wrote:
         | How would you envision that working at the hardware level? GPUs
         | are massively parallel devises, they need to keep the scheduler
         | and ALU logic as simple and compact as possible. SIMD is a
         | natural way to implement this. In real world, SIMT is just SIMD
         | with some additional capabilities for control flow and a
         | programming model that focuses on SIMD lanes as threads of
         | execution.
         | 
         | What's interesting is that modern SIMT is exposing quite a lot
         | of its SIMD underpinnings, because that allows you to implement
         | things much more efficiently. A hardware-accelerated SIMD sum
         | is way faster than adding values in shared memory.
        
       ___________________________________________________________________
       (page generated 2024-06-11 23:01 UTC)