[HN Gopher] SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)
___________________________________________________________________
SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)
Author : shipp02
Score : 79 points
Date : 2024-06-10 06:05 UTC (1 days ago)
(HTM) web link (yosefk.com)
(TXT) w3m dump (yosefk.com)
| narrowbyte wrote:
| quite interesting framing. A couple things have changed since
| 2011
|
| - SIMD (at least intel's AVX512) does have usable gather/scatter,
| so "Single instruction, multiple addresses" is no longer a
| flexibility win for SIMT vs SIMD
|
| - likewise for pervasive masking support and "Single instruction,
| multiple flow paths"
|
| In general, I think of SIMD as _more_ flexible than SIMT, not
| less, in line with this other post
| https://news.ycombinator.com/item?id=40625579. SIMT requires
| staying more towards the "embarrassingly" parallel end of the
| spectrum, SIMD can be applied in cases where understanding the
| opportunity for parallelism is very non-trivial.
| majke wrote:
| Last time i looked at intel scatter/gather I got the impression
| it only works for a very narrow use case, and getting it to
| perform wasn't easy. Did I miss something?
| narrowbyte wrote:
| The post says, about SIMT / GPU programming, "This loss
| results from the DRAM architecture quite directly, the GPU
| being unable to do much about it - similarly to any other
| processor."
|
| I would say that for SIMD the situation is basically the
| same. gather/scatter don't magically make the memory
| hierarchy a non-issue, but they're no longer adding any
| unnecessary pain on top.
| yosefk wrote:
| Barrel threaded machines like GPUs have easier time hiding
| the latency of bank conflict resolution when
| gathering/scattering against local memory/cache than a
| machine running a single instruction thread. So pretty sure
| they have a fundamental advantage when it comes to the
| throughput of scatter/gather operations that gets bigger
| with a larger number of vector lanes
| raphlinus wrote:
| One of the other major things that's changed is that Nvidia now
| has independent thread scheduling (as of Volta, see [1]). That
| allows things like individual threads to take locks, which is a
| pretty big leap. Essentially, it allows you to program each
| individual thread as if it's running a C++ program, but of
| course you do have to think about the warp and block structure
| if you want to optimize performance.
|
| I disagree that SIMT is only for embarrassingly parallel
| problems. Both CUDA and compute shaders are now used for fairly
| sophisticated data structures (including trees) and algorithms
| (including sorting).
|
| [1]: https://developer.nvidia.com/blog/inside-
| volta/#independent_...
| yosefk wrote:
| It's improtant that GPU threads support locking and control
| flow divergence and I don't want to minimize that, but
| threads within a warp diverging still badly loses throughput,
| so I don't think the situation I'd fundamentally different in
| terms of what the machine is good/bad at. We're just closer
| to the base architecture's local maximum of capabilities, as
| one would expect for a more mature architecture; various
| things it could be made to support it now actually supports
| because there was time to add this support
| narrowbyte wrote:
| I intentionally said "more towards embarrassingly parallel"
| rather than "only embarrassingly parallel". I don't think
| there's a hard cutoff, but there is a qualitative difference.
| One example that springs to mind is
| https://github.com/simdjson/simdjson - afaik there's no
| similarly mature GPU-based JSON parsing.
| raphlinus wrote:
| I'm not aware of any similarly mature GPU-based JSON
| parser, but I believe such a thing is possible. My stack
| monoid work [1] contains a bunch of ideas that may be
| helpful for building one. I've thought about pursuing that,
| but have kept focus on 2D graphics as it's clearer how that
| will actually be useful.
|
| [1]: https://arxiv.org/abs/2205.11659
| xoranth wrote:
| > That allows things like individual threads to take locks,
| which is a pretty big leap.
|
| Does anyone know how those get translated into SIMD
| instructions. Like, how do you do a CAS loop for each lane
| where each lane can individually succeed or fail? What
| happens if the lanes point to the same location?
| raphlinus wrote:
| There's a bit more information at [1], but I think the
| details are not public. The hardware _is_ tracking a
| separate program counter (and call stack) for each thread.
| So in the CAS example, one thread wins and continues making
| progress, while the other threads loop.
|
| There seems to some more detail in a Bachelors thesis by
| Phillip Grote[2], with lots of measurements of different
| synchronization primitives, but it doesn't go too deep into
| the hardware.
|
| [1]: https://arxiv.org/abs/2205.11659
|
| [2]: https://www.clemenslutz.com/pdfs/bsc_thesis_phillip_gr
| ote.pd...
| xoranth wrote:
| Thanks!
| HALtheWise wrote:
| For a SIMD architecture that supports scatter/gather and
| instruction masking (like Arm SVE), could a compiler or language
| allow you to write "Scalar-style code" that compiles to SIMD
| instructions? I guess this is just auto-vectorization, but I'd be
| interested in explicit tagging of code regions, possibly in
| combination with restrictions on what operations are allowed.
| yosefk wrote:
| Check out ispc/spmd by Matt Pharr, a very interesting take on
| this subject
| doophus wrote:
| Yes, have a look at ISPC - it's amazing. I especially like that
| it can generate code for multiple architectures and then select
| the best implementation at runtime for the CPU it's running on.
| xoranth wrote:
| Do you know any good tutorial for ISPC? Documentation is a
| bit sparse.
| Remnant44 wrote:
| I think the principle things that have changed since this article
| was written is mostly each category taking inspiration from the
| other.
|
| For example, SIMD instructions gained gather/scatter and even
| masking of instructions for divergent flow (in avx512 that
| consumers never get to play with). These can really simplify
| writing explicit SIMD and make it more GPU-like.
|
| Conversely, GPUs gained a much higher emphasis on caching,
| sustained divergent flow via independent program counters, and
| subgroup instructions which are essentially explicit SIMD in
| disguise.
|
| SMT on the other hand... seems like it might be on the way out
| completely. While still quite effective for some workloads, it
| seems like quite a lot of complexity for only situational
| improvements in throughput.
| yosefk wrote:
| The basic architecture still matters. GPUs still lose
| throughput upon divergence regardless of their increased
| ability to run more kinds of divergent flows correctly due to
| having separate PCs, and SIMD still has more trouble with
| instruction latency (including due to bank conflict resolution
| in scatter/gather) than barrel threaded machines, etc. This is
| not to detract from the importance of the improvements to the
| base architecture made over time
| Remnant44 wrote:
| agreed! The basic categories remain, just blurring a bit at
| the edges.
| mkoubaa wrote:
| This type of parallelism is sort of like a flops metric.
| Optimizing the amount of wall time the GPU is actually doing
| computation is just as important (if not more). There are some
| synchronization and pipelining tools in CUDA and Vulkan but they
| are scary at first glance.
| Const-me wrote:
| > How many register sets does a typical SMT processor have? Er,
| 2, sometimes 4
|
| Way more of them. Pipelines are deep, and different in-flight
| instructions need different versions of the same registers.
|
| For example, my laptop has AMD Zen3 processor. Each core has 192
| scalar physical registers, while only ~16 general-purpose scalar
| registers defined in the ISA. This gives 12 register sets; they
| are shared by both threads running on the core.
|
| Similar with SIMD vector registers. Apparently each core has 160
| 32-byte vector registers. Because AVX2 ISA defines 16 vector
| register, this gives 10 register sets per core, again shared by 2
| threads.
| xoranth wrote:
| I believe the author is referring to how many logical
| threads/hyperthreads can a core run (for AMD and Intel, two. I
| believe POWER can do 8, Sparc 4).
|
| The extra physical registers are there for superscalar
| execution, not for SMT/hyperthreading.
| yosefk wrote:
| I meant register sets visible to software. The fact that
| there's even more hardware not visible to software that you
| need for a thread to run fast just means that the cost of
| adding another thread is even higher
| jabl wrote:
| A couple of related questions:
|
| - It has been claimed that several GPU vendors behind the covers
| convert the SIMT programming model (graphics shaders, CUDA,
| OpenCL, whatever) into something like a SIMD ISA that the
| underlying hardware supports. Why is that? Why not have something
| SIMT-like as the underlying HW ISA? Seems the conceptual beauty
| of SIMT is that you don't need to duplicate the entire scalar ISA
| for vectors like you need with SIMD, you just need a few thread
| control instructions (fork, join, etc.) to tell the HW to switch
| between scalar or SIMT mode. So why haven't vendors gone with
| this? Is there some hidden complexity that makes SIMT hard to
| implement efficiently, despite the nice high level programming
| model?
|
| - How do these higher level HW features like Tensor cores map to
| the SIMT model? It's sort of easy to see how SIMT handles a
| vector, each thread handles one element of the vector. But if you
| have HW support for something like a matrix multiplication, what
| then? Or does each SIMT thread have access to a 'matmul'
| instruction, and all the threads in a warp that run concurrently
| can concurrently run matmuls?
| xoranth wrote:
| It is the same reason in software sometimes you batch
| operations:
|
| When you add two numbers, the GPU needs to do a lot more stuff
| besides the addition.
|
| If you implemented SIMT by having multiple cores, you would
| need to do the extra stuff once per core, so you wouldn't save
| power (and you have a fixed power budget). With SIMD, you get
| $NUM_LANES additions, but you do the extra stuff only once,
| saving power.
|
| (See this article by OP, which goes into more details:
| https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht...
| )
| ribit wrote:
| How would you envision that working at the hardware level? GPUs
| are massively parallel devises, they need to keep the scheduler
| and ALU logic as simple and compact as possible. SIMD is a
| natural way to implement this. In real world, SIMT is just SIMD
| with some additional capabilities for control flow and a
| programming model that focuses on SIMD lanes as threads of
| execution.
|
| What's interesting is that modern SIMT is exposing quite a lot
| of its SIMD underpinnings, because that allows you to implement
| things much more efficiently. A hardware-accelerated SIMD sum
| is way faster than adding values in shared memory.
___________________________________________________________________
(page generated 2024-06-11 23:01 UTC)