[HN Gopher] Vectorization Virtual Workshop
___________________________________________________________________
Vectorization Virtual Workshop
Author : tjalfi
Score : 29 points
Date : 2021-07-20 23:39 UTC (1 days ago)
(HTM) web link (cvw.cac.cornell.edu)
(TXT) w3m dump (cvw.cac.cornell.edu)
| gen_greyface wrote:
| https://cvw.cac.cornell.edu/topics
|
| this is a really good resource
| dragontamer wrote:
| Hmm. Its a good resource, but I'm pretty bearish on
| autovectorization in this manner.
|
| After trying OpenCL / CUDA / ROCm style programming, its clear
| that writing explicitly parallel code is in fact easier than
| expected (albeit with a ton of study involved... but I bet anyone
| can learn it if they put their mind to it).
|
| If CPU-SIMD is really needed for some reason, I expect that the
| languages that will be most convenient are explicitly-parallel
| systems like OpenMP or ISPC.
|
| In particular, look at these restrictions:
| https://cvw.cac.cornell.edu/vector/coding_vectorizable
|
| > The loop must be countable at runtime.
|
| > There should be a single control flow within the loop.
|
| > The loop should not contain function calls.
|
| These three restrictions are extremely restrictive!! CUDA /
| OpenCL / ROCm allow these constructs. Control flow may have
| terrible performance in CUDA / OpenCL, but its allowed (because
| its convenient. If the programmer can't think of any way to solve
| a problem aside from a few more if-statements / switch-
| statements, then we should let them even if its inefficient).
|
| That's the thing. We know that SIMD-machines and languages
| designed for SIMD-machines can have dynamic loop counts and more
| than one if-statement (albeit with a branch divergence penalty).
| We also find it extremely convenient to decompose our problems
| into functions and sub-functions.
|
| ---------------------
|
| OpenMP in contrast, looks like its learning from OpenCL / CUDA.
| The #omp simd parallel for construct is moving closer-and-closer
| to CUDA/OpenCL parity, allowing for convenient "if" statements
| and "dynamic loop counts".
|
| CPU-SIMD is here to stay, and I think learning how to use it is
| very important. But autovectorization from the compiler (without
| any programmer assist) looks like a dead end. The compiler gets a
| LOT of help when the programmer states things in terms of
| "threadIdx.x", and other explicitly SIMD variable concepts.
|
| Besides, if the programmer is forced to learn all of these
| obscure rules / obscure programming methods (countable loops /
| only one control flow within the loop / etc. etc.), you're pretty
| much learning a sublanguage without any syntax to indicate that
| you've switched languages. Discoverability is really bad.
|
| If I instead say "#pragma omp parallel simd for" before using
| some obscure OpenMP features, any C/C++ programmer these days
| will notice that something is weird and search on those terms
| before reading the rest of the for-loop.
|
| A for-loop written in "autovectorized" style has no such
| indicator, no such "discoverability" to teach non-SIMD
| programmers what the hell is going on.
|
| ---------
|
| Kind of a shame, because this resource is excellently written and
| still worth a read IMO. Even if I think the tech is a bit of a
| dead-end.
| devwastaken wrote:
| SIMD is generally written by hand in assembly. Libjpegturbo or
| ffmpeg have a lot of handwritten SIMD for various platforms.
|
| Sometimes you have to use compiler intrinsics to do parts of
| vectorization because of odd issues with the platform like
| differences in various chips.
|
| Parallelization is a different matter, but you can combine the
| two.
| dragontamer wrote:
| > SIMD is generally written by hand in assembly. Libjpegturbo
| or ffmpeg have a lot of handwritten SIMD for various
| platforms.
|
| Note that CPU-SIMD written in this manner is a subdiscipline
| called SWAR: SIMD-within-a-register. When we compare/contrast
| SWAR-SIMD (such as SSE / AVX / NEON / Altivec, etc. etc.) vs
| GPU-SIMD (NVidia CUDA, AMD ROCm), we notice that the two
| styles are extremely similar at the assembly language level
| (!!!).
|
| There's a few differences that lead to major performance
| contrasts, but when an AMD Vega-GPU performs the
| "V_ADD_CO_U32" assembly instruction, its very similar to the
| Intel-AVX512 "vpaddd" instruction (except Vega-GPU is over
| 64x32-bit, while AVX512 is just over 16x32-bit).
|
| So at the assembly language level, I assert that GPU-SIMD and
| CPU-SIMD+SWAR are more similar than most people think.
|
| ------------
|
| The primary difference is in the expectation in what the
| compiler can or can't do.
|
| In particular: CPU-SIMD+SWAR programmers have been
| programming in either assembly language (back in MMX-days) or
| compiler-intrinsics because... lets be frank... the compiler
| sucks still. Autovectorization sucks so much that our only
| choice is to dip down into intrinsics and/or assembly
| language directly.
|
| In contrast, GPU-SIMD programmers have created more-and-more
| elaborate compilers / transformations to support a greater
| variety of programming structures. Its possible to program in
| a SIMD-manner today using a "high level language" like CUDA /
| OpenCL.
|
| -------
|
| Based on my experiments in the GPU-programming world, I'm
| convinced that the OpenMP committee "gets" it. They too seem
| to have seen the benefits of a higher-level abstractions that
| CUDA / OpenCL offers, and are beginning to translate those
| abstractions into OpenMP. The #pragma omp simd constructs are
| marching in the right direction (albeit many years behind
| CUDA/OpenCL, but... they seem to be at least going in the
| right direction).
|
| https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-
| simd.h...
|
| -------
|
| Not all of us are willing to use intrinsics / raw assembly
| language for the rest of our days. :-) Some of us are looking
| for ways to bring forth those performance benefits of SIMD
| into a higher-level language. There will always be a place
| for intrinsics / raw assembly language, but we definitely
| prefer for "most" code to be written in a more easily
| understood manner.
| [deleted]
___________________________________________________________________
(page generated 2021-07-22 23:02 UTC)