[HN Gopher] Vectorization Virtual Workshop
       ___________________________________________________________________
        
       Vectorization Virtual Workshop
        
       Author : tjalfi
       Score  : 29 points
       Date   : 2021-07-20 23:39 UTC (1 days ago)
        
 (HTM) web link (cvw.cac.cornell.edu)
 (TXT) w3m dump (cvw.cac.cornell.edu)
        
       | gen_greyface wrote:
       | https://cvw.cac.cornell.edu/topics
       | 
       | this is a really good resource
        
       | dragontamer wrote:
       | Hmm. Its a good resource, but I'm pretty bearish on
       | autovectorization in this manner.
       | 
       | After trying OpenCL / CUDA / ROCm style programming, its clear
       | that writing explicitly parallel code is in fact easier than
       | expected (albeit with a ton of study involved... but I bet anyone
       | can learn it if they put their mind to it).
       | 
       | If CPU-SIMD is really needed for some reason, I expect that the
       | languages that will be most convenient are explicitly-parallel
       | systems like OpenMP or ISPC.
       | 
       | In particular, look at these restrictions:
       | https://cvw.cac.cornell.edu/vector/coding_vectorizable
       | 
       | > The loop must be countable at runtime.
       | 
       | > There should be a single control flow within the loop.
       | 
       | > The loop should not contain function calls.
       | 
       | These three restrictions are extremely restrictive!! CUDA /
       | OpenCL / ROCm allow these constructs. Control flow may have
       | terrible performance in CUDA / OpenCL, but its allowed (because
       | its convenient. If the programmer can't think of any way to solve
       | a problem aside from a few more if-statements / switch-
       | statements, then we should let them even if its inefficient).
       | 
       | That's the thing. We know that SIMD-machines and languages
       | designed for SIMD-machines can have dynamic loop counts and more
       | than one if-statement (albeit with a branch divergence penalty).
       | We also find it extremely convenient to decompose our problems
       | into functions and sub-functions.
       | 
       | ---------------------
       | 
       | OpenMP in contrast, looks like its learning from OpenCL / CUDA.
       | The #omp simd parallel for construct is moving closer-and-closer
       | to CUDA/OpenCL parity, allowing for convenient "if" statements
       | and "dynamic loop counts".
       | 
       | CPU-SIMD is here to stay, and I think learning how to use it is
       | very important. But autovectorization from the compiler (without
       | any programmer assist) looks like a dead end. The compiler gets a
       | LOT of help when the programmer states things in terms of
       | "threadIdx.x", and other explicitly SIMD variable concepts.
       | 
       | Besides, if the programmer is forced to learn all of these
       | obscure rules / obscure programming methods (countable loops /
       | only one control flow within the loop / etc. etc.), you're pretty
       | much learning a sublanguage without any syntax to indicate that
       | you've switched languages. Discoverability is really bad.
       | 
       | If I instead say "#pragma omp parallel simd for" before using
       | some obscure OpenMP features, any C/C++ programmer these days
       | will notice that something is weird and search on those terms
       | before reading the rest of the for-loop.
       | 
       | A for-loop written in "autovectorized" style has no such
       | indicator, no such "discoverability" to teach non-SIMD
       | programmers what the hell is going on.
       | 
       | ---------
       | 
       | Kind of a shame, because this resource is excellently written and
       | still worth a read IMO. Even if I think the tech is a bit of a
       | dead-end.
        
         | devwastaken wrote:
         | SIMD is generally written by hand in assembly. Libjpegturbo or
         | ffmpeg have a lot of handwritten SIMD for various platforms.
         | 
         | Sometimes you have to use compiler intrinsics to do parts of
         | vectorization because of odd issues with the platform like
         | differences in various chips.
         | 
         | Parallelization is a different matter, but you can combine the
         | two.
        
           | dragontamer wrote:
           | > SIMD is generally written by hand in assembly. Libjpegturbo
           | or ffmpeg have a lot of handwritten SIMD for various
           | platforms.
           | 
           | Note that CPU-SIMD written in this manner is a subdiscipline
           | called SWAR: SIMD-within-a-register. When we compare/contrast
           | SWAR-SIMD (such as SSE / AVX / NEON / Altivec, etc. etc.) vs
           | GPU-SIMD (NVidia CUDA, AMD ROCm), we notice that the two
           | styles are extremely similar at the assembly language level
           | (!!!).
           | 
           | There's a few differences that lead to major performance
           | contrasts, but when an AMD Vega-GPU performs the
           | "V_ADD_CO_U32" assembly instruction, its very similar to the
           | Intel-AVX512 "vpaddd" instruction (except Vega-GPU is over
           | 64x32-bit, while AVX512 is just over 16x32-bit).
           | 
           | So at the assembly language level, I assert that GPU-SIMD and
           | CPU-SIMD+SWAR are more similar than most people think.
           | 
           | ------------
           | 
           | The primary difference is in the expectation in what the
           | compiler can or can't do.
           | 
           | In particular: CPU-SIMD+SWAR programmers have been
           | programming in either assembly language (back in MMX-days) or
           | compiler-intrinsics because... lets be frank... the compiler
           | sucks still. Autovectorization sucks so much that our only
           | choice is to dip down into intrinsics and/or assembly
           | language directly.
           | 
           | In contrast, GPU-SIMD programmers have created more-and-more
           | elaborate compilers / transformations to support a greater
           | variety of programming structures. Its possible to program in
           | a SIMD-manner today using a "high level language" like CUDA /
           | OpenCL.
           | 
           | -------
           | 
           | Based on my experiments in the GPU-programming world, I'm
           | convinced that the OpenMP committee "gets" it. They too seem
           | to have seen the benefits of a higher-level abstractions that
           | CUDA / OpenCL offers, and are beginning to translate those
           | abstractions into OpenMP. The #pragma omp simd constructs are
           | marching in the right direction (albeit many years behind
           | CUDA/OpenCL, but... they seem to be at least going in the
           | right direction).
           | 
           | https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-
           | simd.h...
           | 
           | -------
           | 
           | Not all of us are willing to use intrinsics / raw assembly
           | language for the rest of our days. :-) Some of us are looking
           | for ways to bring forth those performance benefits of SIMD
           | into a higher-level language. There will always be a place
           | for intrinsics / raw assembly language, but we definitely
           | prefer for "most" code to be written in a more easily
           | understood manner.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-07-22 23:02 UTC)