[HN Gopher] SIMD for C++ Developers [pdf]
       ___________________________________________________________________
        
       SIMD for C++ Developers [pdf]
        
       Author : Const-me
       Score  : 101 points
       Date   : 2021-04-27 12:55 UTC (10 hours ago)
        
 (HTM) web link (const.me)
 (TXT) w3m dump (const.me)
        
       | secondcoming wrote:
       | What timing! This is probably better for StackOverflow, but is
       | there a way to AND two AVX operands and also get the ZF flag set
       | if the result is zero?
       | 
       | It seems like there's one intrinsic to do the AND but this
       | doesn't set ZF. [0]
       | 
       | But there's another instrinsic that will set ZF but doesn't
       | actually store the result of the AND operation [1].
       | 
       | [0] vpand ymm, ymm, ymm
       | 
       | [1] vtestpd ymm, ymm
       | 
       | I'm guessing that either a) I'm missing an instruction, or having
       | to modify EFLAGS from AVX instructions incurs a large penalty and
       | so it's not advisable?
        
         | Const-me wrote:
         | I don't think you're missing an instruction. A few comments,
         | still.
         | 
         | Bitwise instructions are very cheap, 1 cycle of latency.
         | Skylake can run 3 of them every clock, Zen 2 can run 4 of them
         | per clock. I wouldn't worry about that extra vpand instruction
         | too much.
         | 
         | About vptest, the latency is not great, 6-7 cycles. If you
         | gonna branch on the outcome, and the branch is not predictable
         | (your code takes random branch every time the instruction at
         | specific address is running), you gonna waste time. Sometimes
         | that's unavoidable, like when your goal is something similar to
         | memchr() function (however I'd recommend _mm256_movemask_epi8
         | instead for that). But other times it's possible to rework into
         | something better: mask lanes with _mm256_blendv_[something],
         | zero out stuff with bitwise AND, that kind of stuff.
        
         | [deleted]
        
       | xpuente wrote:
       | Why not tools like https://github.com/ispc?
       | 
       | This seems really close to the metal, either to have a non-
       | negligible maintenance cost or not being able to fully exploit
       | the hardware at use.
        
         | vgatherps wrote:
         | I've used ISPC before, as well as enoki (sort of like ISPC-
         | in-c++), and found that they have a lot of sharp performance
         | edges.
         | 
         | My experience with both was that as I moved away from the super
         | classic SIMD cases, the more I ran into crazy compiler cliffs
         | where tiny tweaks would blow up the codegen. In each case I
         | gave up, reimplemented what I wanted directly in c++ (the
         | second time using anger fog's wonderful vector class library),
         | and easily got the results I wanted without a ton of finagling
         | the compiler and libraries.
        
         | bjourne wrote:
         | It doesn't always emit optimal SIMD code. Plus, when you get
         | the hang of it, writing your own SIMD library is fairly simple
         | so you don't need a tool for it. C++ templates and operator
         | overloading really shines here. For example, you can write
         | sqrt(x*y+z) and have the the template system select the most
         | optimal SIMD intrinsics depending on whether x, y, and z are
         | int, float, int16, float8, double4, etc.
        
           | janwas wrote:
           | +1 to intrinsics or wrappers giving us more control over
           | performance.
           | 
           | > Plus, when you get the hang of it, writing your own SIMD
           | library is fairly simple
           | 
           | hm.. it's indeed easy to start, but maintaining
           | https://github.com/google/highway (supports clang/gcc/MSVC,
           | x86/ARM/RiscV) is quite time-consuming, especially working
           | around compiler bugs.
        
         | Const-me wrote:
         | Harder to use. That's another language which requires that
         | special compiler from Intel. The intrinsics are already
         | supported in all modern C and C++ compilers, with little to no
         | project setup.
         | 
         | For many practical problems, the ISPC's abstraction is not a
         | good fit. It's good for linear algebra with long vectors and
         | large matrices, but SIMD is useful for many other things
         | besides that. A toy problem: compute count of spaces in a 4 GB-
         | long buffer in memory. I'm pretty sure manually written SSE2 or
         | AVX2 code (inner loop doing _mm_cmpeq_epi8 and _mm_sub_epi8,
         | outer one doing _mm_sad_epu8 and _mm_add_epi64) gonna be faster
         | than ISPC-made version.
        
           | mattpharr wrote:
           | > It's good for linear algebra with long vectors and large
           | matrices, but SIMD is useful for many other things besides
           | that
           | 
           | The main goal in ispc's design was to support SPMD (single
           | program multiple data) programming, which is more general
           | than pure SIMD. Handling the relatively easy cases of (dense)
           | linear algebra that are easily expressed in SIMD wasn't a
           | focus as it's pretty easy to do in other ways.
           | 
           | Rather, ispc is focused on making it easy to write code with
           | divergent control flow over the vector lanes. This is
           | especially painful to do in intrinsics, especially in the
           | presence of nested divergent control flow. If you don't have
           | that, you might as well use explicit SIMD, though perhaps via
           | something like Eigen in order to avoid all of the ugliness of
           | manual use of intrinsics.
           | 
           | > I'm pretty sure manually written SSE2 or AVX2 code (inner
           | loop doing _mm_cmpeq_epi8 and _mm_sub_epi8, outer one doing
           | _mm_sad_epu8 and _mm_add_epi64)
           | 
           | ispc is focused on 32-byte datatypes, so I'm sure that is
           | true. I suspect it would be a more pleasant experience than
           | intrinsics for a reduction operation of that sort over 32-bit
           | datatypes, however.
        
             | Const-me wrote:
             | > This is especially painful to do in intrinsics
             | 
             | Depends on use case, but yes, can be complicated due to
             | lack of support in hardware. I've heard AVX512 fixed that
             | to an extent, but I don't have experience with that tech.
             | 
             | > perhaps via something like Eigen
             | 
             | I do, but sometimes I can outperform it substantially. It's
             | optimized for large vectors. In some cases, intrinsics can
             | be faster, and in my line of work I encounter a lot of
             | these cases. Very small matrices like 3x3 and 4x4 fit
             | completely in registers. Larger square matrices of size
             | like 8 or 24, and tall matrices with small fixed count of
             | columns, don't fit there but a complete row does, saving a
             | lot of RAM latency when dealing with them.
             | 
             | > to avoid all of the ugliness of manual use of intrinsics
             | 
             | I don't believe they are ugly; I think they just have a
             | steep learning curve.
             | 
             | > I suspect it would be a more pleasant experience than
             | intrinsics for a reduction operation of that sort over
             | 32-bit datatypes
             | 
             | Here's an example how to compute FP32 dot product with
             | intrinsics: https://stackoverflow.com/a/59495197/126995 I
             | have doubts the ISPC's reduction gonna result in similar
             | code. Even clang's automatic vectorizer (which I have a
             | high opinion of) is not doing that kind of stuff with
             | multiple independent accumulators.
        
               | creato wrote:
               | > Even clang's automatic vectorizer (which I have a high
               | opinion of) is not doing that kind of stuff with multiple
               | independent accumulators.
               | 
               | I think it does? I see Clang unroll reductions into
               | multiple accumulators quite often.
        
               | atom3 wrote:
               | > Here's an example how to compute FP32 dot product with
               | intrinsics: https://stackoverflow.com/a/59495197/126995 I
               | have doubts the ISPC's reduction gonna result in similar
               | code. Even clang's automatic vectorizer (which I have a
               | high opinion of) is not doing that kind of stuff with
               | multiple independent accumulators.
               | 
               | ISPC lets you request that the gang size be larger that
               | the vector size [1] to get 2 accumulators out of the box.
               | If having more accumulator is crucial, you can have them
               | at the cost of not using idiomatic ispc but I'd argue the
               | resulting code is still more readable.
               | 
               | I'm no expert so they might be flaws that I don't see but
               | the generated code looks good to me, the main difference
               | I see is that ISPC does more unrolling (which may be
               | better?).
               | 
               | Here is the reference implementation:
               | https://godbolt.org/z/MxT1Kedf1
               | 
               | Here is the ISPC implementation:
               | https://godbolt.org/z/qcez47GT5
               | 
               | [1] https://ispc.github.io/perfguide.html#choosing-a-
               | target-vect...
        
               | Const-me wrote:
               | > Here is the ISPC implementation
               | 
               | Line 36 computes ymm6 = (ymm6 * mem) + ymm4, the next
               | instruction on line 37 computes ymm6 = (ymm8 * mem) +
               | ymm6
               | 
               | These two instructions form a dependency chain. The CPU
               | can't start the instruction on line 37 before the one on
               | line 36 has made a result. That gonna take 5-6 CPU cycles
               | depending on CPU model. Same happens for ymm5 vector
               | between instructions on line 38 and 41, and in a few
               | other places.
               | 
               | In the reference code all 4 FMA instructions in the body
               | of the loop are independent from each other, a CPU will
               | run all 4 of them in parallel. The data dependencies are
               | across loop iterations, only the complete loop is limited
               | to 4-5 cycles/iteration. That's OK because the throughput
               | limit (probably not the FMA throughput though, I think
               | load ports throughput is saturated before FMA, especially
               | for unaligned inputs) is smaller than that.
        
               | atom3 wrote:
               | Oh right, I didn't think of looking for that, guess
               | you're right and doing things by hand is still better
        
               | Const-me wrote:
               | It's not terribly bad because CPUs are out-of-order. As
               | far as I can tell, there's no single dependency chain
               | over all instructions in the loop body, some of these
               | FMAs gonna run in parallel in your ISPC version. Still, I
               | would expect manually-vectorized code to be slightly
               | faster.
        
         | Scaevolus wrote:
         | Switching compilers is often too high-risk, but there are
         | header-only libraries that get you most of the same benefits
         | with normal C++ and wrappers around the intrinsics:
         | https://github.com/richgel999/CppSPMD_Fast
        
       | kolbe wrote:
       | I would at least appreciate a disclaimer that the vast majority
       | of these optimizations could be accomplished by encouraging the
       | compilers to make the assembly vectorized. You said in a footnote
       | that compilers will only do these optimizations when they're
       | extremely simple and rarely on integers, but I have not found
       | that to be the case. -O3 and -mavx do an amazing job for most use
       | cases. And more to the point, there are other tricks that I think
       | it's better to turn to (like using the __restrict key word)
       | before you take these fairly steep steps into coding the SIMD
       | commands yourself.
       | 
       | It's cool to learn these things. And it's down right important to
       | learn these things once you're experienced enough, because you
       | have to use them at some point if you're in the game of
       | optimization. But also I would feel pretty bad if some kid out
       | there wasted a week on a project at work (and got reprimanded for
       | it) that could have been accomplished with a couple compiler
       | flags, you know?
        
         | Const-me wrote:
         | They're getting better over time, especially for
         | floats/doubles, but I still find them limited even for simple
         | use cases.
         | 
         | Here's an example of auto-vectorizer in clang 12, which I
         | believe represents state of the art at the moment:
         | https://godbolt.org/z/6Pe33187W It automatically vectorized the
         | loop and even manually unrolled it, however I think the code
         | bottlenecks on shuffles not on memory loads. Just too many
         | instructions in the loop, and that vpmovzxbq instruction can
         | only run on port 5 on Skylake.
         | 
         | Compare the assembly with manually vectorized version from an
         | answer on stackoverflow: https://godbolt.org/z/do5e3-
        
         | janwas wrote:
         | +1 for restrict, that certainly helps. Out of curiosity, what's
         | your use case where autovectorization works well?
         | 
         | Personally, I have often been disappointed. Not much progress
         | in 2 years:
         | http://www.0x80.pl/notesen/2021-01-18-autovectorization-gcc-...
        
           | kolbe wrote:
           | I'll admit that there are times when I am both stunned at how
           | well the compiler will optimize, and times when I'm stunned
           | at how poorly it does. It would seem you can never make an
           | assumption on what will happen--hence my addiction to
           | godbolt. I don't have my exact code, but I work heavily with
           | math operations, so it may be that I do encounter an
           | especially easy types of loops to vectorize.
        
       | andyxor wrote:
       | If you liked this post you may also like:
       | 
       | SIMD in Java: https://news.ycombinator.com/item?id=14636802
       | (archived version https://archive.is/C5iZA)
       | 
       | SIMD in Rust: https://news.ycombinator.com/item?id=10111729
       | 
       | SIMD in Python: https://news.ycombinator.com/item?id=10470428
       | 
       | SIMD in Javascript: https://news.ycombinator.com/item?id=8533350
       | 
       | Using SIMD to aggregate billions of values per second:
       | https://news.ycombinator.com/item?id=22803504
       | 
       | Towards fearless SIMD:
       | https://news.ycombinator.com/item?id=18293209
       | 
       | First Impressions of ARM SIMD Programming:
       | https://news.ycombinator.com/item?id=19490542
        
       | dkersten wrote:
       | Perfect timing, I was just about to start looking into SIMD again
       | as my toy game engine is almost at a point where I want to see if
       | I can vectorize some of my processing (much of it is already
       | stored in SOA format, so hopefully it won't be too much trouble).
       | I'm thinking especially about tasks like frustum culling, but
       | also other things. We'll see, after I read this :) I've used SIMD
       | intrinsics before, but I could definitely do with a refresher!
        
         | Const-me wrote:
         | For videogame applications, look there before writing these
         | intrinsics: https://github.com/microsoft/DirectXMath/ That
         | library already implements a lot of complicated things,
         | relatively well.
         | 
         | Here's for frustum culling
         | https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...
         | Relatively inefficient when you have many boxes to test against
         | same frustum, but (a) compiler may inline and optimize (b)
         | failing that, it's easy to copy-paste and optimize manually,
         | compute these 6 planes and call BoundingBox::ContainedBy method
         | yourself.
        
           | dkersten wrote:
           | Thanks, I'll take a look. Although, it's a for fun engine so
           | I may try myself anyway just to learn. I'll see. Either way,
           | thanks for the link, very interesting!
           | 
           | As for frustum culling, that code seems to do one bounding
           | box at a time? Or am I misunderstanding? I was planning to
           | try to do 4 (or however many) checks at a time. I'm ok with
           | checking against bounding spheres too if that makes it easier
           | to vectorize.
        
             | Const-me wrote:
             | > that code seems to do one bounding box at a time?
             | 
             | Yep, most parts of that library were designed for doing one
             | thing at a time.
             | 
             | Generally speaking, HPC-style SoA approach can be faster
             | especially if you have AVX. But there's a price for that,
             | most importantly code complexity but some performance-
             | related things as well: RAM access pattern, uploading to
             | VRAM for rendering.
             | 
             | > I was planning to try to do 4 (or however many) checks at
             | a time
             | 
             | I would start with whatever code is in that library, and
             | only optimize if profiler says so.
             | 
             | They have sphere versus frustum test too, similar one i.e.
             | they also testing against these 6 planes, might be slightly
             | more efficient than boxes.
        
       | janwas wrote:
       | Nice writeup with helpful diagrams, thanks for sharing!
       | 
       | Readers might also find this short intro [1] helpful, including
       | tips on porting. (Disclosure: author)
       | 
       | 1:
       | https://github.com/google/highway/blob/master/g3doc/highway_...
       | 
       | > many available instructions are missing from the wrappers
       | 
       | Highway can interop with platform-specific intrinsics (on
       | x86/ARM, hwy_vec.raw is the native intrinsic type).
       | 
       | > vectorized integer math often treats vectors as having
       | different lanes count on every line of code
       | 
       | Fair point, that's a cost of type safety. We usually write `auto`
       | to avoid spelling it out.
        
       | mettamage wrote:
       | Probably a HN article on its own, but also related [1]. It's
       | about timestamp parsing using SIMD instructions (among other
       | optimizations). I've noticed when I had a toy HFT project that
       | this type of thinking is needed.
       | 
       | [1] https://kholdstare.github.io/technical/2020/05/26/faster-
       | int...
        
       | dragontamer wrote:
       | https://software.intel.com/sites/landingpage/IntrinsicsGuide...
       | 
       | Anyone who is using SSE / AVX / AVX512 intrinsics probably should
       | know about Intel's excellent Intrinsics Guide. The Intel guide is
       | a reference. This .pdf topic is a tutorial. So both resources
       | will be helpful to anyone seriously doing SIMD on the CPU.
        
         | janwas wrote:
         | Nice. For anyone interested in ARM, they also have a guide with
         | diagrams [1] and a searchable reference [2].
         | 
         | 1:
         | https://developer.arm.com/documentation/102159/0400/Permutat...
         | 2: https://developer.arm.com/architectures/instruction-
         | sets/sim...
        
         | mjsir911 wrote:
         | I also use this for nice visualizations while learning:
         | 
         | https://www.officedaytime.com/simd512e/
        
       | Const-me wrote:
       | Updated my article from 2019.
       | 
       | It's not limited to C++, equally good for C.
       | 
       | Over time, the support slowly arrives to other languages too,
       | like C#: https://docs.microsoft.com/en-
       | us/dotnet/api/system.runtime.i... https://docs.microsoft.com/en-
       | us/dotnet/api/system.runtime.i...
        
         | pjmlp wrote:
         | And D as well, although the support varies a bit across all
         | three backends.
        
         | bobthedino wrote:
         | SIMD support is also currently in development for Java:
         | https://openjdk.java.net/jeps/338
        
         | smhenderson wrote:
         | It's a really well written article, thank you for the work!
        
         | gameswithgo wrote:
         | Rust as well, and the intrinsics are identically named so your
         | tutorial is good for rust as well.
        
       ___________________________________________________________________
       (page generated 2021-04-27 23:01 UTC)