[HN Gopher] SIMD for C++ Developers [pdf]
___________________________________________________________________
SIMD for C++ Developers [pdf]
Author : Const-me
Score : 101 points
Date : 2021-04-27 12:55 UTC (10 hours ago)
(HTM) web link (const.me)
(TXT) w3m dump (const.me)
| secondcoming wrote:
| What timing! This is probably better for StackOverflow, but is
| there a way to AND two AVX operands and also get the ZF flag set
| if the result is zero?
|
| It seems like there's one intrinsic to do the AND but this
| doesn't set ZF. [0]
|
| But there's another instrinsic that will set ZF but doesn't
| actually store the result of the AND operation [1].
|
| [0] vpand ymm, ymm, ymm
|
| [1] vtestpd ymm, ymm
|
| I'm guessing that either a) I'm missing an instruction, or having
| to modify EFLAGS from AVX instructions incurs a large penalty and
| so it's not advisable?
| Const-me wrote:
| I don't think you're missing an instruction. A few comments,
| still.
|
| Bitwise instructions are very cheap, 1 cycle of latency.
| Skylake can run 3 of them every clock, Zen 2 can run 4 of them
| per clock. I wouldn't worry about that extra vpand instruction
| too much.
|
| About vptest, the latency is not great, 6-7 cycles. If you
| gonna branch on the outcome, and the branch is not predictable
| (your code takes random branch every time the instruction at
| specific address is running), you gonna waste time. Sometimes
| that's unavoidable, like when your goal is something similar to
| memchr() function (however I'd recommend _mm256_movemask_epi8
| instead for that). But other times it's possible to rework into
| something better: mask lanes with _mm256_blendv_[something],
| zero out stuff with bitwise AND, that kind of stuff.
| [deleted]
| xpuente wrote:
| Why not tools like https://github.com/ispc?
|
| This seems really close to the metal, either to have a non-
| negligible maintenance cost or not being able to fully exploit
| the hardware at use.
| vgatherps wrote:
| I've used ISPC before, as well as enoki (sort of like ISPC-
| in-c++), and found that they have a lot of sharp performance
| edges.
|
| My experience with both was that as I moved away from the super
| classic SIMD cases, the more I ran into crazy compiler cliffs
| where tiny tweaks would blow up the codegen. In each case I
| gave up, reimplemented what I wanted directly in c++ (the
| second time using anger fog's wonderful vector class library),
| and easily got the results I wanted without a ton of finagling
| the compiler and libraries.
| bjourne wrote:
| It doesn't always emit optimal SIMD code. Plus, when you get
| the hang of it, writing your own SIMD library is fairly simple
| so you don't need a tool for it. C++ templates and operator
| overloading really shines here. For example, you can write
| sqrt(x*y+z) and have the the template system select the most
| optimal SIMD intrinsics depending on whether x, y, and z are
| int, float, int16, float8, double4, etc.
| janwas wrote:
| +1 to intrinsics or wrappers giving us more control over
| performance.
|
| > Plus, when you get the hang of it, writing your own SIMD
| library is fairly simple
|
| hm.. it's indeed easy to start, but maintaining
| https://github.com/google/highway (supports clang/gcc/MSVC,
| x86/ARM/RiscV) is quite time-consuming, especially working
| around compiler bugs.
| Const-me wrote:
| Harder to use. That's another language which requires that
| special compiler from Intel. The intrinsics are already
| supported in all modern C and C++ compilers, with little to no
| project setup.
|
| For many practical problems, the ISPC's abstraction is not a
| good fit. It's good for linear algebra with long vectors and
| large matrices, but SIMD is useful for many other things
| besides that. A toy problem: compute count of spaces in a 4 GB-
| long buffer in memory. I'm pretty sure manually written SSE2 or
| AVX2 code (inner loop doing _mm_cmpeq_epi8 and _mm_sub_epi8,
| outer one doing _mm_sad_epu8 and _mm_add_epi64) gonna be faster
| than ISPC-made version.
| mattpharr wrote:
| > It's good for linear algebra with long vectors and large
| matrices, but SIMD is useful for many other things besides
| that
|
| The main goal in ispc's design was to support SPMD (single
| program multiple data) programming, which is more general
| than pure SIMD. Handling the relatively easy cases of (dense)
| linear algebra that are easily expressed in SIMD wasn't a
| focus as it's pretty easy to do in other ways.
|
| Rather, ispc is focused on making it easy to write code with
| divergent control flow over the vector lanes. This is
| especially painful to do in intrinsics, especially in the
| presence of nested divergent control flow. If you don't have
| that, you might as well use explicit SIMD, though perhaps via
| something like Eigen in order to avoid all of the ugliness of
| manual use of intrinsics.
|
| > I'm pretty sure manually written SSE2 or AVX2 code (inner
| loop doing _mm_cmpeq_epi8 and _mm_sub_epi8, outer one doing
| _mm_sad_epu8 and _mm_add_epi64)
|
| ispc is focused on 32-byte datatypes, so I'm sure that is
| true. I suspect it would be a more pleasant experience than
| intrinsics for a reduction operation of that sort over 32-bit
| datatypes, however.
| Const-me wrote:
| > This is especially painful to do in intrinsics
|
| Depends on use case, but yes, can be complicated due to
| lack of support in hardware. I've heard AVX512 fixed that
| to an extent, but I don't have experience with that tech.
|
| > perhaps via something like Eigen
|
| I do, but sometimes I can outperform it substantially. It's
| optimized for large vectors. In some cases, intrinsics can
| be faster, and in my line of work I encounter a lot of
| these cases. Very small matrices like 3x3 and 4x4 fit
| completely in registers. Larger square matrices of size
| like 8 or 24, and tall matrices with small fixed count of
| columns, don't fit there but a complete row does, saving a
| lot of RAM latency when dealing with them.
|
| > to avoid all of the ugliness of manual use of intrinsics
|
| I don't believe they are ugly; I think they just have a
| steep learning curve.
|
| > I suspect it would be a more pleasant experience than
| intrinsics for a reduction operation of that sort over
| 32-bit datatypes
|
| Here's an example how to compute FP32 dot product with
| intrinsics: https://stackoverflow.com/a/59495197/126995 I
| have doubts the ISPC's reduction gonna result in similar
| code. Even clang's automatic vectorizer (which I have a
| high opinion of) is not doing that kind of stuff with
| multiple independent accumulators.
| creato wrote:
| > Even clang's automatic vectorizer (which I have a high
| opinion of) is not doing that kind of stuff with multiple
| independent accumulators.
|
| I think it does? I see Clang unroll reductions into
| multiple accumulators quite often.
| atom3 wrote:
| > Here's an example how to compute FP32 dot product with
| intrinsics: https://stackoverflow.com/a/59495197/126995 I
| have doubts the ISPC's reduction gonna result in similar
| code. Even clang's automatic vectorizer (which I have a
| high opinion of) is not doing that kind of stuff with
| multiple independent accumulators.
|
| ISPC lets you request that the gang size be larger that
| the vector size [1] to get 2 accumulators out of the box.
| If having more accumulator is crucial, you can have them
| at the cost of not using idiomatic ispc but I'd argue the
| resulting code is still more readable.
|
| I'm no expert so they might be flaws that I don't see but
| the generated code looks good to me, the main difference
| I see is that ISPC does more unrolling (which may be
| better?).
|
| Here is the reference implementation:
| https://godbolt.org/z/MxT1Kedf1
|
| Here is the ISPC implementation:
| https://godbolt.org/z/qcez47GT5
|
| [1] https://ispc.github.io/perfguide.html#choosing-a-
| target-vect...
| Const-me wrote:
| > Here is the ISPC implementation
|
| Line 36 computes ymm6 = (ymm6 * mem) + ymm4, the next
| instruction on line 37 computes ymm6 = (ymm8 * mem) +
| ymm6
|
| These two instructions form a dependency chain. The CPU
| can't start the instruction on line 37 before the one on
| line 36 has made a result. That gonna take 5-6 CPU cycles
| depending on CPU model. Same happens for ymm5 vector
| between instructions on line 38 and 41, and in a few
| other places.
|
| In the reference code all 4 FMA instructions in the body
| of the loop are independent from each other, a CPU will
| run all 4 of them in parallel. The data dependencies are
| across loop iterations, only the complete loop is limited
| to 4-5 cycles/iteration. That's OK because the throughput
| limit (probably not the FMA throughput though, I think
| load ports throughput is saturated before FMA, especially
| for unaligned inputs) is smaller than that.
| atom3 wrote:
| Oh right, I didn't think of looking for that, guess
| you're right and doing things by hand is still better
| Const-me wrote:
| It's not terribly bad because CPUs are out-of-order. As
| far as I can tell, there's no single dependency chain
| over all instructions in the loop body, some of these
| FMAs gonna run in parallel in your ISPC version. Still, I
| would expect manually-vectorized code to be slightly
| faster.
| Scaevolus wrote:
| Switching compilers is often too high-risk, but there are
| header-only libraries that get you most of the same benefits
| with normal C++ and wrappers around the intrinsics:
| https://github.com/richgel999/CppSPMD_Fast
| kolbe wrote:
| I would at least appreciate a disclaimer that the vast majority
| of these optimizations could be accomplished by encouraging the
| compilers to make the assembly vectorized. You said in a footnote
| that compilers will only do these optimizations when they're
| extremely simple and rarely on integers, but I have not found
| that to be the case. -O3 and -mavx do an amazing job for most use
| cases. And more to the point, there are other tricks that I think
| it's better to turn to (like using the __restrict key word)
| before you take these fairly steep steps into coding the SIMD
| commands yourself.
|
| It's cool to learn these things. And it's down right important to
| learn these things once you're experienced enough, because you
| have to use them at some point if you're in the game of
| optimization. But also I would feel pretty bad if some kid out
| there wasted a week on a project at work (and got reprimanded for
| it) that could have been accomplished with a couple compiler
| flags, you know?
| Const-me wrote:
| They're getting better over time, especially for
| floats/doubles, but I still find them limited even for simple
| use cases.
|
| Here's an example of auto-vectorizer in clang 12, which I
| believe represents state of the art at the moment:
| https://godbolt.org/z/6Pe33187W It automatically vectorized the
| loop and even manually unrolled it, however I think the code
| bottlenecks on shuffles not on memory loads. Just too many
| instructions in the loop, and that vpmovzxbq instruction can
| only run on port 5 on Skylake.
|
| Compare the assembly with manually vectorized version from an
| answer on stackoverflow: https://godbolt.org/z/do5e3-
| janwas wrote:
| +1 for restrict, that certainly helps. Out of curiosity, what's
| your use case where autovectorization works well?
|
| Personally, I have often been disappointed. Not much progress
| in 2 years:
| http://www.0x80.pl/notesen/2021-01-18-autovectorization-gcc-...
| kolbe wrote:
| I'll admit that there are times when I am both stunned at how
| well the compiler will optimize, and times when I'm stunned
| at how poorly it does. It would seem you can never make an
| assumption on what will happen--hence my addiction to
| godbolt. I don't have my exact code, but I work heavily with
| math operations, so it may be that I do encounter an
| especially easy types of loops to vectorize.
| andyxor wrote:
| If you liked this post you may also like:
|
| SIMD in Java: https://news.ycombinator.com/item?id=14636802
| (archived version https://archive.is/C5iZA)
|
| SIMD in Rust: https://news.ycombinator.com/item?id=10111729
|
| SIMD in Python: https://news.ycombinator.com/item?id=10470428
|
| SIMD in Javascript: https://news.ycombinator.com/item?id=8533350
|
| Using SIMD to aggregate billions of values per second:
| https://news.ycombinator.com/item?id=22803504
|
| Towards fearless SIMD:
| https://news.ycombinator.com/item?id=18293209
|
| First Impressions of ARM SIMD Programming:
| https://news.ycombinator.com/item?id=19490542
| dkersten wrote:
| Perfect timing, I was just about to start looking into SIMD again
| as my toy game engine is almost at a point where I want to see if
| I can vectorize some of my processing (much of it is already
| stored in SOA format, so hopefully it won't be too much trouble).
| I'm thinking especially about tasks like frustum culling, but
| also other things. We'll see, after I read this :) I've used SIMD
| intrinsics before, but I could definitely do with a refresher!
| Const-me wrote:
| For videogame applications, look there before writing these
| intrinsics: https://github.com/microsoft/DirectXMath/ That
| library already implements a lot of complicated things,
| relatively well.
|
| Here's for frustum culling
| https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...
| Relatively inefficient when you have many boxes to test against
| same frustum, but (a) compiler may inline and optimize (b)
| failing that, it's easy to copy-paste and optimize manually,
| compute these 6 planes and call BoundingBox::ContainedBy method
| yourself.
| dkersten wrote:
| Thanks, I'll take a look. Although, it's a for fun engine so
| I may try myself anyway just to learn. I'll see. Either way,
| thanks for the link, very interesting!
|
| As for frustum culling, that code seems to do one bounding
| box at a time? Or am I misunderstanding? I was planning to
| try to do 4 (or however many) checks at a time. I'm ok with
| checking against bounding spheres too if that makes it easier
| to vectorize.
| Const-me wrote:
| > that code seems to do one bounding box at a time?
|
| Yep, most parts of that library were designed for doing one
| thing at a time.
|
| Generally speaking, HPC-style SoA approach can be faster
| especially if you have AVX. But there's a price for that,
| most importantly code complexity but some performance-
| related things as well: RAM access pattern, uploading to
| VRAM for rendering.
|
| > I was planning to try to do 4 (or however many) checks at
| a time
|
| I would start with whatever code is in that library, and
| only optimize if profiler says so.
|
| They have sphere versus frustum test too, similar one i.e.
| they also testing against these 6 planes, might be slightly
| more efficient than boxes.
| janwas wrote:
| Nice writeup with helpful diagrams, thanks for sharing!
|
| Readers might also find this short intro [1] helpful, including
| tips on porting. (Disclosure: author)
|
| 1:
| https://github.com/google/highway/blob/master/g3doc/highway_...
|
| > many available instructions are missing from the wrappers
|
| Highway can interop with platform-specific intrinsics (on
| x86/ARM, hwy_vec.raw is the native intrinsic type).
|
| > vectorized integer math often treats vectors as having
| different lanes count on every line of code
|
| Fair point, that's a cost of type safety. We usually write `auto`
| to avoid spelling it out.
| mettamage wrote:
| Probably a HN article on its own, but also related [1]. It's
| about timestamp parsing using SIMD instructions (among other
| optimizations). I've noticed when I had a toy HFT project that
| this type of thinking is needed.
|
| [1] https://kholdstare.github.io/technical/2020/05/26/faster-
| int...
| dragontamer wrote:
| https://software.intel.com/sites/landingpage/IntrinsicsGuide...
|
| Anyone who is using SSE / AVX / AVX512 intrinsics probably should
| know about Intel's excellent Intrinsics Guide. The Intel guide is
| a reference. This .pdf topic is a tutorial. So both resources
| will be helpful to anyone seriously doing SIMD on the CPU.
| janwas wrote:
| Nice. For anyone interested in ARM, they also have a guide with
| diagrams [1] and a searchable reference [2].
|
| 1:
| https://developer.arm.com/documentation/102159/0400/Permutat...
| 2: https://developer.arm.com/architectures/instruction-
| sets/sim...
| mjsir911 wrote:
| I also use this for nice visualizations while learning:
|
| https://www.officedaytime.com/simd512e/
| Const-me wrote:
| Updated my article from 2019.
|
| It's not limited to C++, equally good for C.
|
| Over time, the support slowly arrives to other languages too,
| like C#: https://docs.microsoft.com/en-
| us/dotnet/api/system.runtime.i... https://docs.microsoft.com/en-
| us/dotnet/api/system.runtime.i...
| pjmlp wrote:
| And D as well, although the support varies a bit across all
| three backends.
| bobthedino wrote:
| SIMD support is also currently in development for Java:
| https://openjdk.java.net/jeps/338
| smhenderson wrote:
| It's a really well written article, thank you for the work!
| gameswithgo wrote:
| Rust as well, and the intrinsics are identically named so your
| tutorial is good for rust as well.
___________________________________________________________________
(page generated 2021-04-27 23:01 UTC)