[HN Gopher] Unsigned comparisons in AVX2/SSE: a quick note
___________________________________________________________________
Unsigned comparisons in AVX2/SSE: a quick note
Author : todsacerdoti
Score : 50 points
Date : 2022-08-26 07:50 UTC (15 hours ago)
(HTM) web link (outerproduct.net)
(TXT) w3m dump (outerproduct.net)
| pyler wrote:
| What gcc/clang does?
| [deleted]
| Const-me wrote:
| > simply add 128 (or however much--depending on the sizes at
| hand) to each of your inputs before comparing
|
| On many CPUs, bitwise XOR is slightly more efficient than
| addition. But you still need the magic number.
|
| > and you are in a hostile environment, you will have to figure
| out how to load up a vector of 128s, which costs cycles
|
| That particular vector can be generated with 2 instructions
| without RAM access, pcmpeqd to generate a vector with all bits
| set, and psllw/pslld for shifts.
|
| Modern compilers support LTCG/LTO which optimizes code across
| translation units. If you have a loop comparing these vectors,
| the magic vector is likely to be created outside, and kept in a
| register.
|
| > To avoid this, use min
|
| Yeah, but if you need to compare for a < b as opposed to a <= b,
| you gonna need couple more instructions.
|
| > if anyone working on superoptimisation has caught these?
|
| Page #17 there: http://const.me/articles/simd/simd.pdf
| vardump wrote:
| > On many CPUs, bitwise XOR is slightly more efficient than
| addition.
|
| That's pretty interesting, any examples of CPUs (or
| microcontrollers) where this happens?
| Const-me wrote:
| An example is AMD Zen 2. The latency of both vpaddd and vpxor
| is 1 cycle, but throughput is different, 0.33 versus 0.25
| cycles.
|
| BTW, Zen 2 CPUs are used in both Xbox S/X, and PS5.
| vardump wrote:
| That's pretty interesting. So AMD has split traditional ALU
| to different execution units? I mean, usually same ALU(s)
| that got an adder has also logical ops.
| Const-me wrote:
| Zen 2 cores have 4 execution units which run arithmetical
| and logical SIMD instructions: https://en.wikichip.org/wi
| ki/amd/microarchitectures/zen_2#Fl...
|
| Apparently 3 out of 4 units can add integers, but all 4
| of them can do bitwise operations. Despite the ISA
| defines multiple equivalent bitwise instructions like
| pand / andps / andpd, apparently modern AMD CPUs don't
| have a bypass delay so these instructions are 100%
| equivalent on these CPUs. And with some luck (when not
| bottlenecked on instruction fetch, decode, or other place
| in the pipeline) each Zen2 core can run 4 of them every
| cycle.
| anonymoushn wrote:
| > Yeah, but if you need to compare for a < b as opposed to a <=
| b, you gonna need couple more instructions.
|
| In general yes. Depending on your use for the mask, it might be
| fine to compute !(a < b) == (b <= a) and use the mask with
| andnot.
| sylware wrote:
| Yep, some constants are better to be generated than loaded, see
| the docs from Mr. Agner.
|
| On a more global scale, all assembly "optimisations"/tricks are
| hidden deep into compilers, which are reasonably "transparent"
| to their devs only, that due to their abysmal complexity and
| size.
|
| We would need some sort of online library for those assembly
| (boolean/branchless calculus...) tricks.
|
| A job for wikipedia? Maybe linked to the maths/boolean
| calculus?
| Const-me wrote:
| I think the compiler tricks are reasonably transparent to any
| developer who knows both C and assembly, thanks to
| godbolt.org.
|
| Couple times I even back-ported these tricks from clang-
| generated assembly back into C++ intrinsics.
| janwas wrote:
| github.com/google/highway is arguably such a collection of
| tricks, the x86_256-inl.h header does emulate some missing
| AVX-512 instructions such as this one or compressstoreu. (In
| this case we currently use a constant.)
| pjungwir wrote:
| If you're interested in this, _Hacker 's Delight_ has several
| hundreds of pages of these tricks. The book picks up from a
| 1972 MIT memo called HAKMEM. For a database/web guy like me,
| reading it is just recreational, but that's probably true for
| most people here. I'm currently about halfway through one
| hundred pages on how to divide. :-)
| Const-me wrote:
| I sometimes write optimized low-level code, but many of
| these books, and internet articles, are rather old. CPU
| architectures have converted to just two, AMD64 and ARM64.
| They have tons of useful instructions like sign extension
| and integer division.
|
| They also have less trivial instructions equivalent to some
| of these tricks, only faster. Couple examples.
|
| To turn off the rightmost 1-bit in a word, on AMD64 there's
| BLSR instruction from BMI1 set.
|
| ARM CPUs have a fast instruction to reverse bits. AMD64 do
| not, but they have a fast instruction to flip order of
| bytes allowing to reverse these bits faster than what's in
| that book.
|
| These tricks are only useful very rarely. For instance,
| SIMD instructions can't divide vectors of integers. Some
| older GPUs can't divide FP64 numbers but most of them can
| multiply FP64, and all of them can divide FP32. For these
| exotic use cases, these tricks are still relevant.
| muziq wrote:
| If you're not dividing greater than 2^23 numbers then
| can't you just use the FP divide ? I'm sure I've done
| this in CUDA many years back.. Worked a treat :) If you
| are then you have my sympathies..
| Const-me wrote:
| That's often a good way, but not necessarily the best
| one. On my computer with Zen3 CPU, cvtdq2ps and cvttps2dq
| have 3 cycles of latency each, divps up to 10 cycles,
| results in 16 CPU cycles in total.
|
| When the divisor's the same for all lanes and known at
| compile time, there're tricks to reduce division to a
| single integer multiplication, and a few extra cheap
| instructions like shifts and additions. Usually faster
| than 16 cycles. Here's an example, automagically made by
| clang 13: https://godbolt.org/z/T4TKb7oov
| moonchild wrote:
| > if you need to compare for a < b as opposed to a <= b, you
| gonna need couple more instructions
|
| Depends on what you are doing--you may not need to spend any
| extra ops at all. For instance, if you are blending then you
| can compute a >= b and commute the blend arguments. If you are
| branching, then you pick the complementary flag to branch on.
| Etc.
| anonymoushn wrote:
| I'm not working on superoptimization, but yes it does seem like
| eq(x,max_unsigned(x,y)) is an "obvious" cmpge_unsigned if one
| just browses all the intrinsics available here:
| https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
|
| I would like to read more about superoptimizers targeting AVX2,
| since previously I've complained that gcc will not make
| transformations between various equivalent instruction sequences
| for moving bytes around within and between registers, and some of
| these are a lot slower than others.
| dzaima wrote:
| For what it's worth, clang[0] compiles to the vpminu[bwd] +
| vpcmpeq[bwd] version even from manually written intrinsics of the
| addition-based version, and has since clang 7. Though,
| interestingly enough, it fails to cancel out a following xor.
|
| (and, for completeness, a regular C loop[1], with some massaging
| to make it more readable)
|
| [0]: https://godbolt.org/z/e7TE5P73Y
|
| [1]: https://godbolt.org/z/xhj3WTnxv
| moonchild wrote:
| That's unfortunate. The solution with min actually does equal
| work in this case--it just has a lesser startup cost, hence my
| mention of a 'hostile environment'--but with greater span,
| because the high-bit toggling can be done for both inputs in
| parallel. If you switch to a loop, at the very least, it ought
| to use the traditional solution, but it doesn't. Also it
| doesn't use the trick with the saturating subtract.
| https://godbolt.org/z/76qKs49eW
___________________________________________________________________
(page generated 2022-08-26 23:02 UTC)