[HN Gopher] Unsigned comparisons in AVX2/SSE: a quick note
       ___________________________________________________________________
        
       Unsigned comparisons in AVX2/SSE: a quick note
        
       Author : todsacerdoti
       Score  : 50 points
       Date   : 2022-08-26 07:50 UTC (15 hours ago)
        
 (HTM) web link (outerproduct.net)
 (TXT) w3m dump (outerproduct.net)
        
       | pyler wrote:
       | What gcc/clang does?
        
         | [deleted]
        
       | Const-me wrote:
       | > simply add 128 (or however much--depending on the sizes at
       | hand) to each of your inputs before comparing
       | 
       | On many CPUs, bitwise XOR is slightly more efficient than
       | addition. But you still need the magic number.
       | 
       | > and you are in a hostile environment, you will have to figure
       | out how to load up a vector of 128s, which costs cycles
       | 
       | That particular vector can be generated with 2 instructions
       | without RAM access, pcmpeqd to generate a vector with all bits
       | set, and psllw/pslld for shifts.
       | 
       | Modern compilers support LTCG/LTO which optimizes code across
       | translation units. If you have a loop comparing these vectors,
       | the magic vector is likely to be created outside, and kept in a
       | register.
       | 
       | > To avoid this, use min
       | 
       | Yeah, but if you need to compare for a < b as opposed to a <= b,
       | you gonna need couple more instructions.
       | 
       | > if anyone working on superoptimisation has caught these?
       | 
       | Page #17 there: http://const.me/articles/simd/simd.pdf
        
         | vardump wrote:
         | > On many CPUs, bitwise XOR is slightly more efficient than
         | addition.
         | 
         | That's pretty interesting, any examples of CPUs (or
         | microcontrollers) where this happens?
        
           | Const-me wrote:
           | An example is AMD Zen 2. The latency of both vpaddd and vpxor
           | is 1 cycle, but throughput is different, 0.33 versus 0.25
           | cycles.
           | 
           | BTW, Zen 2 CPUs are used in both Xbox S/X, and PS5.
        
             | vardump wrote:
             | That's pretty interesting. So AMD has split traditional ALU
             | to different execution units? I mean, usually same ALU(s)
             | that got an adder has also logical ops.
        
               | Const-me wrote:
               | Zen 2 cores have 4 execution units which run arithmetical
               | and logical SIMD instructions: https://en.wikichip.org/wi
               | ki/amd/microarchitectures/zen_2#Fl...
               | 
               | Apparently 3 out of 4 units can add integers, but all 4
               | of them can do bitwise operations. Despite the ISA
               | defines multiple equivalent bitwise instructions like
               | pand / andps / andpd, apparently modern AMD CPUs don't
               | have a bypass delay so these instructions are 100%
               | equivalent on these CPUs. And with some luck (when not
               | bottlenecked on instruction fetch, decode, or other place
               | in the pipeline) each Zen2 core can run 4 of them every
               | cycle.
        
         | anonymoushn wrote:
         | > Yeah, but if you need to compare for a < b as opposed to a <=
         | b, you gonna need couple more instructions.
         | 
         | In general yes. Depending on your use for the mask, it might be
         | fine to compute !(a < b) == (b <= a) and use the mask with
         | andnot.
        
         | sylware wrote:
         | Yep, some constants are better to be generated than loaded, see
         | the docs from Mr. Agner.
         | 
         | On a more global scale, all assembly "optimisations"/tricks are
         | hidden deep into compilers, which are reasonably "transparent"
         | to their devs only, that due to their abysmal complexity and
         | size.
         | 
         | We would need some sort of online library for those assembly
         | (boolean/branchless calculus...) tricks.
         | 
         | A job for wikipedia? Maybe linked to the maths/boolean
         | calculus?
        
           | Const-me wrote:
           | I think the compiler tricks are reasonably transparent to any
           | developer who knows both C and assembly, thanks to
           | godbolt.org.
           | 
           | Couple times I even back-ported these tricks from clang-
           | generated assembly back into C++ intrinsics.
        
           | janwas wrote:
           | github.com/google/highway is arguably such a collection of
           | tricks, the x86_256-inl.h header does emulate some missing
           | AVX-512 instructions such as this one or compressstoreu. (In
           | this case we currently use a constant.)
        
           | pjungwir wrote:
           | If you're interested in this, _Hacker 's Delight_ has several
           | hundreds of pages of these tricks. The book picks up from a
           | 1972 MIT memo called HAKMEM. For a database/web guy like me,
           | reading it is just recreational, but that's probably true for
           | most people here. I'm currently about halfway through one
           | hundred pages on how to divide. :-)
        
             | Const-me wrote:
             | I sometimes write optimized low-level code, but many of
             | these books, and internet articles, are rather old. CPU
             | architectures have converted to just two, AMD64 and ARM64.
             | They have tons of useful instructions like sign extension
             | and integer division.
             | 
             | They also have less trivial instructions equivalent to some
             | of these tricks, only faster. Couple examples.
             | 
             | To turn off the rightmost 1-bit in a word, on AMD64 there's
             | BLSR instruction from BMI1 set.
             | 
             | ARM CPUs have a fast instruction to reverse bits. AMD64 do
             | not, but they have a fast instruction to flip order of
             | bytes allowing to reverse these bits faster than what's in
             | that book.
             | 
             | These tricks are only useful very rarely. For instance,
             | SIMD instructions can't divide vectors of integers. Some
             | older GPUs can't divide FP64 numbers but most of them can
             | multiply FP64, and all of them can divide FP32. For these
             | exotic use cases, these tricks are still relevant.
        
               | muziq wrote:
               | If you're not dividing greater than 2^23 numbers then
               | can't you just use the FP divide ? I'm sure I've done
               | this in CUDA many years back.. Worked a treat :) If you
               | are then you have my sympathies..
        
               | Const-me wrote:
               | That's often a good way, but not necessarily the best
               | one. On my computer with Zen3 CPU, cvtdq2ps and cvttps2dq
               | have 3 cycles of latency each, divps up to 10 cycles,
               | results in 16 CPU cycles in total.
               | 
               | When the divisor's the same for all lanes and known at
               | compile time, there're tricks to reduce division to a
               | single integer multiplication, and a few extra cheap
               | instructions like shifts and additions. Usually faster
               | than 16 cycles. Here's an example, automagically made by
               | clang 13: https://godbolt.org/z/T4TKb7oov
        
         | moonchild wrote:
         | > if you need to compare for a < b as opposed to a <= b, you
         | gonna need couple more instructions
         | 
         | Depends on what you are doing--you may not need to spend any
         | extra ops at all. For instance, if you are blending then you
         | can compute a >= b and commute the blend arguments. If you are
         | branching, then you pick the complementary flag to branch on.
         | Etc.
        
       | anonymoushn wrote:
       | I'm not working on superoptimization, but yes it does seem like
       | eq(x,max_unsigned(x,y)) is an "obvious" cmpge_unsigned if one
       | just browses all the intrinsics available here:
       | https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
       | 
       | I would like to read more about superoptimizers targeting AVX2,
       | since previously I've complained that gcc will not make
       | transformations between various equivalent instruction sequences
       | for moving bytes around within and between registers, and some of
       | these are a lot slower than others.
        
       | dzaima wrote:
       | For what it's worth, clang[0] compiles to the vpminu[bwd] +
       | vpcmpeq[bwd] version even from manually written intrinsics of the
       | addition-based version, and has since clang 7. Though,
       | interestingly enough, it fails to cancel out a following xor.
       | 
       | (and, for completeness, a regular C loop[1], with some massaging
       | to make it more readable)
       | 
       | [0]: https://godbolt.org/z/e7TE5P73Y
       | 
       | [1]: https://godbolt.org/z/xhj3WTnxv
        
         | moonchild wrote:
         | That's unfortunate. The solution with min actually does equal
         | work in this case--it just has a lesser startup cost, hence my
         | mention of a 'hostile environment'--but with greater span,
         | because the high-bit toggling can be done for both inputs in
         | parallel. If you switch to a loop, at the very least, it ought
         | to use the traditional solution, but it doesn't. Also it
         | doesn't use the trick with the saturating subtract.
         | https://godbolt.org/z/76qKs49eW
        
       ___________________________________________________________________
       (page generated 2022-08-26 23:02 UTC)