https://www.neilhenning.dev/posts/yourownconstantfolder/

 
Neil Henning

  * Life Log

github mastodon rss

Your Own Constant Folder in C/C++

2024-03-21Neil Henning

I was talking with someone today that really really wanted the sqrtps
to be used in some code they were writing. And because of a quirk
with clang (still there as of clang 18.1.0), if you happened to use
-ffast-math clang would butcher the use of the intrinsic. So for the
code:

__m128 test(const __m128 vec)
{
    return _mm_sqrt_ps(vec);
}

Clang would compile it correctly without fast-math:

test:                                   # @test
        sqrtps  xmm0, xmm0
        ret

And create this monstrosity with -ffast-math:

.LCPI0_0:
        .long   0xbf000000                      # float -0.5
        .long   0xbf000000                      # float -0.5
        .long   0xbf000000                      # float -0.5
        .long   0xbf000000                      # float -0.5
.LCPI0_1:
        .long   0xc0400000                      # float -3
        .long   0xc0400000                      # float -3
        .long   0xc0400000                      # float -3
        .long   0xc0400000                      # float -3
test:
        rsqrtps xmm1, xmm0
        movaps  xmm2, xmm0
        mulps   xmm2, xmm1
        movaps  xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
        mulps   xmm3, xmm2
        mulps   xmm2, xmm1
        addps   xmm2, xmmword ptr [rip + .LCPI0_1]
        mulps   xmm2, xmm3
        xorps   xmm1, xmm1
        cmpneqps        xmm0, xmm1
        andps   xmm0, xmm2
        ret

The optimization flow here in LLVM is:

  * That under fast-math conditions, sqrt(x) == x * rsqrt(x), so it
    uses rsqrtps instead.
  * But that has precision issues between Intel and AMD due to a high
    ULP tolerance for the rsqrtps instruction.
  * So LLVM does two newton-raphson iterations anytime it calls
    rsqrtps to correct the precision between the CPU implementations.

The 'fix' here is just to use inline assembly to guarantee you'll get
the instruction selection you want always:

__m128 test(__m128 vec)
{
    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

But there is one additional thing I'd advocate you do if you need to
use inline assembly - write your own constant folding.

See the one downside to the inline assembly above is that if test is
inlined and vec was a constant, it wouldn't constant fold it. For
example:

__attribute__((always_inline)) __m128 test(__m128 vec)
{
    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

__m128 call_test()
{
    return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}

Will produce:

test:
        sqrtps  xmm0, xmm0
        ret
.LCPI1_0:
        .long   0x3f800000                      # float 1
        .long   0x40000000                      # float 2
        .long   0x40400000                      # float 3
        .long   0x40800000                      # float 4
call_test:
        movaps  xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0]
        sqrtps  xmm0, xmm0
        ret

So that even under inlining, when we could have constant folded it
away entirely, we are still calling sqrtps when we don't have to. So
what is the fix?

LLVM has an intrinsic is_constant which can be got at via the
Clang-supported GCC extension __builtin_constant_p. If we extend our
test above to check when vec is constant, we can call _mm_sqrt_ps
when it is constant, and benefit from the constant folder doing its
thing and removing the call entirely. So our code becomes:

__attribute__((always_inline)) __m128 test(__m128 vec)
{
    if (__builtin_constant_p(vec))
    {
        return _mm_sqrt_ps(vec);
    }

    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

__m128 call_test()
{
    return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}

And we get:

call_test:
        movaps  xmm0, xmmword ptr [rip + .LCPI11_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0]
        sqrtps  xmm0, xmm0
        ret

What the heck?! It hasn't constant folded! Turns out GCC is a bit
picky with this builtin, and it looks like LLVM has inherited that
funky behaviour. You cannot use it with a vector - even though LLVM
happily has the support in the IR for it. But there is a workaround,
an ugly one:

__attribute__((always_inline)) __m128 test(__m128 vec)
{
    if (__builtin_constant_p(vec[0]) &&
      __builtin_constant_p(vec[1]) &&
      __builtin_constant_p(vec[2]) &&
      __builtin_constant_p(vec[3]))
    {
        return _mm_sqrt_ps(vec);
    }

    __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
    return vec;
}

__m128 call_test()
{
    return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}

Will produce:

.LCPI15_0:
        .long   0x3f800000                      # float 1
        .long   0x3fb504f3                      # float 1.41421354
        .long   0x3fddb3d7                      # float 1.73205078
        .long   0x40000000                      # float 2
call_test:
        movaps  xmm0, xmmword ptr [rip + .LCPI15_0] # xmm0 = [1.0E+0,1.41421354E+0,1.73205078E+0,2.0E+0]
        ret

Nice! We've got the constant folding we want. And also nicely, if we
mark test as noinline instead, the code for test is:

test:
        sqrtps  xmm0, xmm0
        ret

Meaning the branch is folded away. In both cases we now get the
behaviour we want. We've wrote our own constant folder. Nice! You can
see the full example on godbolt.

It'd be nice if we could just use the vector in __builtin_constant_p,
but I think the LLVM folks have purposefully tried to match what GCC
would do. I'd personally advocate for a loosening of the builtin, and
I might file a GitHub issue about just that.

---------------------------------------------------------------------
Verse Transactional Memory -
(c) 2024 Powered by Hugo :: Theme by panr :: Hosted on gandi.net
github mastodon rss