https://www.neilhenning.dev/posts/yourownconstantfolder/ Neil Henning * Life Log github mastodon rss Your Own Constant Folder in C/C++ 2024-03-21Neil Henning I was talking with someone today that really really wanted the sqrtps to be used in some code they were writing. And because of a quirk with clang (still there as of clang 18.1.0), if you happened to use -ffast-math clang would butcher the use of the intrinsic. So for the code: __m128 test(const __m128 vec) { return _mm_sqrt_ps(vec); } Clang would compile it correctly without fast-math: test: # @test sqrtps xmm0, xmm0 ret And create this monstrosity with -ffast-math: .LCPI0_0: .long 0xbf000000 # float -0.5 .long 0xbf000000 # float -0.5 .long 0xbf000000 # float -0.5 .long 0xbf000000 # float -0.5 .LCPI0_1: .long 0xc0400000 # float -3 .long 0xc0400000 # float -3 .long 0xc0400000 # float -3 .long 0xc0400000 # float -3 test: rsqrtps xmm1, xmm0 movaps xmm2, xmm0 mulps xmm2, xmm1 movaps xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1] mulps xmm3, xmm2 mulps xmm2, xmm1 addps xmm2, xmmword ptr [rip + .LCPI0_1] mulps xmm2, xmm3 xorps xmm1, xmm1 cmpneqps xmm0, xmm1 andps xmm0, xmm2 ret The optimization flow here in LLVM is: * That under fast-math conditions, sqrt(x) == x * rsqrt(x), so it uses rsqrtps instead. * But that has precision issues between Intel and AMD due to a high ULP tolerance for the rsqrtps instruction. * So LLVM does two newton-raphson iterations anytime it calls rsqrtps to correct the precision between the CPU implementations. The 'fix' here is just to use inline assembly to guarantee you'll get the instruction selection you want always: __m128 test(__m128 vec) { __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec)); return vec; } But there is one additional thing I'd advocate you do if you need to use inline assembly - write your own constant folding. See the one downside to the inline assembly above is that if test is inlined and vec was a constant, it wouldn't constant fold it. For example: __attribute__((always_inline)) __m128 test(__m128 vec) { __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec)); return vec; } __m128 call_test() { return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f)); } Will produce: test: sqrtps xmm0, xmm0 ret .LCPI1_0: .long 0x3f800000 # float 1 .long 0x40000000 # float 2 .long 0x40400000 # float 3 .long 0x40800000 # float 4 call_test: movaps xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0] sqrtps xmm0, xmm0 ret So that even under inlining, when we could have constant folded it away entirely, we are still calling sqrtps when we don't have to. So what is the fix? LLVM has an intrinsic is_constant which can be got at via the Clang-supported GCC extension __builtin_constant_p. If we extend our test above to check when vec is constant, we can call _mm_sqrt_ps when it is constant, and benefit from the constant folder doing its thing and removing the call entirely. So our code becomes: __attribute__((always_inline)) __m128 test(__m128 vec) { if (__builtin_constant_p(vec)) { return _mm_sqrt_ps(vec); } __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec)); return vec; } __m128 call_test() { return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f)); } And we get: call_test: movaps xmm0, xmmword ptr [rip + .LCPI11_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0] sqrtps xmm0, xmm0 ret What the heck?! It hasn't constant folded! Turns out GCC is a bit picky with this builtin, and it looks like LLVM has inherited that funky behaviour. You cannot use it with a vector - even though LLVM happily has the support in the IR for it. But there is a workaround, an ugly one: __attribute__((always_inline)) __m128 test(__m128 vec) { if (__builtin_constant_p(vec[0]) && __builtin_constant_p(vec[1]) && __builtin_constant_p(vec[2]) && __builtin_constant_p(vec[3])) { return _mm_sqrt_ps(vec); } __asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec)); return vec; } __m128 call_test() { return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f)); } Will produce: .LCPI15_0: .long 0x3f800000 # float 1 .long 0x3fb504f3 # float 1.41421354 .long 0x3fddb3d7 # float 1.73205078 .long 0x40000000 # float 2 call_test: movaps xmm0, xmmword ptr [rip + .LCPI15_0] # xmm0 = [1.0E+0,1.41421354E+0,1.73205078E+0,2.0E+0] ret Nice! We've got the constant folding we want. And also nicely, if we mark test as noinline instead, the code for test is: test: sqrtps xmm0, xmm0 ret Meaning the branch is folded away. In both cases we now get the behaviour we want. We've wrote our own constant folder. Nice! You can see the full example on godbolt. It'd be nice if we could just use the vector in __builtin_constant_p, but I think the LLVM folks have purposefully tried to match what GCC would do. I'd personally advocate for a loosening of the builtin, and I might file a GitHub issue about just that. --------------------------------------------------------------------- Verse Transactional Memory - (c) 2024 Powered by Hugo :: Theme by panr :: Hosted on gandi.net github mastodon rss