https://1f6042.blogspot.com/2024/01/stdclamp-still-generates-less-efficient.html Random unimportant stuff Random unimportant stuff Tuesday, January 16, 2024 std::clamp still generates less efficient assembly than std::min(max, std::max(min, v)) on GCC and Clang I originally wrote this blog post in 2019 (or maybe 2018 - my timestamps say that it was written before 1 May 2019). Recently I decided to revisit my old blog and found that std::clamp still generates less efficient assembly than std::min(max, std::max(min, v)) even on the latest versions of GCC (13.2) and Clang (17.0.1). Here's my old blog post: Contents: * Ternary * Using intermediate values * using std::min and std::max * Using std::clamp Let's say you want to clamp a value v between 2 values, min and max. If v is greater than max, return max. If v is smaller than min, return min. Otherwise return v. Ternary Implementing it directly as per the description: double clamp(double v, double min, double max){ return v < min? min : v > max? max : v; } gcc 8.2: clamp(double, double, double): comisd xmm1, xmm0 ja .L2 minsd xmm2, xmm0 movapd xmm1, xmm2 .L2: movapd xmm0, xmm1 ret One branch instruction. clang 7.0: clamp(double, double, double): # @clamp(double, double, double) minsd xmm2, xmm0 cmpltsd xmm0, xmm1 movapd xmm3, xmm0 andnpd xmm3, xmm2 andpd xmm0, xmm1 orpd xmm0, xmm3 ret Branchless. Using intermediate values From this stackoverflow answer: https://stackoverflow.com/questions/ 427477/fastest-way-to-clamp-a-real-fixed-floating-point-value double clamp(double v, double min, double max){ double out = v > max ? max : v; return out < min ? min : out; } gcc 8.2: clamp(double, double, double): minsd xmm2, xmm0 maxsd xmm1, xmm2 movapd xmm0, xmm1 ret clang 7.0: clamp(double, double, double): # @clamp(double, double, double) minsd xmm2, xmm0 maxsd xmm1, xmm2 movapd xmm0, xmm1 ret Identical output. Much better than before. Can we do better? using std::min and std::max This is the idiomatic way to do clamp in C++ (and most other languages): #include double clamp(double v, double min, double max){ return std::min(max, std::max(min, v)); } gcc 8.2: clamp(double, double, double): maxsd xmm0, xmm1 minsd xmm0, xmm2 ret clang 7.0: clamp(double, double, double): # @clamp(double, double, double) maxsd xmm0, xmm1 minsd xmm0, xmm2 ret Also seems to generate the best code. Using std::clamp #include double clamp(double v, double min, double max){ return std::clamp(v, min, max); } gcc 8.2: clamp(double, double, double): comisd xmm1, xmm0 ja .L2 minsd xmm2, xmm0 movapd xmm1, xmm2 .L2: movapd xmm0, xmm1 ret clang 7.0: clamp(double, double, double): # @clamp(double, double, double) minsd xmm2, xmm0 cmpltsd xmm0, xmm1 movapd xmm3, xmm0 andnpd xmm3, xmm2 andpd xmm0, xmm1 orpd xmm0, xmm3 ret Not very efficient. EDIT: It's been almost 5 years since I originally wrote this article, so I decided to try again using the latest versions of GCC and Clang: gcc 13.2: clamp(double, double, double): maxsd xmm1, xmm0 minsd xmm2, xmm1 movapd xmm0, xmm2 ret clang 17.0.1: clamp(double, double, double): # @clamp(double, double, double) maxsd xmm1, xmm0 minsd xmm2, xmm1 movapd xmm0, xmm2 ret Still not the most efficient - it uses one more instruction than the std::min(max, std::max(min, v)) implementation. But how does the fastest implementation work you ask? Going through the code line by line: clamp(double, double, double): maxsd xmm0, xmm1 minsd xmm0, xmm2 ret The maxsd xmm0, xmm1 puts the max value of xmm0 and xmm1 into xmm0. The minsd xmm0, xmm2 puts the min value of xmm0 and xmm2 into xmm0. Thus, after the first line, xmm0 contains the max of the lower bound and the value. And after the second line, xmm0 contains the min of the upper bound and the previous result. But let's step through with gdb to confirm. Source code: #include double __attribute__ ((noinline)) clamp(double v, double min, double max){ return std::min(max, std::max(min, v)); } int main(){ volatile double x, min, max; x = 1653; min = 1776; max = 1729; return clamp(x, min, max); } gdb logs: Before running maxsd: | > 0x555555555180 <_Z5clampddd> maxsd %xmm1,%xmm0 | | 0x555555555184 <_Z5clampddd+4> minsd %xmm2,%xmm0 | | 0x555555555188 <_Z5clampddd+8> ret | +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ multi-thre Thread 0x7ffff7dca7 In: clamp L254 PC: 0x555555555180 xmm0 {v8_bfloat16 = {0x0, 0x0, 0xd400, 0x4099, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xd400, 0x4099, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x4099 d400, 0x0, 0x0}, v2_double = {0x4099d40000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xd4, 0x99, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1 6 = {0x0, 0x0, 0xd400, 0x4099, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x4099d400, 0x0, 0x0}, v2_int64 = {0x4099d40000000000, 0x0}, uint128 = 0x4099d40000000000} xmm1 {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1 6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000} xmm2 {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04 00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000} After running maxsd but before running minsd: | 0x555555555180 <_Z5clampddd> maxsd %xmm1,%xmm0 | | > 0x555555555184 <_Z5clampddd+4> minsd %xmm2,%xmm0 | | 0x555555555188 <_Z5clampddd+8> ret | +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ multi-thre Thread 0x7ffff7dca7 In: clamp L4 PC: 0x555555555184 xmm0 {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1 6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000} xmm1 {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1 6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000} xmm2 {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04 00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000} After running minsd: | 0x555555555180 <_Z5clampddd> maxsd %xmm1,%xmm0 | | 0x555555555184 <_Z5clampddd+4> minsd %xmm2,%xmm0 | |B+> 0x555555555188 <_Z5clampddd+8> ret | +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ multi-thre Thread 0x7ffff7dca7 In: clamp L5 PC: 0x555555555188 xmm0 {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04 00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000} xmm1 {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1 6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000} xmm2 {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04 00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000} And how does the value from xmm0 get placed into eax? The answer is the cvttsd2si instruction: | 0x555555555080 call 0x555555555180 <_Z5clampddd> | | 0x555555555085 add $0x20,%rsp | | > 0x555555555089 cvttsd2si %xmm0,%eax | | 0x55555555508d ret | +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ multi-thre Thread 0x7ffff7dca7 In: main L12 PC: 0x555555555089 (gdb) i r eax xmm0 eax 0x55555040 1431654464 xmm0 {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04 00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000} | 0x555555555080 call 0x555555555180 <_Z5clampddd> | | 0x555555555085 add $0x20,%rsp | | 0x555555555089 cvttsd2si %xmm0,%eax | | > 0x55555555508d ret | +-------------------------------------------------------------------------------------------------------------------------------------------------------------+ multi-thre Thread 0x7ffff7dca7 In: main L13 PC: 0x55555555508d (gdb) i r eax xmm0 eax 0x6c1 1729 xmm0 {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04 00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000} Pretty cool. Anyway, I found it surprising that std::clamp still generates less efficient assembly than std::min(max, std::max(min, v)) even on the latest versions of GCC (13.2) and Clang (17.0.1). at January 16, 2024 # Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest Labels: assembly, C++, clamp, clang, gcc 1 comment: 1. [blogger_lo] StevenSunJanuary 16, 2024 at 8:37 AM I did some investigation. It turns out `std::clamp` and your `std::min(std::max)` are not the same when `nan` is involving. See this link: https://godbolt.org/z/jaE7EdjTo ReplyDelete Replies Reply Add comment Load more... Newer Post Older Post Home Subscribe to: Post Comments (Atom) Search This Blog [ ] [Search] * Home * PSAs * .tmux.conf Labels * ai * almalinux * amazon * amd * anecdotes * apple * assembly * aws * benchmarks * books * brute force * bugs * C++ * chatgpt * chinese * chrome * clamp * clang * cloudflare * cp * crimping * data integrity * dd * debuggers * dolphin * ecctool * english * ethernet * fedora * firefox * firewall * firewalld * gcc * gdb * gmail * go * golang * google * google translate * gotchas * gripes * hashing * headless * hls * intel * iptables * kde * kernel * kvm * L I B U R I N G B 3 S U M * lightsail * linux * longest prefix matching * memory * networking * nft * nftables * noobtraps * nvidia * optimization * performace * performance * protip * psa * pycharm * python * qemu * race condition * RAGE * random thoughts * reflections * reverse debugging * rr * rsync * ryzen * scp * selenium * snap * software * spectacle * Stack Overflow * string * system design * the go programming language * time machine * translation * trie * type safety * ubuntu * url shortener * user-agent * video * virt manager * virtualbox * vps * vscode * webserver * windows * wtf * xattr * youtube Blog Archive * January 2024 (3) * December 2023 (20) * November 2023 (15) * October 2023 (10) * September 2023 (9) * August 2023 (15) Powered by Blogger.