https://1f6042.blogspot.com/2024/01/stdclamp-still-generates-less-efficient.html

Random unimportant stuff

Random unimportant stuff

Tuesday, January 16, 2024

std::clamp still generates less efficient assembly than std::min(max,
std::max(min, v)) on GCC and Clang

I originally wrote this blog post in 2019 (or maybe 2018 - my
timestamps say that it was written before 1 May 2019). Recently I
decided to revisit my old blog and found that std::clamp still
generates less efficient assembly than std::min(max, std::max(min,
v)) even on the latest versions of GCC (13.2) and Clang (17.0.1).

Here's my old blog post:

Contents:

  * Ternary
  * Using intermediate values
  * using std::min and std::max
  * Using std::clamp

Let's say you want to clamp a value v between 2 values, min and max.
If v is greater than max, return max. If v is smaller than min,
return min. Otherwise return v.

Ternary

Implementing it directly as per the description:

double clamp(double v, double min, double max){
    return v < min? min : v > max? max : v;
}

gcc 8.2:

clamp(double, double, double):
        comisd  xmm1, xmm0
        ja      .L2
        minsd   xmm2, xmm0
        movapd  xmm1, xmm2
.L2:
        movapd  xmm0, xmm1
        ret

One branch instruction.

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        minsd   xmm2, xmm0
        cmpltsd xmm0, xmm1
        movapd  xmm3, xmm0
        andnpd  xmm3, xmm2
        andpd   xmm0, xmm1
        orpd    xmm0, xmm3
        ret

Branchless.

Using intermediate values

From this stackoverflow answer: https://stackoverflow.com/questions/
427477/fastest-way-to-clamp-a-real-fixed-floating-point-value

double clamp(double v, double min, double max){
    double out = v > max ? max : v;
    return out < min ? min : out;
}

gcc 8.2:

clamp(double, double, double):
        minsd   xmm2, xmm0
        maxsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        minsd   xmm2, xmm0
        maxsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret

Identical output. Much better than before. Can we do better?

using std::min and std::max

This is the idiomatic way to do clamp in C++ (and most other
languages):

#include <algorithm>
double clamp(double v, double min, double max){
    return std::min(max, std::max(min, v));
}

gcc 8.2:

clamp(double, double, double):
        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
        ret

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
        ret

Also seems to generate the best code.

Using std::clamp

#include <algorithm>
double clamp(double v, double min, double max){
    return std::clamp(v, min, max);
}

gcc 8.2:

clamp(double, double, double):
        comisd  xmm1, xmm0
        ja      .L2
        minsd   xmm2, xmm0
        movapd  xmm1, xmm2
.L2:
        movapd  xmm0, xmm1
        ret

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        minsd   xmm2, xmm0
        cmpltsd xmm0, xmm1
        movapd  xmm3, xmm0
        andnpd  xmm3, xmm2
        andpd   xmm0, xmm1
        orpd    xmm0, xmm3
        ret

Not very efficient.

EDIT: It's been almost 5 years since I originally wrote this article,
so I decided to try again using the latest versions of GCC and Clang:

gcc 13.2:

clamp(double, double, double):
        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2
        ret

clang 17.0.1:

clamp(double, double, double):                            # @clamp(double, double, double)
        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2
        ret

Still not the most efficient - it uses one more instruction than the
std::min(max, std::max(min, v)) implementation.

But how does the fastest implementation work you ask? Going through
the code line by line:

clamp(double, double, double):
        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
        ret

The maxsd xmm0, xmm1 puts the max value of xmm0 and xmm1 into xmm0.

The minsd xmm0, xmm2 puts the min value of xmm0 and xmm2 into xmm0.

Thus, after the first line, xmm0 contains the max of the lower bound
and the value.

And after the second line, xmm0 contains the min of the upper bound
and the previous result.

But let's step through with gdb to confirm.

Source code:

#include <algorithm>
double __attribute__ ((noinline)) clamp(double v, double min, double max){
    return std::min(max, std::max(min, v));
}

int main(){
    volatile double x, min, max;
    x = 1653;
    min = 1776;
    max = 1729;
    return clamp(x, min, max);
}

gdb logs:

Before running maxsd:

|  > 0x555555555180 <_Z5clampddd>                    maxsd  %xmm1,%xmm0                                                                                       |
|    0x555555555184 <_Z5clampddd+4>                  minsd  %xmm2,%xmm0                                                                                       |
|    0x555555555188 <_Z5clampddd+8>                  ret                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
multi-thre Thread 0x7ffff7dca7 In: clamp                                                                                              L254  PC: 0x555555555180
xmm0           {v8_bfloat16 = {0x0, 0x0, 0xd400, 0x4099, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xd400, 0x4099, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x4099
d400, 0x0, 0x0}, v2_double = {0x4099d40000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xd4, 0x99, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1
6 = {0x0, 0x0, 0xd400, 0x4099, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x4099d400, 0x0, 0x0}, v2_int64 = {0x4099d40000000000, 0x0}, uint128 = 0x4099d40000000000}

xmm1           {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b
c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1
6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000}

xmm2           {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04
00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 =
 {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000}

After running maxsd but before running minsd:

|    0x555555555180 <_Z5clampddd>                    maxsd  %xmm1,%xmm0                                                                                       |
|  > 0x555555555184 <_Z5clampddd+4>                  minsd  %xmm2,%xmm0                                                                                       |
|    0x555555555188 <_Z5clampddd+8>                  ret                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
multi-thre Thread 0x7ffff7dca7 In: clamp                                                                                              L4    PC: 0x555555555184
xmm0           {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b
c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1
6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000}

xmm1           {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b
c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1
6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000}

xmm2           {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04
00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 =
 {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000}

After running minsd:

|    0x555555555180 <_Z5clampddd>                    maxsd  %xmm1,%xmm0                                                                                       |
|    0x555555555184 <_Z5clampddd+4>                  minsd  %xmm2,%xmm0                                                                                       |
|B+> 0x555555555188 <_Z5clampddd+8>                  ret                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
multi-thre Thread 0x7ffff7dca7 In: clamp                                                                                              L5    PC: 0x555555555188
xmm0           {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04
00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 =
 {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000}

xmm1           {v8_bfloat16 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b
c000, 0x0, 0x0}, v2_double = {0x409bc00000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0xc0, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int1
6 = {0x0, 0x0, 0xc000, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409bc000, 0x0, 0x0}, v2_int64 = {0x409bc00000000000, 0x0}, uint128 = 0x409bc00000000000}

xmm2           {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04
00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 =
 {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000}

And how does the value from xmm0 get placed into eax? The answer is
the cvttsd2si instruction:

|    0x555555555080 <main()+64>      call   0x555555555180 <_Z5clampddd>                                                                                      |
|    0x555555555085 <main()+69>      add    $0x20,%rsp                                                                                                        |
|  > 0x555555555089 <main()+73>      cvttsd2si %xmm0,%eax                                                                                                     |
|    0x55555555508d <main()+77>      ret                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
multi-thre Thread 0x7ffff7dca7 In: main                                                                                               L12   PC: 0x555555555089
(gdb) i r eax xmm0
eax            0x55555040          1431654464
xmm0           {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04
00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 =
 {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000}



|    0x555555555080 <main()+64>      call   0x555555555180 <_Z5clampddd>                                                                                      |
|    0x555555555085 <main()+69>      add    $0x20,%rsp                                                                                                        |
|    0x555555555089 <main()+73>      cvttsd2si %xmm0,%eax                                                                                                     |
|  > 0x55555555508d <main()+77>      ret                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
multi-thre Thread 0x7ffff7dca7 In: main                                                                                               L13   PC: 0x55555555508d
(gdb) i r eax xmm0
eax            0x6c1               1729
xmm0           {v8_bfloat16 = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v8_half = {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_float = {0x0, 0x409b04
00, 0x0, 0x0}, v2_double = {0x409b040000000000, 0x0}, v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x4, 0x9b, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int16 =
 {0x0, 0x0, 0x400, 0x409b, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x409b0400, 0x0, 0x0}, v2_int64 = {0x409b040000000000, 0x0}, uint128 = 0x409b040000000000}

Pretty cool.

Anyway, I found it surprising that std::clamp still generates less
efficient assembly than std::min(max, std::max(min, v)) even on the
latest versions of GCC (13.2) and Clang (17.0.1).

at January 16, 2024 #
Email ThisBlogThis!Share to TwitterShare to FacebookShare to
Pinterest
Labels: assembly, C++, clamp, clang, gcc

1 comment:

 1. [blogger_lo]
    StevenSunJanuary 16, 2024 at 8:37 AM

    I did some investigation. It turns out `std::clamp` and your
    `std::min(std::max)` are not the same when `nan` is involving.
    See this link: https://godbolt.org/z/jaE7EdjTo

    ReplyDelete
    Replies
        Reply

Add comment
Load more...

 

Newer Post Older Post Home
Subscribe to: Post Comments (Atom)

Search This Blog

[          ] [Search]
  * Home
  * PSAs
  * .tmux.conf

Labels

  * ai
  * almalinux
  * amazon
  * amd
  * anecdotes
  * apple
  * assembly
  * aws
  * benchmarks
  * books
  * brute force
  * bugs
  * C++
  * chatgpt
  * chinese
  * chrome
  * clamp
  * clang
  * cloudflare
  * cp
  * crimping
  * data integrity
  * dd
  * debuggers
  * dolphin
  * ecctool
  * english
  * ethernet
  * fedora
  * firefox
  * firewall
  * firewalld
  * gcc
  * gdb
  * gmail
  * go
  * golang
  * google
  * google translate
  * gotchas
  * gripes
  * hashing
  * headless
  * hls
  * intel
  * iptables
  * kde
  * kernel
  * kvm
  * L I B U R I N G B 3 S U M
  * lightsail
  * linux
  * longest prefix matching
  * memory
  * networking
  * nft
  * nftables
  * noobtraps
  * nvidia
  * optimization
  * performace
  * performance
  * protip
  * psa
  * pycharm
  * python
  * qemu
  * race condition
  * RAGE
  * random thoughts
  * reflections
  * reverse debugging
  * rr
  * rsync
  * ryzen
  * scp
  * selenium
  * snap
  * software
  * spectacle
  * Stack Overflow
  * string
  * system design
  * the go programming language
  * time machine
  * translation
  * trie
  * type safety
  * ubuntu
  * url shortener
  * user-agent
  * video
  * virt manager
  * virtualbox
  * vps
  * vscode
  * webserver
  * windows
  * wtf
  * xattr
  * youtube

 

Blog Archive

  * January 2024 (3)
  * December 2023 (20)
  * November 2023 (15)
  * October 2023 (10)
  * September 2023 (9)
  * August 2023 (15)

 
Powered by Blogger.