hngopher.com

       [HN Gopher] Own Constant Folder in C/C++
       ___________________________________________________________________
        
       Own Constant Folder in C/C++
        
       Author : todsacerdoti
       Score  : 149 points
       Date   : 2024-06-22 10:07 UTC (12 hours ago)
        
 (HTM) web link (www.neilhenning.dev)
 (TXT) w3m dump (www.neilhenning.dev)
        
       | fwsgonzo wrote:
       | Very interesting, and that constant folding trick is handy!
        
         | smitty1e wrote:
         | I was going to label this a fine example of yak shaving.
         | Usually when I'm working this hard against the tool, I've
         | missed something, though the author here sounds expert.
        
           | DrBazza wrote:
           | This is the sort of thing we do in HFT to get single
           | instruction calls and save nanos in places that get called
           | many many thousands of times a second.
        
             | smitty1e wrote:
             | I mean, at some point, why not just write a custom
             | compiler?
        
               | trelane wrote:
               | If you care that much, you use intrinsics, asm in the C
               | file, or asm to create an object you call into.
               | 
               | Most of the code doesn't need nearly that level of
               | optimization, so you write it in a higher level language.
               | 
               | Making your own compiler is a _lot_ of work compared to
               | just doing what I 've outlined above. So you don't. I's
               | not worth it, and you'd still end up back where you are.
        
               | TillE wrote:
               | Writing assembly really isn't that scary, especially in
               | small doses for performance-critical code. That was a
               | very common practice in the 80s/90s, though it's faded as
               | compiler optimizations have improved.
        
       | feverzsj wrote:
       | clang has __builtin_elementwise_sqrt, which meets all your
       | requirements.
        
         | josephg wrote:
         | I just tried it on godbolt. __builtin_elementwise_sqrt emits
         | the same terrible asm as _mm_sqrt_ps when -ffast-math is
         | enabled:
         | 
         | https://godbolt.org/z/34z81GEsG
        
       | btdmaster wrote:
       | Fortunately gcc does not miscompile like this:
       | https://godbolt.org/z/PaeaohqGz
        
       | noitpmeder wrote:
       | This seems like a great example of why people don't like C/C++,
       | and probably a good example of why some people _do_ like it.
       | 
       | How is a non-expert in the language supposed to learn tricks/...
       | things like this? I'm asking as a C++ developer of 6+ years in
       | high performance settings, most of this article is esoteric to
       | me.
        
         | joosters wrote:
         | Is this really C++ specific though? It seems like the
         | optimisations are happening on a lower level, and so would
         | 'infect' other languages too.
         | 
         | Whatever the language, at some point in performance tweaking
         | you will end up having to look at the assembly produced by your
         | compiler, and discovering all kinds of surprises.
        
           | tialaramex wrote:
           | LLVM isn't perfect, but the problem here is that there's a
           | C++ compiler flag (-ffast-math) which says OK, disregard how
           | arithmetic actually works, we're going to promise _across the
           | entire codebase_ that we don 't actually care and that's
           | fine.
           | 
           | This is nonsense, but it's really common, distressingly
           | common, for C and C++ programmers to use this sort of
           | inappropriate global modelling. It's something which cannot
           | scale, it works OK for one man projects, "Oh, I use the
           | Special Goose Mode to make routine A better, so even though
           | normal Elephants can't Teleport I need to remember that in
           | Special Goose Mode the Elephants in routine B might
           | Teleport". In practice you'll screw this up, but it _feels_
           | like you 'll get it right often enough to be valuable.
           | 
           | In a large project where we're doing software engineering
           | this is complete nonsense, now Jenny, the newest member of
           | the team working on routine A, will see that obviously
           | Special Goose Mode is a great idea, and turn it on, whereupon
           | the entirely different team handling routine B find that
           | their fucking Elephants can now Teleport. WTF.
           | 
           | The need to _never do this_ is why I was glad to see Rust
           | stabilize (e.g) u32::unchecked_add fairly recently. This
           | (unsafe obviously) method says no, I don 't want checked
           | arithmetic, or wrapping, or saturating, I want you to _assume
           | this cannot overflow_. I am _formally promising_ that this
           | addition is never going to overflow, in order to squeeze out
           | the last drops of performance.
           | 
           | Notice that's not a global flag. I can write let a = unsafe {
           | b.unchecked_add(c) }; in just one place in a 50MLOC system,
           | and _for just that one place_ the compiler can go absolutely
           | wild optimising for the promise that overflows never happen -
           | and yet right next door, even on the next line, I can write
           | let x = y + z; and _that_ still gets the kid gloves, if it
           | overflows nothing catches on fire. That 's how granular this
           | needs to be to be useful, unlike C++ -ffast-math.
        
             | gpderetta wrote:
             | You can set fast math (or a subset of it) on a translation
             | unit basis.
        
               | tialaramex wrote:
               | Because the language works by textual inclusion and so a
               | "translation unit" isn't really just your code this is
               | much more likely to result in nasty surprises, up to and
               | including ODR violations.
        
               | gpderetta wrote:
               | Yes, if you try hard enough I'm sure you can find ways to
               | screw up.
               | 
               | From a practical point of view it is fine.
        
               | cozzyd wrote:
               | Or even on a per function basis, at least with gcc (no
               | clue about clang...)
        
               | gpderetta wrote:
               | In principle yes, you can use Attribute optimize, but I
               | wouldn't rely on it. Too many bugs open against it.
        
               | cozzyd wrote:
               | It's worked for me in the past but maybe I got lucky.
        
               | pdw wrote:
               | That's not really true. If you link a shared library that
               | was compiled with -ffast-math, that will affect the
               | entire program.
               | https://moyix.blogspot.com/2022/09/someones-been-messing-
               | wit...
        
             | nikic wrote:
             | LLVM actually also supports instruction-level granularity
             | for fast-math (using essentially the same mechanism as
             | things like unchecked_add), but Clang doesn't expose that
             | level of control.
        
               | wmobit wrote:
               | clang does have pragma clang fp to enable a subset of
               | fast math flags within a scope
        
             | nickelpro wrote:
             | If you're using floating point at all you have declared you
             | don't care about determinism or absolute precision across
             | platforms.
             | 
             | Fast math is simply saying "I care even less than IEEE"
             | 
             | This is perfectly appropriate in many settings, but
             | _especially_ video games where such deterministic results
             | are completely irrelevant.
        
               | aw1621107 wrote:
               | > This is perfectly appropriate in many settings, but
               | _especially_ video games where such deterministic results
               | are completely irrelevant.
               | 
               | I'm not sure I'd agree. Off the top of my head, a
               | potentially significant benefit to deterministic
               | floating-point is allowing you to send updates/inputs
               | instead of world state in multiplayer games, which could
               | be a substantial improvement in network traffic demand.
               | It would also allow for smaller/simpler cross-
               | machine/platform replays, though I don't know how much
               | that feature is desired in comparison.
        
               | comex wrote:
               | Indeed, but this isn't hypothetical. A large fraction of
               | multiplayer games operate by sending inputs and requiring
               | the simulation to be deterministic.
               | 
               | (Some of those are single-platform or lack cross-play
               | support, and thus only need consistency between different
               | machines running the same build. That makes compiler
               | optimizations less of an issue. However, some do support
               | cross-play, and thus need consistency between different
               | builds - using different compilers - of the same source
               | code.)
        
               | immibis wrote:
               | Actually, floating-point math is mostly deterministic.
               | There is an exception for rounding errors in
               | transcendental functions.
               | 
               | The perception of nondeterminism came specifically from
               | x87, which had 80-bit native floating-point registers,
               | which were different from every other platform's 64-bit
               | default, and forcing values to 64-bit all the time cost
               | performance, so compilers secretly turned data types to
               | different ones when compiling for x87, therefore giving
               | different results. It would be like if the compiler for
               | ARM secretly changed every use of 'float' into 'double'.
        
         | hifromwork wrote:
         | Everything in this post applies to C too, so it's not C++
         | specific. And the same gotchas apply for every case when you
         | use inline assembly. I wouldn't call it a trick... Just an
         | interesting workaround for a performance bug in Clang.
         | 
         | The post can be boiled down to "Clang doesn't compile this
         | intrinsic nicely, so just use inline asm directly. But remember
         | that you need to have a non-asm special case to optimize
         | constants too, and you can achieve this with
         | __builtin_constant_p".
        
           | GrantMoyer wrote:
           | I wouldn't call this a performance bug in clang. It's an
           | optimization working as intended.
        
             | dooglius wrote:
             | I would challenge you to find a processor on which the
             | rsqrt plus two newton-raphson iterations is not slower than
             | plain sqrt. (We don't know what mtune the author used)
        
               | pclmulqdq wrote:
               | The author probably didn't use any mtune setting, which
               | is likely the problem. If you look at older cores on
               | Agner's instruction tables, SQRT has been getting
               | steadily faster over time. This implementation is
               | slightly faster on old Intel machines, for example.
        
               | GrantMoyer wrote:
               | According to Intel, any processor before Skylake (section
               | 15.12 from [1]).
               | 
               | [1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fil
               | eName=24...
        
         | parasti wrote:
         | A lesson I learned very early on: don't use -ffast-math unless
         | you know what it does. It has a very appealing name that
         | suggests nice things. You probably won't ever need it.
        
           | SassyBird wrote:
           | And even if you do know what it does, it's very impractical
           | given that it's a global option. Turning it on for selected
           | portions of code would be a completely different game.
        
             | adgjlsfhk1 wrote:
             | Julia has this (via a @fastmath macro) and it's so nice!
        
             | swatcoder wrote:
             | You're mistaken.
             | 
             | You can/would just use it in the translation units where
             | you want it; usually for numerical code where you want
             | certain optimizations or behaviors and know that the
             | tradeoffs are irrelevant.
             | 
             | It's mostly harmless for everday application math anyway,
             | and so enabling it for your whole application isn't a
             | catastrophe, but it's not what people who know what they're
             | doing would usually do. It's usually used for a specific
             | file or perhaps a specific support library.
        
               | leni536 wrote:
               | -ffast-math affects other translation units too, because
               | it introduces a global initialization that affects some
               | CPU flags. You can't really contain it to a single TU.
        
           | adgjlsfhk1 wrote:
           | I prefer -funsafemath. Who doesn't want their math to be fun
           | and safe?
        
             | glitchc wrote:
             | I don't know how many people got your joke, but that flag
             | actually says unsafe math. The -f prefix signifies a
             | floating point flag.
        
               | rockwotj wrote:
               | I am pretty sure the -f is for feature. Because there are
               | flags like -fno-exceptions and -fcoroutines
        
               | jkrejcha wrote:
               | It's a feature flag, they're not all necessarily floating
               | point related. As an example though, I still
               | intentionally humorously misread -funroll-loops as
               | funroll loops even though it's f(eature) unroll loops
        
         | flohofwoe wrote:
         | This is a Clang or even LLVM code generation issue and entirely
         | unrelated to the C and C++ standards (-ffast-math, intrinsics
         | and inline assembly are all compiler-specific features not
         | covered by the standards).
         | 
         | Most other language ecosystems most likely suffer from similar
         | problem if you look under the hood.
         | 
         | At least compilers like Clang also give you the tools to
         | workaround such issues, as demonstrated by the article.
        
         | almostgotcaught wrote:
         | > How is a non-expert in the language supposed to learn
         | tricks/... things like this?
         | 
         | just like everything else in life that's complex: slowly and
         | diligently.
         | 
         | i hate to break it to you - C++ is complex not for fun but
         | because it has to both be modern (support lots of modern
         | syntax/sugar/primitives) _and_ compile to target code for an
         | enormous range of architectures and modern achitectures are
         | horribly complex and varied (x86 /arm isn't the only game in
         | town). forgo one of those and it could be a much simpler
         | language but forgo one of those and you're not nearly as
         | compelling.
        
         | nalrqa wrote:
         | > How is a non-expert in the language supposed to learn
         | tricks/... things like this?
         | 
         | By learning C and inline asm. For a C developer, this is
         | nothing out of the ordinary. C++ focuses too much on new
         | abstractions and hiding everything in the stdlib++, where the
         | actual implementations of course use all of this and look more
         | like C, which includes using (OMG!) raw pointers.
        
           | pas wrote:
           | arguably vast-vast-vaaast majority of projects, problems,
           | solutions, developers simply don't need this.
           | 
           | but yes, given the premise of the article is that a friend
           | wants to use a specific CPU instruction, yeah, at least
           | minimum knowledge of one's stack is required (and usually the
           | path leads through Assembly, some C/Rust and an FFI interface
           | - like JNI for Java, cffi/Cython for Python, and so on)
        
         | elvircrn wrote:
         | Take a computer architecture course for starters.
        
           | bryancoxwell wrote:
           | Got any you'd recommend?
        
             | pas wrote:
             | for appetizer I recommend Cliff Click's "A Crash Course in
             | Modern Hardware" talk
             | 
             | https://www.youtube.com/watch?v=5ZOuCuGrw48 (and here's the
             | 2009 version https://www.infoq.com/presentations/click-
             | crash-course-moder... .. might be interesting for
             | comparison )
        
               | bryancoxwell wrote:
               | Thanks!
        
         | tonetegeatinst wrote:
         | I'm a security student. My main experience has been python and
         | java....but I have started to learn c to better learn how low
         | level stuff works without so much abstraction.
         | 
         | My understanding is that C is a great language, but I also get
         | that its not for everyone. Its really powerful, and yet you can
         | easily make mistakes.
         | 
         | For me, I'm just learning how to use C, I'm not trying to
         | understand the compiler or make files yet. From what I get, the
         | compiler is how you can achieve even better performance, but
         | you need to understand how it is doing its black
         | magic....otherwise you just might make your code slower or more
         | inefficient.
        
           | vlovich123 wrote:
           | First order optimization is always overall program
           | architecture, then hotspots, then fixing architectural issues
           | in the code (e.g. getting rid of misspeculated branches,
           | reducing instructions etc), and then optimizing the code the
           | compiler is generating. And at no point does it require to
           | know the internals of the optimization passes as to how it
           | generates the code.
           | 
           | As for the compiler's role in C, it's equivalent to javac -
           | it's taking your source and creating machine code, except the
           | machine code isn't an abstract bytecode but the exact machine
           | instructions intended to run on the CPU.
           | 
           | The issues with C and C++ are around memory safety. Practice
           | has repeatedly shown that the defect rate with these
           | languages is high enough that it results in lots of easily
           | exploitable vulnerabilities. That's a bit more serious than a
           | personal preference. That's why there's pushes to shift the
           | professional industry itself to stop using C and C++ in favor
           | of Rust or even Go.
        
           | torusle wrote:
           | Nah, it is not that bad.
           | 
           | Sure you can mess up your performance by picking bad compiler
           | options, but most of the time you are fine with just default
           | optimizations enabled and let it do it's thing. No need to
           | understand the black magic behind it.
           | 
           | This is only really necessary if you want to squeeze the last
           | bit of performance out of a piece of code. And honestly, how
           | often dies this occur in day to day coding unless you write a
           | video or audio codec?
        
             | vlovich123 wrote:
             | The main flags to look at:
             | 
             | * mtune/march - specifying a value of native optimizes for
             | the current machine, x86-64-v1/v2/v3/v4 for generations or
             | you can specify a specific CPU (ARM has different naming
             | conventions). Recommendation: use the generation if
             | distributing binaries, native if building and running
             | locally unless you can get much much more specific
             | 
             | * -O2 / -O3 - turn on most optimizations for speed.
             | Alternatively Os/Oz for smaller binaries (sometimes faster,
             | particularly on ARM)
             | 
             | * -flto=thin - get most of the benefits of LTO with minimal
             | compile time overhead
             | 
             | * pgo - if you have a representative workload you can use
             | this to replace compiler heuristics with real world
             | measurements. AutoFDO is the next evolution of this to make
             | it easier to connect data from production environments to
             | compile time.
             | 
             | * math: -fno-math-errno and -fno-trapping-math are "safe"
             | subsets of ffast-math (i.e. don't alter the numerical
             | accuracy). -fno-signed-zeros can also probably be
             | considered if valuable.
        
               | vbezhenar wrote:
               | Also I learned recently that there's `-Og` which enables
               | optimizations suitable for debug build.
        
               | vlovich123 wrote:
               | In practice I've had limited success with that flag. It
               | still seems to enable optimizations that make debugging
               | difficult.
        
         | fredgrott wrote:
         | if u understand that c/c+ purpose was at first to write an
         | OS...u somewhat are aware of this....but that would depend upon
         | your CS classroom studies exposure...
         | 
         | In my case it was by accident as I picked up assembly and
         | machine language before I touched C in the late 1980s.
        
         | plq wrote:
         | They have a computation to do, and they want to do it using a
         | particular instruction. That's the only reason they are
         | fiddling with the compiler. You would have to do so too, if you
         | had decided to solve a problem at hand using a similar method.
         | 
         | The author is talking about a way to get a particular (version
         | of) C/C++ compiler to emit the desired instruction. So I'd call
         | this clang-18.1.0-specific but not C/C++-specific since this
         | has nothing to do with the language.
         | 
         | Also such solutions are not portable nor stable since
         | optimization behavior _does_ change between compiler versions.
         | As far as I can tell, they also would have to implement a
         | compiler-level unit test that ensures that the desired machine
         | code is emitted as toolchain versions change.
        
         | kenferry wrote:
         | I mean, you don't have to care about this unless you have an
         | application where you do. And if you do there is enough
         | transparency (ie ability to inspect the assembly and ask
         | questions) that you can solve this one issue without knowing
         | everything under the sun.
         | 
         | If you had an application where this sort of thing made a
         | difference in JavaScript, the problem would likely still the
         | there, you'd just have a lot less visibility on it.
         | 
         | I guess you're still right - at the end of the day you see
         | discussions like this far more often in C, so it impacts the
         | feel of programming in C more.
        
         | Conscat wrote:
         | In my opinion, `__builtin_constant_p()` is not _that_ obscure
         | of a feature. In C, it is used in macros to imitate constant
         | functions, and in C++ it is useful for determining the current
         | alternative that has lifetime in a `union` within a constant
         | function. Granted that `__builtin_is_constant_evaluated()` has
         | obsoleted its primary purpose, but there are enough ways it 's
         | still useful that I see it from time to time.
        
         | pandaman wrote:
         | Indeed, C++ is different from most languages (other than C)
         | because "knowing C++" does not mean just knowing syntaxis and
         | standard library API but implies understanding of how the
         | source code is turned into bytes inside an executable image. A
         | Python or Java or whatever programmer could be writing books
         | and praised as the language expert without slightest idea how
         | is memory allocated, a C++ programmer who does not know that is
         | probably going to work in some legacy shop supporting an
         | ancient MFC/Qt data-entry app.
        
         | Waterluvian wrote:
         | Not only that but this is so far away from the actual problem
         | being solved that it's just kind of annoying.
         | 
         | I wish there was some sensible way for code that's purely about
         | optimization to live entirely separated from the code that's
         | about solving the problem at hand...
        
       | pxmpxm wrote:
       | Huh, why would reciprocal sqrt and fpmul be faster than regular
       | sqrt?
        
         | pkhuong wrote:
         | It's less accurate.
        
         | GrantMoyer wrote:
         | Intel's explanation (at section 15.12):
         | https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...
         | 
         | > In Intel microarchitectures prior to Skylake, _the SSE divide
         | and square root instructions DIVPS and SQRTPS have a latency of
         | 14 cycles (or the neighborhood) and they are not pipelined.
         | This means that the throughput of these instructions is one in
         | every fourteen cycles._ The 256-bit Intel AVX instructions
         | VDIVPS and VSQRTPS execute with 128-bit data path and have a
         | latency of twenty-eight cycles and they are not pipelined as
         | well.
         | 
         | > In microarchitectures that provide DIVPS/SQRTPS with high
         | latency and low throughput, it is possible to speed up single-
         | precision divide and square root calculations using the
         | (V)RSQRTPS and (V)RCPPS instructions. _For example, with
         | 128-bit RCPPS /RSQRTPS at five-cycle latency and one-cycle
         | throughput_ or with 256-bit implementation of these
         | instructions at seven-cycle latency and two-cycle throughput, a
         | single Newton-Raphson iteration or Taylor approximation can
         | achieve almost the same precision as the (V)DIVPS and (V)SQRTPS
         | instructions
         | 
         | (emphasis mine)
        
           | dooglius wrote:
           | That's a different case where one is trying to calculate
           | rsqrt(x) from 1/sqrt(x) whereas the article involves
           | computing sqrt(x) from x*rsqrt(x) and in this case you don't
           | need the divps instruction for sqrt but do need an additional
           | mul instruction for the rsqrt case.
        
             | GrantMoyer wrote:
             | I think you've misread the section:
             | 
             | > it is possible to speed up single-precision divide and
             | square root calculations using the (V)RSQRTPS and (V)RCPPS
             | instructions
        
         | TinkersW wrote:
         | Only on some older hardware really, pretty dumb optimization
         | from clang, people writing SIMD probably already know this
         | trick and have a version of sqrt that is exactly this, so going
         | behind their backs and forcing it when they requested sqrt_ps
         | is very uncool.
        
           | cozzyd wrote:
           | Arguably this is an incorrect default of mtune
        
       | eqvinox wrote:
       | Pet peeve: this isn't "your own constant folder in C/C++"... it's
       | "your own enabling constant folding in C/C++"...
       | 
       | With "own constant folder" I expected a GCC/clang plugin that
       | does tree/control flow analysis and some fancy logic in order to
       | determine when, where and how to constant fold...
        
       | GrantMoyer wrote:
       | I think the real problem is "really really wanted the sqrtps to
       | be used in some code they were writing" is at odds with -ffast-
       | math.
       | 
       | Clang transforms sqrtps(x) to x * rsqrtps(x) when -ffast-math is
       | set because it's often faster (See [1] section 15.12). It isn't
       | faster for some architectures, but if you tell clang what
       | architecture you're targeting (with -mtune), it appears to make
       | the right choice for the architecture[2].
       | 
       | [1]:
       | https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...
       | 
       | [2]:
       | https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
        
       | white_beach wrote:
       | somewhat related to this:
       | 
       | https://godbolt.org/z/1543TYszP
       | 
       | (intel c++ compiler)
        
       | kookamamie wrote:
       | It's just so fiddly - can't trust the compiler, not a good sign
       | for a language or its implementation.
        
         | g15jv2dp wrote:
         | Do you think other compilers for other languages don't perform
         | optimizations like this?
        
           | adgjlsfhk1 wrote:
           | Julia does all of this correctly (although there are a few 0
           | cost redundant move instructions presumably for ABI reasons).
           | 
           | julia> f(x) = @fastmath sqrt.(x)
           | 
           | julia > @code_native debuginfo=:none f.(Tuple(rand(Float32,
           | 8)))
           | 
           | vsqrtps ymm0, ymmword ptr [rsi]
           | 
           | mov rbp, rsp
           | 
           | mov rax, rdi
           | 
           | vmovups ymmword ptr [rdi], ymm0
           | 
           | pop rbp
           | 
           | vzeroupper
           | 
           | ret
           | 
           | julia> g() = f((1f0, 2f0, 3f0, 4f0))
           | 
           | julia> @code_native debuginfo=:none g()
           | 
           | .LCPI0_0:
           | 
           | .long 0x3f800000 # float 1
           | 
           | .long 0x3fb504f3 # float 1.41421354
           | 
           | .long 0x3fddb3d7 # float 1.73205078
           | 
           | .long 0x40000000 # float 2
           | 
           | .text
           | 
           | .globl julia_g_600
           | 
           | .p2align 4, 0x90
           | 
           | .type julia_g_600,@function
           | 
           | julia_g_600: # @julia_g_600
           | 
           | # %bb.0: # %top
           | 
           | push rbp
           | 
           | movabs rcx, offset .LCPI0_0
           | 
           | mov rbp, rsp
           | 
           | mov rax, rdi
           | 
           | vmovaps xmm0, xmmword ptr [rcx]
           | 
           | vmovups xmmword ptr [rdi], xmm0
           | 
           | pop rbp
           | 
           | ret
        
             | g15jv2dp wrote:
             | ...so does LLVM, unless you specifically instruct it to do
             | otherwise with -ffast-math. Don't be fooled by the name of
             | this flag, if it were possible to just do math faster
             | without compromising anything, it wouldn't be a flag, it
             | would be the default. It seems that Julia has a flag with
             | the same name but a different behavior. Okay? What's your
             | point exactly?
        
               | chongli wrote:
               | But in this case, -ffast-math actually results in slower
               | math, so the flag is badly named. Naming things is
               | important (and hard). The naive expectation of a flag
               | named -ffast-math is that the math should never be
               | slower, sometimes be faster, and potentially less
               | accurate and/or error prone. The fact that it can also be
               | sometimes slower means the flag really should be named
               | -fdifferent-math or (uncharitably) -funreliable-math.
        
               | tcbawo wrote:
               | It's too bad that it hasn't acquired the name of an
               | author to describe this alternative functionality more
               | concisely without being judgy (like Nagle's algorithm). I
               | looked around, and it seems that there isn't any one
               | designer/author that you could give credit to.
        
               | adgjlsfhk1 wrote:
               | @fastmath in Julia is telling LLVM (basically) the same
               | thing. I think the only difference is that Julia doesn't
               | think rsqrt(x)*x with Newton iteration is a good way of
               | computing sqrt
        
         | dist1ll wrote:
         | I mean, the fiddly-ness comes from wanting to use inline
         | assembly, which is often considered to be niche and second-
         | class in most modern languages. Even writing the asm in the
         | first place is full of footguns [0]. There are ways to reduce
         | the optimization barrier of asm, and implementing constant
         | folding is one such example. But I can see why most compiler
         | writers wouldn't be that interested in it.
         | 
         | [0] https://news.ycombinator.com/item?id=40607845
        
         | ComputerGuru wrote:
         | You can never trust an optimizing compiler. I've seen similar
         | issues in pretty much every language I've written performance-
         | sensitive code in. Point revisions change optimizations all the
         | time, sometimes with horribly disastrous results. One example:
         | every time rust upgrades the version of LLVM underlying the
         | compiler backend, the team prepares itself for a deluge of
         | optimization regression reports (though this has been somewhat
         | "mitigated" by a massive regression in rustc's ability to
         | convey sufficient information to the LLVM backend for it to
         | optimize to begin with, thanks to a change made to improve
         | compile times [0]).
         | 
         | This is almost universally true across all languages. The
         | behaviors you see are usually the result of countless
         | heuristics stacked one atop the other, and a tiny change here
         | can end up with huge changes there.
         | 
         | [0]: https://github.com/rust-lang/rust/pull/91743
        
       | edelsohn wrote:
       | __builtin_constant_p(vec) is not inquiring if the contents of vec
       | is constant. The compilers are not being fickle. The statement is
       | not performing the question that the developer intended.
        
       | dooglius wrote:
       | I have not found __builtin_constant_p to be very reliable when I
       | want to fold multiple times. Is there any way to do this trick
       | better using c++ constexpr I wonder?
        
       | ComputerGuru wrote:
       | > but I think the LLVM folks have purposefully tried to match
       | what GCC would do.
       | 
       | I never got that impression in my perusal of LLVM bug reports and
       | patches. I wonder if there is an open issue for this specific
       | case.
        
       | ComputerGuru wrote:
       | This reminds me of an issue that I ran into with rust [0] when I
       | was trying to optimize some machine learning code. Rust's uber-
       | strict type-safe math operations have you use matching types to
       | get the euclidean non-negative remainder of x mod y. When you
       | have a float-point x value but an integral y value, the operation
       | can be performed _much_ more cheaply than when y is also a
       | floating point value.
       | 
       | The problem is that you end up promoting y from an integer to an
       | f64 and get a much slower operation. I ended up writing my own
       | `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some
       | ~35x faster (a huge win when crunching massive arrays), but as
       | there are range limitations (since f64::MAX > i64::MAX) you can't
       | naively replace all call sites based on the type signatures.
       | However, with some support from the compiler it would be
       | completely doable anytime the compiler is able to infer an
       | upper/lower bound on the f64 dividend, when the result of the
       | operation is coerced to an integer afterwards, or when the
       | dividend is a constant value that doesn't exceed that range.
       | 
       | So now I copy that function around from ML project to ML project,
       | because what else can I do?
       | 
       | (A "workaround" was to use a slower-but-still-faster
       | `rem_i128(self: &f64, divisor: i128) -> f64` to raise the
       | functional limits of the operation, but you're never going to
       | match the range of a 64-bit floating point value until you use
       | 512-bit integral math!)
       | 
       | [0]: https://github.com/rust-lang/rust/issues/83973
       | 
       | Godbolt link: https://godbolt.org/z/EqrEqExnc
        
         | toast0 wrote:
         | > So now I copy that function around from ML project to ML
         | project, because what else can I do?
         | 
         | Aren't you supposed to make a crate? Which _is_ copying with
         | more steps, but might make your life easier (or harder).
        
           | ComputerGuru wrote:
           | I knew someone would pipe in with that suggestion as I was
           | writing that comment!
           | 
           | Yes, I suppose that would be the canonical way to go. But
           | it's just one function; four lines! I'm getting isEven()
           | vibes!
        
             | pryelluw wrote:
             | I package unrelated classes, functions, utilities like
             | yours into one module/crate/library and just keep adding
             | stuff to it. Sort of like my own standard library.
        
               | mikepurvis wrote:
               | That works for a time but it's ultimately not great for
               | new/external contributors to be faced with your code
               | being full of unfamiliar idioms and utility functions
               | coming from a single kitchen sink package.
        
               | marcosdumay wrote:
               | Really don't.
               | 
               | If that crate becomes successful, it will basically
               | greenlight a lot of functionality that was just there
               | because it happened to be made by the same author. And
               | all that extra functionality will only reduce the chances
               | of the really useful function to get into the spotlight.
               | Both are bad for the ecosystem.
               | 
               | If the GP has _related_ useful little functions, yes,
               | pack them together. Otherwise, I 'd say a crate for a
               | small little function isn't a problem at all.
        
             | im3w1l wrote:
             | Maybe you could go looking for a home for your function in
             | some existing crate then?
        
               | saghm wrote:
               | Maybe it could fit into the `num` crate in some fashion?
               | https://docs.rs/num/latest/num/
        
       | pclmulqdq wrote:
       | Another option is to use "-march" to set your target architecture
       | to something post-skylake/Zen 2. That should emit the right
       | instruction.
       | 
       | The square root and div instructions used to be a lot slower than
       | they are now.
        
       | SAI_Peregrinus wrote:
       | > if you happened to use -ffast-math
       | 
       | That option is terribly named. It should be
       | -fincorrect_math_that_might_sometimes_be_faster.
        
       | omoikane wrote:
       | Offtopic, but the original title is "Your Own Constant Folder in
       | C/C++". I am guessing Hacker News cut out the "your" for some
       | reason, but at first glance I thought it was some kind of read-
       | only directory implemented in C/C++. It's like a truncated garden
       | path sentence.
       | 
       | https://en.wikipedia.org/wiki/Garden-path_sentence
        
       | mgaunard wrote:
       | The real fix is to not use -ffast-math. It breaks all sorts of
       | things by design.
       | 
       | If you really want to relax the correctness of your code to get
       | some potential speedup, either do the optimizations yourself
       | instead of letting the compiler do them, or locally enable fast-
       | math equivalents.
       | 
       | As for the is constant expression GCC extensions, that stuff is
       | natively available in standard C++ nowadays.
        
         | hawk_ wrote:
         | How does one locally enable -ffast-math equivalent?
        
           | cozzyd wrote:
           | https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-
           | Option-...
        
       | Lockal wrote:
       | For C++ this goes to the question
       | https://stackoverflow.com/questions/8936549/constexpr-overlo...
       | 
       | Back in the days the answer was `__builtin_constant_p`.
       | 
       | But with C++20 it is possible to use std::is_constant_evaluated,
       | or `if consteval` with C++23.
       | 
       | But this is for scenario when you still want to keep high quality
       | of code (maybe when you write multiprecision math library; not
       | when you hack around compiler flag), which inline assembly
       | violates for many reasons:
       | 
       | 1) major issue: instead of dealing with `-ffast-math` with inline
       | asm, just remove `-ffast-math`
       | 
       | 2) randomly slapped inline asm inside normal fp32 computations
       | breaks autovectorization
       | 
       | 3) randomly slapped inline asm in example uses non-vex encoding,
       | you will likely to forget to call vzeroupper on transition. Or in
       | general, this limits code to x86 (forget about x86-64/arm)
       | 
       | 4) provided example (best_always_inline) does not work in GCC as
       | expected
        
       | hansvm wrote:
       | Forced constant folding is something I use on occasion in Zig. To
       | declare that a computation needs to happen at compile-time, just
       | prefix it with `comptime`.
       | 
       | Compile-time execution emulates the target architecture, which
       | has pros and cons. The most notable con is that inline assembly
       | isn't supported (yet?). The complete solution to this particular
       | problem then additionally requires `return if (isComptime())
       | normal_code() else assembly_magic();` (which has no runtime cost
       | because that branch is constant-folded). Given that inline
       | assembly usually also branches on target architecture, that winds
       | up not adding much complexity -- especially given that you
       | probably had exactly that same code as a fallback for
       | architectures you didn't explicitly handle.
        
       | metadat wrote:
       | Is `_mm_sqrt_ps(..)' an Intel-only thing? Why is the naming so
       | jacked?
       | 
       | https://www.intel.com/content/www/us/en/docs/cpp-compiler/de...
        
       ___________________________________________________________________
       (page generated 2024-06-22 23:01 UTC)