[HN Gopher] Std: Clamp generates less efficient assembly than st...
       ___________________________________________________________________
        
       Std: Clamp generates less efficient assembly than
       std:min(max,std:max(min,v))
        
       Author : x1f604
       Score  : 151 points
       Date   : 2024-01-16 10:53 UTC (12 hours ago)
        
 (HTM) web link (1f6042.blogspot.com)
 (TXT) w3m dump (1f6042.blogspot.com)
        
       | fooker wrote:
       | If you benchmark these, you'll likely find the version with the
       | jump edges out the one with the conditional instruction in
       | practice.
        
         | svantana wrote:
         | That must depend on the platform and the surrounding code, no?
        
           | fooker wrote:
           | Yes. On platform - most modern cpus are happier with
           | predictable branches than exotic instructions.
           | 
           | On surrounding code - for sure.
        
         | pclmulqdq wrote:
         | Compilers often under-generate conditional instructions. They
         | implicitly assume (correctly) that most branches you write are
         | 90/10 (ie very predictable), not 50/50. The branches that
         | actually are 50/50 suffer from being treated as being 90/10.
        
           | fooker wrote:
           | The branches in this example are not 50/50.
           | 
           | Given a few million calls of clamp, most would be no-ops in
           | practice. Modern CPUs are very good at dynamically observing
           | this.
        
           | IainIreland wrote:
           | It's hard to predict statically which branches will be
           | dynamically unpredictable.
           | 
           | A seasoned hardware architect once told me that Intel went
           | all-in on predication for Itanium, under the assumption that
           | a Sufficiently Smart Compiler could figure it out, and then
           | discovered to their horror that their compiler team's best
           | efforts were not Sufficiently Smart. He implied that this was
           | why Intel pushed to get a profile-guided optimization step
           | added to the SPEC CPU benchmark, since profiling was the only
           | way to get sufficiently accurate data.
           | 
           | I've never gone back to see whether the timeline checks out,
           | but it's a good story.
        
             | fooker wrote:
             | The compiler doesn't do much of the predicting, it's done
             | by the CPU in runtime.
        
               | kyboren wrote:
               | [delayed]
        
         | jeffbee wrote:
         | FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'
         | 
         | Using this microbenchmark on an Intel Sapphire Rapids CPU,
         | compiled with march=k8 to get the older form, takes ~980ns,
         | while compiling with march=native gives ~570ns. It's not at all
         | clear that the imperfection the article describes is really
         | relevant in context, because the compiler transforms this
         | function into something quite different.
        
           | fooker wrote:
           | With random test cases, branch prediction can't help.
        
       | tambre wrote:
       | Both recent GCC and Clang are able to generate the most optimal
       | version for std::clamp() if you add something like -march=znver1,
       | even at -O1 [0]. Interesting!
       | 
       | [0] https://godbolt.org/z/YsMMo7Kjz
        
         | GrumpySloth wrote:
         | But then it uses AVX instructions. (You can replace
         | -march=znver1 with just -mavx.)
         | 
         | When AVX isn't enabled, the std::min + std::max example still
         | uses fewer instructions. Looks like a random register
         | allocation failure.
        
           | gpderetta wrote:
           | The additional "movapd xmm0, xmm2" is mostly free as it is
           | handled by renaming, but yes, it seems a quirk of the
           | register allocator. It wouldn't be the first time I see GCC
           | trying to move stuff around without obvious reasons.
        
       | jeffbee wrote:
       | Clang generates the shortest of these if you target sandybridge,
       | or x86-64-v3, or later. The real article that's buried in this
       | article is that compilers target k8-generic unless you tell them
       | otherwise, and the features and cost model of opteron are
       | obsolete.
       | 
       | Always specify your target.
        
         | josephg wrote:
         | Yep. Adding "-C target-cpu=native" to rustc on my desktop
         | computer consistently gets a ~10-15% performance boost compared
         | to the default target. The default target is extremely
         | conservative. As far as I can tell, it doesn't take advantage
         | of any CPU features added in the last 20 years. (The k8 came
         | out in 2003.)
        
           | jeffbee wrote:
           | Those Gentoo people were onto something.
        
             | alexey-salmin wrote:
             | Funny that it stopped being the case for a while around
             | 2006. AMD64 became widespread while also being very new,
             | closing the gap between "default" and "native".
        
             | skykooler wrote:
             | Of course, gentoo just started using prebuilt packages a
             | few months ago...
        
           | wongarsu wrote:
           | Red Hat Enterprise Linux has upgraded their default target to
           | x86-64-v2 and is considering switching to x86-64-v3 for RHEL
           | 10 (which should release around 2026?). I'd take that as a
           | sign that those might be reasonable choices for newly
           | released software.
           | 
           | Some linux distros also give you the option to either get a
           | version compatible with ancient hardware or the optimized
           | x86-64-v3 version, which seems like a good compromise.
        
       | svantana wrote:
       | I'm a heavy std::clamp user, but I'm considering replacing it
       | with min+max because of the uncertainty about what will happen
       | when lo > hi. On windows it triggers an assertion, while other
       | platforms just do a min+max in one or the other order. Of course,
       | this should never happen but can be difficult to guarantee when
       | the limits are derived from user inputs.
        
         | lifthrasiir wrote:
         | Pretty sure that their behaviors on NaN arguments will also
         | differ.
        
         | wegfawefgawefg wrote:
         | I hope they fix it. Thats quite a basic functional unit for it
         | to be a footgun all on its own.
        
           | camblomquist wrote:
           | Don't get your hopes up, the behavior when lo > hi is
           | explicitly undefined.
        
         | lpapez wrote:
         | > Of course, this should never happen but can be difficult to
         | guarantee when the limits are derived from user inputs.
         | 
         | Sounds to me like you are missing a validation step before
         | calling your logic. When it comes to parsing, trusting user
         | input is a recipe for disaster in the form of buffer overruns
         | and potential exploits.
         | 
         | As they used to say in the Soviet Union: "trust, but verify".
        
           | PaulDavisThe1st wrote:
           | That was what Reagan said _about_ the Soviet Union, not what
           | was said in the Soviet Union.
           | 
           | Correct me if I'm wrong.
        
             | usefulcat wrote:
             | According to wikipedia you're right (about Reagan) but it's
             | also a Russion proverb.
        
             | plucas wrote:
             | https://en.wikipedia.org/wiki/Trust,_but_verify
             | 
             | > Trust, but verify (Russian: doveriai, no proveriai, tr.
             | doveryay, no proveryay, IPA: [d@vjI'rjaej no pr@vjI'rjaej])
             | is a Russian proverb, which is rhyming in Russian. The
             | phrase became internationally known in English after
             | Suzanne Massie, a scholar of Russian history, taught it to
             | Ronald Reagan, then president of the United States, the
             | latter of whom used it on several occasions in the context
             | of nuclear disarmament discussions with the Soviet Union.
             | 
             | Memorably referenced in "Chernobyl":
             | https://youtu.be/9Ebah_QdBnI?t=79
        
               | PaulDavisThe1st wrote:
               | Thanks for the clarification/explanation!
        
               | fl0ki wrote:
               | Also referenced in the Metro Exodus "Sam's Story" DLC
               | because of the backstory of the two characters speaking,
               | and nuclear weapons once again being part of the
               | scenario.
        
               | dekhn wrote:
               | When I hear the phrase, I rewrite it to "don't trust,
               | verify."
        
             | protomolecule wrote:
             | Russian here. We use that expression from time to time:
             | https://ya.ru/search/?text="doveriai%2C+no+proveriai"
        
           | iknowstuff wrote:
           | The answer is of course                   clamp(min(a,b),
           | max(a,b))
           | 
           | classic c++
        
         | dahart wrote:
         | Will min+max help you? What do you expect the answer to be when
         | lo > hi? What certainty should std::clamp have? Using min+max
         | on a number that's between lo+hi when lo>hi will always return
         | either lo or hi, and never your input value.
        
           | svantana wrote:
           | Sure, that was the point - min(max()) forces you to give
           | explicit preference to lo or hi, whereas with clamp it's up
           | to the std library. I trust my users to bend my software to
           | their will, but I don't want different behavior on mac and
           | windows (for example).
        
             | dahart wrote:
             | Yeah, seems reasonable. I think the outer call wins, so
             | min(max()) will always return lo for empty intervals,
             | right? I didn't know std::clamp() was undefined for empty
             | intervals. It does seem like a good idea to try to
             | guarantee the interval is valid instead of worrying about
             | clamp... even with a guarantee, the answer might still
             | surprise someone, since technically the problem is
             | mathematically undefined and the guaranteed answer is
             | wrong.
        
       | celegans25 wrote:
       | On gcc 13, the difference in assembly between the min(max())
       | version and std::clamp is eliminated when I add the -ffast-math
       | flag. I suspect that the two implementations handle one of the
       | arguments being NaN a bit differently.
       | 
       | https://gcc.godbolt.org/z/fGaP6roe9
       | 
       | I see the same behavior on clang 17 as well
       | 
       | https://gcc.godbolt.org/z/6jvnoxWhb
        
         | gumby wrote:
         | You (celegans25) probably know this but here is a PSA that
         | -ffast-math is really -finaccurate-math. The knowledgeable
         | developer will know when to use it (almost never) while the
         | naive user will have bugs.
        
           | cogman10 wrote:
           | Ehh, not so much inaccurate, more of a "floating point
           | numbers are tricky, let's act like they aren't".
           | 
           | Compilers are pretty skittish about changing the order of
           | floating point operations (for good reason) and ffast-math is
           | the thing that lets them transform equations to try and
           | generate faster code.
           | 
           | IE, instead of doing "n / 10" doing "n * 0.1". The issue, of
           | course, being that things like 0.1 can't be perfectly
           | represented with floats but 100 / 10 can be. So now you've
           | introduced a tiny bit of error where it might not have
           | existed.
        
             | phkahler wrote:
             | I've never understood why generating exceptions is
             | preferable to just using higher precision.
        
               | gumby wrote:
               | Higher precision isn't always available. IEEE 754 is an
               | unusually well-thought-through standard (thanks to some
               | smart people with a lot of painful experience) and is
               | pretty good at justifying its decisions, some of which
               | are surprising (far from obvious) to anyone not steeped
               | in it.
        
               | cogman10 wrote:
               | The main problem with floats in general is they are
               | designed primarily for scientific computing.
               | 
               | We are fortunately starting to see newer (well, not that
               | new now) CPU instructions like FMA that make more
               | accurate decimal representations not take such huge
               | performance hits.
        
               | adgjlsfhk1 wrote:
               | what does fma have to do with decimal numbers?
        
               | cogman10 wrote:
               | Oh shoot, nvm, I thought it was an optimization for
               | integers.
               | 
               | Really it'll be the SIMD style instructions that speeds
               | things up.
        
               | dahart wrote:
               | On a GPU, higher precision can cost between 2 and 64
               | times more than single precision, with typical ratios for
               | consumer cards being 16 or 32. Even on the CPU, fp64
               | workloads tend to run at half the speed on real data due
               | to the extra bandwidth needed for higher precision.
        
             | rwmj wrote:
             | It isn't just that. -ffast-math also allows the compiler to
             | ignore infinites. In fact for GCC with -ffast-math, isinf
             | always returns false. Something similar happens for
             | NaNs/isnan.
        
               | cogman10 wrote:
               | I lump this into "floating points are tricky". NaNs and
               | inf are definitely legitimate floating point values. They
               | are also things that a lot of applications will break on
               | they ever encounter them.
        
           | alexey-salmin wrote:
           | If your code ventures into the domain where fast-math matters
           | and you're not a mathematician trying to solve a lyapunov-
           | unstable problem with very tricky numeric methods, then most
           | likely your code is already broken.
        
           | dahart wrote:
           | Why do you say almost never? Don't let the name scare you;
           | all floating point math is inaccurate. Fast math is only
           | _slightly_ less accurate, I think typically it's a 1 or maybe
           | 2 LSB difference. At least in CUDA it is, and I think many
           | (most?) people  & situations can tolerate 22 bits of mantissa
           | compared to 23, and many (most?) people/situations aren't
           | paying attention to inf/nan/exception issues at all.
           | 
           | I deal with a lot of floating point professionally day to
           | day, and I use fast math all the time, since the tradeoff for
           | higher performance and the relatively small loss of accuracy
           | are acceptable. Maybe the biggest issue I run into is lack of
           | denorms with CUDA fast-math, and it's pretty rare for me to
           | care about numbers smaller than 10^-38. Heck, I'd say I can
           | tolerate 8 or 16 bits of mantissa most of the time, and fast-
           | math floats are way more accurate than that. And we know a
           | lot of neural network training these days can tolerate less
           | than 8 bits of mantissa.
        
             | mort96 wrote:
             | The scary thing IMO is: your code might be fine with unsafe
             | math optimisations, but maybe you're using a library which
             | is written to do operations in a certain order to minimise
             | numerical error, and unsafe math operations changes the
             | code which are mathematically equivalent but which results
             | in many orders of magnitude more numerical error. It's
             | probably fine most of the time, but it's kinda scary.
        
               | dahart wrote:
               | It shouldn't be scary. Any library that is sensitive to
               | order of operations will hopefully have a big fat warning
               | on it. And it can be compiled separately with fast-math
               | disabled. I don't know of any such libraries off the top
               | of my head, and it's quite rare to find situations that
               | result in orders of magnitude more error, though I grant
               | you it can happen, and it can be contrived pretty easily.
        
               | planede wrote:
               | You can't fully disable fast-math per-library, moreover a
               | library compiled with fast-math might also introduce
               | inaccuracies in a seemingly unrelated library or
               | application code in the same executable. The reason is
               | that fast-math enables some dynamic initialization of the
               | library that changes the floating point environment in
               | some ways.
        
               | londons_explore wrote:
               | you're gonna hive to give us a concrete real world
               | example to convince most of us...
        
               | dahart wrote:
               | > You can't fully disable fast-math per library
               | 
               | Can you elaborate? What fast-math can sneak into a
               | library that disabled fast-math at compile time?
               | 
               | > fast-math enables some dynamic initialization of the
               | library that changes the floating point environment in
               | some ways.
               | 
               | I wasn't aware of this, I would love to see some
               | documentation discussing exactly what happens, can you
               | send a link?
        
               | mort96 wrote:
               | > Can you elaborate? What fast-math can sneak into a
               | library that disabled fast-math at compile time?
               | 
               | A lot of library code is in headers (especially in C++!).
               | The code in headers is compiled by your compiler using
               | your compile options.
        
               | dahart wrote:
               | Ah, of course, very good point. A header-only library
               | doesn't have separate compile options. This is a great
               | reason for a float-sensitive library to not be header-
               | only, right?
        
               | mort96 wrote:
               | It's not just about being header-only, lots of libraries
               | which aren't header-only still have code in headers. The
               | library may choose to put certain functions in headers
               | for performance reasons (to let compiler inline them),
               | or, in C++, function templates and class templates
               | generally have to be in headers.
               | 
               | But yeah, it's probably a good idea to not put code which
               | breaks under -ffast-math in headers if possible.
        
               | jcranmer wrote:
               | https://github.com/llvm/llvm-project/issues/57589
               | 
               | Turn on fast-math, it flips the FTZ/DAZ bit for the
               | entire application. Even if you turned it on for just a
               | shared library!
        
               | mort96 wrote:
               | I don't typically thoroughly read through the
               | documentation for all the dependencies which my
               | dependencies are using.
               | 
               | But you're correct that it's probably usually fine in
               | practice.
        
               | dahart wrote:
               | That's fair. Ideally transitive dependencies should be
               | completely hidden from you. Hopefully the author of the
               | library you include directly has heeded the instructions
               | of libraries they depend on.
               | 
               | Hey I grant and acknowledge that using fast-math carries
               | a little risk of surprises, we don't necessarily need to
               | try to think of corner cases. I'm mostly pushing back a
               | little because using floats at all carries almost as much
               | risk. A lot of people seem to use floats without knowing
               | how inaccurate floats are, and a lot of people aren't
               | doing precision analysis or handling the exceptional
               | cases... and don't really need to.
        
               | ska wrote:
               | > A lot of people seem to use floats without knowing how
               | inaccurate floats are,
               | 
               | Small nit, but floats aren't inaccurate, they have non
               | uniform precision. Some float _operations_ can be
               | inaccurate, but that 's rather path dependent...
               | 
               | One problem with -ffast-math is that a) it sounds
               | appealing and b) people don't understand floats, so lots
               | of people turn it on without understanding what it does,
               | and that can introduce subtle problems in code they
               | didn't write.
               | 
               | Sometimes in computational code it makes sense e.g. to
               | get rid of denorms, but a very small fraction of
               | programmers understand this properly, or ever will.
               | 
               | I wish they had named it something scary sounding.
        
               | dahart wrote:
               | I am talking about float operations, of course. And
               | they're all inaccurate, generally speaking, because they
               | round. Fast math rounding error is not much larger than
               | rounding error without fast mast.
        
             | usefulcat wrote:
             | > Fast math is only slightly less accurate
             | 
             | 'slightly'? Last I checked, -Ofast completely breaks
             | std::isnan and std::isinf--they always return false.
        
               | dahart wrote:
               | Hopefully it was clear from the rest of my comment that I
               | was talking about in-range floats there. I wouldn't
               | necessarily call inf & nan handling an accuracy issue,
               | that's more about exceptional cases, but to your point I
               | would have to agree that losing std::isinf is kinda bad
               | since divide by zero is probably near the very top of the
               | list of things most people using floats casually might
               | have to deal with.
               | 
               | Which compiler are you using where std::isinf breaks?
               | Hopefully it was also clear that my experience leans
               | toward CUDA, and I think the inf & nan support works
               | there in the presence of NVCC's fast-math.
        
               | usefulcat wrote:
               | My experience is with gcc and clang on x86. I generally
               | agree with you regarding accuracy, which is why I was
               | quite surprised when I first discovered that -Ofast
               | breaks isnan/isinf.
               | 
               | Even if I don't care about the accuracy differences, I
               | still need a way to check for invalid input data. The
               | upshot is that I had to roll my own isnan and isinf to be
               | able to use -Ofast (because it's actually the underlying
               | __builtin_xxx intrinsics that are broken), which still
               | seems wrong to me.
        
               | xigoi wrote:
               | They are talking about -ffast-math, not -Ofast.
        
               | gumby wrote:
               | From the gcc manual:
               | 
               | -Ofast
               | 
               | Disregard strict standards compliance. -Ofast enables all
               | -O3 optimizations. It also enables optimizations that are
               | not valid for all standard-compliant programs. _It turns
               | on -ffast-math_ , -fallow-store-data-races and the
               | Fortran-specific -fstack-arrays, unless -fmax-stack-var-
               | size is specified, and -fno-protect-parens. It turns off
               | -fsemantic-interposition.
        
             | jcranmer wrote:
             | Here are some of the problems with fast-math:
             | 
             | * It links in an object file that enables denormal flushing
             | globally, so that it affects _all_ libraries linked into
             | your application, even if said library explicitly _doesn
             | 't_ want fast-math. This is seriously one of the most user-
             | hostile things a compiler can do.
             | 
             | * The results of your program will vary depending on the
             | exact make of your compiler and other random attributes of
             | your compile environment, which can wreak havoc if you have
             | code that absolutely wants bit-identical results. This
             | doesn't matter for everybody, but there are some domains
             | where this can be a non-starter (e.g., multiplayer game
             | code).
             | 
             | * Fast-math precludes you from being able to use NaN or
             | infinities, and often even being able to defensively _test_
             | for NaN or infinity. Sure, there are times where this is
             | useful, but an option you might generally prefer to suggest
             | for an uninformed programmer would rather be a  "floating-
             | point code can't overflow" option rather than "infinity
             | doesn't exist and it's UB if it does exist".
             | 
             | * Fast-math can cause hard range guarantees to fail. Maybe
             | you've got code that you can prove that, even with rounding
             | error, the result will still be >= 0. With fast-math, the
             | code might be adjusted so that the result is instead, say,
             | -1e-10. And if you pass that to a function with a hard
             | domain error at 0 (like sqrt), you now go from the result
             | being 0 to the result being NaN. And see above about what
             | happens when you get NaN.
             | 
             | Fast-math is a tradeoff, and if you're willing to except
             | the tradeoff it offers, it's a fine option to use. But most
             | programmers don't even know what the tradeoffs are, and the
             | failure mode can be absolutely catastrophic. It's
             | definitely an option that is in the "you must be this
             | knowledgeable to use" camp.
        
               | dahart wrote:
               | Thank you, great points. I'd have to agree that disabling
               | denorms globally is pretty bad, even if (or maybe
               | especially if) caring about denorms is rare.
               | 
               | > Fast-math can cause hard range guarantees to fail.
               | Maybe you've got code that you can prove that, even with
               | rounding error, the result will still be >= 0.
               | 
               | Floats do this too, it's pretty routine to bump into
               | epsilon out-of-range issues without fast-math. Most
               | people don't prove things about their rounding error, and
               | if they do, it's easy for them to account for 3 ULPs of
               | fast-math error compared to 1/2 ULP for the more accurate
               | operations. Like, nobody who knows what they're doing
               | will call sqrt() on a number that is fresh out of a
               | multiplier and might be anywhere near zero without
               | testing for zero explicitly, right? I'm sure someone has
               | done it, but I've never seen it, and it ranks high on the
               | list of bad ideas even if you steer completely clear of
               | fast-math, no?
               | 
               | I guess I just wanted to resist the unspecific parts of
               | the FUD just a little bit. I like your list a lot because
               | it's specific. Fast-math does carry some additional risks
               | for accuracy sensitive code, and clearly as you and
               | others showed, can infect and impact your whole app, and
               | it can sometimes lead to situations where things break
               | that wouldn't have happened otherwise. But I think in the
               | grand scheme these situations are quite rare compared to
               | how often people mess up regular floating point math. For
               | a very wide swath of people doing casual arithmetic,
               | fast-math is not likely to cause more problems than
               | floats cause, but it's fair to want to be careful and pay
               | attention.
        
               | PaulDavisThe1st wrote:
               | > I'd have to agree that disabling denorms globally is
               | pretty bad,
               | 
               | and yet, for audio processing, this is an option that
               | most DAWs either implement silently, or offer users the
               | choice, because denormals are inevitable in reverb tails
               | and on most Intel processors they slow things by orders
               | of magnitude.
        
               | dahart wrote:
               | I would think for audio, there's no audible difference
               | between a denorm and a flushed zero. Are there cases
               | where denorms are important to audio?
        
               | fl0ki wrote:
               | > The results of your program will vary depending on the
               | exact make of your compiler and other random attributes
               | of your compile environment, which can wreak havoc if you
               | have code that absolutely wants bit-identical results.
               | This doesn't matter for everybody, but there are some
               | domains where this can be a non-starter (e.g.,
               | multiplayer game code).
               | 
               | This already shouldn't be assumed, because even the same
               | code, compiler, and flags can produce different floating
               | point results on different CPU targets. With the world
               | increasingly split over x86_64 and aarch64, with more to
               | come, it would be unwise to assume they produce the same
               | exact numbers.
               | 
               | Often this comes down to acceptable implementation
               | defined behavior, e.g. temporarily using an 80-bit
               | floating register despite the result being coerced to 64
               | bits, or using an FMA instruction that loses less
               | precision than separate multiply and add instructions.
               | 
               | Portable results should come from integers (even if used
               | to simulate rationals and fixed point), not floats. I
               | understand that's not easy with multiplayer games, but
               | doing so with floats is simply impossible because of what
               | is left as implementation-defined in our language
               | standards.
        
               | AshamedCaptain wrote:
               | > Often this comes down to acceptable implementation
               | defined behavior,
               | 
               | I believe this is "always" rather than often when it
               | comes to the actual operations defined by the FP
               | standard. gcc does play it fast and loose (as -ffast-math
               | is not yet enabled by default, and FMA on the other hand
               | is), but this is technically illegal and at least can be
               | easily configured to be in standards-compliant mode.
               | 
               | I think the bigger problem comes from what is _not_
               | documented by the standard. E.g. transcendental
               | functions. A program calling plain old sqrt(x) can find
               | itself behaving differently _even between different
               | stepping of the same core_, not to mention that there are
               | well-known differences between AMD vs Intel. This is all
               | using the same binary.
        
               | jcranmer wrote:
               | This advice is out-of-date.
               | 
               | All CPU hardware nowadays conforms to IEEE 754 semantics
               | for binary32 and binary64. (I think all the GPUs now have
               | non-denormal-flushing modes, but my GPU knowledge is less
               | deep). All compilers will have a floating-point mode that
               | preserves IEEE 754 semantics assuming that FP exceptions
               | are unobservable and rounding mode is the default, and
               | this is usually the default (icc/icx is unusual in making
               | fast-math the default).
               | 
               | Thus, you have portability of floating-point semantics,
               | subject to caveats:
               | 
               | * The math library functions [1] are not the same between
               | different implementations. If you want portability, you
               | need to ensure that you're using the exact same math
               | library on all platforms.
               | 
               | * NaN payloads are not consistent on different platforms,
               | or necessarily within the same platform due to compiler
               | optimizations. Note that not even IEEE 754 attempts to
               | guarantee NaN payload stability.
               | 
               | * Long double is not the same type on different
               | platforms. Don't use it. Seriously, don't.
               | 
               | * 32-bit x86 support for exact IEEE 754 equivalence is
               | essentially a "known-WONTFIX" bug. (This is why the C
               | standard implemented FLT_EVAL_METHOD). The x87 FPU
               | evaluates everything in 80-bit precision, and while you
               | can make this work for binary32 easily (double rounding
               | isn't an issue), though with some performance cost (the
               | solution involves reading/writing from memory after every
               | operation), it's not so easy for binary64. However, the
               | SSE registers do implement IEEE 754 exactly, and are
               | present on every chip old enough to drink, so it's not
               | really a problem anymore. There's a subsidiary issue that
               | the x86-32 ABI requires floats be returned in x87
               | registers, which means you can't properly return an sNaN
               | correctly, but sNaN and floating-point exceptions are
               | firmly in the realm of nonportability anyways.
               | 
               | In short, if you don't need to care about 32-bit x86
               | support (or if you do care but can require SSE2 support),
               | and you don't care about NaNs, and you bring your own
               | libraries along, you can absolutely expect to have
               | floating-point portability.
               | 
               | [1] It's actually not even all math library functions,
               | just those that are like sin, pow, exp, etc., but
               | specifically excluding things like sqrt. I'm still trying
               | to come up with a good term to encompass these.
        
               | zozbot234 wrote:
               | > It's actually not even all math library functions, just
               | those that are like sin, pow, exp, etc., but specifically
               | excluding things like sqrt. I'm still trying to come up
               | with a good term to encompass these.
               | 
               | Transcendental functions. They're called that because
               | computing an exactly rounded result might be unfeasible
               | for some inputs. https://en.wikipedia.org/wiki/Table-
               | maker%27s_dilemma So standards for numerical compute punt
               | on the issue and allow for some error in the last digit.
        
               | jcranmer wrote:
               | Not all of the functions are transcendental--things like
               | cbrt and rsqrt are in the list, and they're both
               | algebraic.
               | 
               | (The main defining factor is if they're an IEEE 754 SS5
               | operation or not, but IEEE 754 isn't a freely-available
               | standard.)
        
               | fl0ki wrote:
               | Not sure if this is a spooky coincidence, but I happened
               | to be reading the Rust 1.75.0 release notes today and
               | fell into this 50-tab rabbit hole:
               | https://github.com/rust-lang/rust/pull/113053/
        
               | AshamedCaptain wrote:
               | > this is usually the default
               | 
               | No, it's not. gcc itself still defaults to fp-
               | contract=fast. Or at least does in all versions I have
               | ever tried.
        
             | light_hue_1 wrote:
             | Nah, you don't deal with floats. You do machine learning
             | which just happens to use floats. I do both numerical
             | computing and machine learning. And oh boy are you wrong!
             | 
             | People who deal with actual numerical computing know that
             | the statement "fast math is only slightly less accurate" is
             | absurd. Fast math is unbounded in its inaccuracy! It can
             | reorder your computations so that something that used to
             | sum to 1 now sums to 0, it can cause catastrophic
             | cancellation, etc.
             | 
             | Please stop giving people terrible advice on a topic you're
             | totally unfamiliar with.
        
               | alexey-salmin wrote:
               | > It can reorder your computations so that something that
               | used to sum to 1 now sums to 0, it can cause catastrophic
               | cancellation, etc.
               | 
               | Yes, and it could very well be that the correct answer is
               | actually 0 and not 1.
               | 
               | Unless you write your code to explicitly account for fp
               | associativity effects, in which case you don't need
               | generic forum advice about fast-math.
        
               | thechao wrote:
               | +1. I'm years away from fp-analysis, but do the
               | transcendental expansions even _converge_ in the presence
               | of fast-math? No `sin()`, no `cos()`, no `exp()`, ...
        
               | dahart wrote:
               | Well there are library implementations of fast-math
               | trancendentals that offer bounded error, and a million
               | different fast sine approximation algorithms, so, yes?
               | This is why you shouldn't listen to FUD. The corner cases
               | are indeed frustrating for a few people, but most never
               | hit them, and the world doesn't suddenly break when fast
               | math is enabled. I am paid to do some FP analysis, btw.
        
               | dahart wrote:
               | I only do numeric computation, I don't work in machine
               | learning. Sorry your assumptions are incorrect, maybe
               | it's best not to assume or attack. I didn't exactly
               | advise using fast math either, I asked for reasoning and
               | pointed out that most casual uses of float aren't highly
               | sensitive to precision.
               | 
               | It's easy to have wrong sums and catastrophic
               | cancellation without fast math, and it's relatively rare
               | for fast math to cause those issues when an underlying
               | issue didn't already exist.
               | 
               | I've been working in some code that does a couple of
               | quadratic solves and has high order intermediate terms,
               | and I've tried using Kahan's algorithm repeatedly to
               | improve the precision of the discriminants, but it has
               | never helped at all. On the other hand I've used a few
               | other tricks that improve the precision enough that the
               | fast math version is higher precision than the naive one
               | without fast math. I get to have my cake and eat it too.
               | 
               | Fast math is a tradeoff. Of course it's a good idea to
               | know what it does and what the risks of using it are, but
               | at least in terms of the accuracy of fast math in CUDA,
               | it's not an opinion whether the accuracy is relatively
               | close to slow math, it's reasonably well documented. You
               | can see for yourself that most fast math ops are in the
               | single digit ulps of rounding error.
               | https://docs.nvidia.com/cuda/cuda-c-programming-
               | guide/index....
        
           | mort96 wrote:
           | What you really should enable is the fun and safe math
           | optimizations, with -funsafe-math-optimizations.
        
             | aqfamnzc wrote:
             | I know almost nothing about compiler flags but I got a
             | laugh out of this even though I still don't know if you're
             | joking or not. Edit: Just read it again and now I
             | understand the joke. Haha
        
               | kevincox wrote:
               | To others `-f` is a common prefix for GCC flags. You can
               | think of this as "enable feature". So -funsafe-math-
               | operations should be read as (-f) (unsafe-math-
               | operations). Not (-)(funsafe-math-operations).
        
               | arcticbull wrote:
               | I kind of like the idea the flag is sarcastically calling
               | them very fun and very safe.
        
               | dekhn wrote:
               | don't forget libiberty which is linked in using -liberty
               | (and freedom for all)
        
               | cozzyd wrote:
               | strangely I'm not aware of a libibre.
        
               | dekhn wrote:
               | oh, that's called -lmojito
        
             | gumby wrote:
             | The problem is this causes the compiler to correctly solve
             | your recreational math problems, which isn't actually a
             | much fun as solving them yourself!
        
           | planede wrote:
           | Another PSA is that dynamic libraries compiled with fast-math
           | will also introduce inaccuracies in unrelated libraries in
           | the same executable, as they introduce dynamic initialization
           | that globally changes the floating point environment.
        
             | pavlov wrote:
             | This would only affect code that uses the old-school x87
             | floating point instructions, though? The x87 FPU unit
             | indeed has scary global state that can make your doubles
             | behave like floats in secret and silence.
             | 
             | I would think practically all modern FPU code on x86-64
             | would be using the SIMD registers which have explicit
             | widths.
        
               | borodi wrote:
               | So it was a bit more pervasive than this, the issue was
               | that flushing subnormals (values very close to 0) to 0 is
               | a register that gets set, so if a library is built with
               | the fastmath flags and it gets loaded, it sets the
               | register, causing the whole process to flush it's
               | subnormals. i.e https://github.com/llvm/llvm-
               | project/issues/57589
        
               | jcranmer wrote:
               | > This would only affect code that uses the old-school
               | x87 floating point instructions, though?
               | 
               | Actually, no, the x87 FPU instructions are the only ones
               | that _won 't_ be affected.
               | 
               | It sets the FTZ/DAZ bits, which exist for SSE
               | instructions but not x87 instructions.
        
               | mhh__ wrote:
               | You're mistaking something else for the rounding mode and
               | subnormal handling flags.
        
           | mhh__ wrote:
           | One of the things that you can do with D and as far as I know
           | Julia is enable specific optimizations locally e.g. allow
           | FMAs here and there, not globally.
           | 
           | fast-math is one of the dumbest things we have as an industry
           | IMO.
        
       | planede wrote:
       | On a somewhat similar note, don't use std::lerp if you don't need
       | its strong guarantees around rounding (monotonicity among other
       | things).
       | 
       | https://godbolt.org/z/hzrG3s6T4
        
       | camblomquist wrote:
       | I did a double take on this because I wrote a blog post about
       | this topic a few months ago and came to a very different
       | conclusion, that the results are effectively identical on clang
       | and gcc is just weird.
       | 
       | Then I realized that I was writing about compiling for ARM and
       | this post is about x86. Which is extra weird! Why is the compiler
       | better tuned for ARM than x86 in this case?
       | 
       | Never did figure out what gcc's problem was.
       | 
       | https://godbolt.org/z/Y75qnTGdr
        
         | frozenport wrote:
         | Try switching to -Ofast it produces different ASM
        
           | klodolph wrote:
           | -Ofast is one of those dangerous flags that you should
           | probably be careful with. It is "contagious" and it can mess
           | up code elsewhere in the program, because it changes
           | processor flags.
           | 
           | I would try a more specific flag like -ffinite-math-only.
        
             | Sharlin wrote:
             | finite-math-only is a footgun as well as it allows the
             | compiler assume that NaNs do not exist. Which means all
             | `isnan()` calls are just reduced to `false` so it's
             | difficult to program defensively. And if a NaN in fact
             | occurs it's naturally a one-way ticket to UB land.
        
               | klodolph wrote:
               | If that's a foot gun, then -Ofast is an autocannon.
               | 
               | I like to think that the flag should be renamed "-Ofuck-
               | my-shit-up".
        
               | MaulingMonkey wrote:
               | As 1 of [?] examples of UB land, I once had to debug JS
               | objects being misinterpreted as numbers when
               | https://duktape.org/ was miscompiled with a fast-math
               | equivalent (references to objects were encoded as NaNs.)
        
             | WanderPanda wrote:
             | Is that changing of global processor flags a x86 feature or
             | does it hold for arm as well?
        
             | gpderetta wrote:
             | IIRC the changing global flags "feature" was removed
             | recently from GCC and now you have to separately ask for
             | it.
        
       | nickysielicki wrote:
       | https://bugs.llvm.org/show_bug.cgi?id=47271
       | 
       | This specific test (click the godbolt links) does not reproduce
       | the issue.
        
       | cmovq wrote:
       | Depending on the order of the arguments to min max you'll get an
       | extra move instruction [1]:
       | 
       | std::min(max, std::max(min, v));                       maxsd
       | xmm0, xmm1             minsd   xmm0, xmm2
       | 
       | std::min(std::max(v, min), max);                       maxsd
       | xmm1, xmm0             minsd   xmm2, xmm1             movapd
       | xmm0, xmm2
       | 
       | For min/max on x86 if any operand is NaN the instruction copies
       | the second operand into the first. So the compiler can't reorder
       | the second case to look like the first (to leave the result in
       | xmm0 for the return value).
       | 
       | The reason for this NaN behavior is that minsd is implemented to
       | look like `(a < b) ? a : b`, where if any of a or b is NaN the
       | condition is false, and the expression evaluates to b.
       | 
       | Possibly std::clamp has the comparisons ordered like the second
       | case?
       | 
       | [1]: https://godbolt.org/z/coes8Gdhz
        
         | x1f604 wrote:
         | I think the libstdc++ implementation does indeed have the
         | comparisons ordered in the way that you describe. I stepped
         | into the std::clamp() call in gdb and got this:
         | +-/usr/include/c++/12/bits/stl_algo.h--------------------------
         | ------------------------------------------------------------
         | |     3617     \*  @pre `_Tp` is LessThanComparable and `(__hi
         | < __lo)` is false.         |     3618     \*/         |
         | 3619    template<typename _Tp>         |     3620
         | constexpr const _Tp&         |     3621      clamp(const _Tp&
         | __val, const _Tp& __lo, const _Tp& __hi)         |     3622
         | {         |     3623        __glibcxx_assert(!(__hi < __lo));
         | |  >  3624        return std::min(std::max(__val, __lo), __hi);
         | |     3625      }         |     3626
        
           | cmovq wrote:
           | Thanks for sharing. I don't know if the C++ standard mandates
           | one behavior or another, it really depends on how you want
           | clamp to behave if the value is NaN. std::clamp returns NaN,
           | while the reverse order returns the min value.
        
             | cornstalks wrote:
             | From SS25.8.9 Bounded value [alg.clamp]:
             | 
             | > 2 _Preconditions: `bool(comp(proj(hi), proj(lo)))` is
             | false. For the first form, type `T` meets the
             | Cpp17LessThanComparable requirements (Table 26)._
             | 
             | > 3 _Returns: `lo` if `bool(comp(proj(v), proj(lo)))` is
             | true, `hi` if `bool(comp(proj(hi), proj(v)))` is true,
             | otherwise `v`._
             | 
             | > 4 _[Note: If NaN is avoided, `T` can be a floating-point
             | type. -- end note]_
             | 
             | From Table 26:
             | 
             | > _` <` is a strict weak ordering relation (25.8)_
        
               | rahkiin wrote:
               | Does that mean NaN is undefined behavior for clamp?
        
               | cornstalks wrote:
               | My interpretation is that yes, passing NaN is undefined
               | behavior. Strict weak ordering is defined in 25.8 Sorting
               | and related operations [alg.sorting]:
               | 
               | > 4 _The term_ strict _refers to the requirement of an
               | irreflexive relation (`!comp(x, x)` for all `x`), and the
               | term_ weak _to requirements that are not as strong as
               | those for a total ordering, but stronger than those for a
               | partial ordering. If we define `equiv(a, b)` as `!comp(a,
               | b) && !comp(b, a)`, then the requirements are that `comp`
               | and `equiv` both be transitive relations:_
               | 
               | > 4.1 _`comp(a, b) && comp(b, c)` implies `comp(a, c)`_
               | 
               | > 4.2 _`equiv(a, b) && equiv(b, c)` implies `equiv(a,
               | c)`_
               | 
               | NaN breaks these relations, because `equiv(42.0, NaN)`
               | and `equiv(NaN, 3.14)` are both true, which would imply
               | `equiv(42.0, 3.14)` is also true. But clearly that's
               | _not_ true, so floating point numbers do not satisfy the
               | strict weak ordering requirement.
               | 
               | The standard doesn't explicitly say that NaN is undefined
               | behavior. But it does not define the behavior for when
               | NaN is used with `std::clamp()`, which I think by
               | definition means it's undefined behavior.
        
         | vitorsr wrote:
         | It seems that this is close to the most likely reason. See
         | also:
         | 
         | https://godbolt.org/z/q7e3MrE66
        
         | miohtama wrote:
         | Sir cmovq, you have deserved your username.
        
       | CountHackulus wrote:
       | I see that the assembly instructions are different, but what's
       | the performance difference? Personally, I don't care about the
       | number of instructions used, as long as it's faster. With things
       | like store forwarding and register files, a lot of those movs
       | might be treated as noops.
        
       | superjan wrote:
       | The only times I worry about min/max/clamp performance is when I
       | need to do thousands or millions of them. And in that case, I'd
       | suggest intrinsics. You get to choose how NaN is handled, it's
       | branchless, and you can do multiple in parallel.
       | 
       | It feels backwards that you need to order your comparisons so as
       | to generate optimal assembly.
        
       ___________________________________________________________________
       (page generated 2024-01-16 23:01 UTC)