[HN Gopher] Std: Clamp generates less efficient assembly than st...
___________________________________________________________________
Std: Clamp generates less efficient assembly than
std:min(max,std:max(min,v))
Author : x1f604
Score : 151 points
Date : 2024-01-16 10:53 UTC (12 hours ago)
(HTM) web link (1f6042.blogspot.com)
(TXT) w3m dump (1f6042.blogspot.com)
| fooker wrote:
| If you benchmark these, you'll likely find the version with the
| jump edges out the one with the conditional instruction in
| practice.
| svantana wrote:
| That must depend on the platform and the surrounding code, no?
| fooker wrote:
| Yes. On platform - most modern cpus are happier with
| predictable branches than exotic instructions.
|
| On surrounding code - for sure.
| pclmulqdq wrote:
| Compilers often under-generate conditional instructions. They
| implicitly assume (correctly) that most branches you write are
| 90/10 (ie very predictable), not 50/50. The branches that
| actually are 50/50 suffer from being treated as being 90/10.
| fooker wrote:
| The branches in this example are not 50/50.
|
| Given a few million calls of clamp, most would be no-ops in
| practice. Modern CPUs are very good at dynamically observing
| this.
| IainIreland wrote:
| It's hard to predict statically which branches will be
| dynamically unpredictable.
|
| A seasoned hardware architect once told me that Intel went
| all-in on predication for Itanium, under the assumption that
| a Sufficiently Smart Compiler could figure it out, and then
| discovered to their horror that their compiler team's best
| efforts were not Sufficiently Smart. He implied that this was
| why Intel pushed to get a profile-guided optimization step
| added to the SPEC CPU benchmark, since profiling was the only
| way to get sufficiently accurate data.
|
| I've never gone back to see whether the timeline checks out,
| but it's a good story.
| fooker wrote:
| The compiler doesn't do much of the predicting, it's done
| by the CPU in runtime.
| kyboren wrote:
| [delayed]
| jeffbee wrote:
| FYI. https://quick-bench.com/q/sK9t9GoFDRkx9XxloUUbB8Q3ht4'
|
| Using this microbenchmark on an Intel Sapphire Rapids CPU,
| compiled with march=k8 to get the older form, takes ~980ns,
| while compiling with march=native gives ~570ns. It's not at all
| clear that the imperfection the article describes is really
| relevant in context, because the compiler transforms this
| function into something quite different.
| fooker wrote:
| With random test cases, branch prediction can't help.
| tambre wrote:
| Both recent GCC and Clang are able to generate the most optimal
| version for std::clamp() if you add something like -march=znver1,
| even at -O1 [0]. Interesting!
|
| [0] https://godbolt.org/z/YsMMo7Kjz
| GrumpySloth wrote:
| But then it uses AVX instructions. (You can replace
| -march=znver1 with just -mavx.)
|
| When AVX isn't enabled, the std::min + std::max example still
| uses fewer instructions. Looks like a random register
| allocation failure.
| gpderetta wrote:
| The additional "movapd xmm0, xmm2" is mostly free as it is
| handled by renaming, but yes, it seems a quirk of the
| register allocator. It wouldn't be the first time I see GCC
| trying to move stuff around without obvious reasons.
| jeffbee wrote:
| Clang generates the shortest of these if you target sandybridge,
| or x86-64-v3, or later. The real article that's buried in this
| article is that compilers target k8-generic unless you tell them
| otherwise, and the features and cost model of opteron are
| obsolete.
|
| Always specify your target.
| josephg wrote:
| Yep. Adding "-C target-cpu=native" to rustc on my desktop
| computer consistently gets a ~10-15% performance boost compared
| to the default target. The default target is extremely
| conservative. As far as I can tell, it doesn't take advantage
| of any CPU features added in the last 20 years. (The k8 came
| out in 2003.)
| jeffbee wrote:
| Those Gentoo people were onto something.
| alexey-salmin wrote:
| Funny that it stopped being the case for a while around
| 2006. AMD64 became widespread while also being very new,
| closing the gap between "default" and "native".
| skykooler wrote:
| Of course, gentoo just started using prebuilt packages a
| few months ago...
| wongarsu wrote:
| Red Hat Enterprise Linux has upgraded their default target to
| x86-64-v2 and is considering switching to x86-64-v3 for RHEL
| 10 (which should release around 2026?). I'd take that as a
| sign that those might be reasonable choices for newly
| released software.
|
| Some linux distros also give you the option to either get a
| version compatible with ancient hardware or the optimized
| x86-64-v3 version, which seems like a good compromise.
| svantana wrote:
| I'm a heavy std::clamp user, but I'm considering replacing it
| with min+max because of the uncertainty about what will happen
| when lo > hi. On windows it triggers an assertion, while other
| platforms just do a min+max in one or the other order. Of course,
| this should never happen but can be difficult to guarantee when
| the limits are derived from user inputs.
| lifthrasiir wrote:
| Pretty sure that their behaviors on NaN arguments will also
| differ.
| wegfawefgawefg wrote:
| I hope they fix it. Thats quite a basic functional unit for it
| to be a footgun all on its own.
| camblomquist wrote:
| Don't get your hopes up, the behavior when lo > hi is
| explicitly undefined.
| lpapez wrote:
| > Of course, this should never happen but can be difficult to
| guarantee when the limits are derived from user inputs.
|
| Sounds to me like you are missing a validation step before
| calling your logic. When it comes to parsing, trusting user
| input is a recipe for disaster in the form of buffer overruns
| and potential exploits.
|
| As they used to say in the Soviet Union: "trust, but verify".
| PaulDavisThe1st wrote:
| That was what Reagan said _about_ the Soviet Union, not what
| was said in the Soviet Union.
|
| Correct me if I'm wrong.
| usefulcat wrote:
| According to wikipedia you're right (about Reagan) but it's
| also a Russion proverb.
| plucas wrote:
| https://en.wikipedia.org/wiki/Trust,_but_verify
|
| > Trust, but verify (Russian: doveriai, no proveriai, tr.
| doveryay, no proveryay, IPA: [d@vjI'rjaej no pr@vjI'rjaej])
| is a Russian proverb, which is rhyming in Russian. The
| phrase became internationally known in English after
| Suzanne Massie, a scholar of Russian history, taught it to
| Ronald Reagan, then president of the United States, the
| latter of whom used it on several occasions in the context
| of nuclear disarmament discussions with the Soviet Union.
|
| Memorably referenced in "Chernobyl":
| https://youtu.be/9Ebah_QdBnI?t=79
| PaulDavisThe1st wrote:
| Thanks for the clarification/explanation!
| fl0ki wrote:
| Also referenced in the Metro Exodus "Sam's Story" DLC
| because of the backstory of the two characters speaking,
| and nuclear weapons once again being part of the
| scenario.
| dekhn wrote:
| When I hear the phrase, I rewrite it to "don't trust,
| verify."
| protomolecule wrote:
| Russian here. We use that expression from time to time:
| https://ya.ru/search/?text="doveriai%2C+no+proveriai"
| iknowstuff wrote:
| The answer is of course clamp(min(a,b),
| max(a,b))
|
| classic c++
| dahart wrote:
| Will min+max help you? What do you expect the answer to be when
| lo > hi? What certainty should std::clamp have? Using min+max
| on a number that's between lo+hi when lo>hi will always return
| either lo or hi, and never your input value.
| svantana wrote:
| Sure, that was the point - min(max()) forces you to give
| explicit preference to lo or hi, whereas with clamp it's up
| to the std library. I trust my users to bend my software to
| their will, but I don't want different behavior on mac and
| windows (for example).
| dahart wrote:
| Yeah, seems reasonable. I think the outer call wins, so
| min(max()) will always return lo for empty intervals,
| right? I didn't know std::clamp() was undefined for empty
| intervals. It does seem like a good idea to try to
| guarantee the interval is valid instead of worrying about
| clamp... even with a guarantee, the answer might still
| surprise someone, since technically the problem is
| mathematically undefined and the guaranteed answer is
| wrong.
| celegans25 wrote:
| On gcc 13, the difference in assembly between the min(max())
| version and std::clamp is eliminated when I add the -ffast-math
| flag. I suspect that the two implementations handle one of the
| arguments being NaN a bit differently.
|
| https://gcc.godbolt.org/z/fGaP6roe9
|
| I see the same behavior on clang 17 as well
|
| https://gcc.godbolt.org/z/6jvnoxWhb
| gumby wrote:
| You (celegans25) probably know this but here is a PSA that
| -ffast-math is really -finaccurate-math. The knowledgeable
| developer will know when to use it (almost never) while the
| naive user will have bugs.
| cogman10 wrote:
| Ehh, not so much inaccurate, more of a "floating point
| numbers are tricky, let's act like they aren't".
|
| Compilers are pretty skittish about changing the order of
| floating point operations (for good reason) and ffast-math is
| the thing that lets them transform equations to try and
| generate faster code.
|
| IE, instead of doing "n / 10" doing "n * 0.1". The issue, of
| course, being that things like 0.1 can't be perfectly
| represented with floats but 100 / 10 can be. So now you've
| introduced a tiny bit of error where it might not have
| existed.
| phkahler wrote:
| I've never understood why generating exceptions is
| preferable to just using higher precision.
| gumby wrote:
| Higher precision isn't always available. IEEE 754 is an
| unusually well-thought-through standard (thanks to some
| smart people with a lot of painful experience) and is
| pretty good at justifying its decisions, some of which
| are surprising (far from obvious) to anyone not steeped
| in it.
| cogman10 wrote:
| The main problem with floats in general is they are
| designed primarily for scientific computing.
|
| We are fortunately starting to see newer (well, not that
| new now) CPU instructions like FMA that make more
| accurate decimal representations not take such huge
| performance hits.
| adgjlsfhk1 wrote:
| what does fma have to do with decimal numbers?
| cogman10 wrote:
| Oh shoot, nvm, I thought it was an optimization for
| integers.
|
| Really it'll be the SIMD style instructions that speeds
| things up.
| dahart wrote:
| On a GPU, higher precision can cost between 2 and 64
| times more than single precision, with typical ratios for
| consumer cards being 16 or 32. Even on the CPU, fp64
| workloads tend to run at half the speed on real data due
| to the extra bandwidth needed for higher precision.
| rwmj wrote:
| It isn't just that. -ffast-math also allows the compiler to
| ignore infinites. In fact for GCC with -ffast-math, isinf
| always returns false. Something similar happens for
| NaNs/isnan.
| cogman10 wrote:
| I lump this into "floating points are tricky". NaNs and
| inf are definitely legitimate floating point values. They
| are also things that a lot of applications will break on
| they ever encounter them.
| alexey-salmin wrote:
| If your code ventures into the domain where fast-math matters
| and you're not a mathematician trying to solve a lyapunov-
| unstable problem with very tricky numeric methods, then most
| likely your code is already broken.
| dahart wrote:
| Why do you say almost never? Don't let the name scare you;
| all floating point math is inaccurate. Fast math is only
| _slightly_ less accurate, I think typically it's a 1 or maybe
| 2 LSB difference. At least in CUDA it is, and I think many
| (most?) people & situations can tolerate 22 bits of mantissa
| compared to 23, and many (most?) people/situations aren't
| paying attention to inf/nan/exception issues at all.
|
| I deal with a lot of floating point professionally day to
| day, and I use fast math all the time, since the tradeoff for
| higher performance and the relatively small loss of accuracy
| are acceptable. Maybe the biggest issue I run into is lack of
| denorms with CUDA fast-math, and it's pretty rare for me to
| care about numbers smaller than 10^-38. Heck, I'd say I can
| tolerate 8 or 16 bits of mantissa most of the time, and fast-
| math floats are way more accurate than that. And we know a
| lot of neural network training these days can tolerate less
| than 8 bits of mantissa.
| mort96 wrote:
| The scary thing IMO is: your code might be fine with unsafe
| math optimisations, but maybe you're using a library which
| is written to do operations in a certain order to minimise
| numerical error, and unsafe math operations changes the
| code which are mathematically equivalent but which results
| in many orders of magnitude more numerical error. It's
| probably fine most of the time, but it's kinda scary.
| dahart wrote:
| It shouldn't be scary. Any library that is sensitive to
| order of operations will hopefully have a big fat warning
| on it. And it can be compiled separately with fast-math
| disabled. I don't know of any such libraries off the top
| of my head, and it's quite rare to find situations that
| result in orders of magnitude more error, though I grant
| you it can happen, and it can be contrived pretty easily.
| planede wrote:
| You can't fully disable fast-math per-library, moreover a
| library compiled with fast-math might also introduce
| inaccuracies in a seemingly unrelated library or
| application code in the same executable. The reason is
| that fast-math enables some dynamic initialization of the
| library that changes the floating point environment in
| some ways.
| londons_explore wrote:
| you're gonna hive to give us a concrete real world
| example to convince most of us...
| dahart wrote:
| > You can't fully disable fast-math per library
|
| Can you elaborate? What fast-math can sneak into a
| library that disabled fast-math at compile time?
|
| > fast-math enables some dynamic initialization of the
| library that changes the floating point environment in
| some ways.
|
| I wasn't aware of this, I would love to see some
| documentation discussing exactly what happens, can you
| send a link?
| mort96 wrote:
| > Can you elaborate? What fast-math can sneak into a
| library that disabled fast-math at compile time?
|
| A lot of library code is in headers (especially in C++!).
| The code in headers is compiled by your compiler using
| your compile options.
| dahart wrote:
| Ah, of course, very good point. A header-only library
| doesn't have separate compile options. This is a great
| reason for a float-sensitive library to not be header-
| only, right?
| mort96 wrote:
| It's not just about being header-only, lots of libraries
| which aren't header-only still have code in headers. The
| library may choose to put certain functions in headers
| for performance reasons (to let compiler inline them),
| or, in C++, function templates and class templates
| generally have to be in headers.
|
| But yeah, it's probably a good idea to not put code which
| breaks under -ffast-math in headers if possible.
| jcranmer wrote:
| https://github.com/llvm/llvm-project/issues/57589
|
| Turn on fast-math, it flips the FTZ/DAZ bit for the
| entire application. Even if you turned it on for just a
| shared library!
| mort96 wrote:
| I don't typically thoroughly read through the
| documentation for all the dependencies which my
| dependencies are using.
|
| But you're correct that it's probably usually fine in
| practice.
| dahart wrote:
| That's fair. Ideally transitive dependencies should be
| completely hidden from you. Hopefully the author of the
| library you include directly has heeded the instructions
| of libraries they depend on.
|
| Hey I grant and acknowledge that using fast-math carries
| a little risk of surprises, we don't necessarily need to
| try to think of corner cases. I'm mostly pushing back a
| little because using floats at all carries almost as much
| risk. A lot of people seem to use floats without knowing
| how inaccurate floats are, and a lot of people aren't
| doing precision analysis or handling the exceptional
| cases... and don't really need to.
| ska wrote:
| > A lot of people seem to use floats without knowing how
| inaccurate floats are,
|
| Small nit, but floats aren't inaccurate, they have non
| uniform precision. Some float _operations_ can be
| inaccurate, but that 's rather path dependent...
|
| One problem with -ffast-math is that a) it sounds
| appealing and b) people don't understand floats, so lots
| of people turn it on without understanding what it does,
| and that can introduce subtle problems in code they
| didn't write.
|
| Sometimes in computational code it makes sense e.g. to
| get rid of denorms, but a very small fraction of
| programmers understand this properly, or ever will.
|
| I wish they had named it something scary sounding.
| dahart wrote:
| I am talking about float operations, of course. And
| they're all inaccurate, generally speaking, because they
| round. Fast math rounding error is not much larger than
| rounding error without fast mast.
| usefulcat wrote:
| > Fast math is only slightly less accurate
|
| 'slightly'? Last I checked, -Ofast completely breaks
| std::isnan and std::isinf--they always return false.
| dahart wrote:
| Hopefully it was clear from the rest of my comment that I
| was talking about in-range floats there. I wouldn't
| necessarily call inf & nan handling an accuracy issue,
| that's more about exceptional cases, but to your point I
| would have to agree that losing std::isinf is kinda bad
| since divide by zero is probably near the very top of the
| list of things most people using floats casually might
| have to deal with.
|
| Which compiler are you using where std::isinf breaks?
| Hopefully it was also clear that my experience leans
| toward CUDA, and I think the inf & nan support works
| there in the presence of NVCC's fast-math.
| usefulcat wrote:
| My experience is with gcc and clang on x86. I generally
| agree with you regarding accuracy, which is why I was
| quite surprised when I first discovered that -Ofast
| breaks isnan/isinf.
|
| Even if I don't care about the accuracy differences, I
| still need a way to check for invalid input data. The
| upshot is that I had to roll my own isnan and isinf to be
| able to use -Ofast (because it's actually the underlying
| __builtin_xxx intrinsics that are broken), which still
| seems wrong to me.
| xigoi wrote:
| They are talking about -ffast-math, not -Ofast.
| gumby wrote:
| From the gcc manual:
|
| -Ofast
|
| Disregard strict standards compliance. -Ofast enables all
| -O3 optimizations. It also enables optimizations that are
| not valid for all standard-compliant programs. _It turns
| on -ffast-math_ , -fallow-store-data-races and the
| Fortran-specific -fstack-arrays, unless -fmax-stack-var-
| size is specified, and -fno-protect-parens. It turns off
| -fsemantic-interposition.
| jcranmer wrote:
| Here are some of the problems with fast-math:
|
| * It links in an object file that enables denormal flushing
| globally, so that it affects _all_ libraries linked into
| your application, even if said library explicitly _doesn
| 't_ want fast-math. This is seriously one of the most user-
| hostile things a compiler can do.
|
| * The results of your program will vary depending on the
| exact make of your compiler and other random attributes of
| your compile environment, which can wreak havoc if you have
| code that absolutely wants bit-identical results. This
| doesn't matter for everybody, but there are some domains
| where this can be a non-starter (e.g., multiplayer game
| code).
|
| * Fast-math precludes you from being able to use NaN or
| infinities, and often even being able to defensively _test_
| for NaN or infinity. Sure, there are times where this is
| useful, but an option you might generally prefer to suggest
| for an uninformed programmer would rather be a "floating-
| point code can't overflow" option rather than "infinity
| doesn't exist and it's UB if it does exist".
|
| * Fast-math can cause hard range guarantees to fail. Maybe
| you've got code that you can prove that, even with rounding
| error, the result will still be >= 0. With fast-math, the
| code might be adjusted so that the result is instead, say,
| -1e-10. And if you pass that to a function with a hard
| domain error at 0 (like sqrt), you now go from the result
| being 0 to the result being NaN. And see above about what
| happens when you get NaN.
|
| Fast-math is a tradeoff, and if you're willing to except
| the tradeoff it offers, it's a fine option to use. But most
| programmers don't even know what the tradeoffs are, and the
| failure mode can be absolutely catastrophic. It's
| definitely an option that is in the "you must be this
| knowledgeable to use" camp.
| dahart wrote:
| Thank you, great points. I'd have to agree that disabling
| denorms globally is pretty bad, even if (or maybe
| especially if) caring about denorms is rare.
|
| > Fast-math can cause hard range guarantees to fail.
| Maybe you've got code that you can prove that, even with
| rounding error, the result will still be >= 0.
|
| Floats do this too, it's pretty routine to bump into
| epsilon out-of-range issues without fast-math. Most
| people don't prove things about their rounding error, and
| if they do, it's easy for them to account for 3 ULPs of
| fast-math error compared to 1/2 ULP for the more accurate
| operations. Like, nobody who knows what they're doing
| will call sqrt() on a number that is fresh out of a
| multiplier and might be anywhere near zero without
| testing for zero explicitly, right? I'm sure someone has
| done it, but I've never seen it, and it ranks high on the
| list of bad ideas even if you steer completely clear of
| fast-math, no?
|
| I guess I just wanted to resist the unspecific parts of
| the FUD just a little bit. I like your list a lot because
| it's specific. Fast-math does carry some additional risks
| for accuracy sensitive code, and clearly as you and
| others showed, can infect and impact your whole app, and
| it can sometimes lead to situations where things break
| that wouldn't have happened otherwise. But I think in the
| grand scheme these situations are quite rare compared to
| how often people mess up regular floating point math. For
| a very wide swath of people doing casual arithmetic,
| fast-math is not likely to cause more problems than
| floats cause, but it's fair to want to be careful and pay
| attention.
| PaulDavisThe1st wrote:
| > I'd have to agree that disabling denorms globally is
| pretty bad,
|
| and yet, for audio processing, this is an option that
| most DAWs either implement silently, or offer users the
| choice, because denormals are inevitable in reverb tails
| and on most Intel processors they slow things by orders
| of magnitude.
| dahart wrote:
| I would think for audio, there's no audible difference
| between a denorm and a flushed zero. Are there cases
| where denorms are important to audio?
| fl0ki wrote:
| > The results of your program will vary depending on the
| exact make of your compiler and other random attributes
| of your compile environment, which can wreak havoc if you
| have code that absolutely wants bit-identical results.
| This doesn't matter for everybody, but there are some
| domains where this can be a non-starter (e.g.,
| multiplayer game code).
|
| This already shouldn't be assumed, because even the same
| code, compiler, and flags can produce different floating
| point results on different CPU targets. With the world
| increasingly split over x86_64 and aarch64, with more to
| come, it would be unwise to assume they produce the same
| exact numbers.
|
| Often this comes down to acceptable implementation
| defined behavior, e.g. temporarily using an 80-bit
| floating register despite the result being coerced to 64
| bits, or using an FMA instruction that loses less
| precision than separate multiply and add instructions.
|
| Portable results should come from integers (even if used
| to simulate rationals and fixed point), not floats. I
| understand that's not easy with multiplayer games, but
| doing so with floats is simply impossible because of what
| is left as implementation-defined in our language
| standards.
| AshamedCaptain wrote:
| > Often this comes down to acceptable implementation
| defined behavior,
|
| I believe this is "always" rather than often when it
| comes to the actual operations defined by the FP
| standard. gcc does play it fast and loose (as -ffast-math
| is not yet enabled by default, and FMA on the other hand
| is), but this is technically illegal and at least can be
| easily configured to be in standards-compliant mode.
|
| I think the bigger problem comes from what is _not_
| documented by the standard. E.g. transcendental
| functions. A program calling plain old sqrt(x) can find
| itself behaving differently _even between different
| stepping of the same core_, not to mention that there are
| well-known differences between AMD vs Intel. This is all
| using the same binary.
| jcranmer wrote:
| This advice is out-of-date.
|
| All CPU hardware nowadays conforms to IEEE 754 semantics
| for binary32 and binary64. (I think all the GPUs now have
| non-denormal-flushing modes, but my GPU knowledge is less
| deep). All compilers will have a floating-point mode that
| preserves IEEE 754 semantics assuming that FP exceptions
| are unobservable and rounding mode is the default, and
| this is usually the default (icc/icx is unusual in making
| fast-math the default).
|
| Thus, you have portability of floating-point semantics,
| subject to caveats:
|
| * The math library functions [1] are not the same between
| different implementations. If you want portability, you
| need to ensure that you're using the exact same math
| library on all platforms.
|
| * NaN payloads are not consistent on different platforms,
| or necessarily within the same platform due to compiler
| optimizations. Note that not even IEEE 754 attempts to
| guarantee NaN payload stability.
|
| * Long double is not the same type on different
| platforms. Don't use it. Seriously, don't.
|
| * 32-bit x86 support for exact IEEE 754 equivalence is
| essentially a "known-WONTFIX" bug. (This is why the C
| standard implemented FLT_EVAL_METHOD). The x87 FPU
| evaluates everything in 80-bit precision, and while you
| can make this work for binary32 easily (double rounding
| isn't an issue), though with some performance cost (the
| solution involves reading/writing from memory after every
| operation), it's not so easy for binary64. However, the
| SSE registers do implement IEEE 754 exactly, and are
| present on every chip old enough to drink, so it's not
| really a problem anymore. There's a subsidiary issue that
| the x86-32 ABI requires floats be returned in x87
| registers, which means you can't properly return an sNaN
| correctly, but sNaN and floating-point exceptions are
| firmly in the realm of nonportability anyways.
|
| In short, if you don't need to care about 32-bit x86
| support (or if you do care but can require SSE2 support),
| and you don't care about NaNs, and you bring your own
| libraries along, you can absolutely expect to have
| floating-point portability.
|
| [1] It's actually not even all math library functions,
| just those that are like sin, pow, exp, etc., but
| specifically excluding things like sqrt. I'm still trying
| to come up with a good term to encompass these.
| zozbot234 wrote:
| > It's actually not even all math library functions, just
| those that are like sin, pow, exp, etc., but specifically
| excluding things like sqrt. I'm still trying to come up
| with a good term to encompass these.
|
| Transcendental functions. They're called that because
| computing an exactly rounded result might be unfeasible
| for some inputs. https://en.wikipedia.org/wiki/Table-
| maker%27s_dilemma So standards for numerical compute punt
| on the issue and allow for some error in the last digit.
| jcranmer wrote:
| Not all of the functions are transcendental--things like
| cbrt and rsqrt are in the list, and they're both
| algebraic.
|
| (The main defining factor is if they're an IEEE 754 SS5
| operation or not, but IEEE 754 isn't a freely-available
| standard.)
| fl0ki wrote:
| Not sure if this is a spooky coincidence, but I happened
| to be reading the Rust 1.75.0 release notes today and
| fell into this 50-tab rabbit hole:
| https://github.com/rust-lang/rust/pull/113053/
| AshamedCaptain wrote:
| > this is usually the default
|
| No, it's not. gcc itself still defaults to fp-
| contract=fast. Or at least does in all versions I have
| ever tried.
| light_hue_1 wrote:
| Nah, you don't deal with floats. You do machine learning
| which just happens to use floats. I do both numerical
| computing and machine learning. And oh boy are you wrong!
|
| People who deal with actual numerical computing know that
| the statement "fast math is only slightly less accurate" is
| absurd. Fast math is unbounded in its inaccuracy! It can
| reorder your computations so that something that used to
| sum to 1 now sums to 0, it can cause catastrophic
| cancellation, etc.
|
| Please stop giving people terrible advice on a topic you're
| totally unfamiliar with.
| alexey-salmin wrote:
| > It can reorder your computations so that something that
| used to sum to 1 now sums to 0, it can cause catastrophic
| cancellation, etc.
|
| Yes, and it could very well be that the correct answer is
| actually 0 and not 1.
|
| Unless you write your code to explicitly account for fp
| associativity effects, in which case you don't need
| generic forum advice about fast-math.
| thechao wrote:
| +1. I'm years away from fp-analysis, but do the
| transcendental expansions even _converge_ in the presence
| of fast-math? No `sin()`, no `cos()`, no `exp()`, ...
| dahart wrote:
| Well there are library implementations of fast-math
| trancendentals that offer bounded error, and a million
| different fast sine approximation algorithms, so, yes?
| This is why you shouldn't listen to FUD. The corner cases
| are indeed frustrating for a few people, but most never
| hit them, and the world doesn't suddenly break when fast
| math is enabled. I am paid to do some FP analysis, btw.
| dahart wrote:
| I only do numeric computation, I don't work in machine
| learning. Sorry your assumptions are incorrect, maybe
| it's best not to assume or attack. I didn't exactly
| advise using fast math either, I asked for reasoning and
| pointed out that most casual uses of float aren't highly
| sensitive to precision.
|
| It's easy to have wrong sums and catastrophic
| cancellation without fast math, and it's relatively rare
| for fast math to cause those issues when an underlying
| issue didn't already exist.
|
| I've been working in some code that does a couple of
| quadratic solves and has high order intermediate terms,
| and I've tried using Kahan's algorithm repeatedly to
| improve the precision of the discriminants, but it has
| never helped at all. On the other hand I've used a few
| other tricks that improve the precision enough that the
| fast math version is higher precision than the naive one
| without fast math. I get to have my cake and eat it too.
|
| Fast math is a tradeoff. Of course it's a good idea to
| know what it does and what the risks of using it are, but
| at least in terms of the accuracy of fast math in CUDA,
| it's not an opinion whether the accuracy is relatively
| close to slow math, it's reasonably well documented. You
| can see for yourself that most fast math ops are in the
| single digit ulps of rounding error.
| https://docs.nvidia.com/cuda/cuda-c-programming-
| guide/index....
| mort96 wrote:
| What you really should enable is the fun and safe math
| optimizations, with -funsafe-math-optimizations.
| aqfamnzc wrote:
| I know almost nothing about compiler flags but I got a
| laugh out of this even though I still don't know if you're
| joking or not. Edit: Just read it again and now I
| understand the joke. Haha
| kevincox wrote:
| To others `-f` is a common prefix for GCC flags. You can
| think of this as "enable feature". So -funsafe-math-
| operations should be read as (-f) (unsafe-math-
| operations). Not (-)(funsafe-math-operations).
| arcticbull wrote:
| I kind of like the idea the flag is sarcastically calling
| them very fun and very safe.
| dekhn wrote:
| don't forget libiberty which is linked in using -liberty
| (and freedom for all)
| cozzyd wrote:
| strangely I'm not aware of a libibre.
| dekhn wrote:
| oh, that's called -lmojito
| gumby wrote:
| The problem is this causes the compiler to correctly solve
| your recreational math problems, which isn't actually a
| much fun as solving them yourself!
| planede wrote:
| Another PSA is that dynamic libraries compiled with fast-math
| will also introduce inaccuracies in unrelated libraries in
| the same executable, as they introduce dynamic initialization
| that globally changes the floating point environment.
| pavlov wrote:
| This would only affect code that uses the old-school x87
| floating point instructions, though? The x87 FPU unit
| indeed has scary global state that can make your doubles
| behave like floats in secret and silence.
|
| I would think practically all modern FPU code on x86-64
| would be using the SIMD registers which have explicit
| widths.
| borodi wrote:
| So it was a bit more pervasive than this, the issue was
| that flushing subnormals (values very close to 0) to 0 is
| a register that gets set, so if a library is built with
| the fastmath flags and it gets loaded, it sets the
| register, causing the whole process to flush it's
| subnormals. i.e https://github.com/llvm/llvm-
| project/issues/57589
| jcranmer wrote:
| > This would only affect code that uses the old-school
| x87 floating point instructions, though?
|
| Actually, no, the x87 FPU instructions are the only ones
| that _won 't_ be affected.
|
| It sets the FTZ/DAZ bits, which exist for SSE
| instructions but not x87 instructions.
| mhh__ wrote:
| You're mistaking something else for the rounding mode and
| subnormal handling flags.
| mhh__ wrote:
| One of the things that you can do with D and as far as I know
| Julia is enable specific optimizations locally e.g. allow
| FMAs here and there, not globally.
|
| fast-math is one of the dumbest things we have as an industry
| IMO.
| planede wrote:
| On a somewhat similar note, don't use std::lerp if you don't need
| its strong guarantees around rounding (monotonicity among other
| things).
|
| https://godbolt.org/z/hzrG3s6T4
| camblomquist wrote:
| I did a double take on this because I wrote a blog post about
| this topic a few months ago and came to a very different
| conclusion, that the results are effectively identical on clang
| and gcc is just weird.
|
| Then I realized that I was writing about compiling for ARM and
| this post is about x86. Which is extra weird! Why is the compiler
| better tuned for ARM than x86 in this case?
|
| Never did figure out what gcc's problem was.
|
| https://godbolt.org/z/Y75qnTGdr
| frozenport wrote:
| Try switching to -Ofast it produces different ASM
| klodolph wrote:
| -Ofast is one of those dangerous flags that you should
| probably be careful with. It is "contagious" and it can mess
| up code elsewhere in the program, because it changes
| processor flags.
|
| I would try a more specific flag like -ffinite-math-only.
| Sharlin wrote:
| finite-math-only is a footgun as well as it allows the
| compiler assume that NaNs do not exist. Which means all
| `isnan()` calls are just reduced to `false` so it's
| difficult to program defensively. And if a NaN in fact
| occurs it's naturally a one-way ticket to UB land.
| klodolph wrote:
| If that's a foot gun, then -Ofast is an autocannon.
|
| I like to think that the flag should be renamed "-Ofuck-
| my-shit-up".
| MaulingMonkey wrote:
| As 1 of [?] examples of UB land, I once had to debug JS
| objects being misinterpreted as numbers when
| https://duktape.org/ was miscompiled with a fast-math
| equivalent (references to objects were encoded as NaNs.)
| WanderPanda wrote:
| Is that changing of global processor flags a x86 feature or
| does it hold for arm as well?
| gpderetta wrote:
| IIRC the changing global flags "feature" was removed
| recently from GCC and now you have to separately ask for
| it.
| nickysielicki wrote:
| https://bugs.llvm.org/show_bug.cgi?id=47271
|
| This specific test (click the godbolt links) does not reproduce
| the issue.
| cmovq wrote:
| Depending on the order of the arguments to min max you'll get an
| extra move instruction [1]:
|
| std::min(max, std::max(min, v)); maxsd
| xmm0, xmm1 minsd xmm0, xmm2
|
| std::min(std::max(v, min), max); maxsd
| xmm1, xmm0 minsd xmm2, xmm1 movapd
| xmm0, xmm2
|
| For min/max on x86 if any operand is NaN the instruction copies
| the second operand into the first. So the compiler can't reorder
| the second case to look like the first (to leave the result in
| xmm0 for the return value).
|
| The reason for this NaN behavior is that minsd is implemented to
| look like `(a < b) ? a : b`, where if any of a or b is NaN the
| condition is false, and the expression evaluates to b.
|
| Possibly std::clamp has the comparisons ordered like the second
| case?
|
| [1]: https://godbolt.org/z/coes8Gdhz
| x1f604 wrote:
| I think the libstdc++ implementation does indeed have the
| comparisons ordered in the way that you describe. I stepped
| into the std::clamp() call in gdb and got this:
| +-/usr/include/c++/12/bits/stl_algo.h--------------------------
| ------------------------------------------------------------
| | 3617 \* @pre `_Tp` is LessThanComparable and `(__hi
| < __lo)` is false. | 3618 \*/ |
| 3619 template<typename _Tp> | 3620
| constexpr const _Tp& | 3621 clamp(const _Tp&
| __val, const _Tp& __lo, const _Tp& __hi) | 3622
| { | 3623 __glibcxx_assert(!(__hi < __lo));
| | > 3624 return std::min(std::max(__val, __lo), __hi);
| | 3625 } | 3626
| cmovq wrote:
| Thanks for sharing. I don't know if the C++ standard mandates
| one behavior or another, it really depends on how you want
| clamp to behave if the value is NaN. std::clamp returns NaN,
| while the reverse order returns the min value.
| cornstalks wrote:
| From SS25.8.9 Bounded value [alg.clamp]:
|
| > 2 _Preconditions: `bool(comp(proj(hi), proj(lo)))` is
| false. For the first form, type `T` meets the
| Cpp17LessThanComparable requirements (Table 26)._
|
| > 3 _Returns: `lo` if `bool(comp(proj(v), proj(lo)))` is
| true, `hi` if `bool(comp(proj(hi), proj(v)))` is true,
| otherwise `v`._
|
| > 4 _[Note: If NaN is avoided, `T` can be a floating-point
| type. -- end note]_
|
| From Table 26:
|
| > _` <` is a strict weak ordering relation (25.8)_
| rahkiin wrote:
| Does that mean NaN is undefined behavior for clamp?
| cornstalks wrote:
| My interpretation is that yes, passing NaN is undefined
| behavior. Strict weak ordering is defined in 25.8 Sorting
| and related operations [alg.sorting]:
|
| > 4 _The term_ strict _refers to the requirement of an
| irreflexive relation (`!comp(x, x)` for all `x`), and the
| term_ weak _to requirements that are not as strong as
| those for a total ordering, but stronger than those for a
| partial ordering. If we define `equiv(a, b)` as `!comp(a,
| b) && !comp(b, a)`, then the requirements are that `comp`
| and `equiv` both be transitive relations:_
|
| > 4.1 _`comp(a, b) && comp(b, c)` implies `comp(a, c)`_
|
| > 4.2 _`equiv(a, b) && equiv(b, c)` implies `equiv(a,
| c)`_
|
| NaN breaks these relations, because `equiv(42.0, NaN)`
| and `equiv(NaN, 3.14)` are both true, which would imply
| `equiv(42.0, 3.14)` is also true. But clearly that's
| _not_ true, so floating point numbers do not satisfy the
| strict weak ordering requirement.
|
| The standard doesn't explicitly say that NaN is undefined
| behavior. But it does not define the behavior for when
| NaN is used with `std::clamp()`, which I think by
| definition means it's undefined behavior.
| vitorsr wrote:
| It seems that this is close to the most likely reason. See
| also:
|
| https://godbolt.org/z/q7e3MrE66
| miohtama wrote:
| Sir cmovq, you have deserved your username.
| CountHackulus wrote:
| I see that the assembly instructions are different, but what's
| the performance difference? Personally, I don't care about the
| number of instructions used, as long as it's faster. With things
| like store forwarding and register files, a lot of those movs
| might be treated as noops.
| superjan wrote:
| The only times I worry about min/max/clamp performance is when I
| need to do thousands or millions of them. And in that case, I'd
| suggest intrinsics. You get to choose how NaN is handled, it's
| branchless, and you can do multiple in parallel.
|
| It feels backwards that you need to order your comparisons so as
| to generate optimal assembly.
___________________________________________________________________
(page generated 2024-01-16 23:01 UTC)