[HN Gopher] Beware of fast-math
       ___________________________________________________________________
        
       Beware of fast-math
        
       Author : simonbyrne
       Score  : 95 points
       Date   : 2021-11-12 16:53 UTC (6 hours ago)
        
 (HTM) web link (simonbyrne.github.io)
 (TXT) w3m dump (simonbyrne.github.io)
        
       | jjgreen wrote:
       | One trick that I happened upon was speeding up complex
       | multiplication (like a factor of 5) under gcc with the --enable-
       | cx-fortran switch.
        
         | bruce343434 wrote:
         | -fcx-fortran-rules                   Complex multiplication and
         | division follow Fortran rules. Range reduction is done as part
         | of complex division, but there is no checking whether the
         | result of a complex multiplication or division is NaN + I*NaN,
         | with an attempt to rescue the situation in that case.
         | The default is -fno-cx-fortran-rules.
         | 
         | [1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-
         | Option...
        
       | kzrdude wrote:
       | It looks like -fassociative-math is "safe" in the sense that it
       | can not be used to get UB in working code? That's a good property
       | to make it easier to use in the right context.
        
         | adgjlsfhk1 wrote:
         | Yeah. It's safe in that you won't get UB, but it's bad in that
         | you can get arbitrarily wrong answers.
        
           | wiml wrote:
           | To be fair, if you're using floating point _at all_ you can
           | get arbitrarily wrong answers. The nice thing about ieee754
           | conformance is that you can, with a lot of expertise,
           | somewhat reason about the kinds of error you 're getting. But
           | for code that wasn't written by someone skilled in numerical
           | techniques, and that's the vast majority of fp code, is fast-
           | math worse than the status quo?
        
             | simonbyrne wrote:
             | From personal experience, yes: I've seen multiple cases of
             | scientists finding the ultimate cause of their bugs was
             | some fast-math-related optimization.
             | 
             | The problem isn't necessarily the code they wrote
             | themselves: it is often that they've compiled someone
             | else's code or an open source library with fast-math, which
             | broke some internal piece.
        
         | simonbyrne wrote:
         | Not necessarily: if your cospi(x) function is always returning
         | 1.0 (https://github.com/JuliaLang/julia/issues/30073#issuecomme
         | nt...), but you wrote your code assuming the result was in a
         | different interval, then you could quite easily invoke
         | undefined behavior.
        
         | mbauman wrote:
         | See the one footnote: you can re-associate a list of 2046
         | numbers such that they sum to _any_ floating point number
         | between 0 and 2^970.
         | 
         | https://discourse.julialang.org/t/array-ordering-and-naive-s...
        
           | StefanKarpinski wrote:
           | To be fair though, as noted further down in that thread,
           | naive left-to-right summation is the worst case here since
           | the tree of operations is as skewed as possible. I think that
           | any other tree shape is better and compilers will tend to
           | make the tree more balanced, not less, when they use
           | associativity to enable SIMD optimizations. So while
           | reassociation is theoretically arbitrarily bad, in practice
           | it's probably mostly ok. Probably.
        
             | stephencanon wrote:
             | This is exactly right. Also, _all_ of the summation
             | orderings satisfy the usual backwards-stability bounds, and
             | so all of them are perfectly stable in normal computations.
        
       | vchuravy wrote:
       | Especially the fact that loading a library compiled with GCC and
       | fast math on, can modify the global state of the program... It's
       | one of the most baffling decisions made in the name of
       | performance.
       | 
       | I would really like for someone to take fast math seriously, and
       | to provide well scoped and granular options to programmers. The
       | Julia `@fastmath` macro gets close, but it is two broad. I want
       | to control the flags individually.
       | 
       | Also the question how that interacts with IPO/inlining...
        
         | mhh__ wrote:
         | D (LDC) lets you control the flags on a per function basis.
         | 
         | So one can (and we do at work) have @optmath which is a
         | specific set of flags (just a value we defined at compile time,
         | the compiler recognizes it as a UDA) we want as opposed to
         | letting the compiler bulldoze everything.
        
           | physicsguy wrote:
           | You can do that on a per-compiler basis for e.g. with
           | 
           | #pragma GCC optimize("fast-math")
        
       | smitop wrote:
       | The LLVM IR is more expressive than clang is for expressing fast-
       | math: it supports making an operation use fast-math optimization
       | on a per _operation_ basis
       | (https://llvm.org/docs/LangRef.html#fastmath).
        
         | simonbyrne wrote:
         | Do you know what happens when you have ops with different
         | flags? e.g. if you have (a + b) + c, where one + allows reassoc
         | but one doesn't?
        
       | SeanLuke wrote:
       | The other examples he gave trade off significant math
       | deficiencies for small speed gains. But flushing subnormals to
       | zero can produce a MASSIVE speed gain. Like 1000x. And including
       | subnormals isn't necessarily good floating point practice -- they
       | were rather controversial during the development of IEEE 754 as I
       | understand it. The tradeoff here is markedly different than in
       | the other cases.
        
         | StefanKarpinski wrote:
         | That's true, but the danger of flushing subnormals to zero is
         | correspondingly worse because it's _global_ CPU state and there
         | 's commonly used code that relies on not flushing subnormals to
         | zero in order to work correctly, like `libm`. The example
         | linked in the post is of a case where loading a shared library
         | that had been compiled with `-Ofast` (which includes `-ffast-
         | math`) broke a completely unrelated package because of this. Of
         | course, the fact that CPU designers made this a global hardware
         | flag is atrocious, but they did, so here we are.
        
           | Joker_vD wrote:
           | Wait, what is "local" CPU state/hardware flag? In any case,
           | since x64 ABI doesn't require MXCSR to be in any particular
           | state on function entry, libm should set/clear whatever
           | control flags it needs on its own (and restore them on exit
           | since MXCSR control bits _are_ defined to be callee-saved).
        
             | StefanKarpinski wrote:
             | Local would be not using register flags at all and instead
             | indicating with each operation whether you want flushing or
             | not (and rounding mode, ideally). Some libms may clear and
             | restore the control flags and some may not. Libm is just an
             | example here and one where you're right that most of
             | function calls that might need to avoid flushing subnormals
             | to zero are expensive enough that clearing and restoring
             | flags is an acceptable cost. However, that's not always the
             | case--sometimes the operation in question is a few
             | instructions and it may get inlined into some other code.
             | It _might_ be possible to handle this better at the
             | compiler level while still using the MXCSR register, but if
             | it is, LLVM certainly can 't currently do that well.
        
             | simonbyrne wrote:
             | In theory, every function should do that to check things
             | like rounding mode etc. But that would be pretty slow,
             | especially for low-latency operations (modifying mxcsr will
             | disrupt pipelining for example).
        
             | pcwalton wrote:
             | That wouldn't be practical. C math library performance
             | really matters for numerical-intensive apps like games.
        
         | adrian_b wrote:
         | Flushing subnormals to zero produces speed gains only on
         | certain CPU models, while on others it almost does not have any
         | effect.
         | 
         | For example Zen CPUs have negligible penalties for handling
         | denormals, but many Intel models have a penalty between 100 and
         | 200 clock cycles for an operation with denormals.
         | 
         | Even on the CPU models with slow denormal processing, a speedup
         | between 100 and 1000 exists only for the operation with
         | denormals itself and only when the operation belonged to a
         | stream of operations working at the maximum CPU SIMD speed,
         | when during the one hundred and something lost clock cycles the
         | CPU could have done 4 or 8 operations during every clock cycle.
         | 
         | Any complete computations cannot have a significant percentage
         | of operations with denormals, unless they are written in an
         | extremely bad way.
         | 
         | So for a complete computation, even on the models with bad
         | denormal handling, a speedup of more than a few times would be
         | abnormal.
         | 
         | The only controversy that has ever existed about denormals is
         | that handling them at full speed increases the cost of the FPU,
         | so lazy or greedy companies, i.e. mainly Intel, have preferred
         | to add the flush-to-zero option for gamers, instead of
         | designing the FPU in the right way.
         | 
         | When the correctness of the results is not important, like in
         | many graphic or machine-learning applications, using flush-to-
         | zero is OK, otherwise it is not.
        
           | Someone wrote:
           | > The only controversy that has ever existed about denormals
           | is that handling them at full speed increases the cost of the
           | FPU, so lazy or greedy companies, i.e. mainly Intel, have
           | preferred to add the flush-to-zero option for gamers
           | 
           | You could also say some companies have been kind enough to
           | make hardware for gamers that doesn't have costly features
           | they do not need.
        
       | pavpanchekha wrote:
       | Fun fact: when working on Herbie (http://herbie.uwplse.org), our
       | automated tool for reducing floating-point error by rearranging
       | your mathematical expressions, we found that fast-math often
       | undid Herbie's improvements. In a sense, Herbie and fast-math are
       | opposites: one makes code more accurate (sometimes slower,
       | sometimes faster), while the other makes code faster (sometimes
       | less accurate, sometimes more).
        
         | simonbyrne wrote:
         | Herbie is a great tool, especially for teaching.
        
         | named-user wrote:
         | The docs for fast math literally say this "This option is not
         | turned on by any -O option besides -Ofast since it can result
         | in incorrect output for programs that depend on an exact
         | implementation of IEEE or ISO rules/specifications for math
         | functions. It may, however, yield faster code for programs that
         | do not require the guarantees of these specifications."
         | 
         | So what did you expect?
         | 
         | I was taught in my undergrad class not to use fast math if you
         | ever wanted accuracy. I don't understand how you missed this?
        
           | pavpanchekha wrote:
           | Hi, I noticed you're a new user, so in that spirit this is
           | not the kind of comment that is well-received on HN. It reads
           | as arrogant and condescending, though I assume that was not
           | your intention. Since I publish regularly in this area, the
           | condescension feels especially off-base.
           | 
           | To answer the question in good faith: while the documentation
           | clearly states that fast-math _can_ reduce accuracy, in
           | isolation the accuracy changes are often small, and moreover
           | in isolation the accuracy changes often look positive. So I
           | was surprised, when evaluating Herbie, that it _frequently_
           | caused _large, negative_ changes in accuracy, specifically by
           | undoing changes made to improve accuracy.
        
         | Quekid5 wrote:
         | That's a real "fun fact" kind of thing. Love it.
         | 
         | Of course, the 'real' solution is actual Numerical Analysis (as
         | I'm sure you know) to keep the error properly bounded, but it's
         | really interesting to have a sort of middle ground which just
         | stabilizes the math locally... which might be good enough.
         | 
         | Other fun fact: Numerical Analysis is a thing that's _really_
         | hard to imagine unless you happen to be introduced to it during
         | an education. It 's so obviously a thing once you've heard of
         | it, but REALLY hard to come up with ex nihilo.
        
       | willis936 wrote:
       | I use single precision floating point to save memory and
       | computation in applications where it makes sense. I had a case
       | where I didn't care about the vertical precision of a signal very
       | much. It had a sample rate in the tens of thousands of samples
       | per second. I was generating a sinusoid and transmitting it. On
       | the receiver the signal would become garbled after about a
       | minute. I slapped my head and immediately realized I ran out of
       | precision by using a single precision time value feeding the sin
       | function when t became too large with the small increment.
       | sin(single(t)) == bad            single(sin(t)) == good
        
         | adgjlsfhk1 wrote:
         | IMO, sinpi(x)=sin(pi*x) is a better function because it does
         | much better here. the regular trig functions are approximately
         | 20% slower for most implementations in order to accurately
         | reduce mod 2pi, while reducing mod 2 is pretty much trivial.
        
           | willis936 wrote:
           | I think the real solution I should have adopted is
           | incrementing t like this:                 t = mod(t + ts, 1 /
           | f)
           | 
           | Since I'm just sending a static frequency the range of time
           | never needs to be beyond one period. However, using a double
           | here is far from the critical path in increasing performance
           | and it all runs fast enough anyway.
        
       | dlsa wrote:
       | Never considered fast-math. I get the sense its useful but can
       | create awkward and/or unexpected surprises. If I was to use it
       | I'd have to have a verification test harness as part of some
       | pipeline to comfirm no weirdness. Literally a bunch of example
       | canary calculations to determine if fast-math will kill or harm
       | some real use case.
       | 
       | Is this a sensible approach? What are others experiences around
       | this? I've never bothered with this kind of optimisation and I
       | now vaguely feel like I'm missing out.
       | 
       | I tend to use calculations for deterministic purposes rather than
       | pure accuracy. 1+1=2.1 where the answer is stable and approximate
       | is still better and more useful than 1+1=2.0 but where the answer
       | is unstable. Eg because one of those is 0.9999999 and the
       | precision triggers some edge case.
        
         | simonbyrne wrote:
         | I tried to lay out a reasonable path: incrementally test
         | accuracy and performance, and only enable the necessary
         | optimizations to get the desired performance. Good tests will
         | catch the obvious catastrophic cases, but some will inevitably
         | be weird edge cases.
         | 
         | As always, the devil is in the details: you typically can't
         | check exact equality, as e.g. reassociating arithmetic can give
         | slightly different (but not necessarily worse) results. So the
         | challenge is coming up with appropriate measure of determining
         | whether something is wrong.
        
       | zoomablemind wrote:
       | On the subject of the floating-point math in general, I wonder
       | what's the practical way to treat the extreme order values (close
       | to zero ~ 1E-200, or infinity ~ 1E200, but not zero or inf)? This
       | can take place in some iterative methods, expansion series, or
       | around some singularities.
       | 
       | How reliable is it to keep the exreme orders in expectation that
       | the resp. quatities would cancel the orders properly yielding a
       | meaningful value (rounding wise)?
       | 
       | For example, calculating some resulting value function, expressed
       | as
       | 
       | v(x)=f(x)/g(x),
       | 
       | where both f(x) and g(x) are oscillating with a number of roots
       | in a given interval of x.
        
         | simonbyrne wrote:
         | The key thing about floating point is that it maintains
         | relative accuracy: in your case, if you have say f(x) and g(x)
         | are both O(1e200), and are correct to some small relative
         | tolerance, say 1e-10 (that is, the absolute error is 1e190).
         | Then the relative for f(x)/g(x) stays nicely bounded to about
         | 2e-10.
         | 
         | However if you do f(x) - g(x), the absolute error is on the
         | order of 2e190: if f(x) - g(x) is small, then now the relative
         | error can be huge (this is known as catastrophic cancellation).
        
           | zoomablemind wrote:
           | Both f(x) and g(x) could be calc'ed to proper machine
           | precision (say, GSL double prec ~ 1E-15). Would this imply,
           | that beyond the machine precision the values are effectively
           | fixed to resp. zero or infinity, instead of carrying around
           | the extreme orders of magnitude?
        
             | simonbyrne wrote:
             | I'm not exactly sure what you're asking here, but the point
             | is that "to machine precision" is relative: if f(x) and
             | g(x) are O(1e200), then the absolute error of each is still
             | O(1e185). However f(x)/g(x) will still be very accurate
             | (with absolute error O(1e-15)).
        
         | gsteinb88 wrote:
         | If you can, working with the logarithms of the intermediate
         | large or small values is one way around the issue
         | 
         | One example talking about this here:
         | http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...
        
         | junon wrote:
         | Not very reliable. For precision one usually seeks out the MPFR
         | library or something similar.
        
           | adgjlsfhk1 wrote:
           | ARB https://arblib.org/ is often better than mpfr.
        
           | zoomablemind wrote:
           | If anyone else wonders, MPFR (Multi-Precision Floating-point
           | calc with correct Rounding). The application domain is
           | Interval Math [1].
           | 
           | [1]:http://cs.utep.edu/interval-comp/applications.html
        
       | dmitrygr wrote:
       | Contrarian waypoint: beware of not-fast-math. Making things like
       | atan2f and sqrtf set errno takes you down a very slow path,
       | costing you significant perf in cases where you likely do not
       | want it. And most math will work fine with fast-math, if you are
       | careful how you write it. (Free online numerical methods classes
       | are available, eg [1]) Without fast-math most compilers cannot
       | even use FMA instructions (costing you up to 2x in cases where
       | they could be used otherwise) since they cannot prove it will
       | produce the same result - FMA will actually likely produce a more
       | accurate result, but your compiler is handicapped by lack of
       | fast-math to offer it to you.
       | 
       | [1]
       | https://ocw.mit.edu/courses/mathematics/18-335j-introduction...
        
         | mbauman wrote:
         | That's precisely the part that makes it so impossible to use!
         | Sometimes it means fewer bits of accuracy than IEEE would
         | otherwise give you; sometimes it means more. Sometimes it
         | results in your code being interpreted in a more algebra-ish
         | way, sometimes it's less.
         | 
         | That's why finer-grained flags are needed -- yes, FMAs and SIMD
         | are essential for _both_ performance and improved accuracy, but
         | `-ffast-math` bundles so many disparate things together it's
         | impossible to understand what your code does.
         | 
         | > And most math will work fine with fast-math, if you are
         | careful how you write it.
         | 
         | The most hair-pulling part about `-ffast-math` is that it will
         | actively _disable_ your "careful code." You can't check for
         | nans. You can't check for residuals. It'll rearrange those
         | things on your behalf because it's faster that way.
        
         | an1sotropy wrote:
         | (in case anyone reading doesn't know: FMA = Fused Multiply and
         | Add, as in a*b+c, an operation on 3 values, which increases
         | precision by incurring rounding error once instead of twice)
         | 
         | I'm not an expert on this, but for my own code I've been
         | meaning to better understand the discussion here [1], which
         | suggests that there ARE ways of getting FMAs, without the
         | sloppiness of fast-math.
         | 
         | [1] https://stackoverflow.com/questions/15933100/how-to-use-
         | fuse...
        
           | simonbyrne wrote:
           | -ffp-contract=fast will enable FMA contraction, i.e.
           | replacing a * b + c with fma(a,b,c). This is generally okay,
           | but there are a few cases where it can cause problems: the
           | canonical example is computing an expression of the form:
           | 
           | a * d - b * c
           | 
           | If a == b and c == d (and all are finite), then this should
           | give 0 (which is true for strict IEEE 754 math), but if you
           | replace it with an fma then you can get either a positive or
           | negative value, depending on the order in which it was
           | contracted. Issues like this pop up in complex
           | multiplication, or applying the quadratic formula.
        
           | m4r35n357 wrote:
           | MPFR https://www.mpfr.org/
        
           | dahart wrote:
           | > which suggests that there ARE ways of getting FMAs, without
           | the sloppiness of fast-math.
           | 
           | There are ways, indeed, but they are pretty slow, it's
           | prioritizing accuracy over performance. And they're still
           | pretty tricky too. The most practical alternative for float
           | FMA might be to use doubles, and for double precision FMA
           | might be to bump to a 128 bit representation.
           | 
           | Here's a paper on what it takes to do FMA emulation:
           | https://www.lri.fr/~melquion/doc/08-tc.pdf
        
             | adgjlsfhk1 wrote:
             | That's not what the parent meant. The parent meant that
             | there are ways of generating fma instructions without using
             | fast-math. Emulating an fma instruction is almost always a
             | bad idea (I should know I've written fma-emulation before.
             | It sucks)
        
               | dahart wrote:
               | Oh, my mistake, thanks. Yes you can use FMA instructions
               | without the fast-math compiler flag for sure. Emulation
               | being a bad idea is the impression I got; I'm glad to
               | hear the confirmation from experience.
        
             | an1sotropy wrote:
             | I remember a teacher who said (when I was a student)
             | something like "if you care about precision use double".
             | Now that I'm teaching, I force students to only use single-
             | precision "float"s in their code, with the message that FP
             | precision is a finite resource, and you don't learn how to
             | manage any resource by increasing its supply. I think my
             | students hate me.
        
           | StefanKarpinski wrote:
           | The way Julia handles this is worth noting:
           | 
           | - `fma(a, b, c)` is exact but may be slow: it uses intrinsics
           | if available and falls back to a slow software emulation when
           | they're not
           | 
           | - `muladd(a, b, c)` uses the fastest possibly inexact
           | implementation of `a*b + c` available, which is FMA
           | intrinsics if available or just doing separate `*` and `+`
           | operations if they're not
           | 
           | That gives the user control over what they need--precision or
           | speed. If you're writing code that needs the extra precision,
           | use the `fma` function but if you just want to compute `a*b +
           | c` as fast as possible, with or without extra precision, then
           | use `muladd`.
        
             | adgjlsfhk1 wrote:
             | Note that this is only true in theory. In practice, there
             | are still some bugs here that will hopefully be fixed by
             | julia 1.8
        
         | simonbyrne wrote:
         | My point isn't that fast-math isn't useful: it very much is.
         | The problem is that it is a whole grab bag of things that can
         | do very dangerous things. Rather than using a sledgehammer, you
         | should try to be selective and enable only the useful
         | optimizations, e.g. you could just enable -ffp-contract=fast
         | and -fno-math-errno.
        
           | djmips wrote:
           | One thing I don't think you pointed out is that tracking down
           | issues with NaNs seems hard with fast-math since, I believe,
           | it also disables any exceptions that might be useful to being
           | alerted to their formation?
        
         | kristofferc wrote:
         | > Free online numerical methods classes are available
         | 
         | How can you use any numerical methods (like error analysis) if
         | you don't have a solid foundation with strict rules to analyze
         | on top on?
        
       | headPoet wrote:
       | -funsafe-math-optimizations always makes me laugh. Of course I
       | want fun and safe math optimisations
        
       | markhahn wrote:
       | NaN's should trap, but compilers should not worry about accurate
       | debugging.
        
       | pfdietz wrote:
       | The link to Kahan Summation was interesting.
       | 
       | https://en.wikipedia.org/wiki/Kahan_summation_algorithm
        
       ___________________________________________________________________
       (page generated 2021-11-12 23:01 UTC)