[HN Gopher] Beware of fast-math
___________________________________________________________________
Beware of fast-math
Author : simonbyrne
Score : 95 points
Date : 2021-11-12 16:53 UTC (6 hours ago)
(HTM) web link (simonbyrne.github.io)
(TXT) w3m dump (simonbyrne.github.io)
| jjgreen wrote:
| One trick that I happened upon was speeding up complex
| multiplication (like a factor of 5) under gcc with the --enable-
| cx-fortran switch.
| bruce343434 wrote:
| -fcx-fortran-rules Complex multiplication and
| division follow Fortran rules. Range reduction is done as part
| of complex division, but there is no checking whether the
| result of a complex multiplication or division is NaN + I*NaN,
| with an attempt to rescue the situation in that case.
| The default is -fno-cx-fortran-rules.
|
| [1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Optimize-
| Option...
| kzrdude wrote:
| It looks like -fassociative-math is "safe" in the sense that it
| can not be used to get UB in working code? That's a good property
| to make it easier to use in the right context.
| adgjlsfhk1 wrote:
| Yeah. It's safe in that you won't get UB, but it's bad in that
| you can get arbitrarily wrong answers.
| wiml wrote:
| To be fair, if you're using floating point _at all_ you can
| get arbitrarily wrong answers. The nice thing about ieee754
| conformance is that you can, with a lot of expertise,
| somewhat reason about the kinds of error you 're getting. But
| for code that wasn't written by someone skilled in numerical
| techniques, and that's the vast majority of fp code, is fast-
| math worse than the status quo?
| simonbyrne wrote:
| From personal experience, yes: I've seen multiple cases of
| scientists finding the ultimate cause of their bugs was
| some fast-math-related optimization.
|
| The problem isn't necessarily the code they wrote
| themselves: it is often that they've compiled someone
| else's code or an open source library with fast-math, which
| broke some internal piece.
| simonbyrne wrote:
| Not necessarily: if your cospi(x) function is always returning
| 1.0 (https://github.com/JuliaLang/julia/issues/30073#issuecomme
| nt...), but you wrote your code assuming the result was in a
| different interval, then you could quite easily invoke
| undefined behavior.
| mbauman wrote:
| See the one footnote: you can re-associate a list of 2046
| numbers such that they sum to _any_ floating point number
| between 0 and 2^970.
|
| https://discourse.julialang.org/t/array-ordering-and-naive-s...
| StefanKarpinski wrote:
| To be fair though, as noted further down in that thread,
| naive left-to-right summation is the worst case here since
| the tree of operations is as skewed as possible. I think that
| any other tree shape is better and compilers will tend to
| make the tree more balanced, not less, when they use
| associativity to enable SIMD optimizations. So while
| reassociation is theoretically arbitrarily bad, in practice
| it's probably mostly ok. Probably.
| stephencanon wrote:
| This is exactly right. Also, _all_ of the summation
| orderings satisfy the usual backwards-stability bounds, and
| so all of them are perfectly stable in normal computations.
| vchuravy wrote:
| Especially the fact that loading a library compiled with GCC and
| fast math on, can modify the global state of the program... It's
| one of the most baffling decisions made in the name of
| performance.
|
| I would really like for someone to take fast math seriously, and
| to provide well scoped and granular options to programmers. The
| Julia `@fastmath` macro gets close, but it is two broad. I want
| to control the flags individually.
|
| Also the question how that interacts with IPO/inlining...
| mhh__ wrote:
| D (LDC) lets you control the flags on a per function basis.
|
| So one can (and we do at work) have @optmath which is a
| specific set of flags (just a value we defined at compile time,
| the compiler recognizes it as a UDA) we want as opposed to
| letting the compiler bulldoze everything.
| physicsguy wrote:
| You can do that on a per-compiler basis for e.g. with
|
| #pragma GCC optimize("fast-math")
| smitop wrote:
| The LLVM IR is more expressive than clang is for expressing fast-
| math: it supports making an operation use fast-math optimization
| on a per _operation_ basis
| (https://llvm.org/docs/LangRef.html#fastmath).
| simonbyrne wrote:
| Do you know what happens when you have ops with different
| flags? e.g. if you have (a + b) + c, where one + allows reassoc
| but one doesn't?
| SeanLuke wrote:
| The other examples he gave trade off significant math
| deficiencies for small speed gains. But flushing subnormals to
| zero can produce a MASSIVE speed gain. Like 1000x. And including
| subnormals isn't necessarily good floating point practice -- they
| were rather controversial during the development of IEEE 754 as I
| understand it. The tradeoff here is markedly different than in
| the other cases.
| StefanKarpinski wrote:
| That's true, but the danger of flushing subnormals to zero is
| correspondingly worse because it's _global_ CPU state and there
| 's commonly used code that relies on not flushing subnormals to
| zero in order to work correctly, like `libm`. The example
| linked in the post is of a case where loading a shared library
| that had been compiled with `-Ofast` (which includes `-ffast-
| math`) broke a completely unrelated package because of this. Of
| course, the fact that CPU designers made this a global hardware
| flag is atrocious, but they did, so here we are.
| Joker_vD wrote:
| Wait, what is "local" CPU state/hardware flag? In any case,
| since x64 ABI doesn't require MXCSR to be in any particular
| state on function entry, libm should set/clear whatever
| control flags it needs on its own (and restore them on exit
| since MXCSR control bits _are_ defined to be callee-saved).
| StefanKarpinski wrote:
| Local would be not using register flags at all and instead
| indicating with each operation whether you want flushing or
| not (and rounding mode, ideally). Some libms may clear and
| restore the control flags and some may not. Libm is just an
| example here and one where you're right that most of
| function calls that might need to avoid flushing subnormals
| to zero are expensive enough that clearing and restoring
| flags is an acceptable cost. However, that's not always the
| case--sometimes the operation in question is a few
| instructions and it may get inlined into some other code.
| It _might_ be possible to handle this better at the
| compiler level while still using the MXCSR register, but if
| it is, LLVM certainly can 't currently do that well.
| simonbyrne wrote:
| In theory, every function should do that to check things
| like rounding mode etc. But that would be pretty slow,
| especially for low-latency operations (modifying mxcsr will
| disrupt pipelining for example).
| pcwalton wrote:
| That wouldn't be practical. C math library performance
| really matters for numerical-intensive apps like games.
| adrian_b wrote:
| Flushing subnormals to zero produces speed gains only on
| certain CPU models, while on others it almost does not have any
| effect.
|
| For example Zen CPUs have negligible penalties for handling
| denormals, but many Intel models have a penalty between 100 and
| 200 clock cycles for an operation with denormals.
|
| Even on the CPU models with slow denormal processing, a speedup
| between 100 and 1000 exists only for the operation with
| denormals itself and only when the operation belonged to a
| stream of operations working at the maximum CPU SIMD speed,
| when during the one hundred and something lost clock cycles the
| CPU could have done 4 or 8 operations during every clock cycle.
|
| Any complete computations cannot have a significant percentage
| of operations with denormals, unless they are written in an
| extremely bad way.
|
| So for a complete computation, even on the models with bad
| denormal handling, a speedup of more than a few times would be
| abnormal.
|
| The only controversy that has ever existed about denormals is
| that handling them at full speed increases the cost of the FPU,
| so lazy or greedy companies, i.e. mainly Intel, have preferred
| to add the flush-to-zero option for gamers, instead of
| designing the FPU in the right way.
|
| When the correctness of the results is not important, like in
| many graphic or machine-learning applications, using flush-to-
| zero is OK, otherwise it is not.
| Someone wrote:
| > The only controversy that has ever existed about denormals
| is that handling them at full speed increases the cost of the
| FPU, so lazy or greedy companies, i.e. mainly Intel, have
| preferred to add the flush-to-zero option for gamers
|
| You could also say some companies have been kind enough to
| make hardware for gamers that doesn't have costly features
| they do not need.
| pavpanchekha wrote:
| Fun fact: when working on Herbie (http://herbie.uwplse.org), our
| automated tool for reducing floating-point error by rearranging
| your mathematical expressions, we found that fast-math often
| undid Herbie's improvements. In a sense, Herbie and fast-math are
| opposites: one makes code more accurate (sometimes slower,
| sometimes faster), while the other makes code faster (sometimes
| less accurate, sometimes more).
| simonbyrne wrote:
| Herbie is a great tool, especially for teaching.
| named-user wrote:
| The docs for fast math literally say this "This option is not
| turned on by any -O option besides -Ofast since it can result
| in incorrect output for programs that depend on an exact
| implementation of IEEE or ISO rules/specifications for math
| functions. It may, however, yield faster code for programs that
| do not require the guarantees of these specifications."
|
| So what did you expect?
|
| I was taught in my undergrad class not to use fast math if you
| ever wanted accuracy. I don't understand how you missed this?
| pavpanchekha wrote:
| Hi, I noticed you're a new user, so in that spirit this is
| not the kind of comment that is well-received on HN. It reads
| as arrogant and condescending, though I assume that was not
| your intention. Since I publish regularly in this area, the
| condescension feels especially off-base.
|
| To answer the question in good faith: while the documentation
| clearly states that fast-math _can_ reduce accuracy, in
| isolation the accuracy changes are often small, and moreover
| in isolation the accuracy changes often look positive. So I
| was surprised, when evaluating Herbie, that it _frequently_
| caused _large, negative_ changes in accuracy, specifically by
| undoing changes made to improve accuracy.
| Quekid5 wrote:
| That's a real "fun fact" kind of thing. Love it.
|
| Of course, the 'real' solution is actual Numerical Analysis (as
| I'm sure you know) to keep the error properly bounded, but it's
| really interesting to have a sort of middle ground which just
| stabilizes the math locally... which might be good enough.
|
| Other fun fact: Numerical Analysis is a thing that's _really_
| hard to imagine unless you happen to be introduced to it during
| an education. It 's so obviously a thing once you've heard of
| it, but REALLY hard to come up with ex nihilo.
| willis936 wrote:
| I use single precision floating point to save memory and
| computation in applications where it makes sense. I had a case
| where I didn't care about the vertical precision of a signal very
| much. It had a sample rate in the tens of thousands of samples
| per second. I was generating a sinusoid and transmitting it. On
| the receiver the signal would become garbled after about a
| minute. I slapped my head and immediately realized I ran out of
| precision by using a single precision time value feeding the sin
| function when t became too large with the small increment.
| sin(single(t)) == bad single(sin(t)) == good
| adgjlsfhk1 wrote:
| IMO, sinpi(x)=sin(pi*x) is a better function because it does
| much better here. the regular trig functions are approximately
| 20% slower for most implementations in order to accurately
| reduce mod 2pi, while reducing mod 2 is pretty much trivial.
| willis936 wrote:
| I think the real solution I should have adopted is
| incrementing t like this: t = mod(t + ts, 1 /
| f)
|
| Since I'm just sending a static frequency the range of time
| never needs to be beyond one period. However, using a double
| here is far from the critical path in increasing performance
| and it all runs fast enough anyway.
| dlsa wrote:
| Never considered fast-math. I get the sense its useful but can
| create awkward and/or unexpected surprises. If I was to use it
| I'd have to have a verification test harness as part of some
| pipeline to comfirm no weirdness. Literally a bunch of example
| canary calculations to determine if fast-math will kill or harm
| some real use case.
|
| Is this a sensible approach? What are others experiences around
| this? I've never bothered with this kind of optimisation and I
| now vaguely feel like I'm missing out.
|
| I tend to use calculations for deterministic purposes rather than
| pure accuracy. 1+1=2.1 where the answer is stable and approximate
| is still better and more useful than 1+1=2.0 but where the answer
| is unstable. Eg because one of those is 0.9999999 and the
| precision triggers some edge case.
| simonbyrne wrote:
| I tried to lay out a reasonable path: incrementally test
| accuracy and performance, and only enable the necessary
| optimizations to get the desired performance. Good tests will
| catch the obvious catastrophic cases, but some will inevitably
| be weird edge cases.
|
| As always, the devil is in the details: you typically can't
| check exact equality, as e.g. reassociating arithmetic can give
| slightly different (but not necessarily worse) results. So the
| challenge is coming up with appropriate measure of determining
| whether something is wrong.
| zoomablemind wrote:
| On the subject of the floating-point math in general, I wonder
| what's the practical way to treat the extreme order values (close
| to zero ~ 1E-200, or infinity ~ 1E200, but not zero or inf)? This
| can take place in some iterative methods, expansion series, or
| around some singularities.
|
| How reliable is it to keep the exreme orders in expectation that
| the resp. quatities would cancel the orders properly yielding a
| meaningful value (rounding wise)?
|
| For example, calculating some resulting value function, expressed
| as
|
| v(x)=f(x)/g(x),
|
| where both f(x) and g(x) are oscillating with a number of roots
| in a given interval of x.
| simonbyrne wrote:
| The key thing about floating point is that it maintains
| relative accuracy: in your case, if you have say f(x) and g(x)
| are both O(1e200), and are correct to some small relative
| tolerance, say 1e-10 (that is, the absolute error is 1e190).
| Then the relative for f(x)/g(x) stays nicely bounded to about
| 2e-10.
|
| However if you do f(x) - g(x), the absolute error is on the
| order of 2e190: if f(x) - g(x) is small, then now the relative
| error can be huge (this is known as catastrophic cancellation).
| zoomablemind wrote:
| Both f(x) and g(x) could be calc'ed to proper machine
| precision (say, GSL double prec ~ 1E-15). Would this imply,
| that beyond the machine precision the values are effectively
| fixed to resp. zero or infinity, instead of carrying around
| the extreme orders of magnitude?
| simonbyrne wrote:
| I'm not exactly sure what you're asking here, but the point
| is that "to machine precision" is relative: if f(x) and
| g(x) are O(1e200), then the absolute error of each is still
| O(1e185). However f(x)/g(x) will still be very accurate
| (with absolute error O(1e-15)).
| gsteinb88 wrote:
| If you can, working with the logarithms of the intermediate
| large or small values is one way around the issue
|
| One example talking about this here:
| http://aosabook.org/en/500L/a-rejection-sampler.html#the-mul...
| junon wrote:
| Not very reliable. For precision one usually seeks out the MPFR
| library or something similar.
| adgjlsfhk1 wrote:
| ARB https://arblib.org/ is often better than mpfr.
| zoomablemind wrote:
| If anyone else wonders, MPFR (Multi-Precision Floating-point
| calc with correct Rounding). The application domain is
| Interval Math [1].
|
| [1]:http://cs.utep.edu/interval-comp/applications.html
| dmitrygr wrote:
| Contrarian waypoint: beware of not-fast-math. Making things like
| atan2f and sqrtf set errno takes you down a very slow path,
| costing you significant perf in cases where you likely do not
| want it. And most math will work fine with fast-math, if you are
| careful how you write it. (Free online numerical methods classes
| are available, eg [1]) Without fast-math most compilers cannot
| even use FMA instructions (costing you up to 2x in cases where
| they could be used otherwise) since they cannot prove it will
| produce the same result - FMA will actually likely produce a more
| accurate result, but your compiler is handicapped by lack of
| fast-math to offer it to you.
|
| [1]
| https://ocw.mit.edu/courses/mathematics/18-335j-introduction...
| mbauman wrote:
| That's precisely the part that makes it so impossible to use!
| Sometimes it means fewer bits of accuracy than IEEE would
| otherwise give you; sometimes it means more. Sometimes it
| results in your code being interpreted in a more algebra-ish
| way, sometimes it's less.
|
| That's why finer-grained flags are needed -- yes, FMAs and SIMD
| are essential for _both_ performance and improved accuracy, but
| `-ffast-math` bundles so many disparate things together it's
| impossible to understand what your code does.
|
| > And most math will work fine with fast-math, if you are
| careful how you write it.
|
| The most hair-pulling part about `-ffast-math` is that it will
| actively _disable_ your "careful code." You can't check for
| nans. You can't check for residuals. It'll rearrange those
| things on your behalf because it's faster that way.
| an1sotropy wrote:
| (in case anyone reading doesn't know: FMA = Fused Multiply and
| Add, as in a*b+c, an operation on 3 values, which increases
| precision by incurring rounding error once instead of twice)
|
| I'm not an expert on this, but for my own code I've been
| meaning to better understand the discussion here [1], which
| suggests that there ARE ways of getting FMAs, without the
| sloppiness of fast-math.
|
| [1] https://stackoverflow.com/questions/15933100/how-to-use-
| fuse...
| simonbyrne wrote:
| -ffp-contract=fast will enable FMA contraction, i.e.
| replacing a * b + c with fma(a,b,c). This is generally okay,
| but there are a few cases where it can cause problems: the
| canonical example is computing an expression of the form:
|
| a * d - b * c
|
| If a == b and c == d (and all are finite), then this should
| give 0 (which is true for strict IEEE 754 math), but if you
| replace it with an fma then you can get either a positive or
| negative value, depending on the order in which it was
| contracted. Issues like this pop up in complex
| multiplication, or applying the quadratic formula.
| m4r35n357 wrote:
| MPFR https://www.mpfr.org/
| dahart wrote:
| > which suggests that there ARE ways of getting FMAs, without
| the sloppiness of fast-math.
|
| There are ways, indeed, but they are pretty slow, it's
| prioritizing accuracy over performance. And they're still
| pretty tricky too. The most practical alternative for float
| FMA might be to use doubles, and for double precision FMA
| might be to bump to a 128 bit representation.
|
| Here's a paper on what it takes to do FMA emulation:
| https://www.lri.fr/~melquion/doc/08-tc.pdf
| adgjlsfhk1 wrote:
| That's not what the parent meant. The parent meant that
| there are ways of generating fma instructions without using
| fast-math. Emulating an fma instruction is almost always a
| bad idea (I should know I've written fma-emulation before.
| It sucks)
| dahart wrote:
| Oh, my mistake, thanks. Yes you can use FMA instructions
| without the fast-math compiler flag for sure. Emulation
| being a bad idea is the impression I got; I'm glad to
| hear the confirmation from experience.
| an1sotropy wrote:
| I remember a teacher who said (when I was a student)
| something like "if you care about precision use double".
| Now that I'm teaching, I force students to only use single-
| precision "float"s in their code, with the message that FP
| precision is a finite resource, and you don't learn how to
| manage any resource by increasing its supply. I think my
| students hate me.
| StefanKarpinski wrote:
| The way Julia handles this is worth noting:
|
| - `fma(a, b, c)` is exact but may be slow: it uses intrinsics
| if available and falls back to a slow software emulation when
| they're not
|
| - `muladd(a, b, c)` uses the fastest possibly inexact
| implementation of `a*b + c` available, which is FMA
| intrinsics if available or just doing separate `*` and `+`
| operations if they're not
|
| That gives the user control over what they need--precision or
| speed. If you're writing code that needs the extra precision,
| use the `fma` function but if you just want to compute `a*b +
| c` as fast as possible, with or without extra precision, then
| use `muladd`.
| adgjlsfhk1 wrote:
| Note that this is only true in theory. In practice, there
| are still some bugs here that will hopefully be fixed by
| julia 1.8
| simonbyrne wrote:
| My point isn't that fast-math isn't useful: it very much is.
| The problem is that it is a whole grab bag of things that can
| do very dangerous things. Rather than using a sledgehammer, you
| should try to be selective and enable only the useful
| optimizations, e.g. you could just enable -ffp-contract=fast
| and -fno-math-errno.
| djmips wrote:
| One thing I don't think you pointed out is that tracking down
| issues with NaNs seems hard with fast-math since, I believe,
| it also disables any exceptions that might be useful to being
| alerted to their formation?
| kristofferc wrote:
| > Free online numerical methods classes are available
|
| How can you use any numerical methods (like error analysis) if
| you don't have a solid foundation with strict rules to analyze
| on top on?
| headPoet wrote:
| -funsafe-math-optimizations always makes me laugh. Of course I
| want fun and safe math optimisations
| markhahn wrote:
| NaN's should trap, but compilers should not worry about accurate
| debugging.
| pfdietz wrote:
| The link to Kahan Summation was interesting.
|
| https://en.wikipedia.org/wiki/Kahan_summation_algorithm
___________________________________________________________________
(page generated 2021-11-12 23:01 UTC)