[HN Gopher] High Performance Correctly Rounded Math Libraries fo...
___________________________________________________________________
High Performance Correctly Rounded Math Libraries for 32-Bit
Floating Point
Author : ingve
Score : 64 points
Date : 2021-08-26 18:10 UTC (4 hours ago)
(HTM) web link (blog.sigplan.org)
(TXT) w3m dump (blog.sigplan.org)
| camgunz wrote:
| There are ~30 year old legends in the client/server gamedev
| community about floating point differing across platforms, but I
| assumed they were ironed out by now. _Wild_ to find out they
| haven 't been.
| thxg wrote:
| Nowadays, all major platforms follow IEEE754, which specifies
| exactly the results of elementary operations (+, -, x, /). You
| must get bit-for-bit reproducible results across different
| runs, compiles, libraries, OSes and hardware. Any difference is
| a bug.
|
| However:
|
| 1. As the paper emphasizes, math library functions (sin, cos,
| log, etc.) are not covered by that standardization, at least
| not mandatorily and not in practice. This is the problem that
| the paper tackles.
|
| 2. In the bad old days of the x87, the FPU carried out
| computations on eight 80-bit registers. The problem is that
| whenever those registers spilled out to the stack, precision
| was truncated down to whatever type the language actually meant
| (64-bit double or 32-bit float). Because the register-spilling
| decision is mostly taken by the compiler, it was largely
| unpredictable. This resulted in code that was not reproducible,
| even across compiles! Since x87 was deprecated in favor of
| SSE/AVX, the precision of operations is always that of the type
| used in the code. Lower than 80-bits but much better in
| practice because of consistency and reproducibility.
|
| 3. The bad old days are making a comeback with FMA instructions
| (fused multiply-add). This is a single operation (a x b + c)
| and so it is correctly rounded to the closest representable
| number. This number can be different from ((a x b) + c) in
| which _two_ roundings occur. If we give the compiler freedom to
| decide whether or not to fuse such expressions, we will again
| see non-reproducibility across compiles. Unfortunately, gcc and
| icc both do that by default at -O3 (see -ffast-math, and more
| specifically -fexcess-precision). Specifying a proper language
| standard (for example -std=c99, -std=c++14, etc.) restores sane
| behavior.
| pklausler wrote:
| x87 can store and load full 80-bit register values. The
| truncation you mention is optional.
| adgjlsfhk1 wrote:
| On the one hand, the ambiguity of FMA is kind of annoying,
| but on the other, FMA is a godsend for writing accurate
| arithmatic. One reason for this, is that the error of `a _b`
| is `fma(a, b, -a_ b)` which makes compensated algorithms much
| faster to write. Secondly, polynomial evaluations using
| horner's method converge to .5 ULP (units in last place) with
| FMA, but to 1.5 ULP without. This is incredibly useful since
| polynomial approximations are at the heart of pretty much
| everything else.
|
| I think Julia takes an interesting approach here of never
| reordering floating point computations unless you explecitly
| give permission using @fastmath (which acts locally instead
| of globally like a --fast-math in C/C++). This makes it much
| easier to use ieee specific tricks and be confident that the
| compiler won't mess with you, while still making it easy to
| tell the compiler to do what it wants.
| titzer wrote:
| > I think Julia takes an interesting approach here of never
| reordering floating point computations
|
| WebAssembly as well. IEEE-754 is mandated and there are no
| unsafe compiler optimizations allowed. This is the only
| sane choice. It's just too hard to reason about programs
| otherwise. Nothing about a super-intelligent machine
| completely reorganizing your code in a way that subtly
| breaks it is good, IMHO.
| thxg wrote:
| > FMA is a godsend for writing accurate arithmatic.
|
| True, and the small performance boost (on some platforms)
| is nice too
|
| > never reordering floating point computations unless you
| explecitly give permission
|
| Yes, requiring explicit code for FMA and arithmetic
| grouping/reordering in general seems like the sane
| approach.
| dleslie wrote:
| Decades ago we were deeply concerned with syncronized
| simulations; the dominant replication method was lock-step
| input replication, or similar, and that made it very important
| that everyone's simulation was identical or bad things would
| happen.
|
| Nowadays that's not an issue, as there's a slew of different
| ways of overcoming that and allowing clients to operate
| asynchronously; sometimes widely asynchronous.
| _dwt wrote:
| Any good resources (or even search term hints) you could
| recommend on the more modern/async approaches? I'd love to
| read about them.
| lodi wrote:
| https://gafferongames.com/post/state_synchronization/
|
| Also search for "game networking" + the following:
| - state replication - dead reckoning - client
| side prediction - networked physics
| thxg wrote:
| This work essentially allows one to go beyond the IEEE754
| standard which mandates correctly-rounded elementary operations
| +, -, x and / : they provide implementations of libm functions
| (cos, sin, log, exp, etc.) that are _also_ correctly rounded.
|
| This is really nice, not necessarily because anyone cares about
| the correctness of the last bit in a float. But more importantly,
| because since there is only one "correct" answer, requiring
| correct rounding means that those functions are then fully
| specified and deterministic. No more getting slightly different
| result when changing platform, OS, or even just updating libm
| (some transcendental functions are still actively being improved,
| thus changed, in glibc!). This is amazing for reproducibility.
|
| Since the performance is good (even better than existing libm
| implementations, they claim), this is a true win-win.
|
| This is obviously made possible by the fact that one can easily
| enumerate all 32-bit floats. It would be amazing to have
| something similar, even at a big performance cost, for 64-bit
| doubles (MPFR can do it of course but the perf cost there is
| truly massive). Things are much harder with 64 bits,
| unfortunately.
| Someone wrote:
| Apple had a library that did that for 80-bit floating point,
| for both the 6502 (!) and 68000: SANE (https://en.wikipedia.org
| /wiki/Standard_Apple_Numerics_Enviro...)
|
| And yes, performance wasn't good, especially compared to the
| x87 that IBM PCs used. It did claim to round the last bit
| correctly, though, where early x87s could have large errors
| (https://randomascii.wordpress.com/2014/10/09/intel-
| underesti...)
| brandmeyer wrote:
| > Things are much harder with 64 bits, unfortunately.
|
| Solved and published over 15 years ago. The underlying methods
| were published over 20 years ago (V. Lef`evre, J.M. Muller, and
| A. Tisserand. Towards correctly rounded transcendentals. and
| Lefvre's associated PhD thesis).
|
| https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804/file...
|
| The Correctly Rounded Math Library (CRLIBM) is available for
| several languages.
| adgjlsfhk1 wrote:
| CRLIBM is great, but it is significantly slower than
| alternatives since it requires very high internal precision.
| The cool thing with this paper is that it gives correct
| rounding without as much of a cost for 32 bit.
| bsder wrote:
| That's a bit glib. I don't believe that the bounds are fully
| proven for all the transcendentals (and I believe that
| exponentiation is only "proven" over restricted subdomains).
|
| Single precision floating point is a small enough domain that
| it can be fully enumerated--you can prove that everything
| rounds correctly and find the problematic cases. 24-bits just
| isn't that large to modern computers.
|
| Double precision floating point is still too large to fully
| enumerate. So, there could be an exception lurking in there
| that goes out a long way.
|
| In reality, it seems that rounding requires about 2n + log2 n
| bits and never 3n bits (I don't _think_ we 've ever found an
| exception to this). However, if you can _prove_ that, you
| should submit it as it would be a significant advance in the
| world of mathematics.
| adgjlsfhk1 wrote:
| In general, there shouldn't be a rule like 3n bits. For
| random functions, 1 in 8^n values should require more than
| 3n bits. Since n bit inputs give 2^n possible values, this
| implies that for bit length n, there is a 4^-n chance that
| any given function will require more than 3n bits.
| bsder wrote:
| Except that this isn't the case. We know empirically
| because we have enumerated the single precision space
| completely.
|
| So, the problem isn't bounded by simple probabilities.
| There are other properties involved that cause it to
| diverge from the "naive" probabilistic estimates. What
| those properties are we don't know yet.
|
| The "rounding" problem is called the "Table Maker's
| Dilemma" and it descends into some pretty fundamental
| transcendental number theory.
|
| This is not an "easy" problem. Proof that PI was
| transcendental is relatively modern (1885).
| adgjlsfhk1 wrote:
| This doesn't contradict what I said. I said that the odds
| of a function needing more than 3n digits was 4^-n. For
| single precision, that means you would expect almost all
| functions to be OK with 3n bits (1 in 2^46 wouldn't be).
| wumpus wrote:
| > No more getting slightly different result when changing
| platform, OS, or even just updating libm
|
| Optimizing compilers are usually used with flags that cause
| them to break IEEE754.
|
| Still, it would be nice if results from low-optimization runs
| might be bit-exact.
| titzer wrote:
| > Optimizing compilers are usually used with flags that cause
| them to break IEEE754.
|
| For languages that don't mandate IEEE-754. There are plenty
| of languages that do, like Java, JavaScript, C#, and
| WebAssembly.
|
| It's worth noting that usually such optimizations are things
| like enabling algebraic reassociation and commutation (of
| primitives like +, -, /, ==, etc) operations, fused-multiply
| add, and others. It's really not clear to me how much
| disabling all those optimization costs in the real world.
|
| Things get crazy when compilers want to do major
| reorganization of loops, optimizing stencils, polyhedral
| optimizations. I am not an expert here, but I think these
| rely on properties of float arithmetic that don't hold in all
| cases, i.e. they rely on UB.
| wumpus wrote:
| > It's really not clear to me how much disabling all those
| optimization costs in the real world.
|
| I assure you that SPECcpu scores are a lot lower, if that's
| what you care about. And yes, it's extremely helpful to do
| major reorganization of loops if the program has multi-
| dimensional loops in the wrong order.
| thxg wrote:
| Yes, indeed it is the case with fused multiply-add
| operations, as I wrote below in another thread. By default at
| -O3, unfortunately, gcc and icc both take responsibility for
| deciding whether or not to use FMA instructions when they see
| (a x b + c) expressions. As a result we can even see slightly
| different result across compiles!!
|
| However, as far as I understand, specifying a proper language
| standard (for example -std=c99, -std=c++14, etc.) restores
| sane behavior (see -ffast-math, and more specifically
| -fexcess-precision).
|
| Are there other cases where -O3 breaks IEEE754 on modern
| setups?
| jcranmer wrote:
| Fast math flags are actually far more complicated in C/C++
| compilers than just a simple -ffast-math flag, although I
| suppose most users may not be acquainted with the degree of
| control possible. In LLVM, for example, fast math flags are
| actually a set of 7 independently-toggled flags (no-signed-
| zero, no-infinity, no-NaN, allow FMAs, allow reassociation,
| allow approximate reciprocal, and allow approximate
| functions), that at the C/C++ level can be toggled either
| by command-line options or on a per-scope basis. ICC
| actually has two different levels of fast math [1].
|
| The dark underside of IEEE 754 is that full support for it
| also requires features that most languages don't provide
| access for: rounding mode and exception handling (aka
| sticky bits). In C/C++, using these features correctly
| requires STDC FENV_ACCESS support which is only in ICC and
| recent clang (neither GCC nor MSVC support it). And there's
| usually two hardware bits for handling denormals (flush-to-
| zero and denormals-are-zero) that aren't in IEEE 754, which
| tend to be on by default when you're compiling for some
| targets (most notably GPUs).
|
| [1] https://software.intel.com/content/www/us/en/develop/do
| cumen... -- note fp-model=fast=[1|2]
| wumpus wrote:
| > Yes, indeed it is the case with fused multiply-add
| operations, as I wrote below in another thread
|
| This is not always true. Some FMA operations round after
| the multiply, and some do not.
| brandmeyer wrote:
| Its unfortunate that they don't mention the prior work of the
| crlibm project in this domain.
|
| https://gforge.inria.fr/scm/browser.php?group_id=5929&extra=...
|
| Github mirror: https://github.com/taschini/crlibm
|
| Python bindings: https://pypi.org/project/crlibm/
|
| Julia bindings: https://juliapackages.com/p/crlibm
| samth wrote:
| crlibm is discussed in their paper:
| https://people.cs.rutgers.edu/~sn349/papers/rlibm32-pldi-202...
| brandmeyer wrote:
| Based on their press material, you'd think that nobody else
| has done this before at any precision. You can see the
| resulting confusion in this very forum. That kind of behavior
| is disingenuous _at best_.
| GeorgeTirebiter wrote:
| I've read that, if one is going to use floats, then one should
| eschew 32-bit floats, and go with 64-bit floats. 80-bit floats
| seem only useful for very special applications.
|
| I converted a satellite orbital prediction program that used
| doubles (64-bit floats) to use long doubles (80-bit floats) but
| only saw a difference in the 11th decimal place, with lots of
| sin/cos/atan/sqrt involved in the calculations. With 32-bit
| floats it was a joke: the accumulated error made the
| 'predictions' useless.
|
| The basic problem with 32-bit is you have to be quite careful
| with your arithmetic so you don't lose significance. It's almost
| as bad as using Q-notation.
| https://en.wikipedia.org/wiki/Q_(number_format)
| zokier wrote:
| > I've read that, if one is going to use floats, then one
| should eschew 32-bit floats, and go with 64-bit floats. 80-bit
| floats seem only useful for very special applications.
|
| This is what Kahan has to say on the matter:
|
| _For now the 10-byte Extended format is a tolerable compromise
| between the value of extra-precise arithmetic and the price of
| implementing it to run fast; very soon two more bytes of
| precision will become tolerable, and ultimately a 16-byte
| format ... That kind of gradual evolution towards wider
| precision was already in view when IEEE Standard 754 for
| Floating-Point Arithmetic was framed_
|
| (The quote is apparently from 1994 unpublished manuscript, so I
| can't find the source for it)
|
| More than 25 years later and we are rarely using even 80 bit
| doubles, nevermind 128 bit ones.
| Narishma wrote:
| Isn't that because 80-bit floats have been deprecated a
| decade or two ago in x86, and were never implemented in other
| architectures to begin with?
| modeless wrote:
| 32-bit floats are sufficient and desired for performance in
| many cases, like graphics or audio. 64-bit floats are not
| enough in some cases, like galaxy-scale coordinates or
| nanosecond precision over long time periods or some currency
| calculations. There's no single best answer, you have to know
| what your requirements are.
| dynm wrote:
| The main use for 80-bit floats is probably to confirm that when
| you switch from 64-bit to 80-bit, nothing changes.
|
| This is a very important use, though!
| dragontamer wrote:
| Errors grow exponentially in most floating-point operations.
|
| The exception is cancellation error, which can suddenly cause
| the error rate to spike dramatically. But if you sort your
| numbers before adding/subtracting, you can often negate this
| overwhelming cancellation error.
|
| The exponential growth of other floating point errors is almost
| unstoppable however. The only real way to stop it is to design
| an algorithm that operates with fewer steps.
|
| ---------
|
| 64-bit has a much smaller initial error (1/2^53), while 32-bit
| has (1/2^23) initial errors... starting with 30-magnitudes more
| error than double-precision.
| vlovich123 wrote:
| Kahan summation[1] may be better than just doing a sorted sum
| & doesn't require you have all the numbers up-front to sort
| (i.e. running sum). Although I think even it benefits from
| doing it in sorted order if you do have the full history
| available & the wikipedia article lists a parallelization
| technique you can leverage in that case. Klein sum might be
| better but I haven't investigated.
|
| EDIT: Apparently there's a nice Rust crate for doing this
| stuff: https://crates.io/crates/accurate
|
| [1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm#A
| lte...
| adgjlsfhk1 wrote:
| One thing to consider here is XSum. It's a C library that
| has surprisingly fast exact floating point sums.
| a1369209993 wrote:
| Kahan summation is basically just synthesizing a larger
| floating point type (with 46/106 bits of mantissa for
| 32/64-bit base type) and then using that to do the sum,
| though. You still lose precision if your errors exceed that
| new, larger margin. Eg try adding 34000 + 56 + .078 - 56 -
| 34000, carrying two digits per float.
| dev_tty01 wrote:
| Depends on what you are doing. 8-bit floats are widely used in
| large neural network applications. (No that is not a typo. I
| mean 8-bit, not 80-bit.)
| adgjlsfhk1 wrote:
| Really? I thought that for 8 bits, fixed point formats were
| much more common.
| alfalfasprout wrote:
| It really depends what you're using the floats for. For ML
| applications? 16 bit floats do the job with some trickery
| (quantization).
|
| For scientific computing? Generally 64 bit floats give you the
| needed precision. But if you're chaining a lot of operations
| that exacerbate floating point error you may need more.
|
| For usage in trading? Again, depends on what computations
| you're doing. There's no "hard and fast" rule.
| jart wrote:
| I'm a libc author and I want to use this. Then I clicked on the
| repository and there wasn't any code implementing these libm
| routines. Instead it just had tools and they told me I need to
| install a 6gb docker container to use them.
| zokier wrote:
| > Then I clicked on the repository and there wasn't any code
| implementing these libm routines
|
| Whats this then?
|
| https://github.com/rutgers-apl/rlibm-32/tree/main/source/flo...
___________________________________________________________________
(page generated 2021-08-26 23:00 UTC)