[HN Gopher] High Performance Correctly Rounded Math Libraries fo...
       ___________________________________________________________________
        
       High Performance Correctly Rounded Math Libraries for 32-Bit
       Floating Point
        
       Author : ingve
       Score  : 64 points
       Date   : 2021-08-26 18:10 UTC (4 hours ago)
        
 (HTM) web link (blog.sigplan.org)
 (TXT) w3m dump (blog.sigplan.org)
        
       | camgunz wrote:
       | There are ~30 year old legends in the client/server gamedev
       | community about floating point differing across platforms, but I
       | assumed they were ironed out by now. _Wild_ to find out they
       | haven 't been.
        
         | thxg wrote:
         | Nowadays, all major platforms follow IEEE754, which specifies
         | exactly the results of elementary operations (+, -, x, /). You
         | must get bit-for-bit reproducible results across different
         | runs, compiles, libraries, OSes and hardware. Any difference is
         | a bug.
         | 
         | However:
         | 
         | 1. As the paper emphasizes, math library functions (sin, cos,
         | log, etc.) are not covered by that standardization, at least
         | not mandatorily and not in practice. This is the problem that
         | the paper tackles.
         | 
         | 2. In the bad old days of the x87, the FPU carried out
         | computations on eight 80-bit registers. The problem is that
         | whenever those registers spilled out to the stack, precision
         | was truncated down to whatever type the language actually meant
         | (64-bit double or 32-bit float). Because the register-spilling
         | decision is mostly taken by the compiler, it was largely
         | unpredictable. This resulted in code that was not reproducible,
         | even across compiles! Since x87 was deprecated in favor of
         | SSE/AVX, the precision of operations is always that of the type
         | used in the code. Lower than 80-bits but much better in
         | practice because of consistency and reproducibility.
         | 
         | 3. The bad old days are making a comeback with FMA instructions
         | (fused multiply-add). This is a single operation (a x b + c)
         | and so it is correctly rounded to the closest representable
         | number. This number can be different from ((a x b) + c) in
         | which _two_ roundings occur. If we give the compiler freedom to
         | decide whether or not to fuse such expressions, we will again
         | see non-reproducibility across compiles. Unfortunately, gcc and
         | icc both do that by default at -O3 (see -ffast-math, and more
         | specifically -fexcess-precision). Specifying a proper language
         | standard (for example -std=c99, -std=c++14, etc.) restores sane
         | behavior.
        
           | pklausler wrote:
           | x87 can store and load full 80-bit register values. The
           | truncation you mention is optional.
        
           | adgjlsfhk1 wrote:
           | On the one hand, the ambiguity of FMA is kind of annoying,
           | but on the other, FMA is a godsend for writing accurate
           | arithmatic. One reason for this, is that the error of `a _b`
           | is `fma(a, b, -a_ b)` which makes compensated algorithms much
           | faster to write. Secondly, polynomial evaluations using
           | horner's method converge to .5 ULP (units in last place) with
           | FMA, but to 1.5 ULP without. This is incredibly useful since
           | polynomial approximations are at the heart of pretty much
           | everything else.
           | 
           | I think Julia takes an interesting approach here of never
           | reordering floating point computations unless you explecitly
           | give permission using @fastmath (which acts locally instead
           | of globally like a --fast-math in C/C++). This makes it much
           | easier to use ieee specific tricks and be confident that the
           | compiler won't mess with you, while still making it easy to
           | tell the compiler to do what it wants.
        
             | titzer wrote:
             | > I think Julia takes an interesting approach here of never
             | reordering floating point computations
             | 
             | WebAssembly as well. IEEE-754 is mandated and there are no
             | unsafe compiler optimizations allowed. This is the only
             | sane choice. It's just too hard to reason about programs
             | otherwise. Nothing about a super-intelligent machine
             | completely reorganizing your code in a way that subtly
             | breaks it is good, IMHO.
        
             | thxg wrote:
             | > FMA is a godsend for writing accurate arithmatic.
             | 
             | True, and the small performance boost (on some platforms)
             | is nice too
             | 
             | > never reordering floating point computations unless you
             | explecitly give permission
             | 
             | Yes, requiring explicit code for FMA and arithmetic
             | grouping/reordering in general seems like the sane
             | approach.
        
         | dleslie wrote:
         | Decades ago we were deeply concerned with syncronized
         | simulations; the dominant replication method was lock-step
         | input replication, or similar, and that made it very important
         | that everyone's simulation was identical or bad things would
         | happen.
         | 
         | Nowadays that's not an issue, as there's a slew of different
         | ways of overcoming that and allowing clients to operate
         | asynchronously; sometimes widely asynchronous.
        
           | _dwt wrote:
           | Any good resources (or even search term hints) you could
           | recommend on the more modern/async approaches? I'd love to
           | read about them.
        
             | lodi wrote:
             | https://gafferongames.com/post/state_synchronization/
             | 
             | Also search for "game networking" + the following:
             | - state replication       - dead reckoning       - client
             | side prediction       - networked physics
        
       | thxg wrote:
       | This work essentially allows one to go beyond the IEEE754
       | standard which mandates correctly-rounded elementary operations
       | +, -, x and / : they provide implementations of libm functions
       | (cos, sin, log, exp, etc.) that are _also_ correctly rounded.
       | 
       | This is really nice, not necessarily because anyone cares about
       | the correctness of the last bit in a float. But more importantly,
       | because since there is only one "correct" answer, requiring
       | correct rounding means that those functions are then fully
       | specified and deterministic. No more getting slightly different
       | result when changing platform, OS, or even just updating libm
       | (some transcendental functions are still actively being improved,
       | thus changed, in glibc!). This is amazing for reproducibility.
       | 
       | Since the performance is good (even better than existing libm
       | implementations, they claim), this is a true win-win.
       | 
       | This is obviously made possible by the fact that one can easily
       | enumerate all 32-bit floats. It would be amazing to have
       | something similar, even at a big performance cost, for 64-bit
       | doubles (MPFR can do it of course but the perf cost there is
       | truly massive). Things are much harder with 64 bits,
       | unfortunately.
        
         | Someone wrote:
         | Apple had a library that did that for 80-bit floating point,
         | for both the 6502 (!) and 68000: SANE (https://en.wikipedia.org
         | /wiki/Standard_Apple_Numerics_Enviro...)
         | 
         | And yes, performance wasn't good, especially compared to the
         | x87 that IBM PCs used. It did claim to round the last bit
         | correctly, though, where early x87s could have large errors
         | (https://randomascii.wordpress.com/2014/10/09/intel-
         | underesti...)
        
         | brandmeyer wrote:
         | > Things are much harder with 64 bits, unfortunately.
         | 
         | Solved and published over 15 years ago. The underlying methods
         | were published over 20 years ago (V. Lef`evre, J.M. Muller, and
         | A. Tisserand. Towards correctly rounded transcendentals. and
         | Lefvre's associated PhD thesis).
         | 
         | https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804/file...
         | 
         | The Correctly Rounded Math Library (CRLIBM) is available for
         | several languages.
        
           | adgjlsfhk1 wrote:
           | CRLIBM is great, but it is significantly slower than
           | alternatives since it requires very high internal precision.
           | The cool thing with this paper is that it gives correct
           | rounding without as much of a cost for 32 bit.
        
           | bsder wrote:
           | That's a bit glib. I don't believe that the bounds are fully
           | proven for all the transcendentals (and I believe that
           | exponentiation is only "proven" over restricted subdomains).
           | 
           | Single precision floating point is a small enough domain that
           | it can be fully enumerated--you can prove that everything
           | rounds correctly and find the problematic cases. 24-bits just
           | isn't that large to modern computers.
           | 
           | Double precision floating point is still too large to fully
           | enumerate. So, there could be an exception lurking in there
           | that goes out a long way.
           | 
           | In reality, it seems that rounding requires about 2n + log2 n
           | bits and never 3n bits (I don't _think_ we 've ever found an
           | exception to this). However, if you can _prove_ that, you
           | should submit it as it would be a significant advance in the
           | world of mathematics.
        
             | adgjlsfhk1 wrote:
             | In general, there shouldn't be a rule like 3n bits. For
             | random functions, 1 in 8^n values should require more than
             | 3n bits. Since n bit inputs give 2^n possible values, this
             | implies that for bit length n, there is a 4^-n chance that
             | any given function will require more than 3n bits.
        
               | bsder wrote:
               | Except that this isn't the case. We know empirically
               | because we have enumerated the single precision space
               | completely.
               | 
               | So, the problem isn't bounded by simple probabilities.
               | There are other properties involved that cause it to
               | diverge from the "naive" probabilistic estimates. What
               | those properties are we don't know yet.
               | 
               | The "rounding" problem is called the "Table Maker's
               | Dilemma" and it descends into some pretty fundamental
               | transcendental number theory.
               | 
               | This is not an "easy" problem. Proof that PI was
               | transcendental is relatively modern (1885).
        
               | adgjlsfhk1 wrote:
               | This doesn't contradict what I said. I said that the odds
               | of a function needing more than 3n digits was 4^-n. For
               | single precision, that means you would expect almost all
               | functions to be OK with 3n bits (1 in 2^46 wouldn't be).
        
         | wumpus wrote:
         | > No more getting slightly different result when changing
         | platform, OS, or even just updating libm
         | 
         | Optimizing compilers are usually used with flags that cause
         | them to break IEEE754.
         | 
         | Still, it would be nice if results from low-optimization runs
         | might be bit-exact.
        
           | titzer wrote:
           | > Optimizing compilers are usually used with flags that cause
           | them to break IEEE754.
           | 
           | For languages that don't mandate IEEE-754. There are plenty
           | of languages that do, like Java, JavaScript, C#, and
           | WebAssembly.
           | 
           | It's worth noting that usually such optimizations are things
           | like enabling algebraic reassociation and commutation (of
           | primitives like +, -, /, ==, etc) operations, fused-multiply
           | add, and others. It's really not clear to me how much
           | disabling all those optimization costs in the real world.
           | 
           | Things get crazy when compilers want to do major
           | reorganization of loops, optimizing stencils, polyhedral
           | optimizations. I am not an expert here, but I think these
           | rely on properties of float arithmetic that don't hold in all
           | cases, i.e. they rely on UB.
        
             | wumpus wrote:
             | > It's really not clear to me how much disabling all those
             | optimization costs in the real world.
             | 
             | I assure you that SPECcpu scores are a lot lower, if that's
             | what you care about. And yes, it's extremely helpful to do
             | major reorganization of loops if the program has multi-
             | dimensional loops in the wrong order.
        
           | thxg wrote:
           | Yes, indeed it is the case with fused multiply-add
           | operations, as I wrote below in another thread. By default at
           | -O3, unfortunately, gcc and icc both take responsibility for
           | deciding whether or not to use FMA instructions when they see
           | (a x b + c) expressions. As a result we can even see slightly
           | different result across compiles!!
           | 
           | However, as far as I understand, specifying a proper language
           | standard (for example -std=c99, -std=c++14, etc.) restores
           | sane behavior (see -ffast-math, and more specifically
           | -fexcess-precision).
           | 
           | Are there other cases where -O3 breaks IEEE754 on modern
           | setups?
        
             | jcranmer wrote:
             | Fast math flags are actually far more complicated in C/C++
             | compilers than just a simple -ffast-math flag, although I
             | suppose most users may not be acquainted with the degree of
             | control possible. In LLVM, for example, fast math flags are
             | actually a set of 7 independently-toggled flags (no-signed-
             | zero, no-infinity, no-NaN, allow FMAs, allow reassociation,
             | allow approximate reciprocal, and allow approximate
             | functions), that at the C/C++ level can be toggled either
             | by command-line options or on a per-scope basis. ICC
             | actually has two different levels of fast math [1].
             | 
             | The dark underside of IEEE 754 is that full support for it
             | also requires features that most languages don't provide
             | access for: rounding mode and exception handling (aka
             | sticky bits). In C/C++, using these features correctly
             | requires STDC FENV_ACCESS support which is only in ICC and
             | recent clang (neither GCC nor MSVC support it). And there's
             | usually two hardware bits for handling denormals (flush-to-
             | zero and denormals-are-zero) that aren't in IEEE 754, which
             | tend to be on by default when you're compiling for some
             | targets (most notably GPUs).
             | 
             | [1] https://software.intel.com/content/www/us/en/develop/do
             | cumen... -- note fp-model=fast=[1|2]
        
             | wumpus wrote:
             | > Yes, indeed it is the case with fused multiply-add
             | operations, as I wrote below in another thread
             | 
             | This is not always true. Some FMA operations round after
             | the multiply, and some do not.
        
       | brandmeyer wrote:
       | Its unfortunate that they don't mention the prior work of the
       | crlibm project in this domain.
       | 
       | https://gforge.inria.fr/scm/browser.php?group_id=5929&extra=...
       | 
       | Github mirror: https://github.com/taschini/crlibm
       | 
       | Python bindings: https://pypi.org/project/crlibm/
       | 
       | Julia bindings: https://juliapackages.com/p/crlibm
        
         | samth wrote:
         | crlibm is discussed in their paper:
         | https://people.cs.rutgers.edu/~sn349/papers/rlibm32-pldi-202...
        
           | brandmeyer wrote:
           | Based on their press material, you'd think that nobody else
           | has done this before at any precision. You can see the
           | resulting confusion in this very forum. That kind of behavior
           | is disingenuous _at best_.
        
       | GeorgeTirebiter wrote:
       | I've read that, if one is going to use floats, then one should
       | eschew 32-bit floats, and go with 64-bit floats. 80-bit floats
       | seem only useful for very special applications.
       | 
       | I converted a satellite orbital prediction program that used
       | doubles (64-bit floats) to use long doubles (80-bit floats) but
       | only saw a difference in the 11th decimal place, with lots of
       | sin/cos/atan/sqrt involved in the calculations. With 32-bit
       | floats it was a joke: the accumulated error made the
       | 'predictions' useless.
       | 
       | The basic problem with 32-bit is you have to be quite careful
       | with your arithmetic so you don't lose significance. It's almost
       | as bad as using Q-notation.
       | https://en.wikipedia.org/wiki/Q_(number_format)
        
         | zokier wrote:
         | > I've read that, if one is going to use floats, then one
         | should eschew 32-bit floats, and go with 64-bit floats. 80-bit
         | floats seem only useful for very special applications.
         | 
         | This is what Kahan has to say on the matter:
         | 
         |  _For now the 10-byte Extended format is a tolerable compromise
         | between the value of extra-precise arithmetic and the price of
         | implementing it to run fast; very soon two more bytes of
         | precision will become tolerable, and ultimately a 16-byte
         | format ... That kind of gradual evolution towards wider
         | precision was already in view when IEEE Standard 754 for
         | Floating-Point Arithmetic was framed_
         | 
         | (The quote is apparently from 1994 unpublished manuscript, so I
         | can't find the source for it)
         | 
         | More than 25 years later and we are rarely using even 80 bit
         | doubles, nevermind 128 bit ones.
        
           | Narishma wrote:
           | Isn't that because 80-bit floats have been deprecated a
           | decade or two ago in x86, and were never implemented in other
           | architectures to begin with?
        
         | modeless wrote:
         | 32-bit floats are sufficient and desired for performance in
         | many cases, like graphics or audio. 64-bit floats are not
         | enough in some cases, like galaxy-scale coordinates or
         | nanosecond precision over long time periods or some currency
         | calculations. There's no single best answer, you have to know
         | what your requirements are.
        
         | dynm wrote:
         | The main use for 80-bit floats is probably to confirm that when
         | you switch from 64-bit to 80-bit, nothing changes.
         | 
         | This is a very important use, though!
        
         | dragontamer wrote:
         | Errors grow exponentially in most floating-point operations.
         | 
         | The exception is cancellation error, which can suddenly cause
         | the error rate to spike dramatically. But if you sort your
         | numbers before adding/subtracting, you can often negate this
         | overwhelming cancellation error.
         | 
         | The exponential growth of other floating point errors is almost
         | unstoppable however. The only real way to stop it is to design
         | an algorithm that operates with fewer steps.
         | 
         | ---------
         | 
         | 64-bit has a much smaller initial error (1/2^53), while 32-bit
         | has (1/2^23) initial errors... starting with 30-magnitudes more
         | error than double-precision.
        
           | vlovich123 wrote:
           | Kahan summation[1] may be better than just doing a sorted sum
           | & doesn't require you have all the numbers up-front to sort
           | (i.e. running sum). Although I think even it benefits from
           | doing it in sorted order if you do have the full history
           | available & the wikipedia article lists a parallelization
           | technique you can leverage in that case. Klein sum might be
           | better but I haven't investigated.
           | 
           | EDIT: Apparently there's a nice Rust crate for doing this
           | stuff: https://crates.io/crates/accurate
           | 
           | [1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm#A
           | lte...
        
             | adgjlsfhk1 wrote:
             | One thing to consider here is XSum. It's a C library that
             | has surprisingly fast exact floating point sums.
        
             | a1369209993 wrote:
             | Kahan summation is basically just synthesizing a larger
             | floating point type (with 46/106 bits of mantissa for
             | 32/64-bit base type) and then using that to do the sum,
             | though. You still lose precision if your errors exceed that
             | new, larger margin. Eg try adding 34000 + 56 + .078 - 56 -
             | 34000, carrying two digits per float.
        
         | dev_tty01 wrote:
         | Depends on what you are doing. 8-bit floats are widely used in
         | large neural network applications. (No that is not a typo. I
         | mean 8-bit, not 80-bit.)
        
           | adgjlsfhk1 wrote:
           | Really? I thought that for 8 bits, fixed point formats were
           | much more common.
        
         | alfalfasprout wrote:
         | It really depends what you're using the floats for. For ML
         | applications? 16 bit floats do the job with some trickery
         | (quantization).
         | 
         | For scientific computing? Generally 64 bit floats give you the
         | needed precision. But if you're chaining a lot of operations
         | that exacerbate floating point error you may need more.
         | 
         | For usage in trading? Again, depends on what computations
         | you're doing. There's no "hard and fast" rule.
        
       | jart wrote:
       | I'm a libc author and I want to use this. Then I clicked on the
       | repository and there wasn't any code implementing these libm
       | routines. Instead it just had tools and they told me I need to
       | install a 6gb docker container to use them.
        
         | zokier wrote:
         | > Then I clicked on the repository and there wasn't any code
         | implementing these libm routines
         | 
         | Whats this then?
         | 
         | https://github.com/rutgers-apl/rlibm-32/tree/main/source/flo...
        
       ___________________________________________________________________
       (page generated 2021-08-26 23:00 UTC)