[HN Gopher] Reciprocal Approximation with 1 Subtraction
       ___________________________________________________________________
        
       Reciprocal Approximation with 1 Subtraction
        
       Today's find: You can get a floating-point approximation of 1/x
       that's accurate to 3 bits with a single integer subtraction
       instruction.                 float fast_reciprocal(float x)       {
       unsigned i = *(unsigned *) &x;          i = 0x7effffffU - i;
       return *(float *) &i;       }       The magic number 0x7effffff
       accomplishes two things:  1) The exponent is calculated as 253-e,
       which effectively negates the exponent and subtracts 1.  2) The
       mantissa is approximated as a 1st order polynomial in the interval
       [1, 2).  Interesting, but perhaps not very useful (as most CPU:s
       have more accurate reciprocal approximations these days).
        
       Author : mbitsnbites
       Score  : 83 points
       Date   : 2024-12-30 09:26 UTC (1 days ago)
        
       | magicalhippo wrote:
       | Similar trick to that used in the fast inverse square root
       | routine[1] popularized by Quake 3.
       | 
       | [1]: https://en.wikipedia.org/wiki/Fast_inverse_square_root
        
         | mbitsnbites wrote:
         | Yep. Along the same lines. This one is even simpler, though, as
         | it requires only a single integer CPU instruction (and the
         | simplest of all instructions too).
         | 
         | If you want full precision, you need to do three Newton-Raphson
         | iterations after the initial approximation. One iteration is:
         | y = y * (2.0F - x * y);
        
           | magicalhippo wrote:
           | It's a neat trick, and could be very useful on
           | microcontrollers which doesn't have hardware division but
           | does have hardware multiplication.
        
             | actionfromafar wrote:
             | Hm, like 68000?
        
         | fph wrote:
         | For some more explanation: the main idea behind both tricks is
         | that the IEEE floating point formats are designed so that the
         | most significant bits represent its exponent, that is,
         | floor(log2(x)). Hence reinterpret-casting a float x to an
         | unsigned integer uint(x) approximates a multiple of log2(x). So
         | these kinds of approximations work like logarithm arithmetic:
         | float(C + a*uint(x)) approximates x^a, for a suitable constant
         | C. Quake's invsqrt is a=-1/2, this post is a=-1.
         | 
         | More detail on
         | https://en.wikipedia.org/wiki/Fast_inverse_square_root#Alias...
         | .
         | 
         | IEEE754 floats are a very well-designed binary format, and the
         | fact that these approximations are possible is part of this
         | design; indeed, the first known instance of this trick is for
         | a=1/2 by W. Kahan, the main designer of IEE754.
        
       | torusle wrote:
       | There are couple of tricks you can do if you fiddle with the bits
       | of a floating point value using integer arithmetic and binary
       | logic.
       | 
       | That was a thing back in the 90th..
       | 
       | I wonder how hard the performance hit from moving values between
       | integer and float pipeline is nowadays.
       | 
       | Last time I looked into that was the Cortex-A8 (first I-Phone
       | area). Doing that kind of trick costed around 26 cycles (back and
       | forth) due to pipeline stalls back then.
        
         | stephencanon wrote:
         | There are basic integer operations in the FP/SIMD units on most
         | CPUs, so there's no generally need to "move back and forth"
         | unless you need to branch on the result of a comparison, use a
         | value as an address, or do some more specialized arithmetic.
        
           | stephencanon wrote:
           | (For that matter, though, most modern FP/SIMD units have a
           | direct approximate-reciprocal instruction that is single-
           | cycle throughput or better and much more accurate--generally
           | around 10-12 bits, so there's no need for this sort of thing.
           | See, e.g. FRECPE on ARM NEON and [V]RCPP[S/D] on x86.)
        
         | ack_complete wrote:
         | These kinds of tricks are still used today. They're not so
         | useful if you need a reciprocal or square root, since CPUs now
         | have dedicated hardware for that, but it's different if you
         | need a _cube_ root or x^(1/2.4).
        
           | mitthrowaway2 wrote:
           | I wonder to what extent the dedicated hardware is essentially
           | implementing the same steps but at the transistor level.
        
       | Y_Y wrote:
       | Quick comparison with exact and one Newton-Raphson:
       | Value | True | Fast | Fast+Newton
       | ----------------------------------------        0.1 | 10.000 |
       | 11.200 | 9.8560        0.5 | 2.0000 | 2.0000 | 2.0000        1.0
       | | 1.0000 | 1.0000 | 1.0000        2.0 | 0.5000 | 0.5000 | 0.5000
       | 5.0 | 0.2000 | 0.2187 | 0.1982       10.0 | 0.1000 | 0.1094 |
       | 0.0991
       | 
       | (where the extra correction was done with:                 y *=
       | (2.0f - x*y);     )
        
       | pkhuong wrote:
       | You can do a little better if you tweak the low order bits
       | (https://pvk.ca/Blog/LowLevel/software-reciprocal.html for the
       | double float version)
        
       | jcmeyrignac wrote:
       | A better value could be 0x7EEEEBB3, as in:
       | https://github.com/parallella/pal/blob/bd9389f14fe16db4d9630...
        
       | dahart wrote:
       | This code is technically UB in C++, right? [1] Has anyone run
       | into a case where it actually didn't work? Just curious. I've
       | often assumed that if C++ compilers didn't compile this code, all
       | hell would break loose.
       | 
       | It might be nice to start sharing modern/safe versions of this
       | snippet & the Quake thing. Is using memcpy the only option that
       | is safe in both C and C++? That always felt really awkward to me.
       | 
       | [1]
       | https://tttapa.github.io/Pages/Programming/Cpp/Practices/typ...
        
         | tekknolagi wrote:
         | We used memcpy everywhere in our runtime and after the 10th or
         | so time doing it, it becomes less awkward.
        
           | dahart wrote:
           | And it's always reliably optimized out in release builds, I
           | assume?
        
             | LegionMammal978 wrote:
             | On platforms thar require aligned loads and stores (not x86
             | nor ARM), a direct pointer cast sometimes uses an aligned
             | load/store where a memcpy uses multiple byte loads/stores,
             | even on a good compiler, since memcpy() doesn't require
             | that the pointers are aligned. This can be mitigated by
             | going through a local variable, but it gets pretty verbose.
        
               | stouset wrote:
               | Sounds like a good place for a macro?
        
               | ack_complete wrote:
               | Some ARM CPUs do require aligned loads and stores, such
               | as the Cortex-M0+ in a Raspberry Pi Pico.
        
             | Asooka wrote:
             | For MSVC you have to add "/Oi", otherwise it is always a
             | function call at lower optimisation levels. Clang and GCC
             | treat it as an intrinsic always, even in debug builds.
        
             | tekknolagi wrote:
             | I haven't manually checked every case but it's normally
             | folded into the load or shift or whatever and completely
             | erased
        
         | LegionMammal978 wrote:
         | Yes, most casts via pointers are formally UB in both C and C++,
         | and you can induce weird behavior by compilers if you transmit
         | casted pointers through a function boundary (see [0] for a
         | standard example: notice how the write to *pf vanishes into
         | thin air). Since people like doing it in practice, GCC and
         | Clang have an -fno-strict-aliasing flag to disable this
         | optimization, and the MSVC compiler doesn't use it in the first
         | place (except for restrict pointers in C). They don't go too
         | far with it regardless, since lots of code using the POSIX
         | socket API casts around struct pointers as a matter of course.
         | 
         | Apart from memcpy(), the 'allowed' methods include unions in C
         | (writing to one member and reading from another), and
         | bit_cast<T>() and std::start_lifetime_as<T>() in C++.
         | 
         | [0] https://godbolt.org/z/dxMMfazoq
        
           | ckjellqv wrote:
           | It should be well-defined with reinterpret_cast<uint32_t&>
           | though.
        
             | LegionMammal978 wrote:
             | No, reinterpret_cast doesn't change the type of the
             | underlying object. The rule in [basic.lval]/11 (against
             | accessing casted values) applies to all glvalues, whether
             | they come from pointers or references.
        
           | rjeli wrote:
           | For C++, `bit_cast<uint32_t>(0.f)` should be Well Defined,
           | right? I'm curious, in C, is union-casting float->uint32_t
           | also Perfectly Legal And Well Defined?
           | 
           | (I am not a C or C++ expert.)
        
         | Asooka wrote:
         | It would be better if compilers were fixed. Current treatment
         | of UB is insane. The default should be no UB and a pragma that
         | lets you turn it on for specific parts. That used to be the
         | case and some of the world's most performant software was
         | written for compilers that didn't assume no pointer aliasing
         | nor no integer overflow (case in point: Quake).
         | 
         | The big problem, apart from "memcpy is a bit tedious to write",
         | is that it is impossible to guarantee absence of UB in a large
         | program. You will get inlined functions from random headers and
         | suddenly someone somewhere is "dereferencing" a pointer by
         | storing it as a reference, and then your null pointer checks
         | are deleted. Or an integer overflows and your "i>=0" array
         | bounds assert is deleted. I have seen that happen several times
         | and each time the various bits of code all look innocent enough
         | until you sit down and thoroughly read each function and the
         | functions it calls and the functions they call. So we compile
         | with the worst UB turned off (null is a valid address, integers
         | can overflow, pointers can alias, enums can store any integer)
         | and honestly, we cannot see a measurable difference in
         | performance.
         | 
         | UB as it is currently implemented is simply an enormous
         | footgun. It would be a lot more useful if there were a profiler
         | that could find parts of the code, which would benefit from UB
         | optimisations. Then we could focus on making those parts UB-
         | safe, add asserts for the debug/testing builds, and turn on
         | more optimisations for them. I am quite certain nobody can
         | write a fully UB-free program. For example, did you know that
         | multiplying two "unsigned short" variables is UB? Can you
         | guarantee that across all the template instantiations for
         | custom vector types you've written you never multiply unsigned
         | shorts? I'll leave it as an exercise to the reader as to why
         | that is.
        
           | imtringued wrote:
           | There is no point in talking with the people that worship UB
           | for the sake of UB.
           | 
           | They don't understand that one of the biggest barriers to
           | developers writing and adopting more C in their projects is
           | the random jankiness that you get from the compilers. Instead
           | they make C this elite thing for the few people who have read
           | every single line of C code they themselves compiled and ran
           | on their Gentoo installation. Stuff like having no bounds
           | checks is almost entirely irrelevant outside of compute
           | kernels. It doesn't get you much performance, because the
           | branches are perfectly predictable. It merely reduces code
           | size.
           | 
           | There is also the problem that the C development culture is
           | uttlery backwards compared even to the semiconductor
           | industry. If you want to have these ultra optimized release
           | builds, then your development builds must scream when they
           | encounter UB and it also means that no C program or library
           | without an extensive test suite should be allowed onto any
           | Linux distribution's package repository. Suddenly the cost of
           | C programming appears to be unaffordably high.
        
         | Conscat wrote:
         | A few years ago with GCC 10, I had some strange UD2
         | instructions (I think) inserted around some inline assembly
         | that was fixed by replaced my `reinterpret_cast`s with
         | `bit_cast`.
        
         | Earw0rm wrote:
         | Yes, if implemented as shown.
         | 
         | You could use intrinsics for the bit casting & so on and it
         | would be well-defined to the extent that those intrinsics are.
         | 
         | (I understand some people consider SIMD intrinsics in general
         | to be UB at language level, in part because they let you switch
         | between floats & ints in a hardware-specific way.)
        
       | AlotOfReading wrote:
       | A better value is 0x7eb504f3. You can follow that up with Newton-
       | raphson to refine the approximation, or approximate the Newton-
       | raphson too by multiplying 1.385356f*x.
       | 
       | I did this a few days ago in the approximate division thread:
       | https://news.ycombinator.com/item?id=42481612#42489596
       | 
       | In some cases, it can be faster than hardware division.
        
       | quanto wrote:
       | A quick intuition: magic number 7e ffffff is negating by two's
       | complement both the mantissa and exponent.
       | 
       | 1. 7e: the first sig bit has to be zero to preserve the sign of
       | the overall number.
       | 
       | 2. ffffff: due to Taylor series 1/(1 + X) = 1 - X ..., negating
       | gives the multiplicative inverse.
       | 
       | Although an IEEE float has a sign bit (thus 2's complement does
       | not work on the float itself) but mantissa and the exponent
       | individually work with 2's complement for different reasons. The
       | exponent has a biased range due to being chopped by half; where
       | as the mantissa has a biased range due to its definition and
       | constant offset. The exponent's 1 offset (7e instead of 7f) is a
       | bit more difficult to see at first -- the devil is in the
       | details, but 2's complement is the basic intuition.
        
       | somat wrote:
       | Whenever I see these tricks(see also: the quake 3 fast inverse
       | sqrt) involving using, not casting, but using integers as floats
       | directly and floats as integers, I wonder if there is a way to do
       | it without the jank.
       | 
       | Because what do you really want? some sort of exponent or
       | exponent math right, some variant of the log function should
       | work. is the problem is all the log functions are gated behind
       | the function call interface. where as the subtract function is
       | less heavy being behind the operator interface. or are they
       | trying to use a floating point accelerated log aka floating point
       | subtract?
        
         | brudgers wrote:
         | Engineering mathematical calculations always looks like "jank."
         | Take a look at Plauger's _The C Standard Library_. It is full
         | of  "magic" numbers.
         | 
         | But then again, representing irrational numbers in binary is
         | inherently janky.
        
       | kkkqkqkqkqlqlql wrote:
       | Where is the obligatory "WHAT THE FUCK" comment?
        
       ___________________________________________________________________
       (page generated 2024-12-31 23:01 UTC)