[HN Gopher] Reciprocal Approximation with 1 Subtraction
___________________________________________________________________
Reciprocal Approximation with 1 Subtraction
Today's find: You can get a floating-point approximation of 1/x
that's accurate to 3 bits with a single integer subtraction
instruction. float fast_reciprocal(float x) {
unsigned i = *(unsigned *) &x; i = 0x7effffffU - i;
return *(float *) &i; } The magic number 0x7effffff
accomplishes two things: 1) The exponent is calculated as 253-e,
which effectively negates the exponent and subtracts 1. 2) The
mantissa is approximated as a 1st order polynomial in the interval
[1, 2). Interesting, but perhaps not very useful (as most CPU:s
have more accurate reciprocal approximations these days).
Author : mbitsnbites
Score : 83 points
Date : 2024-12-30 09:26 UTC (1 days ago)
| magicalhippo wrote:
| Similar trick to that used in the fast inverse square root
| routine[1] popularized by Quake 3.
|
| [1]: https://en.wikipedia.org/wiki/Fast_inverse_square_root
| mbitsnbites wrote:
| Yep. Along the same lines. This one is even simpler, though, as
| it requires only a single integer CPU instruction (and the
| simplest of all instructions too).
|
| If you want full precision, you need to do three Newton-Raphson
| iterations after the initial approximation. One iteration is:
| y = y * (2.0F - x * y);
| magicalhippo wrote:
| It's a neat trick, and could be very useful on
| microcontrollers which doesn't have hardware division but
| does have hardware multiplication.
| actionfromafar wrote:
| Hm, like 68000?
| fph wrote:
| For some more explanation: the main idea behind both tricks is
| that the IEEE floating point formats are designed so that the
| most significant bits represent its exponent, that is,
| floor(log2(x)). Hence reinterpret-casting a float x to an
| unsigned integer uint(x) approximates a multiple of log2(x). So
| these kinds of approximations work like logarithm arithmetic:
| float(C + a*uint(x)) approximates x^a, for a suitable constant
| C. Quake's invsqrt is a=-1/2, this post is a=-1.
|
| More detail on
| https://en.wikipedia.org/wiki/Fast_inverse_square_root#Alias...
| .
|
| IEEE754 floats are a very well-designed binary format, and the
| fact that these approximations are possible is part of this
| design; indeed, the first known instance of this trick is for
| a=1/2 by W. Kahan, the main designer of IEE754.
| torusle wrote:
| There are couple of tricks you can do if you fiddle with the bits
| of a floating point value using integer arithmetic and binary
| logic.
|
| That was a thing back in the 90th..
|
| I wonder how hard the performance hit from moving values between
| integer and float pipeline is nowadays.
|
| Last time I looked into that was the Cortex-A8 (first I-Phone
| area). Doing that kind of trick costed around 26 cycles (back and
| forth) due to pipeline stalls back then.
| stephencanon wrote:
| There are basic integer operations in the FP/SIMD units on most
| CPUs, so there's no generally need to "move back and forth"
| unless you need to branch on the result of a comparison, use a
| value as an address, or do some more specialized arithmetic.
| stephencanon wrote:
| (For that matter, though, most modern FP/SIMD units have a
| direct approximate-reciprocal instruction that is single-
| cycle throughput or better and much more accurate--generally
| around 10-12 bits, so there's no need for this sort of thing.
| See, e.g. FRECPE on ARM NEON and [V]RCPP[S/D] on x86.)
| ack_complete wrote:
| These kinds of tricks are still used today. They're not so
| useful if you need a reciprocal or square root, since CPUs now
| have dedicated hardware for that, but it's different if you
| need a _cube_ root or x^(1/2.4).
| mitthrowaway2 wrote:
| I wonder to what extent the dedicated hardware is essentially
| implementing the same steps but at the transistor level.
| Y_Y wrote:
| Quick comparison with exact and one Newton-Raphson:
| Value | True | Fast | Fast+Newton
| ---------------------------------------- 0.1 | 10.000 |
| 11.200 | 9.8560 0.5 | 2.0000 | 2.0000 | 2.0000 1.0
| | 1.0000 | 1.0000 | 1.0000 2.0 | 0.5000 | 0.5000 | 0.5000
| 5.0 | 0.2000 | 0.2187 | 0.1982 10.0 | 0.1000 | 0.1094 |
| 0.0991
|
| (where the extra correction was done with: y *=
| (2.0f - x*y); )
| pkhuong wrote:
| You can do a little better if you tweak the low order bits
| (https://pvk.ca/Blog/LowLevel/software-reciprocal.html for the
| double float version)
| jcmeyrignac wrote:
| A better value could be 0x7EEEEBB3, as in:
| https://github.com/parallella/pal/blob/bd9389f14fe16db4d9630...
| dahart wrote:
| This code is technically UB in C++, right? [1] Has anyone run
| into a case where it actually didn't work? Just curious. I've
| often assumed that if C++ compilers didn't compile this code, all
| hell would break loose.
|
| It might be nice to start sharing modern/safe versions of this
| snippet & the Quake thing. Is using memcpy the only option that
| is safe in both C and C++? That always felt really awkward to me.
|
| [1]
| https://tttapa.github.io/Pages/Programming/Cpp/Practices/typ...
| tekknolagi wrote:
| We used memcpy everywhere in our runtime and after the 10th or
| so time doing it, it becomes less awkward.
| dahart wrote:
| And it's always reliably optimized out in release builds, I
| assume?
| LegionMammal978 wrote:
| On platforms thar require aligned loads and stores (not x86
| nor ARM), a direct pointer cast sometimes uses an aligned
| load/store where a memcpy uses multiple byte loads/stores,
| even on a good compiler, since memcpy() doesn't require
| that the pointers are aligned. This can be mitigated by
| going through a local variable, but it gets pretty verbose.
| stouset wrote:
| Sounds like a good place for a macro?
| ack_complete wrote:
| Some ARM CPUs do require aligned loads and stores, such
| as the Cortex-M0+ in a Raspberry Pi Pico.
| Asooka wrote:
| For MSVC you have to add "/Oi", otherwise it is always a
| function call at lower optimisation levels. Clang and GCC
| treat it as an intrinsic always, even in debug builds.
| tekknolagi wrote:
| I haven't manually checked every case but it's normally
| folded into the load or shift or whatever and completely
| erased
| LegionMammal978 wrote:
| Yes, most casts via pointers are formally UB in both C and C++,
| and you can induce weird behavior by compilers if you transmit
| casted pointers through a function boundary (see [0] for a
| standard example: notice how the write to *pf vanishes into
| thin air). Since people like doing it in practice, GCC and
| Clang have an -fno-strict-aliasing flag to disable this
| optimization, and the MSVC compiler doesn't use it in the first
| place (except for restrict pointers in C). They don't go too
| far with it regardless, since lots of code using the POSIX
| socket API casts around struct pointers as a matter of course.
|
| Apart from memcpy(), the 'allowed' methods include unions in C
| (writing to one member and reading from another), and
| bit_cast<T>() and std::start_lifetime_as<T>() in C++.
|
| [0] https://godbolt.org/z/dxMMfazoq
| ckjellqv wrote:
| It should be well-defined with reinterpret_cast<uint32_t&>
| though.
| LegionMammal978 wrote:
| No, reinterpret_cast doesn't change the type of the
| underlying object. The rule in [basic.lval]/11 (against
| accessing casted values) applies to all glvalues, whether
| they come from pointers or references.
| rjeli wrote:
| For C++, `bit_cast<uint32_t>(0.f)` should be Well Defined,
| right? I'm curious, in C, is union-casting float->uint32_t
| also Perfectly Legal And Well Defined?
|
| (I am not a C or C++ expert.)
| Asooka wrote:
| It would be better if compilers were fixed. Current treatment
| of UB is insane. The default should be no UB and a pragma that
| lets you turn it on for specific parts. That used to be the
| case and some of the world's most performant software was
| written for compilers that didn't assume no pointer aliasing
| nor no integer overflow (case in point: Quake).
|
| The big problem, apart from "memcpy is a bit tedious to write",
| is that it is impossible to guarantee absence of UB in a large
| program. You will get inlined functions from random headers and
| suddenly someone somewhere is "dereferencing" a pointer by
| storing it as a reference, and then your null pointer checks
| are deleted. Or an integer overflows and your "i>=0" array
| bounds assert is deleted. I have seen that happen several times
| and each time the various bits of code all look innocent enough
| until you sit down and thoroughly read each function and the
| functions it calls and the functions they call. So we compile
| with the worst UB turned off (null is a valid address, integers
| can overflow, pointers can alias, enums can store any integer)
| and honestly, we cannot see a measurable difference in
| performance.
|
| UB as it is currently implemented is simply an enormous
| footgun. It would be a lot more useful if there were a profiler
| that could find parts of the code, which would benefit from UB
| optimisations. Then we could focus on making those parts UB-
| safe, add asserts for the debug/testing builds, and turn on
| more optimisations for them. I am quite certain nobody can
| write a fully UB-free program. For example, did you know that
| multiplying two "unsigned short" variables is UB? Can you
| guarantee that across all the template instantiations for
| custom vector types you've written you never multiply unsigned
| shorts? I'll leave it as an exercise to the reader as to why
| that is.
| imtringued wrote:
| There is no point in talking with the people that worship UB
| for the sake of UB.
|
| They don't understand that one of the biggest barriers to
| developers writing and adopting more C in their projects is
| the random jankiness that you get from the compilers. Instead
| they make C this elite thing for the few people who have read
| every single line of C code they themselves compiled and ran
| on their Gentoo installation. Stuff like having no bounds
| checks is almost entirely irrelevant outside of compute
| kernels. It doesn't get you much performance, because the
| branches are perfectly predictable. It merely reduces code
| size.
|
| There is also the problem that the C development culture is
| uttlery backwards compared even to the semiconductor
| industry. If you want to have these ultra optimized release
| builds, then your development builds must scream when they
| encounter UB and it also means that no C program or library
| without an extensive test suite should be allowed onto any
| Linux distribution's package repository. Suddenly the cost of
| C programming appears to be unaffordably high.
| Conscat wrote:
| A few years ago with GCC 10, I had some strange UD2
| instructions (I think) inserted around some inline assembly
| that was fixed by replaced my `reinterpret_cast`s with
| `bit_cast`.
| Earw0rm wrote:
| Yes, if implemented as shown.
|
| You could use intrinsics for the bit casting & so on and it
| would be well-defined to the extent that those intrinsics are.
|
| (I understand some people consider SIMD intrinsics in general
| to be UB at language level, in part because they let you switch
| between floats & ints in a hardware-specific way.)
| AlotOfReading wrote:
| A better value is 0x7eb504f3. You can follow that up with Newton-
| raphson to refine the approximation, or approximate the Newton-
| raphson too by multiplying 1.385356f*x.
|
| I did this a few days ago in the approximate division thread:
| https://news.ycombinator.com/item?id=42481612#42489596
|
| In some cases, it can be faster than hardware division.
| quanto wrote:
| A quick intuition: magic number 7e ffffff is negating by two's
| complement both the mantissa and exponent.
|
| 1. 7e: the first sig bit has to be zero to preserve the sign of
| the overall number.
|
| 2. ffffff: due to Taylor series 1/(1 + X) = 1 - X ..., negating
| gives the multiplicative inverse.
|
| Although an IEEE float has a sign bit (thus 2's complement does
| not work on the float itself) but mantissa and the exponent
| individually work with 2's complement for different reasons. The
| exponent has a biased range due to being chopped by half; where
| as the mantissa has a biased range due to its definition and
| constant offset. The exponent's 1 offset (7e instead of 7f) is a
| bit more difficult to see at first -- the devil is in the
| details, but 2's complement is the basic intuition.
| somat wrote:
| Whenever I see these tricks(see also: the quake 3 fast inverse
| sqrt) involving using, not casting, but using integers as floats
| directly and floats as integers, I wonder if there is a way to do
| it without the jank.
|
| Because what do you really want? some sort of exponent or
| exponent math right, some variant of the log function should
| work. is the problem is all the log functions are gated behind
| the function call interface. where as the subtract function is
| less heavy being behind the operator interface. or are they
| trying to use a floating point accelerated log aka floating point
| subtract?
| brudgers wrote:
| Engineering mathematical calculations always looks like "jank."
| Take a look at Plauger's _The C Standard Library_. It is full
| of "magic" numbers.
|
| But then again, representing irrational numbers in binary is
| inherently janky.
| kkkqkqkqkqlqlql wrote:
| Where is the obligatory "WHAT THE FUCK" comment?
___________________________________________________________________
(page generated 2024-12-31 23:01 UTC)