[HN Gopher] Emulating the FMAdd Instruction, Part 1: 32-bit Floats
       ___________________________________________________________________
        
       Emulating the FMAdd Instruction, Part 1: 32-bit Floats
        
       Author : TazeTSchnitzel
       Score  : 32 points
       Date   : 2025-01-02 08:37 UTC (14 hours ago)
        
 (HTM) web link (drilian.com)
 (TXT) w3m dump (drilian.com)
        
       | msk-lywenn wrote:
       | I think the link should be:
       | https://drilian.com/posts/2024.12.31-emulating-the-fmadd-ins...
        
       | RossBencina wrote:
       | I would love to know how the author tested/verified that the code
       | is correct.
        
         | eapriv wrote:
         | It doesn't even compile (the last code snippet uses some
         | undefined variables).
        
         | stephencanon wrote:
         | The _algorithm_ is certainly correct (this is a relatively
         | straightforward and well-known theorem of floating-point
         | arithmetic, see e.g. "Handbook of Floating-Point Arithmetic" by
         | JM Muller et al).
         | 
         | Whether or not the code presented is a faithful translation of
         | that algorithm is more subtle and somewhat dependent on the
         | environment; the AddWithError operation requires that no excess
         | precision is used (FLT_EVAL_METHOD == 0), which is the norm for
         | compilers today but not quite universal, especially for
         | compilers that target 32b x86. The `LowestBitOfMantissa`,
         | `AddOneBitToMantissa` and `SubtractOneBitFromMantissa`
         | operations are not implemented in the post--those aren't hard
         | to implement, but doing it without invoking undefined behavior
         | can be subtle (and what's actually permitted is a critical
         | distinction between C and C++).
         | 
         | It also may not handle some non-finite cases correctly--when
         | the result is infinite or NaN, the error term as computed will
         | be NaN, which will pass `errorTerm != 0.0 &&
         | LowestBitOfMantissa(value) == 0`, which will cause a bit to be
         | subtracted from the result, rounding infinity back to a finite
         | value. That will overflow back to infinity on the final
         | conversion in default rounding, but in directed rounding modes
         | will give the wrong answer. It also will get the FP flags wrong
         | in some cases, which doesn't matter much for most use, but
         | means you couldn't use it as a C standard library
         | implementation.
         | 
         | A fully correct implementation that handles all these small
         | details isn't really any more complicated, and can be written
         | in ~20 lines of C without any code golfing (I wrote Apple's
         | libm fma[f] for targets that do not have hardware support). The
         | "hard" part is writing the thousands and thousands of tests
         | that allow you to have some confidence in the implementation
         | details--this can largely be automated, but it's definitely the
         | harder side of the task.
        
       ___________________________________________________________________
       (page generated 2025-01-02 23:02 UTC)