[HN Gopher] Myths About Floating-Point Numbers (2021)
       ___________________________________________________________________
        
       Myths About Floating-Point Numbers (2021)
        
       Author : Bogdanp
       Score  : 43 points
       Date   : 2025-08-09 20:22 UTC (4 days ago)
        
 (HTM) web link (www.asawicki.info)
 (TXT) w3m dump (www.asawicki.info)
        
       | BugsJustFindMe wrote:
       | I've never heard anyone say any of these supposed myths, except
       | for the first one, sort of, but nobody means what the first one
       | pretends it means, so this whole post feels like a big strawman
       | to me.
        
         | rollcat wrote:
         | People do say a lot of nonsense about floats, my younger self
         | trying to bash JavaScript included. They e.g. do fine as
         | integers - up to 2**53; this could be optimised by a JIT to use
         | actual integer math.
        
         | Cieric wrote:
         | Just to add some context Adam works at AMD as a dev tech, so he
         | is constantly working with game studios developers directly.
         | While I can't say I've heard the same things since I'm
         | somewhere else, I have seen some of the assumptions made in
         | some shader code and they do line up with the kind of things
         | he's saying.
        
         | jbjbjbjb wrote:
         | I was putting together some technical interview questions for
         | our candidates and I wanted to see what ChatGPT would put for
         | an answer. It told me floating point numbers were non-
         | deterministic and wouldn't back down from that answer so it's
         | getting it from somewhere.
        
         | AshamedCaptain wrote:
         | I concur, it seems low effort, and the only real common "myth"
         | (the 1st one) is not really disproven. Infact the very example
         | he puts goes to prove it, as it is going to become an infinite
         | loop given large enough N....
         | 
         | Also compiler optimizations should not affect the result.
         | Specially without "fast-math/O3" (which arguably many people
         | stupidly use nowadays, then complain).
        
           | zokier wrote:
           | > Also compiler optimizations should not affect the result.
           | Specially without "fast-math/O3" (which arguably many people
           | stupidly use nowadays, then complain).
           | 
           | -ffp-contract is annoying exception to that principle.
        
         | jandrese wrote:
         | The first rule is true, but relying on it is dangerous unless
         | you are well versed in what floats can and can not be
         | represented exactly. It's best to pretend it isn't true in most
         | cases to avoid footguns.
        
         | janalsncm wrote:
         | Same, but I still learned a fair amount anyways. I say no harm,
         | no foul.
        
       | rollcat wrote:
       | Related: What Every Computer Scientist Should Know About
       | Floating-Point Arithmetic
       | <https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...>
        
         | hans_castorp wrote:
         | Also related:
         | 
         | https://floating-point-gui.de/
        
       | bee_rider wrote:
       | I like these. They are push back against the sort of... first
       | correction that people make when encountering floating point
       | weirdness. That is, the first mistake we make is to treat floats
       | as reals, and then we observe some odd rounding behavior. The
       | second mistake we make is to treat the rounding events as random.
       | A nice thing about IEEE floats is that the rounding behavior is
       | well defined.
       | 
       | Often it doesn't matter, like you ask for a gemm and you get
       | whatever order of operations blas, AVX-whatever, and OpenMP
       | conspire to give you, so it is more-or-less random.
       | 
       | But if it does matter, the ability to define it is there.
        
         | jandrese wrote:
         | > A nice thing about IEEE floats is that the rounding behavior
         | is well defined.
         | 
         | Until it isn't. I used to play the CodeWeavers port of Kohan
         | and while the game would allow you to do crossplay between
         | Windows and Linux, differences in how the two OSes rounded
         | floats would cause the game to desynchronize after 15-20
         | minutes of play or so. Some unit's pathfinding algorithm would
         | zig on Windows and zag on Linux, causing the state to diverge
         | and end up kicking off one of the players.
        
           | bee_rider wrote:
           | Is it possible that your different operating systems just had
           | different mxcsr values?
           | 
           | Or, since it was a port, maybe they were compiled with
           | different optimizations.
           | 
           | There are a lot of things happening under the hood but most
           | of them should be deterministic.
        
             | toolslive wrote:
             | until someone compiles with --ffast-math enabled, stating
             | "I don't care about accuracy, as long as it's fast".
        
               | bee_rider wrote:
               | It is good to enable that flag because it also enables
               | the "fun safe math optimizations" flag, and it is
               | important to remind people that math is a safe way to
               | have fun.
        
               | toolslive wrote:
               | "Friends don't let friends use fast-math"
               | 
               | https://simonbyrne.github.io/notes/fastmath/
        
           | jcranmer wrote:
           | The differences are almost certainly not in how the two OSes
           | rounded floats--the IEEE rounding modes are standard, and
           | almost no one actually bothers to even change the rounding
           | mode from the default.
           | 
           | For cross-OS issues, the most likely culprit is that Windows
           | and Linux are using different libm implementations, which
           | means that the results of functions like sin or atan2 are
           | going to be slightly different.
        
             | zokier wrote:
             | Problem with rounding modes and other fpenv flags is that
             | any library anywhere might flip some flag and suddenly the
             | whole program changes behavior.
        
       | exDM69 wrote:
       | My favorite thing about floating point numbers: you _can_ divide
       | by zero. The result of x /0.0 is +/- inf (or NaN if x is zero).
       | There's a helpful table in "weird floats" [0] that covers all the
       | cases for division and a bunch of other arithmetic instructions.
       | 
       | This is especially useful when writing branchless or SIMD code.
       | Adding a branch for checking against zero can have bad
       | performance implications and it isn't even necessary in many
       | cases.
       | 
       | Especially in graphics code I often see a zero check before a
       | division and a fallback for the zero case. This often practically
       | means "wait until numerical precision artifacts arise and then do
       | something else". Often you could just choose the better of the
       | two options you have instead of checking for zero.
       | 
       | Case in point: choosing the axes of your shadow map projection
       | matrix. You have two options (world x axis or z axis), choose the
       | better one (larger angle with viewing direction). Don't wait
       | until the division goes to inf and then fall back to the other.
       | 
       | [0]
       | https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...
        
         | thwarted wrote:
         | You still can't divide by zero, it just doesn't result in an
         | error state that stops execution. The inf and NaN values are
         | sentinel values that you still have to check for _after the
         | calculation_ to know if it went awry.
        
           | sixo wrote:
           | In the space of floats, you _are_ dividing by zero. To map
           | back to the space of numbers you have to check. It 's nice,
           | though; inf and NaN sentinels give you the behavior of a
           | monadic `Result | Error` pipeline without having to wrap your
           | numbers in another abstraction.
        
           | TehShrike wrote:
           | You can divide by zero, but you mayn't.
        
           | epcoa wrote:
           | If dividing by zero has a well defined result that doesn't
           | abort execution what exactly does "can't" even mean?
           | 
           | Operations on those sentinel values are also defined. This
           | can affect when checking needs to be done in optimized code.
        
             | bee_rider wrote:
             | I believe divide-by-zero produces an exception. The machine
             | can either be configured to mask that exception, or not.
             | 
             | Personally, I am lazy, so I don't check the mxcsr register
             | before I start running my programs. Maybe gcc does
             | something by default, I don't know. IMO legitimate division
             | by zero is rare but not impossible, so if you do it, the
             | onus is on you to make sure the flags are set up right.
        
               | epcoa wrote:
               | Correct, divide by zero is one of the original five
               | defined IEEE754-1985 exception. But the default behavior
               | then and now is to produce that defined result mentioned
               | and continue execution with a flag set ("default non-
               | stop"). Further conforming implementations also allow
               | "raiseNoFlag".
               | 
               | It's well-defined is all that really matters AFAIC.
        
           | mike_ivanov wrote:
           | This is the Result monad in practice. It allows you to
           | postpone error handling until the computation is done.
        
           | wpollock wrote:
           | It has always amused me that Integer division by 0 results in
           | "floating point exception", but floating point division by
           | 0.0 doesn't!
        
         | kccqzy wrote:
         | This is also my favorite thing about floating point numbers.
         | Unfortunately languages like Python try to be smart and prevent
         | me from doing it. Compare:                   >>> 1.0/0.0
         | ZeroDivisionError         >>> np.float64(1)/np.float64(0)
         | inf
         | 
         | I'm so used to writing such zero division in other languages
         | like C/C++ that this Python quirk still trips me up.
        
           | pklausler wrote:
           | Division by zero is an error and it should be treated as
           | such. "Infinity" is an error indication from overflow and
           | division by zero, nothing more.
        
             | ForceBru wrote:
             | The article's point 3 says that this is a myth. Indeed, the
             | _limit_ of `1/x` as `x` approaches zero from the right is
             | positive infinity. What's more, division by _negative zero_
             | (which, perhaps surprisingly, is a thing) yields negative
             | infinity, which is also the value of the corresponding
             | limit. If you divide a finite float by infinity, you get
             | zero, because `lim_{x\to\infty} c/x=0`. In many cases you
             | can treat division by zero or infinity as the appropriate
             | limit.
        
               | pklausler wrote:
               | I am allowed to disagree with the article.
        
               | ForceBru wrote:
               | Sure, but it makes sense, doesn't it? Even `inf-inf ==
               | NaN` and `inf/inf == NaN`, which is true in calculus:
               | limits like these are undefined, unless you use
               | l'Hopital's rule or something. (I know NaN isn't equal to
               | itself, it's just for illustration purposes) But then
               | again, you usually don't want these popping up in your
               | code.
        
               | pklausler wrote:
               | In practice, though, I can't recall any HPC codes that
               | want to use IEEE-754 infinities as valid data.
        
             | sfpotter wrote:
             | This is totally false mathematically. Please look up the
             | extended real number system for an example. Many branches
             | of mathematics affix infinity to some existing number
             | system, extending its operations consistently, and do all
             | kinds of useful things with this setup. Being able to work
             | with infinity in exactly the same way in IEEE754 is crucial
             | for being able to cleanly map algorithms from these domains
             | onto a computer. If dividing by zero were an error in
             | floating point arithmetic, I would be unable to do my job
             | developing numerical methods.
        
       | BlackFly wrote:
       | You can exploit the exactness of (specific) floating point
       | operations in test data by using sums of powers of 2. Polynomials
       | with such coefficients produce exact results so long as the
       | overall powers are within ~53 powers of 2 (don't quote me exactly
       | on that, I generally don't push the range very high!). You can
       | find exact polynomial solutions to linear PDEs with such powers
       | using high enough order finite difference methods for example.
       | 
       | However, the story about non-determinism is no myth. The intel
       | processors have a separate math coprocessor that supports 80bit
       | floats (https://en.wikipedia.org/wiki/Extended_precision#x86_exte
       | nde...). Moving a float from a register in this coprocessor to
       | memory truncates the float. Repeated math can be done inside this
       | coprocessor to achieve higher precision so hot loops generally
       | don't move floats outside of these registers. Non-determinism
       | occurs in programs running on intel with floats when threads are
       | interrupted and the math coprocessor flushed. The non-determinism
       | isn't intrinsic to the floating point arithmetic but to the non-
       | determinism of when this truncation may occur. This is more
       | relevant for fields where chaotic dynamics occur. So the same
       | program with the same inputs can produce different results.
       | 
       | NaN is an error. If you take the square root of a negative number
       | you get a NaN. This is just a type error, use complex numbers to
       | overcome this one. But then you get 0. / 0. and that's a NaN or
       | Inf - Inf and a whole slew of other things that produce out of
       | bounds results. Whether it is expected or not is another story,
       | but it does mean that you are unable to represent the value with
       | a float and that is a type error.
        
         | AshamedCaptain wrote:
         | > Non-determinism occurs in programs running on intel with
         | floats when threads are interrupted and the math coprocessor
         | flushed
         | 
         | That's ridiculous. No OS in his right mind would flush FPU regs
         | to 64 bits only, because that would break many things, most
         | obviously "real" 80 bit FP which is still a thing and the only
         | reason x87 instructions still work. It would even break plain
         | equality comparisons making all FP useless.
         | 
         | For 64 bit FP most compilers prefer SSE rather than x87
         | instructions these days.
        
         | bobmcnamara wrote:
         | > Non-determinism occurs in programs running on intel
         | 
         | FTFY. They even changed some of the more obscure handling
         | between 8087,80287,80387. So much hoop jumping if you cared
         | about binary reproducibility.
         | 
         | Seems to be largely fixed with targeting SSE even for scalar
         | code now.
        
         | jcranmer wrote:
         | Wow, you're crossing a few wires in your zeal to provide
         | information to the point that you're repeating myths.
         | 
         | > The intel processors have a separate math coprocessor that
         | supports 80bit floats
         | 
         | x86 processors have two FPU units, the x87 unit (that you're
         | describing) and the SSE unit. Anyone compiling for x86-64 uses
         | the SSE unit for default, and most x86-32 compilers still
         | default to SSE anyways.
         | 
         | > Moving a float from a register in this coprocessor to memory
         | truncates the float.
         | 
         | No it doesn't. The x87 unit has load and store instructions for
         | 32-bit, 64-bit, and 80-bit floats. If you want to spill 80-bit
         | values as 80-bit values, you can do so.
         | 
         | > Repeated math can be done inside this coprocessor to achieve
         | higher precision so hot loops generally don't move floats
         | outside of these registers.
         | 
         | Hot loops these days use the SSE stuff because they're so much
         | faster than x87. Friends don't let friends use long double
         | without good reason!
         | 
         | > Non-determinism occurs in programs running on intel with
         | floats when threads are interrupted and the math coprocessor
         | flushed.
         | 
         | Lol, nope. You'll spill the x87 register stack on thread
         | context switch with FSAVE or FXSAVE or XSAVE, all of which will
         | store the registers as 80-bit values without loss of precision.
         | 
         | That said, there was a problem with programs that use the x87
         | unit, but it has absolutely nothing to do with what you're
         | describing. The x87 unit doesn't have arithmetic for 32-bit and
         | 64-bit values, only 80-bit values. Many compilers, though, just
         | pretended that the x87 unit supported arithmetic on 32-bit and
         | 64-bit values, so that FADD would simultaneously be a 32-bit
         | addition, a 64-bit addition, and a 80-bit addition. If the
         | compiler needed to spill a floating-point register, they would
         | spill the value as a 32-bit value (if float) or 64-bit value
         | (if double), and register spills are pretty unpredictable for
         | user code. That's the nondeterminism you're referring to, and
         | it's considered a bug in every compiler I'm aware of. (See
         | https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37...
         | for a more thorough description of the problem).
        
         | nyeah wrote:
         | NaN is not necessarily an error. It might be fine. It depends
         | on what you're doing with it.
         | 
         | If NaN is invalid input for the next step, then sure why not
         | treat it as an error? But that's a design decision not an
         | imperative that everybody must follow. (I picture Mel Brooks'
         | 15 commandments, #11-15, that fell and broke. This is not like
         | that.)
        
       | quietbritishjim wrote:
       | When writing code, it's useful to follow simpler rules than
       | strictly needed to make it easier to read for other coders
       | (including you in the future). For example, you might add some
       | unnecessary brackets to expressions to avoid remembering all the
       | operator precedence rules.
       | 
       | The article is right that sometimes you can can use simple
       | equality on floating point values. But if you have a (almost)
       | blanket rule to use fuzzy comparison, or (even better) just avoid
       | any sort if equality-like comparison altogether, then your code
       | might be simpler to understand and you might be less likely to
       | make a mistake about when you really can safely do that.
       | 
       | It's still sensible to understand that it's possible though. Just
       | that you should try to avoid writing tricky code that depends on
       | it.
        
         | zokier wrote:
         | That sort of argument can be problematic though, because the
         | code can become misleading. If I see a fuzzy comparison then
         | I'd assume it is there because it is needed, which might make
         | the rest of the code more difficult to understand/modify
         | because I then have to assume that everywhere else the values
         | might be fuzzy.
        
         | kccqzy wrote:
         | Every time I use exact floating point equality I simply write a
         | comment explaining why I did this. That's what comments are
         | for.
        
       | glkindlmann wrote:
       | Note: the improved loop in the "1. They are not exact" can easily
       | hang. If count > 2^24, then the ulp of f is 2, and adding 1.0f
       | leaves f unchanged. What's wild is that a few lines later he
       | notes how above 2^24 numbers start "jumping" every 2. Ok, but FP
       | are always "jumping", by their ulp, regardless of how big or
       | small they are.
        
       | pklausler wrote:
       | People need to understand rounding better, especially the topic
       | of _when_ rounding can happen and when it can 't for the basic
       | operations.
       | 
       | Updating with a concrete example: The Fortran standard defines
       | the real MOD and MODULO intrinsic functions as being equivalent
       | to the straightforward sequence of a division, conversion to
       | integer, multiplication, and subtraction. This formula can round,
       | obviously. But MOD can be (and thus should be) implemented
       | exactly by other means, and most Fortran compilers do so instead.
       | This leaves Fortran implementors in a bit of a pickle -- conform
       | to the standard, or produce good results?
        
         | camgunz wrote:
         | Definitely. People think they'll get out of knowing how
         | rounding works by using arbitrary precision arithmetic, but
         | arguably it's even more important there (you run out of
         | precision/memory at some point; what do you think happens
         | then?). You can use floats for money if you do the rounding
         | right.
        
       | AtlasBarfed wrote:
       | The core thing to know about floating point numbers in comp langs
       | is they aren't floating point numbers.
       | 
       | They are approximations of floating point numbers, and even if
       | your current approximation of a floating point number to
       | represent a value seems to be accurate as an integer vale...
       | 
       | There is no guarantee if you take two of those floating point
       | number approximations that appear completely accurate that the
       | resulting operation between them will also be completely
       | accurate.
        
         | kccqzy wrote:
         | That's not a useful way to think. Floating point numbers are
         | just floating point numbers. You aren't approximating floating
         | point numbers. You are approximating real numbers.
         | 
         | Floating point numbers are compared against fixed point numbers
         | where the point (the part between the integer part and
         | fractional part) is fixed. That is why they are called
         | floating. The nature of the real numbers is such that they can
         | in general only be approximated, regardless of whether you use
         | fixed point numbers or floating point numbers or a fancier
         | computable number representation.
        
       | ivankra wrote:
       | My favorite trick: NaN boxing. NaN's aren't just for errors, but
       | also for smuggling other data inside. For a double, you have
       | whopping 53 bits of payload, enough to cram in a pointer and
       | maybe a type tag, and many javascript engines do (since JS
       | numbers are double's after all)
       | 
       | https://wingolog.org/archives/2011/05/18/value-representatio...
       | 
       | https://piotrduperas.com/posts/nan-boxing
        
         | pklausler wrote:
         | 52 bits of payload, and at least one bit must be set.
        
           | ivankra wrote:
           | You can put stuff into the sign bit too, that makes 53. Yeah,
           | the lower 52 bits can't all be zero - that'd be +-INF, but
           | the other 2^53-2 values are all yours to use.
        
             | pklausler wrote:
             | It's possible for the sign bit of a NaN to be changed by a
             | "non-arithmetic" operation that doesn't trap on the NaN, so
             | don't put anything precious in there.
        
         | Arnavion wrote:
         | It is also how RISC-V floating point registers are required to
         | store floats of smaller widths. Eg if your CPU supports 64-bit
         | floats (D extension), its FPU registers will be 64-bit wide. If
         | you use an instruction to load a 16-bit float (Zfh extension)
         | into such a register, it will be boxed into a negative quiet
         | NaN with all bits above the lower 16 bits set to 1.
        
       | calibas wrote:
       | > They are not exact
       | 
       | It's not exactly a myth, as the article mentions, they're only
       | exact for certain ranges of values.
       | 
       | > NaN and INF are indication of an error
       | 
       | This is somewhat semantic, but dividing by zero typically does
       | create a hardware exception. However, it's all handled behind the
       | scenes, and you get "Inf" as the result.
       | 
       | You can make it so dividing by zero is explicitly an error, see
       | the "ftrapping-math" flag.
        
       | AndriyKunitsyn wrote:
       | Here's one that's not a myth: IEEE-754 floats are the only
       | "primitive types" that allow "a == a" to not be true.
       | 
       | I.e., two floats that are _identical_ to each other (even when
       | it's _the same_ variable, on the same memory address) can be not
       | _equal_ to each other, specifically if it's NaN. This is dictated
       | by IEEE-754, and this is true for all programming languages I
       | know, and to this day, this makes zero sense to me, but
       | apparently this is useful for some reason.
        
       ___________________________________________________________________
       (page generated 2025-08-13 23:01 UTC)