[HN Gopher] Myths About Floating-Point Numbers (2021)
___________________________________________________________________
Myths About Floating-Point Numbers (2021)
Author : Bogdanp
Score : 43 points
Date : 2025-08-09 20:22 UTC (4 days ago)
(HTM) web link (www.asawicki.info)
(TXT) w3m dump (www.asawicki.info)
| BugsJustFindMe wrote:
| I've never heard anyone say any of these supposed myths, except
| for the first one, sort of, but nobody means what the first one
| pretends it means, so this whole post feels like a big strawman
| to me.
| rollcat wrote:
| People do say a lot of nonsense about floats, my younger self
| trying to bash JavaScript included. They e.g. do fine as
| integers - up to 2**53; this could be optimised by a JIT to use
| actual integer math.
| Cieric wrote:
| Just to add some context Adam works at AMD as a dev tech, so he
| is constantly working with game studios developers directly.
| While I can't say I've heard the same things since I'm
| somewhere else, I have seen some of the assumptions made in
| some shader code and they do line up with the kind of things
| he's saying.
| jbjbjbjb wrote:
| I was putting together some technical interview questions for
| our candidates and I wanted to see what ChatGPT would put for
| an answer. It told me floating point numbers were non-
| deterministic and wouldn't back down from that answer so it's
| getting it from somewhere.
| AshamedCaptain wrote:
| I concur, it seems low effort, and the only real common "myth"
| (the 1st one) is not really disproven. Infact the very example
| he puts goes to prove it, as it is going to become an infinite
| loop given large enough N....
|
| Also compiler optimizations should not affect the result.
| Specially without "fast-math/O3" (which arguably many people
| stupidly use nowadays, then complain).
| zokier wrote:
| > Also compiler optimizations should not affect the result.
| Specially without "fast-math/O3" (which arguably many people
| stupidly use nowadays, then complain).
|
| -ffp-contract is annoying exception to that principle.
| jandrese wrote:
| The first rule is true, but relying on it is dangerous unless
| you are well versed in what floats can and can not be
| represented exactly. It's best to pretend it isn't true in most
| cases to avoid footguns.
| janalsncm wrote:
| Same, but I still learned a fair amount anyways. I say no harm,
| no foul.
| rollcat wrote:
| Related: What Every Computer Scientist Should Know About
| Floating-Point Arithmetic
| <https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...>
| hans_castorp wrote:
| Also related:
|
| https://floating-point-gui.de/
| bee_rider wrote:
| I like these. They are push back against the sort of... first
| correction that people make when encountering floating point
| weirdness. That is, the first mistake we make is to treat floats
| as reals, and then we observe some odd rounding behavior. The
| second mistake we make is to treat the rounding events as random.
| A nice thing about IEEE floats is that the rounding behavior is
| well defined.
|
| Often it doesn't matter, like you ask for a gemm and you get
| whatever order of operations blas, AVX-whatever, and OpenMP
| conspire to give you, so it is more-or-less random.
|
| But if it does matter, the ability to define it is there.
| jandrese wrote:
| > A nice thing about IEEE floats is that the rounding behavior
| is well defined.
|
| Until it isn't. I used to play the CodeWeavers port of Kohan
| and while the game would allow you to do crossplay between
| Windows and Linux, differences in how the two OSes rounded
| floats would cause the game to desynchronize after 15-20
| minutes of play or so. Some unit's pathfinding algorithm would
| zig on Windows and zag on Linux, causing the state to diverge
| and end up kicking off one of the players.
| bee_rider wrote:
| Is it possible that your different operating systems just had
| different mxcsr values?
|
| Or, since it was a port, maybe they were compiled with
| different optimizations.
|
| There are a lot of things happening under the hood but most
| of them should be deterministic.
| toolslive wrote:
| until someone compiles with --ffast-math enabled, stating
| "I don't care about accuracy, as long as it's fast".
| bee_rider wrote:
| It is good to enable that flag because it also enables
| the "fun safe math optimizations" flag, and it is
| important to remind people that math is a safe way to
| have fun.
| toolslive wrote:
| "Friends don't let friends use fast-math"
|
| https://simonbyrne.github.io/notes/fastmath/
| jcranmer wrote:
| The differences are almost certainly not in how the two OSes
| rounded floats--the IEEE rounding modes are standard, and
| almost no one actually bothers to even change the rounding
| mode from the default.
|
| For cross-OS issues, the most likely culprit is that Windows
| and Linux are using different libm implementations, which
| means that the results of functions like sin or atan2 are
| going to be slightly different.
| zokier wrote:
| Problem with rounding modes and other fpenv flags is that
| any library anywhere might flip some flag and suddenly the
| whole program changes behavior.
| exDM69 wrote:
| My favorite thing about floating point numbers: you _can_ divide
| by zero. The result of x /0.0 is +/- inf (or NaN if x is zero).
| There's a helpful table in "weird floats" [0] that covers all the
| cases for division and a bunch of other arithmetic instructions.
|
| This is especially useful when writing branchless or SIMD code.
| Adding a branch for checking against zero can have bad
| performance implications and it isn't even necessary in many
| cases.
|
| Especially in graphics code I often see a zero check before a
| division and a fallback for the zero case. This often practically
| means "wait until numerical precision artifacts arise and then do
| something else". Often you could just choose the better of the
| two options you have instead of checking for zero.
|
| Case in point: choosing the axes of your shadow map projection
| matrix. You have two options (world x axis or z axis), choose the
| better one (larger angle with viewing direction). Don't wait
| until the division goes to inf and then fall back to the other.
|
| [0]
| https://www.cs.uaf.edu/2011/fall/cs301/lecture/11_09_weird_f...
| thwarted wrote:
| You still can't divide by zero, it just doesn't result in an
| error state that stops execution. The inf and NaN values are
| sentinel values that you still have to check for _after the
| calculation_ to know if it went awry.
| sixo wrote:
| In the space of floats, you _are_ dividing by zero. To map
| back to the space of numbers you have to check. It 's nice,
| though; inf and NaN sentinels give you the behavior of a
| monadic `Result | Error` pipeline without having to wrap your
| numbers in another abstraction.
| TehShrike wrote:
| You can divide by zero, but you mayn't.
| epcoa wrote:
| If dividing by zero has a well defined result that doesn't
| abort execution what exactly does "can't" even mean?
|
| Operations on those sentinel values are also defined. This
| can affect when checking needs to be done in optimized code.
| bee_rider wrote:
| I believe divide-by-zero produces an exception. The machine
| can either be configured to mask that exception, or not.
|
| Personally, I am lazy, so I don't check the mxcsr register
| before I start running my programs. Maybe gcc does
| something by default, I don't know. IMO legitimate division
| by zero is rare but not impossible, so if you do it, the
| onus is on you to make sure the flags are set up right.
| epcoa wrote:
| Correct, divide by zero is one of the original five
| defined IEEE754-1985 exception. But the default behavior
| then and now is to produce that defined result mentioned
| and continue execution with a flag set ("default non-
| stop"). Further conforming implementations also allow
| "raiseNoFlag".
|
| It's well-defined is all that really matters AFAIC.
| mike_ivanov wrote:
| This is the Result monad in practice. It allows you to
| postpone error handling until the computation is done.
| wpollock wrote:
| It has always amused me that Integer division by 0 results in
| "floating point exception", but floating point division by
| 0.0 doesn't!
| kccqzy wrote:
| This is also my favorite thing about floating point numbers.
| Unfortunately languages like Python try to be smart and prevent
| me from doing it. Compare: >>> 1.0/0.0
| ZeroDivisionError >>> np.float64(1)/np.float64(0)
| inf
|
| I'm so used to writing such zero division in other languages
| like C/C++ that this Python quirk still trips me up.
| pklausler wrote:
| Division by zero is an error and it should be treated as
| such. "Infinity" is an error indication from overflow and
| division by zero, nothing more.
| ForceBru wrote:
| The article's point 3 says that this is a myth. Indeed, the
| _limit_ of `1/x` as `x` approaches zero from the right is
| positive infinity. What's more, division by _negative zero_
| (which, perhaps surprisingly, is a thing) yields negative
| infinity, which is also the value of the corresponding
| limit. If you divide a finite float by infinity, you get
| zero, because `lim_{x\to\infty} c/x=0`. In many cases you
| can treat division by zero or infinity as the appropriate
| limit.
| pklausler wrote:
| I am allowed to disagree with the article.
| ForceBru wrote:
| Sure, but it makes sense, doesn't it? Even `inf-inf ==
| NaN` and `inf/inf == NaN`, which is true in calculus:
| limits like these are undefined, unless you use
| l'Hopital's rule or something. (I know NaN isn't equal to
| itself, it's just for illustration purposes) But then
| again, you usually don't want these popping up in your
| code.
| pklausler wrote:
| In practice, though, I can't recall any HPC codes that
| want to use IEEE-754 infinities as valid data.
| sfpotter wrote:
| This is totally false mathematically. Please look up the
| extended real number system for an example. Many branches
| of mathematics affix infinity to some existing number
| system, extending its operations consistently, and do all
| kinds of useful things with this setup. Being able to work
| with infinity in exactly the same way in IEEE754 is crucial
| for being able to cleanly map algorithms from these domains
| onto a computer. If dividing by zero were an error in
| floating point arithmetic, I would be unable to do my job
| developing numerical methods.
| BlackFly wrote:
| You can exploit the exactness of (specific) floating point
| operations in test data by using sums of powers of 2. Polynomials
| with such coefficients produce exact results so long as the
| overall powers are within ~53 powers of 2 (don't quote me exactly
| on that, I generally don't push the range very high!). You can
| find exact polynomial solutions to linear PDEs with such powers
| using high enough order finite difference methods for example.
|
| However, the story about non-determinism is no myth. The intel
| processors have a separate math coprocessor that supports 80bit
| floats (https://en.wikipedia.org/wiki/Extended_precision#x86_exte
| nde...). Moving a float from a register in this coprocessor to
| memory truncates the float. Repeated math can be done inside this
| coprocessor to achieve higher precision so hot loops generally
| don't move floats outside of these registers. Non-determinism
| occurs in programs running on intel with floats when threads are
| interrupted and the math coprocessor flushed. The non-determinism
| isn't intrinsic to the floating point arithmetic but to the non-
| determinism of when this truncation may occur. This is more
| relevant for fields where chaotic dynamics occur. So the same
| program with the same inputs can produce different results.
|
| NaN is an error. If you take the square root of a negative number
| you get a NaN. This is just a type error, use complex numbers to
| overcome this one. But then you get 0. / 0. and that's a NaN or
| Inf - Inf and a whole slew of other things that produce out of
| bounds results. Whether it is expected or not is another story,
| but it does mean that you are unable to represent the value with
| a float and that is a type error.
| AshamedCaptain wrote:
| > Non-determinism occurs in programs running on intel with
| floats when threads are interrupted and the math coprocessor
| flushed
|
| That's ridiculous. No OS in his right mind would flush FPU regs
| to 64 bits only, because that would break many things, most
| obviously "real" 80 bit FP which is still a thing and the only
| reason x87 instructions still work. It would even break plain
| equality comparisons making all FP useless.
|
| For 64 bit FP most compilers prefer SSE rather than x87
| instructions these days.
| bobmcnamara wrote:
| > Non-determinism occurs in programs running on intel
|
| FTFY. They even changed some of the more obscure handling
| between 8087,80287,80387. So much hoop jumping if you cared
| about binary reproducibility.
|
| Seems to be largely fixed with targeting SSE even for scalar
| code now.
| jcranmer wrote:
| Wow, you're crossing a few wires in your zeal to provide
| information to the point that you're repeating myths.
|
| > The intel processors have a separate math coprocessor that
| supports 80bit floats
|
| x86 processors have two FPU units, the x87 unit (that you're
| describing) and the SSE unit. Anyone compiling for x86-64 uses
| the SSE unit for default, and most x86-32 compilers still
| default to SSE anyways.
|
| > Moving a float from a register in this coprocessor to memory
| truncates the float.
|
| No it doesn't. The x87 unit has load and store instructions for
| 32-bit, 64-bit, and 80-bit floats. If you want to spill 80-bit
| values as 80-bit values, you can do so.
|
| > Repeated math can be done inside this coprocessor to achieve
| higher precision so hot loops generally don't move floats
| outside of these registers.
|
| Hot loops these days use the SSE stuff because they're so much
| faster than x87. Friends don't let friends use long double
| without good reason!
|
| > Non-determinism occurs in programs running on intel with
| floats when threads are interrupted and the math coprocessor
| flushed.
|
| Lol, nope. You'll spill the x87 register stack on thread
| context switch with FSAVE or FXSAVE or XSAVE, all of which will
| store the registers as 80-bit values without loss of precision.
|
| That said, there was a problem with programs that use the x87
| unit, but it has absolutely nothing to do with what you're
| describing. The x87 unit doesn't have arithmetic for 32-bit and
| 64-bit values, only 80-bit values. Many compilers, though, just
| pretended that the x87 unit supported arithmetic on 32-bit and
| 64-bit values, so that FADD would simultaneously be a 32-bit
| addition, a 64-bit addition, and a 80-bit addition. If the
| compiler needed to spill a floating-point register, they would
| spill the value as a 32-bit value (if float) or 64-bit value
| (if double), and register spills are pretty unpredictable for
| user code. That's the nondeterminism you're referring to, and
| it's considered a bug in every compiler I'm aware of. (See
| https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p37...
| for a more thorough description of the problem).
| nyeah wrote:
| NaN is not necessarily an error. It might be fine. It depends
| on what you're doing with it.
|
| If NaN is invalid input for the next step, then sure why not
| treat it as an error? But that's a design decision not an
| imperative that everybody must follow. (I picture Mel Brooks'
| 15 commandments, #11-15, that fell and broke. This is not like
| that.)
| quietbritishjim wrote:
| When writing code, it's useful to follow simpler rules than
| strictly needed to make it easier to read for other coders
| (including you in the future). For example, you might add some
| unnecessary brackets to expressions to avoid remembering all the
| operator precedence rules.
|
| The article is right that sometimes you can can use simple
| equality on floating point values. But if you have a (almost)
| blanket rule to use fuzzy comparison, or (even better) just avoid
| any sort if equality-like comparison altogether, then your code
| might be simpler to understand and you might be less likely to
| make a mistake about when you really can safely do that.
|
| It's still sensible to understand that it's possible though. Just
| that you should try to avoid writing tricky code that depends on
| it.
| zokier wrote:
| That sort of argument can be problematic though, because the
| code can become misleading. If I see a fuzzy comparison then
| I'd assume it is there because it is needed, which might make
| the rest of the code more difficult to understand/modify
| because I then have to assume that everywhere else the values
| might be fuzzy.
| kccqzy wrote:
| Every time I use exact floating point equality I simply write a
| comment explaining why I did this. That's what comments are
| for.
| glkindlmann wrote:
| Note: the improved loop in the "1. They are not exact" can easily
| hang. If count > 2^24, then the ulp of f is 2, and adding 1.0f
| leaves f unchanged. What's wild is that a few lines later he
| notes how above 2^24 numbers start "jumping" every 2. Ok, but FP
| are always "jumping", by their ulp, regardless of how big or
| small they are.
| pklausler wrote:
| People need to understand rounding better, especially the topic
| of _when_ rounding can happen and when it can 't for the basic
| operations.
|
| Updating with a concrete example: The Fortran standard defines
| the real MOD and MODULO intrinsic functions as being equivalent
| to the straightforward sequence of a division, conversion to
| integer, multiplication, and subtraction. This formula can round,
| obviously. But MOD can be (and thus should be) implemented
| exactly by other means, and most Fortran compilers do so instead.
| This leaves Fortran implementors in a bit of a pickle -- conform
| to the standard, or produce good results?
| camgunz wrote:
| Definitely. People think they'll get out of knowing how
| rounding works by using arbitrary precision arithmetic, but
| arguably it's even more important there (you run out of
| precision/memory at some point; what do you think happens
| then?). You can use floats for money if you do the rounding
| right.
| AtlasBarfed wrote:
| The core thing to know about floating point numbers in comp langs
| is they aren't floating point numbers.
|
| They are approximations of floating point numbers, and even if
| your current approximation of a floating point number to
| represent a value seems to be accurate as an integer vale...
|
| There is no guarantee if you take two of those floating point
| number approximations that appear completely accurate that the
| resulting operation between them will also be completely
| accurate.
| kccqzy wrote:
| That's not a useful way to think. Floating point numbers are
| just floating point numbers. You aren't approximating floating
| point numbers. You are approximating real numbers.
|
| Floating point numbers are compared against fixed point numbers
| where the point (the part between the integer part and
| fractional part) is fixed. That is why they are called
| floating. The nature of the real numbers is such that they can
| in general only be approximated, regardless of whether you use
| fixed point numbers or floating point numbers or a fancier
| computable number representation.
| ivankra wrote:
| My favorite trick: NaN boxing. NaN's aren't just for errors, but
| also for smuggling other data inside. For a double, you have
| whopping 53 bits of payload, enough to cram in a pointer and
| maybe a type tag, and many javascript engines do (since JS
| numbers are double's after all)
|
| https://wingolog.org/archives/2011/05/18/value-representatio...
|
| https://piotrduperas.com/posts/nan-boxing
| pklausler wrote:
| 52 bits of payload, and at least one bit must be set.
| ivankra wrote:
| You can put stuff into the sign bit too, that makes 53. Yeah,
| the lower 52 bits can't all be zero - that'd be +-INF, but
| the other 2^53-2 values are all yours to use.
| pklausler wrote:
| It's possible for the sign bit of a NaN to be changed by a
| "non-arithmetic" operation that doesn't trap on the NaN, so
| don't put anything precious in there.
| Arnavion wrote:
| It is also how RISC-V floating point registers are required to
| store floats of smaller widths. Eg if your CPU supports 64-bit
| floats (D extension), its FPU registers will be 64-bit wide. If
| you use an instruction to load a 16-bit float (Zfh extension)
| into such a register, it will be boxed into a negative quiet
| NaN with all bits above the lower 16 bits set to 1.
| calibas wrote:
| > They are not exact
|
| It's not exactly a myth, as the article mentions, they're only
| exact for certain ranges of values.
|
| > NaN and INF are indication of an error
|
| This is somewhat semantic, but dividing by zero typically does
| create a hardware exception. However, it's all handled behind the
| scenes, and you get "Inf" as the result.
|
| You can make it so dividing by zero is explicitly an error, see
| the "ftrapping-math" flag.
| AndriyKunitsyn wrote:
| Here's one that's not a myth: IEEE-754 floats are the only
| "primitive types" that allow "a == a" to not be true.
|
| I.e., two floats that are _identical_ to each other (even when
| it's _the same_ variable, on the same memory address) can be not
| _equal_ to each other, specifically if it's NaN. This is dictated
| by IEEE-754, and this is true for all programming languages I
| know, and to this day, this makes zero sense to me, but
| apparently this is useful for some reason.
___________________________________________________________________
(page generated 2025-08-13 23:01 UTC)