[HN Gopher] Why Does Integer Addition Approximate Float Multipli...
       ___________________________________________________________________
        
       Why Does Integer Addition Approximate Float Multiplication?
        
       Author : ibobev
       Score  : 152 points
       Date   : 2025-02-09 18:36 UTC (4 days ago)
        
 (HTM) web link (probablydance.com)
 (TXT) w3m dump (probablydance.com)
        
       | Animats wrote:
       | Neat. Of course it works for the exponent, but that it's not that
       | far off for the mantissa is unexpected. It helps that the
       | mantissa is normalized.
        
         | guyomes wrote:
         | For the mantissa, the insight is probably that :
         | 
         | (1+e1)*(1+e2) = 1+e1+e2+(e1*e2)
         | 
         | If e1 and e2 are small, then e1*e2 is negligible.
        
           | thequux wrote:
           | Huh, and in the case that e1 and e2 are large, the mantissa
           | overflows into the exponent, multiplying the result by 2.
        
             | amelius wrote:
             | I guess we're lucky that the IEEE 754 committee put the
             | exponent in front of the mantissa. Or maybe they were aware
             | of this trick all along ...
        
               | tialaramex wrote:
               | This arrangement makes ordering easier anyway which is
               | probably why they chose it.
               | 
               | If we have some positive 32-bit integers A and B then if
               | A < B, f32::from_bits(A) < f32::from_bits(B)
               | 
               | Edited to add anecdote:
               | 
               | I actually had a bug in realistic (my Rust crate to
               | implement Hans Boehm's "Towards an API for the Real
               | Numbers") which I only fixed recently, for converting my
               | Real number into a floating point type where I might end
               | up with too many mantissa bits, but then sometimes
               | coincidentally the bottom bit of the exponent is zero,
               | and so instead of increasing the exponent by one I
               | actually overwrite it with a one and that's the same
               | result.
               | 
               | I finally caught that when it occurred to me to do
               | "thorough testing". That is, take all possible 32-bit
               | floats, turn them into a Real, then, turn that Real back
               | into a 32-bit float, this should be a roundtrip, if it's
               | not we've found a bug. There are "only" a few billion of
               | these values to test, a machine can do that.
               | 
               | I fixed the 64-bit routines for the same bugs, but I
               | don't have enough years to run all the 64-bit tests the
               | same way and discover any that aren't paralleled.
               | 
               | [Of course there are presumably still bugs in the
               | conversion algorithms which may either be symmetric, and
               | thus the bug doesn't impact a round trip, or they don't
               | affect the binary fractions used exclusively by floats,
               | Real has proper rationals and various other things like
               | the square root of integers, Pi, etc. which we can
               | convert _to_ a float and lose precision but we never get
               | these from converting a float because floats are always
               | binary fractions.]
        
               | xigoi wrote:
               | > If we have some positive 32-bit integers A and B then
               | if A < B, f32::from_bits(A) < f32::from_bits(B)
               | 
               | That is, if both are valid floats that represent a finite
               | non-zero number.
        
               | ChadNauseam wrote:
               | > realistic (my Rust crate to implement Hans Boehm's
               | "Towards an API for the Real Numbers")
               | 
               | So cool, I didn't know anyone was working on this!
        
               | Animats wrote:
               | > I fixed the 64-bit routines for the same bugs, but I
               | don't have enough years to run all the 64-bit tests the
               | same way and discover any that aren't paralleled.
               | 
               | Minor thought for today: there are more than 2^64 bits of
               | RAM in the world.
        
       | nabla9 wrote:
       | It's math and not just for integers. That's also how slide rule
       | works.
       | 
       | axb = 10^(log(axb))
       | 
       | log(axb) = log(a)+log(b)
       | 
       | thus axb = 10^(log(a)+log(b))
        
         | Sharlin wrote:
         | Yes, but what is nontrivial here is how good the approximation
         | is, despite the mantissa part being linear (much better than
         | the upper bound of being within 2x of the real value).
        
           | shiandow wrote:
           | Nontrivial yes, but it's not _that_ complicated to figure out
           | that floats are a piecewise linear approximation of the base
           | 2 logarithm.
           | 
           | Edit: or base 2^k, depends a bit how you view it.
        
         | advisedwang wrote:
         | The very first paragraph of the article says this and then goes
         | on to say the trick gets the mantissa roughly right too:
         | 
         | > You can multiply two numbers by adding their exponents. So
         | just with the exponent-addition you will be within a factor of
         | 2 of the right result. But this will actually be much better
         | and get within 7.5% of the right answer. Why?
        
       | sethaurus wrote:
       | This is the core trick in the Fast Inverse Square Root[0], as
       | made famous by the Quake III source code.
       | 
       | It uses a shift (equivalent to dividing) and a subtraction in
       | integer-land to estimate x^(-0.5) in float-land.
       | 
       | [0]: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root
        
         | Timwi wrote:
         | I've always hated that it's called an "inverse". It's not an
         | inverse. The inverse of square root is square. If they had
         | called it "reciprocal", it would have been clear to me what it
         | does, but "inverse" confused the hell out of me when I first
         | saw it.
        
           | Someone wrote:
           | It's confusing for non-mathematicians, but (and you may know
           | that) it is not incorrect.
           | https://en.wikipedia.org/wiki/Inverse_element:
           | 
           |  _"In mathematics, the concept of an inverse element
           | generalises the concepts of opposite (-x) and reciprocal (1
           | /x) of numbers."_
        
             | disconcision wrote:
             | i think this is just typical natural language ambiguity:
             | 
             | "Inverse Square root" is used to mean "(Multiplicative)
             | Inverse (of the) Square root" as opposed to "(Square root)
             | Inverse"
        
             | bmacho wrote:
             | No, it's confusing for mathematicians because it is
             | directly opposing math terminology. (Which is okay, but
             | confusing.)
        
               | Grimblewald wrote:
               | It doesnt though, inverse strictly speaking means the
               | multiplicative inverse. The idea was extended in inverse
               | functions eventually which for sake of brevity we also
               | call the inverse. Now f^{-1}(x) and f(x)^{-1} are clearly
               | different beasts the first would describe the square (for
               | square root) while the second describes what quake was
               | doing. Wether you call something an inverse or its full
               | name (multiplicative/functional - inverse) depends on if
               | its unclear from context which was meant.
        
               | throwawaymaths wrote:
               | the function inverse _is_ the multiplicative inverse in
               | the group of automorphisms over sets (when the
               | multiplication operation is functional composition).
        
               | cluckindan wrote:
               | Found the Haskell guy.
        
             | seeknotfind wrote:
             | Right, the bang per syllable loss is it's ambiguous, not
             | that it's incorrect. Inverse square root could also mean
             | -(x^0.5) if it meant the additive inverse, and it could
             | mean x^2 if it meant the functional inverse, as said here.
        
           | manojlds wrote:
           | The inverse is about the negative power right? Square root is
           | the 0.5
        
           | vrighter wrote:
           | "inverse" has a very specific meaning: inverse(x) * x = 1
           | 
           | x^2 * x != 1 for any x other than 1. So no, x^2 is not the
           | inverse of sqrt(x)
        
             | tczMUFlmoNk wrote:
             | But _sqrt_ * _square_ = 1, where  " _sqrt_ : R+ - R+" is
             | the square root operation, " _square_ : R+ - R+" is the
             | squaring operation, "1: R+ - R+" is the identity operation,
             | and (*) is function composition: i.e., working in the
             | monoid of functions R+ - R+. So _x_ 2 and sqrt( _x_ ), as
             | elements of R, are not inverses, but _sqrt_ and _square_ ,
             | as elements of R+ - R+, are.
             | 
             | It depends on how you parse the phrase "fast inverse square
             | root".
        
               | xigoi wrote:
               | This is true, but interpreting it one way is clearly
               | absurd (nobody would call a square the "inverse square
               | root"), which implicates that the other meaning was
               | intended.
        
             | quietbritishjim wrote:
             | No. "inverse" has a very specific meaning: the operation
             | that undoes whatever operation you're talking about.
             | 
             | https://en.m.wikipedia.org/wiki/Inverse_function
             | 
             | The inverse of multiplying by 3 is dividing by 3, and vice-
             | versa.
             | 
             | The inverse of squaring is squart root, and vice-versa.
             | 
             | The inverse _of a number_ or other element (rather than an
             | operation  / function) means whatever number would form the
             | inverse if you use it with whatever binary operation you're
             | talking about. So the additive inverse of 3 is -3, the
             | inverse of "rotate clockwise 90 degrees" in the group of
             | geometric operations is "rotate anticlockwise 90 degrees",
             | and the multiplicative inverse of 3 is 1/3.
             | 
             | Saying "the inverse of 3" to mean 1/3, i.e. its
             | multiplicative inverse, is sloppy but acceptable so long as
             | everyone knows what you mean (and it's fairly common).
             | Saying "the inverse square root" to mean 1/square root is
             | just wrong.
        
               | sunk1st wrote:
               | Is it acceptable to use the imperative? That is, to say:
               | "Now, invert the element". Or should it always be, "take
               | the inverse of ..."?
        
           | bee_rider wrote:
           | Interestingly Intel got it right, thus the rsqrt family of
           | instructions.
        
           | empath75 wrote:
           | Since calling the square an inverse square root makes zero
           | sense in any practical context, the other common meaning
           | being applied is obvious. In this case, it's not the inverse
           | of a _function_, it's the inverse of a _number_, which is
           | usually considered to be the same as the reciprocal.
        
           | throwawaymaths wrote:
           | as a mathematician: its fine. the word "multiplicative" is
           | just silent in the phrase.
        
       | HPsquared wrote:
       | The most significant bits of a floating point representation are
       | basically a logarithm of the number. Logarithms have this
       | relation between multiplication and addition.
        
         | vanderZwan wrote:
         | To give a "yes and" side-track to your comment: saying _"
         | logarithms have this relation between multiplication and
         | addition"_ is even underselling what logarithms are, because
         | reducing multiplication to an additive operation was the whole
         | motivation for John Napier[0] to discover/invent logarithms:
         | 
         | > _"...nothing is more tedious, fellow mathematicians, in the
         | practice of the mathematical arts, than the great delays
         | suffered in the tedium of lengthy multiplications and
         | divisions, the finding of ratios, and in the extraction of
         | square and cube roots... [with] the many slippery errors that
         | can arise...I have found an amazing way of shortening the
         | proceedings [in which]... all the numbers associated with the
         | multiplications, and divisions of numbers, and with the long
         | arduous tasks of extracting square and cube roots are
         | themselves rejected from the work, and in their place other
         | numbers are substituted, which perform the tasks of these
         | rejected by means of addition, subtraction, and division by two
         | or three only."_ [1]
         | 
         | Logarithms were honestly an _enormous_ breakthrough in
         | optimization, computers wouldn 't be remotely as useful without
         | them, even if most of us don't "see" the logarithms being used.
         | 
         | In fact I'd argue that they are the second-biggest
         | computational optimization in use today, with only positional
         | notation being a bigger deal. Which, funny enough, works kind
         | of similarly: imagine you only could count by tallying (so,
         | unary). _Adding two number M and N would take M+N operations_ ,
         | e.g. 1234 + 5678 would require counting all 6912 individual
         | digits. Unary math scales O(n) in both data and computation.
         | Systems like Roman numerals almost work, but as soon as we
         | reach values larger than the largest symbol (M for 1000) it's
         | O(n) again, just with a better constant factor.
         | 
         | With positional notation numbers require only log(n) symbols to
         | write down, and log(n) operations for addition, e.g. 1234 +
         | 5678 requires one or two additions for each digit pair in a
         | given position - one addition if there's no carry from the
         | previous addition, two if there is. So addition at most 2 x
         | ceil( max( log(M), log(N) ) ) operations, so log(n).
         | 
         | Logarithms take that idea and "recursively" apply it to the
         | notation, making the same optimization work for multiplication.
         | Without it, the naive algorithm for the multiplication of two
         | numbers requires iterating over each digit, e.g. 1234 x 5678
         | requires multiplying each of the four digits of the first
         | number with each of the digit of the second number, and then
         | adding all the resulting numbers. It scales O(dixdj), where di
         | and dj are the digits of each number. If they're the same we
         | can simplify that to O(d2). When the numbers are represented as
         | two logarithms the operation is reduced to adding two numbers
         | again, so O(log(d) + [whatever the log/inverse log conversion
         | cost is]). Of course d is a different value here and the number
         | of digits used affects the precision.
         | 
         | I think the craziest thing of all this is that we're so used to
         | positional notation that _nobody ever seems to consider it a
         | data compression technique_. Even though _almost no other data
         | compression method would work without it as a building block_
         | (run-length encoding, Lempel-Ziv, Arithmetic coding? Useless
         | without positional notation 's O(log(n)) scaling factor). The
         | only exceptions are data compression methods that are based on
         | inventing their own numerical notation[2].
         | 
         | We do this every day ever since we first learned addition and
         | subtraction as kids. Or as David Bess[3] puts it in his book
         | "Mathematica": ask almost any adult what one billion minus one
         | is and they know the answer instantaneously, so most adults
         | would appear to have mental superpowers in the eyes of pretty
         | much all mathematicians before positional notation was invented
         | (well, everyone except Archimedes maybe[4]). Positional
         | notation is magical, we're all math wizards, and it's so
         | normalized that we don't even realize it.
         | 
         | But to get back to your original point: yes, you are entirely
         | correct. IEEE floats are a form of lossy compression of
         | fractions, and the basis of that lossy compression is
         | logarithmic notation (but with a fixed number of binary digits
         | and some curious rules for encoding other values like NaN and
         | infinity).
         | 
         | [0] https://en.wikipedia.org/wiki/John_Napier
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Mirifici_Logarithmorum_Canonis...
         | 
         | [2] https://en.wikipedia.org/wiki/Asymmetric_numeral_systems
         | 
         | [3] https://www.quantamagazine.org/mathematical-thinking-isnt-
         | wh...
         | 
         | [4] https://en.wikipedia.org/wiki/The_Sand_Reckoner
        
           | esafak wrote:
           | And if I may take this yet a step further, this is the point
           | of mathematical transformations: to find a domain in which
           | desirable operations are easier such that the round trip to
           | and from the domain are offset.
           | 
           | In the past, transformations, like logarithms, Fourier
           | transforms, wavelets, had to be proposed. Enabled by advances
           | in computers, machine learning automated all that away by
           | composing differentiable building blocks into universal
           | estimators. The parameters of these building blocks are
           | estimated through gradient descent in conjunction with a
           | user-chosen loss function, which guides the optimization of
           | the transform. Good representations can be manipulated
           | through basic algebra (like added, averaged, and compared for
           | similarity, depending on the task) in a way that corresponds
           | to semantic operations when their raw, untransformed
           | representations can not.
        
             | HPsquared wrote:
             | I'm still amazed at word embeddings. They allow to find
             | related words, and even do addition and subtraction.
             | 
             | The way you can supposedly take "king", subtract "man" and
             | add "woman" and you get "queen", and this kind of thing is
             | the mathematical basis of LLMs.
        
               | otabdeveloper4 wrote:
               | That's an urban legend and not how embbeddings work.
        
               | CamperBob2 wrote:
               | It's my understanding of how embeddings work, as well.
               | The dot products (cosine similarity) of man . woman end
               | up being very similar to king . queen. This similarity
               | could be considered subtraction if you stretch the
               | metaphor enough.
               | 
               | If this is not a valid way to think about embedding
               | vectors, would you care to elaborate?
        
           | taejo wrote:
           | Excellent comment!
           | 
           | As a historical tidbit I'll add that Romans did develop two
           | ways to write larger numbers.
           | 
           | 1. Writing a line (vinculum) over a numeral to multiply its
           | value by 1,000. This was in fact extended to writing a line
           | to the left and above a numeral to multiply its value by
           | 1,000,000, and could in principle be extended to lines below
           | and to the right to multiply by 10^9 and 10^12, and even
           | nested boxes for larger powers.
           | 
           | 2. The use |), |)), |))), ... for 500, 5,000, 50,000, ... and
           | (|), ((|)), (((|))), ... for 1,000, 10,000, 100,000, ...
           | These can be continued indefinitely.
           | 
           | https://en.wikipedia.org/wiki/Roman_numerals#Large_numbers
           | 
           | Both require an ever increasing number of marks just to write
           | the increasing powers, as well as an ever increasing number
           | of powers being summed, but both increase only
           | logarithmically, so we end up using O((log n)2) marks to
           | write n. This is quadratically worse than positional
           | notation, but exponentially better than just writing M over
           | and over.
        
             | vanderZwan wrote:
             | Wow, that's amazing! I had somehow never heard of this
             | before despite being into numeral systems. Thank you for
             | sharing!
        
               | lanstin wrote:
               | If you like numeral systems, you might be interested in
               | packages that let one do arithmetic using continued
               | fractions: https://ieeexplore.ieee.org/document/6158099
        
               | thaumasiotes wrote:
               | Note that our use of D and M for the "Roman numerals" 500
               | and 1000 is a misinterpretation of their |) and (|).
        
           | hathawsh wrote:
           | I remember the glee I felt when I first used logarithms to
           | represent large numbers in Excel. Suddenly, I could easily
           | approximate large factorials, and they were fast! I wondered
           | how many other people have discovered this trick. I couldn't
           | find many references on the Internet. Over time, I realized
           | this method of handling large numbers is used so often that
           | most people who know about it don't bother to talk about it.
        
             | groby_b wrote:
             | This _should_ have been part of the HS curriculum.
             | 
             | Logarithms pop up (IIRC) in 10th grade, and ln(x*y) = ln(x)
             | + ln(y) is usually explained as part of that. What many
             | teachers completely fail to teach is that it's not just a
             | formula to memorize, but that it has profound implications
             | on how you can do math. As you discovered by yourself.
             | (Kudos!)
             | 
             | It is, with the right teacher, a super-exciting story with
             | lots of colorful history & characters. You can even guide
             | your students to come to that some crucial insight. Alas,
             | few teachers have the necessary amount of passion to make
             | that come alive.
             | 
             | But that's probably why few people write about it - for
             | most, either their math interest got murdered in HS so they
             | never get to look at it, or they did learn it in 10th grade
             | and so consider it not worth mentioning.
             | 
             | (Corollary - we all should write more about the cool
             | realizations we had, even if they're in retrospective "well
             | known")
        
           | prerok wrote:
           | I must admit I wanted to write that all that is true but
           | calculating logarithms is difficult but then looked it up and
           | found that slide rule was invented shortly thereafter:
           | 
           | https://en.m.wikipedia.org/wiki/Slide_rule
        
           | xyzzyz wrote:
           | _When the numbers are represented as two logarithms the
           | operation is reduced to adding two numbers again, so O(log(d)
           | + [whatever the log /inverse log conversion cost is])._
           | 
           | This is pretty confused. First, if the d is the number of
           | digits of a number, then you will need O(d) of memory to
           | store its logarithm, otherwise you will loose a lot of
           | precision. Thus, the addition of logarithms is still O(d),
           | not O(log(d)).
           | 
           | Second, calculating log(x) and its inverse, exp(x) is much
           | more expensive than multiplication. For example, if x is
           | between 0 and 1, you can approximate exp(x) pretty well as 1
           | + x + x^2/2 + x^3/6 + x^4/24. This is 3 multiplications and 3
           | divisions just to convert.
        
             | vanderZwan wrote:
             | > _Thus, the addition of logarithms is still O(d), not
             | O(log(d))._
             | 
             | Oh dear, yes that's wrong, thank you for catching that.
             | 
             | Still, O(d) a lot better than O(d2). And perhaps more
             | importantly: division (which is even more painful to
             | compute) is a simple matter of subtracting the logarithmic
             | tranformations and then taking the exponent.
             | 
             | > _Second, calculating log(x) and its inverse, exp(x) is
             | much more expensive than multiplication._
             | 
             | You're not wrong, but you're talking about the costs of a
             | computer doing the calculation. I was talking about a human
             | doing calculations by hand. What I forgot to mention is
             | that _Mirifici Logarithmorum Canonis Descriptio_ contained
             | 90 pages of precomputed tables. Multiplying two numbers is
             | then a matter of two look-ups, one addition, and another
             | look-up. For large enough numbers that 's worth it (and
             | "large enough" is reached pretty quickly in the case of
             | calculations done by hand).
             | 
             | And shortly after the introduction of logarithms William
             | Oughtred invented the slide rule, using two log scales to
             | create an analog computer, removing the need for a table as
             | well (depending on the precision required)
             | 
             | https://en.m.wikipedia.org/wiki/Slide_rule
        
           | fragmede wrote:
           | > In fact I'd argue that they are the second-biggest
           | computational optimization in use today, with only positional
           | notation being a bigger deal
           | 
           | For the sake of argument, I think representing numbers in
           | binary is the most fundamental optimization in use today.
        
             | vanderZwan wrote:
             | You're arguing that positional notation using a specific
             | base is more fundamental than the concept of positional
             | notation itself?
             | 
             | Although I agree that base 2 is a great pick, of course,
             | for many reasons. So perhaps we should put binary in second
             | place, and logarithms in third.
        
               | fragmede wrote:
               | yeah. because positional notation is a general
               | mathematical concept, and it's only in practical
               | application that its power is realized. binary is the
               | most efficient base for computation with the technology
               | we have at hand. positional notation exists for any/every
               | base, but binary is the foundation optimization in
               | practice.
               | 
               | if I were to argue it, that is. In reality they're all
               | really important and it's not like there's a twisted math
               | genie that's forcing us to choose one or the other. I
               | felt like pointing out that chosing base two is also a
               | very important, fundamental optimization decision because
               | it's lead us to the world we have today.
        
           | ChuckMcM wrote:
           | Wish I could up vote this more than once :-). It also is why
           | people had books of 'logs' back in the day that were just the
           | log10 value of a given number.
        
         | mlyle wrote:
         | Yes, and overflowing the mantissa moves the exponent in the
         | right direction by the right amount; just now the least
         | significant bits are not quite in the right place anymore. But
         | that's by definition a small error.
        
         | advisedwang wrote:
         | As the very first article says:
         | 
         | > You can multiply two numbers by adding their exponents. So
         | just with the exponent-addition you will be within a factor of
         | 2 of the right result. But this will actually be much better
         | and get within 7.5% of the right answer. Why?
        
           | zahlman wrote:
           | You can multiply two numbers by _multiplying their mantissas_
           | and adding their exponents.
           | 
           | Since the numbers are stored in base 2, the mantissa values
           | range from 1 to 2 (subnormals notwithstanding) - but they're
           | effectively stored as a 0..1 value. The effective range
           | produced by adding vs. multiplying them have considerable
           | overlap, and on average the difference won't be much. Because
           | of how the mantissa and exponent fields are arranged (a
           | deliberate choice by the standard, AFAIK), the mantissa
           | effectively adds additional bits to the approximation of the
           | logarithm given by the exponent.
           | 
           | Edit: this other comment chain
           | https://news.ycombinator.com/item?id=43034230 perhaps puts it
           | better.
        
         | emmelaich wrote:
         | Anyone who did high school before calculators (late 70s) would
         | have carried around a book of logarithmic tables so they could
         | do the multiplication necessary for many problems.
        
       | fatuna wrote:
       | Would subtraction also approximate division?
        
         | Jaxan wrote:
         | Yes
        
           | Timwi wrote:
           | Would negation also approximate the reciprocal?
        
             | vanderZwan wrote:
             | Yes. Logarithms are wonderful like that :)
        
       | rowanG077 wrote:
       | This can be very worth it in circuit design for custom
       | accelerators. Floating point operation is often multicycle. If
       | you can get close enough with addition it will save a ton of
       | resources and probably also simplify the design.
        
       | SushiHippie wrote:
       | Previous discussion of the paper:
       | 
       |  _Addition is all you need for energy-efficient language models_
       | - https://news.ycombinator.com/item?id=41784591 - _Oct 2024_ -
       | 126 comments
        
       | WhitneyLand wrote:
       | I wonder if there was any inspiration from Quake III.
       | 
       | Didn't the inverse square root trick rely on bit-level floating
       | point and subtraction of a bias?
        
       | akomtu wrote:
       | float32 (1+m)x2^e as int32 = e<<23 | m
       | 
       | (1+m1)x2^e1 x (1+m2)x2^e2 = (1+m1+m2+m1xm2)x2^(e1+e2)
       | 
       | If m1xm2 is small, that's approximately float32(m1+m2, e1+e2).
        
       ___________________________________________________________________
       (page generated 2025-02-13 23:00 UTC)