[HN Gopher] Why Does Integer Addition Approximate Float Multipli...
___________________________________________________________________
Why Does Integer Addition Approximate Float Multiplication?
Author : ibobev
Score : 152 points
Date : 2025-02-09 18:36 UTC (4 days ago)
(HTM) web link (probablydance.com)
(TXT) w3m dump (probablydance.com)
| Animats wrote:
| Neat. Of course it works for the exponent, but that it's not that
| far off for the mantissa is unexpected. It helps that the
| mantissa is normalized.
| guyomes wrote:
| For the mantissa, the insight is probably that :
|
| (1+e1)*(1+e2) = 1+e1+e2+(e1*e2)
|
| If e1 and e2 are small, then e1*e2 is negligible.
| thequux wrote:
| Huh, and in the case that e1 and e2 are large, the mantissa
| overflows into the exponent, multiplying the result by 2.
| amelius wrote:
| I guess we're lucky that the IEEE 754 committee put the
| exponent in front of the mantissa. Or maybe they were aware
| of this trick all along ...
| tialaramex wrote:
| This arrangement makes ordering easier anyway which is
| probably why they chose it.
|
| If we have some positive 32-bit integers A and B then if
| A < B, f32::from_bits(A) < f32::from_bits(B)
|
| Edited to add anecdote:
|
| I actually had a bug in realistic (my Rust crate to
| implement Hans Boehm's "Towards an API for the Real
| Numbers") which I only fixed recently, for converting my
| Real number into a floating point type where I might end
| up with too many mantissa bits, but then sometimes
| coincidentally the bottom bit of the exponent is zero,
| and so instead of increasing the exponent by one I
| actually overwrite it with a one and that's the same
| result.
|
| I finally caught that when it occurred to me to do
| "thorough testing". That is, take all possible 32-bit
| floats, turn them into a Real, then, turn that Real back
| into a 32-bit float, this should be a roundtrip, if it's
| not we've found a bug. There are "only" a few billion of
| these values to test, a machine can do that.
|
| I fixed the 64-bit routines for the same bugs, but I
| don't have enough years to run all the 64-bit tests the
| same way and discover any that aren't paralleled.
|
| [Of course there are presumably still bugs in the
| conversion algorithms which may either be symmetric, and
| thus the bug doesn't impact a round trip, or they don't
| affect the binary fractions used exclusively by floats,
| Real has proper rationals and various other things like
| the square root of integers, Pi, etc. which we can
| convert _to_ a float and lose precision but we never get
| these from converting a float because floats are always
| binary fractions.]
| xigoi wrote:
| > If we have some positive 32-bit integers A and B then
| if A < B, f32::from_bits(A) < f32::from_bits(B)
|
| That is, if both are valid floats that represent a finite
| non-zero number.
| ChadNauseam wrote:
| > realistic (my Rust crate to implement Hans Boehm's
| "Towards an API for the Real Numbers")
|
| So cool, I didn't know anyone was working on this!
| Animats wrote:
| > I fixed the 64-bit routines for the same bugs, but I
| don't have enough years to run all the 64-bit tests the
| same way and discover any that aren't paralleled.
|
| Minor thought for today: there are more than 2^64 bits of
| RAM in the world.
| nabla9 wrote:
| It's math and not just for integers. That's also how slide rule
| works.
|
| axb = 10^(log(axb))
|
| log(axb) = log(a)+log(b)
|
| thus axb = 10^(log(a)+log(b))
| Sharlin wrote:
| Yes, but what is nontrivial here is how good the approximation
| is, despite the mantissa part being linear (much better than
| the upper bound of being within 2x of the real value).
| shiandow wrote:
| Nontrivial yes, but it's not _that_ complicated to figure out
| that floats are a piecewise linear approximation of the base
| 2 logarithm.
|
| Edit: or base 2^k, depends a bit how you view it.
| advisedwang wrote:
| The very first paragraph of the article says this and then goes
| on to say the trick gets the mantissa roughly right too:
|
| > You can multiply two numbers by adding their exponents. So
| just with the exponent-addition you will be within a factor of
| 2 of the right result. But this will actually be much better
| and get within 7.5% of the right answer. Why?
| sethaurus wrote:
| This is the core trick in the Fast Inverse Square Root[0], as
| made famous by the Quake III source code.
|
| It uses a shift (equivalent to dividing) and a subtraction in
| integer-land to estimate x^(-0.5) in float-land.
|
| [0]: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root
| Timwi wrote:
| I've always hated that it's called an "inverse". It's not an
| inverse. The inverse of square root is square. If they had
| called it "reciprocal", it would have been clear to me what it
| does, but "inverse" confused the hell out of me when I first
| saw it.
| Someone wrote:
| It's confusing for non-mathematicians, but (and you may know
| that) it is not incorrect.
| https://en.wikipedia.org/wiki/Inverse_element:
|
| _"In mathematics, the concept of an inverse element
| generalises the concepts of opposite (-x) and reciprocal (1
| /x) of numbers."_
| disconcision wrote:
| i think this is just typical natural language ambiguity:
|
| "Inverse Square root" is used to mean "(Multiplicative)
| Inverse (of the) Square root" as opposed to "(Square root)
| Inverse"
| bmacho wrote:
| No, it's confusing for mathematicians because it is
| directly opposing math terminology. (Which is okay, but
| confusing.)
| Grimblewald wrote:
| It doesnt though, inverse strictly speaking means the
| multiplicative inverse. The idea was extended in inverse
| functions eventually which for sake of brevity we also
| call the inverse. Now f^{-1}(x) and f(x)^{-1} are clearly
| different beasts the first would describe the square (for
| square root) while the second describes what quake was
| doing. Wether you call something an inverse or its full
| name (multiplicative/functional - inverse) depends on if
| its unclear from context which was meant.
| throwawaymaths wrote:
| the function inverse _is_ the multiplicative inverse in
| the group of automorphisms over sets (when the
| multiplication operation is functional composition).
| cluckindan wrote:
| Found the Haskell guy.
| seeknotfind wrote:
| Right, the bang per syllable loss is it's ambiguous, not
| that it's incorrect. Inverse square root could also mean
| -(x^0.5) if it meant the additive inverse, and it could
| mean x^2 if it meant the functional inverse, as said here.
| manojlds wrote:
| The inverse is about the negative power right? Square root is
| the 0.5
| vrighter wrote:
| "inverse" has a very specific meaning: inverse(x) * x = 1
|
| x^2 * x != 1 for any x other than 1. So no, x^2 is not the
| inverse of sqrt(x)
| tczMUFlmoNk wrote:
| But _sqrt_ * _square_ = 1, where " _sqrt_ : R+ - R+" is
| the square root operation, " _square_ : R+ - R+" is the
| squaring operation, "1: R+ - R+" is the identity operation,
| and (*) is function composition: i.e., working in the
| monoid of functions R+ - R+. So _x_ 2 and sqrt( _x_ ), as
| elements of R, are not inverses, but _sqrt_ and _square_ ,
| as elements of R+ - R+, are.
|
| It depends on how you parse the phrase "fast inverse square
| root".
| xigoi wrote:
| This is true, but interpreting it one way is clearly
| absurd (nobody would call a square the "inverse square
| root"), which implicates that the other meaning was
| intended.
| quietbritishjim wrote:
| No. "inverse" has a very specific meaning: the operation
| that undoes whatever operation you're talking about.
|
| https://en.m.wikipedia.org/wiki/Inverse_function
|
| The inverse of multiplying by 3 is dividing by 3, and vice-
| versa.
|
| The inverse of squaring is squart root, and vice-versa.
|
| The inverse _of a number_ or other element (rather than an
| operation / function) means whatever number would form the
| inverse if you use it with whatever binary operation you're
| talking about. So the additive inverse of 3 is -3, the
| inverse of "rotate clockwise 90 degrees" in the group of
| geometric operations is "rotate anticlockwise 90 degrees",
| and the multiplicative inverse of 3 is 1/3.
|
| Saying "the inverse of 3" to mean 1/3, i.e. its
| multiplicative inverse, is sloppy but acceptable so long as
| everyone knows what you mean (and it's fairly common).
| Saying "the inverse square root" to mean 1/square root is
| just wrong.
| sunk1st wrote:
| Is it acceptable to use the imperative? That is, to say:
| "Now, invert the element". Or should it always be, "take
| the inverse of ..."?
| bee_rider wrote:
| Interestingly Intel got it right, thus the rsqrt family of
| instructions.
| empath75 wrote:
| Since calling the square an inverse square root makes zero
| sense in any practical context, the other common meaning
| being applied is obvious. In this case, it's not the inverse
| of a _function_, it's the inverse of a _number_, which is
| usually considered to be the same as the reciprocal.
| throwawaymaths wrote:
| as a mathematician: its fine. the word "multiplicative" is
| just silent in the phrase.
| HPsquared wrote:
| The most significant bits of a floating point representation are
| basically a logarithm of the number. Logarithms have this
| relation between multiplication and addition.
| vanderZwan wrote:
| To give a "yes and" side-track to your comment: saying _"
| logarithms have this relation between multiplication and
| addition"_ is even underselling what logarithms are, because
| reducing multiplication to an additive operation was the whole
| motivation for John Napier[0] to discover/invent logarithms:
|
| > _"...nothing is more tedious, fellow mathematicians, in the
| practice of the mathematical arts, than the great delays
| suffered in the tedium of lengthy multiplications and
| divisions, the finding of ratios, and in the extraction of
| square and cube roots... [with] the many slippery errors that
| can arise...I have found an amazing way of shortening the
| proceedings [in which]... all the numbers associated with the
| multiplications, and divisions of numbers, and with the long
| arduous tasks of extracting square and cube roots are
| themselves rejected from the work, and in their place other
| numbers are substituted, which perform the tasks of these
| rejected by means of addition, subtraction, and division by two
| or three only."_ [1]
|
| Logarithms were honestly an _enormous_ breakthrough in
| optimization, computers wouldn 't be remotely as useful without
| them, even if most of us don't "see" the logarithms being used.
|
| In fact I'd argue that they are the second-biggest
| computational optimization in use today, with only positional
| notation being a bigger deal. Which, funny enough, works kind
| of similarly: imagine you only could count by tallying (so,
| unary). _Adding two number M and N would take M+N operations_ ,
| e.g. 1234 + 5678 would require counting all 6912 individual
| digits. Unary math scales O(n) in both data and computation.
| Systems like Roman numerals almost work, but as soon as we
| reach values larger than the largest symbol (M for 1000) it's
| O(n) again, just with a better constant factor.
|
| With positional notation numbers require only log(n) symbols to
| write down, and log(n) operations for addition, e.g. 1234 +
| 5678 requires one or two additions for each digit pair in a
| given position - one addition if there's no carry from the
| previous addition, two if there is. So addition at most 2 x
| ceil( max( log(M), log(N) ) ) operations, so log(n).
|
| Logarithms take that idea and "recursively" apply it to the
| notation, making the same optimization work for multiplication.
| Without it, the naive algorithm for the multiplication of two
| numbers requires iterating over each digit, e.g. 1234 x 5678
| requires multiplying each of the four digits of the first
| number with each of the digit of the second number, and then
| adding all the resulting numbers. It scales O(dixdj), where di
| and dj are the digits of each number. If they're the same we
| can simplify that to O(d2). When the numbers are represented as
| two logarithms the operation is reduced to adding two numbers
| again, so O(log(d) + [whatever the log/inverse log conversion
| cost is]). Of course d is a different value here and the number
| of digits used affects the precision.
|
| I think the craziest thing of all this is that we're so used to
| positional notation that _nobody ever seems to consider it a
| data compression technique_. Even though _almost no other data
| compression method would work without it as a building block_
| (run-length encoding, Lempel-Ziv, Arithmetic coding? Useless
| without positional notation 's O(log(n)) scaling factor). The
| only exceptions are data compression methods that are based on
| inventing their own numerical notation[2].
|
| We do this every day ever since we first learned addition and
| subtraction as kids. Or as David Bess[3] puts it in his book
| "Mathematica": ask almost any adult what one billion minus one
| is and they know the answer instantaneously, so most adults
| would appear to have mental superpowers in the eyes of pretty
| much all mathematicians before positional notation was invented
| (well, everyone except Archimedes maybe[4]). Positional
| notation is magical, we're all math wizards, and it's so
| normalized that we don't even realize it.
|
| But to get back to your original point: yes, you are entirely
| correct. IEEE floats are a form of lossy compression of
| fractions, and the basis of that lossy compression is
| logarithmic notation (but with a fixed number of binary digits
| and some curious rules for encoding other values like NaN and
| infinity).
|
| [0] https://en.wikipedia.org/wiki/John_Napier
|
| [1]
| https://en.wikipedia.org/wiki/Mirifici_Logarithmorum_Canonis...
|
| [2] https://en.wikipedia.org/wiki/Asymmetric_numeral_systems
|
| [3] https://www.quantamagazine.org/mathematical-thinking-isnt-
| wh...
|
| [4] https://en.wikipedia.org/wiki/The_Sand_Reckoner
| esafak wrote:
| And if I may take this yet a step further, this is the point
| of mathematical transformations: to find a domain in which
| desirable operations are easier such that the round trip to
| and from the domain are offset.
|
| In the past, transformations, like logarithms, Fourier
| transforms, wavelets, had to be proposed. Enabled by advances
| in computers, machine learning automated all that away by
| composing differentiable building blocks into universal
| estimators. The parameters of these building blocks are
| estimated through gradient descent in conjunction with a
| user-chosen loss function, which guides the optimization of
| the transform. Good representations can be manipulated
| through basic algebra (like added, averaged, and compared for
| similarity, depending on the task) in a way that corresponds
| to semantic operations when their raw, untransformed
| representations can not.
| HPsquared wrote:
| I'm still amazed at word embeddings. They allow to find
| related words, and even do addition and subtraction.
|
| The way you can supposedly take "king", subtract "man" and
| add "woman" and you get "queen", and this kind of thing is
| the mathematical basis of LLMs.
| otabdeveloper4 wrote:
| That's an urban legend and not how embbeddings work.
| CamperBob2 wrote:
| It's my understanding of how embeddings work, as well.
| The dot products (cosine similarity) of man . woman end
| up being very similar to king . queen. This similarity
| could be considered subtraction if you stretch the
| metaphor enough.
|
| If this is not a valid way to think about embedding
| vectors, would you care to elaborate?
| taejo wrote:
| Excellent comment!
|
| As a historical tidbit I'll add that Romans did develop two
| ways to write larger numbers.
|
| 1. Writing a line (vinculum) over a numeral to multiply its
| value by 1,000. This was in fact extended to writing a line
| to the left and above a numeral to multiply its value by
| 1,000,000, and could in principle be extended to lines below
| and to the right to multiply by 10^9 and 10^12, and even
| nested boxes for larger powers.
|
| 2. The use |), |)), |))), ... for 500, 5,000, 50,000, ... and
| (|), ((|)), (((|))), ... for 1,000, 10,000, 100,000, ...
| These can be continued indefinitely.
|
| https://en.wikipedia.org/wiki/Roman_numerals#Large_numbers
|
| Both require an ever increasing number of marks just to write
| the increasing powers, as well as an ever increasing number
| of powers being summed, but both increase only
| logarithmically, so we end up using O((log n)2) marks to
| write n. This is quadratically worse than positional
| notation, but exponentially better than just writing M over
| and over.
| vanderZwan wrote:
| Wow, that's amazing! I had somehow never heard of this
| before despite being into numeral systems. Thank you for
| sharing!
| lanstin wrote:
| If you like numeral systems, you might be interested in
| packages that let one do arithmetic using continued
| fractions: https://ieeexplore.ieee.org/document/6158099
| thaumasiotes wrote:
| Note that our use of D and M for the "Roman numerals" 500
| and 1000 is a misinterpretation of their |) and (|).
| hathawsh wrote:
| I remember the glee I felt when I first used logarithms to
| represent large numbers in Excel. Suddenly, I could easily
| approximate large factorials, and they were fast! I wondered
| how many other people have discovered this trick. I couldn't
| find many references on the Internet. Over time, I realized
| this method of handling large numbers is used so often that
| most people who know about it don't bother to talk about it.
| groby_b wrote:
| This _should_ have been part of the HS curriculum.
|
| Logarithms pop up (IIRC) in 10th grade, and ln(x*y) = ln(x)
| + ln(y) is usually explained as part of that. What many
| teachers completely fail to teach is that it's not just a
| formula to memorize, but that it has profound implications
| on how you can do math. As you discovered by yourself.
| (Kudos!)
|
| It is, with the right teacher, a super-exciting story with
| lots of colorful history & characters. You can even guide
| your students to come to that some crucial insight. Alas,
| few teachers have the necessary amount of passion to make
| that come alive.
|
| But that's probably why few people write about it - for
| most, either their math interest got murdered in HS so they
| never get to look at it, or they did learn it in 10th grade
| and so consider it not worth mentioning.
|
| (Corollary - we all should write more about the cool
| realizations we had, even if they're in retrospective "well
| known")
| prerok wrote:
| I must admit I wanted to write that all that is true but
| calculating logarithms is difficult but then looked it up and
| found that slide rule was invented shortly thereafter:
|
| https://en.m.wikipedia.org/wiki/Slide_rule
| xyzzyz wrote:
| _When the numbers are represented as two logarithms the
| operation is reduced to adding two numbers again, so O(log(d)
| + [whatever the log /inverse log conversion cost is])._
|
| This is pretty confused. First, if the d is the number of
| digits of a number, then you will need O(d) of memory to
| store its logarithm, otherwise you will loose a lot of
| precision. Thus, the addition of logarithms is still O(d),
| not O(log(d)).
|
| Second, calculating log(x) and its inverse, exp(x) is much
| more expensive than multiplication. For example, if x is
| between 0 and 1, you can approximate exp(x) pretty well as 1
| + x + x^2/2 + x^3/6 + x^4/24. This is 3 multiplications and 3
| divisions just to convert.
| vanderZwan wrote:
| > _Thus, the addition of logarithms is still O(d), not
| O(log(d))._
|
| Oh dear, yes that's wrong, thank you for catching that.
|
| Still, O(d) a lot better than O(d2). And perhaps more
| importantly: division (which is even more painful to
| compute) is a simple matter of subtracting the logarithmic
| tranformations and then taking the exponent.
|
| > _Second, calculating log(x) and its inverse, exp(x) is
| much more expensive than multiplication._
|
| You're not wrong, but you're talking about the costs of a
| computer doing the calculation. I was talking about a human
| doing calculations by hand. What I forgot to mention is
| that _Mirifici Logarithmorum Canonis Descriptio_ contained
| 90 pages of precomputed tables. Multiplying two numbers is
| then a matter of two look-ups, one addition, and another
| look-up. For large enough numbers that 's worth it (and
| "large enough" is reached pretty quickly in the case of
| calculations done by hand).
|
| And shortly after the introduction of logarithms William
| Oughtred invented the slide rule, using two log scales to
| create an analog computer, removing the need for a table as
| well (depending on the precision required)
|
| https://en.m.wikipedia.org/wiki/Slide_rule
| fragmede wrote:
| > In fact I'd argue that they are the second-biggest
| computational optimization in use today, with only positional
| notation being a bigger deal
|
| For the sake of argument, I think representing numbers in
| binary is the most fundamental optimization in use today.
| vanderZwan wrote:
| You're arguing that positional notation using a specific
| base is more fundamental than the concept of positional
| notation itself?
|
| Although I agree that base 2 is a great pick, of course,
| for many reasons. So perhaps we should put binary in second
| place, and logarithms in third.
| fragmede wrote:
| yeah. because positional notation is a general
| mathematical concept, and it's only in practical
| application that its power is realized. binary is the
| most efficient base for computation with the technology
| we have at hand. positional notation exists for any/every
| base, but binary is the foundation optimization in
| practice.
|
| if I were to argue it, that is. In reality they're all
| really important and it's not like there's a twisted math
| genie that's forcing us to choose one or the other. I
| felt like pointing out that chosing base two is also a
| very important, fundamental optimization decision because
| it's lead us to the world we have today.
| ChuckMcM wrote:
| Wish I could up vote this more than once :-). It also is why
| people had books of 'logs' back in the day that were just the
| log10 value of a given number.
| mlyle wrote:
| Yes, and overflowing the mantissa moves the exponent in the
| right direction by the right amount; just now the least
| significant bits are not quite in the right place anymore. But
| that's by definition a small error.
| advisedwang wrote:
| As the very first article says:
|
| > You can multiply two numbers by adding their exponents. So
| just with the exponent-addition you will be within a factor of
| 2 of the right result. But this will actually be much better
| and get within 7.5% of the right answer. Why?
| zahlman wrote:
| You can multiply two numbers by _multiplying their mantissas_
| and adding their exponents.
|
| Since the numbers are stored in base 2, the mantissa values
| range from 1 to 2 (subnormals notwithstanding) - but they're
| effectively stored as a 0..1 value. The effective range
| produced by adding vs. multiplying them have considerable
| overlap, and on average the difference won't be much. Because
| of how the mantissa and exponent fields are arranged (a
| deliberate choice by the standard, AFAIK), the mantissa
| effectively adds additional bits to the approximation of the
| logarithm given by the exponent.
|
| Edit: this other comment chain
| https://news.ycombinator.com/item?id=43034230 perhaps puts it
| better.
| emmelaich wrote:
| Anyone who did high school before calculators (late 70s) would
| have carried around a book of logarithmic tables so they could
| do the multiplication necessary for many problems.
| fatuna wrote:
| Would subtraction also approximate division?
| Jaxan wrote:
| Yes
| Timwi wrote:
| Would negation also approximate the reciprocal?
| vanderZwan wrote:
| Yes. Logarithms are wonderful like that :)
| rowanG077 wrote:
| This can be very worth it in circuit design for custom
| accelerators. Floating point operation is often multicycle. If
| you can get close enough with addition it will save a ton of
| resources and probably also simplify the design.
| SushiHippie wrote:
| Previous discussion of the paper:
|
| _Addition is all you need for energy-efficient language models_
| - https://news.ycombinator.com/item?id=41784591 - _Oct 2024_ -
| 126 comments
| WhitneyLand wrote:
| I wonder if there was any inspiration from Quake III.
|
| Didn't the inverse square root trick rely on bit-level floating
| point and subtraction of a bias?
| akomtu wrote:
| float32 (1+m)x2^e as int32 = e<<23 | m
|
| (1+m1)x2^e1 x (1+m2)x2^e2 = (1+m1+m2+m1xm2)x2^(e1+e2)
|
| If m1xm2 is small, that's approximately float32(m1+m2, e1+e2).
___________________________________________________________________
(page generated 2025-02-13 23:00 UTC)