[HN Gopher] Did any processor implement an integer square root i...
___________________________________________________________________
Did any processor implement an integer square root instruction?
Author : rwallace
Score : 185 points
Date : 2024-04-07 10:55 UTC (12 hours ago)
(HTM) web link (retrocomputing.stackexchange.com)
(TXT) w3m dump (retrocomputing.stackexchange.com)
| fjfaase wrote:
| Is it possible in a single clock-cycle. Yes, with a very large
| lookup table. It is probably possible to reduce the size
| depending on how many serial logical gates can be executed within
| the clock-cycle. Think about that the binary root of 10000 is
| rather similar to that of 100 only with respect to different
| number of zero's.
| WithinReason wrote:
| Sounds like it's possible to run any algorithm in the world in
| 1 clock cycle.
| robinduckett wrote:
| I like the Quantum BogoSort as a proof of this /s
| ajb wrote:
| It isn't, because eventually the size of your logic or table
| becomes larger than the distance a signal can propagate in
| one clock tick. Before that, it likely presents practical
| issues (eg, is it worth dedicating that much silicon)
| willcipriano wrote:
| Have slower ticks. A planet size CPU that runs at .5 hz but
| can work on impossibly large numbers.
| gameshot911 wrote:
| That's actually a really fascinating science fiction
| idea!
| willcipriano wrote:
| It's sort of the plot of the Douglas Adam's books.
|
| https://hitchhikers.fandom.com/wiki/Earth
| com2kid wrote:
| https://en.wikipedia.org/wiki/Matrioshka_brain
|
| :-D
|
| It's a topic that has been explored quite a bit in
| science fiction literature.
| karmakaze wrote:
| Related to the fallacy of comparing what's slow in our
| world with what's computable in a simulation of it--
| there's no requirement for time to tick similarly.
| volemo wrote:
| And 27.7% [1] of the planet's crust is silicon already!
|
| [1] Britannica: Silicon
| rcxdude wrote:
| With a sufficiently large chip and a sufficiently slow clock,
| sure.
| wheybags wrote:
| "Give me a lut large enough and an infinite period of time
| to execute, and I shall simulate the world"
| dartos wrote:
| "Every cycle"
| retrac wrote:
| Yes. In theory, any pure function can be turned into a lookup
| table. And any lookup table that isn't just random numbers
| can be turned into a more compact algorithm that spends
| compute to save space.
|
| Such tables may be infeasible, though. While a int8 -> int8
| table only needs 256 bytes, an int32 -> int32 needs 16
| gigabytes.
| DaiPlusPlus wrote:
| Fractal functions are pure, but don't lend themselves well
| to memoization nor lookup tables.
| bonzini wrote:
| You can compute the integer square root in n/2 iterations where
| n is the number of bits in the source using just shifts and
| adds. For each step, check if a new bit has to be set in the
| result n_old by computing
|
| n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1))
| + (1 << (bit*2))
|
| Then compare it with the source operand, and if it's greater or
| equal: 1) set the bit in the result 2) update n2_old with
| n2_new
|
| It can be done in n/2 or perhaps n clock cycles with a suitable
| microcode instruction set and ALU. With some effort it can be
| optimized to reduce n to the index of the leftmost set bit in
| the operand.
| masswerk wrote:
| Compare the integer square root algorithm used in "Spacewar!"
| [1]. So, even by 1960 it should have been possible to
| implement a square root step instructions for each bit, much
| like division or multiplication shifts, and progress from
| this to full-fledged automatic operations by the use of a sub
| timing network. (I guess, it really depends on the economics
| of the individual use case, whether the effort does pay off
| or not, as you would amass a few additional hardware modules
| to accomplish this.)
|
| [1] https://www.masswerk.at/spacewar/inside/insidespacewar-
| pt6-g...
| Findecanor wrote:
| _Floating_ _point_ _reciprocal_ _square_ _root_ _estimate_
| (`frsqrte`) instructions are typically implemented as just such
| a table lookup, indexed by a few bits of the fraction and the
| LSB of the exponent. The precision is typically limited to
| similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are
| expected to do a few Newton-Raphson iterations afterwards if
| they want more.
| benlivengood wrote:
| It's not as bad for integer square root; you only need to store
| N^0.5 entries in a greater/lesser-than lookup table: N^2 for
| all the answers N. Feasible for 16-bit integers, maybe for
| 32-bit, not for 64-bit.
| m463 wrote:
| so, dumb question.
|
| do lookups in large tables ever (practically, not
| theoretically) take one clock cycle?
|
| If there's a large lookup table, it would have to come from
| memory, which might mean cache and memory hierarchy delays,
| right?
| treprinum wrote:
| Can't you use the sequence 1 + 3 + 5 + ... + 2k + 1 to get the
| integer square root of any integer number? It's basically the k
| of the nearest lower number to your number in this sequence.
| HPsquared wrote:
| And this is one of those "embarrassingly parallel" tasks.
| sublinear wrote:
| Yes. On a desert island we can have the whole village
| construct this table for newton-raphson guesses.
|
| Combined with a cutting tool attached to a worm drive we will
| precisely count our turns (big radius crank for extra
| precision!) and begin manufacture of slide rules. Can never
| have too many scales and this is just one we shall etch into
| them!
| maxcoder4 wrote:
| Can you explain your idea? Your algorithm is correct by
| definition, but doing this naively would be very slow (even for
| 32bit number). At this point it would be much faster to just
| binsearch it.
| __s wrote:
| For an example of binsearch algo, I recently dipped into this
| while switching some code from floats to fixed point
| arithmetic (reducing overall wasm blob size)
|
| https://github.com/serprex/openEtG/blob/2011007dec2616d1a24d.
| ..
|
| Tho I could probably save binary size more by importing
| Math.sqrt from JS
| smcameron wrote:
| Recently, I had reason to dig around for a fixed point
| square root algorithm and found this: https://groups.google
| .com/g/comp.lang.c/c/IpwKbw0MAxw/m/N1xh...
| atq2119 wrote:
| And you would iterate through that sequence? That's exponential
| time in the bit length of the input...
| sublinear wrote:
| It's sqrt(n) - 1 additions for the n you're trying to get the
| integer square root of. Memoization would make it constant
| time for any lesser n than the greatest n you've done this
| for. For greater n it's sqrt(new_big_n - prev_big_n) - 1 more
| additions to memoize.
|
| You're right this isn't practical, but fun to think about.
| Good refresher for those out of school for a while.
| klyrs wrote:
| Subtle detail of complexity analysis: the input is of
| length log(n), so taking n^k steps for any positive number
| k is exponential time.
| tomatocracy wrote:
| Better might be to use the expansion (x+y)^2=x^2+2xy+y^2 along
| with the observation that in any base, the square root of a
| 2n-digit number is at most n digits, as in the common method
| for calculating a square root "by hand" with pen and paper. If
| you did this 8 bits at a time then you only need a lookup table
| for roots of 8bit numbers.
| moomin wrote:
| You need to read down a bit, but the answer "ENIAC" is hilarious.
| xyst wrote:
| hardware engineer humor lol
| corsix wrote:
| AArch64 NEON has the URSQRTE instruction, which gets closer to
| the OP's question than you might think; view a 32-bit value as a
| fixed-precision integer with 32 fractional bits (so the
| representable range is evenly spaced 0 through 1-e, where
| e=2^-32), then URSQRTE computes the approximate inverse square
| root, halves it, then clamps it to the range 0 through 1-e.
| Fixed-precision integers aren't quite integers, and approximate
| inverse square root isn't quite square root, but it might get you
| somewhere close.
|
| The related FRSQRTE instruction is much more conventional,
| operating on 32-bit floats, again giving approximate inverse
| square root.
| voidbert wrote:
| What task benefits from using such a complex instruction so
| easily dividable in simpler ones for it to be present in
| aarch64?
| nanidin wrote:
| Neon is SIMD so I would presume these instructions let you
| vectorize those calculations and do them in parallel on a lot
| of data more efficiently than if you broke it down into
| simpler operations and did them one by one.
| voidbert wrote:
| Yes, but the part that got me was the halving of the result
| followed by the clamping. SIMD generally makes sense, but
| for something like this to exist usually there's something
| very specific (like a certain video codec, for example)
| that greatly benefits from such a complex instruction.
| creato wrote:
| It's probably not about avoiding extra
| instructions/performance, but making the range of the
| result more useful and avoiding overflow. Or in other
| words, the entire instruction may be useless if you don't
| do these things.
| epcoa wrote:
| The halving and clamping is nothing particularly
| remarkable in the context of usefully using fixed point
| numbers (scaled integers) to avoid overflow. Reciprocal
| square root itself is a fundamental operation for DSP
| algorithms and of course computer graphics. This is a
| fairly generic instruction really, though FRSQRTE likely
| gets more real world use.
| colechristensen wrote:
| Inverse square root is for normalizing vectors particularly
| in computer graphics calculations, it needs to be run a whole
| lot very fast.
|
| https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Mot.
| ..
| hinkley wrote:
| Famously the magic constant in the Quake engine that nobody
| remembers inventing.
|
| That article does say there's an SSE instruction rsqrtss
| that is better.
| dahart wrote:
| For an approximate (very rough) answer, as opposed to one
| accurate to the nearest integer, a right shift by half the number
| of bits of the leading 1's position will do, and of course nearly
| every processor has a shift instruction. I'm not sure how often
| processors haven't had something like FLO (Find Leading One) or
| FFS (Find First Set) instruction, those seem ubiquitous as well.
|
| The super rough approximation for some uses can be approximately
| as good as an accurate answer. When you just need a decent
| starting place for some further Newton-Raphson iteration, for
| example. (Of course the right-shift trick is a nice way to seed a
| more accurate square root calculation. :P)
| lordnacho wrote:
| Is this where the DOOM reference comes in? Somewhat famous
| Internet story by now featuring Carmack and a magic 32 bit
| number.
| dahart wrote:
| Not really, that's a very clever trick used on floating point
| numbers, and does the approximate reciprocal square root.
|
| This right-shift thing is far simpler, not very clever,
| doesn't involve magic numbers, and is much more well known
| than the "Quake trick". There are easy ways to see it. One
| would be that multiplying a number by itself approximately
| doubles the number of digits. Therefore halving the number of
| digits is approximately the square root. You can get more
| technical and precise by noting that FFS(n) =~ log2(n), and
| if you remember logs, you know that exp(log(n)/2) = n^(1/2),
| so shifting right by FFS(n)/2 is just mathematically
| approximately a square root.
| creato wrote:
| They are more closely related than you suggest. Both
| methods are using an approximation of log2 to get an
| initial guess. One gets it from "FPS(n)", the other gets it
| from the floating point representation of n, where you can
| roughly find the log2(n) by looking at the exponent of n in
| the float representation. You can also use the mantissa to
| refine the guess further.
| dahart wrote:
| They are related a little, around the log2 (exponent), I
| totally agree. I guess I figured the magic number
| subtraction that turns it into n^(-1/2) is so wild that
| it puts the Quake trick in a bit of a different class.
| This right shift thing is a lot older and there's
| probably a lineage of ideas from the right shift on
| integers to the Quake trick. I discovered another fun one
| on my own that is probably already well known in bit-math
| circles, but simply inverting the bits of a floating
| point exponent is a very rough approximation to the
| reciprocal, good for a Newton initial guess.
| epcoa wrote:
| You mean _Quake 3_ and fast inverse square root? No. And it
| wasn't Carmack. https://www.beyond3d.com/content/articles/15/
|
| https://en.wikipedia.org/wiki/Fast_inverse_square_root
| dboreham wrote:
| I wondered about the name of that function: surely it isn't
| _inverse_ square root? That would be "squared". It's "1
| over square root" or something. So down the rabbit hole to
| see if it was always called that. Yup, in Wikipedia
| articles at least, but the first paper seems to be by Jim
| Blinn in 1997, without the term in the title. So let's read
| the paper... Frustratingly, although I _am_ an IEEE member,
| and I did subscribe to Computer Graphics and Applications
| in 1997, it won 't let me read the paper without paying
| again. So curious to hear from folks knowledgeable in the
| space if this was a mis-naming that stuck or I'm confused
| about the meaning of "inverse". In my universe we used
| inverse in the context of functions and their inverses, and
| "invert" in the context of binary values (1s compliment of
| x is also called x inverted). Never to describe reciprocal.
| dahart wrote:
| I've heard inverse used to mean reciprocal often enough.
| And it's technically accurate - a reciprocal is a
| multiplicative inverse. The problem is mainly that
| "inverse" is ambiguous, especially in this particular
| case (an inverse square root is a square!), whereas
| "reciprocal" is clear and unambiguous. Online you can
| find lots of uses of inverse, and questions about inverse
| vs reciprocal. So yes reciprocal is the better, more
| clear term to use here, but "inverse" to mean reciprocal
| does happen.
| epcoa wrote:
| It was called "inverse square root" as it is just another
| practical "layer" on top of "inverse" which is just meant
| to mean multiplicative inverse.
|
| This is the original Blinn 97 BTW. https://web.archive.or
| g/web/20130309073539/http://rufus.hack...
|
| This seems to have been common usage. I never really
| thought about it as it was just so normal to refer to
| reciprocal as "inverse" in this context.
|
| > In my universe we used inverse in the context of
| functions and their inverses
|
| Yes but, the other type of inverse that is so fundamental
| to CS in general, and especially geometry is a matrix
| inverse, which is again a multiplicative inverse, so it's
| not too surprising how this usage became assumed in many
| contexts.
| winwang wrote:
| Fun fact, FFS (and its generalization, FNS) is in CUDA:
| https://docs.nvidia.com/cuda/cuda-math-api/index.html#group_...
|
| Another nice CUDA hardware intrinsic I like is log2.
| bryanlarsen wrote:
| IIRC, most (all?) fixed point DSP's have a square root
| instruction and/or helper instructions.
| msarnoff wrote:
| If you wanted to expand the definition of "processor" to
| electromechanical contraptions, the Friden SRQ could perform
| square roots using just additions and shifts, with not a single
| electronic component other than a motor. And since you had to
| position the decimal points manually, it would _technically_ be
| an integer operation...
|
| Video: https://youtu.be/o44a1ao5h8w
| Dwedit wrote:
| 2 ^ (1/2 * Log2(X)) = sqrt(X)
|
| You can get a really really rough approximation if you replace
| Log2(x) with 'count leading zeroes'. With a better approximation
| of Log(2), you can get closer to the answer.
| pajko wrote:
| ARM VFP has VSQRT
|
| https://developer.arm.com/documentation/dui0473/m/vfp-instru...
___________________________________________________________________
(page generated 2024-04-07 23:00 UTC)