[HN Gopher] Did any processor implement an integer square root i...
       ___________________________________________________________________
        
       Did any processor implement an integer square root instruction?
        
       Author : rwallace
       Score  : 185 points
       Date   : 2024-04-07 10:55 UTC (12 hours ago)
        
 (HTM) web link (retrocomputing.stackexchange.com)
 (TXT) w3m dump (retrocomputing.stackexchange.com)
        
       | fjfaase wrote:
       | Is it possible in a single clock-cycle. Yes, with a very large
       | lookup table. It is probably possible to reduce the size
       | depending on how many serial logical gates can be executed within
       | the clock-cycle. Think about that the binary root of 10000 is
       | rather similar to that of 100 only with respect to different
       | number of zero's.
        
         | WithinReason wrote:
         | Sounds like it's possible to run any algorithm in the world in
         | 1 clock cycle.
        
           | robinduckett wrote:
           | I like the Quantum BogoSort as a proof of this /s
        
           | ajb wrote:
           | It isn't, because eventually the size of your logic or table
           | becomes larger than the distance a signal can propagate in
           | one clock tick. Before that, it likely presents practical
           | issues (eg, is it worth dedicating that much silicon)
        
             | willcipriano wrote:
             | Have slower ticks. A planet size CPU that runs at .5 hz but
             | can work on impossibly large numbers.
        
               | gameshot911 wrote:
               | That's actually a really fascinating science fiction
               | idea!
        
               | willcipriano wrote:
               | It's sort of the plot of the Douglas Adam's books.
               | 
               | https://hitchhikers.fandom.com/wiki/Earth
        
               | com2kid wrote:
               | https://en.wikipedia.org/wiki/Matrioshka_brain
               | 
               | :-D
               | 
               | It's a topic that has been explored quite a bit in
               | science fiction literature.
        
               | karmakaze wrote:
               | Related to the fallacy of comparing what's slow in our
               | world with what's computable in a simulation of it--
               | there's no requirement for time to tick similarly.
        
               | volemo wrote:
               | And 27.7% [1] of the planet's crust is silicon already!
               | 
               | [1] Britannica: Silicon
        
           | rcxdude wrote:
           | With a sufficiently large chip and a sufficiently slow clock,
           | sure.
        
             | wheybags wrote:
             | "Give me a lut large enough and an infinite period of time
             | to execute, and I shall simulate the world"
        
               | dartos wrote:
               | "Every cycle"
        
           | retrac wrote:
           | Yes. In theory, any pure function can be turned into a lookup
           | table. And any lookup table that isn't just random numbers
           | can be turned into a more compact algorithm that spends
           | compute to save space.
           | 
           | Such tables may be infeasible, though. While a int8 -> int8
           | table only needs 256 bytes, an int32 -> int32 needs 16
           | gigabytes.
        
             | DaiPlusPlus wrote:
             | Fractal functions are pure, but don't lend themselves well
             | to memoization nor lookup tables.
        
         | bonzini wrote:
         | You can compute the integer square root in n/2 iterations where
         | n is the number of bits in the source using just shifts and
         | adds. For each step, check if a new bit has to be set in the
         | result n_old by computing
         | 
         | n2_new = (n_old + (1 << bit))^2 = n2_old + (n_old << (bit + 1))
         | + (1 << (bit*2))
         | 
         | Then compare it with the source operand, and if it's greater or
         | equal: 1) set the bit in the result 2) update n2_old with
         | n2_new
         | 
         | It can be done in n/2 or perhaps n clock cycles with a suitable
         | microcode instruction set and ALU. With some effort it can be
         | optimized to reduce n to the index of the leftmost set bit in
         | the operand.
        
           | masswerk wrote:
           | Compare the integer square root algorithm used in "Spacewar!"
           | [1]. So, even by 1960 it should have been possible to
           | implement a square root step instructions for each bit, much
           | like division or multiplication shifts, and progress from
           | this to full-fledged automatic operations by the use of a sub
           | timing network. (I guess, it really depends on the economics
           | of the individual use case, whether the effort does pay off
           | or not, as you would amass a few additional hardware modules
           | to accomplish this.)
           | 
           | [1] https://www.masswerk.at/spacewar/inside/insidespacewar-
           | pt6-g...
        
         | Findecanor wrote:
         | _Floating_ _point_ _reciprocal_ _square_ _root_ _estimate_
         | (`frsqrte`) instructions are typically implemented as just such
         | a table lookup, indexed by a few bits of the fraction and the
         | LSB of the exponent. The precision is typically limited to
         | similar to bf16 (ARM, RISC-V) or fp16 (x86), so programs are
         | expected to do a few Newton-Raphson iterations afterwards if
         | they want more.
        
         | benlivengood wrote:
         | It's not as bad for integer square root; you only need to store
         | N^0.5 entries in a greater/lesser-than lookup table: N^2 for
         | all the answers N. Feasible for 16-bit integers, maybe for
         | 32-bit, not for 64-bit.
        
         | m463 wrote:
         | so, dumb question.
         | 
         | do lookups in large tables ever (practically, not
         | theoretically) take one clock cycle?
         | 
         | If there's a large lookup table, it would have to come from
         | memory, which might mean cache and memory hierarchy delays,
         | right?
        
       | treprinum wrote:
       | Can't you use the sequence 1 + 3 + 5 + ... + 2k + 1 to get the
       | integer square root of any integer number? It's basically the k
       | of the nearest lower number to your number in this sequence.
        
         | HPsquared wrote:
         | And this is one of those "embarrassingly parallel" tasks.
        
           | sublinear wrote:
           | Yes. On a desert island we can have the whole village
           | construct this table for newton-raphson guesses.
           | 
           | Combined with a cutting tool attached to a worm drive we will
           | precisely count our turns (big radius crank for extra
           | precision!) and begin manufacture of slide rules. Can never
           | have too many scales and this is just one we shall etch into
           | them!
        
         | maxcoder4 wrote:
         | Can you explain your idea? Your algorithm is correct by
         | definition, but doing this naively would be very slow (even for
         | 32bit number). At this point it would be much faster to just
         | binsearch it.
        
           | __s wrote:
           | For an example of binsearch algo, I recently dipped into this
           | while switching some code from floats to fixed point
           | arithmetic (reducing overall wasm blob size)
           | 
           | https://github.com/serprex/openEtG/blob/2011007dec2616d1a24d.
           | ..
           | 
           | Tho I could probably save binary size more by importing
           | Math.sqrt from JS
        
             | smcameron wrote:
             | Recently, I had reason to dig around for a fixed point
             | square root algorithm and found this: https://groups.google
             | .com/g/comp.lang.c/c/IpwKbw0MAxw/m/N1xh...
        
         | atq2119 wrote:
         | And you would iterate through that sequence? That's exponential
         | time in the bit length of the input...
        
           | sublinear wrote:
           | It's sqrt(n) - 1 additions for the n you're trying to get the
           | integer square root of. Memoization would make it constant
           | time for any lesser n than the greatest n you've done this
           | for. For greater n it's sqrt(new_big_n - prev_big_n) - 1 more
           | additions to memoize.
           | 
           | You're right this isn't practical, but fun to think about.
           | Good refresher for those out of school for a while.
        
             | klyrs wrote:
             | Subtle detail of complexity analysis: the input is of
             | length log(n), so taking n^k steps for any positive number
             | k is exponential time.
        
         | tomatocracy wrote:
         | Better might be to use the expansion (x+y)^2=x^2+2xy+y^2 along
         | with the observation that in any base, the square root of a
         | 2n-digit number is at most n digits, as in the common method
         | for calculating a square root "by hand" with pen and paper. If
         | you did this 8 bits at a time then you only need a lookup table
         | for roots of 8bit numbers.
        
       | moomin wrote:
       | You need to read down a bit, but the answer "ENIAC" is hilarious.
        
         | xyst wrote:
         | hardware engineer humor lol
        
       | corsix wrote:
       | AArch64 NEON has the URSQRTE instruction, which gets closer to
       | the OP's question than you might think; view a 32-bit value as a
       | fixed-precision integer with 32 fractional bits (so the
       | representable range is evenly spaced 0 through 1-e, where
       | e=2^-32), then URSQRTE computes the approximate inverse square
       | root, halves it, then clamps it to the range 0 through 1-e.
       | Fixed-precision integers aren't quite integers, and approximate
       | inverse square root isn't quite square root, but it might get you
       | somewhere close.
       | 
       | The related FRSQRTE instruction is much more conventional,
       | operating on 32-bit floats, again giving approximate inverse
       | square root.
        
         | voidbert wrote:
         | What task benefits from using such a complex instruction so
         | easily dividable in simpler ones for it to be present in
         | aarch64?
        
           | nanidin wrote:
           | Neon is SIMD so I would presume these instructions let you
           | vectorize those calculations and do them in parallel on a lot
           | of data more efficiently than if you broke it down into
           | simpler operations and did them one by one.
        
             | voidbert wrote:
             | Yes, but the part that got me was the halving of the result
             | followed by the clamping. SIMD generally makes sense, but
             | for something like this to exist usually there's something
             | very specific (like a certain video codec, for example)
             | that greatly benefits from such a complex instruction.
        
               | creato wrote:
               | It's probably not about avoiding extra
               | instructions/performance, but making the range of the
               | result more useful and avoiding overflow. Or in other
               | words, the entire instruction may be useless if you don't
               | do these things.
        
               | epcoa wrote:
               | The halving and clamping is nothing particularly
               | remarkable in the context of usefully using fixed point
               | numbers (scaled integers) to avoid overflow. Reciprocal
               | square root itself is a fundamental operation for DSP
               | algorithms and of course computer graphics. This is a
               | fairly generic instruction really, though FRSQRTE likely
               | gets more real world use.
        
           | colechristensen wrote:
           | Inverse square root is for normalizing vectors particularly
           | in computer graphics calculations, it needs to be run a whole
           | lot very fast.
           | 
           | https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Mot.
           | ..
        
             | hinkley wrote:
             | Famously the magic constant in the Quake engine that nobody
             | remembers inventing.
             | 
             | That article does say there's an SSE instruction rsqrtss
             | that is better.
        
       | dahart wrote:
       | For an approximate (very rough) answer, as opposed to one
       | accurate to the nearest integer, a right shift by half the number
       | of bits of the leading 1's position will do, and of course nearly
       | every processor has a shift instruction. I'm not sure how often
       | processors haven't had something like FLO (Find Leading One) or
       | FFS (Find First Set) instruction, those seem ubiquitous as well.
       | 
       | The super rough approximation for some uses can be approximately
       | as good as an accurate answer. When you just need a decent
       | starting place for some further Newton-Raphson iteration, for
       | example. (Of course the right-shift trick is a nice way to seed a
       | more accurate square root calculation. :P)
        
         | lordnacho wrote:
         | Is this where the DOOM reference comes in? Somewhat famous
         | Internet story by now featuring Carmack and a magic 32 bit
         | number.
        
           | dahart wrote:
           | Not really, that's a very clever trick used on floating point
           | numbers, and does the approximate reciprocal square root.
           | 
           | This right-shift thing is far simpler, not very clever,
           | doesn't involve magic numbers, and is much more well known
           | than the "Quake trick". There are easy ways to see it. One
           | would be that multiplying a number by itself approximately
           | doubles the number of digits. Therefore halving the number of
           | digits is approximately the square root. You can get more
           | technical and precise by noting that FFS(n) =~ log2(n), and
           | if you remember logs, you know that exp(log(n)/2) = n^(1/2),
           | so shifting right by FFS(n)/2 is just mathematically
           | approximately a square root.
        
             | creato wrote:
             | They are more closely related than you suggest. Both
             | methods are using an approximation of log2 to get an
             | initial guess. One gets it from "FPS(n)", the other gets it
             | from the floating point representation of n, where you can
             | roughly find the log2(n) by looking at the exponent of n in
             | the float representation. You can also use the mantissa to
             | refine the guess further.
        
               | dahart wrote:
               | They are related a little, around the log2 (exponent), I
               | totally agree. I guess I figured the magic number
               | subtraction that turns it into n^(-1/2) is so wild that
               | it puts the Quake trick in a bit of a different class.
               | This right shift thing is a lot older and there's
               | probably a lineage of ideas from the right shift on
               | integers to the Quake trick. I discovered another fun one
               | on my own that is probably already well known in bit-math
               | circles, but simply inverting the bits of a floating
               | point exponent is a very rough approximation to the
               | reciprocal, good for a Newton initial guess.
        
           | epcoa wrote:
           | You mean _Quake 3_ and fast inverse square root? No. And it
           | wasn't Carmack. https://www.beyond3d.com/content/articles/15/
           | 
           | https://en.wikipedia.org/wiki/Fast_inverse_square_root
        
             | dboreham wrote:
             | I wondered about the name of that function: surely it isn't
             | _inverse_ square root? That would be  "squared". It's "1
             | over square root" or something. So down the rabbit hole to
             | see if it was always called that. Yup, in Wikipedia
             | articles at least, but the first paper seems to be by Jim
             | Blinn in 1997, without the term in the title. So let's read
             | the paper... Frustratingly, although I _am_ an IEEE member,
             | and I did subscribe to Computer Graphics and Applications
             | in 1997, it won 't let me read the paper without paying
             | again. So curious to hear from folks knowledgeable in the
             | space if this was a mis-naming that stuck or I'm confused
             | about the meaning of "inverse". In my universe we used
             | inverse in the context of functions and their inverses, and
             | "invert" in the context of binary values (1s compliment of
             | x is also called x inverted). Never to describe reciprocal.
        
               | dahart wrote:
               | I've heard inverse used to mean reciprocal often enough.
               | And it's technically accurate - a reciprocal is a
               | multiplicative inverse. The problem is mainly that
               | "inverse" is ambiguous, especially in this particular
               | case (an inverse square root is a square!), whereas
               | "reciprocal" is clear and unambiguous. Online you can
               | find lots of uses of inverse, and questions about inverse
               | vs reciprocal. So yes reciprocal is the better, more
               | clear term to use here, but "inverse" to mean reciprocal
               | does happen.
        
               | epcoa wrote:
               | It was called "inverse square root" as it is just another
               | practical "layer" on top of "inverse" which is just meant
               | to mean multiplicative inverse.
               | 
               | This is the original Blinn 97 BTW. https://web.archive.or
               | g/web/20130309073539/http://rufus.hack...
               | 
               | This seems to have been common usage. I never really
               | thought about it as it was just so normal to refer to
               | reciprocal as "inverse" in this context.
               | 
               | > In my universe we used inverse in the context of
               | functions and their inverses
               | 
               | Yes but, the other type of inverse that is so fundamental
               | to CS in general, and especially geometry is a matrix
               | inverse, which is again a multiplicative inverse, so it's
               | not too surprising how this usage became assumed in many
               | contexts.
        
         | winwang wrote:
         | Fun fact, FFS (and its generalization, FNS) is in CUDA:
         | https://docs.nvidia.com/cuda/cuda-math-api/index.html#group_...
         | 
         | Another nice CUDA hardware intrinsic I like is log2.
        
       | bryanlarsen wrote:
       | IIRC, most (all?) fixed point DSP's have a square root
       | instruction and/or helper instructions.
        
       | msarnoff wrote:
       | If you wanted to expand the definition of "processor" to
       | electromechanical contraptions, the Friden SRQ could perform
       | square roots using just additions and shifts, with not a single
       | electronic component other than a motor. And since you had to
       | position the decimal points manually, it would _technically_ be
       | an integer operation...
       | 
       | Video: https://youtu.be/o44a1ao5h8w
        
       | Dwedit wrote:
       | 2 ^ (1/2 * Log2(X)) = sqrt(X)
       | 
       | You can get a really really rough approximation if you replace
       | Log2(x) with 'count leading zeroes'. With a better approximation
       | of Log(2), you can get closer to the answer.
        
       | pajko wrote:
       | ARM VFP has VSQRT
       | 
       | https://developer.arm.com/documentation/dui0473/m/vfp-instru...
        
       ___________________________________________________________________
       (page generated 2024-04-07 23:00 UTC)