[HN Gopher] Speeding up atan2f
___________________________________________________________________
Speeding up atan2f
Author : rostayob
Score : 196 points
Date : 2021-08-17 12:26 UTC (10 hours ago)
(HTM) web link (mazzo.li)
(TXT) w3m dump (mazzo.li)
| cogman10 wrote:
| I'm actually a bit surprised that the x86 SIMD instructions don't
| support trig functions.
| drfuchs wrote:
| Wouldn't CORDIC have done the trick faster? There's no mention
| that they even considered it, even though it's been around for
| half a century or so.
| adrian_b wrote:
| CORDIC was great for devices that were too simple to include
| hardware multipliers.
|
| CORDIC, in its basic form, produces just 1 bit of result per
| clock cycle. Only for very low precisions (i.e. with few bits
| for each result) you would have the chance to overlap enough
| parallel atan2 computations to achieve a throughput comparable
| to what you can do with polynomial approximations on modern
| CPUs, with pipelined multipliers.
|
| CORDIC remains useful only for ASIC/FPGA implementations, in
| the cases when the area and the power consumption are much more
| important than the speed.
| mzs wrote:
| CORDIC unlikely to be faster than this article's: "the
| approximation produces a result every 2 clock cycles"
| jvz01 wrote:
| I have developed very fast, accurate, and vectorizable atan() and
| atan2() implementations, leveraging AVX/SSE capabilities. You can
| find them here [warning: self-signed SSL-Cert].
|
| https://fox-toolkit.org/wordpress/?p=219
| ghusbands wrote:
| Your result is significantly slower than the versions presented
| in the article, though yours has more terms and so may be more
| accurate.
| jvz01 wrote:
| Yes, aim was to be acurate down to 1 lsb while significantly
| faster. Feel free to drop terms from the polynomial if you
| can live with less accurate results!
|
| The coefficients were generated by a package called Sollya,
| I've used it a few times to develop accurate chebyshev
| approximations for functions.
|
| Abramowitz & Stegun is another good reference.
| touisteur wrote:
| Please, Would you mind one of these days updating your blog
| post with the instructions you gave to sollya? I'm trying
| something stupid with log1p and can't get sollya to help,
| mostly because I'm not putting enough time to read all the
| docs...
| jvz01 wrote:
| Little side-note: algorithm as given is scalar; however, its
| branch-free, and defined entirely in the header file. So,
| compilers will typically be able to vectorize it, and thus
| achieve speed up directly based on the vector size. I see
| potential [but architecture-dependent] optimization using
| Estrin scheme for evaluating the polynomial.
| aconz2 wrote:
| Nice writeup and interesting results. I hadn't seen the use of
| perf_event_open(2) before directly in code which looks cool.
|
| The baseline is at a huge disadvantage here because the call to
| atan2 in the loop never gets inlined and the loop doesn't seem to
| get unrolled (which is surprising actually). Manually unrolling
| by 8 gives me an 8x speedup. Maybe I'm missing something with the
| `-static` link but unless they're using musl I didn't think -lm
| could get statically linked.
| xxpor wrote:
| Could it be that calls to glibc never get inlined, since
| inlining is essentially static linking by another name? Or
| since they're in separate compilation units, without LTO you'd
| never perform the analysis to figure out if it's worth it.
| Would LTO inspect across a dynamic loading boundary? Just
| speculating, I really have no idea. Everything I work with is
| statically linked, so I've never really had to think about it.
| aconz2 wrote:
| You've made a silly mistake and just did 1/8th the work, so of
| course it was 8x speedup. I am still wondering how the numbers
| would look if you could inline atan2f from libc. Too bad that
| isn't easier
| [deleted]
| h0mie wrote:
| Love posts like this!
| sorenjan wrote:
| How do you handle arrays of values where the array lengths are
| not a multiple of 8 in this kind of vectorized code? Do you zero
| pad them before handling them to the vectorized function, or do
| you run a second loop element by element on the remaining
| elements after the main one? What happens if you try to do
| `_mm256_load_ps(&ys[i])` with < 8 elements remaining?
| TonyTrapp wrote:
| Typically you have have the vectorized loop followed by a non-
| vectorized loop handling the remaining items, or even just
| explicit if-then-else statements for each possible number of
| remaining items.
| vlovich123 wrote:
| > Do you zero pad them before handling them to the vectorized
| function, or do you run a second loop element by element on the
| remaining elements after the main one?
|
| Typically the latter. That's why I find SVEs so interesting.
| Should improve code density.
| zbjornson wrote:
| You could pad the input as you said, which avoids a "tail
| loop," but otherwise you usually do a partial load (load <8
| elements into a vector) and store. Some instruction set
| extensions provide "masked load/store" instructions for this,
| but there are ways to do it without those too.
|
| To your last question specifically, if you
| _mm256_load_ps(&ys[i]) and you're at the edge of a page, you'll
| get a segfault. Otherwise you'll get undefined values (which
| can be okay, you could ignore them instead of dealing with a
| partial load).
| twic wrote:
| RISC-V has a solution to this which seems quite elegant - the
| vector length is set by the program, up to some maximum, and
| for the final block, it is simply whatever is left:
|
| https://gms.tf/riscv-vector.html
| Const-me wrote:
| I wonder how does it compare with Microsoft's implementation,
| there:
| https://github.com/microsoft/DirectXMath/blob/jan2021/Inc/Di...
|
| Based on the code your version is probably much faster. It would
| be interesting to compare precision still, MS uses 17-degree
| polynomial there.
| pclmulqdq wrote:
| The "small error" in this article isn't so small when people
| want exact results. Unfortunately, a high-degree polynomial is
| necessary if you want 24 bit precision across the input range
| for functions like this.
|
| That said, I wonder if a rational approximation might not be so
| bad on modern machines...
| azalemeth wrote:
| Indeed. I love Pade approximants -- they're very underused.
| Another really fun trick for approximations of expensive
| functions are turning them into Chebsyshev functions, like
| Nick Trefethen's chebfun package. You can trivially
| differentiate, integrate and compute chebfun approximations
| of really horrible special functions (e.g. DE solutions) with
| fantastic accuracy. Criminally underused.
| jacobolus wrote:
| If you like Pade approximants, have you been following
| Trefethen & al.'s recent work on the "AAA" algorithm etc.?
| https://arxiv.org/pdf/1612.00337.pdf
| https://arxiv.org/pdf/1705.10132.pdf
|
| See also the lecture
| https://www.youtube.com/watch?v=S1upJPMIFfg
| azalemeth wrote:
| No, I haven't -- thank you very much for sharing. I've
| only ever used them, e.g. for spectral decomposition in
| magnetic resonance in a medical setting (1) with the fast
| Pade transformation, which has a wonderfully deep
| relationship with the DEs describing magnetic resonance
| (2).
|
| Thanks also for the lecture - I'll enjoy. Nick Trefethen
| is such a good speaker. He taught me in a graduate course
| in numerical methods -- honestly, I'd watch those
| lectures on a loop again.
|
| (1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5179227/
|
| (2) https://iopscience.iop.org/article/10.1088/0031-9155/
| 51/5/00...
| Const-me wrote:
| > I wonder if a rational approximation might not be so bad on
| modern machines
|
| Uops.info says the throughput of 32-byte FP32 vector division
| is 5 cycles on Intel, 3.5 cycles on AMD.
|
| The throughput of 32-byte vector multiplication is 0.5
| cycles. The same is for FMA, which is a single instruction
| computing a*b+c for 3 float vectors.
|
| If the approximation formula only has 1 division where both
| numerator and denominator are polynomials, and the degree of
| both polynomials is much lower than 17, the rational version
| might be not too bad compared to the 17-degree polynomial
| version.
| pclmulqdq wrote:
| A Pade approximation is generally like that. Especially
| when you are dealing with singularities, it should be a
| much smaller polynomial than 17 terms in the numerator and
| denominator. I think I'm going to have to start a blog, and
| this will be on the list.
| janwas wrote:
| Indeed, degree-4 rational polynomials are generally
| enough for JPEG XL. Code for evaluating them using
| Horner's scheme (just FMAs): https://github.com/libjxl/li
| bjxl/blob/bdde644b94c125a15e532b...
|
| I initially used Chebfun to approximate them, then a
| teammate implemented the same Caratheodory-Fejer method
| in C. We subsequently convert from Chebyshev basis to
| monomials.
|
| Blog sounds good :)
| petschge wrote:
| 24 bits precision would be possible with the Hastings
| approximation with 15 terms instead of 6 terms. A naive
| estimate would be that this requires 2.5 times as much work
| and could run at 6GB/s and take a little less than 5 cycles
| per element.
| Someone wrote:
| I think that remains to be seen. Exact calculations will
| give you 24 bits, but in computers, floating point
| calculations typically aren't exact.
|
| If, as this article does, you compute all your intermediate
| values as 32-bit IEEE floats, it would hugely surprise me
| if you still got 24 bits of precision after 1 division (to
| compute y/x or x/y), 15 multiplications and 14 additions
| (to evaluate the polynomial) and, sometimes, a subtraction
| from p/2.
|
| (That's a weakness in this article. They point out how fast
| their code is, but do not show any test that it actually
| computes approximations to atan2, let alone about how
| accurate their results are)
|
| Getting all the bits right in floating point is expensive.
| adgjlsfhk1 wrote:
| The good news is that the high order terms of the
| polynomial don't notably affect the error. with fma,
| horner's method converges to .5 ULP accuracy. The
| subtraction with pi/2 can also be done without adding
| inaccuracy by using compensated summation.
| gumby wrote:
| Because of the disregard for the literature common in CS I loved
| this part:
|
| > This is achieved through ... _and some cool documents from the
| 50s_.
|
| A bit of anecdote: back when I was a research scientist
| (corporate lab) 30+ years ago I would in fact go downstairs to
| the library and read -- I was still a kid with a lot to learn
| (and still am). When I came across (by chance or by someone's
| suggestion) something useful to my work I'd photocopy the article
| and and try it out. I'd put a comment in the code with the
| reference.
|
| My colleagues in my group would give me (undeserved) credit for
| my supposed brilliance even though I said in the code where the
| idea came from and would determinedly point to the very paper on
| my desk. This attitude seemed bizarre as the group itself was
| producing conference papers and even books.
|
| (Obviously this was not a universal phenomenon as there were
| other people in the lab overall, and friends on the net,
| suggesting papers to read. But I've seen it a lot, from back then
| up to today)
| lmilcin wrote:
| It is very unfortunate that in the world of ubiquitous
| information, ability to just look up a reasonable solution to a
| well known and well studied problem that has already been
| solved long ago is practically a superpower.
|
| As a developer, I regularly am in a situation where I solved
| some kind of large problem people had for a long time with
| significant impact for the company and this mostly with couple
| google searches.
|
| Once I solved a batch task that took 11 hours to complete and
| reduced it to about 5s of work by looking up a paper on
| bitemporal indexes. The reactions ranged from "You are
| awesome!" to "You red a... paper????" to "No, we can't have
| this, this is super advanced code and it surely must have bugs.
| Can't you find an open source library that implements this?"
| jacobolus wrote:
| One of the most valuable (but often under-appreciated) kinds
| of work people can do is to take hard-won knowledge that
| exists somewhere obscure and make it more accessible, e.g. by
| writing expository survey papers, adding material to
| Wikipedia, writing documentation, ...
| zokier wrote:
| > It is very unfortunate that in the world of ubiquitous
| information, ability to just look up a reasonable solution to
| a well known and well studied problem that has already been
| solved long ago is practically a superpower
|
| Its exactly because information is so ubiquitous and
| plentiful that finding a reasonable solution is like looking
| for a needle in a haystack.
| lmilcin wrote:
| In most cases in my experience, the people did not face
| challenge actually grepping the internet for the solution.
|
| More likely they just went to roll out their own solution
| for any number of other reasons:
|
| -- For the fun of intellectual challenge (I am guilty of
| this),
|
| -- For having hubris to think they are first to meet that
| challenge (I have been guilty of this in the past, I no
| longer make that mistake),
|
| -- For having hubris to think they have uniquely best
| solution to the problem, highly unlikely it is even worth
| verifying,
|
| -- For being superior to everybody else and hence no real
| reason to dignify other, lesser people, by being interested
| in their work,
|
| -- For feeling there is just too much pressure and no time
| to do it right or even to take a peek at what is right and
| how much time it would take,
|
| -- For not having enough experience to suspect there might
| be a solution to this problem,
|
| -- Books? Only old people read books,
|
| -- Papers? They are for research scientists. I haven't
| finished CS to read papers at work.
|
| and so on.
|
| There have really been very few concrete problems that I
| could not find some solution to on the Internet.
|
| It is more difficult to find solution to less concrete
| problems but even with these kinds of problems if you have
| a habit of regularly seeking knowledge, reading various
| sources, etc. you will probably amass enough experience and
| knowledge to deal with most even complex problems on your
| own.
| djmips wrote:
| I regularly hear coders say they cannot understand other
| people's code so they _had_ to write it themselves. Most
| people can't read papers either it seems.
| lmilcin wrote:
| Well, that's another thing.
|
| People lost ability to read anything longer than a tweet
| with focus and comprehension. Especially new developers.
|
| I know this because whatever I put in third and further
| paragraphs of my tickets is not being red or understood.
|
| I sometimes quiz guys on review meetings and I even
| sometimes put the most important information somewhere in
| the middle of the ticket. Always with the same result --
| they are surprised.
|
| Now, to read a paper is to try to imagine being the
| person who wrote it and appreciate not just written text
| but also a lot more things that are not necessarily
| spelled.
|
| The same with code -- when I read somebodys code I want
| to recognize their style, how they approach various
| problems. Once you know a little bit about their
| vocabulary you can really speed up reading the rest
| because you can anticipate what is going to happen next.
| nuclearnice1 wrote:
| People may have a preference for certainty.
|
| They work on the problem. They will make progress and
| likely solve it.
|
| They search the literature they will spend time on
| research and quite possibly lose days before emerging
| with nothing to show for it.
| [deleted]
| refset wrote:
| > bitemporal indexes [...] Can't you find an open source
| library that implements this?
|
| Undoubtedly too late for you, but
| https://github.com/juxt/crux/blob/master/crux-
| core/src/crux/... :)
| kbenson wrote:
| > 30+ years ago ... I was still a kid with a lot to learn (and
| still am).
|
| We should all strive to stay young at heart like this. ;)
| dleslie wrote:
| I find that disregard for education to be tasteless, but
| perhaps warranted due to the modern prevalence of "bootcamps"
| and similar fast-but-shallow programmes.
|
| Personally, I love when I get a chance to apply something from
| a published paper; I will leave a citation as a comment in the
| code and a brief explanation of how it works and why it was
| chosen. I have no regrets for having achieved a computing
| degree, so many years ago.
| touisteur wrote:
| But, knowing where to look for, being able to suss out details
| from papers, having the almost adequate level of conversation
| to try and contact authors (if still alive...). The ability to
| put undecypherable formulas into code and to bear
| stackoverflow's requirements for a 'good' question. Not a set
| of skills that amounts to nothing. It's rare nowadays that I
| see any copy of a reference textbook on numerical recipes,
| although we do tons of signal processing and HPC stuff.
|
| Recently I rediscovered the ability to compute logs with 64
| bits integers with a better precision than double-precision.
| Man, jumping back into those depths after so many years... Not
| as easy as reading other people's code.
| teddyh wrote:
| I've seen an odd use of language, mostly from U.S. Americans,
| where "intelligent" = "smart" = " _knowledgable_ ". And
| conversely, assuming that knowing many things is what makes you
| smart. I suspect this is what happened there; they thought you
| "brilliant" because you knew many things - in their minds,
| there might not be a difference between the two concepts.
| whatshisface wrote:
| Not brilliant? Use of the written word to pass ideas down
| between generations is one of the most brilliant ideas in human
| history, and since nobody around you was doing it, the only
| explanation is that you independently invented the ethos of the
| scholar yourself.
| gumby wrote:
| Heh heh, Ockham's razor provides an explanation!
| specialist wrote:
| Nice.
|
| I truly miss foraging the stacks. Such delight.
|
| Browsing scholar.google.com, resulting in 100s of tabs open,
| just isn't the same.
| jjgreen wrote:
| I always found that the smell of old books filled me with the
| urge to defecate, to the extent that I had to deliberately go
| for a crap before diving into the stacks. Something of a
| disadvantage as a pre-internet researcher. I thought I was
| the only person in the world with this condition, but have
| subsequently found a couple of other similarly afflicted
| souls.
| _moof wrote:
| No one likes to do lit review, which is a shame. I get it -
| it's way less fun to find out someone already bested your cool
| idea than it is to write code. But it's definitely more
| productive to read the literature!
| prionassembly wrote:
| I wonder whether Pade approximants are well known by this kind of
| researcher. E.g. http://www-
| labs.iro.umontreal.ca/~mignotte/IFT2425/Documents...
| nimish wrote:
| Numerical analysts are definitely familiar. Library authors may
| not be unless they took a course in function approximation.
|
| Personally I prefer chebyshev approximants like chebfun but
| pade are really good at the cost of a division.
| stephencanon wrote:
| A math library implementor will generally be familiar with at
| least minimax and Chebyshev approximations, and generally
| Pade and Caratheodory-Fejer approximations as well. (Source:
| I implemented math libraries for a decade. We used all of
| these frequently.)
| zokier wrote:
| I would have wished to see the error analysis section expanded a
| bit, or maybe seeing some sort of tests to validate the max
| error. In particular if the mathematical approximation function
| arctan* has max error of 1/10000 degrees then I'd naively expect
| that the float-based implementation to have worse error.
| Furthermore it's not obvious if additional error could be
| introduced by the division float atan_input =
| (swap ? x : y) / (swap ? y : x);
| spaetzleesser wrote:
| I am envious of people who can deal with such problems. The
| problem is defined clearly and can be measured easily.
|
| This is so much more fun than figuring out why some SAAS service
| is misbehaving.
| jacobolus wrote:
| If someone wants a fast version of _x_ - tan( _px_ /2), let me
| recommend the approximation: tanpi_2 = function
| tanpi_2(x) { var y = (1 - x*x); return x *
| (((-0.000221184 * y + 0.0024971104) * y - 0.02301937096) * y
| + 0.3182994604 + 1.2732402998 / y); }
|
| (valid for -1 <= x <= 1)
|
| https://observablehq.com/@jrus/fasttan with error:
| https://www.desmos.com/calculator/hmncdd6fuj
|
| But even better is to avoid trigonometry and angle measures as
| much as possible. Almost everything can be done better (faster,
| with fewer numerical problems) with vector methods; if you want a
| 1-float representation of an angle, use the stereographic
| projection: stereo = (x, y) => y/(x +
| Math.hypot(x, y)); stereo_to_xy = (s) => { var q =
| 1/(1 + s*s); return !q ? [-1, 0] : [(1 - s*s)/q, 2*s/q];
| }
| leowbattle wrote:
| > Almost everything can be done better ... use the
| stereographic projection
|
| I was just learning about stereographic projection earlier.
| Isn't it odd how when you know something you notice it
| appearing in places.
|
| Can you give an example of an operation that could be performed
| better using stereographic projection rather than angles?
| jacobolus wrote:
| Generally you can just stick to storing 2-coordinate vectors
| and using vector operations.
|
| The places where you might want to convert to a 1-number
| representation are when you have a lot of numbers you want to
| store or transmit somewhere. Using the stereographic
| projection (half-angle tangent) instead of angle measure
| works better with the floating point number format and uses
| only rational arithmetic instead of transcendental functions.
| hdersch wrote:
| A comparison with one of the many SIMD-mathlibraries would have
| been fairer than with plain libm. Long time ago I wrote such a
| dual-platform library for the PS3 (cell-processor) and x86
| architecture (outdated, but still available here [1]). Depending
| on how standard libm implements atan2f, a speedup of 3x to 15x is
| achieved, without sacrifying accuracy.
|
| 1. https://webuser.hs-furtwangen.de/~dersch/libsimdmath.pdf
| drej wrote:
| Nice. Reminds me of an optimisation trick from a while ago: I
| remember being bottlenecked by one of these trigonometric
| functions years ago when working with a probabilistic data
| structure... then I figured the input domain was pretty small (a
| couple dozen values), so I precomputed those and used an array
| lookup instead. A huge win in terms of perf, obviously only
| applicable in these extreme cases.
| ThePadawan wrote:
| One of the things I recently learned that sounded the most
| "that can't possibly work well enough" is an optimization for
| sin(x):
|
| If abs(x) < 0.1, "sin(x)" is approximated really well by "x".
|
| That's it. For small x, just return x.
|
| (Obviously, there is some error involved, but for the speedup
| gained, it's a very good compromise)
| sharikone wrote:
| I think that you will find that for subnormal numbers any
| math library will use the identity function for sin(x) and 1
| for cos(x)
| ThePadawan wrote:
| Right, but the largest subnormal number in single-precision
| floats is ~ 10^-38.
|
| That the sin(x) approximation still works well for 10^-1
| (with an error of ~0.01%) is the really cool thing!
| chriswarbo wrote:
| This is a very common assumption in Physics, e.g. https://en.
| wikipedia.org/wiki/Pendulum_(mathematics)#Small-a...
|
| Whether it's appropriate in a numerical calculation obviously
| depends on the possible inputs and the acceptable error bars
| :)
| tzs wrote:
| To put some numbers on it, using N terms of the Taylor series
| for sin(x) [1] with |x| <= 0.1, the maximum error percentage
| cannot exceed [2]: N Error Limit 1
| 0.167% (1/6%) 2 8.35 x 10^-5% (1/11980%) 3 1.99
| x 10^-8% (1/50316042%) 4 2.76 x 10^-12%
| (1/362275502328%)
|
| Even for |x| as large as 1 the sin(x) = x approximation is
| within 20%, which is not too bad when you are just doing a
| rough ballpark calculation, and a two term approximation gets
| you under 1%. Three terms is under 0.024% (1/43%).
|
| For |x| up to Pi/2 (90deg) the one term approximation falls
| apart. The guarantee from the Taylor series is within 65% (in
| reality it does better, but is still off by 36%). Two terms
| is guaranteed to be within 8%, three within 0.5%, and four
| within 0.02%.
|
| Here's a quick and dirty little Python program to compute a
| table of error bounds for a given x [3].
|
| [1] x - x^3/3! + x^5/5! - x^7/7! + ...
|
| [2] In reality the error will usually be quite a bit below
| this upper limit. The Taylor series for a given x is a
| convergent alternating series. That is, it is of the form A0
| - A1 + A2 - A3 + ... where all the A's have the same sign,
| |Ak| is a decreasing sequence past some particular k, and
| |Ak| has a limit of 0 as k goes to infinity. Such a series
| converges to some value, and if you approximate that by just
| taking the first N terms the error cannot exceed the first
| omitted term as long as N is large enough to take you to the
| point where the sequence from there on is decreasing. This is
| the upper bound I'm using above.
|
| [3] https://pastebin.com/thN8B7Gf
| quietbritishjim wrote:
| That is precisely the technique discussed in the article:
| it's the first term of the Taylor expansion. Except that the
| article used more terms of the expansion, and also used very
| slightly "wrong" coefficients to improve the overall accuracy
| within the small region.
| Bostonian wrote:
| Why wouldn't you at least include the x^3 term in the Taylor
| series for abs(x) < 0.1?
| ThePadawan wrote:
| That's what I assumed would have been a reasonable
| optimization!
|
| What I really found amazing was that rather than reducing
| the number of multiplications to 2 (to calculate x^3), you
| can reduce it to 0, and it would still do _surprisingly_
| well.
| jvz01 wrote:
| Another good hint is the classical half-angle formulas. You
| can often avoid calling sin() and cos() altogether!
| MauranKilom wrote:
| Some trivia, partly stolen from Bruce Dawson[0]:
|
| The sin(x) = x approximation is actually exact (in terms of
| doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens
| to cover 48.6% of all doubles.
|
| Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).
|
| Finally, sin(double(pi)) tells you the error in double(pi) -
| that is, how far the double representation of pi is away from
| the "real", mathematical pi [3].
|
| [0]: https://randomascii.wordpress.com/2014/10/09/intel-
| underesti...
|
| [1]: https://github.com/lattera/glibc/blob/master/sysdeps/iee
| e754...
|
| [2]: https://github.com/lattera/glibc/blob/master/sysdeps/iee
| e754...
|
| [3]: https://randomascii.wordpress.com/2012/02/25/comparing-
| float...
| RicoElectrico wrote:
| Come on, at this point I've seen this "engineering
| approximation" memed so many times, even on Instagram ;)
|
| What is more interesting to me is that this can be one of
| rationales behind using radians.
|
| And that tan(x)~x also holds for small angles, greatly easing
| estimations of distance to objects of known size (cf. mil-dot
| reticle).
| lapetitejort wrote:
| tan(x)~x because cos(x)~1. It's approximations all the way
| down.
| mooman219 wrote:
| This would be a decent lookup for the atan2 function:
|
| https://gist.github.com/mooman219/19b18ff07bb9d609a103ef0cd0...
| tantalor wrote:
| https://en.wikipedia.org/wiki/Memoization
| bluedino wrote:
| _Technically_ it 's a _lookup table_ if you pre-compute them.
| Memoization would just be caching them as you do them.
| tantalor wrote:
| Not necessarily, you could "cache" them in a compilation
| step and then use the table at runtime.
| bruce343434 wrote:
| Tangential at best, but why was the 'r' dropped from that
| term? Or why not call it caching? Why the weird "memo-
| ization"? It makes me think of a mass extinction event where
| everything is turned into a memo.
| franciscop wrote:
| It's explained right in the linked Wikipedia page:
|
| > The term "memoization" was coined by Donald Michie in
| 1968[3] and is derived from the Latin word "memorandum"
| ("to be remembered"), usually truncated as "memo" in
| American English, and thus carries the meaning of "turning
| [the results of] a function into something to be
| remembered". While "memoization" might be confused with
| "memorization" (because they are etymological cognates),
| "memoization" has a specialized meaning in computing.
| wongarsu wrote:
| The term memoization likely precedes the word caching (as
| related to computing, obviously weapon caches are far
| older). Memoization was coined in 1968. CPU caches only
| came about in the 80s as registers became significantly
| faster than main memory.
|
| As wikipedia outlines, the r was dropped because of the
| memo. It's derived from the latin word memorandum that does
| contain the r, just like memory, but apparently it was more
| meant as an analogy to written memos.
| unemphysbro wrote:
| coolest blog post I've seen here in a while. :)
| aj7 wrote:
| Around 1988, I added phase shift to my optical thin film design
| program written in Excel 4.0 for the Mac. At the time, this was
| utterly unique: each spreadsheet row represented a layer and the
| matrices describing each layer could be calculated right in that
| row by squashing them down horizontally. The S- and
| P-polarization matrices could be recorded this way, and the
| running matrix products similarly maintained. Finally, using a
| simple one input table, the reflectance of a typically 25-31
| layer laser mirror could be calculated. And in less than a second
| on a 20 MHz 68020 (?) Mac II for about 50 wavelengths. The best
| part were the graphics which were instantaneous, beautiful,
| publishable, and pasteable into customer quotations. Semi-
| technical people could be trained to use the whole thing.
|
| Now about the phase shift. In 1988, atan2 didn't exist. Anywhere.
| Not in FORTRAN, Basic, Excel, or a C library. I'm sure phase
| shift calculators implemented it, each working alone. For us, it
| was critical. You see we actually cared not about the phase
| shift, but its second derivative, the group delay dispersion.
| This was the beginning of the femtosecond laser era, and people
| needed to check whether these broadband laser pulses would be
| inadvertently stretched by reflection off or transmission through
| the mirror coating. So atan2, the QUADRANT-PRESERVING arc
| tangent, is required for a continuous,differential phase
| function. An Excel function macro did this, with IF statements
| correcting the quadrant. And the irony of all this?
|
| I CALLED it atan2.
| dzdt wrote:
| Fortran did have it much earlier. Per wikipedia atan2 appeared
| in Fortran IV from IBM in 1961.
| https://en.wikipedia.org/wiki/Atan2
| pklausler wrote:
| And it's been in every ANSI & ISO Fortran standard from F'66
| through F'2018.
| Narishma wrote:
| > Around 1988, I added phase shift to my optical thin film
| design program written in Excel 4.0 for the Mac.
|
| Excel 4 was released in 1992. In 1988 the only version of Excel
| that would have been available on the Mac would be the very
| first one.
| floxy wrote:
| Table 8-3, page 8-7 (pdf page 97):
|
| http://www.bitsavers.org/www.computer.museum.uq.edu.au/pdf/D...
| raphlinus wrote:
| 1965: https://www.ecma-international.org/publications-and-
| standard... pdf page 36
| floxy wrote:
| Nice find. Anyone have a way to search for old versions
| "math.h"? The 1989 version of ANSI C had atan2:
|
| https://web.archive.org/web/20161223125339/http://flash-
| gord...
|
| ...but I wonder when it first hit the scene.
| kimixa wrote:
| According to https://unix.superglobalmegacorp.com/ it was
| in 3bsd libm -https://unix.superglobalmegacorp.com/cgi-
| bin/cvsweb.cgi/3BSD...
|
| 3BSD was released at the end of 1979 according to
| wikipedia, and that's just the oldest BSD source I could
| find, so it may have been in even older versions.
|
| Before ANSI/ISO C I don't think there was really a
| "math.h" to look for - as it may be different on
| different systems, but as so many things derive from BSD
| I wouldn't be surprised if it was "de-facto" standardised
| to what that supported.
| shoo wrote:
| Related -- there's a 2011 post from Paul Minero with fast
| approximations for logarithm, exponential, power, inverse root.
| http://www.machinedlearnings.com/2011/06/fast-approximate-lo...
|
| Minero's faster approximate log2, < 1.4% relative error for x in
| [1/100, 10]. Here's the simple non-sse version:
| static inline float fasterlog2 (float x) {
| union { float f; uint32_t i; } vx = { x }; float y =
| vx.i; y *= 1.1920928955078125e-7f; return y -
| 126.94269504f; }
|
| This fastapprox library also includes fast approximations of some
| other functions that show up in statistical / probabilistic
| calculations -- gamma, digamma, lambert w function. It is BSD
| licensed, originally lived in google code, copies of the library
| live on in github, e.g. https://github.com/etheory/fastapprox
|
| It's also interesting to read through libm. E.g. compare Sun's
| ~1993 atan2 & atan:
|
| https://github.com/JuliaMath/openlibm/blob/master/src/e_atan...
|
| https://github.com/JuliaMath/openlibm/blob/master/src/s_atan...
| nice2meetu wrote:
| I did something similar for tanh once, though I found I could get
| to 1 ulp.
|
| Part of the motivation was that I could get 10x faster than libc.
| However, I then tried on my FreeBSD and could only get 4x faster.
| After a lot of head scratching and puzzling it turned out there
| was a bug in the version of libc on my linux box that slowed
| things down. It kind of took the wind out of the achievement, but
| it was still a great learning experience.
| azhenley wrote:
| This is pretty similar to my quest to make my own cos() when my
| friend didn't have access to libc. It was fun! Though I don't
| have the math or low-level knowledge that this author does.
|
| https://web.eecs.utk.edu/~azh/blog/cosine.html
| pklausler wrote:
| (Undoubtedly) stupid question: would it be any faster to project
| (x, y) to the unit circle (x', y'), then compute acos(x') or
| asin(y'), and then correct the result based on the signs of x &
| y? When converting Cartesian coordinates to polar, the value of
| r=HYPOT(x, y) is needed anyway, so the projection to the unit
| circle would be a single division by r.
| stephencanon wrote:
| > if we're working with batches of points and willing to live
| with tiny errors, we can produce an atan2 approximation which is
| 50 times faster than the standard version provided by libc.
|
| Which libc, though? I assume glibc, but it's frustrating when
| people talk about libc as though there were a single
| implementation. Each vendor supplies their own implementation,
| libc is just a common interface defined by the C library. There
| is no "standard version" provided by libc.
|
| In particular, glibc's math functions are not especially fast--
| Intel's and Apple's math libraries are 4-5x faster for some
| functions[1], and often more accurate as well, for example (and
| both vendors provide vectorized implementations). Even within
| glibc versions, there have been enormous improvements over the
| last decade or so, and for some functions there are big
| performance differences depending on whether or not -fno-math-
| errno is specified. (I would also note that atan2 has a lot of
| edge cases, and more than half the work in a standards-compliant
| libc is in getting those edge cases with zeros and infinities
| right, which this implementation punts on. There's nothing wrong
| with that, but that's a bigger tradeoff for most users than the
| small loss of accuracy and important to note.)
|
| So what are we actually comparing against here? Comparing against
| a clown-shoes baseline makes for eye-popping numbers, but it's
| not very meaningful.
|
| None of this should really take away from the work presented, by
| the way--the techniques described here are very useful for people
| interested in this stuff.
|
| [1] I don't know the current state of atan2f in glibc
| specifically; it's possible that it's been improved since last I
| looked at its performance. But the blog post cites "105.98 cycles
| / element", which would be glacially slow on any semi-recent
| hardware, which makes me think something is up here.
| rostayob wrote:
| (I'm the author)
|
| You're right, I should have specified -- it is glibc 2.32-48 .
| This the source specifying how glibc is built:
| https://github.com/NixOS/nixpkgs/blob/97c5d0cbe76901da0135b0...
| .
|
| I've amended the article so that it says `glibc` rather than
| `libc`, and added a sidenote specifying the version.
|
| I link to it statically as indicated in the gist, although I
| believe that shouldn't matter. Also see
| https://gist.github.com/bitonic/d0f5a0a44e37d4f0be03d34d47ac...
| .
|
| Note that the hardware is not particularly recent (Q3 2017),
| but we tend to rent servers which are not exactly on the
| bleeding edge, so that was my platform.
| stephencanon wrote:
| Thanks for clarifying!
|
| Without benchmarking, I would expect atan2f to be around
| 20-30 cycles per element or less with either Intel's or
| Apple's scalar math library, and proportionally faster for
| their vector libs.
| rostayob wrote:
| Thanks for the info.
|
| By the way, your writing on floating point arithmetic is
| very informative -- I even cite a message of yours on FMA
| in the post itself!
| rostayob wrote:
| Ah, regarding edge cases: the only edge cases I do not handle
| are infinities and the origin. In my case these are both
| uninteresting: we deal with points from panoramic images, so
| those both make no sense.
|
| I did want to deal with 0s in either coordinates though, and to
| preserve NaNs.
|
| You're right that atan2 has a lot of edge cases, which is why
| it made for an interesting subject. I thought it was neat that
| the edge cases I did care about fell out fairly naturally from
| the efficient bit-twiddling.
|
| That said, the versions presented in the article post are _not_
| conforming, as I remark repeatedly. The article is more about
| getting people curious about some topics that I think are neat,
| more than a guide on how to implement those standard functions
| correctly.
___________________________________________________________________
(page generated 2021-08-17 23:01 UTC)