[HN Gopher] Implementing Cosine in C from Scratch (2020)
___________________________________________________________________
Implementing Cosine in C from Scratch (2020)
Author : alraj
Score : 187 points
Date : 2023-06-05 11:12 UTC (11 hours ago)
(HTM) web link (austinhenley.com)
(TXT) w3m dump (austinhenley.com)
| JKCalhoun wrote:
| The author's Taylor Series looked salvageable to me. The range
| between -Pi/2 and Pi/2 looked fine. That useable range can be re-
| used to serve all other portions of the circle.
|
| Add a conditional that applies (1-result) for some angles and
| you're golden.
| londons_explore wrote:
| OP's lookup table benchmarking is bad. The "modd(x, CONST_2PI);"
| at the top will dominate, by far, the runtime.
|
| Anyone who wants performance measures angles using n bit fixed
| point math, mapping 0 to 2*pi as 0 to (2^n)-1.
| omgmajk wrote:
| >Maybe don't stare at that for too long...
|
| I sure did stare at that for way too long.
| pacaro wrote:
| I went through the same exercise implementing trig functions for
| scheme in webassembly...
|
| It was a rabbit hole for sure
|
| https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s...
|
| [Edit: above is just the range from 0-p/4, the branch cuts for
| cos are at
| https://github.com/PollRobots/scheme/blob/main/scheme.wasm/s... ]
| dang wrote:
| Url changed from
| https://web.archive.org/web/20210513043002/http://web.eecs.u...,
| which points to this.
| jpfr wrote:
| The Taylor expansion locally fits a polynom based on the n first
| derivatives. If you want to find the "best" nth-degree polynom to
| approximate the sine function, functional analysis gives the
| tools for solving that optimization in closed form.
|
| By selecting an appropriate norm (in function space) you can
| minimize either the maximum error or the error integral over some
| range (e.g. the 0-\pi range).
|
| Here's a video on the subject. You might want to watch earlier
| ones also for more context.
|
| https://www.youtube.com/watch?v=tMlKZZf2Kac&list=PLdkTDauaUn...
|
| Full disclosure, this is my university lecture on optimization
| that was recorded during Covid.
| stephencanon wrote:
| It's worth noting that for fixed-precision math libraries, we
| usually want the best approximation in the (possibly weighted)
| L-inf sense, which doesn't generally have a closed-form
| solution, so we use iterative methods to generate
| approximations instead.
| bigdict wrote:
| What's a good textbook intro to these methods?
| remcob wrote:
| I wrote this intro for a friend and since then more people
| have found it useful:
|
| https://xn--2-umb.com/22/approximation
|
| If you want a proper textbook, I recommend L.N. Trefethen
| (2019). "Approximation Theory and Approximation Practice."
| Which is available online here:
|
| https://epubs.siam.org/doi/book/10.1137/1.9781611975949
| munificent wrote:
| I think this is the first time I've seen a punycode URL
| in the wild. I love it!
| stephencanon wrote:
| That's a really nice overview of Remez, thanks for
| sharing!
| nsajko wrote:
| Seems also worth noting that although the usual algorithm for
| finding the minimax polynomial is the very old Remez
| algorithm, I'm playing with a minimax finder that relies on
| what I think is a novel algorithm. My algorithm seems better
| than Remez, and certainly beats it (as in, finds the solution
| when Remez can not) in many cases, even though I don't have
| an analysis of the algorithm.
|
| The main idea is to use linear programming with some chosen
| points from the interval to look for the polynomial that
| approximates the chosen function over that interval. Unlike
| Remez, this enables control over which individual points from
| the interval are chosen as representatives, which enables
| avoiding ill-behaved points. An example of where this leads
| to improvements over Remez is when optimizing relative error:
| Remez would trip over zeros of the function, because they
| cause singularities in the relative error, however my
| algorithm works (by avoiding ill-behaved points) as long as
| the discontinuities are removable.
|
| My algorithm is also a lot more flexible than Remez', for
| example it allows optimizing over multiple intervals instead
| of a single one.
|
| The Git repo of the (still in-progress) project is here:
| https://gitlab.com/nsajko/FindMinimaxPolynomial.jl
|
| The Julia package is already registered (installable from the
| official Julia registry): https://juliahub.com/ui/Packages/Fi
| ndMinimaxPolynomial/kNIo8...
| adgjlsfhk1 wrote:
| This is really cool and I think it might be a re-discovery
| of https://dl.acm.org/doi/abs/10.1145/3434310.
| rjmunro wrote:
| Can you get some better accuracy by noting that cos(x) ===
| sin(p/2 - x) and using e.g. the taylor expansion for sin when p/4
| < x < 3p/4?
| Technotroll wrote:
| Looks to me like you could compute the top half, and then just
| repeat the rest as a kind of mirror function, that repeats with
| some set translation. Am I wrong here?
| dh2022 wrote:
| I thought the same thing as well. When I saw the Taylor sum
| diverging at the "edges" I thought - well Taylor sum converges
| absolutely only in a compact interval. An easy solution to this
| limitation is to use the periodicity of the cos/sin function...
| sojuz151 wrote:
| I would say the benchmarks with lookup tables are bad. In the
| benchmark cpu will keep the table in cache but in a real program
| this cache would have to be shared with rest of the code/data.
| This would either kill the performance of cosine or rest of the
| app
| dang wrote:
| Related:
|
| _Implementing Cosine in C from Scratch (2020)_ -
| https://news.ycombinator.com/item?id=30844872 - March 2022 (134
| comments)
|
| _Implementing cosine in C from scratch_ -
| https://news.ycombinator.com/item?id=23893325 - July 2020 (20
| comments)
| eschneider wrote:
| Trig functions are where accuracy and performance go to die.
| Accumulated error is a thing, so when optimizing always consider
| exactly how you're going to be using those functions so you make
| the 'right' tradeoffs for your application. One size definitely
| doesn't fit all here, so test and experiment.
| amadio wrote:
| Taylor expansion is usually not so good for this, you'll fare
| better with either Legendre or Chebyshev polynomials. Robin Green
| has some excellent material on the subject, which I am linking
| below.
|
| Faster Math Functions:
| https://basesandframes.files.wordpress.com/2016/05/fast-math...
| https://basesandframes.files.wordpress.com/2016/05/fast-math...
|
| Even faster math functions GDC 2020:
| https://gdcvault.com/play/1026734/Math-in-Game-Development-S...
| aidenn0 wrote:
| Marz's taylor series implementation as-is is significantly faster
| (~60% the runtime) than the glibc implementation at 6 terms on My
| Machine(tm), rather than about the same as in TFA. I haven't
| compared to LERP lookup table yet though.
| femto wrote:
| If doing signal processing, an optimisation is to use an N-bit
| integer to represent the range 0 to 2.pi. Some example points in
| the mapping: 0->0, 2^(N-2)->pi/2, 2^(N-1)->pi, 2^N(wraps to
| 0)->2.pi(wraps to 0).
|
| If your lookup table has an M-bit index, the index to the lookup
| table is calculated with: index = (unsigned)theta>>(N-M), where
| theta in the N-bit integer representing the angle.
|
| The fractional part, which can be used for interpolation, is:
| theta & ((1<<(N-M))-1).
|
| If you choose M=N=word size (16 bits is often nice), the angle
| can be used directly as an index. With 16-bits accuracy is
| typically good enough without interpolation and the entire trig
| operation reduces to an array indexing operation.
|
| This minimises the overhead of converting an angle to a table
| index.
| polalavik wrote:
| This method has 2 error terms - phase error and amplitude
| error. You can only represent sinusoids with a resolution of
| 2pi/2^N rad/samp (same phase error if you don't do phase
| truncation). Maybe that's enough for most applications, but
| it's probably significantly more error than math libraries that
| calculate cos. Totally depends on your application.
| rjmunro wrote:
| Why just signal processing? Surely this is good for storing
| e.g. rotations of characters in games, longitude, in fact
| almost anywhere where you want an angle where a full rotation
| is the same as no rotation at all, and more precision as you
| approach 0deg is not useful.
| vikingerik wrote:
| Accumulated error can matter, which is where you'd want that
| precision near 0deg. If something is making tiny turns each
| frame for thousands of frames, floating-point inaccuracy can
| add up. (A smarter mathematical approach might remedy this,
| or might not; think of some game component that is supposed
| to jitter randomly each frame, maybe some particle system.)
| It's not common, but it's within hypothetical feasibility
| that the precision would be useful.
| amitport wrote:
| He is describing deterministic uniform round to nearest
| quantization. The most common smarter mathematical approach
| would be randomized rounding and it will solve this issue.
|
| (Issue being that the deterministic rounding is a biased
| estimator of the sin function)
| londons_explore wrote:
| > randomized rounding
|
| There are a bunch of places where randomized rounding
| works better. But is there any computer hardware that
| implements this? Obviously the internal precision (of the
| random element) can't be infinite, but a couple of extra
| bits would seem worth the complexity.
| mvcalder wrote:
| If you like this sort of thing there's a great book: Methods and
| Programs for Mathematical Functions by Stephen Moshier. It covers
| the computation of all sorts of special functions. The code is
| available in the cephes library but the book may be out of print.
| jonsen wrote:
| Not even a single hit on bookfinder.com!
| xorvoid wrote:
| How good do you need it? Lol.
|
| This is the approximation that I used in for the animated sinwave
| example for SectorC:
|
| y ~= 100 + (x*(157 - x)) >> 7
|
| https://github.com/xorvoid/sectorc/blob/main/examples/sinwav...
| azhenley wrote:
| No need for archive.org, my website moved a few years ago:
| https://austinhenley.com/blog/cosine.html
| butterlesstoast wrote:
| Thanks for taking the time to make such an incredible article.
| Made my day reading it this morning.
| dang wrote:
| Fixed. Thanks!
| capitainenemo wrote:
| BTW, WRT projects using a lookup table...
|
| https://hg.hedgewars.org/hedgewars/file/tip/hedgewars/uSinTa...
|
| I think there's even a github mirror someone made since that's
| where you were looking. Maybe a little out of date, but this is
| old code.
| capitainenemo wrote:
| oh... also, sine table isn't just a performance thing.
| Hedgewars does fixed point custom math for deterministic
| lockstep, so keeps the math simple and predictable on various
| compilers/hardware.
| [deleted]
| charlieyu1 wrote:
| Related:
|
| https://news.ycombinator.com/item?id=35381968 Cosine
| Implementation in C
|
| Much better approximation with only 7 terms
| londons_explore wrote:
| Note that you can combine the lookup table with the taylor series
| expansion...
|
| And you can even use the same lookup table for each. That means
| with 2 table lookups in a 32 entry table, a single multiply and
| add (and a few bit shifts), you can get ~9 bits of precision in
| your result, which is fine for most uses. It also probably makes
| a sin operation take ~1 clock cycle on most superscalar
| architectures, as long as your workload has sufficient
| instruction parallelism.
|
| Note that a smaller table typically works out faster because 32
| entries fit in the cache, whereas repeated random entry into a
| 1024 entry table would probably kick a bunch of other stuff out
| of the cache that you wanted.
| throwawaaarrgh wrote:
| Strange. This article was easy to read, simple formating, not
| driven by trends or weird internet culture. The weren't any
| bombastic claims, aggrandizing statements or dramatic opinions.
| Just interesting information presented clearly without being
| verbose. Almost like it was written by an adult. How did this end
| up on HN?
| smlacy wrote:
| Yeah, very much unlike your comment. Strange, that.
| [deleted]
| Helenarttr wrote:
| [dead]
| Aardwolf wrote:
| " In all my time of coding, there has only been one situation in
| which I used cosine: games, games, games.
| function move(obj) { obj.x += obj.speed *
| Math.sin(obj.rotation); obj.y += obj.speed *
| Math.cos(obj.rotation); }
|
| "
|
| Why not store the velocity as a 2D vector instead? Then you still
| have to use cos/sin to compute this vector, but at least you
| don't need it every frame, plus often you don't need to use
| cos/sin to compute this vector either since forces that act on
| the velocity themselves can have an x and y component you can
| directly add to it
| messe wrote:
| Sometimes it's just easier to work and think in polar
| coordinates. If that's the case, then (speed, rotation) is just
| as valid a choice as (vx, vy). For smaller games, the cost of
| computing sin and cos every frame is so astronomically small,
| that it's not even concerning yourself over.
| Aardwolf wrote:
| Hmm, I usually find polar coordinates harder to reason with
| due to the gimbal lock :)
|
| In 2D there's no gimbal lock of course, but there's still a
| number that behaves differently than the others, the angle
| wrapping around in range 0..2pi!
|
| And of course the article is about fast computation of
| sin/cos so the option of not computing it should be
| considered
| midjji wrote:
| The equivalent of polar coordinates in 3d are quaternions.
| Which do not have gimbal lock, the problem only occurs for
| euler angles(which should never ever be used for anything
| ever).
| messe wrote:
| > The equivalent of polar coordinates in 3d are
| quaternions
|
| I'd disagree. Quaternions are the 3D equivalent (yes, I'm
| aware of the subtleties here, but speaking in a rough
| sense) to using complex numbers to represent rotations.
|
| Euler angles are much closer in spirit to polar
| coordinates, but unfortunately have gimbal lock.
| thehappypm wrote:
| Can you explain gimbal lock? I can't wrap my head around it
| Aardwolf wrote:
| The main thing is how when representing the 3D looking
| orientation as multiple angles (pitch=up/down,
| yaw=compass direction), you lose control over the yaw
| axis when looking straight up.
| [deleted]
| pacaro wrote:
| Of course gimbal lock is a thing, and there are ways to
| avoid it, even without quaternions, but I'm curious as to
| when it should be avoided. I think that gamers are now so
| used to it, that removing it might cause more problems
| than it solves
| [deleted]
| Aardwolf wrote:
| I guess that's more for airplane simulators then. But I'd
| also think any physics engine where things can rotate
| arbitrarily need to work correctly here.
|
| I also think, despite various sources on the internet
| saying "quaternions are the solution to gimbal lock",
| that classical 3D vectors and 3D matrices without
| quaternions also perfectly solve it.
| pacaro wrote:
| Agreed absolutely. I've used quaternions when someone
| else made the decision that was what the physics engine
| was going to do! But when I'm the decision maker it's 3d
| vectors and matrices all the way
| [deleted]
| basementcat wrote:
| A physical analogy is an azimuth-elevation mounted
| antenna tracking an object that flies over the zenith.
| When the object is near the horizon, the angular velocity
| of the azimuth actuator may be near zero and the
| elevation actuator may be small. Near zenith, the angular
| velocity of the elevation actuator may be small and the
| azimuth actuator may be near infinity.
| mistercow wrote:
| This is a fine introduction to how you even approach the problem
| of computing transcendental functions.
|
| It would have been nice to have some discussion of accuracy
| requirements rather than just eyeballing it and saying "more
| accuracy than I need". This is a spot where inexperienced devs
| can easily get tangled up. An attitude like "ten digits? That's
| so many! I'm only making a game, after all" is exactly the sort
| of thing that gets you into trouble if you start accumulating
| errors over time, and this is particularly easy to do with trig
| functions.
| 082349872349872 wrote:
| Niklaus Wirth also has some cosine code (probably meant more for
| software floating point, or fp FPGA blocks); I don't know how
| they compare with these approximations but his seem to be within
| 1e-6 of python's math.cos ...
|
| https://people.inf.ethz.ch/wirth/ProjectOberon/Sources/Math....
| dm319 wrote:
| The answer to Cos(1.57079632) (in radians) will give you a clue
| as to how your calculator does it.
|
| See here [0]
|
| [0]
| https://www.reddit.com/r/calculators/comments/126st95/cos157...
| aidenn0 wrote:
| 6.7948900e-9 which has only the first 5 digits correct (if you
| assume it would have to have been 6.79490 to get 6 digits
| correct since that is what the correct value rounds to)
| the_third_wave wrote:
| Quite well when you realise this comes down to _cos(p /2)_ to
| which the thing answers correctly: _cos(p /2) = 0_.
|
| Using HiPER Calc on Android
| midjji wrote:
| A good idea is to not to compute the values of cos from 0-2pi,
| but further reduce the range, using cos(a) = cos(-a), and cos(2a)
| = 1-2cos(a), or cos(a+pi/4) =...
|
| So we really only ever need to be able to compute cos in the
| range 0-pi/4.
|
| Then for further accuracy we can do the taylor expansion around
| pi/8. (or other approximations)
|
| finally the number of terms for a fixed accuracy varies with the
| distance from pi/8,
| londons_explore wrote:
| I think all real implementations use a lookup table... Small
| lookup tables (eg. 8 elements) all work out far smaller in
| silicon area than even one multiplier...
| xipix wrote:
| Nice. I'd love to see how this changes when you have SIMD and
| multiple cosines to compute in parallel. Also when you have to
| compute sine and cosine simultaneously which is often the case,
| and then you may be more interested in polar error than cartesian
| error.
| p0nce wrote:
| You can do this with: https://github.com/RJVB/sse_mathfun Those
| functions do not have the same safety as libc, but even for
| scalar sin/cos it can be faster.
| amiga386 wrote:
| One of the things I loved most about reading kernel and libc (or
| rather, libm) sources was the floating-point and FP emulation
| code. I thought it was immensely cool to see what others just
| expected an FPU instruction to do, someone had not only written
| out in C, but also wrote comments explaining the mathematics
| (rendered in ASCII), often with numerical analysis about error
| propagation, accuracy, etc.
|
| Some examples:
|
| https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=new...
|
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
|
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
| sampo wrote:
| > cosine repeats every 2pi
|
| > We can do better since the cosine values are equivalent every
| multiple of pi, except that the sign flips.
|
| There is one more step to take: Each p/2 long segment has
| identical shape, they are just pointing up or down (like you
| already noticed), and left or right. So you can reduce you basic
| domain down to not just 0...p, but to 0...p/2.
| amelius wrote:
| But that takes extra clock cycles, since now you have to fiddle
| with the sign.
| Someone wrote:
| I think it's likely that you get that more than back by
| having to do (at least) one iteration fewer.
|
| Those who know more about this: educate me.
| adgjlsfhk1 wrote:
| Pretty much every implementation does reduction mod pi/2.
| Switching polynomials is basically free (and halving the
| domain significantly reduces the number of terms needed).
| IIAOPSW wrote:
| And you can keep going with this logic, building cos from 0-x
| by recursively calculating cos from 0-x/2 then mirroring over
| some line if needed. Eventually you hit machine precision limit
| aka the base case cos(0) = 1.
| ajross wrote:
| And still further: the function can be piecewise decomposed
| into separate polynomials, which is what the pervasive SunPro
| code in everyone's libc does. There's one "mostly parabolic"
| function to handle the curvy top, and another "mostly linear"
| one to handle the zero crossing region.
| gmiller123456 wrote:
| You can actually reduce it to n/4 as the range from 0 to pi/2
| is the same shape as the remaining 3 segments, just mirrored
| along different axis.
___________________________________________________________________
(page generated 2023-06-05 23:02 UTC)