[HN Gopher] Unit Testing Numerical Routines
___________________________________________________________________
Unit Testing Numerical Routines
Author : boscillator
Score : 65 points
Date : 2024-11-12 13:03 UTC (5 days ago)
(HTM) web link (buchanan.one)
(TXT) w3m dump (buchanan.one)
| amelius wrote:
| If you have a numerical routine that converts from one
| representation to another, you can test it by calling the routine
| implementing the inverse conversion.
| pclmulqdq wrote:
| Assuming you don't have an identical bug in the inverse
| routine.
| physicsguy wrote:
| All good stuff, I'd add though that for many numerical routines
| you end up testing on simple well known systems that have well
| defined analytic answers.
|
| So for e.g. if you're writing solvers for ODE systems, then you
| tend to test it on things like simple harmonic oscillators.
|
| If you're very lucky, some algorithms have bounded errors. For
| e.g. the fast multipole method for evaluating long range forces
| does, so you can check this for very simple systems.
| LeanderK wrote:
| yeah that's how I do it. You start simple (known solutions for
| trivial points), then easy cases with analytic solutions and
| then just some random stuff where you test that the solutions
| is reached without errors and makes sense (correct sign etc.),
| here called the property based tests.
| phdelightful wrote:
| I have worked on Kokkos Kernels, which is performance-portable
| math libraries for the Kokkos programming system. We implement
| BLAS, sparse matrix operations, a variety of solvers, some ODE
| stuff, and various utilities for a variety of serial and parallel
| execution modes on x86, ARM, and the three major GPU vendors.
|
| The simplest and highest-impact tests are all the edge cases - if
| an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells
| you a lot about what the outputs should be.
|
| It can be difficult to determine sensible numerical limit for
| error in the algorithms. The simplest example is a dot product -
| summing floats is not associative, so doing it in parallel is not
| bitwise the same as serial. For dot in particular it's relatively
| easy to come up with an error bound, but for anything more
| complicated it takes a particular expertise that is not always
| available. This has been a work in progress, and sometimes
| (usually) we just picked a magic tolerance out of thin air that
| seems to work.
|
| Solvers are tested using analytical solutions and by inverting
| them, e.g. if we're solving Ax = y, for x, then Ax should come
| out "close" to the original y (see error tolerance discussion
| above).
|
| One of the most surprising things to me is that the suite has
| identified many bugs in vendor math libraries (OpenBLAS, MKL,
| cuSparse, rocSparse, etc.) - a major component of what we do is
| wrap up these vendor libraries in a common interface so our users
| don't have to do any work when they switch supercomputers, so in
| practice we test them all pretty thoroughly as well. Maybe I can
| let OpenBLAS off the hook due to the wide variety of systems they
| support, but I expected the other vendors would do a better job
| since they're better-resourced.
|
| For this reason we find regression tests to be useful as well.
| essexedwards wrote:
| I've also been surprised many times by issues in numerical
| libraries. In addition to matrices with simple entries, I've
| found plenty of bugs just testing small matrices, with
| dimensions in {0,1,2,3,4}. Many libraries/routines fall over
| when the matrix is small, especially when one dimension is 0 or
| 1.
|
| Presently, I am working on cuSPARSE and I'm very keen to
| improve its testing and correctness. I would appreciate
| anything more you can share about bugs you've seen in cuSPARSE.
| Feel free to email me, eedwards at nvidia.com
| AlotOfReading wrote:
| This is one of the reasons I argue that it's almost always
| better to prioritize speed and stability than accuracy
| specifically. No one actually knows what their thresholds are
| (including library authors), but the sky isn't falling
| despite that. Instabilities and nondeterminism will blow up a
| test suite pretty quickly though.
| phdelightful wrote:
| Yeah, we really just try to come up with very loose bounds
| since the analysis is hard. Even so, it does occasionally
| stop us from getting things way way wrong.
| essexedwards wrote:
| > No one actually knows what their thresholds are
| (including library authors)
|
| If low-level numerical libraries provided documentation for
| their accuracy guarantees, it would make it easier to
| develop software on top of those libraries. I think
| numerical libraries should be doing this, when possible.
| It's already common for special-function (e.g. sin, cos,
| sqrt) libraries to specify their accuracy in ULPs. It's
| less common for linear algebra libraries to specify their
| accuracy, but it's still quite doable for BLAS-like
| operations.
| AlotOfReading wrote:
| What I'm trying to convey is that the required accuracies
| for the application are what's unclear. To give an
| example of a case where accuracy matters, I regularly
| catch computational geometry folks writing code that
| branches differently on positive, negative, and 0
| results. That application implies 0.5 ulp, which
| obviously doesn't match the actual implementation
| accuracy even if it's properly specified, so there's
| usually a follow-up conversation trying to understand
| what they really need and helping them achieve it.
| RhysU wrote:
| Nx0, 0xN, and 0x0 matrices are great edge cases.
|
| 0-length vectors, too.
|
| Then, do the same with Nx1, 1xN, and 1x1 matrices.
| dahart wrote:
| > sometimes (usually) we just picked a magic tolerance out of
| thin air that seems to work.
|
| Probably worth mentioning that in general the tolerance should
| be relative error, not absolute, for floating point math.
| Absolute error tolerance should only be used when there's a
| maximum limit on the magnitude of inputs, or the problem has
| been analyzed and understood.
|
| I know that doesn't stop people from just throwing in 1e-6 all
| over the place, just like the article did. (Hey I do it too!)
| But if the problem hasn't been analyzed then an absolute error
| tolerance is just a bug waiting to happen. It might seem to
| work at first, but then catch and confuse someone as soon as
| the tests use bigger numbers. Or maybe worse, fail to catch a
| bug when they start using smaller numbers.
| tomsmeding wrote:
| But then relative error is also not a panacea. If I compute 1
| + 1e9, then producing 1e9 - 1 instead would fall within a
| relative error bound of 1e-6 easily. More generally, relative
| error works only if your computation scales
| "multiplicatively" from zero; if there's any additive
| component, it's suspect.
|
| Of course, as you say, absolute error is also crap in
| general: it's overly restrictive for large inputs and overly
| permissive for small ones.
|
| I'm not a numerics person, but I do end up needing to decide
| on something sensible for error bounds on computations
| sometimes. How does one do this properly? Interval arithmetic
| or something?
| dahart wrote:
| > More generally, relative error works only if your
| computation scales "multiplicatively" from zero; if there's
| any additive component, it's suspect.
|
| IEEE floating point inherently scales from zero, the
| absolute error in _any_ computation is proportional to the
| magnitude of the input numbers, whether you're adding or
| multiplying or doing something else. It's the reason that
| subtracting two large numbers has a higher relative error
| than subtracting two small numbers, c.f. catastrophic
| cancellation.
|
| > How does one do this properly?
|
| There's a little bit of an art to it, but you can start by
| noting the actual result of any accurate operation has a
| maximum error of 0.5 LSB (least significant bit) simply as
| a byproduct of having to store the result in 32 or 64 bits;
| essentially just think about every single math instruction
| being required to round the result so it can fit into a
| register. Now write an expression for your operation in
| terms of perfect math. If I'm adding two numbers it will
| look like a[+]b = (a + b)*(1 + e), where e is your 0.5 LSB
| epsilon value. For 32 bit float, e == +/- 1e^-24. In this
| case I differentiate between digital addition with a finite
| precision result, and perfect addition, using [+] for
| digital addition and + for perfect addition.
|
| This gets hairy and you need more tricks for anything
| complicated, but multiplying each and every operation by
| (1+e) is the first step. It quickly becomes apparent that
| the maximum error is bounded by |e| * (|a|+|b|) for
| addition or |e| * (|a| * |b|) for multiply... substitute
| whatever your operation is.
|
| When doing more complicated order-dependent error analysis,
| it's helpful to use bounds and to allow error estimates to
| grow slightly in order to simplify expressions. This way
| you can prove the error is less than a certain expression,
| but the expression might be conservative.
|
| A 3d dot product is a good example to work though using
| (1+e). Typically it's reasonable to drop e^2 terms, even
| though it will technically compromise your error bound
| proof by some minuscule amount. a[*]x [+]
| b[*]y [+] c[*]z = ((ax(1+e) + bx(1+e)) + cx(1+e))(1+e)
| = ((ax+axe + by+bye)(1+e) + cz+cze)(1+e) =
| (ax+by+(ax+by)e + axe+bye+(ax+by)ee + cz+cze)(1+e)
| = (ax + bx + 2e(ax+by) + e^2(ax+by) + cz+cze)(1+e)
| = ax + by + 2e(ax+by) + e^2(ax+by) + cz + cze + axe + bye +
| 2e^2(ax+by) + e^3(ax+by) + cze + cze^2 = ax+by+cz +
| 3e(ax+by) + 2e(cz) + 3e^2(ax+by) + e^2cz + e^3(ax+by)
|
| Now drop all the higher order terms of e.
| = ax+by+cz + 3e(ax+by) + 2e(cz)
|
| Now also notice that 2e|cz| <= 3e|cz|, so we can say the
| total error bound: <= (ax + by + cz) +
| 3e( |a||x| + |b||y| + |c||z| )
|
| And despite the intermediate mess, this suddenly looks very
| conceptually simple and doesn't depend on the order of
| operations. If the input values are all positive, then we
| can say the error is proportional to 3 times the magnitude
| of the dot product. And it's logical too because we stacked
| 3 math operations, one multiply for each element of the sum
| and two adds.
|
| Sorry if that was way too much detail... I got carried
| away. :P I glossed over some topics, and there could be
| mistakes but that's the gist. I've had to do this for my
| work on a much more complicated example, and it took a few
| tries. There is a good linear algebra book about this, I
| think called Accuracy and Stability of Numerical Algorithms
| (Nicholas Higham). The famous PBR graphics book by Pharr
| et. al. also talks about error estimation techniques.
| boulos wrote:
| This is what ULPs are for
| (https://en.wikipedia.org/wiki/Unit_in_the_last_place).
|
| It's easier for most building blocks (like transcendental
| functions) to be discussed in terms of worst case ULP error
| (e.g., <= 1 everywhere, <= 3, etc.). For example, SPIR-V /
| OpenCL has this section on the requirements to meet the
| OpenCL 3.0 spec (https://registry.khronos.org/OpenCL/specs/
| 3.0-unified/html/O...). NVIDIA includes per-PTX-revision
| ULP information in their docs (e.g.,
| https://docs.nvidia.com/cuda/parallel-thread-
| execution/#floa... for Divide and
| https://docs.nvidia.com/cuda/parallel-thread-
| execution/#floa... more broadly).
| boscillator wrote:
| 100% agree that picking numeric tolerances can be tricky. It is
| especially tricky when writing a generic math library like you
| are. If your doing something more applied it can help to take
| limits from your domain. For the example in the blog, if you're
| using GPS to determine you're position on earth, you probably
| know how precise physics allows that answer to be, and you only
| need to test to that tolerance (or an order of magnitude more
| strict, to give some wiggle room.)
| alexpotato wrote:
| Mentioning this b/c it was something that surprised me when I
| first heard it:
|
| Many python numerical libraries change how they internally
| represent certain data types and/or change internal algorithms
| from version to version. This means that running the exact same
| code on two different versions may give you slightly different
| results.
|
| In your side project this may not be a big deal but if you are
| working at, say, a hedge fund or a company that does long and
| complicated data processing pipelines then version upgrades with
| no unit testing can be "very bad".
| atoav wrote:
| If you are working in a hedge fund you probably (hopefully?)
| know it isn't a good idea to represent monetary values using
| floats. And your numerics library should be checked for
| correctness on each commit, just in case.
| wesselbindt wrote:
| Of course, but regardless, version bumps do sometimes affect
| the outcomes of pipelines. Unit tests are a good thing to
| have.
| samatman wrote:
| This is a useful bit of conventional wisdom, but it helps to
| know when it is and isn't applicable.
|
| There is a huge amount of data analysis which uses numbers
| representing money as inputs where it's fine to use floating
| point, and financial entities routinely do so. For a
| transaction, interest payments, and that sort of thing, then
| yes, one needs a specific sort of fixed-point representation,
| and operations which round in whatever the legally-correct
| fashion is for the use case.
|
| Reductively, if you've turned money into numbers, and will
| turn the resulting numbers back into money, don't use
| floating point. If you're running a big calculation with lots
| of inputs to decide what stock to buy next, that's probably
| going to have floating-point monetary values in it, and
| there's nothing wrong with that. Hedge funds do a lot of the
| latter.
| owlbite wrote:
| Unless the underlying code has been designed specifically for
| it (at the cost of some lost performance), it is a somewhat
| unreasonable expectation for high performance numerical code to
| always return the same result across version changes. Even
| running the same code on different machines, or the same
| machine with differently aligned data is likely to return
| different (but equally mathematically valid) results.
| tomsmeding wrote:
| This happens not only in math libraries but even in compilers.
| An x64 processor has two ways (at least) to perform simple
| 64-bit floating-point operations: with SSE registers (using
| only one of the slots in the vecror), or with the older x87
| instructions. The former work with 64-bit IEEE intermediates
| (`double`), the latter works with 80-bit extended precision
| intermediates (`long double`).
|
| In practice, these days compilers always use the former because
| they're widely available and are predictable. But if you tell a
| C compiler to not assume that SSE is available, it will use the
| x87 instructions.
|
| If you use the x87 instructions, whether an intermediate float
| value is stored in memory or in a register _changes the
| outcome_ , because it's stored in memory truncated to 64 bits,
| but in the (x87) register it's the full 80 bits.
|
| If you rely on exact, reproducible floating-point results, you
| have more to worry about than just software versions.
| Arech wrote:
| Honestly, I was expecting to see suggestions for testing for
| numerical stability. This might be a super annoying issue if the
| data is just right to hit certain peculiarities of floating point
| numbers representation.
| qayxc wrote:
| Unfortunately numerical stability is a very complex topic and
| depends entirely on the use-case.
|
| Sure, there's your usual suspects like subtractive
| cancellation, but iterative methods in particular can vary
| quite a lot in terms of stability.
|
| Sometimes it's due to step sizes, sometimes problems only occur
| within certain value ranges, other times it's peculiarities
| with specific hardware or even compilers. There's really no
| good general way of finding/avoiding these pitfalls that would
| apply to most numerical problems.
| RhysU wrote:
| Start simpler.
|
| Check expected behavior of...
|
| Nan, +0, -0, +1, -1, +Inf, -Inf, +Eps, -Eps, +1+Eps, +1-Eps,
| -1+Eps, -1-Eps
|
| ...on each coordinate when supplying a sensible constant for the
| other coordinates.
|
| Check the outer product of the above list taken across all
| coordinates.
|
| Check that you get the same answer when you call the function
| twice in each of these outer product cases.
|
| This approach works for any floating point function. You will be
| surprised how many mistakes these tests will catch, inclusive of
| later maintenance on the code under test.
| eesmith wrote:
| To be more specific for those wanting to follow your advice,
| use nextafter to compute +Eps, +1+Eps, etc. rather than assume
| it's a fixed value like DBL_EPSILON. >>> from
| math import nextafter >>> nextafter(0, 1) 5e-324
| >>> nextafter(0, -1) -5e-324 >>> nextafter(-1, -10)
| -1.0000000000000002
|
| My numeric code became much more robust once I started doing
| this, including using it to find test cases, for example, by
| computing all distinct rationals (up to a given denominator) in
| my range of interest, then using nextafter() to generate all
| values within 5 or so steps from that rational.
|
| For example, one algorithm I had failed with values like
| 0.21000000000000002 and 0.39999999999999997 , which are one
| step away from 0.21 and 0.4, respectively.
| RcouF1uZ4gsC wrote:
| Another trick is to use 32 bit float and just test every single
| number.
|
| 4 billion is something computers can handle, and it will give you
| pretty much all the edge cases of floating point representation.
| owlbite wrote:
| That quickly becomes unfeasible for things that are not single
| input (and still requires being able to identify a correct
| result).
| eesmith wrote:
| ecef_to_lla requires two values, so even if one is willing to
| take the reduced geospatial resolution of float32, that's still
| 64-bits of parameter space.
|
| if you do need 1 mm accuracy, including in the Pacific Ocean
| around 180 E/W then neither float32 nor scaled 32-bit ints are
| enough.
| boscillator wrote:
| But test against what? This only makes sense for the property
| based testing, but then random sampling should give the same
| result up to a statistical margin of error (unless there's some
| specific pathological case). I suppose this is good for
| extremely critical routines, kind of like the modified
| condition/decision coverage safe critical flight software has
| to undergo.
| spenczar5 wrote:
| If you have an inverse function, use that to test all round
| tripa?
| henshao wrote:
| For coordinate transforms specifically, I also like to run
| through the whole chain of available, implemented transforms and
| back again - asserting I have the same coordinate at the end. One
| method might be wrong, but they "can't" all be wrong.
| boscillator wrote:
| Yah, I wrote a `f(f^-1(x)) == x` test but ended up cutting for
| brevity. Also, running through every possible transform seems
| like a pain to debug if the error is somewhere in the middle.
| Perhaps it would be better to test every pair using the
| `GENERATE` clause in catch. That way you could narrow it down
| to two different algorithms.
| sevensor wrote:
| This works great with property based tests especially. You can
| identify the parts of the domain where things blow up.
| quinnirill wrote:
| Makes me wonder if there should be an open source test suite for
| numerical routines, i.e. just a big table (in JSON or something)
| of inputs, transformations and expected outputs, so new projects
| could get easily started with a fairly robust (if not a bit
| generic) test suite they could filter to suite their needs as
| they grow the feature set.
___________________________________________________________________
(page generated 2024-11-17 23:01 UTC)