[HN Gopher] Unit Testing Numerical Routines
       ___________________________________________________________________
        
       Unit Testing Numerical Routines
        
       Author : boscillator
       Score  : 65 points
       Date   : 2024-11-12 13:03 UTC (5 days ago)
        
 (HTM) web link (buchanan.one)
 (TXT) w3m dump (buchanan.one)
        
       | amelius wrote:
       | If you have a numerical routine that converts from one
       | representation to another, you can test it by calling the routine
       | implementing the inverse conversion.
        
         | pclmulqdq wrote:
         | Assuming you don't have an identical bug in the inverse
         | routine.
        
       | physicsguy wrote:
       | All good stuff, I'd add though that for many numerical routines
       | you end up testing on simple well known systems that have well
       | defined analytic answers.
       | 
       | So for e.g. if you're writing solvers for ODE systems, then you
       | tend to test it on things like simple harmonic oscillators.
       | 
       | If you're very lucky, some algorithms have bounded errors. For
       | e.g. the fast multipole method for evaluating long range forces
       | does, so you can check this for very simple systems.
        
         | LeanderK wrote:
         | yeah that's how I do it. You start simple (known solutions for
         | trivial points), then easy cases with analytic solutions and
         | then just some random stuff where you test that the solutions
         | is reached without errors and makes sense (correct sign etc.),
         | here called the property based tests.
        
       | phdelightful wrote:
       | I have worked on Kokkos Kernels, which is performance-portable
       | math libraries for the Kokkos programming system. We implement
       | BLAS, sparse matrix operations, a variety of solvers, some ODE
       | stuff, and various utilities for a variety of serial and parallel
       | execution modes on x86, ARM, and the three major GPU vendors.
       | 
       | The simplest and highest-impact tests are all the edge cases - if
       | an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells
       | you a lot about what the outputs should be.
       | 
       | It can be difficult to determine sensible numerical limit for
       | error in the algorithms. The simplest example is a dot product -
       | summing floats is not associative, so doing it in parallel is not
       | bitwise the same as serial. For dot in particular it's relatively
       | easy to come up with an error bound, but for anything more
       | complicated it takes a particular expertise that is not always
       | available. This has been a work in progress, and sometimes
       | (usually) we just picked a magic tolerance out of thin air that
       | seems to work.
       | 
       | Solvers are tested using analytical solutions and by inverting
       | them, e.g. if we're solving Ax = y, for x, then Ax should come
       | out "close" to the original y (see error tolerance discussion
       | above).
       | 
       | One of the most surprising things to me is that the suite has
       | identified many bugs in vendor math libraries (OpenBLAS, MKL,
       | cuSparse, rocSparse, etc.) - a major component of what we do is
       | wrap up these vendor libraries in a common interface so our users
       | don't have to do any work when they switch supercomputers, so in
       | practice we test them all pretty thoroughly as well. Maybe I can
       | let OpenBLAS off the hook due to the wide variety of systems they
       | support, but I expected the other vendors would do a better job
       | since they're better-resourced.
       | 
       | For this reason we find regression tests to be useful as well.
        
         | essexedwards wrote:
         | I've also been surprised many times by issues in numerical
         | libraries. In addition to matrices with simple entries, I've
         | found plenty of bugs just testing small matrices, with
         | dimensions in {0,1,2,3,4}. Many libraries/routines fall over
         | when the matrix is small, especially when one dimension is 0 or
         | 1.
         | 
         | Presently, I am working on cuSPARSE and I'm very keen to
         | improve its testing and correctness. I would appreciate
         | anything more you can share about bugs you've seen in cuSPARSE.
         | Feel free to email me, eedwards at nvidia.com
        
           | AlotOfReading wrote:
           | This is one of the reasons I argue that it's almost always
           | better to prioritize speed and stability than accuracy
           | specifically. No one actually knows what their thresholds are
           | (including library authors), but the sky isn't falling
           | despite that. Instabilities and nondeterminism will blow up a
           | test suite pretty quickly though.
        
             | phdelightful wrote:
             | Yeah, we really just try to come up with very loose bounds
             | since the analysis is hard. Even so, it does occasionally
             | stop us from getting things way way wrong.
        
             | essexedwards wrote:
             | > No one actually knows what their thresholds are
             | (including library authors)
             | 
             | If low-level numerical libraries provided documentation for
             | their accuracy guarantees, it would make it easier to
             | develop software on top of those libraries. I think
             | numerical libraries should be doing this, when possible.
             | It's already common for special-function (e.g. sin, cos,
             | sqrt) libraries to specify their accuracy in ULPs. It's
             | less common for linear algebra libraries to specify their
             | accuracy, but it's still quite doable for BLAS-like
             | operations.
        
               | AlotOfReading wrote:
               | What I'm trying to convey is that the required accuracies
               | for the application are what's unclear. To give an
               | example of a case where accuracy matters, I regularly
               | catch computational geometry folks writing code that
               | branches differently on positive, negative, and 0
               | results. That application implies 0.5 ulp, which
               | obviously doesn't match the actual implementation
               | accuracy even if it's properly specified, so there's
               | usually a follow-up conversation trying to understand
               | what they really need and helping them achieve it.
        
           | RhysU wrote:
           | Nx0, 0xN, and 0x0 matrices are great edge cases.
           | 
           | 0-length vectors, too.
           | 
           | Then, do the same with Nx1, 1xN, and 1x1 matrices.
        
         | dahart wrote:
         | > sometimes (usually) we just picked a magic tolerance out of
         | thin air that seems to work.
         | 
         | Probably worth mentioning that in general the tolerance should
         | be relative error, not absolute, for floating point math.
         | Absolute error tolerance should only be used when there's a
         | maximum limit on the magnitude of inputs, or the problem has
         | been analyzed and understood.
         | 
         | I know that doesn't stop people from just throwing in 1e-6 all
         | over the place, just like the article did. (Hey I do it too!)
         | But if the problem hasn't been analyzed then an absolute error
         | tolerance is just a bug waiting to happen. It might seem to
         | work at first, but then catch and confuse someone as soon as
         | the tests use bigger numbers. Or maybe worse, fail to catch a
         | bug when they start using smaller numbers.
        
           | tomsmeding wrote:
           | But then relative error is also not a panacea. If I compute 1
           | + 1e9, then producing 1e9 - 1 instead would fall within a
           | relative error bound of 1e-6 easily. More generally, relative
           | error works only if your computation scales
           | "multiplicatively" from zero; if there's any additive
           | component, it's suspect.
           | 
           | Of course, as you say, absolute error is also crap in
           | general: it's overly restrictive for large inputs and overly
           | permissive for small ones.
           | 
           | I'm not a numerics person, but I do end up needing to decide
           | on something sensible for error bounds on computations
           | sometimes. How does one do this properly? Interval arithmetic
           | or something?
        
             | dahart wrote:
             | > More generally, relative error works only if your
             | computation scales "multiplicatively" from zero; if there's
             | any additive component, it's suspect.
             | 
             | IEEE floating point inherently scales from zero, the
             | absolute error in _any_ computation is proportional to the
             | magnitude of the input numbers, whether you're adding or
             | multiplying or doing something else. It's the reason that
             | subtracting two large numbers has a higher relative error
             | than subtracting two small numbers, c.f. catastrophic
             | cancellation.
             | 
             | > How does one do this properly?
             | 
             | There's a little bit of an art to it, but you can start by
             | noting the actual result of any accurate operation has a
             | maximum error of 0.5 LSB (least significant bit) simply as
             | a byproduct of having to store the result in 32 or 64 bits;
             | essentially just think about every single math instruction
             | being required to round the result so it can fit into a
             | register. Now write an expression for your operation in
             | terms of perfect math. If I'm adding two numbers it will
             | look like a[+]b = (a + b)*(1 + e), where e is your 0.5 LSB
             | epsilon value. For 32 bit float, e == +/- 1e^-24. In this
             | case I differentiate between digital addition with a finite
             | precision result, and perfect addition, using [+] for
             | digital addition and + for perfect addition.
             | 
             | This gets hairy and you need more tricks for anything
             | complicated, but multiplying each and every operation by
             | (1+e) is the first step. It quickly becomes apparent that
             | the maximum error is bounded by |e| * (|a|+|b|) for
             | addition or |e| * (|a| * |b|) for multiply... substitute
             | whatever your operation is.
             | 
             | When doing more complicated order-dependent error analysis,
             | it's helpful to use bounds and to allow error estimates to
             | grow slightly in order to simplify expressions. This way
             | you can prove the error is less than a certain expression,
             | but the expression might be conservative.
             | 
             | A 3d dot product is a good example to work though using
             | (1+e). Typically it's reasonable to drop e^2 terms, even
             | though it will technically compromise your error bound
             | proof by some minuscule amount.                   a[*]x [+]
             | b[*]y [+] c[*]z = ((ax(1+e) + bx(1+e)) + cx(1+e))(1+e)
             | = ((ax+axe + by+bye)(1+e) + cz+cze)(1+e)         =
             | (ax+by+(ax+by)e + axe+bye+(ax+by)ee + cz+cze)(1+e)
             | = (ax + bx + 2e(ax+by) + e^2(ax+by) + cz+cze)(1+e)
             | = ax + by + 2e(ax+by) + e^2(ax+by) + cz + cze + axe + bye +
             | 2e^2(ax+by) + e^3(ax+by) + cze + cze^2         = ax+by+cz +
             | 3e(ax+by) + 2e(cz) + 3e^2(ax+by) + e^2cz + e^3(ax+by)
             | 
             | Now drop all the higher order terms of e.
             | = ax+by+cz + 3e(ax+by) + 2e(cz)
             | 
             | Now also notice that 2e|cz| <= 3e|cz|, so we can say the
             | total error bound:                   <= (ax + by + cz) +
             | 3e( |a||x| + |b||y| + |c||z| )
             | 
             | And despite the intermediate mess, this suddenly looks very
             | conceptually simple and doesn't depend on the order of
             | operations. If the input values are all positive, then we
             | can say the error is proportional to 3 times the magnitude
             | of the dot product. And it's logical too because we stacked
             | 3 math operations, one multiply for each element of the sum
             | and two adds.
             | 
             | Sorry if that was way too much detail... I got carried
             | away. :P I glossed over some topics, and there could be
             | mistakes but that's the gist. I've had to do this for my
             | work on a much more complicated example, and it took a few
             | tries. There is a good linear algebra book about this, I
             | think called Accuracy and Stability of Numerical Algorithms
             | (Nicholas Higham). The famous PBR graphics book by Pharr
             | et. al. also talks about error estimation techniques.
        
             | boulos wrote:
             | This is what ULPs are for
             | (https://en.wikipedia.org/wiki/Unit_in_the_last_place).
             | 
             | It's easier for most building blocks (like transcendental
             | functions) to be discussed in terms of worst case ULP error
             | (e.g., <= 1 everywhere, <= 3, etc.). For example, SPIR-V /
             | OpenCL has this section on the requirements to meet the
             | OpenCL 3.0 spec (https://registry.khronos.org/OpenCL/specs/
             | 3.0-unified/html/O...). NVIDIA includes per-PTX-revision
             | ULP information in their docs (e.g.,
             | https://docs.nvidia.com/cuda/parallel-thread-
             | execution/#floa... for Divide and
             | https://docs.nvidia.com/cuda/parallel-thread-
             | execution/#floa... more broadly).
        
         | boscillator wrote:
         | 100% agree that picking numeric tolerances can be tricky. It is
         | especially tricky when writing a generic math library like you
         | are. If your doing something more applied it can help to take
         | limits from your domain. For the example in the blog, if you're
         | using GPS to determine you're position on earth, you probably
         | know how precise physics allows that answer to be, and you only
         | need to test to that tolerance (or an order of magnitude more
         | strict, to give some wiggle room.)
        
       | alexpotato wrote:
       | Mentioning this b/c it was something that surprised me when I
       | first heard it:
       | 
       | Many python numerical libraries change how they internally
       | represent certain data types and/or change internal algorithms
       | from version to version. This means that running the exact same
       | code on two different versions may give you slightly different
       | results.
       | 
       | In your side project this may not be a big deal but if you are
       | working at, say, a hedge fund or a company that does long and
       | complicated data processing pipelines then version upgrades with
       | no unit testing can be "very bad".
        
         | atoav wrote:
         | If you are working in a hedge fund you probably (hopefully?)
         | know it isn't a good idea to represent monetary values using
         | floats. And your numerics library should be checked for
         | correctness on each commit, just in case.
        
           | wesselbindt wrote:
           | Of course, but regardless, version bumps do sometimes affect
           | the outcomes of pipelines. Unit tests are a good thing to
           | have.
        
           | samatman wrote:
           | This is a useful bit of conventional wisdom, but it helps to
           | know when it is and isn't applicable.
           | 
           | There is a huge amount of data analysis which uses numbers
           | representing money as inputs where it's fine to use floating
           | point, and financial entities routinely do so. For a
           | transaction, interest payments, and that sort of thing, then
           | yes, one needs a specific sort of fixed-point representation,
           | and operations which round in whatever the legally-correct
           | fashion is for the use case.
           | 
           | Reductively, if you've turned money into numbers, and will
           | turn the resulting numbers back into money, don't use
           | floating point. If you're running a big calculation with lots
           | of inputs to decide what stock to buy next, that's probably
           | going to have floating-point monetary values in it, and
           | there's nothing wrong with that. Hedge funds do a lot of the
           | latter.
        
         | owlbite wrote:
         | Unless the underlying code has been designed specifically for
         | it (at the cost of some lost performance), it is a somewhat
         | unreasonable expectation for high performance numerical code to
         | always return the same result across version changes. Even
         | running the same code on different machines, or the same
         | machine with differently aligned data is likely to return
         | different (but equally mathematically valid) results.
        
         | tomsmeding wrote:
         | This happens not only in math libraries but even in compilers.
         | An x64 processor has two ways (at least) to perform simple
         | 64-bit floating-point operations: with SSE registers (using
         | only one of the slots in the vecror), or with the older x87
         | instructions. The former work with 64-bit IEEE intermediates
         | (`double`), the latter works with 80-bit extended precision
         | intermediates (`long double`).
         | 
         | In practice, these days compilers always use the former because
         | they're widely available and are predictable. But if you tell a
         | C compiler to not assume that SSE is available, it will use the
         | x87 instructions.
         | 
         | If you use the x87 instructions, whether an intermediate float
         | value is stored in memory or in a register _changes the
         | outcome_ , because it's stored in memory truncated to 64 bits,
         | but in the (x87) register it's the full 80 bits.
         | 
         | If you rely on exact, reproducible floating-point results, you
         | have more to worry about than just software versions.
        
       | Arech wrote:
       | Honestly, I was expecting to see suggestions for testing for
       | numerical stability. This might be a super annoying issue if the
       | data is just right to hit certain peculiarities of floating point
       | numbers representation.
        
         | qayxc wrote:
         | Unfortunately numerical stability is a very complex topic and
         | depends entirely on the use-case.
         | 
         | Sure, there's your usual suspects like subtractive
         | cancellation, but iterative methods in particular can vary
         | quite a lot in terms of stability.
         | 
         | Sometimes it's due to step sizes, sometimes problems only occur
         | within certain value ranges, other times it's peculiarities
         | with specific hardware or even compilers. There's really no
         | good general way of finding/avoiding these pitfalls that would
         | apply to most numerical problems.
        
       | RhysU wrote:
       | Start simpler.
       | 
       | Check expected behavior of...
       | 
       | Nan, +0, -0, +1, -1, +Inf, -Inf, +Eps, -Eps, +1+Eps, +1-Eps,
       | -1+Eps, -1-Eps
       | 
       | ...on each coordinate when supplying a sensible constant for the
       | other coordinates.
       | 
       | Check the outer product of the above list taken across all
       | coordinates.
       | 
       | Check that you get the same answer when you call the function
       | twice in each of these outer product cases.
       | 
       | This approach works for any floating point function. You will be
       | surprised how many mistakes these tests will catch, inclusive of
       | later maintenance on the code under test.
        
         | eesmith wrote:
         | To be more specific for those wanting to follow your advice,
         | use nextafter to compute +Eps, +1+Eps, etc. rather than assume
         | it's a fixed value like DBL_EPSILON.                 >>> from
         | math import nextafter       >>> nextafter(0, 1)       5e-324
         | >>> nextafter(0, -1)       -5e-324       >>> nextafter(-1, -10)
         | -1.0000000000000002
         | 
         | My numeric code became much more robust once I started doing
         | this, including using it to find test cases, for example, by
         | computing all distinct rationals (up to a given denominator) in
         | my range of interest, then using nextafter() to generate all
         | values within 5 or so steps from that rational.
         | 
         | For example, one algorithm I had failed with values like
         | 0.21000000000000002 and 0.39999999999999997 , which are one
         | step away from 0.21 and 0.4, respectively.
        
       | RcouF1uZ4gsC wrote:
       | Another trick is to use 32 bit float and just test every single
       | number.
       | 
       | 4 billion is something computers can handle, and it will give you
       | pretty much all the edge cases of floating point representation.
        
         | owlbite wrote:
         | That quickly becomes unfeasible for things that are not single
         | input (and still requires being able to identify a correct
         | result).
        
         | eesmith wrote:
         | ecef_to_lla requires two values, so even if one is willing to
         | take the reduced geospatial resolution of float32, that's still
         | 64-bits of parameter space.
         | 
         | if you do need 1 mm accuracy, including in the Pacific Ocean
         | around 180 E/W then neither float32 nor scaled 32-bit ints are
         | enough.
        
         | boscillator wrote:
         | But test against what? This only makes sense for the property
         | based testing, but then random sampling should give the same
         | result up to a statistical margin of error (unless there's some
         | specific pathological case). I suppose this is good for
         | extremely critical routines, kind of like the modified
         | condition/decision coverage safe critical flight software has
         | to undergo.
        
           | spenczar5 wrote:
           | If you have an inverse function, use that to test all round
           | tripa?
        
       | henshao wrote:
       | For coordinate transforms specifically, I also like to run
       | through the whole chain of available, implemented transforms and
       | back again - asserting I have the same coordinate at the end. One
       | method might be wrong, but they "can't" all be wrong.
        
         | boscillator wrote:
         | Yah, I wrote a `f(f^-1(x)) == x` test but ended up cutting for
         | brevity. Also, running through every possible transform seems
         | like a pain to debug if the error is somewhere in the middle.
         | Perhaps it would be better to test every pair using the
         | `GENERATE` clause in catch. That way you could narrow it down
         | to two different algorithms.
        
         | sevensor wrote:
         | This works great with property based tests especially. You can
         | identify the parts of the domain where things blow up.
        
       | quinnirill wrote:
       | Makes me wonder if there should be an open source test suite for
       | numerical routines, i.e. just a big table (in JSON or something)
       | of inputs, transformations and expected outputs, so new projects
       | could get easily started with a fairly robust (if not a bit
       | generic) test suite they could filter to suite their needs as
       | they grow the feature set.
        
       ___________________________________________________________________
       (page generated 2024-11-17 23:01 UTC)