[HN Gopher] How to Correctly Sum Up Numbers
___________________________________________________________________
How to Correctly Sum Up Numbers
Author : sebg
Score : 28 points
Date : 2024-10-31 09:07 UTC (3 days ago)
(HTM) web link (cedardb.com)
(TXT) w3m dump (cedardb.com)
| eesmith wrote:
| This essay only covers underflow and overflow, and gives an
| example of "correct" summation for Rust. fn
| sum_up_numbers_correctly(numbers: &[i32]) -> Option<i32> {
| let mut sum: i32 = 0; for i in numbers {
| sum = sum.checked_add(*i)?; } Some(sum)
| }
|
| I expected it to also mention the Kahan summation algorithm
| (https://en.wikipedia.org/wiki/Kahan_summation_algorithm), which
| gives more accurate results than this checked_add solution, but
| it did not. I wonder if that will be in one of the subsequent
| blog posts.
| lifthrasiir wrote:
| To be clear, the checked_add solution is for integers while the
| Kahan summation is for IEEE 754 numbers. (There is no
| `f32::checked_add` or similar.)
| eesmith wrote:
| You're right! I am not a Rust programmer and mis-read the
| "i32" as "float32".
|
| The text does imply the correct float solution has a similar
| structure to the int32 case, with "So, you will have to
| resort to writing machine code manually if you want to
| efficiently check overflows of float numbers."
| jepler wrote:
| You don't need machine code these days to find overflow in
| IEEE floating point arithmetic.
|
| If any intermediate sum overflows, you get an infinity.
| Further arithmetic can never return you to the realm of
| finite numbers (though you can also get into the world of
| "not a number" NaNs by calculating -inf + inf for
| instance), so you can use a standard function like (C99
| <math.h>) isfinite() to check if any intermediate step of
| summing FP numbers overflowed the representation.
| tialaramex wrote:
| Rust's 32-bit floating point type is, unsurprisingly, named
| f32 (stable Rust also has f64, and unstable builds offer
| f16 and possibly f128 which are both newer IEEE formats)
|
| If you're sure your answer is a 32-bit IEEE floating point
| number, I think you could store the intermediate sums as a
| fully expanded fixed point number with enough bits to go
| from the smallest possible f32 to the largest possible, so
| you're just doing (big integer) integer summation. 36 bytes
| ought to be very possible. When you've finished summing,
| you find the top-most set bit, its position decides the
| exponent and the subsequent 23 bits are your mantissa.
| msichert wrote:
| Hey, I'm the author of the blog post!
|
| I had planned to write about the Kahan summation, but I didn't
| want to overload the blog post with too many different
| concepts. I'm definitely considering writing about it in a
| follow-up post.
| Koshkin wrote:
| The order in which you sum a large collection of floating-point
| numbers is also important, if you want the result to be stable.
| 77pt77 wrote:
| Sum smaller absolute value numbers first.
| stabbles wrote:
| It will still accumulate rounding errors of O(n eps). If you
| care you can make it O(lg n eps) with pairwise summation or
| O(eps) with Kahan summation
| Pesthuf wrote:
| I just wonder, in situations where precision is so
| important that these small rounding errors actually matter,
| aren't floats the wrong tool anyway?
| owlbite wrote:
| Sometimes its more about numerical stability than
| absolute accuracy (numerical stability being defined as
| not have the property that small changes in input cause
| large changes in the output).
| stabbles wrote:
| In many numerical algorithms it is often that rounding
| errors add up over iterations / time. Maybe you loose one
| significant digit in one iteration of your algorithm, let
| it run longer and rounding errors start to dominate --
| the result is garbage.
|
| Since performance is important, not using floats is not
| an option, and you rarely see Kahan summation type of
| tricks because the heavy lifting is usually done by BLAS
| libraries (i.e. vectorized linear algebra operations).
|
| So, in numerical algorithms the type of tricks is
| typically using stable matrix decompositions like the QR
| decomposition A = QR where Q'Q = I (repeated
| multiplication with Q leaves the quantities roughly the
| same size, whereas with A things blow up).
|
| And there are other tricks, for example if you need to
| orthogonalize a vector x to this Q, in full precision you
| compute y = x - Q * (Q' * x), so that Q' * y = 0. The Q'
| * x computation is a reduction, so it can accumulate O(n
| eps) errors in finite precision. Instead of doing a more
| stable reduction (e.g. Kahan summation etc), you can
| instead repeat the computation: z = y - Q * (Q' * y). The
| idea is that the second orthogonalization step kills the
| rounding error noise introduced by the first step.
|
| So in short: you can still get accuracy and decent
| performance with floats, you just have to know the
| tricks.
|
| [Another somewhat interesting point is that on GPUs
| reductions are by necessity implemented with O(ln n eps)
| error, and the naive but vectorized CPU implementation
| has an error O(n/k eps) where k is the vector width]
| TeMPOraL wrote:
| They are, but that didn't stop people from using them.
| They're convenient, they're _bounded_ , and as such,
| ended up having decades of hardware optimization behind
| then. When precision turns out to be really important,
| it's often cheaper to get some smart people to
| restructure computation as to mitigate the errors, than
| it is to switch to arbitrary-precision math and deal with
| consequences. There's a whole field of CS that's spent
| the last half century or more developing ways to deal
| with floating point math problems - extending the scope
| of their use, and making alternatives _even less
| economical_.
|
| There's an interesting case of it today: LLMs. In theory,
| they should be fully deterministic with "temperature"
| parameter set to 0, but _they aren 't_ - all that matrix
| multiplication is massively parallelized, so with
| floating point math being order-dependent, partial
| results quickly accumulate non-deterministic errors.
| stabbles wrote:
| This is more relevant than overflow
| GMoromisato wrote:
| One of the things I love about software engineering is that there
| are no absolute answers, only trade-offs.
|
| You can't memorize your way into software engineering. You can't
| say, "always wrap-around when an operation overflows" or "always
| tell the user (panic) when an operation overflows".
|
| This article is a great example. C wraps unsigned integer
| addition on overflow, but Python upscales to arbitrary integer
| sizes. The trade-off, of course, is performance. If you're
| writing a real-time engine (for a rocket or a video game) and you
| can guarantee integer sizes, then you need to use the fastest
| algorithm possible. But if you're writing a general-purpose
| calculator, you might prefer upscaling or at least catching
| overflows and informing the user.
|
| In almost any other disciple there would be a "correct" way to do
| it, or at least an accepted "best practice." But for software
| engineers, it's always, "it depends."
| cratermoon wrote:
| > You can't say, "always wrap-around when an operation
| overflows" or "always tell the user (panic) when an operation
| overflows".
|
| If you were designing the flight software for the Ariane 5,
| which choice would you make?
| GMoromisato wrote:
| I understood that reference.
|
| But it just reinforces my point that there are no absolute
| rules. You have to think.
| TeMPOraL wrote:
| The important choice was made elsewhere: flight termination
| system successfully self-destructed the rocket once it was
| unlikely to recover from its problem.
|
| The Wiki quotes an interesting sentence from the official
| report - "An underlying theme in the development of Ariane 5
| is the bias towards the mitigation of random failure." It's
| fortunate that that bias didn't extend to flight termination
| system. At the point flight control software intersects with
| real life, it's better to destroy an expensive rocket than
| have it rain fiery hell on some random people somewhere else.
| That principle holds to this day.
| wang_li wrote:
| > If you were designing the flight software for the Ariane 5,
| which choice would you make?
|
| I would take note of the underlying hardware and what range
| of inputs are allowed and code appropriately. Stop pretending
| that your software doesn't have a specific purpose and will
| run on an actual computer. If you're doing finance you should
| be using fixed point integer arithmetic with two or three
| decimal places. If you're doing physics simulations you need
| to be aware of how many significant digits exist. If you
| ignore your problem space and the kind of system your code is
| going to run on you're living in spherical cow world and it
| doesn't matter if you over- or under-flow because your
| software is garbage anyway.
| Zamiel_Snawley wrote:
| Saturation arithmetic doesn't get enough love, either.
|
| I've heard Zig has convenient facilities for it.
| IshKebab wrote:
| There are plenty of absolutes. People just don't talk about
| them much because there's not much to say:
|
| * binary is the best way to encode information and numbers *
| twos complement is the best general purpose way to encode
| negative integers * static typing and stronger typing lead to
| fewer bugs * length prefixes are better than using sentinel
| values and escaping
|
| > In almost any other discipline there would be a "correct" way
| to do it
|
| Tell me you know nothing about anything except computers
| without...
|
| You think there's a "correct" way to build a bridge?
| wongarsu wrote:
| I'd argue this is about _safely_ summing up numbers, not
| correctly.
|
| For example, this is the proposed rust solution (imho the most
| readable): fn sum_up_numbers_correctly(numbers:
| &[i32]) -> Option<i32> { let mut sum: i32 = 0;
| for i in numbers { sum = sum.checked_add(*i)?;
| } Some(sum) }
|
| If you try to sum [i32::MAX, i32::MIN, 2] this will return None,
| while the naive wrapping addition would have returned 1. I'd call
| None a safe result, it's the algorithm bailing out when it's
| unsure, and 1 the correct result.
|
| The upside of wrapping addition is that you can overflow as often
| as you want, as long as the result is within the range of the
| type, the result (of addition and subtraction) is correct. The
| downside is that you don't know if the result is within range. So
| you trade between correctly handling a larger range of inputs
| with wrapping or being more certain of the result with checked
| overflow.
|
| My preferred solution would have been just upcasting. If we make
| sum an i64 we know it can't overflow the intermediate if you sum
| less than 2^32 i32 numbers, and on a 64 bit machine you lose
| basically nothing by doing that. Keep the checked_add if you are
| concerned about people summing extremely long lists of numbers
| fn sum_up_numbers_correctly(numbers: &[i32]) -> Option<i32> {
| let mut sum: i64 = 0; for i in numbers {
| sum = sum.checked_add(*i as i64)?; }
| Some(sum.try_into().ok()?) }
|
| (the try_into().ok() turns values that overflow i32 into None).
|
| When summing i64 you can do the same trick by summing into an
| i128
| mmphosis wrote:
| alias tally='awk -F '\''[^-.0-9]'\'' '\''{ for (i = 1; i <= NF;
| i++) s += $i } END { print s }'\'''
___________________________________________________________________
(page generated 2024-11-03 23:01 UTC)