[HN Gopher] Optimizing a bignum library for fun
___________________________________________________________________
Optimizing a bignum library for fun
Author : azhenley
Score : 50 points
Date : 2024-07-16 18:08 UTC (4 hours ago)
(HTM) web link (austinhenley.com)
(TXT) w3m dump (austinhenley.com)
| lifthrasiir wrote:
| Modulo is surprisingly expensive even when you combine it with a
| quotient. It is almost always better to use binary "limbs", in
| this case 31 or 32 bits wide, because decimal parsing and
| printing should be much rarer than individual operations in
| general.
| colonwqbang wrote:
| This is a bit misleading. Quotient and modulo with constant
| right-hand side is readily optimised into faster operations,
| like multiplication and bit shifts. "All" optimising compilers
| do this and don't emit a true divide instruction in such cases.
| See for instance: https://godbolt.org/z/vxnTPvY6M
|
| Quotient and modulo with a variable RHS is expensive, but this
| isn't necessary for the algorithm showed in the article.
| Although I agree that binary wins over decimal anyway. We are
| using binary computers after all, for the last 50 years or so.
| Someone wrote:
| > Quotient and modulo with a variable RHS is expensive
|
| For dividing 'big' bignums by 'small' divisors, I would
| use/tweak _libdivide_ (https://libdivide.com/), and spend the
| time to optimize the division.
|
| For dividing 'big' bignums by 'big' divisors, that may not be
| worth it.
|
| (Determining the correct values of the various 'big' and
| 'small' values left as an exercise for the reader. Those will
| depend on the hardware at hand)
| worstspotgain wrote:
| The reason a division by 1000000 can be efficiently turned
| into a multiplication is because you're targeting a system
| that has a 32x32->64 multiplication instruction, which has
| plenty of room to spare. But if you have that, you would be
| better off using e.g. 2^32 as your modulo.
| styczen wrote:
| not + -
| minimize wrote:
| For maximum efficiency, you should work in binary instead of base
| 10. Handling carries becomes more straightforward with the right
| primitives, for example __builtin_addc with GCC:
| https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...
|
| You can also implement it in C if you want a more portable
| solution:
| https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c52...
|
| If you scroll around, you can also find my implementations for
| multiplication and such.
| rurban wrote:
| For maximum efficiency you'd avoid addc like hell, because it
| blocks internal precomputation, and use guarantees like he did,
| avoiding overflow better. Better use int128 types though.
|
| I'd just just stack objects for small typical sizes. And esp.
| do formal proofs than the random test joke.
| minimize wrote:
| > For maximum efficiency you'd avoid addc like hell, because
| it blocks internal precomputation, and use guarantees like he
| did, avoiding overflow better.
|
| Whether that is the case depends on the CPU. I hit memory
| bandwidth limits before adc becomes a problem. But I concede
| that you have a valid point and that there are definitely
| cases where leaving sufficient space for carries is the
| better strategy.
|
| Anyway, we can probably both agree that "% 1000000000" is not
| the best way to do it.
| koverstreet wrote:
| addc makes all your adds serialize on the flags register,
| which is really painful.
|
| more modern approach is to reserve some bits out of every
| word for carries, but that drastically complicates things.
| JonChesterfield wrote:
| I remembered and ultimately found a source for a workaround
| for the serialising on flags problem, intel paper at https:
| //web.archive.org/web/20150131061304/http://www.intel....
| amounts to new instructions with better behaviour for ILP
| paldepind2 wrote:
| Speaking of bignum libraries, I recently watched a talk with Rob
| Pike where he mentioned that one thing he regretted about Go was
| not making the default integer implementation arbitrary
| precision. Supposedly the performance overhead for normal numbers
| is very small, and you avoid the weirdness and complicated
| semantics of fixed precision integers. I found that to be quite
| fascinating, especially coming from a "low-level guy" like Rob
| Pike. Ever since I've been wanting a language with that feature
| and to understand how bignum implementations work.
| MaxBarraclough wrote:
| For what it's worth, Haskell's _Integer_ type is that way.
|
| _edit_ To be clear, in Haskell you 'default' to using its
| arbitrary-precision type, unlike say Java where they're
| available in the standard library but not in the core primitive
| types.
| vient wrote:
| > I've been wanting a language with that feature
|
| Python, which realization CPython is mentioned in the article,
| has arbitrary precision integers.
| omoikane wrote:
| > I've been wanting a language with that feature
|
| GNU MP's C++ interface makes this mostly transparent,
| expressions like "x = y * z" just work due to operator
| overloading. The only time you need to know that the underlying
| class is "mpz_class" instead of "int" is at input/output.
| pfdietz wrote:
| Common Lisp has been like this forever. It makes testing the
| language much easier, as you don't have to check for overflows
| when constructing test code.
|
| A dynamic language like CL has the added requirement that the
| integers fit with everything else under a single top type,
| which in practice limits fixnums to a few bits less than the
| word size.
| nsguy wrote:
| Rexx (EDIT: https://en.wikipedia.org/wiki/Rexx )
| mpweiher wrote:
| Smalltalk is another language that has had this feature since
| forever.
|
| Tagged pointers / SmallInteger encodings make the memory
| overhead zero to negligible, and with the CPU / memory-
| bandwidth gap being what it is, the few extra CPU operations
| rarely matter.
|
| Daniel Bernstein made the argument that integers should default
| to these semantics for security reasons:
|
| https://css.csail.mit.edu/6.858/2018/readings/qmail.pdf
| epidemian wrote:
| It's a very common feature of high-level languages. Python,
| Ruby, Lisp(s), Haskell, etc, all use arbitrary-precision
| integers by default. Even JavaScript has integers now (since
| 2020), not as the default number type, but with the `n` suffix,
| like `42n`.
| kazinator wrote:
| I'm "low-level guy, like Pike", yet I wouldn't have integers
| any other way in an application language.
|
| Nobody uses Go for ethernet card drivers. Dealing with machine
| integers is just mental overhead for Go programmers, and a
| source of bugs.
| aag wrote:
| Do you remember where you saw that talk? I'd like to watch it.
|
| Thanks.
| worstspotgain wrote:
| I've had to implement BigNum on an embedded system that had very
| little RAM and where the initial _optimized_ version of ModExp
| took many minutes to complete. After much hair-pulling, the final
| version took 40 seconds.
|
| First, you should work with unsigned numbers, and use a power of
| 2 as your word size. The fastest choice for a word size is very
| operation- and CPU-dependent.
|
| A key trick is to lay out your bignums in the same order as the
| endianity of each word in memory, e.g. least-significant word
| first on little-endian systems. This will allow you to choose
| your word size dynamically for each operation: in memory, a
| number with M words of N bits each is identical to a number with
| M / 2 words of N * 2 bits each.
|
| For multiplication, identify the CPU instruction with the widest
| result, then use half that size as your word size. Each step
| through the arrays generates a word result in the low half and a
| word carry in the top half. The carry gets added to the result of
| the next step, possibly overflowing.
|
| For addition, use the widest result as your word size. This can
| also overflow.
|
| How you deal with overflows is very CPU-dependent. You can use
| adc/addc as someone else mentioned, which will be faster on
| embedded and _may_ be faster on fatter chips. Alternatively, you
| can halve the word size and use the top half as the carry.
|
| If addc is not available, you can test for overflows as follows:
| uint32_t a = ..., b = ...; uint32_t res = a + b;
| uint32_t carry = res < a;
|
| On overflow, res must necessarily be less than both a and b, so
| no need to check b.
|
| If SIMD instructions are available, that will almost always be
| the fastest choice by far. While it doesn't change the above
| guidelines in principle, there are often e.g. optimized overflow
| mechanisms.
| nj5rq wrote:
| Fascinating article, I have always been wondering how these big
| number libraries worked.
|
| As a side question, does anyone the program that the author used
| when making that "addition" and "multiplication" performance
| graph? Thanks.
___________________________________________________________________
(page generated 2024-07-16 23:00 UTC)