[HN Gopher] Optimizing a bignum library for fun
       ___________________________________________________________________
        
       Optimizing a bignum library for fun
        
       Author : azhenley
       Score  : 50 points
       Date   : 2024-07-16 18:08 UTC (4 hours ago)
        
 (HTM) web link (austinhenley.com)
 (TXT) w3m dump (austinhenley.com)
        
       | lifthrasiir wrote:
       | Modulo is surprisingly expensive even when you combine it with a
       | quotient. It is almost always better to use binary "limbs", in
       | this case 31 or 32 bits wide, because decimal parsing and
       | printing should be much rarer than individual operations in
       | general.
        
         | colonwqbang wrote:
         | This is a bit misleading. Quotient and modulo with constant
         | right-hand side is readily optimised into faster operations,
         | like multiplication and bit shifts. "All" optimising compilers
         | do this and don't emit a true divide instruction in such cases.
         | See for instance: https://godbolt.org/z/vxnTPvY6M
         | 
         | Quotient and modulo with a variable RHS is expensive, but this
         | isn't necessary for the algorithm showed in the article.
         | Although I agree that binary wins over decimal anyway. We are
         | using binary computers after all, for the last 50 years or so.
        
           | Someone wrote:
           | > Quotient and modulo with a variable RHS is expensive
           | 
           | For dividing 'big' bignums by 'small' divisors, I would
           | use/tweak _libdivide_ (https://libdivide.com/), and spend the
           | time to optimize the division.
           | 
           | For dividing 'big' bignums by 'big' divisors, that may not be
           | worth it.
           | 
           | (Determining the correct values of the various 'big' and
           | 'small' values left as an exercise for the reader. Those will
           | depend on the hardware at hand)
        
           | worstspotgain wrote:
           | The reason a division by 1000000 can be efficiently turned
           | into a multiplication is because you're targeting a system
           | that has a 32x32->64 multiplication instruction, which has
           | plenty of room to spare. But if you have that, you would be
           | better off using e.g. 2^32 as your modulo.
        
       | styczen wrote:
       | not + -
        
       | minimize wrote:
       | For maximum efficiency, you should work in binary instead of base
       | 10. Handling carries becomes more straightforward with the right
       | primitives, for example __builtin_addc with GCC:
       | https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...
       | 
       | You can also implement it in C if you want a more portable
       | solution:
       | https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c52...
       | 
       | If you scroll around, you can also find my implementations for
       | multiplication and such.
        
         | rurban wrote:
         | For maximum efficiency you'd avoid addc like hell, because it
         | blocks internal precomputation, and use guarantees like he did,
         | avoiding overflow better. Better use int128 types though.
         | 
         | I'd just just stack objects for small typical sizes. And esp.
         | do formal proofs than the random test joke.
        
           | minimize wrote:
           | > For maximum efficiency you'd avoid addc like hell, because
           | it blocks internal precomputation, and use guarantees like he
           | did, avoiding overflow better.
           | 
           | Whether that is the case depends on the CPU. I hit memory
           | bandwidth limits before adc becomes a problem. But I concede
           | that you have a valid point and that there are definitely
           | cases where leaving sufficient space for carries is the
           | better strategy.
           | 
           | Anyway, we can probably both agree that "% 1000000000" is not
           | the best way to do it.
        
           | koverstreet wrote:
           | addc makes all your adds serialize on the flags register,
           | which is really painful.
           | 
           | more modern approach is to reserve some bits out of every
           | word for carries, but that drastically complicates things.
        
             | JonChesterfield wrote:
             | I remembered and ultimately found a source for a workaround
             | for the serialising on flags problem, intel paper at https:
             | //web.archive.org/web/20150131061304/http://www.intel....
             | amounts to new instructions with better behaviour for ILP
        
       | paldepind2 wrote:
       | Speaking of bignum libraries, I recently watched a talk with Rob
       | Pike where he mentioned that one thing he regretted about Go was
       | not making the default integer implementation arbitrary
       | precision. Supposedly the performance overhead for normal numbers
       | is very small, and you avoid the weirdness and complicated
       | semantics of fixed precision integers. I found that to be quite
       | fascinating, especially coming from a "low-level guy" like Rob
       | Pike. Ever since I've been wanting a language with that feature
       | and to understand how bignum implementations work.
        
         | MaxBarraclough wrote:
         | For what it's worth, Haskell's _Integer_ type is that way.
         | 
         |  _edit_ To be clear, in Haskell you  'default' to using its
         | arbitrary-precision type, unlike say Java where they're
         | available in the standard library but not in the core primitive
         | types.
        
         | vient wrote:
         | > I've been wanting a language with that feature
         | 
         | Python, which realization CPython is mentioned in the article,
         | has arbitrary precision integers.
        
         | omoikane wrote:
         | > I've been wanting a language with that feature
         | 
         | GNU MP's C++ interface makes this mostly transparent,
         | expressions like "x = y * z" just work due to operator
         | overloading. The only time you need to know that the underlying
         | class is "mpz_class" instead of "int" is at input/output.
        
         | pfdietz wrote:
         | Common Lisp has been like this forever. It makes testing the
         | language much easier, as you don't have to check for overflows
         | when constructing test code.
         | 
         | A dynamic language like CL has the added requirement that the
         | integers fit with everything else under a single top type,
         | which in practice limits fixnums to a few bits less than the
         | word size.
        
         | nsguy wrote:
         | Rexx (EDIT: https://en.wikipedia.org/wiki/Rexx )
        
         | mpweiher wrote:
         | Smalltalk is another language that has had this feature since
         | forever.
         | 
         | Tagged pointers / SmallInteger encodings make the memory
         | overhead zero to negligible, and with the CPU / memory-
         | bandwidth gap being what it is, the few extra CPU operations
         | rarely matter.
         | 
         | Daniel Bernstein made the argument that integers should default
         | to these semantics for security reasons:
         | 
         | https://css.csail.mit.edu/6.858/2018/readings/qmail.pdf
        
         | epidemian wrote:
         | It's a very common feature of high-level languages. Python,
         | Ruby, Lisp(s), Haskell, etc, all use arbitrary-precision
         | integers by default. Even JavaScript has integers now (since
         | 2020), not as the default number type, but with the `n` suffix,
         | like `42n`.
        
         | kazinator wrote:
         | I'm "low-level guy, like Pike", yet I wouldn't have integers
         | any other way in an application language.
         | 
         | Nobody uses Go for ethernet card drivers. Dealing with machine
         | integers is just mental overhead for Go programmers, and a
         | source of bugs.
        
         | aag wrote:
         | Do you remember where you saw that talk? I'd like to watch it.
         | 
         | Thanks.
        
       | worstspotgain wrote:
       | I've had to implement BigNum on an embedded system that had very
       | little RAM and where the initial _optimized_ version of ModExp
       | took many minutes to complete. After much hair-pulling, the final
       | version took 40 seconds.
       | 
       | First, you should work with unsigned numbers, and use a power of
       | 2 as your word size. The fastest choice for a word size is very
       | operation- and CPU-dependent.
       | 
       | A key trick is to lay out your bignums in the same order as the
       | endianity of each word in memory, e.g. least-significant word
       | first on little-endian systems. This will allow you to choose
       | your word size dynamically for each operation: in memory, a
       | number with M words of N bits each is identical to a number with
       | M / 2 words of N * 2 bits each.
       | 
       | For multiplication, identify the CPU instruction with the widest
       | result, then use half that size as your word size. Each step
       | through the arrays generates a word result in the low half and a
       | word carry in the top half. The carry gets added to the result of
       | the next step, possibly overflowing.
       | 
       | For addition, use the widest result as your word size. This can
       | also overflow.
       | 
       | How you deal with overflows is very CPU-dependent. You can use
       | adc/addc as someone else mentioned, which will be faster on
       | embedded and _may_ be faster on fatter chips. Alternatively, you
       | can halve the word size and use the top half as the carry.
       | 
       | If addc is not available, you can test for overflows as follows:
       | uint32_t a = ..., b = ...;         uint32_t res = a + b;
       | uint32_t carry = res < a;
       | 
       | On overflow, res must necessarily be less than both a and b, so
       | no need to check b.
       | 
       | If SIMD instructions are available, that will almost always be
       | the fastest choice by far. While it doesn't change the above
       | guidelines in principle, there are often e.g. optimized overflow
       | mechanisms.
        
       | nj5rq wrote:
       | Fascinating article, I have always been wondering how these big
       | number libraries worked.
       | 
       | As a side question, does anyone the program that the author used
       | when making that "addition" and "multiplication" performance
       | graph? Thanks.
        
       ___________________________________________________________________
       (page generated 2024-07-16 23:00 UTC)