[HN Gopher] The radix 2^51 trick (2017)
       ___________________________________________________________________
        
       The radix 2^51 trick (2017)
        
       Author : blobcode
       Score  : 377 points
       Date   : 2025-05-30 03:55 UTC (19 hours ago)
        
 (HTM) web link (www.chosenplaintext.ca)
 (TXT) w3m dump (www.chosenplaintext.ca)
        
       | addaon wrote:
       | > Aside: Why 13 bits instead of 12? For our purposes, we're going
       | to ignore the carries in the most significant limb, allowing
       | numbers to wrap when they overflow past 2256 - 1 (just like how
       | unsigned addition works in C with normal size integer types). As
       | a result, we can assign 52 bits to the most significant limb and
       | ignore the fact that it will run out of room for carries before
       | the other limbs do.
       | 
       | Why not give the top limb 64 bits and the other four limbs 48
       | bits each, then? You can accumulate more additions before
       | normalization, you can take advantage of word alignment during
       | splitting and normalization if your instruction set has anything
       | useful there, and your overflow properties are identical, no?
        
         | bboreham wrote:
         | Then you would need 6 words to hold a 256-bit value instead of
         | 5 in the OP, and consequently more instructions to add them.
        
           | addaon wrote:
           | 64 + 48 * 4 == 256... still just five 64-bit words.
        
             | bboreham wrote:
             | Now you can't detect overflow?
        
         | Sukera wrote:
         | Because adding the top limbs of two encoded numbers would
         | overflow too soon. If you set both to 2^63 for example, they
         | overflow immediately. Might be fine for wraparound arithmetic,
         | but not in general.
        
           | volemo wrote:
           | Setting both to 2^63 means your original 256-bit numbers were
           | 2^255, thus the addition would overflow no matter what
           | intermediate encoding you're using.
        
             | vitus wrote:
             | Sure, then set one to 2^62 and the other to -2^62 (namely:
             | 0b1100..00). It's overflow as far as unsigned arithmetic is
             | concerned, but not in the case of signed arithmetic.
             | 
             | That said, when you're dealing with 256-bit integers,
             | you're almost assuredly not working with signed arithmetic.
        
               | immibis wrote:
               | ...so? They don't care about top limb overflow, at all.
               | That's the point.
        
         | phkahler wrote:
         | >> Why not give the top limb 64 bits and the other four limbs
         | 48 bits each, then?
         | 
         | I think one goal is to use 5 64 bit registers to do 256 bit
         | math. That means using 256/5 = 51.2 bits of each word. That's
         | probably some kind of ideal if you want 256bit math, but not
         | optimal if you're writing a generic big-int library. In the old
         | days you'd want to use exactly one byte for the carry(s)
         | because we didn't have barrel shifters to do arbitrary bit
         | shifts efficiently. In that case I'd use 56 bits of the 64 to
         | get nice byte alignment.
         | 
         | This is all quite relevant for RISC-V since the ISA does not
         | have flags.
        
           | Thorrez wrote:
           | >That means using 256/5 = 51.2 bits of each word.
           | 
           | Why must each word have the same amount? Why not 64 bits on
           | the top word, and 48 bits on the other 4 words?
        
             | LegionMammal978 wrote:
             | Evenly distributing the number of bits per word lets you
             | chain more additions/subtractions before having to
             | normalize.
        
               | xigoi wrote:
               | Sure, but the point is that for the most significant
               | limb, there is no point in having redundant bits because
               | whatever you put in them will be discarded after
               | normalization.
        
               | LegionMammal978 wrote:
               | Ah, in that case, you're right, it would make sense to
               | use all 64 bits for the top limb. Still, making them all
               | equal-sized can have benefits if you use SIMD or similar
               | techniques to operate on them uniformly. One project of
               | mine has been trying to work with large integers in CUDA
               | by distributing their limbs across a warp.
        
           | andrewla wrote:
           | Even with this explanation a 64 + 48*4 is clearly superior.
           | You can go longer without overflow (since you have 16 bits of
           | carry space per pseudo-digit), and the amount of carry space
           | is aligned even more nicely.
        
       | russdill wrote:
       | I'm seriously doubtful that adc is inherently slower than add on
       | a modern CPU other then the data hazard introduced by the carry
       | bit. I realize the point of the article is the data hazard so
       | this is a really minor nit.
        
         | john-h-k wrote:
         | uops.info has latency for both (Alder Lake) at 1 cycle but
         | throughput (lower is better)
         | 
         | * for add is 0.20 (ie 5 per cycle)
         | 
         | * for adc is 0.50 (ie 2 per cycle)
         | 
         | so it does seem correct.
         | 
         | This seems to be a consequence of `add` being available on
         | ports 0, 1, 5, 6, & B, whereas `adc` is only available on ports
         | 0 & 6
         | 
         | So yes as an individual instruction it's no worse, but even
         | non-dependent instructions will be worse for OoO execution
         | (which is more realistic than viewing it as a single
         | instruction)
        
           | phkahler wrote:
           | Intel is also supposed to introduce the new APX instructions
           | which include a bunch of instructions that duplicate existing
           | ones but don't set any flags. The only plausible reason to
           | add these is for performance reasons.
        
             | john-h-k wrote:
             | This isn't just due to the actual dependencies of flag
             | instructions at hardware level (although likely be a
             | factor), it also majorly affects code layout. On Arm64 for
             | example, you can make a comparison, do other operations,
             | and then consume the result of that comparison afterwards,
             | which is excellent for the pipeline and OoO engine.
             | However, because most instructions on x86_64 write flags,
             | you can't do this, and so you are forced to cram
             | `jcc`/`setcc` instructions _right_ after the comparison,
             | which is less friendly to compilers and the OoO engine
        
               | dzaima wrote:
               | OoO should actually be the care where that doesn't matter
               | I'd think - the CPU can, well, execute the instructions
               | not in the order they're in the binary; it's in-order
               | implementations are where that matters more.
               | 
               | And with compare & jump being adjacent they can be fused
               | together into one uop, which Intel, AMD, and Apple
               | Silicon all do.
        
           | john-h-k wrote:
           | note: since learnt that B port is just port 11 in all the
           | intel docs, uops.info just hexifies them to keep ports
           | single-char
        
         | superjan wrote:
         | You are right: it can be done with the same ALU, for sure. But
         | the data dependency on the carry flag makes it a really
         | different instruction from the point of view of the CPU: three
         | data dependencies in stead of two. For the CPU it is beneficial
         | to treat the instructions differently.
        
         | animal531 wrote:
         | CPU's are really funny and interesting things. Us programmers
         | work with them daily and make so many assumptions about them,
         | as well as the whole code chain from the compiler, runtimes,
         | how code works when it comes to loops, methods etc., you name
         | it.
         | 
         | I've been working on my own Entity Component System in C# and
         | basically had to start from the ground up and test every
         | assumption possible. There have only really been a few
         | instances where my gut was correct, more often than not there
         | are so many surprising gotchas hidden everywhere.
        
           | yusina wrote:
           | It's because they are providing abstractions which we/the
           | compilers use, but just doing that would be too slow, so they
           | implement optimizations, but those are based on certain
           | assumptions, so then the users adjust what they do to match
           | those assumptions well, so the optimizations have now leaked
           | into the API, and after many rounds of doing this for
           | decades, you end up with this terrible mess we are in.
        
       | foota wrote:
       | Would it be legal for a C(++?) compiler to implement this
       | optimization?
        
         | nine_k wrote:
         | Does C++ have native support for uint256?
        
           | Arnavion wrote:
           | With C, it is _BitInt(256) if the compiler supports it. The
           | upper limit of _BitInt is implementation-defined though, so
           | 256 is not guaranteed to be supported. Eg clang on RV64 only
           | supports upto 128, but does support 256 on x64_64. gcc seems
           | to not support _BitInt on RV64 at all, but does support 256
           | on x86_64.
           | 
           | With C++ the existence of such "extended integer" types is
           | implementation defined. clang at least supports the same
           | _BitInt construct for C++ too. gcc seems to not support it.
           | 
           | So, for the 256 case on x86_64, both clang and gcc seem to
           | only generate the simple adc ripple version:
           | https://gcc.godbolt.org/z/nxoEda3q5
           | https://gcc.godbolt.org/z/bYf4bor3f
        
         | addaon wrote:
         | Yes, it complies with the as-if rule; there's no observable
         | difference in behavior. This would apply as well for supporting
         | 64 bit additions within a loop on 32- or 16-bit architectures,
         | for example.
        
         | rollcat wrote:
         | An unexpected optimisation can introduce a side channel (most
         | commonly timing). This one would be safe, but "how do you tell
         | a compiler which ones [not] to use" is a whole topic by itself.
        
           | Denvercoder9 wrote:
           | The C++ standard doesn't forbid introducing side channels, so
           | the answer to the question is yes.
        
             | rollcat wrote:
             | With all the UB, I wonder how did we manage to write any
             | secure or safety-critical code at all.
        
               | wat10000 wrote:
               | In C++? We pretty much did not.
        
       | brucehoult wrote:
       | Someone working entirely on x86_64 very nicely demonstrates that
       | RISC-V is not wrong to omit the carry flag.
        
         | brucehoult wrote:
         | Also, there is another way to do this while keeping 64 bit
         | limbs. All variables uint64_t.                   s0 += a0;
         | s1 += a1;         s2 += a2;         s3 += a3;
         | c0 = s0 < a0; // RISC-V `sltu`         c1 = s1 < a1;         c2
         | = s2 < a2;                  if (s1 == -1) goto propagate0; //
         | executes 1 time in 18,446,744,073,709,551,616         check_s2:
         | if (s2 == -1) goto propagate1; // ditto
         | add_carries:         s1 += c0;         s2 += c1;         s3 +=
         | c2;         goto done;                  propagate0: c1 = c0;
         | goto check_s2;                  propagate1: c2 = c1; goto
         | add_carries;                  done:
         | 
         | The key insight here is that unless the sum at a particular
         | limb position is all 1s the carry out from that position DOES
         | NOT DEPEND on the carry in to that limb position, but only on
         | whether the original add in that position produces a carry. If
         | the sum is all 1s the the carry out is the same as the carry
         | in.
         | 
         | If you express this with a conditional branch which is
         | overwhelmingly predicted as not taken then the code should
         | execute each block of instructions entirely in parallel,
         | provided that multiple conditional branches can be predicted as
         | not-taken in the same clock cycle.
         | 
         | One time in 2^64 it will execute very slowly.
         | 
         | With 4 limb numbers on a 4-wide machine this doesn't offer an
         | advantage over `adc` as there are also 4 code blocks. But on,
         | say, an 8-wide machine with 8 limb numbers you're really
         | starting to gain.
         | 
         | It's probably not going to help on current x86_64, but might
         | well do on Apple's M* series, where even the M1 is 8-wide,
         | though it might be tricky to work around the Arm ISA.
         | 
         | When the 8-wide RISC-V Ascalon processor from Tenstorrent hits
         | hopefully late this year or early 2026 we will really see. And
         | others such as Ventana, Rivos, XiangShan.
         | 
         | This will work even better in a wide SIMD, if you have a fast
         | 1-lane shift (Called slideup on RISC-V).
        
           | phkahler wrote:
           | I think you want to write:                 if (s1 == -1)
           | c1 = c0;       if (s2 == -1)          c2 = c1;
           | 
           | These can become conditional moves on x86. I've often thought
           | RISC-V should have implemented an IF instruction instead of
           | compare and branch. IF would cause the next instruction to be
           | executed conditionally while not needing a flag register at
           | the ISA level. They could have required only branch and jump
           | to be conditional, but it turns out conditional mov, load,
           | and store are all very useful in real code.
        
             | brucehoult wrote:
             | The problem is that, as far as I know, a conditional move
             | is going to introduce a data dependency from c0 to c1 to c2
             | that is the exact thing we are trying to get rid of. The
             | cmov is a constant time instruction, not a speculated
             | instruction like a conditional branch.
             | 
             | The entire point of what I did is that the two conditional
             | branches will be predicted not taken, so the CPU will
             | 99.9999999999999999946% of the time not even see the `c1 =
             | c0` and `c2 = c1` instructions that introduce the
             | sequential dependencies.
        
             | IshKebab wrote:
             | That sounds like it would be quite a pain to implement and
             | program. E.g. what happens if there's an interrupt between
             | the IF and the following instruction? You need to add a CSR
             | to read/write the conditional state, similar to the vector
             | control CSRs (vstart etc.). Hard to see how that extra
             | complexity would be worth it.
             | 
             | Modern branch predictors are very good and most branches
             | are very predictable.
        
           | less_less wrote:
           | Neat, but if you're using this in cryptographic code (one of
           | the main consumers of bignums), keep in mind that secret data
           | reaching branches is usually a side-channel risk. Sure, it's
           | only 1 time in 2^64 on _random_ data, but if you 're
           | depending on that, then you have to consider whether an
           | attacker can choose data that will make it happen more often.
           | 
           | If you can substitute a cmov without control flow then it's
           | probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as
           | you can make sure the compiler won't turn it into a branch.
           | 
           | It does add a data dependency though ...
        
         | NooneAtAll3 wrote:
         | ha, I'm not the only one to think "so what's all the risc5 gmp
         | fuss was about, if carry flag is slow anyway?"
        
           | brucehoult wrote:
           | Right.
           | 
           | Even at that time in 2021 I argued that serialising through a
           | carry flag is limiting on wide machines, but there was very
           | little RISC-V hardware available at the time and also GMP was
           | not yet ported to RISC-V.
           | 
           | That has changed a bit now, and almost two months ago I tried
           | the GMP project's own gmpbench on a few RISC-V boards.
           | 
           | I found that when comparing similar uarch at similar clock
           | speed, in dual-issue in-order SiFive's U74 is very comparable
           | to Arm's A53, and in small 3-wide OoO SiFive's P550 is
           | significantly better than Arm's A72.
           | 
           | And that's not even using the kind of technique discussed in
           | this post, but the full multi-instruction carry flag
           | emulation criticised by Granlund.
           | 
           | https://www.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignu.
           | ..
           | 
           | It's going to be very interesting when the 8-wide OoO RISC-V
           | cores come out, probably starting with Tenstorrent's Ascalon
           | core which they expect to tape out in Q3 and they have said
           | they want to get into as many hands as possible to accelerate
           | RISC-V development, including in laptops, not only in servers
           | or the like.
        
         | pjc50 wrote:
         | This is all downstream of C omitting the carry flag, which
         | means in practice it's very rarely used for the purpose of a
         | carry.
        
           | immibis wrote:
           | C does, however, now have _BitInt
        
             | phkahler wrote:
             | Ugh, what a terrible thing to add to C.
        
         | adrian_b wrote:
         | There remain many frequently-encountered cases when carry-save
         | addition is worse than addition using add-with-carry.
         | 
         | Neither of the 2 multi-word addition algorithms can replace the
         | other, both have their use cases, so ADC/SBB instructions are
         | included in any decent ISA, because the cost of adding them is
         | negligible. A dedicated flag register is not necessary, some
         | ISAs store the carry/borrow flags in general-purpose registers,
         | when used.
         | 
         | Not having carry is by far not the worst feature of RISC-V.
         | Much worse is not having an integer overflow flag, because the
         | software workaround for detecting integer overflow, which is
         | mandatory for any program that claims to be written in a safe
         | way, lowers the attainable performance much more than the
         | workarounds for not having carry.
        
           | phkahler wrote:
           | >> because the software workaround for detecting integer
           | overflow, which is mandatory for any program that claims to
           | be written in a safe way, lowers the attainable performance
           | much more than the workarounds for not having carry
           | 
           | That's absurd. A better way is to ensure that your algorithms
           | don't overflow. Detecting an overflow just means your code
           | has to STOP which is usually not safe. It'd be insane to have
           | conditionally executed code trying to figure out how to
           | handle an overflow anywhere in code. Another problem is that
           | flags are not even accessible from any language higher level
           | then ASM. From a C perspective there are no flags.
        
             | dzaima wrote:
             | While there is no direct access to flags in standard C, you
             | can nevertheless on gcc and clang compile with -ftrapv and
             | get your signed integer arithmetic be overflow-checked. Or
             | you can use __builtin_add_overflow & co and get access to
             | the overflow flags that way. Rust debug builds trap on
             | signed and unsigned integer overflow, and you can make
             | release builds do so too.
             | 
             | While it'd be nice to have a formal proof that every single
             | `a+b`, `a-b`, `a*b` in every codebase doesn't overflow, I'm
             | sure you understand that that is rather impractical. (and
             | really, it'd be nice! I've thought about having some
             | compile-time-bounded-size integers where each addition
             | increases the size, but multiplication is much less
             | suitable for that, and it also means you can't have a loop
             | adding to an accumulator. It's a rather non-trivial problem
             | really - you might think that it'd be fine to have a loop
             | over a list of objects and sum their sizes, but that can
             | relatively easily overflow if the list references the same
             | massive object many times, so can't even really abstract
             | that)
        
       | nine_k wrote:
       | The main takeaway: doing more operations may be faster if they
       | are largely independent, and thus can execute in parallel. Doing
       | fewer operations may be slower if they are forced to execute
       | serially due to data dependency.
       | 
       | This idea has wider applicability than operations on long
       | integers.
        
         | repelsteeltje wrote:
         | Yes. Another approach would be to use regular 64 bit chunks and
         | speculatively execute each add _with_ and _without_ carry in
         | parallel. Then select the correct variant based on carry result
         | of less significant addition.
         | 
         | With double the amount of additions this allows for log(bits)
         | propagation time (versus linear)
        
           | dgoldstein0 wrote:
           | There's not just "result with carry" and "result without
           | carry" but rather one variant of that per word of the input
           | 
           | ... Which likely isn't that bad to code up.
        
           | brucehoult wrote:
           | Or see https://news.ycombinator.com/item?id=44133169
        
           | volemo wrote:
           | Wouldn't that produce 2^n possible results to choose from,
           | where n is the number of chunks? That seems like _a lot_ of
           | additional (he-he) instructions executed.
        
             | repelsteeltje wrote:
             | Nope. Just 2n: each chunk pair is added once without carry,
             | and once won't carry=1.
             | 
             | For as long as radix=2, you either have a carry or you
             | don't.
        
               | mananaysiempre wrote:
               | For a single addition, the radix is irrelevant, the carry
               | is always zero or one: (r-1) + (r-1) = 2r - 2 < 2r.
        
         | rollcat wrote:
         | This rule scales up all the way to multi-node supercomputers /
         | cloud. The overhead is negligible when you can employ 10.000
         | cores.
        
           | hinkley wrote:
           | Amdahl says no.
        
           | noduerme wrote:
           | Abstractly, when any parallel system scales up large enough
           | without cross checking or waiting between "threads", the cost
           | of de-duplicating and merging the output will probably
           | outweigh the advantage of producing new results in tandem. I
           | think. That's just a hypothesis, but feels correct. With
           | stuff like a-life distributed over lots of servers all
           | converging on evolutionary answers to a problem, it's the
           | collation and analysis layer that's most expensive and slow.
           | Sharing more frequently / allowing more reliance on central
           | historical truth slows each one down but avoids duplication
           | and redundancy. I guess where that point is depends on what
           | problem you're trying to solve.
        
           | credit_guy wrote:
           | Actually the overhead crushes you when you employ 10000
           | cores. If the overhead of a process is 10% and the parallel
           | part is 90%, then 2 cores will result in a run time of 55% =
           | 10% + 90%/2 of the original time. And 10 cores will get you
           | to 19%. And 100 cores to 10.9%. If you then buy 9900 more
           | cores to bring it to a total of 10000, you just reduced the
           | runtime from 10.9% to 10.009%. In other words, you increased
           | your bill by a factor of 100 to reduce your run time by
           | almost nothing.
        
             | volemo wrote:
             | You two are talking about different kinds of overhead
             | though.
        
         | CamperBob2 wrote:
         | Yep. Company called NVidia has been looking into that general
         | idea. They seem to be getting some promising results so far, in
         | a couple of different areas.
        
         | zahlman wrote:
         | What I didn't get about this: the technique shown seems to be
         | about making sure that the ripple carry only happens once
         | instead of N-1 times while adding N values. The carry operation
         | is more complex, but this allows the actual addition to be
         | parallelized.
         | 
         | But - you still have to split the input numbers into sets of 5
         | registers in the first place, right? So doesn't _that_ need to
         | be parallelizable somehow as well in order for this to be a net
         | win?
        
           | adgjlsfhk1 wrote:
           | That is paralelizable. Each of the 5 registers has no depence
           | on the value of the others.
        
       | dang wrote:
       | Related. Others?
       | 
       |  _The radix 2^51 trick_ -
       | https://news.ycombinator.com/item?id=33706153 - Nov 2022 (6
       | comments)
       | 
       |  _The radix 2^51 trick (2017)_ -
       | https://news.ycombinator.com/item?id=23351007 - May 2020 (83
       | comments)
        
       | eru wrote:
       | The 'radix trick' also works for data structures.
       | 
       | Okasaki's book 'Purely Functional Data Structures' has some nice
       | examples.
        
       | alwahi wrote:
       | how to do this for large multiplications instead?
        
         | nickcw wrote:
         | You can do large multiplications with a convolution and do the
         | carry afterwards.
         | 
         | A convolution can be done with FFT, pointwise multiply, inverse
         | FFT which is O(n log n) rather that O(n^2) for traditional
         | multiplication.
         | 
         | The bits in each limb can be quite small though as there are
         | lots of carries and it depends on how many digits you have and
         | how accurate your floating point is.
         | 
         | Some kind of FFT is how all large multiplications are done.
         | 
         | I had a lot of fun learning about this in relation to GIMPS
         | (the Great Internet Mersenne Prime Search) where you use a
         | variant FFT called a DWT over an irrational base which gives
         | you a free mod 2^n-1 which is what you want for primality
         | testing Mersenne prime candidates using the Lucas test.
        
           | phkahler wrote:
           | GIMPS is also interesting since it doesn't need to do 2
           | operand multiplication. It only needs squaring.
        
       | e4m2 wrote:
       | On modern enough x86 CPUs (Intel Broadwell, AMD Ryzen) you could
       | also use ADX [1] which may be faster nowadays in situations where
       | radix 2^51 representation traditionally had an edge (e.g.
       | Curve25519).
       | 
       | [1] https://en.wikipedia.org/wiki/Intel_ADX
        
       | ashdnazg wrote:
       | With AVX512 (and to a lesser extent with AVX2) one can implement
       | 256 bit addition pretty efficiently with the additional benefit
       | of fitting more numbers in registers.
       | 
       | It looks more or less like this:                 __m256i s =
       | _mm256_add_epi64(a, b);       const __m256i all_ones =
       | _mm256_set1_epi64x(~0);       int g = _mm256_cmpgt_epu64_mask(a,
       | s);       int p = _mm256_cmpeq_epu64_mask(s, all_ones);       int
       | carries = ((g << 1) + p) ^ p;            __m256i ret =
       | _mm256_mask_sub_epi64(s, carries, s, all_ones);
       | 
       | The throughput even seems to be better:
       | https://godbolt.org/z/e7zETe8xY
       | 
       | It's trivial to change this to do 512 bit addition where the
       | improvement will be even more significant.
        
         | amitprasad wrote:
         | Note that, especially on certain Intel architectures, using
         | AVX512 instructions _at all_ can result in the whole processor
         | downclocking, and thus ending up resulting in inconsistent /
         | slower overall performance.
         | 
         | https://stackoverflow.com/questions/56852812/simd-instructio...
        
           | adgjlsfhk1 wrote:
           | > using AVX512 instructions _at all_
           | 
           | This isn't correct. AVX512 provides both a bunch of extra
           | instructions, zmm (512 bit) registers, and an extra 16 (for a
           | total of 32) vector registers. The donwnclocking only happens
           | if you use 512 bit registers (not just avx512 instructions).
           | The difference here matters a bunch since there are a bunch
           | of really useful instructions (e.g. 64 bit integer multiply)
           | that are added by avx512 that are pure upside.
           | 
           | Also none of this is an issue on Zen4 or Zen5 since they use
           | much more sensible downlclocking where it will only downclock
           | if you've used enough instructions in a row for it to start
           | spiking power/temp.
        
             | amitprasad wrote:
             | Ah yes, you're completely correct :)
             | 
             | General idea was just to highlight some of the dangers of
             | vector registers. I believe the same is true of ymm (256)
             | to a lesser extent.
        
       | t0010q wrote:
       | It's funny that carries don't just make addition difficult to
       | parallelize. Binary addition without carry is XOR. XOR subset sum
       | - find a subset whose XOR gives the desired target - is in P, but
       | proper subset sum with carry is NP-complete.
        
       | hdjrudni wrote:
       | I wish I came across this article a couple months ago.
       | 
       | I was trying to encode and decode some buffers into an arbitrary
       | base, and I eventually came to the conclusion (after far too
       | long) that a carry could ripple all the way down the buffer,
       | which dramatically slows down the algorithm.
       | 
       | Actually, the eventual solution I came up might have some stuff
       | in common with this trick too. I did eventually chunk up the
       | buffer leaving some unused headroom to 'handle carries'. Not
       | exactly though, I just have some wasted bits which uses a tiny
       | bit more storage or network bandwidth but saves on compute. I
       | wonder if I could instead pool up the carries like this and
       | 'resolve' it in a later step. Have my cake and eat it too?
       | Wishful thinking.
        
       ___________________________________________________________________
       (page generated 2025-05-30 23:00 UTC)