[HN Gopher] The radix 2^51 trick (2017)
___________________________________________________________________
The radix 2^51 trick (2017)
Author : blobcode
Score : 377 points
Date : 2025-05-30 03:55 UTC (19 hours ago)
(HTM) web link (www.chosenplaintext.ca)
(TXT) w3m dump (www.chosenplaintext.ca)
| addaon wrote:
| > Aside: Why 13 bits instead of 12? For our purposes, we're going
| to ignore the carries in the most significant limb, allowing
| numbers to wrap when they overflow past 2256 - 1 (just like how
| unsigned addition works in C with normal size integer types). As
| a result, we can assign 52 bits to the most significant limb and
| ignore the fact that it will run out of room for carries before
| the other limbs do.
|
| Why not give the top limb 64 bits and the other four limbs 48
| bits each, then? You can accumulate more additions before
| normalization, you can take advantage of word alignment during
| splitting and normalization if your instruction set has anything
| useful there, and your overflow properties are identical, no?
| bboreham wrote:
| Then you would need 6 words to hold a 256-bit value instead of
| 5 in the OP, and consequently more instructions to add them.
| addaon wrote:
| 64 + 48 * 4 == 256... still just five 64-bit words.
| bboreham wrote:
| Now you can't detect overflow?
| Sukera wrote:
| Because adding the top limbs of two encoded numbers would
| overflow too soon. If you set both to 2^63 for example, they
| overflow immediately. Might be fine for wraparound arithmetic,
| but not in general.
| volemo wrote:
| Setting both to 2^63 means your original 256-bit numbers were
| 2^255, thus the addition would overflow no matter what
| intermediate encoding you're using.
| vitus wrote:
| Sure, then set one to 2^62 and the other to -2^62 (namely:
| 0b1100..00). It's overflow as far as unsigned arithmetic is
| concerned, but not in the case of signed arithmetic.
|
| That said, when you're dealing with 256-bit integers,
| you're almost assuredly not working with signed arithmetic.
| immibis wrote:
| ...so? They don't care about top limb overflow, at all.
| That's the point.
| phkahler wrote:
| >> Why not give the top limb 64 bits and the other four limbs
| 48 bits each, then?
|
| I think one goal is to use 5 64 bit registers to do 256 bit
| math. That means using 256/5 = 51.2 bits of each word. That's
| probably some kind of ideal if you want 256bit math, but not
| optimal if you're writing a generic big-int library. In the old
| days you'd want to use exactly one byte for the carry(s)
| because we didn't have barrel shifters to do arbitrary bit
| shifts efficiently. In that case I'd use 56 bits of the 64 to
| get nice byte alignment.
|
| This is all quite relevant for RISC-V since the ISA does not
| have flags.
| Thorrez wrote:
| >That means using 256/5 = 51.2 bits of each word.
|
| Why must each word have the same amount? Why not 64 bits on
| the top word, and 48 bits on the other 4 words?
| LegionMammal978 wrote:
| Evenly distributing the number of bits per word lets you
| chain more additions/subtractions before having to
| normalize.
| xigoi wrote:
| Sure, but the point is that for the most significant
| limb, there is no point in having redundant bits because
| whatever you put in them will be discarded after
| normalization.
| LegionMammal978 wrote:
| Ah, in that case, you're right, it would make sense to
| use all 64 bits for the top limb. Still, making them all
| equal-sized can have benefits if you use SIMD or similar
| techniques to operate on them uniformly. One project of
| mine has been trying to work with large integers in CUDA
| by distributing their limbs across a warp.
| andrewla wrote:
| Even with this explanation a 64 + 48*4 is clearly superior.
| You can go longer without overflow (since you have 16 bits of
| carry space per pseudo-digit), and the amount of carry space
| is aligned even more nicely.
| russdill wrote:
| I'm seriously doubtful that adc is inherently slower than add on
| a modern CPU other then the data hazard introduced by the carry
| bit. I realize the point of the article is the data hazard so
| this is a really minor nit.
| john-h-k wrote:
| uops.info has latency for both (Alder Lake) at 1 cycle but
| throughput (lower is better)
|
| * for add is 0.20 (ie 5 per cycle)
|
| * for adc is 0.50 (ie 2 per cycle)
|
| so it does seem correct.
|
| This seems to be a consequence of `add` being available on
| ports 0, 1, 5, 6, & B, whereas `adc` is only available on ports
| 0 & 6
|
| So yes as an individual instruction it's no worse, but even
| non-dependent instructions will be worse for OoO execution
| (which is more realistic than viewing it as a single
| instruction)
| phkahler wrote:
| Intel is also supposed to introduce the new APX instructions
| which include a bunch of instructions that duplicate existing
| ones but don't set any flags. The only plausible reason to
| add these is for performance reasons.
| john-h-k wrote:
| This isn't just due to the actual dependencies of flag
| instructions at hardware level (although likely be a
| factor), it also majorly affects code layout. On Arm64 for
| example, you can make a comparison, do other operations,
| and then consume the result of that comparison afterwards,
| which is excellent for the pipeline and OoO engine.
| However, because most instructions on x86_64 write flags,
| you can't do this, and so you are forced to cram
| `jcc`/`setcc` instructions _right_ after the comparison,
| which is less friendly to compilers and the OoO engine
| dzaima wrote:
| OoO should actually be the care where that doesn't matter
| I'd think - the CPU can, well, execute the instructions
| not in the order they're in the binary; it's in-order
| implementations are where that matters more.
|
| And with compare & jump being adjacent they can be fused
| together into one uop, which Intel, AMD, and Apple
| Silicon all do.
| john-h-k wrote:
| note: since learnt that B port is just port 11 in all the
| intel docs, uops.info just hexifies them to keep ports
| single-char
| superjan wrote:
| You are right: it can be done with the same ALU, for sure. But
| the data dependency on the carry flag makes it a really
| different instruction from the point of view of the CPU: three
| data dependencies in stead of two. For the CPU it is beneficial
| to treat the instructions differently.
| animal531 wrote:
| CPU's are really funny and interesting things. Us programmers
| work with them daily and make so many assumptions about them,
| as well as the whole code chain from the compiler, runtimes,
| how code works when it comes to loops, methods etc., you name
| it.
|
| I've been working on my own Entity Component System in C# and
| basically had to start from the ground up and test every
| assumption possible. There have only really been a few
| instances where my gut was correct, more often than not there
| are so many surprising gotchas hidden everywhere.
| yusina wrote:
| It's because they are providing abstractions which we/the
| compilers use, but just doing that would be too slow, so they
| implement optimizations, but those are based on certain
| assumptions, so then the users adjust what they do to match
| those assumptions well, so the optimizations have now leaked
| into the API, and after many rounds of doing this for
| decades, you end up with this terrible mess we are in.
| foota wrote:
| Would it be legal for a C(++?) compiler to implement this
| optimization?
| nine_k wrote:
| Does C++ have native support for uint256?
| Arnavion wrote:
| With C, it is _BitInt(256) if the compiler supports it. The
| upper limit of _BitInt is implementation-defined though, so
| 256 is not guaranteed to be supported. Eg clang on RV64 only
| supports upto 128, but does support 256 on x64_64. gcc seems
| to not support _BitInt on RV64 at all, but does support 256
| on x86_64.
|
| With C++ the existence of such "extended integer" types is
| implementation defined. clang at least supports the same
| _BitInt construct for C++ too. gcc seems to not support it.
|
| So, for the 256 case on x86_64, both clang and gcc seem to
| only generate the simple adc ripple version:
| https://gcc.godbolt.org/z/nxoEda3q5
| https://gcc.godbolt.org/z/bYf4bor3f
| addaon wrote:
| Yes, it complies with the as-if rule; there's no observable
| difference in behavior. This would apply as well for supporting
| 64 bit additions within a loop on 32- or 16-bit architectures,
| for example.
| rollcat wrote:
| An unexpected optimisation can introduce a side channel (most
| commonly timing). This one would be safe, but "how do you tell
| a compiler which ones [not] to use" is a whole topic by itself.
| Denvercoder9 wrote:
| The C++ standard doesn't forbid introducing side channels, so
| the answer to the question is yes.
| rollcat wrote:
| With all the UB, I wonder how did we manage to write any
| secure or safety-critical code at all.
| wat10000 wrote:
| In C++? We pretty much did not.
| brucehoult wrote:
| Someone working entirely on x86_64 very nicely demonstrates that
| RISC-V is not wrong to omit the carry flag.
| brucehoult wrote:
| Also, there is another way to do this while keeping 64 bit
| limbs. All variables uint64_t. s0 += a0;
| s1 += a1; s2 += a2; s3 += a3;
| c0 = s0 < a0; // RISC-V `sltu` c1 = s1 < a1; c2
| = s2 < a2; if (s1 == -1) goto propagate0; //
| executes 1 time in 18,446,744,073,709,551,616 check_s2:
| if (s2 == -1) goto propagate1; // ditto
| add_carries: s1 += c0; s2 += c1; s3 +=
| c2; goto done; propagate0: c1 = c0;
| goto check_s2; propagate1: c2 = c1; goto
| add_carries; done:
|
| The key insight here is that unless the sum at a particular
| limb position is all 1s the carry out from that position DOES
| NOT DEPEND on the carry in to that limb position, but only on
| whether the original add in that position produces a carry. If
| the sum is all 1s the the carry out is the same as the carry
| in.
|
| If you express this with a conditional branch which is
| overwhelmingly predicted as not taken then the code should
| execute each block of instructions entirely in parallel,
| provided that multiple conditional branches can be predicted as
| not-taken in the same clock cycle.
|
| One time in 2^64 it will execute very slowly.
|
| With 4 limb numbers on a 4-wide machine this doesn't offer an
| advantage over `adc` as there are also 4 code blocks. But on,
| say, an 8-wide machine with 8 limb numbers you're really
| starting to gain.
|
| It's probably not going to help on current x86_64, but might
| well do on Apple's M* series, where even the M1 is 8-wide,
| though it might be tricky to work around the Arm ISA.
|
| When the 8-wide RISC-V Ascalon processor from Tenstorrent hits
| hopefully late this year or early 2026 we will really see. And
| others such as Ventana, Rivos, XiangShan.
|
| This will work even better in a wide SIMD, if you have a fast
| 1-lane shift (Called slideup on RISC-V).
| phkahler wrote:
| I think you want to write: if (s1 == -1)
| c1 = c0; if (s2 == -1) c2 = c1;
|
| These can become conditional moves on x86. I've often thought
| RISC-V should have implemented an IF instruction instead of
| compare and branch. IF would cause the next instruction to be
| executed conditionally while not needing a flag register at
| the ISA level. They could have required only branch and jump
| to be conditional, but it turns out conditional mov, load,
| and store are all very useful in real code.
| brucehoult wrote:
| The problem is that, as far as I know, a conditional move
| is going to introduce a data dependency from c0 to c1 to c2
| that is the exact thing we are trying to get rid of. The
| cmov is a constant time instruction, not a speculated
| instruction like a conditional branch.
|
| The entire point of what I did is that the two conditional
| branches will be predicted not taken, so the CPU will
| 99.9999999999999999946% of the time not even see the `c1 =
| c0` and `c2 = c1` instructions that introduce the
| sequential dependencies.
| IshKebab wrote:
| That sounds like it would be quite a pain to implement and
| program. E.g. what happens if there's an interrupt between
| the IF and the following instruction? You need to add a CSR
| to read/write the conditional state, similar to the vector
| control CSRs (vstart etc.). Hard to see how that extra
| complexity would be worth it.
|
| Modern branch predictors are very good and most branches
| are very predictable.
| less_less wrote:
| Neat, but if you're using this in cryptographic code (one of
| the main consumers of bignums), keep in mind that secret data
| reaching branches is usually a side-channel risk. Sure, it's
| only 1 time in 2^64 on _random_ data, but if you 're
| depending on that, then you have to consider whether an
| attacker can choose data that will make it happen more often.
|
| If you can substitute a cmov without control flow then it's
| probably safer, e.g. c1 |= c0 & seq(s1,-1) or so, so long as
| you can make sure the compiler won't turn it into a branch.
|
| It does add a data dependency though ...
| NooneAtAll3 wrote:
| ha, I'm not the only one to think "so what's all the risc5 gmp
| fuss was about, if carry flag is slow anyway?"
| brucehoult wrote:
| Right.
|
| Even at that time in 2021 I argued that serialising through a
| carry flag is limiting on wide machines, but there was very
| little RISC-V hardware available at the time and also GMP was
| not yet ported to RISC-V.
|
| That has changed a bit now, and almost two months ago I tried
| the GMP project's own gmpbench on a few RISC-V boards.
|
| I found that when comparing similar uarch at similar clock
| speed, in dual-issue in-order SiFive's U74 is very comparable
| to Arm's A53, and in small 3-wide OoO SiFive's P550 is
| significantly better than Arm's A72.
|
| And that's not even using the kind of technique discussed in
| this post, but the full multi-instruction carry flag
| emulation criticised by Granlund.
|
| https://www.reddit.com/r/RISCV/comments/1jsnbdr/gnu_mp_bignu.
| ..
|
| It's going to be very interesting when the 8-wide OoO RISC-V
| cores come out, probably starting with Tenstorrent's Ascalon
| core which they expect to tape out in Q3 and they have said
| they want to get into as many hands as possible to accelerate
| RISC-V development, including in laptops, not only in servers
| or the like.
| pjc50 wrote:
| This is all downstream of C omitting the carry flag, which
| means in practice it's very rarely used for the purpose of a
| carry.
| immibis wrote:
| C does, however, now have _BitInt
| phkahler wrote:
| Ugh, what a terrible thing to add to C.
| adrian_b wrote:
| There remain many frequently-encountered cases when carry-save
| addition is worse than addition using add-with-carry.
|
| Neither of the 2 multi-word addition algorithms can replace the
| other, both have their use cases, so ADC/SBB instructions are
| included in any decent ISA, because the cost of adding them is
| negligible. A dedicated flag register is not necessary, some
| ISAs store the carry/borrow flags in general-purpose registers,
| when used.
|
| Not having carry is by far not the worst feature of RISC-V.
| Much worse is not having an integer overflow flag, because the
| software workaround for detecting integer overflow, which is
| mandatory for any program that claims to be written in a safe
| way, lowers the attainable performance much more than the
| workarounds for not having carry.
| phkahler wrote:
| >> because the software workaround for detecting integer
| overflow, which is mandatory for any program that claims to
| be written in a safe way, lowers the attainable performance
| much more than the workarounds for not having carry
|
| That's absurd. A better way is to ensure that your algorithms
| don't overflow. Detecting an overflow just means your code
| has to STOP which is usually not safe. It'd be insane to have
| conditionally executed code trying to figure out how to
| handle an overflow anywhere in code. Another problem is that
| flags are not even accessible from any language higher level
| then ASM. From a C perspective there are no flags.
| dzaima wrote:
| While there is no direct access to flags in standard C, you
| can nevertheless on gcc and clang compile with -ftrapv and
| get your signed integer arithmetic be overflow-checked. Or
| you can use __builtin_add_overflow & co and get access to
| the overflow flags that way. Rust debug builds trap on
| signed and unsigned integer overflow, and you can make
| release builds do so too.
|
| While it'd be nice to have a formal proof that every single
| `a+b`, `a-b`, `a*b` in every codebase doesn't overflow, I'm
| sure you understand that that is rather impractical. (and
| really, it'd be nice! I've thought about having some
| compile-time-bounded-size integers where each addition
| increases the size, but multiplication is much less
| suitable for that, and it also means you can't have a loop
| adding to an accumulator. It's a rather non-trivial problem
| really - you might think that it'd be fine to have a loop
| over a list of objects and sum their sizes, but that can
| relatively easily overflow if the list references the same
| massive object many times, so can't even really abstract
| that)
| nine_k wrote:
| The main takeaway: doing more operations may be faster if they
| are largely independent, and thus can execute in parallel. Doing
| fewer operations may be slower if they are forced to execute
| serially due to data dependency.
|
| This idea has wider applicability than operations on long
| integers.
| repelsteeltje wrote:
| Yes. Another approach would be to use regular 64 bit chunks and
| speculatively execute each add _with_ and _without_ carry in
| parallel. Then select the correct variant based on carry result
| of less significant addition.
|
| With double the amount of additions this allows for log(bits)
| propagation time (versus linear)
| dgoldstein0 wrote:
| There's not just "result with carry" and "result without
| carry" but rather one variant of that per word of the input
|
| ... Which likely isn't that bad to code up.
| brucehoult wrote:
| Or see https://news.ycombinator.com/item?id=44133169
| volemo wrote:
| Wouldn't that produce 2^n possible results to choose from,
| where n is the number of chunks? That seems like _a lot_ of
| additional (he-he) instructions executed.
| repelsteeltje wrote:
| Nope. Just 2n: each chunk pair is added once without carry,
| and once won't carry=1.
|
| For as long as radix=2, you either have a carry or you
| don't.
| mananaysiempre wrote:
| For a single addition, the radix is irrelevant, the carry
| is always zero or one: (r-1) + (r-1) = 2r - 2 < 2r.
| rollcat wrote:
| This rule scales up all the way to multi-node supercomputers /
| cloud. The overhead is negligible when you can employ 10.000
| cores.
| hinkley wrote:
| Amdahl says no.
| noduerme wrote:
| Abstractly, when any parallel system scales up large enough
| without cross checking or waiting between "threads", the cost
| of de-duplicating and merging the output will probably
| outweigh the advantage of producing new results in tandem. I
| think. That's just a hypothesis, but feels correct. With
| stuff like a-life distributed over lots of servers all
| converging on evolutionary answers to a problem, it's the
| collation and analysis layer that's most expensive and slow.
| Sharing more frequently / allowing more reliance on central
| historical truth slows each one down but avoids duplication
| and redundancy. I guess where that point is depends on what
| problem you're trying to solve.
| credit_guy wrote:
| Actually the overhead crushes you when you employ 10000
| cores. If the overhead of a process is 10% and the parallel
| part is 90%, then 2 cores will result in a run time of 55% =
| 10% + 90%/2 of the original time. And 10 cores will get you
| to 19%. And 100 cores to 10.9%. If you then buy 9900 more
| cores to bring it to a total of 10000, you just reduced the
| runtime from 10.9% to 10.009%. In other words, you increased
| your bill by a factor of 100 to reduce your run time by
| almost nothing.
| volemo wrote:
| You two are talking about different kinds of overhead
| though.
| CamperBob2 wrote:
| Yep. Company called NVidia has been looking into that general
| idea. They seem to be getting some promising results so far, in
| a couple of different areas.
| zahlman wrote:
| What I didn't get about this: the technique shown seems to be
| about making sure that the ripple carry only happens once
| instead of N-1 times while adding N values. The carry operation
| is more complex, but this allows the actual addition to be
| parallelized.
|
| But - you still have to split the input numbers into sets of 5
| registers in the first place, right? So doesn't _that_ need to
| be parallelizable somehow as well in order for this to be a net
| win?
| adgjlsfhk1 wrote:
| That is paralelizable. Each of the 5 registers has no depence
| on the value of the others.
| dang wrote:
| Related. Others?
|
| _The radix 2^51 trick_ -
| https://news.ycombinator.com/item?id=33706153 - Nov 2022 (6
| comments)
|
| _The radix 2^51 trick (2017)_ -
| https://news.ycombinator.com/item?id=23351007 - May 2020 (83
| comments)
| eru wrote:
| The 'radix trick' also works for data structures.
|
| Okasaki's book 'Purely Functional Data Structures' has some nice
| examples.
| alwahi wrote:
| how to do this for large multiplications instead?
| nickcw wrote:
| You can do large multiplications with a convolution and do the
| carry afterwards.
|
| A convolution can be done with FFT, pointwise multiply, inverse
| FFT which is O(n log n) rather that O(n^2) for traditional
| multiplication.
|
| The bits in each limb can be quite small though as there are
| lots of carries and it depends on how many digits you have and
| how accurate your floating point is.
|
| Some kind of FFT is how all large multiplications are done.
|
| I had a lot of fun learning about this in relation to GIMPS
| (the Great Internet Mersenne Prime Search) where you use a
| variant FFT called a DWT over an irrational base which gives
| you a free mod 2^n-1 which is what you want for primality
| testing Mersenne prime candidates using the Lucas test.
| phkahler wrote:
| GIMPS is also interesting since it doesn't need to do 2
| operand multiplication. It only needs squaring.
| e4m2 wrote:
| On modern enough x86 CPUs (Intel Broadwell, AMD Ryzen) you could
| also use ADX [1] which may be faster nowadays in situations where
| radix 2^51 representation traditionally had an edge (e.g.
| Curve25519).
|
| [1] https://en.wikipedia.org/wiki/Intel_ADX
| ashdnazg wrote:
| With AVX512 (and to a lesser extent with AVX2) one can implement
| 256 bit addition pretty efficiently with the additional benefit
| of fitting more numbers in registers.
|
| It looks more or less like this: __m256i s =
| _mm256_add_epi64(a, b); const __m256i all_ones =
| _mm256_set1_epi64x(~0); int g = _mm256_cmpgt_epu64_mask(a,
| s); int p = _mm256_cmpeq_epu64_mask(s, all_ones); int
| carries = ((g << 1) + p) ^ p; __m256i ret =
| _mm256_mask_sub_epi64(s, carries, s, all_ones);
|
| The throughput even seems to be better:
| https://godbolt.org/z/e7zETe8xY
|
| It's trivial to change this to do 512 bit addition where the
| improvement will be even more significant.
| amitprasad wrote:
| Note that, especially on certain Intel architectures, using
| AVX512 instructions _at all_ can result in the whole processor
| downclocking, and thus ending up resulting in inconsistent /
| slower overall performance.
|
| https://stackoverflow.com/questions/56852812/simd-instructio...
| adgjlsfhk1 wrote:
| > using AVX512 instructions _at all_
|
| This isn't correct. AVX512 provides both a bunch of extra
| instructions, zmm (512 bit) registers, and an extra 16 (for a
| total of 32) vector registers. The donwnclocking only happens
| if you use 512 bit registers (not just avx512 instructions).
| The difference here matters a bunch since there are a bunch
| of really useful instructions (e.g. 64 bit integer multiply)
| that are added by avx512 that are pure upside.
|
| Also none of this is an issue on Zen4 or Zen5 since they use
| much more sensible downlclocking where it will only downclock
| if you've used enough instructions in a row for it to start
| spiking power/temp.
| amitprasad wrote:
| Ah yes, you're completely correct :)
|
| General idea was just to highlight some of the dangers of
| vector registers. I believe the same is true of ymm (256)
| to a lesser extent.
| t0010q wrote:
| It's funny that carries don't just make addition difficult to
| parallelize. Binary addition without carry is XOR. XOR subset sum
| - find a subset whose XOR gives the desired target - is in P, but
| proper subset sum with carry is NP-complete.
| hdjrudni wrote:
| I wish I came across this article a couple months ago.
|
| I was trying to encode and decode some buffers into an arbitrary
| base, and I eventually came to the conclusion (after far too
| long) that a carry could ripple all the way down the buffer,
| which dramatically slows down the algorithm.
|
| Actually, the eventual solution I came up might have some stuff
| in common with this trick too. I did eventually chunk up the
| buffer leaving some unused headroom to 'handle carries'. Not
| exactly though, I just have some wasted bits which uses a tiny
| bit more storage or network bandwidth but saves on compute. I
| wonder if I could instead pool up the carries like this and
| 'resolve' it in a later step. Have my cake and eat it too?
| Wishful thinking.
___________________________________________________________________
(page generated 2025-05-30 23:00 UTC)