[HN Gopher] The Weird Concept of Branchless Programming
___________________________________________________________________
The Weird Concept of Branchless Programming
Author : judicious
Score : 106 points
Date : 2025-09-28 16:40 UTC (6 hours ago)
(HTM) web link (sanixdk.xyz)
(TXT) w3m dump (sanixdk.xyz)
| sfilmeyer wrote:
| I enjoyed reading the article, but I'm pretty thrown by the
| benchmarks and conclusion. All of the times are reported to a
| single digit of precision, but then the summary is claiming that
| one function shows an improvement while the other two are
| described as negligible. When all the numbers presented are
| "~5ms" or "~6ms", it doesn't leave me confident that small
| changes to the benchmarking might have substantially changed that
| conclusion.
| Joel_Mckay wrote:
| In general, modern compilers will often unroll or inline
| functions without people even noticing. This often helps with
| cache level state localization and parallelism.
|
| Most code should focus on readability, then profile for busy
| areas under use, and finally refactor the busy areas though
| hand optimization or register hints as required.
|
| If one creates something that looks suspect (inline Assembly
| macro), a peer or llvm build will come along and ruin it later
| for sure. Have a great day =3
| hinkley wrote:
| Doesn't it also help with branch prediction since the
| unrolled loop can use different statistics with each copy?
| Joel_Mckay wrote:
| Non-overlapping sub-problems may be safely parallelized,
| and executed out-of-order.
|
| In some architectures, both of the branch code motions are
| executed in parallel, and one is simply tossed after
| dependent operations finish. We can't be sure exactly how
| branch predictors and pre-fetch is implemented as it falls
| under manufacturer NDA. =3
| gizmo686 wrote:
| Yeah. When your timing results are a single digit multiple of
| your timing precision, that is a good indication you either
| need a longer test, or a more precise clock.
|
| At a 5ms baseline with millisecond precision, the smallest
| improvement you can measure is 20%. And you cannot distinguish
| a 20% speedup with a 20% slowdown that happened to get luck
| with clock ticks.
|
| For what it is worth, I ran the provided test code on my
| machine with a 100x increase in iterations and got the
| following: == Benchmarking ABS == ABS
| (branch): 0.260 sec ABS (branchless): 0.264 sec
| == Benchmarking CLAMP == CLAMP (branch): 0.332 sec
| CLAMP (branchless): 0.538 sec == Benchmarking
| PARTITION == PARTITION (branch): 0.043 sec
| PARTITION (branchless): 0.091 sec
|
| Which is not exactly encouraging (gcc 13.3.0, -ffast-math
| -march=native. I did not use the -fomit-this-entire-function
| flag, which my compiler does not understand).
|
| I had to drop down to O0 to see branchless be faster in any
| case: == Benchmarking ABS == ABS
| (branch): 0.743 sec ABS (branchless): 0.948 sec
| == Benchmarking CLAMP == CLAMP (branch): 4.275 sec
| CLAMP (branchless): 1.429 sec == Benchmarking
| PARTITION == PARTITION (branch): 0.156 sec
| PARTITION (branchless): 0.164 sec
| Roxxik wrote:
| I also tried myself, on different array sizes, with more
| iterations. The branchy version is not strictly worse.
|
| https://gist.github.com/Stefan-
| JLU/3925c6a73836ce841860b55c8...
| MontyCarloHall wrote:
| This great video [0] demonstrates how CPU performance has only
| increased 1.5-2x over the past 15(!) years when executing
| extremely branchy code. Really shows you just how deep modern CPU
| pipelines have become.
|
| The video also showcases a much more impressive branchless
| speedup: computing CRC checksums. Doing this naively with an if-
| statement for each bit is ~10x slower than doing it branchless
| with a single bitwise operation for each bit. The author of the
| article should consider showcasing this too, since it's a lot
| more impressive than the measly 1.2x speedup highlighted in the
| article. I assume the minimal/nonexistent speedups in the article
| are due to modern CPU branch prediction being quite good. But
| branch predictors inherently fail miserably at CRC because the
| conditional is on whether the input bit is 1 or 0, which is
| essentially random.
|
| [0] https://www.youtube.com/watch?v=m7PVZixO35c
| SynasterBeiter wrote:
| The linked video doesn't take into account power consumption of
| these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and
| desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a
| couple of years newer, you shouldn't be surprised there isn't
| much of a difference in performance.
| MontyCarloHall wrote:
| Even if you exclude the three mobile CPUs in the charts (the
| 2012 i5, the 2015 i7, and the 2023 i9 NUC), the results still
| hold.
|
| >If you compare a 100W CPU and a 30W CPU that's a couple of
| years newer, you shouldn't be surprised there isn't much of a
| difference in performance
|
| Sure, but this is over a lot more than a couple years. I'd
| expect a 2023 mobile i9 to be considerably more than twice as
| fast as a 2007 desktop Core 2 Duo.
| hinkley wrote:
| Branchless is also useful for cryptographic transforms as it
| frustrates timing attacks. And that's a situation where it only
| needs to be _relatively fast_ compared to the branching
| alternative because we are trying to improve security while
| limiting the overhead of doing so.
| rmnclmnt wrote:
| Great article, triggers some memories.
|
| When you get to think about branchless programming, especially
| for SIMD optimizations in the real world, you always learn a lot
| and it's as if you get a +1 level on your algorithmic skills. The
| hardest part then is make sure the tricks are clearly laidout so
| that someone else can take it from here next time
| Legend2440 wrote:
| Article doesn't mention this, but I'd consider neural networks a
| form of branchless programming. It's all a bunch of multiply and
| threshold operations.
| abyesilyurt wrote:
| Thresholding requires branching no?
| WJW wrote:
| No, you can do that without branching. See https://en.algorit
| hmica.org/hpc/pipelining/branchless/#predi... for example.
| Legend2440 wrote:
| No, it's just a math operation and can be applied to any
| amount of data in parallel.
| bqmjjx0kac wrote:
| As demonstrated in the article, you can compute clamp(x, min,
| max) with straight-line code.
| david-gpu wrote:
| _> It's all a bunch of multiply and threshold operations._
|
| Real-world high-performance matrix multiplication functions do
| contain branches internally, even on GPUs. If you are ever
| curious about what that looks like, NVidia maintains an open-
| source library called CUTLASS.
| mshockwave wrote:
| The article is easy to follow but I think the author missed the e
| point: branchless programming (a subset of the more known
| constant time programming) is almost exclusively used in
| cryptography only nowadays. As shown by the benchmarks in the
| article, modern branch predictors can easily achieve over 95% if
| not 99% precision since like a decade ago
| styfle wrote:
| This same blog posted yesterday and got flagged with "How I
| accidently created the fastest CSV parser ever made"
|
| https://news.ycombinator.com/item?id=45400009
| rao-v wrote:
| I'm amused to see a for loop in a function that is purportedly
| branchless
|
| (not a critique)
| adrianmonk wrote:
| I've always wondered if any CPUs have tried to reduce the branch
| penalty by speculatively executing both ways at once in parallel.
| You'd have two of everything (two pipelines, two ALUs, two sets
| of registers, etc.) and when you hit a conditional branch,
| instead of guessing which way to go, you'd essentially fork.
|
| Obviously that requires a lot of extra transistors and you are
| doing computation that will be thrown away, so it's not free in
| terms of space or power/heat/energy. But perhaps it could handle
| cases that other approaches can't.
|
| Even more of a wild idea is to pair up two cores and have them
| work together this way. When you have a core that would have been
| idle anyway, it can shadow an active core and be its doppelganger
| that takes the other branch. You'd need to have very fast
| communication between cores so the shadow core can spring into
| action instantly when you hit a branch.
|
| My gut instinct is it's not worth it overall, but I'm curious
| whether it's been tried in the real world.
| hawk_ wrote:
| What has worked out very well in practice is hyper-threading.
| So you take instructions from two threads and if one of them is
| waiting on a branch the units of the CPU don't go unused.
| terryf wrote:
| Yes, this has been done for a while now, speculative execution
| + register renaming is how this happens.
| https://en.wikipedia.org/wiki/Register_renaming
| o11c wrote:
| Doesn't that only work on one side of the branch though?
| umanwizard wrote:
| No, what's been done for a while is speculatively executing
| one predicted path, not both paths in parallel.
| thegreatwhale8 wrote:
| It's what happens and it gave us a really big issue a few years
| ago
| https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...
| umanwizard wrote:
| No, that is because of speculatively executing one path, not
| both paths in parallel.
| mshockwave wrote:
| yes, it has been done for at least a decade if not more
|
| > Even more of a wild idea is to pair up two cores and have
| them work together this way
|
| I don't think that'll be profitable, because...
|
| > When you have a core that would have been idle anyway
|
| ...you'll just schedule in another process. Modern OS rarely
| runs short on available tasks to run
| jasonwatkinspdx wrote:
| Yes, it's been looked at. If you wanna skim the research use
| "Eager Execution" and "Disjoint Eager Execution" as jumping off
| points.
|
| It doesn't require duplicating everything. You just need to add
| some additional bookkeeping of dependencies and what to retire
| vs kill at the end of the pipeline.
|
| In practice branch predictors are so good that speculating off
| the "spine" of most likely path just isn't worth it.
|
| In fact there were a lot of interesting microarchitectural
| ideas from the late 90s to early 00s that just ended up being
| moot because the combination of OoO speculation, branch
| predictors, and big caches proved so effective.
| gtirloni wrote:
| I think I'm getting too sensitive to AI-assisted articles. So
| many whimsical terms to make something "fun".
| FooBarBizBazz wrote:
| I used to do stuff like this (ok, not half as smart), but stopped
| around 2013 or so, as the distinction between "implementation
| defined" behavior (ok) and "undefined" behavior (not ok) started
| to matter and bite.
|
| After thinking through this carefully, though, I do not see UB
| (except for signed overflow in a corner case):
|
| Step 1, bit shift.
|
| I understand that, until C++20, _left_ shift of a signed int was
| UB. But this _right_ shift appears to be _implementation_
| defined, which is ok if documented in the code.
|
| Step 2: Add.
|
| Then, (x + mask) is defined behavior (a) for positive x, since
| then mask=0, and (b) for most negative x, since then mask=-1.
| However, if x is numeric_limits::lowest, then you get signed
| integer overflow, which is UB.
|
| Step 3, XOR.
|
| Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as
| of C++20, when signed integers became two's complement
| officially. It might have been implementation defined before
| then, which would be almost as good for something that
| ubiquitous, but I'm not sure.
|
| Ok, so I think this does not involve UB except for an input of
| numeric_limits_lowest.
|
| Sound about right?
|
| To fix this, perhaps you would need to make that + an unsigned
| one?
|
| It bothers me how hard you need to think to do this language
| lawyering. C++ is a system of rules. Computers interpret rules.
| The language lawyer should be a piece of software. I get it, not
| everything can be done statically, so, fine, do it dynamically?
| UBSan comes closest in practice, but doesn't detect everything. I
| understand formally modeled versions of C and C++ have been
| developed commercially, but these are not open source so they
| effectively do not exist. It's a strange situation.
|
| Just the other day I considered writing something branchless and
| said, "nah", because of uncertainty around UB. How much
| performance is being left on the table around the world because
| of similar thinking, driven by the existence of UB?
|
| Maybe I was supposed to write OCaml or Pascal or Rust.
| vdupras wrote:
| In the part about "abs", there's an assembly breakdown:
|
| mov eax, edi
|
| sar eax, 31
|
| mov ecx, eax
|
| add edi, ecx
|
| xor eax, edi
|
| Has this been generated by a C compiler? If yes, it's a bit
| puzzling, because can't you remove "mov ecx, eax", replace "add
| edi, ecx" by "add edi, eax" and have the exact same result?
| userbinator wrote:
| If you look at compiler output, you will always see plenty of
| small stupidities.
| vdupras wrote:
| So why does conventional wisdom say that compilers will, in
| the vast majority of the time, outperform programmers doing
| assembly by hand? It seems contradictory to me.
| HarHarVeryFunny wrote:
| The ARM instruction set used to make all instructions conditional
| to avoid the need for branching, but has now moved away from that
| since the branch prediction is so good nowadays.
| zackmorris wrote:
| I need something like this for a switch() command (technically a
| list of arbitrary functions). Sort of like up to N branches in
| one step.
|
| The idea is to take some number of inputs A, B, C, ... and
| conceptually perform all of the possible functions
| simultaneously, then keep the one that's desired and throw the
| rest away. For any arbitrary logic. Ideally using fewer
| operations than all of that, but that's optional. Driven by one
| variable, it would look like: // branch version
| switch(var1) { case 1: var4 = var2 +
| var3; break; case 2: var4 = var2 -
| var3; break; case 3: var4 = var2 *
| var3; break; case 4: var4 = var2 /
| var3; break; // ... default:
| var4 = 0; // optional and arbitrary break; }
| // branchless version var4 = magic(var1, var2, var3);
|
| I don't know how to do this outside of programmable hardware like
| an FPGA. The problem is that it's extremely challenging to
| write/solve the formulas that map ordinary arithmetic and bitwise
| functions to the larger set of functions.
|
| So for now, I may just use a switch() command in a shader and let
| it figure out any reusable logic internally. I don't know if
| shaders have come far enough to allow that performantly, or if
| they end up just calculating everything and throwing away the
| paths not taken. Which would suggest that the max speed would be
| O(1/N) for the number of cases.
|
| Does anyone know? Maybe truth tables? Or a SAT solver? I'm also
| not sure how this would work for floating point, but that's
| optional too.
|
| Edit: I updated the cases to show how var1 controls the math
| performed on var2, var3 and var4.
| judicious wrote:
| I'm by no means an expert on this topic, but what about using
| predefined hash map of value to function call/closure unless
| that also yields branched programming underneath? Additionally,
| maybe you want to use macros of some kind to generate the
| "magic" function you're looking for, where it constructs a
| single branchless statement up to N terms, whether it's
| procedural macros in Rust or C macros.
| AnotherGoodName wrote:
| Think of a way you can make the number of each case 0 or 1 then
| you have a single sum of all possibilities where that 0 or 1
| multiplies the portion of the switch statement it represents
| (the 0 will mean not active the 1 includes it).
|
| Eg. If we had just 1 or 2 as options for the case statements
| note that test1 = xor(var1 & 0x1, var1>>1 & 0x1) & var1 will,
| with integer rounding, equal '1' if var1 is one or '0' if var1
| is 2,3 or 4. (If it goes to 5 you need another xor at the next
| highest bit).
|
| Likewise test2 = xor(var1 & 0x1, var1>>1 & 0x1) & var1<<1
| equals '0' if var1 is 1, 3 or 4 but it equals '1' if it's two.
|
| So (test1 * (value on var1 being 1) + (test2 * (value on var1
| being 2)) will get you a branchless version for the simple
| either 1 or 2 case. This will pretty much always be much slower
| than the branching version but some types of systems out there
| don't really do branching and are linear. This type of zeroing
| trick is really common on matrix multipliers.
|
| Essentially your creating binary logic gates that give 0 or 1
| for the exact value you want and blindly multiplying that by
| the value it would have produced if 1.
| secondcoming wrote:
| Compilers would turn that into a jump table:
|
| https://godbolt.org/z/14h5djYe8
|
| Although this looks branchless if you don't care about divide-
| by-zero. The ALU might not be too happy:
|
| https://godbolt.org/z/9nqG3xMPY
| fwip wrote:
| Maybe I'm missing something obvious, but how about:
| vars[1] = var2 + var3; vars[2] = var2 - var3;
| ... vars[0] = 0; var4 = vars[var1];
| Joker_vD wrote:
| Just so you know, the "return x < 0 ? -x : x;" compiles into
| abs_branch: mov eax, edi neg eax
| cmovs eax, edi ret
|
| on x64, and into abs_branch: srai
| a5,a0,31 xor a0,a5,a0 sub
| a0,a0,a5 ret
|
| on RISC-V if you use a C compiler with a half-decent codegen. And
| "branchy" clamp() translates into clamp:
| cmp edi, edx mov eax, esi cmovle
| edx, edi cmp edi, esi cmovge eax,
| edx ret
|
| Seriously, the automatic transformation between ?: and if-then-
| else (in both directions) is quite well studied by now. And if
| you try to benchmark difference between branching and branchless
| implementations, please make sure that the branches you expect
| are actually there in the compiler's output.
| Animats wrote:
| This is just cutesy on CPUs, but is a big part of GPU
| programming.
| Scene_Cast2 wrote:
| Another interesting take is how GPUs handle branching (latest
| NVidia GPUs in particular - see the register interconnect and how
| they swap threads without saving registers).
| nixpulvis wrote:
| Branchless programming tactics also avoid security issues around
| timing side channels leaking predicate bits, which is somewhat
| interesting.
| mwkaufma wrote:
| Is cmov branchless, or just branching by another name?
___________________________________________________________________
(page generated 2025-09-28 23:00 UTC)