[HN Gopher] The Weird Concept of Branchless Programming
       ___________________________________________________________________
        
       The Weird Concept of Branchless Programming
        
       Author : judicious
       Score  : 106 points
       Date   : 2025-09-28 16:40 UTC (6 hours ago)
        
 (HTM) web link (sanixdk.xyz)
 (TXT) w3m dump (sanixdk.xyz)
        
       | sfilmeyer wrote:
       | I enjoyed reading the article, but I'm pretty thrown by the
       | benchmarks and conclusion. All of the times are reported to a
       | single digit of precision, but then the summary is claiming that
       | one function shows an improvement while the other two are
       | described as negligible. When all the numbers presented are
       | "~5ms" or "~6ms", it doesn't leave me confident that small
       | changes to the benchmarking might have substantially changed that
       | conclusion.
        
         | Joel_Mckay wrote:
         | In general, modern compilers will often unroll or inline
         | functions without people even noticing. This often helps with
         | cache level state localization and parallelism.
         | 
         | Most code should focus on readability, then profile for busy
         | areas under use, and finally refactor the busy areas though
         | hand optimization or register hints as required.
         | 
         | If one creates something that looks suspect (inline Assembly
         | macro), a peer or llvm build will come along and ruin it later
         | for sure. Have a great day =3
        
           | hinkley wrote:
           | Doesn't it also help with branch prediction since the
           | unrolled loop can use different statistics with each copy?
        
             | Joel_Mckay wrote:
             | Non-overlapping sub-problems may be safely parallelized,
             | and executed out-of-order.
             | 
             | In some architectures, both of the branch code motions are
             | executed in parallel, and one is simply tossed after
             | dependent operations finish. We can't be sure exactly how
             | branch predictors and pre-fetch is implemented as it falls
             | under manufacturer NDA. =3
        
         | gizmo686 wrote:
         | Yeah. When your timing results are a single digit multiple of
         | your timing precision, that is a good indication you either
         | need a longer test, or a more precise clock.
         | 
         | At a 5ms baseline with millisecond precision, the smallest
         | improvement you can measure is 20%. And you cannot distinguish
         | a 20% speedup with a 20% slowdown that happened to get luck
         | with clock ticks.
         | 
         | For what it is worth, I ran the provided test code on my
         | machine with a 100x increase in iterations and got the
         | following:                 == Benchmarking ABS ==       ABS
         | (branch):     0.260 sec       ABS (branchless): 0.264 sec
         | == Benchmarking CLAMP ==       CLAMP (branch):     0.332 sec
         | CLAMP (branchless): 0.538 sec            == Benchmarking
         | PARTITION ==       PARTITION (branch):     0.043 sec
         | PARTITION (branchless): 0.091 sec
         | 
         | Which is not exactly encouraging (gcc 13.3.0, -ffast-math
         | -march=native. I did not use the -fomit-this-entire-function
         | flag, which my compiler does not understand).
         | 
         | I had to drop down to O0 to see branchless be faster in any
         | case:                 == Benchmarking ABS ==       ABS
         | (branch):     0.743 sec       ABS (branchless): 0.948 sec
         | == Benchmarking CLAMP ==       CLAMP (branch):     4.275 sec
         | CLAMP (branchless): 1.429 sec            == Benchmarking
         | PARTITION ==       PARTITION (branch):     0.156 sec
         | PARTITION (branchless): 0.164 sec
        
           | Roxxik wrote:
           | I also tried myself, on different array sizes, with more
           | iterations. The branchy version is not strictly worse.
           | 
           | https://gist.github.com/Stefan-
           | JLU/3925c6a73836ce841860b55c8...
        
       | MontyCarloHall wrote:
       | This great video [0] demonstrates how CPU performance has only
       | increased 1.5-2x over the past 15(!) years when executing
       | extremely branchy code. Really shows you just how deep modern CPU
       | pipelines have become.
       | 
       | The video also showcases a much more impressive branchless
       | speedup: computing CRC checksums. Doing this naively with an if-
       | statement for each bit is ~10x slower than doing it branchless
       | with a single bitwise operation for each bit. The author of the
       | article should consider showcasing this too, since it's a lot
       | more impressive than the measly 1.2x speedup highlighted in the
       | article. I assume the minimal/nonexistent speedups in the article
       | are due to modern CPU branch prediction being quite good. But
       | branch predictors inherently fail miserably at CRC because the
       | conditional is on whether the input bit is 1 or 0, which is
       | essentially random.
       | 
       | [0] https://www.youtube.com/watch?v=m7PVZixO35c
        
         | SynasterBeiter wrote:
         | The linked video doesn't take into account power consumption of
         | these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and
         | desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a
         | couple of years newer, you shouldn't be surprised there isn't
         | much of a difference in performance.
        
           | MontyCarloHall wrote:
           | Even if you exclude the three mobile CPUs in the charts (the
           | 2012 i5, the 2015 i7, and the 2023 i9 NUC), the results still
           | hold.
           | 
           | >If you compare a 100W CPU and a 30W CPU that's a couple of
           | years newer, you shouldn't be surprised there isn't much of a
           | difference in performance
           | 
           | Sure, but this is over a lot more than a couple years. I'd
           | expect a 2023 mobile i9 to be considerably more than twice as
           | fast as a 2007 desktop Core 2 Duo.
        
         | hinkley wrote:
         | Branchless is also useful for cryptographic transforms as it
         | frustrates timing attacks. And that's a situation where it only
         | needs to be _relatively fast_ compared to the branching
         | alternative because we are trying to improve security while
         | limiting the overhead of doing so.
        
       | rmnclmnt wrote:
       | Great article, triggers some memories.
       | 
       | When you get to think about branchless programming, especially
       | for SIMD optimizations in the real world, you always learn a lot
       | and it's as if you get a +1 level on your algorithmic skills. The
       | hardest part then is make sure the tricks are clearly laidout so
       | that someone else can take it from here next time
        
       | Legend2440 wrote:
       | Article doesn't mention this, but I'd consider neural networks a
       | form of branchless programming. It's all a bunch of multiply and
       | threshold operations.
        
         | abyesilyurt wrote:
         | Thresholding requires branching no?
        
           | WJW wrote:
           | No, you can do that without branching. See https://en.algorit
           | hmica.org/hpc/pipelining/branchless/#predi... for example.
        
           | Legend2440 wrote:
           | No, it's just a math operation and can be applied to any
           | amount of data in parallel.
        
           | bqmjjx0kac wrote:
           | As demonstrated in the article, you can compute clamp(x, min,
           | max) with straight-line code.
        
         | david-gpu wrote:
         | _> It's all a bunch of multiply and threshold operations._
         | 
         | Real-world high-performance matrix multiplication functions do
         | contain branches internally, even on GPUs. If you are ever
         | curious about what that looks like, NVidia maintains an open-
         | source library called CUTLASS.
        
       | mshockwave wrote:
       | The article is easy to follow but I think the author missed the e
       | point: branchless programming (a subset of the more known
       | constant time programming) is almost exclusively used in
       | cryptography only nowadays. As shown by the benchmarks in the
       | article, modern branch predictors can easily achieve over 95% if
       | not 99% precision since like a decade ago
        
       | styfle wrote:
       | This same blog posted yesterday and got flagged with "How I
       | accidently created the fastest CSV parser ever made"
       | 
       | https://news.ycombinator.com/item?id=45400009
        
       | rao-v wrote:
       | I'm amused to see a for loop in a function that is purportedly
       | branchless
       | 
       | (not a critique)
        
       | adrianmonk wrote:
       | I've always wondered if any CPUs have tried to reduce the branch
       | penalty by speculatively executing both ways at once in parallel.
       | You'd have two of everything (two pipelines, two ALUs, two sets
       | of registers, etc.) and when you hit a conditional branch,
       | instead of guessing which way to go, you'd essentially fork.
       | 
       | Obviously that requires a lot of extra transistors and you are
       | doing computation that will be thrown away, so it's not free in
       | terms of space or power/heat/energy. But perhaps it could handle
       | cases that other approaches can't.
       | 
       | Even more of a wild idea is to pair up two cores and have them
       | work together this way. When you have a core that would have been
       | idle anyway, it can shadow an active core and be its doppelganger
       | that takes the other branch. You'd need to have very fast
       | communication between cores so the shadow core can spring into
       | action instantly when you hit a branch.
       | 
       | My gut instinct is it's not worth it overall, but I'm curious
       | whether it's been tried in the real world.
        
         | hawk_ wrote:
         | What has worked out very well in practice is hyper-threading.
         | So you take instructions from two threads and if one of them is
         | waiting on a branch the units of the CPU don't go unused.
        
         | terryf wrote:
         | Yes, this has been done for a while now, speculative execution
         | + register renaming is how this happens.
         | https://en.wikipedia.org/wiki/Register_renaming
        
           | o11c wrote:
           | Doesn't that only work on one side of the branch though?
        
           | umanwizard wrote:
           | No, what's been done for a while is speculatively executing
           | one predicted path, not both paths in parallel.
        
         | thegreatwhale8 wrote:
         | It's what happens and it gave us a really big issue a few years
         | ago
         | https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...
        
           | umanwizard wrote:
           | No, that is because of speculatively executing one path, not
           | both paths in parallel.
        
         | mshockwave wrote:
         | yes, it has been done for at least a decade if not more
         | 
         | > Even more of a wild idea is to pair up two cores and have
         | them work together this way
         | 
         | I don't think that'll be profitable, because...
         | 
         | > When you have a core that would have been idle anyway
         | 
         | ...you'll just schedule in another process. Modern OS rarely
         | runs short on available tasks to run
        
         | jasonwatkinspdx wrote:
         | Yes, it's been looked at. If you wanna skim the research use
         | "Eager Execution" and "Disjoint Eager Execution" as jumping off
         | points.
         | 
         | It doesn't require duplicating everything. You just need to add
         | some additional bookkeeping of dependencies and what to retire
         | vs kill at the end of the pipeline.
         | 
         | In practice branch predictors are so good that speculating off
         | the "spine" of most likely path just isn't worth it.
         | 
         | In fact there were a lot of interesting microarchitectural
         | ideas from the late 90s to early 00s that just ended up being
         | moot because the combination of OoO speculation, branch
         | predictors, and big caches proved so effective.
        
       | gtirloni wrote:
       | I think I'm getting too sensitive to AI-assisted articles. So
       | many whimsical terms to make something "fun".
        
       | FooBarBizBazz wrote:
       | I used to do stuff like this (ok, not half as smart), but stopped
       | around 2013 or so, as the distinction between "implementation
       | defined" behavior (ok) and "undefined" behavior (not ok) started
       | to matter and bite.
       | 
       | After thinking through this carefully, though, I do not see UB
       | (except for signed overflow in a corner case):
       | 
       | Step 1, bit shift.
       | 
       | I understand that, until C++20, _left_ shift of a signed int was
       | UB. But this _right_ shift appears to be _implementation_
       | defined, which is ok if documented in the code.
       | 
       | Step 2: Add.
       | 
       | Then, (x + mask) is defined behavior (a) for positive x, since
       | then mask=0, and (b) for most negative x, since then mask=-1.
       | However, if x is numeric_limits::lowest, then you get signed
       | integer overflow, which is UB.
       | 
       | Step 3, XOR.
       | 
       | Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as
       | of C++20, when signed integers became two's complement
       | officially. It might have been implementation defined before
       | then, which would be almost as good for something that
       | ubiquitous, but I'm not sure.
       | 
       | Ok, so I think this does not involve UB except for an input of
       | numeric_limits_lowest.
       | 
       | Sound about right?
       | 
       | To fix this, perhaps you would need to make that + an unsigned
       | one?
       | 
       | It bothers me how hard you need to think to do this language
       | lawyering. C++ is a system of rules. Computers interpret rules.
       | The language lawyer should be a piece of software. I get it, not
       | everything can be done statically, so, fine, do it dynamically?
       | UBSan comes closest in practice, but doesn't detect everything. I
       | understand formally modeled versions of C and C++ have been
       | developed commercially, but these are not open source so they
       | effectively do not exist. It's a strange situation.
       | 
       | Just the other day I considered writing something branchless and
       | said, "nah", because of uncertainty around UB. How much
       | performance is being left on the table around the world because
       | of similar thinking, driven by the existence of UB?
       | 
       | Maybe I was supposed to write OCaml or Pascal or Rust.
        
       | vdupras wrote:
       | In the part about "abs", there's an assembly breakdown:
       | 
       | mov eax, edi
       | 
       | sar eax, 31
       | 
       | mov ecx, eax
       | 
       | add edi, ecx
       | 
       | xor eax, edi
       | 
       | Has this been generated by a C compiler? If yes, it's a bit
       | puzzling, because can't you remove "mov ecx, eax", replace "add
       | edi, ecx" by "add edi, eax" and have the exact same result?
        
         | userbinator wrote:
         | If you look at compiler output, you will always see plenty of
         | small stupidities.
        
           | vdupras wrote:
           | So why does conventional wisdom say that compilers will, in
           | the vast majority of the time, outperform programmers doing
           | assembly by hand? It seems contradictory to me.
        
       | HarHarVeryFunny wrote:
       | The ARM instruction set used to make all instructions conditional
       | to avoid the need for branching, but has now moved away from that
       | since the branch prediction is so good nowadays.
        
       | zackmorris wrote:
       | I need something like this for a switch() command (technically a
       | list of arbitrary functions). Sort of like up to N branches in
       | one step.
       | 
       | The idea is to take some number of inputs A, B, C, ... and
       | conceptually perform all of the possible functions
       | simultaneously, then keep the one that's desired and throw the
       | rest away. For any arbitrary logic. Ideally using fewer
       | operations than all of that, but that's optional. Driven by one
       | variable, it would look like:                 // branch version
       | switch(var1)       {         case 1:           var4 = var2 +
       | var3;           break;         case 2:           var4 = var2 -
       | var3;           break;         case 3:           var4 = var2 *
       | var3;           break;         case 4:           var4 = var2 /
       | var3;           break;         // ...         default:
       | var4 = 0; // optional and arbitrary           break;       }
       | // branchless version       var4 = magic(var1, var2, var3);
       | 
       | I don't know how to do this outside of programmable hardware like
       | an FPGA. The problem is that it's extremely challenging to
       | write/solve the formulas that map ordinary arithmetic and bitwise
       | functions to the larger set of functions.
       | 
       | So for now, I may just use a switch() command in a shader and let
       | it figure out any reusable logic internally. I don't know if
       | shaders have come far enough to allow that performantly, or if
       | they end up just calculating everything and throwing away the
       | paths not taken. Which would suggest that the max speed would be
       | O(1/N) for the number of cases.
       | 
       | Does anyone know? Maybe truth tables? Or a SAT solver? I'm also
       | not sure how this would work for floating point, but that's
       | optional too.
       | 
       | Edit: I updated the cases to show how var1 controls the math
       | performed on var2, var3 and var4.
        
         | judicious wrote:
         | I'm by no means an expert on this topic, but what about using
         | predefined hash map of value to function call/closure unless
         | that also yields branched programming underneath? Additionally,
         | maybe you want to use macros of some kind to generate the
         | "magic" function you're looking for, where it constructs a
         | single branchless statement up to N terms, whether it's
         | procedural macros in Rust or C macros.
        
         | AnotherGoodName wrote:
         | Think of a way you can make the number of each case 0 or 1 then
         | you have a single sum of all possibilities where that 0 or 1
         | multiplies the portion of the switch statement it represents
         | (the 0 will mean not active the 1 includes it).
         | 
         | Eg. If we had just 1 or 2 as options for the case statements
         | note that test1 = xor(var1 & 0x1, var1>>1 & 0x1) & var1 will,
         | with integer rounding, equal '1' if var1 is one or '0' if var1
         | is 2,3 or 4. (If it goes to 5 you need another xor at the next
         | highest bit).
         | 
         | Likewise test2 = xor(var1 & 0x1, var1>>1 & 0x1) & var1<<1
         | equals '0' if var1 is 1, 3 or 4 but it equals '1' if it's two.
         | 
         | So (test1 * (value on var1 being 1) + (test2 * (value on var1
         | being 2)) will get you a branchless version for the simple
         | either 1 or 2 case. This will pretty much always be much slower
         | than the branching version but some types of systems out there
         | don't really do branching and are linear. This type of zeroing
         | trick is really common on matrix multipliers.
         | 
         | Essentially your creating binary logic gates that give 0 or 1
         | for the exact value you want and blindly multiplying that by
         | the value it would have produced if 1.
        
         | secondcoming wrote:
         | Compilers would turn that into a jump table:
         | 
         | https://godbolt.org/z/14h5djYe8
         | 
         | Although this looks branchless if you don't care about divide-
         | by-zero. The ALU might not be too happy:
         | 
         | https://godbolt.org/z/9nqG3xMPY
        
         | fwip wrote:
         | Maybe I'm missing something obvious, but how about:
         | vars[1] = var2 + var3;         vars[2] = var2 - var3;
         | ...         vars[0] = 0;         var4 = vars[var1];
        
       | Joker_vD wrote:
       | Just so you know, the "return x < 0 ? -x : x;" compiles into
       | abs_branch:             mov     eax, edi             neg     eax
       | cmovs   eax, edi             ret
       | 
       | on x64, and into                   abs_branch:             srai
       | a5,a0,31             xor     a0,a5,a0             sub
       | a0,a0,a5             ret
       | 
       | on RISC-V if you use a C compiler with a half-decent codegen. And
       | "branchy" clamp() translates into                   clamp:
       | cmp     edi, edx             mov     eax, esi             cmovle
       | edx, edi             cmp     edi, esi             cmovge  eax,
       | edx             ret
       | 
       | Seriously, the automatic transformation between ?: and if-then-
       | else (in both directions) is quite well studied by now. And if
       | you try to benchmark difference between branching and branchless
       | implementations, please make sure that the branches you expect
       | are actually there in the compiler's output.
        
       | Animats wrote:
       | This is just cutesy on CPUs, but is a big part of GPU
       | programming.
        
       | Scene_Cast2 wrote:
       | Another interesting take is how GPUs handle branching (latest
       | NVidia GPUs in particular - see the register interconnect and how
       | they swap threads without saving registers).
        
       | nixpulvis wrote:
       | Branchless programming tactics also avoid security issues around
       | timing side channels leaking predicate bits, which is somewhat
       | interesting.
        
       | mwkaufma wrote:
       | Is cmov branchless, or just branching by another name?
        
       ___________________________________________________________________
       (page generated 2025-09-28 23:00 UTC)