hngopher.com

       [HN Gopher] Bypassing the Branch Predictor
       ___________________________________________________________________
        
       Bypassing the Branch Predictor
        
       Author : signa11
       Score  : 63 points
       Date   : 2025-11-16 06:51 UTC (16 hours ago)
        
 (HTM) web link (nicula.xyz)
 (TXT) w3m dump (nicula.xyz)
        
       | tux3 wrote:
       | My first instinct for a poorly predicted branch would be to use a
       | conditional move.
       | 
       | This isn't always a win, because you prevent the CPU from
       | speculating down the wrong path, but you also prevent it from
       | speculating the correct path.
       | 
       | If you really don't care about the failure path and really don't
       | mind unmaintainable low-level hacks, I can think of a few ways to
       | get creative.
       | 
       | First there's the whole array of anti uarch-speculation-exploit
       | tricks in the Kernel that you can use as inspiration to control
       | what the CPU is allowed to speculate. These little bits of
       | assembly were reviewed by engineers from Intel and AMD, so these
       | tricks can't stop working without also breaking the kernel with
       | it.
       | 
       | Another idea is to take inspiration from anti-reverse engineering
       | tricks. Make the failure path an actual exception. I don't mean
       | software stack unwinding, I mean divide by your boolean and then
       | call your send function unconditionally. If the boolean is true,
       | it costs nothing because the result of the division is unused and
       | we just speculate past it. If the boolean is false, the CPU will
       | raise a divide by 0 exception, and this invisible branch will
       | never be predicted by the CPU. Then your exception handler
       | recovers and calls the cold path.
        
         | sparkie wrote:
         | We could potentially use a conditional move and an
         | unconditional jump to make the branch target predictor do the
         | work instead - and flood it with a bunch of targets which are
         | intended to mispredict. Eg, we could give 255 different paths
         | for abandon and select one randomly:                   #define
         | BYTE_VALUES \             X(0x00) X(0x01) X(0x02) X(0x03)
         | X(0x04) X(0x05) X(0x06) X(0x07) X(0x08) X(0x09) X(0x0A) X(0x0B)
         | X(0x0C) X(0x0D) X(0x0E) X(0x0F) \             X(0x10) X(0x11)
         | X(0x12) X(0x13) X(0x14) X(0x15) X(0x16) X(0x17) X(0x18) X(0x19)
         | X(0x1A) X(0x1B) X(0x1C) X(0x1D) X(0x1E) X(0x1F) \
         | X(0x20) X(0x21) X(0x22) X(0x23) X(0x24) X(0x25) X(0x26) X(0x27)
         | X(0x28) X(0x29) X(0x2A) X(0x2B) X(0x2C) X(0x2D) X(0x2E) X(0x2F)
         | \             X(0x30) X(0x31) X(0x32) X(0x33) X(0x34) X(0x35)
         | X(0x36) X(0x37) X(0x38) X(0x39) X(0x3A) X(0x3B) X(0x3C) X(0x3D)
         | X(0x3E) X(0x3F) \             X(0x40) X(0x41) X(0x42) X(0x43)
         | X(0x44) X(0x45) X(0x46) X(0x47) X(0x48) X(0x49) X(0x4A) X(0x4B)
         | X(0x4C) X(0x4D) X(0x4E) X(0x4F) \             X(0x50) X(0x51)
         | X(0x52) X(0x53) X(0x54) X(0x55) X(0x56) X(0x57) X(0x58) X(0x59)
         | X(0x5A) X(0x5B) X(0x5C) X(0x5D) X(0x5E) X(0x5F) \
         | X(0x60) X(0x61) X(0x62) X(0x63) X(0x64) X(0x65) X(0x66) X(0x67)
         | X(0x68) X(0x69) X(0x6A) X(0x6B) X(0x6C) X(0x6D) X(0x6E) X(0x6F)
         | \             X(0x70) X(0x71) X(0x72) X(0x73) X(0x74) X(0x75)
         | X(0x76) X(0x77) X(0x78) X(0x79) X(0x7A) X(0x7B) X(0x7C) X(0x7D)
         | X(0x7E) X(0x7F) \             X(0x80) X(0x81) X(0x82) X(0x83)
         | X(0x84) X(0x85) X(0x86) X(0x87) X(0x88) X(0x89) X(0x8A) X(0x8B)
         | X(0x8C) X(0x8D) X(0x8E) X(0x8F) \             X(0x90) X(0x91)
         | X(0x92) X(0x93) X(0x94) X(0x95) X(0x96) X(0x97) X(0x98) X(0x99)
         | X(0x9A) X(0x9B) X(0x9C) X(0x9D) X(0x9E) X(0x9F) \
         | X(0xA0) X(0xA1) X(0xA2) X(0xA3) X(0xA4) X(0xA5) X(0xA6) X(0xA7)
         | X(0xA8) X(0xA9) X(0xAA) X(0xAB) X(0xAC) X(0xAD) X(0xAE) X(0xAF)
         | \             X(0xB0) X(0xB1) X(0xB2) X(0xB3) X(0xB4) X(0xB5)
         | X(0xB6) X(0xB7) X(0xB8) X(0xB9) X(0xBA) X(0xBB) X(0xBC) X(0xBD)
         | X(0xBE) X(0xBF) \             X(0xC0) X(0xC1) X(0xC2) X(0xC3)
         | X(0xC4) X(0xC5) X(0xC6) X(0xC7) X(0xC8) X(0xC9) X(0xCA) X(0xCB)
         | X(0xCC) X(0xCD) X(0xCE) X(0xCF) \             X(0xD0) X(0xD1)
         | X(0xD2) X(0xD3) X(0xD4) X(0xD5) X(0xD6) X(0xD7) X(0xD8) X(0xD9)
         | X(0xDA) X(0xDB) X(0xDC) X(0xDD) X(0xDE) X(0xDF) \
         | X(0xE0) X(0xE1) X(0xE2) X(0xE3) X(0xE4) X(0xE5) X(0xE6) X(0xE7)
         | X(0xE8) X(0xE9) X(0xEA) X(0xEB) X(0xEC) X(0xED) X(0xEE) X(0xEF)
         | \             X(0xF0) X(0xF1) X(0xF2) X(0xF3) X(0xF4) X(0xF5)
         | X(0xF6) X(0xF7) X(0xF8) X(0xF9) X(0xFA) X(0xFB) X(0xFC) X(0xFD)
         | X(0xFE) X(0xFF)                      void resolve(Transaction
         | *t) {             void* jt[] = {                 #define X(n)
         | [n] = &&abandon##n,                 BYTE_VALUES
         | #undef X                 [0] = &send,             };
         | static uint8_t branch = 0;             bool csend =
         | should_send(t);             branch = csend ? 0 : branch;
         | // cmov or SETcc             goto * jt[branch];
         | #define X(n) \             abandon##n: \
         | abandon(t); \                 branch = random() % 256; \
         | return;             BYTE_VALUES             #undef X
         | send:                 if (csend) send(t);                 else
         | abandon(t);                 return;         }
         | 
         | Assuming no inherent bias in the low byte produced by `random`,
         | there's only a ~1/255 chance that an abandon branch will
         | correctly predict, though this is also true for the send
         | branch. The conditional branch in send though should only
         | mispredict 1/256 times (when random returns 0).
         | 
         | If we're sending significantly more often than 1/256 calls to
         | resolve, it may be possible to train the BTP to prefer the send
         | branch, as it will correctly predict this branch more often
         | than the others which are chosen randomly - though this would
         | depend on how the branch target predictor is implemented in the
         | processor.
        
           | Dwedit wrote:
           | rand() doesn't do what you think it does. It's a multiply
           | then an add, then return particular bits of the current seed.
           | See "Linear congruential generator" for more information.
           | 
           | On GCC, it's multiply by 1103515245 then add 12345, and
           | return low 30 bits. On MSVC, it's multiply by 214013 and add
           | 2531011 then return bits 16...30.
        
             | sparkie wrote:
             | I don't specifically mean stdlib `rand` (hence comment
             | about assuming no inherent bias) - but just some arbitrary
             | PRNG. Updated the code above to say `random` so as not
             | confuse.
        
       | IshKebab wrote:
       | Interesting problem! Not a very satisfying solution but I can't
       | think of anything better. Even if there were hints that were
       | respected, you'd still have the problem of the code not being in
       | icache, unless you actually execute it occasionally.
        
       | nneonneo wrote:
       | What a fun problem to think about.
       | 
       | My first instinct, knowing less about this domain than maybe I
       | should, would be to abuse the return address predictor. I believe
       | CPUs will generally predict the target of a "ret" instruction
       | using an internal stack of return addresses; some ARM flavours
       | even make this explicit (https://developer.arm.com/documentation/
       | den0042/0100/Unified...).
       | 
       | The way to abuse this would be to put send() on the normal return
       | path and call abandon() by rewriting the return address. In code:
       | void resolve(Transaction *t) {         predict(t);
       | send(t);       }            void predict(Transaction *t) {
       | if (!should_send(t)) *(void **)__builtin_return_address(0) =
       | &abandon;       }
       | 
       | This isn't exactly correct because it ignores control flow
       | integrity (which you'd have to bypass), doesn't work like this on
       | every architecture, and abandon() would need to be written partly
       | in assembly to deal with the fact that the stack is in a weird
       | state post-return, but hopefully it conveys the idea anyway.
       | 
       | The if in predict() is implementable as a branchless conditional
       | move. The return address predictor should guess that predict()
       | will return to send(), but in most cases you'll smash the return
       | address to point at abandon() instead.
        
       | jcul wrote:
       | Why not just make all the abandon transactions into fake
       | discarded transactions, discard them at the send later. E.g. by
       | poisoning the frame checksum or setting something invalid on
       | them, so they get discarded.
       | 
       | Seems you'd be doing this anyway with the dummy transactions.
       | 
       | Then you have no branch, though may want to add dummy
       | transactions anyway to keep the code in cache.
        
         | andrepd wrote:
         | This is literally what is says in TFA lol
        
       | kklisura wrote:
       | > I asked Claude if there is such a way to basically hard-code
       | branch prediction rules into the machine code, and the answer was
       | that there's no way to do this on x86, but there is a way on ARM:
       | the BEQP (predict branch taken) and BEQNP (predict branch not
       | taken) instructions.
       | 
       | > Those ARM instructions are just hallucinated, and the reality
       | is actually the other way around: ARM doesn't have a way of hard-
       | coding 'predictions', but x86 does.
       | 
       | This made me chuckle. Thanks.
        
         | nandomrumber wrote:
         | If a human wrote that here (on HN) someone would note the error
         | and the poster would reply:
         | 
         | Yes, sorry, you're correct. I've usually had 97 more double
         | ristrettos by this time in the morning.
         | 
         | Some schools of though suggest this has already happened.
        
           | zavec wrote:
           | LLMs are just what happens when we hit the singularity of
           | caffeine consumption?
        
         | api wrote:
         | I asked Claude once if there was a way to open a tun/tap on
         | Windows without a driver. It hallucinated an entire supposedly
         | undocumented NT kernel API that as far as I can tell has never
         | existed, complete with parts of this hallucinated API being
         | depreciated in favor of newer non-existent parts.
         | 
         | It was so detailed it makes me wonder if maybe it was in the
         | training data somewhere. Maybe it ingested an internal MS doc
         | for a proposed API or something. The case of the missing ARM
         | instructions makes me wonder the same. Maybe someone on a forum
         | proposed them and they were ingested.
         | 
         | I did actually verify on the OS that the calls do not exist in
         | the kernel or any driver or DLL on the system.
        
           | Polizeiposaune wrote:
           | LLM's generating text referencing nonexistent up API's or
           | machine instructions seems to me to be exactly like LLMs
           | generating text referencing nonexistent legal cases or
           | generating fictional text referencing nonexistent people.
           | 
           | They've been fed enough real text in each of these various
           | styles that they can parrot the shallow surface style without
           | reference to the internals.
        
         | monocasa wrote:
         | To be fair, on the x86 side those branch prediction hint
         | prefixes have been functionally ignored by pretty much all
         | cores for about two decades.
        
       | ViolentTurkey wrote:
       | Make sure your global branch history is the same when
       | "mistraining" and predicting with your BTB. You may end up in the
       | wrong BTB entry and still mess up your prediction :).
        
       | bluGill wrote:
       | Do cpus really track that much about branches? I know JIT does
       | but where does a cpu find the needed memory to store those
       | counters - and them how does reading those not result in a
       | different miss because the cpu can't speculate until it does the
       | if prediction?
       | 
       | last time I checked a cpu documentation they had a simple rule
       | that branches are always taken, that would be easy for the
       | compiler to code order first. However I don't recall which CPU
       | that was. Still this whole thing feels like a citation needed
       | with me being suspicious it is false. CPU designers know this
       | matters and sometimes compilers have information they don't that
       | users care about: they document how it works. (This is the CPU
       | not the instruction set)
        
         | nkurz wrote:
         | Maybe I'm misinterpreting you, but I think you are vastly
         | underestimating both the accuracy and benefit of branch
         | prediction. Yes, CPU's track that much about branches. Your
         | suspicion that this is false is misguided. Here's Dan Luu's
         | 2017 overview: https://danluu.com/branch-prediction/.
        
         | umanwizard wrote:
         | > last time I checked a cpu documentation they had a simple
         | rule that branches are always taken, that would be easy for the
         | compiler to code order first.
         | 
         | Was that in the 80s? Modern performant CPUs all use dynamic
         | branch prediction.
         | 
         | I'm not really sure I understand the "where does it get the
         | memory" point. Yes, this requires some amount of memory per
         | tracked branch. This memory is hardwired into the CPU, just
         | like all the other memory a CPU needs to do its job (registers,
         | L1 cache, TLB, various in-flight instruction state, etc.)
        
           | bluGill wrote:
           | over the 15 million lines of code I maintain there are a lot
           | of branches. the cpu can track the most common ones but as
           | soon as the code spills out of cache where does that memory
           | come from?
        
             | umanwizard wrote:
             | It doesn't track all of them, it's a cache. The ones that
             | were hit long enough ago get evicted, and then if you ever
             | hit them again the processor will indeed have to fall back
             | to a naive static prediction strategy with a high miss
             | probability.
        
               | bluGill wrote:
               | Thanks, that makes sense now
        
         | monocasa wrote:
         | Modern cores can have more sram for branch prediction than they
         | do for L1I$.
        
         | kronks wrote:
         | All of the side channel attacks for CPUs has been from in depth
         | use of branch prediction techniques that make up a significant
         | part of every modern processor's performance. It's one of the
         | main things we can do aside from just clocking them higher.
        
       | amelius wrote:
       | > On modern x86 processors, those instruction prefixes are simply
       | ignored
       | 
       | This sucks.
        
         | sparkie wrote:
         | The branch taken hint (3EH) was re-added in Redwood Cove
         | (2023), but it's only for static prediction where the branch
         | predictor has not yet encountered the branch - ie, useful for
         | things you would only use once or twice but would likely take
         | the branch. Once the branch predictor has some information the
         | static prediction hint is ignored, so it's best to omit it for
         | anything that will eventually have dynamic branch prediction
         | (ie, run many times), because you'll be consuming bytes in the
         | i-cache which serve no purpose.
        
           | tehjoker wrote:
           | if the branch is only taken once how can you realize a
           | significant performance benefit more than a few ns?
        
             | ack_complete wrote:
             | Cold branch comes to mind -- something like a interrupt
             | handler, that is run often enough but not in high enough
             | bursts.
        
       | lordnacho wrote:
       | Sounds like you want to react to the market when you see a
       | certain price in the incoming stream.
       | 
       | The trigger doesn't happen often, you when it does, the naive
       | implementation incurs a misprediction penalty.
       | 
       | I wonder if the solution here might be to avoid the branch
       | altogether?
       | 
       | Write some boolean logic monster that isn't an if condition. It
       | unconditionally creates the reaction (new/cancel order), but if
       | the trigger was not present, the payload is invalid and
       | downstream processing will not let it out of your system onto the
       | network.
        
       | Arcuru wrote:
       | I love the `[[likely]]` and `[[unlikely]]` tags since they nicely
       | encapsulate the Modern C++ philosophy.
       | 
       | 1. They don't work that well.
       | 
       | 2. The intended use case is that you'd label an unlikely branch
       | that you want to speed up as `[[likely]]`, which is confusing.
       | 
       | They are certainly motivated by good intentions (like the HFT
       | use-case as mentioned in TFA, I remember that talk too, I think I
       | was a volunteer that year for CppCon), but even after knowing
       | both of my points above _they were still added to the standard_.
        
       | m4nu3l wrote:
       | I remember reading about this or a similar problem before. I
       | think it was posted on HN, but the solution was much different.
       | They entirely removed the branch and instead marked the requests
       | in such a way that the packets would be discarded by the network
       | adapter. I can't remember the details. That only works for that
       | kind of network transactions, though.
        
       ___________________________________________________________________
       (page generated 2025-11-16 23:01 UTC)