[HN Gopher] Bypassing the Branch Predictor
___________________________________________________________________
Bypassing the Branch Predictor
Author : signa11
Score : 63 points
Date : 2025-11-16 06:51 UTC (16 hours ago)
(HTM) web link (nicula.xyz)
(TXT) w3m dump (nicula.xyz)
| tux3 wrote:
| My first instinct for a poorly predicted branch would be to use a
| conditional move.
|
| This isn't always a win, because you prevent the CPU from
| speculating down the wrong path, but you also prevent it from
| speculating the correct path.
|
| If you really don't care about the failure path and really don't
| mind unmaintainable low-level hacks, I can think of a few ways to
| get creative.
|
| First there's the whole array of anti uarch-speculation-exploit
| tricks in the Kernel that you can use as inspiration to control
| what the CPU is allowed to speculate. These little bits of
| assembly were reviewed by engineers from Intel and AMD, so these
| tricks can't stop working without also breaking the kernel with
| it.
|
| Another idea is to take inspiration from anti-reverse engineering
| tricks. Make the failure path an actual exception. I don't mean
| software stack unwinding, I mean divide by your boolean and then
| call your send function unconditionally. If the boolean is true,
| it costs nothing because the result of the division is unused and
| we just speculate past it. If the boolean is false, the CPU will
| raise a divide by 0 exception, and this invisible branch will
| never be predicted by the CPU. Then your exception handler
| recovers and calls the cold path.
| sparkie wrote:
| We could potentially use a conditional move and an
| unconditional jump to make the branch target predictor do the
| work instead - and flood it with a bunch of targets which are
| intended to mispredict. Eg, we could give 255 different paths
| for abandon and select one randomly: #define
| BYTE_VALUES \ X(0x00) X(0x01) X(0x02) X(0x03)
| X(0x04) X(0x05) X(0x06) X(0x07) X(0x08) X(0x09) X(0x0A) X(0x0B)
| X(0x0C) X(0x0D) X(0x0E) X(0x0F) \ X(0x10) X(0x11)
| X(0x12) X(0x13) X(0x14) X(0x15) X(0x16) X(0x17) X(0x18) X(0x19)
| X(0x1A) X(0x1B) X(0x1C) X(0x1D) X(0x1E) X(0x1F) \
| X(0x20) X(0x21) X(0x22) X(0x23) X(0x24) X(0x25) X(0x26) X(0x27)
| X(0x28) X(0x29) X(0x2A) X(0x2B) X(0x2C) X(0x2D) X(0x2E) X(0x2F)
| \ X(0x30) X(0x31) X(0x32) X(0x33) X(0x34) X(0x35)
| X(0x36) X(0x37) X(0x38) X(0x39) X(0x3A) X(0x3B) X(0x3C) X(0x3D)
| X(0x3E) X(0x3F) \ X(0x40) X(0x41) X(0x42) X(0x43)
| X(0x44) X(0x45) X(0x46) X(0x47) X(0x48) X(0x49) X(0x4A) X(0x4B)
| X(0x4C) X(0x4D) X(0x4E) X(0x4F) \ X(0x50) X(0x51)
| X(0x52) X(0x53) X(0x54) X(0x55) X(0x56) X(0x57) X(0x58) X(0x59)
| X(0x5A) X(0x5B) X(0x5C) X(0x5D) X(0x5E) X(0x5F) \
| X(0x60) X(0x61) X(0x62) X(0x63) X(0x64) X(0x65) X(0x66) X(0x67)
| X(0x68) X(0x69) X(0x6A) X(0x6B) X(0x6C) X(0x6D) X(0x6E) X(0x6F)
| \ X(0x70) X(0x71) X(0x72) X(0x73) X(0x74) X(0x75)
| X(0x76) X(0x77) X(0x78) X(0x79) X(0x7A) X(0x7B) X(0x7C) X(0x7D)
| X(0x7E) X(0x7F) \ X(0x80) X(0x81) X(0x82) X(0x83)
| X(0x84) X(0x85) X(0x86) X(0x87) X(0x88) X(0x89) X(0x8A) X(0x8B)
| X(0x8C) X(0x8D) X(0x8E) X(0x8F) \ X(0x90) X(0x91)
| X(0x92) X(0x93) X(0x94) X(0x95) X(0x96) X(0x97) X(0x98) X(0x99)
| X(0x9A) X(0x9B) X(0x9C) X(0x9D) X(0x9E) X(0x9F) \
| X(0xA0) X(0xA1) X(0xA2) X(0xA3) X(0xA4) X(0xA5) X(0xA6) X(0xA7)
| X(0xA8) X(0xA9) X(0xAA) X(0xAB) X(0xAC) X(0xAD) X(0xAE) X(0xAF)
| \ X(0xB0) X(0xB1) X(0xB2) X(0xB3) X(0xB4) X(0xB5)
| X(0xB6) X(0xB7) X(0xB8) X(0xB9) X(0xBA) X(0xBB) X(0xBC) X(0xBD)
| X(0xBE) X(0xBF) \ X(0xC0) X(0xC1) X(0xC2) X(0xC3)
| X(0xC4) X(0xC5) X(0xC6) X(0xC7) X(0xC8) X(0xC9) X(0xCA) X(0xCB)
| X(0xCC) X(0xCD) X(0xCE) X(0xCF) \ X(0xD0) X(0xD1)
| X(0xD2) X(0xD3) X(0xD4) X(0xD5) X(0xD6) X(0xD7) X(0xD8) X(0xD9)
| X(0xDA) X(0xDB) X(0xDC) X(0xDD) X(0xDE) X(0xDF) \
| X(0xE0) X(0xE1) X(0xE2) X(0xE3) X(0xE4) X(0xE5) X(0xE6) X(0xE7)
| X(0xE8) X(0xE9) X(0xEA) X(0xEB) X(0xEC) X(0xED) X(0xEE) X(0xEF)
| \ X(0xF0) X(0xF1) X(0xF2) X(0xF3) X(0xF4) X(0xF5)
| X(0xF6) X(0xF7) X(0xF8) X(0xF9) X(0xFA) X(0xFB) X(0xFC) X(0xFD)
| X(0xFE) X(0xFF) void resolve(Transaction
| *t) { void* jt[] = { #define X(n)
| [n] = &&abandon##n, BYTE_VALUES
| #undef X [0] = &send, };
| static uint8_t branch = 0; bool csend =
| should_send(t); branch = csend ? 0 : branch;
| // cmov or SETcc goto * jt[branch];
| #define X(n) \ abandon##n: \
| abandon(t); \ branch = random() % 256; \
| return; BYTE_VALUES #undef X
| send: if (csend) send(t); else
| abandon(t); return; }
|
| Assuming no inherent bias in the low byte produced by `random`,
| there's only a ~1/255 chance that an abandon branch will
| correctly predict, though this is also true for the send
| branch. The conditional branch in send though should only
| mispredict 1/256 times (when random returns 0).
|
| If we're sending significantly more often than 1/256 calls to
| resolve, it may be possible to train the BTP to prefer the send
| branch, as it will correctly predict this branch more often
| than the others which are chosen randomly - though this would
| depend on how the branch target predictor is implemented in the
| processor.
| Dwedit wrote:
| rand() doesn't do what you think it does. It's a multiply
| then an add, then return particular bits of the current seed.
| See "Linear congruential generator" for more information.
|
| On GCC, it's multiply by 1103515245 then add 12345, and
| return low 30 bits. On MSVC, it's multiply by 214013 and add
| 2531011 then return bits 16...30.
| sparkie wrote:
| I don't specifically mean stdlib `rand` (hence comment
| about assuming no inherent bias) - but just some arbitrary
| PRNG. Updated the code above to say `random` so as not
| confuse.
| IshKebab wrote:
| Interesting problem! Not a very satisfying solution but I can't
| think of anything better. Even if there were hints that were
| respected, you'd still have the problem of the code not being in
| icache, unless you actually execute it occasionally.
| nneonneo wrote:
| What a fun problem to think about.
|
| My first instinct, knowing less about this domain than maybe I
| should, would be to abuse the return address predictor. I believe
| CPUs will generally predict the target of a "ret" instruction
| using an internal stack of return addresses; some ARM flavours
| even make this explicit (https://developer.arm.com/documentation/
| den0042/0100/Unified...).
|
| The way to abuse this would be to put send() on the normal return
| path and call abandon() by rewriting the return address. In code:
| void resolve(Transaction *t) { predict(t);
| send(t); } void predict(Transaction *t) {
| if (!should_send(t)) *(void **)__builtin_return_address(0) =
| &abandon; }
|
| This isn't exactly correct because it ignores control flow
| integrity (which you'd have to bypass), doesn't work like this on
| every architecture, and abandon() would need to be written partly
| in assembly to deal with the fact that the stack is in a weird
| state post-return, but hopefully it conveys the idea anyway.
|
| The if in predict() is implementable as a branchless conditional
| move. The return address predictor should guess that predict()
| will return to send(), but in most cases you'll smash the return
| address to point at abandon() instead.
| jcul wrote:
| Why not just make all the abandon transactions into fake
| discarded transactions, discard them at the send later. E.g. by
| poisoning the frame checksum or setting something invalid on
| them, so they get discarded.
|
| Seems you'd be doing this anyway with the dummy transactions.
|
| Then you have no branch, though may want to add dummy
| transactions anyway to keep the code in cache.
| andrepd wrote:
| This is literally what is says in TFA lol
| kklisura wrote:
| > I asked Claude if there is such a way to basically hard-code
| branch prediction rules into the machine code, and the answer was
| that there's no way to do this on x86, but there is a way on ARM:
| the BEQP (predict branch taken) and BEQNP (predict branch not
| taken) instructions.
|
| > Those ARM instructions are just hallucinated, and the reality
| is actually the other way around: ARM doesn't have a way of hard-
| coding 'predictions', but x86 does.
|
| This made me chuckle. Thanks.
| nandomrumber wrote:
| If a human wrote that here (on HN) someone would note the error
| and the poster would reply:
|
| Yes, sorry, you're correct. I've usually had 97 more double
| ristrettos by this time in the morning.
|
| Some schools of though suggest this has already happened.
| zavec wrote:
| LLMs are just what happens when we hit the singularity of
| caffeine consumption?
| api wrote:
| I asked Claude once if there was a way to open a tun/tap on
| Windows without a driver. It hallucinated an entire supposedly
| undocumented NT kernel API that as far as I can tell has never
| existed, complete with parts of this hallucinated API being
| depreciated in favor of newer non-existent parts.
|
| It was so detailed it makes me wonder if maybe it was in the
| training data somewhere. Maybe it ingested an internal MS doc
| for a proposed API or something. The case of the missing ARM
| instructions makes me wonder the same. Maybe someone on a forum
| proposed them and they were ingested.
|
| I did actually verify on the OS that the calls do not exist in
| the kernel or any driver or DLL on the system.
| Polizeiposaune wrote:
| LLM's generating text referencing nonexistent up API's or
| machine instructions seems to me to be exactly like LLMs
| generating text referencing nonexistent legal cases or
| generating fictional text referencing nonexistent people.
|
| They've been fed enough real text in each of these various
| styles that they can parrot the shallow surface style without
| reference to the internals.
| monocasa wrote:
| To be fair, on the x86 side those branch prediction hint
| prefixes have been functionally ignored by pretty much all
| cores for about two decades.
| ViolentTurkey wrote:
| Make sure your global branch history is the same when
| "mistraining" and predicting with your BTB. You may end up in the
| wrong BTB entry and still mess up your prediction :).
| bluGill wrote:
| Do cpus really track that much about branches? I know JIT does
| but where does a cpu find the needed memory to store those
| counters - and them how does reading those not result in a
| different miss because the cpu can't speculate until it does the
| if prediction?
|
| last time I checked a cpu documentation they had a simple rule
| that branches are always taken, that would be easy for the
| compiler to code order first. However I don't recall which CPU
| that was. Still this whole thing feels like a citation needed
| with me being suspicious it is false. CPU designers know this
| matters and sometimes compilers have information they don't that
| users care about: they document how it works. (This is the CPU
| not the instruction set)
| nkurz wrote:
| Maybe I'm misinterpreting you, but I think you are vastly
| underestimating both the accuracy and benefit of branch
| prediction. Yes, CPU's track that much about branches. Your
| suspicion that this is false is misguided. Here's Dan Luu's
| 2017 overview: https://danluu.com/branch-prediction/.
| umanwizard wrote:
| > last time I checked a cpu documentation they had a simple
| rule that branches are always taken, that would be easy for the
| compiler to code order first.
|
| Was that in the 80s? Modern performant CPUs all use dynamic
| branch prediction.
|
| I'm not really sure I understand the "where does it get the
| memory" point. Yes, this requires some amount of memory per
| tracked branch. This memory is hardwired into the CPU, just
| like all the other memory a CPU needs to do its job (registers,
| L1 cache, TLB, various in-flight instruction state, etc.)
| bluGill wrote:
| over the 15 million lines of code I maintain there are a lot
| of branches. the cpu can track the most common ones but as
| soon as the code spills out of cache where does that memory
| come from?
| umanwizard wrote:
| It doesn't track all of them, it's a cache. The ones that
| were hit long enough ago get evicted, and then if you ever
| hit them again the processor will indeed have to fall back
| to a naive static prediction strategy with a high miss
| probability.
| bluGill wrote:
| Thanks, that makes sense now
| monocasa wrote:
| Modern cores can have more sram for branch prediction than they
| do for L1I$.
| kronks wrote:
| All of the side channel attacks for CPUs has been from in depth
| use of branch prediction techniques that make up a significant
| part of every modern processor's performance. It's one of the
| main things we can do aside from just clocking them higher.
| amelius wrote:
| > On modern x86 processors, those instruction prefixes are simply
| ignored
|
| This sucks.
| sparkie wrote:
| The branch taken hint (3EH) was re-added in Redwood Cove
| (2023), but it's only for static prediction where the branch
| predictor has not yet encountered the branch - ie, useful for
| things you would only use once or twice but would likely take
| the branch. Once the branch predictor has some information the
| static prediction hint is ignored, so it's best to omit it for
| anything that will eventually have dynamic branch prediction
| (ie, run many times), because you'll be consuming bytes in the
| i-cache which serve no purpose.
| tehjoker wrote:
| if the branch is only taken once how can you realize a
| significant performance benefit more than a few ns?
| ack_complete wrote:
| Cold branch comes to mind -- something like a interrupt
| handler, that is run often enough but not in high enough
| bursts.
| lordnacho wrote:
| Sounds like you want to react to the market when you see a
| certain price in the incoming stream.
|
| The trigger doesn't happen often, you when it does, the naive
| implementation incurs a misprediction penalty.
|
| I wonder if the solution here might be to avoid the branch
| altogether?
|
| Write some boolean logic monster that isn't an if condition. It
| unconditionally creates the reaction (new/cancel order), but if
| the trigger was not present, the payload is invalid and
| downstream processing will not let it out of your system onto the
| network.
| Arcuru wrote:
| I love the `[[likely]]` and `[[unlikely]]` tags since they nicely
| encapsulate the Modern C++ philosophy.
|
| 1. They don't work that well.
|
| 2. The intended use case is that you'd label an unlikely branch
| that you want to speed up as `[[likely]]`, which is confusing.
|
| They are certainly motivated by good intentions (like the HFT
| use-case as mentioned in TFA, I remember that talk too, I think I
| was a volunteer that year for CppCon), but even after knowing
| both of my points above _they were still added to the standard_.
| m4nu3l wrote:
| I remember reading about this or a similar problem before. I
| think it was posted on HN, but the solution was much different.
| They entirely removed the branch and instead marked the requests
| in such a way that the packets would be discarded by the network
| adapter. I can't remember the details. That only works for that
| kind of network transactions, though.
___________________________________________________________________
(page generated 2025-11-16 23:01 UTC)