[HN Gopher] Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old ...
___________________________________________________________________
Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for
New Tricks
Author : matt_d
Score : 86 points
Date : 2024-07-26 18:32 UTC (4 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| emn13 wrote:
| As a novice in this area, it's not clear to me after reading this
| what exactly the 2-ahead branch predictor is.
| deadmutex wrote:
| You can check out the seminal paper linked in the article. Or
| start by summarizing the paper with Gemini, Claude, ChatGPT,
| etc. to get a high level overview (and then confirm the answer
| by reading the paper).
| coherentpony wrote:
| https://link.springer.com/article/10.1007/s10676-024-09775-5
| cpldcpu wrote:
| My understanding is that they do not predict the target of the
| next branch but of the one after the next (2-ahead). This is
| probably much harder than next-branch prediction but does
| allows to initiate code fetch much earlier to feed even deeper
| pipelines.
| emn13 wrote:
| Ah, that makes sense in the context of the article - thanks!
| layer8 wrote:
| Surely you must also predict the next branch to predict the
| one after. Otherwise you wouldn't know which is the one
| after.
|
| Given that, I still don't understand how predicting the next
| two branches is different from predicting the next branch and
| then the next after that, i.e. two times the same thing.
| Szpadel wrote:
| that's probably bad idea but I would like to learn why:
|
| why when we have a conditional branch we cannot just fetch and
| prepare instructions for both possible branches and then discard
| the incorrect one?
|
| is this that much harder or there are other reasons that makes
| this not worth it
| dymk wrote:
| Transistor count; now you have to duplicate all the decode and
| speculative execution circuitry for both possible branches
| immibis wrote:
| No, the same circuits would execute them interleaved just
| like they execute multiple hardware threads now
| nomel wrote:
| With the SMT core count having to be one less.
| wmf wrote:
| It's a huge waste of energy and in some cases it would even be
| slower because you'd execute more instructions overall. If the
| branch mispredict rate is around 1% it's simply not worth
| paying a penalty 99% of the time to get a gain 1% of the time.
| Maybe it would be worth doing on low-confidence branches.
| sapiogram wrote:
| Disclaimer: I don't work in this field, just an enthusiast.
|
| As far as I can tell, branch predictors have always been too
| good for it to be worth it. Moderns CPUs have instruction
| reorder buffers that are hundreds of instructions deep, so even
| if only 8 of those instructions are conditional jumps, there's
| 256 different paths your program could take. If your branch
| predictor predicts all 8 correctly >50% of the time (It does),
| doing 256x the work to cover your ass is not worth it.
| swatcoder wrote:
| Because it's rare for a branch result to be random. The
| compiler/runtime/cpu/etc can often guess which result is more
| likely, and correctly not do the extra work in the first place,
| and so that's usually the better strategy than spending silicon
| and heat on the wrong answer just in case.
|
| I think a lot of people don't have an intuition about how
| accurate branch prediction can be, but if you look at your own
| code, you'll quickly realize "well, yeah, control flow is
| almost always going to go this way and we just have this branch
| so we can handle the exceptional case" -- compilers can often
| deduce this pretty well themselves now, and cpus/jits/runtimes
| can develop some pretty impressive heuristics as well, and when
| all those fail you can often add explicit hints in your code
| that tell your compiler/etc what _you_ expect if they can 't
| guess.
| branko_d wrote:
| > it's rare for a branch result to be random
|
| How rare, though?
|
| QuickSort has fundamentally unpredictable branches, and it's
| a pretty widely used algorithm. Binary search, B-trees also
| come to mind.
| kllrnohj wrote:
| Binary searching is quite slow and should be used sparingly
| but not because of branch misprediction necessarily but
| because of memory stalls - you're almost always guaranteed
| to have a cache miss during the search. Similarly for
| B-trees it's going to be memory stalls that you're probably
| more focused on addressing, not branch mispredicts.
| emn13 wrote:
| This probably depends on the size of the area to be
| searched, and just how hot that region is. After all, if
| it's fairly small, there won't be any cache misses, and
| the data structure does use less memory than a typical
| hash table, which is itself an advantage.
| orf wrote:
| If the size of the data is small, a linear search through
| a contiguous array is going to be far faster than
| anything more complex.
| 0x000xca0xfe wrote:
| - Side effects; how do you handle two different writes to
| memory?
|
| - Double the execution units; very expensive for wide vector
| units
|
| - Massive waste of energy as half the resources will always be
| wasted no matter what
|
| - Bad scaling, i.e. four branches ahead would require 16x the
| resources
| isotypic wrote:
| Handling two different writes to memory is not really a
| concern - existing speculative/out of order processors
| already solve this issue by completing (perform architectural
| side effects) instructions in-order. So even if two writes
| are made, one in each branch, by the time the write is meant
| to be completed, the prior branch is resolved and we know
| which write is actually meant to be made and the bad one can
| be discarded.
|
| Doubling the execution units also isn't strictly needed - you
| can use the existing out-of-order core to send two sets of
| instructions through the same functional units. There will be
| more contention for the resources, possibly causing stalls,
| but you don't need to fully double everything.
|
| Things similar to this idea are already done in processors -
| simultaneous multithreading, early branch resolution,
| conditional instructions, are all ideas that have similar
| implementation difficulties. So the reason this specific idea
| is not done is more in line with your last two points rather
| than the first two.
| eigenform wrote:
| Most branches are biased one way or the other. "Fetching down
| both paths" means not exploiting any information you might have
| gathered about a branch being biased - I think that would be
| equivalent to randomly predicting the branch (except for it
| would cost _more_ than a random predictor because you 're
| actually fetching both ways instead of just one).
| bsder wrote:
| > why when we have a conditional branch we cannot just fetch
| and prepare instructions for both possible branches and then
| discard the incorrect one?
|
| We actually do that. It's called a GPU. And it sucks for
| general code.
| MBCook wrote:
| We reached 90% accuracy decades ago. Depending on workload
| modern chips can do way better.
|
| So basically it's just nowhere near worth it. Much better to
| use those chip resources for another thread or core.
| IvanAchlaqullah wrote:
| It's always interesting to see decades old papers, sometimes
| published with little to no fanfares, suddenly becomes "state of
| the art" because hardware have become powerful enough.
|
| For example Z-buffers[1]. It's used by 3d video games. When it's
| first published on paper, it's not even the main topic of the
| paper, just some side notes because it requires expensive amount
| of memory to run.
|
| Turn out megabytes is quite cheap few decades latter, and every
| realtime 3d renderer ended up using it.
|
| [1] https://en.wikipedia.org/wiki/Z-buffering
| bee_rider wrote:
| I sometimes wonder if there's an academic career hidden in
| there for an engineer: go to the library and read what the CS
| folks were publishing on physical papers, maybe there are some
| ideas that can actually be implemented now that weren't
| practical back then.
| findthewords wrote:
| Yes, "read 10 year old papers as a source of ideas ripe for
| commercialization" IS common advice in universities.
| chrisbrandow wrote:
| A post-doc in my chemistry lab had the saying, "two weeks
| in the lab will save you a day in the library"
| pornel wrote:
| Image and video compression are like that. Ideas for
| mainframes in the '80s are realtime algorithms now.
| treprinum wrote:
| Elon is doing precisely that.
| Terr_ wrote:
| [delayed]
| gary_0 wrote:
| Here's a great explanation of branch prediction, starting from
| the earliest implementations: https://danluu.com/branch-
| prediction/
___________________________________________________________________
(page generated 2024-07-26 23:00 UTC)