[HN Gopher] Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old ...
       ___________________________________________________________________
        
       Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for
       New Tricks
        
       Author : matt_d
       Score  : 86 points
       Date   : 2024-07-26 18:32 UTC (4 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | emn13 wrote:
       | As a novice in this area, it's not clear to me after reading this
       | what exactly the 2-ahead branch predictor is.
        
         | deadmutex wrote:
         | You can check out the seminal paper linked in the article. Or
         | start by summarizing the paper with Gemini, Claude, ChatGPT,
         | etc. to get a high level overview (and then confirm the answer
         | by reading the paper).
        
           | coherentpony wrote:
           | https://link.springer.com/article/10.1007/s10676-024-09775-5
        
         | cpldcpu wrote:
         | My understanding is that they do not predict the target of the
         | next branch but of the one after the next (2-ahead). This is
         | probably much harder than next-branch prediction but does
         | allows to initiate code fetch much earlier to feed even deeper
         | pipelines.
        
           | emn13 wrote:
           | Ah, that makes sense in the context of the article - thanks!
        
           | layer8 wrote:
           | Surely you must also predict the next branch to predict the
           | one after. Otherwise you wouldn't know which is the one
           | after.
           | 
           | Given that, I still don't understand how predicting the next
           | two branches is different from predicting the next branch and
           | then the next after that, i.e. two times the same thing.
        
       | Szpadel wrote:
       | that's probably bad idea but I would like to learn why:
       | 
       | why when we have a conditional branch we cannot just fetch and
       | prepare instructions for both possible branches and then discard
       | the incorrect one?
       | 
       | is this that much harder or there are other reasons that makes
       | this not worth it
        
         | dymk wrote:
         | Transistor count; now you have to duplicate all the decode and
         | speculative execution circuitry for both possible branches
        
           | immibis wrote:
           | No, the same circuits would execute them interleaved just
           | like they execute multiple hardware threads now
        
             | nomel wrote:
             | With the SMT core count having to be one less.
        
         | wmf wrote:
         | It's a huge waste of energy and in some cases it would even be
         | slower because you'd execute more instructions overall. If the
         | branch mispredict rate is around 1% it's simply not worth
         | paying a penalty 99% of the time to get a gain 1% of the time.
         | Maybe it would be worth doing on low-confidence branches.
        
         | sapiogram wrote:
         | Disclaimer: I don't work in this field, just an enthusiast.
         | 
         | As far as I can tell, branch predictors have always been too
         | good for it to be worth it. Moderns CPUs have instruction
         | reorder buffers that are hundreds of instructions deep, so even
         | if only 8 of those instructions are conditional jumps, there's
         | 256 different paths your program could take. If your branch
         | predictor predicts all 8 correctly >50% of the time (It does),
         | doing 256x the work to cover your ass is not worth it.
        
         | swatcoder wrote:
         | Because it's rare for a branch result to be random. The
         | compiler/runtime/cpu/etc can often guess which result is more
         | likely, and correctly not do the extra work in the first place,
         | and so that's usually the better strategy than spending silicon
         | and heat on the wrong answer just in case.
         | 
         | I think a lot of people don't have an intuition about how
         | accurate branch prediction can be, but if you look at your own
         | code, you'll quickly realize "well, yeah, control flow is
         | almost always going to go this way and we just have this branch
         | so we can handle the exceptional case" -- compilers can often
         | deduce this pretty well themselves now, and cpus/jits/runtimes
         | can develop some pretty impressive heuristics as well, and when
         | all those fail you can often add explicit hints in your code
         | that tell your compiler/etc what _you_ expect if they can 't
         | guess.
        
           | branko_d wrote:
           | > it's rare for a branch result to be random
           | 
           | How rare, though?
           | 
           | QuickSort has fundamentally unpredictable branches, and it's
           | a pretty widely used algorithm. Binary search, B-trees also
           | come to mind.
        
             | kllrnohj wrote:
             | Binary searching is quite slow and should be used sparingly
             | but not because of branch misprediction necessarily but
             | because of memory stalls - you're almost always guaranteed
             | to have a cache miss during the search. Similarly for
             | B-trees it's going to be memory stalls that you're probably
             | more focused on addressing, not branch mispredicts.
        
               | emn13 wrote:
               | This probably depends on the size of the area to be
               | searched, and just how hot that region is. After all, if
               | it's fairly small, there won't be any cache misses, and
               | the data structure does use less memory than a typical
               | hash table, which is itself an advantage.
        
               | orf wrote:
               | If the size of the data is small, a linear search through
               | a contiguous array is going to be far faster than
               | anything more complex.
        
         | 0x000xca0xfe wrote:
         | - Side effects; how do you handle two different writes to
         | memory?
         | 
         | - Double the execution units; very expensive for wide vector
         | units
         | 
         | - Massive waste of energy as half the resources will always be
         | wasted no matter what
         | 
         | - Bad scaling, i.e. four branches ahead would require 16x the
         | resources
        
           | isotypic wrote:
           | Handling two different writes to memory is not really a
           | concern - existing speculative/out of order processors
           | already solve this issue by completing (perform architectural
           | side effects) instructions in-order. So even if two writes
           | are made, one in each branch, by the time the write is meant
           | to be completed, the prior branch is resolved and we know
           | which write is actually meant to be made and the bad one can
           | be discarded.
           | 
           | Doubling the execution units also isn't strictly needed - you
           | can use the existing out-of-order core to send two sets of
           | instructions through the same functional units. There will be
           | more contention for the resources, possibly causing stalls,
           | but you don't need to fully double everything.
           | 
           | Things similar to this idea are already done in processors -
           | simultaneous multithreading, early branch resolution,
           | conditional instructions, are all ideas that have similar
           | implementation difficulties. So the reason this specific idea
           | is not done is more in line with your last two points rather
           | than the first two.
        
         | eigenform wrote:
         | Most branches are biased one way or the other. "Fetching down
         | both paths" means not exploiting any information you might have
         | gathered about a branch being biased - I think that would be
         | equivalent to randomly predicting the branch (except for it
         | would cost _more_ than a random predictor because you 're
         | actually fetching both ways instead of just one).
        
         | bsder wrote:
         | > why when we have a conditional branch we cannot just fetch
         | and prepare instructions for both possible branches and then
         | discard the incorrect one?
         | 
         | We actually do that. It's called a GPU. And it sucks for
         | general code.
        
         | MBCook wrote:
         | We reached 90% accuracy decades ago. Depending on workload
         | modern chips can do way better.
         | 
         | So basically it's just nowhere near worth it. Much better to
         | use those chip resources for another thread or core.
        
       | IvanAchlaqullah wrote:
       | It's always interesting to see decades old papers, sometimes
       | published with little to no fanfares, suddenly becomes "state of
       | the art" because hardware have become powerful enough.
       | 
       | For example Z-buffers[1]. It's used by 3d video games. When it's
       | first published on paper, it's not even the main topic of the
       | paper, just some side notes because it requires expensive amount
       | of memory to run.
       | 
       | Turn out megabytes is quite cheap few decades latter, and every
       | realtime 3d renderer ended up using it.
       | 
       | [1] https://en.wikipedia.org/wiki/Z-buffering
        
         | bee_rider wrote:
         | I sometimes wonder if there's an academic career hidden in
         | there for an engineer: go to the library and read what the CS
         | folks were publishing on physical papers, maybe there are some
         | ideas that can actually be implemented now that weren't
         | practical back then.
        
           | findthewords wrote:
           | Yes, "read 10 year old papers as a source of ideas ripe for
           | commercialization" IS common advice in universities.
        
             | chrisbrandow wrote:
             | A post-doc in my chemistry lab had the saying, "two weeks
             | in the lab will save you a day in the library"
        
           | pornel wrote:
           | Image and video compression are like that. Ideas for
           | mainframes in the '80s are realtime algorithms now.
        
           | treprinum wrote:
           | Elon is doing precisely that.
        
           | Terr_ wrote:
           | [delayed]
        
       | gary_0 wrote:
       | Here's a great explanation of branch prediction, starting from
       | the earliest implementations: https://danluu.com/branch-
       | prediction/
        
       ___________________________________________________________________
       (page generated 2024-07-26 23:00 UTC)