[HN Gopher] Branch predictor: How many "if"s are too many?
___________________________________________________________________
Branch predictor: How many "if"s are too many?
Author : majke
Score : 215 points
Date : 2021-05-06 13:09 UTC (9 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| kissgyorgy wrote:
| 2
| jfrunyon wrote:
| > This is visible with block size 64 breaking at 3072 mark 3072
| _64=196K, and for block 32 at 6144: 6144_ 32=196K.
|
| You mean 192K.
| omginternets wrote:
| Are there any rules-of-thumb for avoiding branch mispredictions,
| other than reducing the number of conditional branches?
|
| For example (not that I expect this to be true), something of the
| same sort as "your if block should contain the rare condition".
| mids_hn wrote:
| My rule of thumb is to partition my data sets, to make sure the
| conditional has the same result in as long a row as possible.
| berkut wrote:
| You can use the __builtin_expect keyword on GCC and clang.
|
| It does make a difference (HPC code uses it quite a bit because
| of this), but as it's done manually, results obviously vary
| depending on processor, and there's a limit to how far you can
| take it. Trying not to branch (not always possible) is best...
|
| Basically, it's like manually doing profile-guided optimisation
| - you look at perf or vtune, and add the above to branches
| which have a high penalty and which are rarely or often taken,
| and see if it makes a difference.
| Veedrac wrote:
| Note that this is not targeted to branch prediction, but
| compiler optimizations. It should be done carefully, because
| it tends to have a larger downside than its upside, and the
| changes the compiler makes based on it can be unpredictable,
| and unstable.
|
| > Trying not to branch (not always possible) is best...
|
| Branches are often faster than branch-free methods, since
| they can speculate certain data dependencies away.
| berkut wrote:
| > Branches are often faster than branch-free methods, since
| they can speculate certain data dependencies away.
|
| And often they're slower, since predicting some branches is
| often impossible due to the data :) So the processor ends
| up miss-predicting most of the time, thinking it knows what
| happened last time.
|
| It all depends on what you're doing.
| initplus wrote:
| If you have perf critical branches in your code, you should try
| and make them as predictable as possible. Sometimes that means
| using a different approach that is algorithmically worse, but
| has much more predictable branches.
|
| Classic example is linear search is faster than binary search
| for "small" lists. The item == myItem branch is only taken once
| at the end. Meanwhile binary search will take the branches of
| it's comparison (item < myItem, item > myItem) in equal
| proportion to each other, so the branch predictor is stuck at a
| 50% guess for that branch. There is a great talk on this but I
| can't remember what it's called...
| sischoel wrote:
| Although it is possible to create branchless binary search.
| Here is one article:
| https://schani.wordpress.com/2010/04/30/linear-vs-binary-
| sea..., there are probably other ways to do that.
|
| Another interesting question where one could sink a lot of
| time is how do to binary search on GPU's using multiple
| threads. There branching in multiple threads at the same time
| is also bad, but for slightly different reasons.
| knorker wrote:
| You can hint the compiler:
| https://en.cppreference.com/w/cpp/language/attributes/likely
|
| Or you can use profile directed optimization, as others
| mentioned.
|
| In my test recently it reduced branches as a whole by 30%,
| according to perf stat ./a.out
| Veedrac wrote:
| Branch prediction works predictably well when the branch is
| heavily biased towards one outcome, typically follows a
| repeating pattern of a short length (eg. taken once every three
| iterations), or is strongly correlated with another preceding
| (not necessarily local) branch. Conversely, branch prediction
| works badly when they are data dependent on irregular data.
| bob1029 wrote:
| I might be off base here, but you would probably do best to not
| overthink this and let the compiler do its job. Attempting to
| pre-optimize if statements relative to a branch predictor (a
| very hardware-specific thing) is probably a massive fool's
| errand.
| astrange wrote:
| The compiler doesn't know how the CPU branch predictor works
| either. It's not especially good at optimizing this, or
| guessing if a branch is predictable or not, so you might as
| well do it yourself.
| jugg1es wrote:
| This is a pretty amazing analysis that answers questions we
| probably all had as newbie programmers before realizing that code
| readability mattered more than efficiency 99.9% of the time.
| lmilcin wrote:
| In most applications you will not be able to get to the level
| of performance when any of this is going to matter.
|
| For example, if you program Python then there is so much more
| conditionals in Python runtime itself than you will not see any
| difference from couple of your conditionals being predicted
| better or worse. You will also be doing a lot of other stuff
| (like spending time in framework code, calling libraries,
| performing I/O) that will be making any gains from this pretty
| insignificant.
|
| Where this comes into picture is if you are really bent on
| improving performance of your code on a grand scale (when you
| develop algorithmic trading or maybe operating system), when
| you have a very dense, busy inner loop (for example video
| encoding) or when you develop compilers.
| andrepd wrote:
| I think that's more of an indictment of how bad the concept
| of "production python" is.
| hinkley wrote:
| The goalposts are constantly changing, but there are
| refactorings that achieve both.
|
| As a student, real world performance analysis was one of the
| ways I kept myself engaged in otherwise sometimes tedious
| curricula. After college I landed a job far away from home only
| to discover their software was so slow I was embarrassed to be
| associated with it, so I got a quick self-directed study in
| practical optimization.
|
| On any given day, you're much more likely to be amused by your
| own cleverness than anyone else is, so you either do it for
| your own reasons, stop doing it entirely, or find palatable
| forms of expression.
|
| My most concrete example is probably the bizarre love triangle
| between locality of reference, lazy evaluation, and common
| subexpression elimination. There are several islands of sanity
| where reading comprehension is improved by refactoring in ways
| that happen to reduce or avoid calculations.
|
| You clean up the code and it gets 5% faster. You use the same
| pattern in three places and you're comfortably into double
| digits.
|
| As more evidence piled up for me about the dangers of confusing
| code, I spent more time thinking about the legibility and very
| little time thinking about the performance, but the fact is
| that I still do it intuitively. 25yo me would still be pretty
| happy with the performance of my code, even though <mumble>yo
| me gets a little concerned about that level of enthusiasm.
| parhaml wrote:
| I appreciate this comment. We can develop these little
| ingrained patterns that keep things 'optimal' and at the same
| time think about future us having to read the code again.
|
| I know that even 1 week future me appreciates readable code
| at the expense of a _little_ bit of performance.
| simias wrote:
| I agree of course but I'd argue that an array lookup would be
| more readable and more maintainable than this `if` soup in the
| `getCountry` example code. One added benefit for instance is
| that the language will never allow you to declare two values
| for the same index, whereas you could very easily mess up a
| copy/paste and forget to update the condition.
|
| More generally I don't think we should consider that
| readability and efficiency are at opposing ends of the same
| axis. With a bit of care and good tools it's often possible to
| get the best of both worlds, or close enough.
| lttlrck wrote:
| I feel like I'm missing something in the conclusion.
|
| The initial question is: if (debug) {
| log("..."); }
|
| > Is it ok to have if clauses that will basically never be run?
|
| And the summary is:
|
| > If it's never-taken, it's probably ok. I found no evidence that
| such branches incur any extra cost. But do avoid always-taken
| branches and function calls.
|
| However, if debug is false in the initial example, the branch is
| _always_ taken, it 's jumping over the call to log. Such code
| incurs a penalty - correct?
|
| https://www.godbolt.org/z/o1Yd7oWYa
| Scaevolus wrote:
| Given appropriate likeliness annotations or profile-guided
| optimization, it's possible for the compiler to _invert_ the
| conditional and move the log statement to a "cold" section of
| code outside of the main flow, making it a never-taken branch.
| lttlrck wrote:
| Yes. I guess that's covered by "it's probably ok"... I'm
| going to play in godbolt now.
| Scaevolus wrote:
| -O2 turns it into a "jump if nonzero" without any extra
| annotations: https://www.godbolt.org/z/j9dfrbosv
| jonny_eh wrote:
| Why don't processors fetch and decode both branches?
| brigade wrote:
| That would get exponential real quick. But Intel's Tremont does
| have an instruction decoder that's 6-wide only when decoding a
| taken branch, since it also starts decoding the branch target
| in the same cycle.
| solarexplorer wrote:
| Because doing so would require twice the cache and execution
| bandwidth. It is usually better to take the gamble and fetch
| just one branch. If you don't want to gamble you could stop the
| current thread and continue with the execution of another
| thread (if you have simultaneously multithreading/hyper
| threading).
|
| In some cases the compiler may decide to evaluate both paths
| and then "undo" the path that was wrong. This is very common on
| VLIW architectures like Itanium which have hardware support for
| this (predicates), but it can also be done with other
| instruction sets.
| dusted wrote:
| I'm wondering how you guys would optimize that code ?
|
| My naive approach would be something like this:
|
| const int numCountryIndicies = however many there are..;
|
| const char* countries = "A1\0A2\0...";
|
| return (cc < numCountryIndicies)?countries[cc*2]:"UNKNOWN";
| fallingknife wrote:
| Does a ternary actually eliminate a logic branch? I always
| assumed it was just high level shorthand, and would be the same
| in assembly as writing out the if block.
| throwaway744678 wrote:
| It does not eliminate the conditional jump, it's just
| synctatic sugar.
| PixelOfDeath wrote:
| Until after C++11, ternary was the only way to use "if
| then" in constexpr functions. So it has slightly more use
| from a C++ point of view.
| gpderetta wrote:
| In a funny turn of events, in C++17 there is no ?:
| shorthand for 'constexpr if' .
| MaxBarraclough wrote:
| Following that line, it's never exactly been syntactic
| sugar. When initializing a const local, you can use the
| ternary operator on the right hand side of the
| assignment, but you can't get the same effect using _if_
| (short of writing a new function). Similarly, C++ lets
| you use the ternary operator for the left hand side of an
| assignment, _(a ? b : c) = 42;_ although you could use an
| _if_ for this pretty easily.
| andromeduck wrote:
| Wow I had no idea you could use it on the LHS. TIL.
| MaxBarraclough wrote:
| A TIL of my own: it seems not to be allowed in C.
| TonyTrapp wrote:
| And to make it more complicated, both "regular" ifs and
| ternaries may not generate a branch at all, e.g. by using
| the CMOV (conditional move) instruction on x86. This avoids
| a branch but can obviously not eliminate the data
| dependency between the condition and a later consumer of
| the result.
| messe wrote:
| You have an off-by-one error. The strings are 3 bytes long.
|
| I'd write something similar, more or less; Probably the
| following, not for optimization, but more as a matter of style:
| const char countries[][3] = { "A1",
| "A0", // [...] }; // [...]
| int total_countries = (int) (sizeof(countries) /
| sizeof(countries[0])); return cc < total_countries ?
| countries[cc] : "UNKNOWN";
|
| No need to hard code the length, and the cast is guaranteed to
| be within the bounds of an int on all platforms as long as you
| don't go over 2^16-1 countries.
| simias wrote:
| I think that's the best solution (the only potential
| optimization I can think of would be to add a padding byte
| after every country to force 32bit alignment which would make
| the offset computation faster on _some_ architectures,
| although probably not on x86).
|
| Stylistically-wise I think the best solution would be to
| write the array like: const char
| countries[][3] = { [0] = "A1", [1] =
| "A0", };
|
| This way the codes are explicit when you read the code and it
| makes editing the array a little easier. Unfortunately gcc
| only warns if you set the same index twice with -Wextra, it
| remains silent with -Wall.
| messe wrote:
| > This way the codes are explicit when you read the code
|
| Good call. It might have caught the off-by-one mistake I
| made (the codes start from 1 and not 0 in the blog post).
| Maybe even switch to using an enum as our index instead of
| an int.
| dusted wrote:
| Well caught! I have two off by one errors :D seems they start
| from 1 instead of 0, and I didn't account for the null byte
| :D your solution is nicer and easier to read :)
| Cerealkiller050 wrote:
| This jives with the overall point of the article of keep it
| simple and readable. The semi fancy indexing caused 2
| errors and the upside is negligible after the compiler gets
| done with it.
| majke wrote:
| Modern freshest gcc 11 can optimize the if's nicely
| https://godbolt.org/z/771foExcG getCountry:
| mov eax, OFFSET FLAT:.LC0 cmp edi, 258
| ja .L1 mov edi, edi mov
| rax, QWORD PTR CSWTCH.1[0+rdi*8] .L1: ret
| qayxc wrote:
| clang does the same - I tested it back to clang 7.0.
|
| So clang did this for a long time it seems.
| Denvercoder9 wrote:
| Anyone know what the purpose of the mov edi, edi instruction
| there is?
|
| Edited to add: I understand that it's a NOP, but why would
| the compiler emit one here?
| Const-me wrote:
| That instructions clears upper 4 bytes of the rdi register.
| Note the next instruction uses rdi in the address.
|
| edi register is the lower 4 bytes of rdi. Instructions
| which write these smaller pieces zero out the unused higher
| bytes of the destination registers. This helps with
| performance because eliminates data dependencies on the old
| values in these higher bytes.
| JoeAltmaier wrote:
| It does nothing, so its a noop of sorts. I wonder if its a
| branch-delay tactic of some kind? Surely the algorithm is
| unchanged if it were removed.
| colejohnson66 wrote:
| It's just a two byte NOP. IIRC, it's Intel's recommended
| form for one of that length. Windows uses it for hot
| patching,[0], but I can't imagine that's the reason here.
|
| [0]: https://devblogs.microsoft.com/oldnewthing/20110921-00
| /?p=95...
| [deleted]
| teachingassist wrote:
| I immediately wanted to rewrite it as an array, yep, since the
| countries are already indexed.
|
| And to minimize the risk of errors such as the "Anguila" error
| above.
| rantwasp wrote:
| it's probably a switch on the index (or in this case you could
| actually have an array and use the index).
|
| I don't know my compilers that well but if I had to guess I
| would say there is a good chance this will be optimized away by
| the compiler.
| lmilcin wrote:
| > I'm wondering how you guys would optimize that code ?
|
| Run it on infinite multiverse, construct a mechanism to destroy
| the Universe every time branch predictor did not predict every
| single conditional correctly.
| dylan604 wrote:
| won't we need quantum computing for that first?
| lmilcin wrote:
| You don't need that.
|
| The branch predictor can select a random branch, so it
| doesn't even need to actually predict anything. It should
| use quantum noise so that it has chance of selecting
| different branches in different copies of the universe.
| There is noting to compute, quantum or otherwise.
|
| The bomb to destroy the universe will take care of all
| universes where it made wrong predictions, so that is where
| we should concentrate our development resources.
|
| It has more utility than just branch prediction. Imagine
| the bomb going off automatically whenever there starts a
| war.
|
| Any copy of the universe that starts a war is automatically
| eliminated and this guarantees that you, as an observer,
| will never observe any wars.
| Narishma wrote:
| It reaches out it reaches out it reaches out it reaches out--
| One hundred and thirteen times a second.
| jfrunyon wrote:
| > This code is in a performance critical loop and it looks like a
| waste - we never run with the "debug" flag enabled[1]. 1. One
| historical solution to this specific 'if debug' problem is called
| "runtime nop'ing".
|
| You know what another historical solution is called? Use a
| compile-time constant and let the damn optimizer optimize.
| (Virtually) no modern, widely used compiler would fail to
| optimize out `if (0) { ... }`.
|
| Also, why does M1 performance matter to Cloudflare's software? I
| kinda doubt their servers are on Macbooks.
| majke wrote:
| We do run aarch64. We are very interested if M1 is faster than
| the ARM's we have.
|
| https://blog.cloudflare.com/arms-race-ampere-altra-takes-on-...
|
| https://blog.cloudflare.com/porting-our-software-to-arm64/
| throwaway29303 wrote:
| The chart shows funny alignment issues. It's unclear what they
| are caused by.
|
| I think you've found its heartbeat. :)
| OskarS wrote:
| Great article, just to add one thing: conditional branches can
| seriously mess with the compiler's ability to optimize,
| especially the ability to auto-vectorize. So even if a never-
| taken branch is more or less free when the CPU executes it, just
| having it there might have resulted in much less efficient
| codegen. Just another thing to keep in mind with this issue.
| solarexplorer wrote:
| If you care about performance and the number of if's you should
| look into profile guided optimization. This can help the compiler
| to arrange the code such that branches like the "if (debug)" in
| the example, are never taken and so don't occupy precious
| resources. This will also make instruction fetch faster and if
| the body of the if is placed on another cache line it can also
| improve instruction cache performance (and even TLB performance
| and IO performance if the unused code is placed into another
| page).
|
| https://llvm.org/docs/HowToBuildWithPGO.html
| throwaway744678 wrote:
| Unrelated to the actual topic, but I believe there is a copy-
| paste bug in the first sample: const char
| *getCountry(int cc) { if(cc == 1) return "A1";
| if(cc == 2) return "A2"; if(cc == 3) return "O1";
| if(cc == 4) return "AD"; if(cc == 5) return "AE";
| if(cc == 6) return "AF"; if(cc == 7) return "AG";
| if(cc == 1) return "AI"; ... if(cc ==
| 252) return "YT"; if(cc == 253) return "ZA";
| if(cc == 254) return "ZM"; if(cc == 255) return "ZW";
| if(cc == 256) return "XK"; if(cc == 257) return "T1";
| return "UNKNOWN"; }
|
| This will never return "AI" (Anguila, as it seems!)
| klohto wrote:
| Full example doesn't have this mistake, so I would assume it's
| not in production :) https://godbolt.org/z/KWYEW3d9s
| tuesdayrain wrote:
| If that was javascript you could make it return "AI" with the
| right `valueOf` call - https://developer.mozilla.org/en-
| US/docs/Web/JavaScript/Refe...
| crazypython wrote:
| Wouldn't the compiler optimize this to an array lookup?
| static const char* countries[257];
|
| Or even: static const char countrycodes[514];
| return countrycodes[i * 2];
| jfrunyon wrote:
| static const char countrycodes[514]; return
| countrycodes[i * 2];
|
| Err... each of the strings is 3 bytes long.
| dahfizz wrote:
| Doesn't look like it. Each case is a compare and jump:
|
| https://godbolt.org/z/KWYEW3d9s
| ChrisSD wrote:
| Turn the gcc version up to 11.
| nitwit005 wrote:
| Playing around a bit, it does seem that clang will do so:
| https://godbolt.org/z/1czW91bMM
|
| A presentation on algorithm improvements popped up when
| looking into it (2015):
| https://llvm.org/devmtg/2015-10/slides/Wennborg-
| SwitchLoweri...
| [deleted]
| superjan wrote:
| The "if (debug) log" example in the article is likely compiled to
| an always taken conditional jump past the logging call. And
| therefore not quite free. Nitpicking of course, I love such in-
| depth articles.
| xenadu02 wrote:
| You can easily verify this in your specific code base but I
| believe the compiler prefers not-taken jumps for paths it
| predicts to be less frequently taken.
|
| You can provide hints that the path is a cold one to ensure the
| compiler uses the most efficient conditional branch layout. On
| some architectures the compiler will also provide this hint to
| the CPU (with a prefix, the instruction it selects, or a flag
| bit in the instruction) which avoids the perf hit on the first
| iteration.
|
| Both Clang and GCC recognize __builtin_expect which can be
| wrapped in a macro easily:
|
| #define unlikely(x) __builtin_expect(!!(x), 0)
| the_duke wrote:
| C++ 20 introduces standard likely/unlikely attributes:
| https://en.cppreference.com/w/cpp/language/attributes/likely
| astrange wrote:
| clang also has `__builtin_unpredictable` which is at least
| as useful as likely/unlikely.
| pmarreck wrote:
| Couple of thoughts here:
|
| > if (debug)
|
| The language Elixir, during its compilation phase, actually
| automatically removes these in the production environment. That
| is to say, it is a macro which behaves as expected in every
| environment but "prod", in which case it removes itself.
|
| > conditionals
|
| A number of years ago I used Ruby to experiment with writing a
| version of fizzbuzz that avoided conditionals entirely, and was
| purely functional:
|
| https://github.com/pmarreck/ruby-snippets/blob/master/functi...
|
| (I actually regret using currying here because it hurts the
| portability of the algorithm)
|
| While it may be difficult (if possible) to convert all
| conditional logic to functional logic in a given algorithm,
| perhaps a compiler could do the work, if certain patterns
| emerged.
|
| I'm not a C guy, but I have to wonder if such an algorithm would
| run faster on Intel or ARM chips by avoiding all branching and
| branch prediction. Can someone chime in on this?
| flakiness wrote:
| In old days people used #ifdef to compile things out so that the
| "production code" doesn't have any unnecessary branches. I was
| shocked when I first saw these living if()s in the server-side
| C++ code but then realized it was vital for debugging in
| production.
|
| People also used to prefer table based jump over long if-else-if
| chain for anecdotal performance reasons. That has gradually
| changed over the evolution of CPUs with various optimizations
| like the branch prediction. Now some JIT impls like tracing JIT
| replace virtual function calls with if-elses.
|
| By the way, I'm glad that the article is so thoughtfully written
| that it doesn't include any ambiguous "best practices" at the
| end. It could have created yet another myth. The three top tips
| in the article is more about stating facts. I admire the author
| taking that stance.
___________________________________________________________________
(page generated 2021-05-06 23:01 UTC)