[HN Gopher] Branch predictor: How many "if"s are too many?
       ___________________________________________________________________
        
       Branch predictor: How many "if"s are too many?
        
       Author : majke
       Score  : 215 points
       Date   : 2021-05-06 13:09 UTC (9 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | kissgyorgy wrote:
       | 2
        
       | jfrunyon wrote:
       | > This is visible with block size 64 breaking at 3072 mark 3072
       | _64=196K, and for block 32 at 6144: 6144_ 32=196K.
       | 
       | You mean 192K.
        
       | omginternets wrote:
       | Are there any rules-of-thumb for avoiding branch mispredictions,
       | other than reducing the number of conditional branches?
       | 
       | For example (not that I expect this to be true), something of the
       | same sort as "your if block should contain the rare condition".
        
         | mids_hn wrote:
         | My rule of thumb is to partition my data sets, to make sure the
         | conditional has the same result in as long a row as possible.
        
         | berkut wrote:
         | You can use the __builtin_expect keyword on GCC and clang.
         | 
         | It does make a difference (HPC code uses it quite a bit because
         | of this), but as it's done manually, results obviously vary
         | depending on processor, and there's a limit to how far you can
         | take it. Trying not to branch (not always possible) is best...
         | 
         | Basically, it's like manually doing profile-guided optimisation
         | - you look at perf or vtune, and add the above to branches
         | which have a high penalty and which are rarely or often taken,
         | and see if it makes a difference.
        
           | Veedrac wrote:
           | Note that this is not targeted to branch prediction, but
           | compiler optimizations. It should be done carefully, because
           | it tends to have a larger downside than its upside, and the
           | changes the compiler makes based on it can be unpredictable,
           | and unstable.
           | 
           | > Trying not to branch (not always possible) is best...
           | 
           | Branches are often faster than branch-free methods, since
           | they can speculate certain data dependencies away.
        
             | berkut wrote:
             | > Branches are often faster than branch-free methods, since
             | they can speculate certain data dependencies away.
             | 
             | And often they're slower, since predicting some branches is
             | often impossible due to the data :) So the processor ends
             | up miss-predicting most of the time, thinking it knows what
             | happened last time.
             | 
             | It all depends on what you're doing.
        
         | initplus wrote:
         | If you have perf critical branches in your code, you should try
         | and make them as predictable as possible. Sometimes that means
         | using a different approach that is algorithmically worse, but
         | has much more predictable branches.
         | 
         | Classic example is linear search is faster than binary search
         | for "small" lists. The item == myItem branch is only taken once
         | at the end. Meanwhile binary search will take the branches of
         | it's comparison (item < myItem, item > myItem) in equal
         | proportion to each other, so the branch predictor is stuck at a
         | 50% guess for that branch. There is a great talk on this but I
         | can't remember what it's called...
        
           | sischoel wrote:
           | Although it is possible to create branchless binary search.
           | Here is one article:
           | https://schani.wordpress.com/2010/04/30/linear-vs-binary-
           | sea..., there are probably other ways to do that.
           | 
           | Another interesting question where one could sink a lot of
           | time is how do to binary search on GPU's using multiple
           | threads. There branching in multiple threads at the same time
           | is also bad, but for slightly different reasons.
        
         | knorker wrote:
         | You can hint the compiler:
         | https://en.cppreference.com/w/cpp/language/attributes/likely
         | 
         | Or you can use profile directed optimization, as others
         | mentioned.
         | 
         | In my test recently it reduced branches as a whole by 30%,
         | according to perf stat ./a.out
        
         | Veedrac wrote:
         | Branch prediction works predictably well when the branch is
         | heavily biased towards one outcome, typically follows a
         | repeating pattern of a short length (eg. taken once every three
         | iterations), or is strongly correlated with another preceding
         | (not necessarily local) branch. Conversely, branch prediction
         | works badly when they are data dependent on irregular data.
        
         | bob1029 wrote:
         | I might be off base here, but you would probably do best to not
         | overthink this and let the compiler do its job. Attempting to
         | pre-optimize if statements relative to a branch predictor (a
         | very hardware-specific thing) is probably a massive fool's
         | errand.
        
           | astrange wrote:
           | The compiler doesn't know how the CPU branch predictor works
           | either. It's not especially good at optimizing this, or
           | guessing if a branch is predictable or not, so you might as
           | well do it yourself.
        
       | jugg1es wrote:
       | This is a pretty amazing analysis that answers questions we
       | probably all had as newbie programmers before realizing that code
       | readability mattered more than efficiency 99.9% of the time.
        
         | lmilcin wrote:
         | In most applications you will not be able to get to the level
         | of performance when any of this is going to matter.
         | 
         | For example, if you program Python then there is so much more
         | conditionals in Python runtime itself than you will not see any
         | difference from couple of your conditionals being predicted
         | better or worse. You will also be doing a lot of other stuff
         | (like spending time in framework code, calling libraries,
         | performing I/O) that will be making any gains from this pretty
         | insignificant.
         | 
         | Where this comes into picture is if you are really bent on
         | improving performance of your code on a grand scale (when you
         | develop algorithmic trading or maybe operating system), when
         | you have a very dense, busy inner loop (for example video
         | encoding) or when you develop compilers.
        
           | andrepd wrote:
           | I think that's more of an indictment of how bad the concept
           | of "production python" is.
        
         | hinkley wrote:
         | The goalposts are constantly changing, but there are
         | refactorings that achieve both.
         | 
         | As a student, real world performance analysis was one of the
         | ways I kept myself engaged in otherwise sometimes tedious
         | curricula. After college I landed a job far away from home only
         | to discover their software was so slow I was embarrassed to be
         | associated with it, so I got a quick self-directed study in
         | practical optimization.
         | 
         | On any given day, you're much more likely to be amused by your
         | own cleverness than anyone else is, so you either do it for
         | your own reasons, stop doing it entirely, or find palatable
         | forms of expression.
         | 
         | My most concrete example is probably the bizarre love triangle
         | between locality of reference, lazy evaluation, and common
         | subexpression elimination. There are several islands of sanity
         | where reading comprehension is improved by refactoring in ways
         | that happen to reduce or avoid calculations.
         | 
         | You clean up the code and it gets 5% faster. You use the same
         | pattern in three places and you're comfortably into double
         | digits.
         | 
         | As more evidence piled up for me about the dangers of confusing
         | code, I spent more time thinking about the legibility and very
         | little time thinking about the performance, but the fact is
         | that I still do it intuitively. 25yo me would still be pretty
         | happy with the performance of my code, even though <mumble>yo
         | me gets a little concerned about that level of enthusiasm.
        
           | parhaml wrote:
           | I appreciate this comment. We can develop these little
           | ingrained patterns that keep things 'optimal' and at the same
           | time think about future us having to read the code again.
           | 
           | I know that even 1 week future me appreciates readable code
           | at the expense of a _little_ bit of performance.
        
         | simias wrote:
         | I agree of course but I'd argue that an array lookup would be
         | more readable and more maintainable than this `if` soup in the
         | `getCountry` example code. One added benefit for instance is
         | that the language will never allow you to declare two values
         | for the same index, whereas you could very easily mess up a
         | copy/paste and forget to update the condition.
         | 
         | More generally I don't think we should consider that
         | readability and efficiency are at opposing ends of the same
         | axis. With a bit of care and good tools it's often possible to
         | get the best of both worlds, or close enough.
        
       | lttlrck wrote:
       | I feel like I'm missing something in the conclusion.
       | 
       | The initial question is:                 if (debug) {
       | log("...");       }
       | 
       | > Is it ok to have if clauses that will basically never be run?
       | 
       | And the summary is:
       | 
       | > If it's never-taken, it's probably ok. I found no evidence that
       | such branches incur any extra cost. But do avoid always-taken
       | branches and function calls.
       | 
       | However, if debug is false in the initial example, the branch is
       | _always_ taken, it 's jumping over the call to log. Such code
       | incurs a penalty - correct?
       | 
       | https://www.godbolt.org/z/o1Yd7oWYa
        
         | Scaevolus wrote:
         | Given appropriate likeliness annotations or profile-guided
         | optimization, it's possible for the compiler to _invert_ the
         | conditional and move the log statement to a "cold" section of
         | code outside of the main flow, making it a never-taken branch.
        
           | lttlrck wrote:
           | Yes. I guess that's covered by "it's probably ok"... I'm
           | going to play in godbolt now.
        
             | Scaevolus wrote:
             | -O2 turns it into a "jump if nonzero" without any extra
             | annotations: https://www.godbolt.org/z/j9dfrbosv
        
       | jonny_eh wrote:
       | Why don't processors fetch and decode both branches?
        
         | brigade wrote:
         | That would get exponential real quick. But Intel's Tremont does
         | have an instruction decoder that's 6-wide only when decoding a
         | taken branch, since it also starts decoding the branch target
         | in the same cycle.
        
         | solarexplorer wrote:
         | Because doing so would require twice the cache and execution
         | bandwidth. It is usually better to take the gamble and fetch
         | just one branch. If you don't want to gamble you could stop the
         | current thread and continue with the execution of another
         | thread (if you have simultaneously multithreading/hyper
         | threading).
         | 
         | In some cases the compiler may decide to evaluate both paths
         | and then "undo" the path that was wrong. This is very common on
         | VLIW architectures like Itanium which have hardware support for
         | this (predicates), but it can also be done with other
         | instruction sets.
        
       | dusted wrote:
       | I'm wondering how you guys would optimize that code ?
       | 
       | My naive approach would be something like this:
       | 
       | const int numCountryIndicies = however many there are..;
       | 
       | const char* countries = "A1\0A2\0...";
       | 
       | return (cc < numCountryIndicies)?countries[cc*2]:"UNKNOWN";
        
         | fallingknife wrote:
         | Does a ternary actually eliminate a logic branch? I always
         | assumed it was just high level shorthand, and would be the same
         | in assembly as writing out the if block.
        
           | throwaway744678 wrote:
           | It does not eliminate the conditional jump, it's just
           | synctatic sugar.
        
             | PixelOfDeath wrote:
             | Until after C++11, ternary was the only way to use "if
             | then" in constexpr functions. So it has slightly more use
             | from a C++ point of view.
        
               | gpderetta wrote:
               | In a funny turn of events, in C++17 there is no ?:
               | shorthand for 'constexpr if' .
        
               | MaxBarraclough wrote:
               | Following that line, it's never exactly been syntactic
               | sugar. When initializing a const local, you can use the
               | ternary operator on the right hand side of the
               | assignment, but you can't get the same effect using _if_
               | (short of writing a new function). Similarly, C++ lets
               | you use the ternary operator for the left hand side of an
               | assignment, _(a ? b : c) = 42;_ although you could use an
               | _if_ for this pretty easily.
        
               | andromeduck wrote:
               | Wow I had no idea you could use it on the LHS. TIL.
        
               | MaxBarraclough wrote:
               | A TIL of my own: it seems not to be allowed in C.
        
             | TonyTrapp wrote:
             | And to make it more complicated, both "regular" ifs and
             | ternaries may not generate a branch at all, e.g. by using
             | the CMOV (conditional move) instruction on x86. This avoids
             | a branch but can obviously not eliminate the data
             | dependency between the condition and a later consumer of
             | the result.
        
         | messe wrote:
         | You have an off-by-one error. The strings are 3 bytes long.
         | 
         | I'd write something similar, more or less; Probably the
         | following, not for optimization, but more as a matter of style:
         | const char countries[][3] = {             "A1",
         | "A0",            // [...]         };              // [...]
         | int total_countries = (int) (sizeof(countries) /
         | sizeof(countries[0]));         return cc < total_countries ?
         | countries[cc] : "UNKNOWN";
         | 
         | No need to hard code the length, and the cast is guaranteed to
         | be within the bounds of an int on all platforms as long as you
         | don't go over 2^16-1 countries.
        
           | simias wrote:
           | I think that's the best solution (the only potential
           | optimization I can think of would be to add a padding byte
           | after every country to force 32bit alignment which would make
           | the offset computation faster on _some_ architectures,
           | although probably not on x86).
           | 
           | Stylistically-wise I think the best solution would be to
           | write the array like:                   const char
           | countries[][3] = {             [0] = "A1",             [1] =
           | "A0",         };
           | 
           | This way the codes are explicit when you read the code and it
           | makes editing the array a little easier. Unfortunately gcc
           | only warns if you set the same index twice with -Wextra, it
           | remains silent with -Wall.
        
             | messe wrote:
             | > This way the codes are explicit when you read the code
             | 
             | Good call. It might have caught the off-by-one mistake I
             | made (the codes start from 1 and not 0 in the blog post).
             | Maybe even switch to using an enum as our index instead of
             | an int.
        
           | dusted wrote:
           | Well caught! I have two off by one errors :D seems they start
           | from 1 instead of 0, and I didn't account for the null byte
           | :D your solution is nicer and easier to read :)
        
             | Cerealkiller050 wrote:
             | This jives with the overall point of the article of keep it
             | simple and readable. The semi fancy indexing caused 2
             | errors and the upside is negligible after the compiler gets
             | done with it.
        
         | majke wrote:
         | Modern freshest gcc 11 can optimize the if's nicely
         | https://godbolt.org/z/771foExcG                 getCountry:
         | mov     eax, OFFSET FLAT:.LC0             cmp     edi, 258
         | ja      .L1             mov     edi, edi             mov
         | rax, QWORD PTR CSWTCH.1[0+rdi*8]       .L1:             ret
        
           | qayxc wrote:
           | clang does the same - I tested it back to clang 7.0.
           | 
           | So clang did this for a long time it seems.
        
           | Denvercoder9 wrote:
           | Anyone know what the purpose of the mov edi, edi instruction
           | there is?
           | 
           | Edited to add: I understand that it's a NOP, but why would
           | the compiler emit one here?
        
             | Const-me wrote:
             | That instructions clears upper 4 bytes of the rdi register.
             | Note the next instruction uses rdi in the address.
             | 
             | edi register is the lower 4 bytes of rdi. Instructions
             | which write these smaller pieces zero out the unused higher
             | bytes of the destination registers. This helps with
             | performance because eliminates data dependencies on the old
             | values in these higher bytes.
        
             | JoeAltmaier wrote:
             | It does nothing, so its a noop of sorts. I wonder if its a
             | branch-delay tactic of some kind? Surely the algorithm is
             | unchanged if it were removed.
        
             | colejohnson66 wrote:
             | It's just a two byte NOP. IIRC, it's Intel's recommended
             | form for one of that length. Windows uses it for hot
             | patching,[0], but I can't imagine that's the reason here.
             | 
             | [0]: https://devblogs.microsoft.com/oldnewthing/20110921-00
             | /?p=95...
        
           | [deleted]
        
         | teachingassist wrote:
         | I immediately wanted to rewrite it as an array, yep, since the
         | countries are already indexed.
         | 
         | And to minimize the risk of errors such as the "Anguila" error
         | above.
        
         | rantwasp wrote:
         | it's probably a switch on the index (or in this case you could
         | actually have an array and use the index).
         | 
         | I don't know my compilers that well but if I had to guess I
         | would say there is a good chance this will be optimized away by
         | the compiler.
        
         | lmilcin wrote:
         | > I'm wondering how you guys would optimize that code ?
         | 
         | Run it on infinite multiverse, construct a mechanism to destroy
         | the Universe every time branch predictor did not predict every
         | single conditional correctly.
        
           | dylan604 wrote:
           | won't we need quantum computing for that first?
        
             | lmilcin wrote:
             | You don't need that.
             | 
             | The branch predictor can select a random branch, so it
             | doesn't even need to actually predict anything. It should
             | use quantum noise so that it has chance of selecting
             | different branches in different copies of the universe.
             | There is noting to compute, quantum or otherwise.
             | 
             | The bomb to destroy the universe will take care of all
             | universes where it made wrong predictions, so that is where
             | we should concentrate our development resources.
             | 
             | It has more utility than just branch prediction. Imagine
             | the bomb going off automatically whenever there starts a
             | war.
             | 
             | Any copy of the universe that starts a war is automatically
             | eliminated and this guarantees that you, as an observer,
             | will never observe any wars.
        
           | Narishma wrote:
           | It reaches out it reaches out it reaches out it reaches out--
           | One hundred and thirteen times a second.
        
       | jfrunyon wrote:
       | > This code is in a performance critical loop and it looks like a
       | waste - we never run with the "debug" flag enabled[1]. 1. One
       | historical solution to this specific 'if debug' problem is called
       | "runtime nop'ing".
       | 
       | You know what another historical solution is called? Use a
       | compile-time constant and let the damn optimizer optimize.
       | (Virtually) no modern, widely used compiler would fail to
       | optimize out `if (0) { ... }`.
       | 
       | Also, why does M1 performance matter to Cloudflare's software? I
       | kinda doubt their servers are on Macbooks.
        
         | majke wrote:
         | We do run aarch64. We are very interested if M1 is faster than
         | the ARM's we have.
         | 
         | https://blog.cloudflare.com/arms-race-ampere-altra-takes-on-...
         | 
         | https://blog.cloudflare.com/porting-our-software-to-arm64/
        
       | throwaway29303 wrote:
       | The chart shows funny alignment issues. It's unclear what they
       | are caused by.
       | 
       | I think you've found its heartbeat. :)
        
       | OskarS wrote:
       | Great article, just to add one thing: conditional branches can
       | seriously mess with the compiler's ability to optimize,
       | especially the ability to auto-vectorize. So even if a never-
       | taken branch is more or less free when the CPU executes it, just
       | having it there might have resulted in much less efficient
       | codegen. Just another thing to keep in mind with this issue.
        
       | solarexplorer wrote:
       | If you care about performance and the number of if's you should
       | look into profile guided optimization. This can help the compiler
       | to arrange the code such that branches like the "if (debug)" in
       | the example, are never taken and so don't occupy precious
       | resources. This will also make instruction fetch faster and if
       | the body of the if is placed on another cache line it can also
       | improve instruction cache performance (and even TLB performance
       | and IO performance if the unused code is placed into another
       | page).
       | 
       | https://llvm.org/docs/HowToBuildWithPGO.html
        
       | throwaway744678 wrote:
       | Unrelated to the actual topic, but I believe there is a copy-
       | paste bug in the first sample:                   const char
       | *getCountry(int cc) {             if(cc == 1) return "A1";
       | if(cc == 2) return "A2";             if(cc == 3) return "O1";
       | if(cc == 4) return "AD";             if(cc == 5) return "AE";
       | if(cc == 6) return "AF";             if(cc == 7) return "AG";
       | if(cc == 1) return "AI";             ...             if(cc ==
       | 252) return "YT";             if(cc == 253) return "ZA";
       | if(cc == 254) return "ZM";             if(cc == 255) return "ZW";
       | if(cc == 256) return "XK";             if(cc == 257) return "T1";
       | return "UNKNOWN";        }
       | 
       | This will never return "AI" (Anguila, as it seems!)
        
         | klohto wrote:
         | Full example doesn't have this mistake, so I would assume it's
         | not in production :) https://godbolt.org/z/KWYEW3d9s
        
         | tuesdayrain wrote:
         | If that was javascript you could make it return "AI" with the
         | right `valueOf` call - https://developer.mozilla.org/en-
         | US/docs/Web/JavaScript/Refe...
        
         | crazypython wrote:
         | Wouldn't the compiler optimize this to an array lookup?
         | static const char* countries[257];
         | 
         | Or even:                   static const char countrycodes[514];
         | return countrycodes[i * 2];
        
           | jfrunyon wrote:
           | static const char countrycodes[514];         return
           | countrycodes[i * 2];
           | 
           | Err... each of the strings is 3 bytes long.
        
           | dahfizz wrote:
           | Doesn't look like it. Each case is a compare and jump:
           | 
           | https://godbolt.org/z/KWYEW3d9s
        
             | ChrisSD wrote:
             | Turn the gcc version up to 11.
        
             | nitwit005 wrote:
             | Playing around a bit, it does seem that clang will do so:
             | https://godbolt.org/z/1czW91bMM
             | 
             | A presentation on algorithm improvements popped up when
             | looking into it (2015):
             | https://llvm.org/devmtg/2015-10/slides/Wennborg-
             | SwitchLoweri...
        
             | [deleted]
        
       | superjan wrote:
       | The "if (debug) log" example in the article is likely compiled to
       | an always taken conditional jump past the logging call. And
       | therefore not quite free. Nitpicking of course, I love such in-
       | depth articles.
        
         | xenadu02 wrote:
         | You can easily verify this in your specific code base but I
         | believe the compiler prefers not-taken jumps for paths it
         | predicts to be less frequently taken.
         | 
         | You can provide hints that the path is a cold one to ensure the
         | compiler uses the most efficient conditional branch layout. On
         | some architectures the compiler will also provide this hint to
         | the CPU (with a prefix, the instruction it selects, or a flag
         | bit in the instruction) which avoids the perf hit on the first
         | iteration.
         | 
         | Both Clang and GCC recognize __builtin_expect which can be
         | wrapped in a macro easily:
         | 
         | #define unlikely(x) __builtin_expect(!!(x), 0)
        
           | the_duke wrote:
           | C++ 20 introduces standard likely/unlikely attributes:
           | https://en.cppreference.com/w/cpp/language/attributes/likely
        
             | astrange wrote:
             | clang also has `__builtin_unpredictable` which is at least
             | as useful as likely/unlikely.
        
       | pmarreck wrote:
       | Couple of thoughts here:
       | 
       | > if (debug)
       | 
       | The language Elixir, during its compilation phase, actually
       | automatically removes these in the production environment. That
       | is to say, it is a macro which behaves as expected in every
       | environment but "prod", in which case it removes itself.
       | 
       | > conditionals
       | 
       | A number of years ago I used Ruby to experiment with writing a
       | version of fizzbuzz that avoided conditionals entirely, and was
       | purely functional:
       | 
       | https://github.com/pmarreck/ruby-snippets/blob/master/functi...
       | 
       | (I actually regret using currying here because it hurts the
       | portability of the algorithm)
       | 
       | While it may be difficult (if possible) to convert all
       | conditional logic to functional logic in a given algorithm,
       | perhaps a compiler could do the work, if certain patterns
       | emerged.
       | 
       | I'm not a C guy, but I have to wonder if such an algorithm would
       | run faster on Intel or ARM chips by avoiding all branching and
       | branch prediction. Can someone chime in on this?
        
       | flakiness wrote:
       | In old days people used #ifdef to compile things out so that the
       | "production code" doesn't have any unnecessary branches. I was
       | shocked when I first saw these living if()s in the server-side
       | C++ code but then realized it was vital for debugging in
       | production.
       | 
       | People also used to prefer table based jump over long if-else-if
       | chain for anecdotal performance reasons. That has gradually
       | changed over the evolution of CPUs with various optimizations
       | like the branch prediction. Now some JIT impls like tracing JIT
       | replace virtual function calls with if-elses.
       | 
       | By the way, I'm glad that the article is so thoughtfully written
       | that it doesn't include any ambiguous "best practices" at the
       | end. It could have created yet another myth. The three top tips
       | in the article is more about stating facts. I admire the author
       | taking that stance.
        
       ___________________________________________________________________
       (page generated 2021-05-06 23:01 UTC)