[HN Gopher] Fastest branchless binary search
___________________________________________________________________
Fastest branchless binary search
Author : jorangreef
Score : 294 points
Date : 2023-08-11 09:38 UTC (13 hours ago)
(HTM) web link (mhdm.dev)
(TXT) w3m dump (mhdm.dev)
| foldr wrote:
| >gcc has __builtin_expect_with_probability(cond, 0, 0.5) but it
| does nothing (tested v10). -
|
| I wonder what could possibly be the use of this builtin. Branch
| prediction varies enough between different processors that it
| seems unlikely that anything useful could be done with a fine-
| grained probability estimate.
| bjourne wrote:
| There are ARM processors where bits in the branch instructions
| encode whether they should be predicted as taken or not. So on
| such architectures such hints can be useful. On x86-64,
| however, they are afaik completely useless (from an
| optimization perspective).
| Pannoniae wrote:
| Read https://en.algorithmica.org/hpc/pipelining/branching/ and
| the following chapter.
|
| TL;DR: you can generate different, and more efficient code if
| you know how much it is going to be mispredicted.
| foldr wrote:
| Nothing in that link shows how a compiler could make use of a
| fine-grained probability estimate in a practical way to guide
| optimizations. I'm perfectly aware of the general concept of
| branch prediction and the annotations that certain
| architectures have in their instruction sets.
| gpderetta wrote:
| The built-in doesn't provide misprediction information to the
| compiler.
| twoodfin wrote:
| Presumably it's not about matching the dynamic branch
| prediction estimate but rather exploiting static (or profile-
| provided) information to assist other optimizations.
|
| Inlining seems like a natural application: Inlining is perhaps
| the most critical optimization to get right, and getting it
| right is a complex balance of frequency of use vs. added code
| bloat and associated further optimization cost. Having more
| than one bit of branch-likelihood as input to these decisions
| could easily be helpful.
| foldr wrote:
| Good point. That's the only plausible explanation in this
| thread, I think. However, if you're at the level of
| performance tuning where you want to optimize inlining beyond
| the compiler defaults, I would have thought that it would
| make more sense to just experiment with forcing certain
| functions to be inlined or not.
| gpderetta wrote:
| I think it is used for code layout: typically you want to make
| the most often taken branch as fallthrough as it can be
| slightly faster even when perfectly predicted.
|
| I also tried (and failed) to use expect with probability to
| generate a cmov, but in retrospect it is obvious: the parameter
| is not the prediction probability, but the taken/not-taken
| probability, and even a branch that goes either way with 50%
| probability can still be predicted perfectly (for example a
| branch that switches every other iteration), so the parameter
| can't be used to decide between predication and prediction.
|
| edit: what we really need is an [[unpredictable]] attribute.
| foldr wrote:
| >typically you want to make the most often taken branch as
| fallthrough as it can be slightly faster even when perfectly
| predicted.
|
| Sure, but that application doesn't require a fine-grained
| probability estimate.
| twoodfin wrote:
| Indeed: A branch taken 50% of the time can be readily
| predicted 100% of the time if it follows a pattern the CPU
| can infer.
|
| Why you have to be careful choosing realistic input data if
| you are microbenchmarking at this level for a real purpose.
| pixelesque wrote:
| clang has one:
|
| https://clang.llvm.org/docs/LanguageExtensions.html#builtin-.
| ..
| ot wrote:
| Which also does not affect cmov conversion. Every time this
| comes up I bring up this long-standing LLVM bug:
| https://bugs.llvm.org/show_bug.cgi?id=40027
| dzaima wrote:
| __builtin_unpredictable is gonna be fixed in LLVM/clang
| 17: https://github.com/llvm/llvm-
| project/commit/09515f2c20111628...
|
| (also, bugs.llvm.org is old; the more up-to-date (but
| still open) issue is https://github.com/llvm/llvm-
| project/issues/39374)
| wongarsu wrote:
| At least when I last read it a couple years ago, Intel's
| optimization manual recommends certain code layouts. For
| example in the absence of better information a conditional jump
| backwards would be predicted as true (because that's what loops
| look like) while a jump forward is seen as less likely to
| happen (error conditions and else branches look like that).
|
| Not sure how consistent this is across architectures, and how
| useful it still is. But at least it used to be a thing
| kragen wrote:
| pretty common but not, as i thought, universal since stretch
| https://en.wikipedia.org/wiki/Branch_predictor#Static_branch.
| ..
|
| but the only simpler approach is 'predict not taken' which in
| a sense is what a cortex-m0 does
| LoganDark wrote:
| Does anyone know where the "BUT RUST" link was _supposed_ to
| lead? It seems to be already out of date due to being
| unversioned, I can 't tell whether it's supposed to lead to the
| middle of the `starts_with` doc comment or not.
| quietbritishjim wrote:
| Looking at the archive.org captures just before [1] and just
| after [2] the article was published, it looks like it was meant
| to be this line of code, now on line 2779 [3]:
| let mid = left + size / 2;
|
| [1]
| https://web.archive.org/web/20230602210213/https://doc.rust-...
|
| [2]
| https://web.archive.org/web/20230709221353/https://doc.rust-...
|
| [3] https://doc.rust-lang.org/src/core/slice/mod.rs.html#2779
| LoganDark wrote:
| Alright, and as a versioned link that'll be:
| https://github.com/rust-
| lang/rust/blob/7d8386f05cedf33c7475c...
|
| Thank you!
| veber-alex wrote:
| You can get versioned link directly into rustdoc:
|
| https://doc.rust-
| lang.org/1.71.1/src/core/slice/mod.rs.html#...
| LoganDark wrote:
| I couldn't find any direct references to this anywhere,
| believe me I spent a long 30 minutes looking... >.<
|
| But thank you anyway <3
| epilys wrote:
| So this is basically a *safe* optimization, since the index
| will always be valid and there's no need for the compiler to
| do a check and unwrap, panic.
|
| This is how unsafe {} should be used. Sometimes some things
| are true but the compiler can't know that. And here the
| unsafe {} means that it dereferences a raw pointer (the index
| that we know is valid). If the unsafe {} safety condition is
| valid, unsafe {} is, well, safe.
| someplaceguy wrote:
| Furthermore, it's an optional optimization (you could just
| copy the code and replace the unsafe access with a safe
| one, if you're paranoid) and it's not like if you write it
| in C++ it will be any safer than Rust's unsafe?!
| mhdm wrote:
| It was meant to link to rust's binary search implementation.
| Updated to https://doc.rust-
| lang.org/1.71.1/src/core/slice/mod.rs.html#...
| WinLychee wrote:
| Other fast binary searches https://github.com/sslotin/amh-
| code/tree/main/binsearch
| [deleted]
| hajrice wrote:
| I wish all blog posts started the way his does: "You're a busy
| person so I'll first jump right to it. Here it is, the fastest
| general (and simple) binary search C++ implementation:"
| ape4 wrote:
| Except I am too busy to read "You're a busy person"
| ant6n wrote:
| But not too busy to read the comments!
| hinkley wrote:
| The comments have a higher density of opinions and
| experience.
| eesmith wrote:
| But does it have a higher density of _good_ opinions and
| experience? ;)
| bjourne wrote:
| On my Cascade Lake processor "-mllvm -x86-cmov-converter=false"
| almost halves the performance of the binary search:
| | Benchmark | gcc | clang | clang -cmov |
| |-----------|------|-------|-------------| | slow u32 |
| 23.4 | 46.7 | 45.8 | | fast u32 | 18.1 | 19.8 |
| 31.4 |
|
| The numbers are nanoseconds/bsearch on a 100mb uint32 array. Seem
| to me that clang (15.0.7) is just much worse at optimizing this
| particular piece of code than gcc (13.2.1). You can see the
| assembly here https://godbolt.org/z/cbx5Kdjs6. The gcc assembly
| looks way cleaner to me.
| mhdm wrote:
| Then you'll want to look at
| https://mhdm.dev/posts/sb_lower_bound/#prefetching
|
| 100mb is large enough that the branchy version turns out to
| have a slight advantage, more due to quirks of x86 (speculative
| execution) rather than being better.
| shmageggy wrote:
| > _If only there was a clean fast bare-metal language to write
| all this in.._
|
| The author includes a footnotes for "BUT RUST.." and "BUT ZIG..",
| but how about Nim? Looks like there is a native library
| implementation of `lowerBound` https://github.com/nim-
| lang/Nim/blob/version-2-0/lib/pure/al... Yes, it's not a "bare-
| metal" language, but it compiles to one (or two), so it would be
| interesting to see what it compiles to here.
|
| Also, what's wrong with C?
| TillE wrote:
| An equivalent generic sort function in C would, at best,
| require a bunch of additional distracting cruft. This is
| exactly what C++ templates are for.
| matklad wrote:
| > BUT ZIG.. There's no binary search implementation in Zig that
| I could find, rather it calls to C++.
|
| Zig's binary search is here, it's an unoptimized textbook
| version:
| https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...
|
| At TigerBeetle, we have our own branchless implementation here:
|
| https://github.com/tigerbeetle/tigerbeetle/blob/e996abcf7154...
| josephg wrote:
| Rust's binary search is here. Also a pretty textbook version
| of the function: https://github.com/rust-
| lang/rust/blob/4d7a80d48697171ed151c...
|
| How much faster is the branchless version in practice? I make
| heavy use of binary search in some code I maintain, and I
| wonder if it would make much of a difference to switch to a
| more efficient version of the function.
| WinLychee wrote:
| Rust's binary search used to be branchless but due to some
| issues down in LLVM land, the codegen doesn't issue a
| `cmov` where it should. https://github.com/rust-
| lang/rust/issues/53823
|
| old branchless version: https://github.com/rust-
| lang/rust/pull/45333/files
| abhishekjha wrote:
| I don't know much about Rust data types or Rust in general
| but does it not have any integer overflow?
|
| I see (left+right)/2. Is it like python with unbounded
| precision?
| cornstalks wrote:
| It's bounded precision, but Rust limits the max size of
| an object/array to isize's max[1], not usize's max. So
| adding two isize::MAX values using usize will never
| overflow.
|
| [1]: https://doc.rust-
| lang.org/stable/reference/types/numeric.htm...
| LegionMammal978 wrote:
| Such an overflow could still be problematic for slices of
| zero-sized types, which can contain up to usize::MAX
| elements, since the total size in bytes is always 0. But
| (left + right) / 2 doesn't actually occur anywhere in the
| code, only left + size / 2, which clearly can't overflow
| as long as size is sane.
| [deleted]
| josephg wrote:
| Integer overflows cause panics in debug mode. And its
| undefined behaviour in release mode.
|
| Where do you see that in the code? I can't see
| (left+right)/2 anywhere in the code I linked?
| [deleted]
| cornstalks wrote:
| > _And its undefined behaviour in release mode._
|
| No, it uses 2's complement and is well defined in release
| mode. From [1]:
|
| > _When you're compiling in release mode with the
| --release flag, Rust does not include checks for integer
| overflow that cause panics. Instead, if overflow occurs,
| Rust performs two's complement wrapping._
|
| [1]: https://doc.rust-lang.org/book/ch03-02-data-
| types.html#integ...
| marcianx wrote:
| I'm pretty sure it's not undefined behavior in rust in
| release mode if it does overflow. It's fully specified
| behavior, I believe.
| Phemist wrote:
| Left and right are both usizes, which are 64-bit
| pointers. You will need work on an array of 2^63 elements
| before you have to worry about integer overflow issues.
|
| This array would not fit in any kind of memory for the
| foreseeable future :)
| creata wrote:
| Not that it makes much of a difference, but Rust does
| compile to 32-bit platforms.
|
| https://doc.rust-lang.org/nightly/rustc/platform-
| support.htm...
| Phemist wrote:
| Ah yes, fair point. It makes it a bit more subtle, in
| that the cases where you have to worry about integer
| overflows on the pointer addition, are cases where you
| have an array of 2^((64 or 32) - 1) bools... which seems
| rather silly to do a binary search on?
| quietbritishjim wrote:
| > I see (left+right)/2.
|
| There is no (left+right)/2 on that page.
| let mid = left + size / 2;
|
| It's only size that is being divided by 2, which is the
| size of the segment still under consideration (which is
| initially the whole array). left is the starting index of
| segment still under consideration (which is initially 0).
| vulcan01 wrote:
| In debug mode, integer overflow panics. In release mode,
| overflow does occur; that's why checked_add and friends
| exist.
| eru wrote:
| > Also, what's wrong with C?
|
| Oodles and oodles of undefined behaviour, for example.
|
| C ain't clean.
| User23 wrote:
| Nah, thanks to CompCert[1] C actually has one of the highest
| quality and most predictable compilers.
|
| [1] https://compcert.org/
| Ygg2 wrote:
| How much of available C libraries are CompCert compatible?
| insanitybit wrote:
| What is the practicality of actually using CompCert? And
| what does it actually defend against?
|
| All I know is that it tries to ensure that the behavior of
| the program is the same as the input, meaning that the
| compiler itself does not have bugs. But how does that
| actually interact with undefined behavior in the language?
| What happens to an array out of bounds?
|
| And what are the limitations imposed, since it only works
| on a subset of the C language?
|
| It seems unlikely that I could just switch compilers and
| suddenly have a safe program written in C.
| LegionMammal978 wrote:
| It looks like CompCert explicitly does not attempt to
| prevent UB from resulting in unpredictable behavior; all
| it guarantees is that "the observable behavior of [the
| compiled code] _C_ improves on one of the allowed
| observable behaviors of [the source program] _S_ ", where
| "improving on" allows for arbitrary behavior on UB [0]:
|
| > Third, the compiler is allowed to improve the behavior
| of the source program. Here, _to improve_ means to
| convert a run-time error (such as crashing on an integer
| division by zero) into a more defined behavior. This can
| happen if the run-time error (e.g. division by zero) was
| optimized away (e.g. removed because the result of the
| division is unused). However, if the source program is
| known to be free of run-time errors, perhaps because it
| was verified using static analyzers or deductive program
| provers, improvement as described above never takes
| place, and the generated code behaves exactly as one of
| the allowed behaviors of the source program.
|
| So you still have to prove your program's correctness by
| some other means.
|
| [0] https://compcert.org/man/manual001.html#sec3
| User23 wrote:
| You have to scroll down a bit, but see here[1]. Also
| searching [site:regehr.org compcert] will find more.
| CompCert C makes it impossible to reproduce at least some
| UB bugs: For CompCert, the high-level
| correctness theorems I proved are all of the first kind
| above: if the source program goes wrong, nothing is
| guaranteed about the behavior of the compiled code. It
| happens, however, that the stepwise simulation lemmas I
| build on actually prove property 2 above: the compiled
| code cannot crash "earlier" than the source code, and
| will produce the same pre-crash observables, then
| possibly more. So maybe I should strengthen my high-
| level correctness theorems...
|
| That was over a decade ago, so perhaps it's been done. I
| confess I haven't closely followed CompCert in some time,
| due to having priorities elsewhere. As you can see
| elsewhere in that reference CompCert C will fail to
| generate the example UB code.
|
| But yes, mechanically proving additional properties with
| some kind of SAT solver is also an open area of research.
| Dafny is the example I'm familiar with. Its conditions
| and invariants allow things like automatically proved
| binary search[2]. Converting CompCert C statements to
| SMTLibv2 assertions (which is, to an extremely rough
| first approximation what Dafny does) would certainly be
| considerably easier than it would be for most other
| languages. It would still be a serious effort of course.
|
| [1] https://blog.regehr.org/archives/232
|
| [2] https://github.com/dafny-
| lang/dafny/blob/master/Test/dafny4/...
| LegionMammal978 wrote:
| Apart from compiler bugs, this issue with "time-traveling
| UB" shouldn't really occur in practice with any side
| effects other than volatile accesses. As noted by Martin
| Uecker [0],
|
| > In portable C code I/O is performed using function
| calls of the standard library. A compiler can generally
| not assume that a function returns to the caller, because
| the function could call 'exit', 'abort', 'longjmp', or
| enter an infinite loop. For this reason, it is never
| allowed for a compiler to use any information derived
| from an operation with potential UB that comes after a
| function call in a way that could affect observable
| behavior before the function call. Consequently,
| observable behavior that is accessed via function calls (
| _i.e._ all except volatile accesses) can not be affected
| by time-traveling optimizations even when using the
| abstract interpretation. In principle, a compiler could
| use special knowledge about C library functions and use
| the fact those functions always return to the caller, but
| we are not aware of any compiler which does this (and it
| hardly seems worthwhile).
|
| I believe LLVM has a "mustreturn" attribute expressing
| this, but as noted, no standard C library messes with any
| such attributes for functions that cause observable
| behavior.
|
| So time-traveling UB really only affects programs which
| use volatile accesses for MMIO, and it might eventually
| be banned entirely from the C standard if the linked
| proposal N3128 goes through. I'd say that even "some C
| bugs" is a overstatement outside of embedded programming,
| given the rarity of MMIO in a hosted environment.
|
| [0] https://www.open-
| std.org/jtc1/sc22/wg14/www/docs/n3128.pdf
| w0mbat wrote:
| Is that still lower_bound? Maybe I am misreading the code but it
| looks like this returns any match, not the earliest match (when
| there are dupes).
|
| It's common to have multiple matches even in a unique list if the
| comparison function is say looking for a certain string prefix to
| do autocomplete, but we want the earliest in the list.
| klysm wrote:
| I guess it's good to have the option of not caring if you want
| even more speed
| shultays wrote:
| My lower_bound is a bit better then template
| <class ForwardIt, class T, class Compare> constexpr
| ForwardIt super_optimized_lower_bound( ForwardIt
| first, ForwardIt last, const T& value, Compare comp) {
| return first; }
|
| It doesn't really work like for some cases std::lower_bound
| but it is super fast
| [deleted]
| commandlinefan wrote:
| Or, less contentiously - if you know you don't care/don't
| need to disambiguate dups, look how much efficiency you're
| losing on a case that isn't important.
|
| The most common argument against optimization is "just use
| the existing code", but the "existing code" _always_ attempts
| to handle special edge cases that probably don't apply in
| your case. We programmers waste a _lot_ of end-user time
| saving a bit of programmer time.
| jfk13 wrote:
| On the other hand, assuming it's OK to ignore the special
| edge cases (or not even thinking of them) can come back to
| bite you (or your users) when eventually one of them _does_
| show up in a situation you didn 't anticipate.
| plagiarist wrote:
| Anecdotally this happens to me every time I omit an edge
| case. Now I try to always write a unit test for the
| special edge case being deliberately omitted. I just wish
| I could catch everything that way.
| sltkr wrote:
| As far as I can tell it does return the earliest match. Why do
| you think it doesn't?
| w0mbat wrote:
| I am possibly confused, getting moreso the more I read the
| code. As it is ignoring the sign of the result of compare I
| don't even see how it would work at all. The sign is what
| tells you which half to throw away as you search.
| badtension wrote:
| You halve the remaining length each time there is a match and
| only exit the loop when length is 0 so it should be the first
| one.
| ectospheno wrote:
| Every post like this makes me update my junior engineer
| watchlist. Then I have to go patiently explain why mucking about
| in the code to save one instruction because you read a blog post
| is a horrible idea. Easily half of all silly junior code is
| pointless optimization.
|
| Having said that, I did enjoy the post.
| sph wrote:
| What kind of junior engineers are you hiring that prematurely
| rewrite a hot loop to use branchless assembly logic?
|
| The junior employees I have had to deal with tended to build
| towering abstractions that no one really needed, rather than
| optimising hot loops.
| ectospheno wrote:
| Yes, I would say that is the other half.
| anonymoushn wrote:
| The Zig stdlib does not call out to C++ for binary search. The
| binary search is currently here:
| https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...
|
| (edit: fixed link to not decay, thanks)
| LoganDark wrote:
| Versioned link:
| https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...
| mhdm wrote:
| Thanks, updated the footnote.
| JonChesterfield wrote:
| Branches on a comparison function, doesn't count it as simple
| functions end up as a conditional move. Branches in a loop back
| edge, doesn't count it as there's a branch predictor.
|
| I was hoping for a binary search without branches in it. Not
| really what we got.
| andyg_blog wrote:
| This is not a valid drop in replacement for lower_bound.
| Accessing iterators as front[index] is only for random access
| iterators like vector has. The author may have realized this if
| they benchmarked on other containers.A forward iterator must be
| advanced and then dereferenced.
| commonlisp94 wrote:
| Good point. I don't think anyone actually uses it for
| containers that only support forward iteration. But the author
| did declare the signature with ForwardIterator.
| fn-mote wrote:
| Can you do binary search without random access? I would not
| expect so. Am I missing a point that someone could explain?
| andyg_blog wrote:
| The C++ standard measures the time complexity of lower_bound
| and binary_search by the number of comparisons performed. So
| yes, it's possible on e.g., a linked list like std::list (by
| this definition)
| kccqzy wrote:
| Observe that you can restrict your code to only call
| std::advance on the iterator. So definitely you could do
| binary search without random access, and you would actually
| iterate over all elements, but you just skip the comparison
| function for the majority of elements.
| mhdm wrote:
| For a proper/sane forward iterator, front[index] is the same
| thing as *(front + index) and front + index achieves the same
| as std::advance(front, index) (just not in place).
|
| Or are you pointing out how first[length] and first += length +
| rem would technically result in advancing a forward iterator
| through to first + length twice? (Trivially fixed with a middle
| = first + length temporary variable btw).
| notorandit wrote:
| Branchless?
|
| You need at least 1 branch to exit the loop.
| ant6n wrote:
| Do you tho? You could unroll the loop ceil(log2(MAX_SIZE))
| times.
| tgv wrote:
| If the compiler doesn't optimize the if(cmp(...)) first += length
| + rem; perhaps a sign operation could be used when cmp returns a
| number. Something along the lines of: first +=
| (length + rem) * ((sign(cmp(...)) + 1) / 2)
|
| Has anyone tried that? DDG isn't very helpful.
| funkychicken wrote:
| It looks like the unpredictable attribute now influences the cmov
| conversion pass (as of June 1st, so probably in clang 17/18)
|
| https://reviews.llvm.org/D118118
| 1letterunixname wrote:
| It's not branchless if there are are any Jcc other than JMP
| because that's an opportunity for a branch prediction miss and a
| pipeline stall.
| jezze wrote:
| Wouldnt it be faster to do length & 1 instead of length % 2 and
| also length >> 1 instead of length / 2? But maybe the compiler
| does this behind the scenes?
| ok123456 wrote:
| Compiler will do these strength reduction optimizations if
| they're applicable.
| planede wrote:
| That's an optimization you can rely on, yes.
| Chabsff wrote:
| Yeah, these are trivial optimizations.
| commonlisp94 wrote:
| lower_bound and upper_bound are typically implemented in terms of
| partition_point, which is much more general than this version of
| lower_bound taking an element.
| Aardwolf wrote:
| Even more reason to use this if you don't need the more general
| case and its associated performance cost
| commonlisp94 wrote:
| Partition point is even simpler and can be optimized the same
| way.
|
| > if you don't need the more general case and its associated
| performance cost
|
| In C++ the "more general case" typically doesn't have a
| performance cost.
| victor106 wrote:
| Anyone knows if Java already does this optimization?
| djoldman wrote:
| Interesting that the results don't hold up with a more
| complicated comp comparison function:
|
| > For somewhat realistic scenarios of binary searching with a
| slower comp() function I've thought of searching through ids,
| phone numbers, accounts and keywords. I've thus settled on
| testing searching 8-byte strings.
|
| > ...
|
| > In this case std::lower_bound is very slightly but consistently
| faster than sb_lower_bound. To always get the best performance it
| is possible for libraries to use sb_lower_bound whenever directly
| working on primitive types and std::lower_bound otherwise.
|
| I would like to see the analysis here.
| pclmulqdq wrote:
| I believe this happens because branch prediction lets you
| pipeline multiple simultaneous comparisons, and rewind whenever
| the branch predictor is wrong (about half the time for truly
| random data and inputs). The CMOV approach blocks after each
| comparison function due to the data dependency. On average,
| you're doing two comparisons at a time with branches, and one
| with CMOV, so when comparison time is greater than branch
| prediction penalty, you would expect a crossover to occur.
| gpderetta wrote:
| Switching to an N-way search (for N>2) could help extract the
| lost parallelism. At the cost of touching more cachelines
| though, so it is not necessarily a win.
| pclmulqdq wrote:
| You're also wasting a lot of work doing that. If you go two
| levels deep, you now do 2 comparisons per iteration
| (caching the result from the previous level), but you are
| guaranteed to waste one of those. The CPU with branch
| predictor is doing about 1.5 comparisons per iteration
| (worst case) thanks to the 50% mispredict rate, although
| this depends a lot on how long the comparison takes: a very
| long comparison function will converge to 1 comparison per
| iteration. Only if you replicate the speculation for
| yourself - which will necessarily have to be a lot worse
| than a CPU branch predictor to be fast and branchless - do
| you have any hope of an efficiency gain.
| gpderetta wrote:
| You do additional comparisons at each recursion depth,
| but you have a shallower recursion depth though. I think
| you break-even already at 4-way search.
| twoodfin wrote:
| I can't find any assertion in the post about the content of any
| of the input data sets or search keys, other than that they're
| "unpredictable".
|
| I assume purely random, but if these 8-byte strings were
| anything but pure information, modern branch predictors can
| sniff their way to better performance than cmov pretty easily.
| edg-l wrote:
| Check out https://en.algorithmica.org/hpc/data-structures/binary-
| searc...
| 3cats-in-a-coat wrote:
| Every time I see people trying to eliminate branches, I wonder,
| do we realize that having long pipelines where a branch
| misprediction stalls the pipeline is not actually a necessary
| part of architecture?
|
| The pipeline is long because we do lots of analysis and
| translation on the fly, just in time, which could easily be done
| in most cases ahead of time, as it's not a very stateful
| algorithm.
|
| This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that
| you have a branch.
|
| After all, if you think about it, all operations are branches.
| Down to bitwise operations and summing two numbers. It's
| branching in the ALU, carry bits and what not. You can't compute
| absolutely anything without looking at the state of one or more
| bits, and making a decision that changes the outcome. But the
| reason those branches don't hurt performance is that they're not
| branches on the main pipeline. These local branches have a very
| short (or non-existent) "pipeline". And the main pipeline is
| therefore not affected, despite the actual state of the system
| is.
| Const-me wrote:
| > as it's not a very stateful algorithm
|
| It might be stateless, but it depends on many things unknown at
| compile time.
|
| One of them is the input data being processed. Binary search is
| exactly that, compiler don't know at which position the result
| will be found.
|
| Another one is micro-architecture, most notably cache
| hierarchy, and composition of EUs inside codes. If you switch
| to ISA with instructions resembling the micro-ops of the
| current CPUs, you'll have to re-compile for every micro-
| architecture. However, this one is technically solvable with
| JIT compiler in OSes, like current GPUs where programs ship in
| byte code formats (DXBC, SPIR-V, NVPTX) then user-mode half of
| GPU driver recompiles into actual hardware instructions.
|
| Another huge one, other CPU threads are running unknow code.
| Even if you drop hyper-threading making cores independent,
| there're still resources shared across the complete chip: L3
| cache, off-chip memory, I/O bandwidth, and electricity i.e.
| thermals.
| scottlamb wrote:
| > > > as it's not a very stateful algorithm
|
| > It might be stateless, but it depends on many things
| unknown at compile time.
|
| > One of them is the input data being processed. Binary
| search is exactly that, compiler don't know at which position
| the result will be found.
|
| Are you and 3cats-in-a-coat are talking about the same "it"?
| I think they're talking about the work that requires such a
| long pipeline [1], as they've stated that doing more ahead of
| time would reduce the length of the pipeline and thus
| importance of accurate branch prediction. You are talking
| about...the branch predictor? the application's algorithm
| being executed (binary search in this case)?
|
| [1] my CPU microarchitectural knowledge isn't that good; I
| have only the vaguest idea of what that is. Register renaming
| / data dependency analysis / instruction scheduling, maybe?
| translation to micro-ops? other things? don't know!
| robertlagrant wrote:
| > After all, if you think about it, all operations are branches
|
| Isn't this definition the crux of it? If you redefine
| everything as a Branch(tm), even things that are not branches,
| then you can definitely precompute some Branch(tm)es.
|
| But the sort of non-Branch(tm) branch elimination is talking
| about actually branching computation routes in code due to an
| if/else statement or similar, is it not?
|
| That would still be useful to do in your world of Branch(tm)es,
| just only the Branch(tm)es that try and compute the results of
| more than one future at the same time (i.e. branches).
| [deleted]
| [deleted]
| mhh__ wrote:
| People have spent probably a billion dollars trying to get this
| to work. It's a noble goal but when will it actually stick?
| dragontamer wrote:
| > The pipeline is long because we do lots of analysis and
| translation on the fly, just in time, which could easily be
| done in most cases ahead of time, as it's not a very stateful
| algorithm.
|
| You're going down the wrong path, again, as Intel did with
| Itanium.
|
| We have pipelines because CPUs are performing Tomasulo's
| algorithm at runtime, because there are 10 pipelines in
| practice (for Intel systems), and all can be run in parallel.
| Multiply can be issued on pipeline0, pipeline1, or pipeline5.
|
| If there are 3 multiply instructions, all three pipelines (p0,
| p1, and p5) all get issued a multiply on _THIS_ clock tick.
|
| ----------
|
| The analysis must be dynamic because instructions such as "mov"
| can take a variable amount of time: loading data from L1 cache
| is 4-clock ticks, L2 cache is 40 clock ticks, L3 cache is ~150
| clock ticks, and RAM is 400+ clock ticks.
|
| That means a "lazy" strategy, where you analyze the state of
| the pipeline "as late as possible" before making a decision
| wins out. If you do pre-analysis (ie: what Intel's compiler was
| supposed to do with Itanium), you pretty much lose out on all
| variable-time instructions (ex: cache vs RAM).
|
| If you know for certain that you have no memory-operations,
| then you probably should use a GPU instead of a CPU.
|
| --------- theLoop: mov ebx, eax[ecx]
| ; y = array[x] add ebx, edx ; y += z add ecx, 4
| cmp ecx, jnz theLoop
|
| How long should you schedule the "add" instruction in the above
| assembly code? If you got a dynamic system just analyzing your
| pipelines and "lazily" issuing the instructions, you "perform
| it when its ready" (ie: after the mov instruction is complete,
| as you need ebx / variable-y to have been finished reading from
| RAM before progressing).
|
| "add ecx, 4" can actually be done right now in parallel. You
| split ecx into two registers (ecx-old and ecx-new). You issue
| "mov ebx, eax[ecx-old]", and then you can execute future
| instructions with ecx-new. If the next loop is branch predicted
| (as would be the case in binary search, since you almost always
| loop), the pipeline/branch prediction stuff works together and
| you can execute mov ebx, eax[ecx0], mov ebx, eax[ecx1], mov
| ebx, eax[ecx2], mov ebx, eax[ecx3]... all in parallel (!!!!),
| as ecx3 gets branch predicted and multiple loops start running
| in parallel..
|
| Its not hard, its not very difficult analysis, its low power.
| Everyone does this parallel computation now. If you go static /
| compiled ahead-of-time, you can't do this kind of thing
| anymore. So you lose out in speed compared to regular CPUs that
| have OoO analysis going on.
|
| ----------
|
| Furthermore, most code with branches can seemingly be replaced
| with branchless code (ex: cmov instructions, min instruction,
| max instruction, etc. etc.)
|
| So instead of magic compilers trying to solve an unsolvable
| problem (how do you schedule memory load/stores to keep the
| processor busy even though you don't know how long any memory
| operation takes...)... you write a magic compiler that solves a
| very easy to solve problem. (IE: convert the branchy-if
| statement into a branchless max or branchless cmov instruction
| instead).
|
| --------
|
| EDIT: I added "theLoop:" to the assembly above, because I
| realized that "branches" can be beneficial in the face of OoO /
| Tomasulo's algorithm. "Executing future loops" before the 1st
| loop is done is very beneficial.
| jimkoen wrote:
| I have little experience in hardware related performance
| optimizations, so excuse me if some of the things you wrote
| went over my head. But this sentence:
|
| > Everyone does this parallel computation now. If you go
| static / compiled ahead-of-time, you can't do this kind of
| thing anymore. So you lose out in speed compared to regular
| CPUs that have OoO analysis going on.
|
| Are you implying that in JIT compiled languages I can care
| less about making my code branchless, since both CPU and JIT
| compiler will optimize to a branchless version on the fly
| whereever possible, whereas in AoT compiled languages such as
| C++ I loose this edge?
|
| Are JIT compiled languages just better when it comes to
| utilizing the branch predictor?
| dragontamer wrote:
| JIT is completely separate issue entirely. I'm not talking
| about C++ vs Java here. I'm talking about the designs of
| different assembly languages.
|
| I'm talking about Intel Itanium (ia64) vs AMD x86-64. Or
| VLIW (compiler-driven parallelism) vs Out-of-Order
| pipelined processors.
|
| Intel Itanium had "premade bundles". Without getting into
| too far into the weeds, *the compiler* was responsible for
| discovering parallelism. The compiler then bundled
| instructions together
|
| So think of this code: theLoop:
| mov ebx, eax[ecx] ; y = array[x] add ebx, edx ; y
| += z add ecx, 4 cmp ecx, jnz
| theLoop
|
| The Itanium Assembly language would allow the compiler to
| emit: mov ebx, eax[ecx] ; add
| ebx, edx : add ecx, 4; // This line executed in parallel
| cmp ecx; jnz theLoop
|
| The compiler discovers that "add ebx, edx : add ecx, 4" are
| both independent, and therefore "bundles" them together.
| Intel Itanium then ran faster and without as much need of a
| decoder to discover this information ahead of time.
|
| But look at how few optimizations are available to Itanium
| or its compiler!! The amount of parallelism in practice
| (for classic x86 code) was much bigger, especially when you
| consider Tomasulo's Algorithm.
| mhh__ wrote:
| JITs can do a lot of things but in practice its a better
| mental model to just assume that the equivalent AOT code is
| faster and the JIT is merely recouping performance lost to
| the inefficiencies of interpretation or weird semantics
| (javascript being the implication)
| phire wrote:
| To add to this, the amount of speculation modern CPUs do is
| insane.
|
| In the "theLoop" example, say the first load misses L1 and L2
| cache and takes needs 40 cycles to load.
|
| AMD's Zen 4 has enough resources to issue about 64 iterations
| of "theLoop:" all before that first load needs to complete.
| With the micro op cache, it can probably issue those 64 loop
| iterations in about 35 cycles.
|
| Zen 4 has enough resources to start all 64 loads early, so
| they run in parallel.
|
| Who cares if there is a branch mispredict and it you actually
| needed to do only 16 iterations of "theLoop"? In this
| example, the loop length isn't dependant on any of the loads
| so the CPU can actually finish executing the "add ecx, 4; cmp
| max; jnz theLoop;" instructions of the first 16 iterations of
| the "theLoop". And then detect the mispredict, flush the
| pipeline and load the correct branch location, all before the
| 40 cycles it takes for that first load to complete.
|
| ------
|
| It's just absolutely insane. No short pipeline without a
| branch predictor could possibly compete.
| dragontamer wrote:
| Yeah, its not so much a game of "don't use the branch
| predictor". Its more so a game of "definitely use the
| branch predictor, but eliminate unnecessary branches so
| that the branch-predictor can be used on more important
| parts of your code".
|
| I don't think people realize how much branch predictor +
| Out-of-Order + superscalar builds and combines with each
| other to build modern CPUs. Its done this way for a reason.
| ChuckMcM wrote:
| Is that you Dave? :-) There was paper[1] comparing superscalar
| CISC approaches to uniscalar RISC in terms of _work done over
| time_ versus _instructions per clock_. I recall telling srk at
| the time that it was a good treatise on how choosing the metric
| (IPC vs throughput) can influence ones thinking about what is
| "good" and what is "bad."
|
| The IPC crowd reasons that if they can get a higher IPC value,
| then the process guys can crank the clock and everybody wins.
| The thoughput crowd takes the more pragmatic approach that
| Moore's law is dead and making silicon clock faster melts it so
| being clever in the ISA wins the day. Both camps have had
| successes and setbacks over the last 20 years.
|
| I am really enjoying the RISC-V stuff which is getting back
| into this level of question with regard to CPU architectures.
| Also a good place to look if you want to catch up on what
| "modern" superscalar ideas are being added _with_ support of
| instruction set flexibility. My guess is that wins the game in
| the long run.
|
| [1] This was in the early oughts so could have been HotChips,
| Microprocessor Forum, or HPC workshops.
| jasonwatkinspdx wrote:
| I'm sorry but this is completely mistaken.
|
| Transmeta's translation did not somehow eliminate branch costs.
|
| In fact I distinctly remember a comp.arch thread and post from
| Linus himself while he worked at Transmeta that paraphrased was
| "the cpu's job is to generate cache misses as fast as
| possible."
|
| Compulsory misses are a thing. No form of JIT is capable of
| eliminating them. Capacity misses are inevitable in the real
| world, even with the monster caches we have now.
|
| Itanium thought it could eliminate branch costs with static
| analysis. How did that work out?
|
| I really wish programmers would actually read a book about
| computer architecture before they so confidently conclude that
| it's simple to build things better than state of the art
| processors. You are underestimating the scale of smart that has
| gone into current processors by at least 7 orders of magnitude
| in my opinion.
| rtomanek wrote:
| Can you give some book recommendations, please?
| jasonwatkinspdx wrote:
| Hennessy and Patterson's Computer Architecture: A
| Quantitative Approach is the definitive textbook. They also
| have a 2nd textbook Computer Organization and Design. The
| former is more focused on people who might go after a
| Computer Engineering degree while the latter is more an
| exploration of microprocessor architecture relevant to a
| wide audience of programmers.
|
| Previous editions are available free online and just as
| good for learning the big picture imo. Actually the older
| editions might be better for learning microarchitecture
| specifically because the more recent editions have cut some
| material on that in order to cover mobile and cloud
| computing.
| taeric wrote:
| You could rephrase that the pipeline is long because there is a
| lot of independent work that you can do in a processor.
| Consider, for every independent operation that can be done, you
| can potentially do that much at the same time.
|
| And I don't just mean decode, fetch, and execute. If your
| computer has independent ALU and Shifting units, you can
| potentially be shifting something as you are also adding. Have
| a dedicated Adder and a Multiplication unit? No reason you
| can't try and do both as the same time.
|
| This directly leads to wanting multiple instructions in flight
| at the same time, which means you have to be able to fetch and
| decode instructions faster than you are processing them. Also
| naturally leads to situations where you want to reorder so that
| the N Add instructions don't keep you from seeing an
| independent Shift that you could do in the same time.
|
| You can think that things are more complicated than they need
| to be. And, you may not even be wrong. But a ton of engineering
| is going into making what we have. If you think you can make
| something a ton faster by not doing it this way, probably a
| good thing to dig into how accurate that claim is.
| MrYellowP wrote:
| I am confused.
|
| > Same function interface as std::lower_bound, but 2x faster, and
| shorter. "branchless" because the if compiles down to a
| conditional move instruction rather than a branch/conditional
| jump.
|
| Assembly programmers did "fastest branchless binary searches"
| using cmov decades ago and we didn't need to think about the
| compiler at all. I know, because I was one of them. Optimizing
| branches away was and still is a nice pastime. This is not new,
| or original, and I don't understand what's noteworthy about it.
|
| "Elitist", you say? Well ... yeah! At least I know what I'm
| doing, unlike people who solely rely on and blindly trust the
| compiler to do their work for them.
| alain94040 wrote:
| I don't get it. The problem with binary search and branches is
| not the branches themselves, it's the fact that until you have
| done the comparison, you don't know which memory location in the
| array to fetch next. It doesn't matter if you use branches or
| anything else, the question is what do you want the processor to
| do?
|
| There is a data dependency: until I read the middle index, I
| can't tell if I want to search the data in the upper section or
| the lower section. I can speculate and issue reads to both. That
| would fix the dependency, but create more memory traffic. Is this
| the right trade-off?
|
| What do _you_ want? Removing branches is not it.
| mhdm wrote:
| Prefetching is the right tradeoff for large arrays. Addressed
| at the end of the article:
| https://mhdm.dev/posts/sb_lower_bound/#prefetching
| returningfory2 wrote:
| If the array is fully in L1 cache, isn't the cost of the branch
| mis-predict much greater than the memory fetches?
| alain94040 wrote:
| Full array in L1 is not a typical scenario for binary search.
| Binary search is usually for large data sets, that are in
| DRAM.
| thefifthsetpin wrote:
| Binary search reduces the search space exponentially as it
| proceeds, so actually quite a lot of the total comparisons
| can hit L1d cache. (Maybe half of them for a ~250GB
| dataset.)
|
| Of course, you could keep a cacheable partial index of your
| huge dataset to accelerate the early part of your search as
| well.
| sixthDot wrote:
| https://probablydance.com/2023/04/27/beautiful-branchless-bi...
| sesuximo wrote:
| I'm too lazy to try it, but putting __builtin_unpredictable
| around the implementation of comp might make std::lower_bound
| less branchy
| postalrat wrote:
| How can you call it branchless if it has "while (length > 0) {"
| dehrmann wrote:
| I took a stab at the same problem a while ago. Since the upper
| bound of iterations is based on the input length, if you write
| your search in a way that extra iterations don't change the
| result, you can use a switch fallthrough to "unroll" the loop
| and not have to branch.
|
| https://github.com/ehrmann/branchless-binary-search/blob/mas...
| sltkr wrote:
| This is literally addressed in the article:
| More branchless, more better? Short answer: no. This
| section can be skipped but here's the long answer: For n
| elements where 2k <= n < 2k+1 sb_lower_bound will always do k+1
| comparisons. On a 64-bit machine that means at most 64
| iterations of the while (length > 0) loop. So it's possible to
| write a "fully branchless" version that doesn't have the length
| check by using a switch with intentional fall-through.
| size_t length = last - first; size_t rem;
| switch(std::bit_width(length)) { case 64:
| rem = length % 2; length /= 2;
| first += comp(first[length], value) \* (length + rem);
| case 63: rem = length % 2; length
| /= 2; first += comp(first[length], value) \*
| (length + rem); // ... case 1:
| rem = length % 2; length /= 2;
| first += comp(first[length], value) \* (length + rem);
| } return first; If you're not familiar
| with switch, think of it as a jump into code. In our case to
| the exact place from which there are exactly the right number
| of comparisons left to do. "Is it any faster?"
| No. Modern x86 processors handle loop conditions well as
| they're predictable; we're very likely to remain in the loop.
| And that's good especially because it saves us from writing
| templates or macros or copy-paste-edit the 64 cases.
|
| (Maybe people will RTFA if I paste it into the comments.)
| amethyst wrote:
| TBH I had to read that part two times before realizing that
| it was describing an an unrolled loop.
| einpoklum wrote:
| Because that's a branch that's "always" taken, except for once
| - which, speed-wise, is very close to always taken, i.e. not a
| branch. The author counts the branching inside the loop, which
| is often taken and often not taken.
| postalrat wrote:
| It's not impossible to remove that branch. Until you do that
| then this is a reduced branch implementation.
| fsckboy wrote:
| the article covers that right at the top: _"branchless" because
| the if compiles down to a conditional move instruction rather
| than a branch /conditional jump._
| Const-me wrote:
| That particular branch is well-predicted for long arrays. You
| probably only gonna have a single misprediction for the
| complete binary search function, the performance consequence of
| that branch is negligible.
|
| People (and compilers, too) don't write branchless code just
| for the sake of it, they do because it helps with performance.
| kragen wrote:
| sometimes branchless code also stops timing side channel
| attacks, but it's not enough by itself
| amluto wrote:
| Oh, C++, the language where templates are extremely powerful but
| their usability is so poor that code like this is kind of
| idiomatic and even compiles: template <class
| ForwardIt, class T, class Compare>
|
| constexpr ForwardIt sb_lower_bound( ForwardIt first, ForwardIt
| last, const T& value, Compare comp) { auto length = last - first;
|
| This looks generic, it compiles by itself, it contains a fancy
| word "ForwardIt", and it works if you try it with a vector or an
| array. But you can't actually subtract forward iterators, and
| binary searching a range of forward iterators is quite silly.
|
| Rust would have required using the correct trait. Maybe C++ will
| improve the situation in the long run.
___________________________________________________________________
(page generated 2023-08-11 23:00 UTC)