[HN Gopher] Fastest branchless binary search
       ___________________________________________________________________
        
       Fastest branchless binary search
        
       Author : jorangreef
       Score  : 294 points
       Date   : 2023-08-11 09:38 UTC (13 hours ago)
        
 (HTM) web link (mhdm.dev)
 (TXT) w3m dump (mhdm.dev)
        
       | foldr wrote:
       | >gcc has __builtin_expect_with_probability(cond, 0, 0.5) but it
       | does nothing (tested v10). -
       | 
       | I wonder what could possibly be the use of this builtin. Branch
       | prediction varies enough between different processors that it
       | seems unlikely that anything useful could be done with a fine-
       | grained probability estimate.
        
         | bjourne wrote:
         | There are ARM processors where bits in the branch instructions
         | encode whether they should be predicted as taken or not. So on
         | such architectures such hints can be useful. On x86-64,
         | however, they are afaik completely useless (from an
         | optimization perspective).
        
         | Pannoniae wrote:
         | Read https://en.algorithmica.org/hpc/pipelining/branching/ and
         | the following chapter.
         | 
         | TL;DR: you can generate different, and more efficient code if
         | you know how much it is going to be mispredicted.
        
           | foldr wrote:
           | Nothing in that link shows how a compiler could make use of a
           | fine-grained probability estimate in a practical way to guide
           | optimizations. I'm perfectly aware of the general concept of
           | branch prediction and the annotations that certain
           | architectures have in their instruction sets.
        
           | gpderetta wrote:
           | The built-in doesn't provide misprediction information to the
           | compiler.
        
         | twoodfin wrote:
         | Presumably it's not about matching the dynamic branch
         | prediction estimate but rather exploiting static (or profile-
         | provided) information to assist other optimizations.
         | 
         | Inlining seems like a natural application: Inlining is perhaps
         | the most critical optimization to get right, and getting it
         | right is a complex balance of frequency of use vs. added code
         | bloat and associated further optimization cost. Having more
         | than one bit of branch-likelihood as input to these decisions
         | could easily be helpful.
        
           | foldr wrote:
           | Good point. That's the only plausible explanation in this
           | thread, I think. However, if you're at the level of
           | performance tuning where you want to optimize inlining beyond
           | the compiler defaults, I would have thought that it would
           | make more sense to just experiment with forcing certain
           | functions to be inlined or not.
        
         | gpderetta wrote:
         | I think it is used for code layout: typically you want to make
         | the most often taken branch as fallthrough as it can be
         | slightly faster even when perfectly predicted.
         | 
         | I also tried (and failed) to use expect with probability to
         | generate a cmov, but in retrospect it is obvious: the parameter
         | is not the prediction probability, but the taken/not-taken
         | probability, and even a branch that goes either way with 50%
         | probability can still be predicted perfectly (for example a
         | branch that switches every other iteration), so the parameter
         | can't be used to decide between predication and prediction.
         | 
         | edit: what we really need is an [[unpredictable]] attribute.
        
           | foldr wrote:
           | >typically you want to make the most often taken branch as
           | fallthrough as it can be slightly faster even when perfectly
           | predicted.
           | 
           | Sure, but that application doesn't require a fine-grained
           | probability estimate.
        
           | twoodfin wrote:
           | Indeed: A branch taken 50% of the time can be readily
           | predicted 100% of the time if it follows a pattern the CPU
           | can infer.
           | 
           | Why you have to be careful choosing realistic input data if
           | you are microbenchmarking at this level for a real purpose.
        
           | pixelesque wrote:
           | clang has one:
           | 
           | https://clang.llvm.org/docs/LanguageExtensions.html#builtin-.
           | ..
        
             | ot wrote:
             | Which also does not affect cmov conversion. Every time this
             | comes up I bring up this long-standing LLVM bug:
             | https://bugs.llvm.org/show_bug.cgi?id=40027
        
               | dzaima wrote:
               | __builtin_unpredictable is gonna be fixed in LLVM/clang
               | 17: https://github.com/llvm/llvm-
               | project/commit/09515f2c20111628...
               | 
               | (also, bugs.llvm.org is old; the more up-to-date (but
               | still open) issue is https://github.com/llvm/llvm-
               | project/issues/39374)
        
         | wongarsu wrote:
         | At least when I last read it a couple years ago, Intel's
         | optimization manual recommends certain code layouts. For
         | example in the absence of better information a conditional jump
         | backwards would be predicted as true (because that's what loops
         | look like) while a jump forward is seen as less likely to
         | happen (error conditions and else branches look like that).
         | 
         | Not sure how consistent this is across architectures, and how
         | useful it still is. But at least it used to be a thing
        
           | kragen wrote:
           | pretty common but not, as i thought, universal since stretch 
           | https://en.wikipedia.org/wiki/Branch_predictor#Static_branch.
           | ..
           | 
           | but the only simpler approach is 'predict not taken' which in
           | a sense is what a cortex-m0 does
        
       | LoganDark wrote:
       | Does anyone know where the "BUT RUST" link was _supposed_ to
       | lead? It seems to be already out of date due to being
       | unversioned, I can 't tell whether it's supposed to lead to the
       | middle of the `starts_with` doc comment or not.
        
         | quietbritishjim wrote:
         | Looking at the archive.org captures just before [1] and just
         | after [2] the article was published, it looks like it was meant
         | to be this line of code, now on line 2779 [3]:
         | let mid = left + size / 2;
         | 
         | [1]
         | https://web.archive.org/web/20230602210213/https://doc.rust-...
         | 
         | [2]
         | https://web.archive.org/web/20230709221353/https://doc.rust-...
         | 
         | [3] https://doc.rust-lang.org/src/core/slice/mod.rs.html#2779
        
           | LoganDark wrote:
           | Alright, and as a versioned link that'll be:
           | https://github.com/rust-
           | lang/rust/blob/7d8386f05cedf33c7475c...
           | 
           | Thank you!
        
             | veber-alex wrote:
             | You can get versioned link directly into rustdoc:
             | 
             | https://doc.rust-
             | lang.org/1.71.1/src/core/slice/mod.rs.html#...
        
               | LoganDark wrote:
               | I couldn't find any direct references to this anywhere,
               | believe me I spent a long 30 minutes looking... >.<
               | 
               | But thank you anyway <3
        
           | epilys wrote:
           | So this is basically a *safe* optimization, since the index
           | will always be valid and there's no need for the compiler to
           | do a check and unwrap, panic.
           | 
           | This is how unsafe {} should be used. Sometimes some things
           | are true but the compiler can't know that. And here the
           | unsafe {} means that it dereferences a raw pointer (the index
           | that we know is valid). If the unsafe {} safety condition is
           | valid, unsafe {} is, well, safe.
        
             | someplaceguy wrote:
             | Furthermore, it's an optional optimization (you could just
             | copy the code and replace the unsafe access with a safe
             | one, if you're paranoid) and it's not like if you write it
             | in C++ it will be any safer than Rust's unsafe?!
        
         | mhdm wrote:
         | It was meant to link to rust's binary search implementation.
         | Updated to https://doc.rust-
         | lang.org/1.71.1/src/core/slice/mod.rs.html#...
        
       | WinLychee wrote:
       | Other fast binary searches https://github.com/sslotin/amh-
       | code/tree/main/binsearch
        
       | [deleted]
        
       | hajrice wrote:
       | I wish all blog posts started the way his does: "You're a busy
       | person so I'll first jump right to it. Here it is, the fastest
       | general (and simple) binary search C++ implementation:"
        
         | ape4 wrote:
         | Except I am too busy to read "You're a busy person"
        
           | ant6n wrote:
           | But not too busy to read the comments!
        
             | hinkley wrote:
             | The comments have a higher density of opinions and
             | experience.
        
               | eesmith wrote:
               | But does it have a higher density of _good_ opinions and
               | experience? ;)
        
       | bjourne wrote:
       | On my Cascade Lake processor "-mllvm -x86-cmov-converter=false"
       | almost halves the performance of the binary search:
       | | Benchmark | gcc  | clang | clang -cmov |
       | |-----------|------|-------|-------------|         | slow u32  |
       | 23.4 | 46.7  | 45.8        |         | fast u32  | 18.1 | 19.8  |
       | 31.4        |
       | 
       | The numbers are nanoseconds/bsearch on a 100mb uint32 array. Seem
       | to me that clang (15.0.7) is just much worse at optimizing this
       | particular piece of code than gcc (13.2.1). You can see the
       | assembly here https://godbolt.org/z/cbx5Kdjs6. The gcc assembly
       | looks way cleaner to me.
        
         | mhdm wrote:
         | Then you'll want to look at
         | https://mhdm.dev/posts/sb_lower_bound/#prefetching
         | 
         | 100mb is large enough that the branchy version turns out to
         | have a slight advantage, more due to quirks of x86 (speculative
         | execution) rather than being better.
        
       | shmageggy wrote:
       | > _If only there was a clean fast bare-metal language to write
       | all this in.._
       | 
       | The author includes a footnotes for "BUT RUST.." and "BUT ZIG..",
       | but how about Nim? Looks like there is a native library
       | implementation of `lowerBound` https://github.com/nim-
       | lang/Nim/blob/version-2-0/lib/pure/al... Yes, it's not a "bare-
       | metal" language, but it compiles to one (or two), so it would be
       | interesting to see what it compiles to here.
       | 
       | Also, what's wrong with C?
        
         | TillE wrote:
         | An equivalent generic sort function in C would, at best,
         | require a bunch of additional distracting cruft. This is
         | exactly what C++ templates are for.
        
         | matklad wrote:
         | > BUT ZIG.. There's no binary search implementation in Zig that
         | I could find, rather it calls to C++.
         | 
         | Zig's binary search is here, it's an unoptimized textbook
         | version:
         | https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...
         | 
         | At TigerBeetle, we have our own branchless implementation here:
         | 
         | https://github.com/tigerbeetle/tigerbeetle/blob/e996abcf7154...
        
           | josephg wrote:
           | Rust's binary search is here. Also a pretty textbook version
           | of the function: https://github.com/rust-
           | lang/rust/blob/4d7a80d48697171ed151c...
           | 
           | How much faster is the branchless version in practice? I make
           | heavy use of binary search in some code I maintain, and I
           | wonder if it would make much of a difference to switch to a
           | more efficient version of the function.
        
             | WinLychee wrote:
             | Rust's binary search used to be branchless but due to some
             | issues down in LLVM land, the codegen doesn't issue a
             | `cmov` where it should. https://github.com/rust-
             | lang/rust/issues/53823
             | 
             | old branchless version: https://github.com/rust-
             | lang/rust/pull/45333/files
        
             | abhishekjha wrote:
             | I don't know much about Rust data types or Rust in general
             | but does it not have any integer overflow?
             | 
             | I see (left+right)/2. Is it like python with unbounded
             | precision?
        
               | cornstalks wrote:
               | It's bounded precision, but Rust limits the max size of
               | an object/array to isize's max[1], not usize's max. So
               | adding two isize::MAX values using usize will never
               | overflow.
               | 
               | [1]: https://doc.rust-
               | lang.org/stable/reference/types/numeric.htm...
        
               | LegionMammal978 wrote:
               | Such an overflow could still be problematic for slices of
               | zero-sized types, which can contain up to usize::MAX
               | elements, since the total size in bytes is always 0. But
               | (left + right) / 2 doesn't actually occur anywhere in the
               | code, only left + size / 2, which clearly can't overflow
               | as long as size is sane.
        
               | [deleted]
        
               | josephg wrote:
               | Integer overflows cause panics in debug mode. And its
               | undefined behaviour in release mode.
               | 
               | Where do you see that in the code? I can't see
               | (left+right)/2 anywhere in the code I linked?
        
               | [deleted]
        
               | cornstalks wrote:
               | > _And its undefined behaviour in release mode._
               | 
               | No, it uses 2's complement and is well defined in release
               | mode. From [1]:
               | 
               | > _When you're compiling in release mode with the
               | --release flag, Rust does not include checks for integer
               | overflow that cause panics. Instead, if overflow occurs,
               | Rust performs two's complement wrapping._
               | 
               | [1]: https://doc.rust-lang.org/book/ch03-02-data-
               | types.html#integ...
        
               | marcianx wrote:
               | I'm pretty sure it's not undefined behavior in rust in
               | release mode if it does overflow. It's fully specified
               | behavior, I believe.
        
               | Phemist wrote:
               | Left and right are both usizes, which are 64-bit
               | pointers. You will need work on an array of 2^63 elements
               | before you have to worry about integer overflow issues.
               | 
               | This array would not fit in any kind of memory for the
               | foreseeable future :)
        
               | creata wrote:
               | Not that it makes much of a difference, but Rust does
               | compile to 32-bit platforms.
               | 
               | https://doc.rust-lang.org/nightly/rustc/platform-
               | support.htm...
        
               | Phemist wrote:
               | Ah yes, fair point. It makes it a bit more subtle, in
               | that the cases where you have to worry about integer
               | overflows on the pointer addition, are cases where you
               | have an array of 2^((64 or 32) - 1) bools... which seems
               | rather silly to do a binary search on?
        
               | quietbritishjim wrote:
               | > I see (left+right)/2.
               | 
               | There is no (left+right)/2 on that page.
               | let mid = left + size / 2;
               | 
               | It's only size that is being divided by 2, which is the
               | size of the segment still under consideration (which is
               | initially the whole array). left is the starting index of
               | segment still under consideration (which is initially 0).
        
               | vulcan01 wrote:
               | In debug mode, integer overflow panics. In release mode,
               | overflow does occur; that's why checked_add and friends
               | exist.
        
         | eru wrote:
         | > Also, what's wrong with C?
         | 
         | Oodles and oodles of undefined behaviour, for example.
         | 
         | C ain't clean.
        
           | User23 wrote:
           | Nah, thanks to CompCert[1] C actually has one of the highest
           | quality and most predictable compilers.
           | 
           | [1] https://compcert.org/
        
             | Ygg2 wrote:
             | How much of available C libraries are CompCert compatible?
        
             | insanitybit wrote:
             | What is the practicality of actually using CompCert? And
             | what does it actually defend against?
             | 
             | All I know is that it tries to ensure that the behavior of
             | the program is the same as the input, meaning that the
             | compiler itself does not have bugs. But how does that
             | actually interact with undefined behavior in the language?
             | What happens to an array out of bounds?
             | 
             | And what are the limitations imposed, since it only works
             | on a subset of the C language?
             | 
             | It seems unlikely that I could just switch compilers and
             | suddenly have a safe program written in C.
        
               | LegionMammal978 wrote:
               | It looks like CompCert explicitly does not attempt to
               | prevent UB from resulting in unpredictable behavior; all
               | it guarantees is that "the observable behavior of [the
               | compiled code] _C_ improves on one of the allowed
               | observable behaviors of [the source program] _S_ ", where
               | "improving on" allows for arbitrary behavior on UB [0]:
               | 
               | > Third, the compiler is allowed to improve the behavior
               | of the source program. Here, _to improve_ means to
               | convert a run-time error (such as crashing on an integer
               | division by zero) into a more defined behavior. This can
               | happen if the run-time error (e.g. division by zero) was
               | optimized away (e.g. removed because the result of the
               | division is unused). However, if the source program is
               | known to be free of run-time errors, perhaps because it
               | was verified using static analyzers or deductive program
               | provers, improvement as described above never takes
               | place, and the generated code behaves exactly as one of
               | the allowed behaviors of the source program.
               | 
               | So you still have to prove your program's correctness by
               | some other means.
               | 
               | [0] https://compcert.org/man/manual001.html#sec3
        
               | User23 wrote:
               | You have to scroll down a bit, but see here[1]. Also
               | searching [site:regehr.org compcert] will find more.
               | CompCert C makes it impossible to reproduce at least some
               | UB bugs:                 For CompCert, the high-level
               | correctness theorems I proved are all of the first kind
               | above: if the source program goes wrong, nothing is
               | guaranteed about the behavior of the compiled code.  It
               | happens, however, that the stepwise simulation lemmas I
               | build on actually prove property 2 above: the compiled
               | code cannot crash "earlier" than the source code, and
               | will produce the same pre-crash observables, then
               | possibly more.  So maybe I should strengthen my high-
               | level correctness theorems...
               | 
               | That was over a decade ago, so perhaps it's been done. I
               | confess I haven't closely followed CompCert in some time,
               | due to having priorities elsewhere. As you can see
               | elsewhere in that reference CompCert C will fail to
               | generate the example UB code.
               | 
               | But yes, mechanically proving additional properties with
               | some kind of SAT solver is also an open area of research.
               | Dafny is the example I'm familiar with. Its conditions
               | and invariants allow things like automatically proved
               | binary search[2]. Converting CompCert C statements to
               | SMTLibv2 assertions (which is, to an extremely rough
               | first approximation what Dafny does) would certainly be
               | considerably easier than it would be for most other
               | languages. It would still be a serious effort of course.
               | 
               | [1] https://blog.regehr.org/archives/232
               | 
               | [2] https://github.com/dafny-
               | lang/dafny/blob/master/Test/dafny4/...
        
               | LegionMammal978 wrote:
               | Apart from compiler bugs, this issue with "time-traveling
               | UB" shouldn't really occur in practice with any side
               | effects other than volatile accesses. As noted by Martin
               | Uecker [0],
               | 
               | > In portable C code I/O is performed using function
               | calls of the standard library. A compiler can generally
               | not assume that a function returns to the caller, because
               | the function could call 'exit', 'abort', 'longjmp', or
               | enter an infinite loop. For this reason, it is never
               | allowed for a compiler to use any information derived
               | from an operation with potential UB that comes after a
               | function call in a way that could affect observable
               | behavior before the function call. Consequently,
               | observable behavior that is accessed via function calls (
               | _i.e._ all except volatile accesses) can not be affected
               | by time-traveling optimizations even when using the
               | abstract interpretation. In principle, a compiler could
               | use special knowledge about C library functions and use
               | the fact those functions always return to the caller, but
               | we are not aware of any compiler which does this (and it
               | hardly seems worthwhile).
               | 
               | I believe LLVM has a "mustreturn" attribute expressing
               | this, but as noted, no standard C library messes with any
               | such attributes for functions that cause observable
               | behavior.
               | 
               | So time-traveling UB really only affects programs which
               | use volatile accesses for MMIO, and it might eventually
               | be banned entirely from the C standard if the linked
               | proposal N3128 goes through. I'd say that even "some C
               | bugs" is a overstatement outside of embedded programming,
               | given the rarity of MMIO in a hosted environment.
               | 
               | [0] https://www.open-
               | std.org/jtc1/sc22/wg14/www/docs/n3128.pdf
        
       | w0mbat wrote:
       | Is that still lower_bound? Maybe I am misreading the code but it
       | looks like this returns any match, not the earliest match (when
       | there are dupes).
       | 
       | It's common to have multiple matches even in a unique list if the
       | comparison function is say looking for a certain string prefix to
       | do autocomplete, but we want the earliest in the list.
        
         | klysm wrote:
         | I guess it's good to have the option of not caring if you want
         | even more speed
        
           | shultays wrote:
           | My lower_bound is a bit better then                 template
           | <class ForwardIt, class T, class Compare>       constexpr
           | ForwardIt super_optimized_lower_bound(             ForwardIt
           | first, ForwardIt last, const T& value, Compare comp) {
           | return first;       }
           | 
           | It doesn't really work like for some cases std::lower_bound
           | but it is super fast
        
           | [deleted]
        
           | commandlinefan wrote:
           | Or, less contentiously - if you know you don't care/don't
           | need to disambiguate dups, look how much efficiency you're
           | losing on a case that isn't important.
           | 
           | The most common argument against optimization is "just use
           | the existing code", but the "existing code" _always_ attempts
           | to handle special edge cases that probably don't apply in
           | your case. We programmers waste a _lot_ of end-user time
           | saving a bit of programmer time.
        
             | jfk13 wrote:
             | On the other hand, assuming it's OK to ignore the special
             | edge cases (or not even thinking of them) can come back to
             | bite you (or your users) when eventually one of them _does_
             | show up in a situation you didn 't anticipate.
        
               | plagiarist wrote:
               | Anecdotally this happens to me every time I omit an edge
               | case. Now I try to always write a unit test for the
               | special edge case being deliberately omitted. I just wish
               | I could catch everything that way.
        
         | sltkr wrote:
         | As far as I can tell it does return the earliest match. Why do
         | you think it doesn't?
        
           | w0mbat wrote:
           | I am possibly confused, getting moreso the more I read the
           | code. As it is ignoring the sign of the result of compare I
           | don't even see how it would work at all. The sign is what
           | tells you which half to throw away as you search.
        
         | badtension wrote:
         | You halve the remaining length each time there is a match and
         | only exit the loop when length is 0 so it should be the first
         | one.
        
       | ectospheno wrote:
       | Every post like this makes me update my junior engineer
       | watchlist. Then I have to go patiently explain why mucking about
       | in the code to save one instruction because you read a blog post
       | is a horrible idea. Easily half of all silly junior code is
       | pointless optimization.
       | 
       | Having said that, I did enjoy the post.
        
         | sph wrote:
         | What kind of junior engineers are you hiring that prematurely
         | rewrite a hot loop to use branchless assembly logic?
         | 
         | The junior employees I have had to deal with tended to build
         | towering abstractions that no one really needed, rather than
         | optimising hot loops.
        
           | ectospheno wrote:
           | Yes, I would say that is the other half.
        
       | anonymoushn wrote:
       | The Zig stdlib does not call out to C++ for binary search. The
       | binary search is currently here:
       | https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...
       | 
       | (edit: fixed link to not decay, thanks)
        
         | LoganDark wrote:
         | Versioned link:
         | https://github.com/ziglang/zig/blob/b835fd90cef1447904d3b009...
        
         | mhdm wrote:
         | Thanks, updated the footnote.
        
       | JonChesterfield wrote:
       | Branches on a comparison function, doesn't count it as simple
       | functions end up as a conditional move. Branches in a loop back
       | edge, doesn't count it as there's a branch predictor.
       | 
       | I was hoping for a binary search without branches in it. Not
       | really what we got.
        
       | andyg_blog wrote:
       | This is not a valid drop in replacement for lower_bound.
       | Accessing iterators as front[index] is only for random access
       | iterators like vector has. The author may have realized this if
       | they benchmarked on other containers.A forward iterator must be
       | advanced and then dereferenced.
        
         | commonlisp94 wrote:
         | Good point. I don't think anyone actually uses it for
         | containers that only support forward iteration. But the author
         | did declare the signature with ForwardIterator.
        
         | fn-mote wrote:
         | Can you do binary search without random access? I would not
         | expect so. Am I missing a point that someone could explain?
        
           | andyg_blog wrote:
           | The C++ standard measures the time complexity of lower_bound
           | and binary_search by the number of comparisons performed. So
           | yes, it's possible on e.g., a linked list like std::list (by
           | this definition)
        
           | kccqzy wrote:
           | Observe that you can restrict your code to only call
           | std::advance on the iterator. So definitely you could do
           | binary search without random access, and you would actually
           | iterate over all elements, but you just skip the comparison
           | function for the majority of elements.
        
         | mhdm wrote:
         | For a proper/sane forward iterator, front[index] is the same
         | thing as *(front + index) and front + index achieves the same
         | as std::advance(front, index) (just not in place).
         | 
         | Or are you pointing out how first[length] and first += length +
         | rem would technically result in advancing a forward iterator
         | through to first + length twice? (Trivially fixed with a middle
         | = first + length temporary variable btw).
        
       | notorandit wrote:
       | Branchless?
       | 
       | You need at least 1 branch to exit the loop.
        
         | ant6n wrote:
         | Do you tho? You could unroll the loop ceil(log2(MAX_SIZE))
         | times.
        
       | tgv wrote:
       | If the compiler doesn't optimize the if(cmp(...)) first += length
       | + rem; perhaps a sign operation could be used when cmp returns a
       | number. Something along the lines of:                   first +=
       | (length + rem) * ((sign(cmp(...)) + 1) / 2)
       | 
       | Has anyone tried that? DDG isn't very helpful.
        
       | funkychicken wrote:
       | It looks like the unpredictable attribute now influences the cmov
       | conversion pass (as of June 1st, so probably in clang 17/18)
       | 
       | https://reviews.llvm.org/D118118
        
       | 1letterunixname wrote:
       | It's not branchless if there are are any Jcc other than JMP
       | because that's an opportunity for a branch prediction miss and a
       | pipeline stall.
        
       | jezze wrote:
       | Wouldnt it be faster to do length & 1 instead of length % 2 and
       | also length >> 1 instead of length / 2? But maybe the compiler
       | does this behind the scenes?
        
         | ok123456 wrote:
         | Compiler will do these strength reduction optimizations if
         | they're applicable.
        
         | planede wrote:
         | That's an optimization you can rely on, yes.
        
         | Chabsff wrote:
         | Yeah, these are trivial optimizations.
        
       | commonlisp94 wrote:
       | lower_bound and upper_bound are typically implemented in terms of
       | partition_point, which is much more general than this version of
       | lower_bound taking an element.
        
         | Aardwolf wrote:
         | Even more reason to use this if you don't need the more general
         | case and its associated performance cost
        
           | commonlisp94 wrote:
           | Partition point is even simpler and can be optimized the same
           | way.
           | 
           | > if you don't need the more general case and its associated
           | performance cost
           | 
           | In C++ the "more general case" typically doesn't have a
           | performance cost.
        
       | victor106 wrote:
       | Anyone knows if Java already does this optimization?
        
       | djoldman wrote:
       | Interesting that the results don't hold up with a more
       | complicated comp comparison function:
       | 
       | > For somewhat realistic scenarios of binary searching with a
       | slower comp() function I've thought of searching through ids,
       | phone numbers, accounts and keywords. I've thus settled on
       | testing searching 8-byte strings.
       | 
       | > ...
       | 
       | > In this case std::lower_bound is very slightly but consistently
       | faster than sb_lower_bound. To always get the best performance it
       | is possible for libraries to use sb_lower_bound whenever directly
       | working on primitive types and std::lower_bound otherwise.
       | 
       | I would like to see the analysis here.
        
         | pclmulqdq wrote:
         | I believe this happens because branch prediction lets you
         | pipeline multiple simultaneous comparisons, and rewind whenever
         | the branch predictor is wrong (about half the time for truly
         | random data and inputs). The CMOV approach blocks after each
         | comparison function due to the data dependency. On average,
         | you're doing two comparisons at a time with branches, and one
         | with CMOV, so when comparison time is greater than branch
         | prediction penalty, you would expect a crossover to occur.
        
           | gpderetta wrote:
           | Switching to an N-way search (for N>2) could help extract the
           | lost parallelism. At the cost of touching more cachelines
           | though, so it is not necessarily a win.
        
             | pclmulqdq wrote:
             | You're also wasting a lot of work doing that. If you go two
             | levels deep, you now do 2 comparisons per iteration
             | (caching the result from the previous level), but you are
             | guaranteed to waste one of those. The CPU with branch
             | predictor is doing about 1.5 comparisons per iteration
             | (worst case) thanks to the 50% mispredict rate, although
             | this depends a lot on how long the comparison takes: a very
             | long comparison function will converge to 1 comparison per
             | iteration. Only if you replicate the speculation for
             | yourself - which will necessarily have to be a lot worse
             | than a CPU branch predictor to be fast and branchless - do
             | you have any hope of an efficiency gain.
        
               | gpderetta wrote:
               | You do additional comparisons at each recursion depth,
               | but you have a shallower recursion depth though. I think
               | you break-even already at 4-way search.
        
         | twoodfin wrote:
         | I can't find any assertion in the post about the content of any
         | of the input data sets or search keys, other than that they're
         | "unpredictable".
         | 
         | I assume purely random, but if these 8-byte strings were
         | anything but pure information, modern branch predictors can
         | sniff their way to better performance than cmov pretty easily.
        
       | edg-l wrote:
       | Check out https://en.algorithmica.org/hpc/data-structures/binary-
       | searc...
        
       | 3cats-in-a-coat wrote:
       | Every time I see people trying to eliminate branches, I wonder,
       | do we realize that having long pipelines where a branch
       | misprediction stalls the pipeline is not actually a necessary
       | part of architecture?
       | 
       | The pipeline is long because we do lots of analysis and
       | translation on the fly, just in time, which could easily be done
       | in most cases ahead of time, as it's not a very stateful
       | algorithm.
       | 
       | This is how Transmeta Crusoe CPUs worked. Imagine NOT caring that
       | you have a branch.
       | 
       | After all, if you think about it, all operations are branches.
       | Down to bitwise operations and summing two numbers. It's
       | branching in the ALU, carry bits and what not. You can't compute
       | absolutely anything without looking at the state of one or more
       | bits, and making a decision that changes the outcome. But the
       | reason those branches don't hurt performance is that they're not
       | branches on the main pipeline. These local branches have a very
       | short (or non-existent) "pipeline". And the main pipeline is
       | therefore not affected, despite the actual state of the system
       | is.
        
         | Const-me wrote:
         | > as it's not a very stateful algorithm
         | 
         | It might be stateless, but it depends on many things unknown at
         | compile time.
         | 
         | One of them is the input data being processed. Binary search is
         | exactly that, compiler don't know at which position the result
         | will be found.
         | 
         | Another one is micro-architecture, most notably cache
         | hierarchy, and composition of EUs inside codes. If you switch
         | to ISA with instructions resembling the micro-ops of the
         | current CPUs, you'll have to re-compile for every micro-
         | architecture. However, this one is technically solvable with
         | JIT compiler in OSes, like current GPUs where programs ship in
         | byte code formats (DXBC, SPIR-V, NVPTX) then user-mode half of
         | GPU driver recompiles into actual hardware instructions.
         | 
         | Another huge one, other CPU threads are running unknow code.
         | Even if you drop hyper-threading making cores independent,
         | there're still resources shared across the complete chip: L3
         | cache, off-chip memory, I/O bandwidth, and electricity i.e.
         | thermals.
        
           | scottlamb wrote:
           | > > > as it's not a very stateful algorithm
           | 
           | > It might be stateless, but it depends on many things
           | unknown at compile time.
           | 
           | > One of them is the input data being processed. Binary
           | search is exactly that, compiler don't know at which position
           | the result will be found.
           | 
           | Are you and 3cats-in-a-coat are talking about the same "it"?
           | I think they're talking about the work that requires such a
           | long pipeline [1], as they've stated that doing more ahead of
           | time would reduce the length of the pipeline and thus
           | importance of accurate branch prediction. You are talking
           | about...the branch predictor? the application's algorithm
           | being executed (binary search in this case)?
           | 
           | [1] my CPU microarchitectural knowledge isn't that good; I
           | have only the vaguest idea of what that is. Register renaming
           | / data dependency analysis / instruction scheduling, maybe?
           | translation to micro-ops? other things? don't know!
        
         | robertlagrant wrote:
         | > After all, if you think about it, all operations are branches
         | 
         | Isn't this definition the crux of it? If you redefine
         | everything as a Branch(tm), even things that are not branches,
         | then you can definitely precompute some Branch(tm)es.
         | 
         | But the sort of non-Branch(tm) branch elimination is talking
         | about actually branching computation routes in code due to an
         | if/else statement or similar, is it not?
         | 
         | That would still be useful to do in your world of Branch(tm)es,
         | just only the Branch(tm)es that try and compute the results of
         | more than one future at the same time (i.e. branches).
        
           | [deleted]
        
         | [deleted]
        
         | mhh__ wrote:
         | People have spent probably a billion dollars trying to get this
         | to work. It's a noble goal but when will it actually stick?
        
         | dragontamer wrote:
         | > The pipeline is long because we do lots of analysis and
         | translation on the fly, just in time, which could easily be
         | done in most cases ahead of time, as it's not a very stateful
         | algorithm.
         | 
         | You're going down the wrong path, again, as Intel did with
         | Itanium.
         | 
         | We have pipelines because CPUs are performing Tomasulo's
         | algorithm at runtime, because there are 10 pipelines in
         | practice (for Intel systems), and all can be run in parallel.
         | Multiply can be issued on pipeline0, pipeline1, or pipeline5.
         | 
         | If there are 3 multiply instructions, all three pipelines (p0,
         | p1, and p5) all get issued a multiply on _THIS_ clock tick.
         | 
         | ----------
         | 
         | The analysis must be dynamic because instructions such as "mov"
         | can take a variable amount of time: loading data from L1 cache
         | is 4-clock ticks, L2 cache is 40 clock ticks, L3 cache is ~150
         | clock ticks, and RAM is 400+ clock ticks.
         | 
         | That means a "lazy" strategy, where you analyze the state of
         | the pipeline "as late as possible" before making a decision
         | wins out. If you do pre-analysis (ie: what Intel's compiler was
         | supposed to do with Itanium), you pretty much lose out on all
         | variable-time instructions (ex: cache vs RAM).
         | 
         | If you know for certain that you have no memory-operations,
         | then you probably should use a GPU instead of a CPU.
         | 
         | ---------                   theLoop:          mov ebx, eax[ecx]
         | ; y = array[x]         add ebx, edx ; y += z         add ecx, 4
         | cmp ecx,          jnz theLoop
         | 
         | How long should you schedule the "add" instruction in the above
         | assembly code? If you got a dynamic system just analyzing your
         | pipelines and "lazily" issuing the instructions, you "perform
         | it when its ready" (ie: after the mov instruction is complete,
         | as you need ebx / variable-y to have been finished reading from
         | RAM before progressing).
         | 
         | "add ecx, 4" can actually be done right now in parallel. You
         | split ecx into two registers (ecx-old and ecx-new). You issue
         | "mov ebx, eax[ecx-old]", and then you can execute future
         | instructions with ecx-new. If the next loop is branch predicted
         | (as would be the case in binary search, since you almost always
         | loop), the pipeline/branch prediction stuff works together and
         | you can execute mov ebx, eax[ecx0], mov ebx, eax[ecx1], mov
         | ebx, eax[ecx2], mov ebx, eax[ecx3]... all in parallel (!!!!),
         | as ecx3 gets branch predicted and multiple loops start running
         | in parallel..
         | 
         | Its not hard, its not very difficult analysis, its low power.
         | Everyone does this parallel computation now. If you go static /
         | compiled ahead-of-time, you can't do this kind of thing
         | anymore. So you lose out in speed compared to regular CPUs that
         | have OoO analysis going on.
         | 
         | ----------
         | 
         | Furthermore, most code with branches can seemingly be replaced
         | with branchless code (ex: cmov instructions, min instruction,
         | max instruction, etc. etc.)
         | 
         | So instead of magic compilers trying to solve an unsolvable
         | problem (how do you schedule memory load/stores to keep the
         | processor busy even though you don't know how long any memory
         | operation takes...)... you write a magic compiler that solves a
         | very easy to solve problem. (IE: convert the branchy-if
         | statement into a branchless max or branchless cmov instruction
         | instead).
         | 
         | --------
         | 
         | EDIT: I added "theLoop:" to the assembly above, because I
         | realized that "branches" can be beneficial in the face of OoO /
         | Tomasulo's algorithm. "Executing future loops" before the 1st
         | loop is done is very beneficial.
        
           | jimkoen wrote:
           | I have little experience in hardware related performance
           | optimizations, so excuse me if some of the things you wrote
           | went over my head. But this sentence:
           | 
           | > Everyone does this parallel computation now. If you go
           | static / compiled ahead-of-time, you can't do this kind of
           | thing anymore. So you lose out in speed compared to regular
           | CPUs that have OoO analysis going on.
           | 
           | Are you implying that in JIT compiled languages I can care
           | less about making my code branchless, since both CPU and JIT
           | compiler will optimize to a branchless version on the fly
           | whereever possible, whereas in AoT compiled languages such as
           | C++ I loose this edge?
           | 
           | Are JIT compiled languages just better when it comes to
           | utilizing the branch predictor?
        
             | dragontamer wrote:
             | JIT is completely separate issue entirely. I'm not talking
             | about C++ vs Java here. I'm talking about the designs of
             | different assembly languages.
             | 
             | I'm talking about Intel Itanium (ia64) vs AMD x86-64. Or
             | VLIW (compiler-driven parallelism) vs Out-of-Order
             | pipelined processors.
             | 
             | Intel Itanium had "premade bundles". Without getting into
             | too far into the weeds, *the compiler* was responsible for
             | discovering parallelism. The compiler then bundled
             | instructions together
             | 
             | So think of this code:                   theLoop:
             | mov ebx, eax[ecx] ; y = array[x]         add ebx, edx ; y
             | += z         add ecx, 4         cmp ecx,          jnz
             | theLoop
             | 
             | The Itanium Assembly language would allow the compiler to
             | emit:                   mov ebx, eax[ecx] ;          add
             | ebx, edx : add ecx, 4; // This line executed in parallel
             | cmp ecx;          jnz theLoop
             | 
             | The compiler discovers that "add ebx, edx : add ecx, 4" are
             | both independent, and therefore "bundles" them together.
             | Intel Itanium then ran faster and without as much need of a
             | decoder to discover this information ahead of time.
             | 
             | But look at how few optimizations are available to Itanium
             | or its compiler!! The amount of parallelism in practice
             | (for classic x86 code) was much bigger, especially when you
             | consider Tomasulo's Algorithm.
        
             | mhh__ wrote:
             | JITs can do a lot of things but in practice its a better
             | mental model to just assume that the equivalent AOT code is
             | faster and the JIT is merely recouping performance lost to
             | the inefficiencies of interpretation or weird semantics
             | (javascript being the implication)
        
           | phire wrote:
           | To add to this, the amount of speculation modern CPUs do is
           | insane.
           | 
           | In the "theLoop" example, say the first load misses L1 and L2
           | cache and takes needs 40 cycles to load.
           | 
           | AMD's Zen 4 has enough resources to issue about 64 iterations
           | of "theLoop:" all before that first load needs to complete.
           | With the micro op cache, it can probably issue those 64 loop
           | iterations in about 35 cycles.
           | 
           | Zen 4 has enough resources to start all 64 loads early, so
           | they run in parallel.
           | 
           | Who cares if there is a branch mispredict and it you actually
           | needed to do only 16 iterations of "theLoop"? In this
           | example, the loop length isn't dependant on any of the loads
           | so the CPU can actually finish executing the "add ecx, 4; cmp
           | max; jnz theLoop;" instructions of the first 16 iterations of
           | the "theLoop". And then detect the mispredict, flush the
           | pipeline and load the correct branch location, all before the
           | 40 cycles it takes for that first load to complete.
           | 
           | ------
           | 
           | It's just absolutely insane. No short pipeline without a
           | branch predictor could possibly compete.
        
             | dragontamer wrote:
             | Yeah, its not so much a game of "don't use the branch
             | predictor". Its more so a game of "definitely use the
             | branch predictor, but eliminate unnecessary branches so
             | that the branch-predictor can be used on more important
             | parts of your code".
             | 
             | I don't think people realize how much branch predictor +
             | Out-of-Order + superscalar builds and combines with each
             | other to build modern CPUs. Its done this way for a reason.
        
         | ChuckMcM wrote:
         | Is that you Dave? :-) There was paper[1] comparing superscalar
         | CISC approaches to uniscalar RISC in terms of _work done over
         | time_ versus _instructions per clock_. I recall telling srk at
         | the time that it was a good treatise on how choosing the metric
         | (IPC vs throughput) can influence ones thinking about what is
         | "good" and what is "bad."
         | 
         | The IPC crowd reasons that if they can get a higher IPC value,
         | then the process guys can crank the clock and everybody wins.
         | The thoughput crowd takes the more pragmatic approach that
         | Moore's law is dead and making silicon clock faster melts it so
         | being clever in the ISA wins the day. Both camps have had
         | successes and setbacks over the last 20 years.
         | 
         | I am really enjoying the RISC-V stuff which is getting back
         | into this level of question with regard to CPU architectures.
         | Also a good place to look if you want to catch up on what
         | "modern" superscalar ideas are being added _with_ support of
         | instruction set flexibility. My guess is that wins the game in
         | the long run.
         | 
         | [1] This was in the early oughts so could have been HotChips,
         | Microprocessor Forum, or HPC workshops.
        
         | jasonwatkinspdx wrote:
         | I'm sorry but this is completely mistaken.
         | 
         | Transmeta's translation did not somehow eliminate branch costs.
         | 
         | In fact I distinctly remember a comp.arch thread and post from
         | Linus himself while he worked at Transmeta that paraphrased was
         | "the cpu's job is to generate cache misses as fast as
         | possible."
         | 
         | Compulsory misses are a thing. No form of JIT is capable of
         | eliminating them. Capacity misses are inevitable in the real
         | world, even with the monster caches we have now.
         | 
         | Itanium thought it could eliminate branch costs with static
         | analysis. How did that work out?
         | 
         | I really wish programmers would actually read a book about
         | computer architecture before they so confidently conclude that
         | it's simple to build things better than state of the art
         | processors. You are underestimating the scale of smart that has
         | gone into current processors by at least 7 orders of magnitude
         | in my opinion.
        
           | rtomanek wrote:
           | Can you give some book recommendations, please?
        
             | jasonwatkinspdx wrote:
             | Hennessy and Patterson's Computer Architecture: A
             | Quantitative Approach is the definitive textbook. They also
             | have a 2nd textbook Computer Organization and Design. The
             | former is more focused on people who might go after a
             | Computer Engineering degree while the latter is more an
             | exploration of microprocessor architecture relevant to a
             | wide audience of programmers.
             | 
             | Previous editions are available free online and just as
             | good for learning the big picture imo. Actually the older
             | editions might be better for learning microarchitecture
             | specifically because the more recent editions have cut some
             | material on that in order to cover mobile and cloud
             | computing.
        
         | taeric wrote:
         | You could rephrase that the pipeline is long because there is a
         | lot of independent work that you can do in a processor.
         | Consider, for every independent operation that can be done, you
         | can potentially do that much at the same time.
         | 
         | And I don't just mean decode, fetch, and execute. If your
         | computer has independent ALU and Shifting units, you can
         | potentially be shifting something as you are also adding. Have
         | a dedicated Adder and a Multiplication unit? No reason you
         | can't try and do both as the same time.
         | 
         | This directly leads to wanting multiple instructions in flight
         | at the same time, which means you have to be able to fetch and
         | decode instructions faster than you are processing them. Also
         | naturally leads to situations where you want to reorder so that
         | the N Add instructions don't keep you from seeing an
         | independent Shift that you could do in the same time.
         | 
         | You can think that things are more complicated than they need
         | to be. And, you may not even be wrong. But a ton of engineering
         | is going into making what we have. If you think you can make
         | something a ton faster by not doing it this way, probably a
         | good thing to dig into how accurate that claim is.
        
       | MrYellowP wrote:
       | I am confused.
       | 
       | > Same function interface as std::lower_bound, but 2x faster, and
       | shorter. "branchless" because the if compiles down to a
       | conditional move instruction rather than a branch/conditional
       | jump.
       | 
       | Assembly programmers did "fastest branchless binary searches"
       | using cmov decades ago and we didn't need to think about the
       | compiler at all. I know, because I was one of them. Optimizing
       | branches away was and still is a nice pastime. This is not new,
       | or original, and I don't understand what's noteworthy about it.
       | 
       | "Elitist", you say? Well ... yeah! At least I know what I'm
       | doing, unlike people who solely rely on and blindly trust the
       | compiler to do their work for them.
        
       | alain94040 wrote:
       | I don't get it. The problem with binary search and branches is
       | not the branches themselves, it's the fact that until you have
       | done the comparison, you don't know which memory location in the
       | array to fetch next. It doesn't matter if you use branches or
       | anything else, the question is what do you want the processor to
       | do?
       | 
       | There is a data dependency: until I read the middle index, I
       | can't tell if I want to search the data in the upper section or
       | the lower section. I can speculate and issue reads to both. That
       | would fix the dependency, but create more memory traffic. Is this
       | the right trade-off?
       | 
       | What do _you_ want? Removing branches is not it.
        
         | mhdm wrote:
         | Prefetching is the right tradeoff for large arrays. Addressed
         | at the end of the article:
         | https://mhdm.dev/posts/sb_lower_bound/#prefetching
        
         | returningfory2 wrote:
         | If the array is fully in L1 cache, isn't the cost of the branch
         | mis-predict much greater than the memory fetches?
        
           | alain94040 wrote:
           | Full array in L1 is not a typical scenario for binary search.
           | Binary search is usually for large data sets, that are in
           | DRAM.
        
             | thefifthsetpin wrote:
             | Binary search reduces the search space exponentially as it
             | proceeds, so actually quite a lot of the total comparisons
             | can hit L1d cache. (Maybe half of them for a ~250GB
             | dataset.)
             | 
             | Of course, you could keep a cacheable partial index of your
             | huge dataset to accelerate the early part of your search as
             | well.
        
       | sixthDot wrote:
       | https://probablydance.com/2023/04/27/beautiful-branchless-bi...
        
       | sesuximo wrote:
       | I'm too lazy to try it, but putting __builtin_unpredictable
       | around the implementation of comp might make std::lower_bound
       | less branchy
        
       | postalrat wrote:
       | How can you call it branchless if it has "while (length > 0) {"
        
         | dehrmann wrote:
         | I took a stab at the same problem a while ago. Since the upper
         | bound of iterations is based on the input length, if you write
         | your search in a way that extra iterations don't change the
         | result, you can use a switch fallthrough to "unroll" the loop
         | and not have to branch.
         | 
         | https://github.com/ehrmann/branchless-binary-search/blob/mas...
        
         | sltkr wrote:
         | This is literally addressed in the article:
         | More branchless, more better?         Short answer: no. This
         | section can be skipped but here's the long answer: For n
         | elements where 2k <= n < 2k+1 sb_lower_bound will always do k+1
         | comparisons. On a 64-bit machine that means at most 64
         | iterations of the while (length > 0) loop. So it's possible to
         | write a "fully branchless" version that doesn't have the length
         | check by using a switch with intentional fall-through.
         | size_t length = last - first;         size_t rem;
         | switch(std::bit_width(length)) {            case 64:
         | rem = length % 2;               length /= 2;
         | first += comp(first[length], value) \* (length + rem);
         | case 63:               rem = length % 2;               length
         | /= 2;               first += comp(first[length], value) \*
         | (length + rem);            // ...            case 1:
         | rem = length % 2;               length /= 2;
         | first += comp(first[length], value) \* (length + rem);
         | }         return first;                  If you're not familiar
         | with switch, think of it as a jump into code. In our case to
         | the exact place from which there are exactly the right number
         | of comparisons left to do.                  "Is it any faster?"
         | No. Modern x86 processors handle loop conditions well as
         | they're predictable; we're very likely to remain in the loop.
         | And that's good especially because it saves us from writing
         | templates or macros or copy-paste-edit the 64 cases.
         | 
         | (Maybe people will RTFA if I paste it into the comments.)
        
           | amethyst wrote:
           | TBH I had to read that part two times before realizing that
           | it was describing an an unrolled loop.
        
         | einpoklum wrote:
         | Because that's a branch that's "always" taken, except for once
         | - which, speed-wise, is very close to always taken, i.e. not a
         | branch. The author counts the branching inside the loop, which
         | is often taken and often not taken.
        
           | postalrat wrote:
           | It's not impossible to remove that branch. Until you do that
           | then this is a reduced branch implementation.
        
         | fsckboy wrote:
         | the article covers that right at the top: _"branchless" because
         | the if compiles down to a conditional move instruction rather
         | than a branch /conditional jump._
        
         | Const-me wrote:
         | That particular branch is well-predicted for long arrays. You
         | probably only gonna have a single misprediction for the
         | complete binary search function, the performance consequence of
         | that branch is negligible.
         | 
         | People (and compilers, too) don't write branchless code just
         | for the sake of it, they do because it helps with performance.
        
           | kragen wrote:
           | sometimes branchless code also stops timing side channel
           | attacks, but it's not enough by itself
        
       | amluto wrote:
       | Oh, C++, the language where templates are extremely powerful but
       | their usability is so poor that code like this is kind of
       | idiomatic and even compiles:                   template <class
       | ForwardIt, class T, class Compare>
       | 
       | constexpr ForwardIt sb_lower_bound( ForwardIt first, ForwardIt
       | last, const T& value, Compare comp) { auto length = last - first;
       | 
       | This looks generic, it compiles by itself, it contains a fancy
       | word "ForwardIt", and it works if you try it with a vector or an
       | array. But you can't actually subtract forward iterators, and
       | binary searching a range of forward iterators is quite silly.
       | 
       | Rust would have required using the correct trait. Maybe C++ will
       | improve the situation in the long run.
        
       ___________________________________________________________________
       (page generated 2023-08-11 23:00 UTC)