[HN Gopher] Do low-level optimizations matter? Faster quicksort ...
___________________________________________________________________
Do low-level optimizations matter? Faster quicksort with cmov
(2020)
Author : fanf2
Score : 149 points
Date : 2024-08-21 20:42 UTC (20 hours ago)
(HTM) web link (cantrip.org)
(TXT) w3m dump (cantrip.org)
| mbroncano wrote:
| (2020)
| tialaramex wrote:
| Which of course makes it ironic that it says "We ought to be
| able to sort pretty fast, by now, using a Standard Library
| sort."
|
| The C++ standard promised (since 1998) that the provided sorts
| are O(n log n).
|
| But in fact popular C++ implementations ignored this and
| shipped quicksort - after all it's "pretty fast" but
| unfortunately this means if bad guys can choose your input (or
| you get unlucky) you get O(n squared)
|
| Gradually they fixed this, libc++ updated to Introsort (last
| century's best attempt to do this) in 2021. In response to a
| ticket opened by Orson Peters.
|
| These days you're also starting to see a trend towards the
| Rust-style "fast sort which implements a wide contract" design
| even in C++ implementations. Rust's sort doesn't actually
| _promise_ this wide contract arrangement, but it does promise
| safety and the wide contract turns out to be the simplest way
| to do that.
|
| What I mean there by "wide contract" is that these sorts don't
| blow up if your ordering rule is nonsense. They can't "sort"
| successfully because your ordering rule is nonsense, if A > B
| && B > A then too bad, we can't sort A and B. But they _can_
| and do gracefully give up, while many older C++ stdlibs just
| cause uncontrolled mayhem in this case.
| gpderetta wrote:
| I'm pretty sure that introsort was already implemented in the
| original STL last century.
|
| In fact the standard didn't require O(log n) untill C++11 as
| all implementations were already using introsort, so allowing
| the previous more permissive O(n^2) bound was no longer
| necessary.
| tialaramex wrote:
| I don't know which is the "original STL" in this context.
| Alexander Stepanov's proposal of a standard template
| library for C++ pre-dates Musser's invention of
| introspective sorting (introsort). Stepanov is a smart guy
| but he's also somewhat famous so I'd guess if he invented
| it first we'd know that.
|
| As I said, Orson filed a bug in LLVM bugzilla for libc++
| going quadratic for adversarial input _in 2014_ , and it
| was eventually fixed (by using introsort) in 2021.
|
| There's a fun sideshow in that ticket, remember Orson
| actually wrote what is probably the best general purpose
| unstable sort at the time - but all he's done is open a
| ticket saying libc++ doesn't meet the minimum requirements.
| After a while, with the team having dragged their feet and
| still not even landed the patch to just use introsort
| somebody pipes up to report that they have benchmarked
| sorts and actually libc++ is faster than other popular C++
| sorts (in their benchmark). So, Orson politely asks them to
| try PDQsort. That's the first time he explicitly mentions
| his Pattern Defecting Quicksort in the ticket, and maybe a
| harried LLVM bug handler wouldn't notice who filed the bug
| until then. Just another name.
|
| But it's OK, the LLVM devs are unperturbed and ignore him,
| focusing instead on the minutiae of the performance numbers
| for their _known defective_ unstable sort. Year later
| somebody eventually adds introsort to libc++ and closes the
| defect ticket. Nobody answers Orson 's question, the answer
| of course is "Oh, yours is much faster".
| gpderetta wrote:
| Musser and Stepanov were working together on the STL. I
| think they were both at SGI at that time, and introsort
| was integrated into the SGI STL (around the same time the
| STL was standardized, I guess introsort was too new to
| require it in C++98); from there introsort made it into
| most other standard library implementations which are
| directly derived from SGI STL.
|
| libc++ is notable for being one of the few from-scratch
| implementations.
| karmakaze wrote:
| I would expect eliminating branches in a busy inner loop to
| matter.
|
| The interesting part is how that was done:
|
| > A New Primitive, swap_if
|
| > How can we use this method in our sort? First, let us make a
| swap_if: inline bool swap_if(bool c, int& a, int&
| b) { int ta = a, mask = -c; // false -> 0, true ->
| 111..111 a = (b & mask) | (ta & ~mask); b = (ta &
| mask) | (b & ~mask); return c; }
|
| > In our partition function, then, we can transform
| if (*right <= pivot) { int tmp = *left; *left = *right,
| *right = tmp; ++left; }
|
| > into just left += swap_if(*right <= pivot,
| *left, *right);
| magicalhippo wrote:
| Will we see x86 or similar CPUs replace hot instruction blocks
| with current-processor-specific optimized code during runtime,
| similar to how certain JIT VMs do?
|
| What I mean is, say you have a similar simple "x = (cond) ? a :
| b;" which the compiler has not translated to a CMOV.
|
| If this is in a hot loop then the CPU could, in theory, notice
| that "it's just doing a conditional move, I can do the CMOV
| faster" and then translate those code bytes at that memory
| location to a CMOV instead (ie during decoding or something
| like that).
|
| Not worth the complexity? I imagine it would slow down the
| decoder or wherever they insert this replacement logic. Or am I
| hopelessly out of date and they're already doing this?
| akira2501 wrote:
| I think it could only do that in spans of code where
| interrupts are disabled and purely between values already in
| registers. CPU state feels like it's in your control, but
| it's not at all in your control, and even if it was, multiple
| processes might exist and memory is never in your control.
| saagarjha wrote:
| Whether to use a branch or a conditional move isn't really
| dependent on what the source is doing but how likely the
| branch is and its dependencies. Simplifying a bit, an
| explicit cmov is basically telling the computer that the
| condition is not easy to predict and to not bother. Modern
| processors will typically do the same analysis themselves and
| perform similarly on a branch with a slight overhead.
| jeffbee wrote:
| > Will we see x86 or similar CPUs replace hot instruction
| blocks with current-processor-specific optimized code during
| runtime, similar to how certain JIT VMs do?
|
| Wasn't that one of Transmeta's things?
| gpderetta wrote:
| Look for "idiom detection". Macro uop fusion is a limited
| form of this.
|
| As a relevant example, POWER CPUs can convert some non
| diverging branches into conditional execution if not well
| predictable.
|
| Many RISC CPUs can convert well known LL/SC code patterns
| into locked atomic operations to guarantee forward progress.
| kevinventullo wrote:
| There's a well-known in-place implementation of swap of the
| form: a ^= b b ^= a a ^= b
|
| (Here ^ denotes bitwise XOR)
|
| Allowing for the mask, one could do an in-place version of
| swap_if via a ^= (b & mask) b ^= (a &
| mask) a ^= (b & mask)
|
| The in-place version of swap is generally discouraged because
| compilers are smart
| (https://stackoverflow.com/questions/36906/what-is-the-
| fastes...) but I do wonder whether the masking in swap_if
| obscures intent to the compiler enough to close the gap.
|
| Assuming the mask is passed in, Godbolt puts OP's swap_if at 26
| instructions, versus the above swap_if at 17 instructions:
| https://godbolt.org/z/Wedco5hPv
| ack_complete wrote:
| You have the optimizer disabled and the functions are no-ops.
| Additionally, masking that way makes the XOR no longer a no-
| op, so only one masking operation is necessary -- but it
| seems that GCC already knows this trick:
|
| https://godbolt.org/z/qxMvcbrc8
| kevinventullo wrote:
| Oof, thank you for the correction.
| Someone wrote:
| > The in-place version of swap is generally discouraged
| because compilers are smart
|
| Isn't that more because CPUs slow down when there are
| dependencies between instructions?
|
| Compilers could (and may even do) fairly easily detect that
| pattern (like they do with some versions of _popcnt_. See for
| example
| https://langdev.stackexchange.com/questions/3942/what-are-
| th..., discussed here in
| https://news.ycombinator.com/item?id=40987123) and compile it
| to whatever is fastest on the target CPU/in the given context
| (in some contexts, it can be compiled to nothing, just
| changing the mapping between local variables and the
| registers they're stored in)
| gpderetta wrote:
| > Isn't that more because CPUs slow down when there are
| dependencies between instructions?
|
| Indeed, while reg-reg moves can be resolved at the rename
| stage and are effectively free latency-wise.
| Almondsetat wrote:
| CPUs have shortcuts in the pipeline to provide the results
| to the dependent operation in time
| a_e_k wrote:
| This sort of use of turning conditionals into bitmasks and then
| shuffling masked values around was fairly common back in the
| day when writing code for SSE intrinsics.
|
| You wouldn't want to have to check each lane individually since
| that would spoil the point of using SSE. So you'd write masking
| ops like this and keep things parallel, tThen judiciously use
| any() and all() type intrinsics to check whether the lanes
| could agree to skip certain blocks conditionally.
|
| (This was before ISPC and before autovectorization to handle it
| for you.)
| djmips wrote:
| Does ISPC + auto vectorization truly remove the need to
| attend to these details in the most optimized code?
| a_e_k wrote:
| They can, yes.
|
| For example, here's a simple Compiler Explorer link with a
| simple loop conditionally copying from one of two values:
|
| https://godbolt.org/z/TbeK99sx1.
|
| If you look at the assembly code produced, you'll see in
| lines 9-11, it's doing a similar thing as the swap_if()
| function: ANDing one value with the mask, ANDing the other
| with the negation of the mask, and then ORing them
| together. And it produced that automatically from the
| ternary in the loop.
|
| They can also do things like automatically generate
| epilogues with a serial loop to handle the leftover items
| if the loop count isn't an even multiple of the SIMD width.
| That's something else that I remember writing by hand
| before autovectorizers.
|
| That said, coaxing an autovectorizer to actually vectorize
| a loop can be an art in itself. They're notoriously
| finicky. My favorite used to be the one in the old
| ("classic") ICC compiler where you use compiler flags to
| get a nice optimization report which would explain exactly
| why it chose what it did for each loop. You could also set
| #pragma's which would noisily warn or error out if the
| associated loop failed to vectorize for some reason.
|
| (And of course, a lot of this stuff you might just use GPU
| compute for now.)
| mgaunard wrote:
| normally there is an instruction for this called blend
| (since SSE4.1).
|
| There shouldn't be a need to do any bitwise magic.
| jcla1 wrote:
| You can get gcc to generate the blend instruction in the
| example given - you just have to 'force' it to use SIMD
| more aggressively by, for example, using -march=native.
| mgaunard wrote:
| That's not forcing anything, if you're not getting blend
| you're just targeting an old ISA without that
| instruction.
|
| It's up to you to choose your target adequately based on
| your deployment requirements.
| memset wrote:
| Fascinating! How does one learn to write C code that will make
| use of specific asm instructions?
|
| SIMDjson is another example that comes to mind. The conceit of C
| is that you do t have control over the underlying machine
| instructions without inlining it yourself. So how do people write
| C code with cpu optimizations like this?
| anonymoushn wrote:
| Use inline assembly or intrinsics. Read the code of stuff that
| does these things.
|
| Resources sufficient for implementing simdjson or similar
| vectorized parsers:
| https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
| https://lemire.me/blog/ http://0x80.pl/articles/
| addaon wrote:
| There are three approaches:
|
| 1) Use intrinsics, if your platform provides intrinsics for the
| instructions you want to use.
|
| 2) Use inline assembly (which I suppose doesn't technically
| count as writing C code, but is very much part of the story).
|
| 3) Write C code carefully against a chosen compiler with chosen
| optimization flags, and inspect the assembly. Especially for
| small chunks of code, a tool like godbolt.org is indispensable.
| Basically, you're writing the assembly you want the compiler to
| generate (either explicitly, or just in your head), then
| writing C code that seems likely to generate that sequence of
| instructions, then tweaking the code until the output you want
| is generated. If this is a super important optimization, it's
| also reasonable to add a build step that inspects the generated
| assembly and fails the build if it doesn't match the desired
| pattern; but in that case, writing inline assembly is usually
| easier.
| vlovich123 wrote:
| Option 4 is sometimes compiler annotations like
| __builtin_expect_with_probability which you could use to
| assign a branch a value of 50% which should coerce it to pick
| cmov (same risk as 3 though in that you need to inspect
| compiler output).
| addaon wrote:
| Yep, there's an entire book to be written on techniques
| that you can use to express your understanding of code to
| the compiler so that the compiler's chosen implementation
| of it matches yours. Plenty of __builtins and
| __attributes__, and even just the ordering of if/else
| statements, grouping statements into local scopes with {},
| using or not using a temporary variable...
| adelpozo wrote:
| I would say it is a moving target if the goal is to write C
| code that compiles to a seemingly magical cpu instruction. As
| other have pointed out, learning assembly and compiler explorer
| are useful things.
|
| As a way to show the wonderful complexities of compilers check
| the TOC of https://shop.elsevier.com/books/optimizing-
| compilers-for-mod...
| kimixa wrote:
| > Back in 2000, AMD included cmov in its 64-bit x86 ISA
| extensions. Then, Intel had to adopt them when Itanium flopped.
|
| Wasn't "cmov" one of the things added for the pentium pro? So it
| wasn't instruction compatible - hence the "i686" prefix to a lot
| of compiler triples?
| pbsd wrote:
| Yes, that is correct.
| tedunangst wrote:
| Intel was so embarrassed by the failure of itanium they
| invented a Time Machine and went back and added the instruction
| to a 1995 CPU. Deceptive and anti-competitive!
| basementcat wrote:
| Intel failed to predict the timeline branch in which i686
| arch had cmov so they had to roll back and replay it.
| winternewt wrote:
| Also known as Speculative CPU manufacturing
| dgl wrote:
| The most important bit of this is in the conclusion:
| Before we conclude anything, we should remind ourselves of its
| limitations. The tests run were on completely random data. Truly
| random data seldom occurs in real life.
|
| Linus famously ranted about CMOV in
| https://yarchive.net/comp/linux/cmov.html (2007, so potentially
| more modern architectures are better at some of this) and he
| says: if you KNOW the branch is totally
| unpredictable, cmov is often good for performance. But a
| compiler almost never knows that.
|
| As usual with optimizations like this you have to benchmark and
| even then if your sample isn't representative it might not mean
| much.
| tialaramex wrote:
| You can see how dramatically the actual data changes sort
| performance in e.g. this (summary of the current unstable sort
| in Rust, ipnsort)
|
| https://github.com/Voultapher/sort-research-rs/blob/main/wri...
|
| Notice how random_s95 is _worse_ (not by much, but it 's there)
| than fully random. random_s95 is 95% sorted data, but 5%
| unsorted, simulating a common "sort, do stuff, append, repeat"
| pattern we see in a lot of software.
|
| In contrast the sorted cases are almost instant, and random_d20
| (only 20 distinct values, chosen at random, but as a result the
| sorted output needs much fewer comparions) is very fast.
| jnordwick wrote:
| When Linus made that comment cmov was like a 6 cycle latency.
| For the last decade it has been 1 cycle, and I don't think
| there is any scenario where cmov is now slower than a branch.
| haberman wrote:
| The problem is not the 1 cycle latency, but the data
| dependency on both values. A correctly-predicted branch cuts
| the dependency on the value that is not used.
|
| I've definitely measured scenarios where cmov/branchless is
| slower than a branch for a given algorithm. Especially if the
| branchless version is doing a bit more work to avoid the
| branch.
| unnah wrote:
| Good point. It makes me wonder if modern out-of-order
| processors can skip performing unused computations
| altogether, if their result registers are overwritten by
| other data later (in program order).
| ants_a wrote:
| No, and it feels unlikely that they will either.
| gpderetta wrote:
| CPUs could in theory speculate CMOV, reintroducing
| prediction when predictable. But after Spectre, IIRC
| Intel now guarantees that CMOV is never speculated.
| gpderetta wrote:
| It is both though. At 6+ cycles there are only a few places
| where CMOV is a win. At 1 cycle you can be more liberal
| with its use and the lack of dependency breaking is a
| tradeoff.
| clausecker wrote:
| Are you sure? I recall cmov always having single cycle
| latency.
| BoardsOfCanada wrote:
| I know that it was decoded into 2 micro-ops and thus had to
| be decoded by the wide decoder, so perhaps 2 cycles?
| clausecker wrote:
| That could be. Agner's tables seem to confirm that.
| mgaunard wrote:
| a cmov-based approach is necessarily slower than a branch-
| based approach that was correctly predicted, since cmov
| requires computing both branches then selecting the result at
| the end.
| gpderetta wrote:
| Linus was specifically ranting about compilers inserting CMOV.
| These days it is actually a pain to get GCC or clang to
| generate a CMOV when you specifically want it.
|
| Also, CMOV did indeed get significantly better.
| pcwalton wrote:
| Note that random data is not a common case for sorting
| algorithms. It'd be interesting to see how the numbers change on
| partially-, mostly-, and fully-sorted data.
| kragen wrote:
| the article does mention this:
|
| > _What can we conclude from this discovery? Before we conclude
| anything, we should remind ourselves of its limitations. The
| tests run were on completely random data. Truly random data
| seldom occurs in real life._
|
| it's true that it's not a common case, but it's probably the
| simplest objectively justifiable case; any particular case of
| the others requires more elaborate justification for
| privileging it. why 90% sorted instead of 80%, say? or why do
| we give 20% weighting to the randomized-data case and 80%
| weighting to the 90%-sorted case, rather than some other
| weighting?
|
| and it does at least guarantee exploration of a good part of
| the algorithm's behavior space. i mean, bogosort is optimal on
| fully-sorted data, right? as long as you do the check for
| sortedness before the random shuffle instead of after it
|
| it's important that your sort not _explode_ on mostly-sorted or
| mostly-reverse-sorted data (myers 's 'bog-standard quicksort'
| uses the last element as the pivot and consequently goes
| quadratic), but beyond that, the particular weighting of
| relative importance to assign to random input vs. 99% sorted
| vs. 90% sorted vs. 90% reverse sorted--that requires some
| application-specific justification
|
| aside from the ones you mentioned, another interesting case is
| a large array of records that all have equal keys. median-of-
| three quicksort or random-pivot quicksort does fine on mostly-
| or fully-sorted data, but still sucks when everything is equal!
| xiaodai wrote:
| The radix sort implementation is not optimal. Instead of sorting
| 1 digit at a time, it should be sorting 11 digits at a time to
| saturate the cache lines.
|
| So the baseline radix sort can be even faster.
| djmips wrote:
| From the article - "This radix sort would be easy to make even
| faster. Instead, let us make one that is slower"
|
| The radix sort was used to provide an estimate for how much
| better the quicksort could improve.
| orlp wrote:
| Rather than std::swap_if which is rather niche, I would prefer to
| see std::select(cond, if_true, if_false), with the guarantee that
| unlike the ternary operator it eagerly evaluates both arguments
| and selects between them branchlessly if possible.
|
| Something similar is coming to Rust as well: https://doc.rust-
| lang.org/nightly/core/intrinsics/fn.select_...
| account42 wrote:
| Wouldn't the sensible lower level primitive be Clang's
| __builtin_unpredictable? If you standardize that then a librry
| can easily build your select function on top of it. I guess
| GCC's __builtin_expect_with_probability with probabilily 0.5
| should have the same meaning.
| gpderetta wrote:
| > I guess GCC's __builtin_expect_with_probability with
| probabilily 0.5 should have the same meaning.
|
| But it doesn't! the sequence 1010101010... has probability
| 50% that it assumes one or the other value, yet is completely
| predictable.
| tialaramex wrote:
| Also the compiler vendors have seen this movie before and
| know how it ends. Too often a "performance hint" is garbage
| and ignoring it will improve performance.
|
| Which is unfortunate because these are genuinely sometimes
| useful, but it's a "We can't have nice things" situation -
| for every expert who is carefully applying a hint to a hot
| path after careful reading of the manual for the specific
| CPU model they use, there are dozens of idiots who will
| copy-paste that hint believing that it's magic speed-up
| juice, never having measured and with no idea what it's
| really doing or why.
|
| The intrinsics have the advantage that at least you're
| expressing what you meant, and it is obvious whose fault it
| is when that's slower because you screwed up - hints make
| it too easy for programmers to justify a tantrum when their
| "hint" isn't magic speed-up juice after all.
| adrian_b wrote:
| A correction to the history mentioned in the article: CMOV has
| not been added by AMD around 2000.
|
| CMOV has been added by Intel in 1995, to Pentium Pro. It was the
| most important addition to the x86 ISA added by Pentium Pro. So
| CMOV was supported by Pentium Pro and its successors, like
| Pentium 2 or Pentium III, but it was not supported by Pentium,
| Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with
| Athlon, in 1999.
|
| Pentium Pro was the first Intel CPU with out-of-order execution
| and it did much more speculative execution than its predecessors.
| Therefore it had the need for CMOV to limit the performance loss
| caused by branch mispredictions.
| lynx23 wrote:
| Heh, nice! This article reminded me of a great talk I recently
| noticed:
|
| Rethinking Binary Search: Improving on a Classic with AI
| Assistance: https://www.youtube.com/watch?v=FAGf5Xr8HZU
|
| The gist is rather simple: Assuming that a substiantial amount of
| your searches results in _no result_ , you can bias the binary
| search to jump farther then just half, improving runtime
| noticeably.
| austin-cheney wrote:
| Matter to whom?
|
| If your primary audience is other developers then it absolutely
| does not matter. In fact it's a way to expose deep anger. All
| that matters is convenience and measurements just piss people
| off, because measuring anything is just too hard.
|
| If your primary audience is business, especially transactional
| finance, then absolutely yes. Squeeze absolutely every
| millisecond out, measure absolutely everything in terms of
| transaction quantity, focus on concurrent operations.
| jart wrote:
| You have to work on computing frontiers like LLMs where people
| are feeling a performance crunch. Measuring things is what
| separates software engineering from computer programming.
| Software engineering turns into social science when your
| hardware is 10x faster than the thing you're building, since
| then you have no need to measure things. Without extreme self-
| imposed discipline, your work in that kind of environment will
| devolve into bloated decadence, unless you move on to the next
| computing frontier. Convenience is the kind of thing a man
| values when he isn't pushing against any fundamental limits
| imposed by the universe.
| pronoiac wrote:
| Here's the article linked as https; I'll email the mods.
|
| https://cantrip.org/sortfast.html
| MatthiasPortzel wrote:
| It's weird to admit that this is an arbitrary dataset that
| doesn't mimic real-world use cases, then admit that radix sort is
| far faster for this situation, then conclude that low-level
| optimizations are important and a cmov-wrapper should be added to
| the standard library to make programs faster. It's a great post I
| just disagree with the conclusion.
| snek_case wrote:
| True. You would think it's maybe not that hard to find a real
| dataset to sort somewhere.
| kazinator wrote:
| > I call this bog-standard, but the quicksorts most easily found
| online are always oddly pessimized, both more complicated and
| slower. This one uses the "Lomuto partition", which is simpler
| than Hoare's.
|
| The Lomuto partitioning is a naive vandalism of Hoare's original,
| which causes quicksort to have quadratic behavior on sorted
| input.
___________________________________________________________________
(page generated 2024-08-22 17:01 UTC)