http://cantrip.org/sortfast.html
Do Low-level Optimizations Matter?
by Nathan Myers, ncm at cantrip dot org, 2020-01-09
Collectively, we have been thinking about sorting for longer than we
have had computers. There is still an active literature ^1,^2,^3. We
are taught that how the counts of comparisons and swaps vary with
problem size is what matters: as problems get bigger, order
dominates, and anything else is buried in the noise. We learn that
the best in-place sorts run in time around kN lg2(N), and a better
sort algorithm has a smaller k, or is better-behaved for chosen
input.
We ought to be able to sort pretty fast, by now, using a Standard
Library sort. Sorting is real work: if you need it to go faster, you
probably need to spend more on hardware. You can't cheat on that.
Or can you? The classic sorting research was conducted on machines
from many generations ago. While today's are made to seem similar
from the outside, inside they are very, very different. Do we still
understand what affects sorting speed today? Maybe the rules have
changed.
Laboratory
Trying out ideas with Standard Library sort implementations can be
tricky; as they have been tuned, they have become complicated.
Results can be confusing. We need a laboratory: simple sorts, and
simple data^4. So, let us begin with a file of a hundred-million
totally random integers:
$ dd if=/dev/urandom count=100 bs=4000000 of=1e8ints
This is objectively the worst case, containing the greatest possible
entropy; but also the best case, hiding no surprises. We can map this
file into our process, and the OS will copy it from the file buffer
cache, the work showing up as sys time. All the rest of the run time
is spent on nothing but sorting.
Next, we need a baseline, using std::sort, for reference:
#include
#include
#include
static const int size = 100'000'000;
int main(int, char**) {
int fd = ::open("1e8ints", O_RDONLY);
int perms = PROT_READ|PROT_WRITE;
int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE;
auto* a = (int*) ::mmap(
nullptr, size * sizeof(int), perms, flags, fd, 0);
std::sort(a, a + size);
return a[0] == a[size - 1];
}
The return statement keeps the compiler from optimizing away the
whole program. Trying it^5:
$ g++ -O3 -march=native sort_base.cc && time ./a.out
real 0m7.365
user 0m7.257
sys 0m0.108s
We see about 73ns per element. lg2(1e8) is about 27, making k a bit
less than 3ns. Each time std::sort looks at an element, it spends, on
average, 3ns, presumably on shuffling it from memory to cache to
register to ALU, and maybe back to cache and to memory. This happens
27 times before the element ends up where it belongs. (Nobody said,
in school, that k has a unit, but for our purposes, it's
nanoseconds.)
Reality Check
Just for a reality check, let's plug in a radix sort. It is an unfair
comparison, on several axes, but it suggests a hard upper bound on
the kind of improvement we can reasonably hope for. This function
distributes elements to one of 256 buckets, and then copies them
back, in stable bucket order, and again for the next byte.
#include
#include
void sort_radix256(unsigned* begin, unsigned* end) {
std::vector buckets[256];
for (auto& v : buckets) v.reserve((end - begin)/128);
for (int byte = 0; byte != sizeof(unsigned); ++byte) {
for (unsigned* p = begin; p != end; ++p) {
buckets[*p & 0xff].push_back((*p >> 8) | (*p << 24));
}
unsigned* p = begin;
for (auto& v : buckets) {
std::memcpy(p, v.data(), v.size() * sizeof(unsigned));
p += v.size();
v.clear();
} } }
- auto* a = (int*) ::mmap(
+ auto* a = (unsigned*) ::mmap(
- std::sort(a, a + size);
+ sort_radix256(a, a + size);
real 0m1.193s
user 0m0.985s
sys 0m0.208s
We see the OS spending extra time initializing memory (two thirds of
it to zeroes), because radix sorting is not done in place. The rest
of the time is spent copying the input to buckets and back, four
passes in all, 2.5ns per element per pass. When you can use it, and
for big enough N, radix sort always wins, because it's O(N): in this
case, 4N. Notably, the sequence of instructions is almost the same,
no matter what the values of the elements are. All that varies is
which bucket each element goes into.
This radix sort would be easy to make even faster. Instead, let us
make one that is slower, but that shares important qualities with
in-place sorts. Instead of 256 buckets, this will use just two. The
first pass picks a bucket according to the low bit, the next pass
uses the next bit, and on around.
void sort_radix2(unsigned* begin, unsigned* end) {
std::vector a_buckets[2];
for (auto& v : a_buckets) v.reserve((end - begin)/3);
int i = 0;
for (; i < (end - begin) / 2; ++i) a_buckets[0].push_back(begin[i]);
for (; i < (end - begin); ++i) a_buckets[1].push_back(begin[i]);
std::vector b_buckets[2];
for (auto& v : b_buckets) v.reserve((end - begin)/3);
for (int bit = 0; bit != 8 * sizeof(unsigned); ++bit) {
for (int j = 0; j != 2; ++j) {
for (unsigned v : a_buckets[j]) {
b_buckets[v & 0x1].push_back((v >> 1) | (v << 31));
}
}
std::swap(a_buckets[0], b_buckets[0]), b_buckets[0].clear();
std::swap(a_buckets[1], b_buckets[1]), b_buckets[1].clear();
}
for (auto bucket : a_buckets)
for (unsigned v : bucket) *begin++ = v;
}
- sort_radix256(a, a + size);
+ sort_radix2(a, a + size);
real 0m4.759s
user 0m4.543s
sys 0m0.216s
It's slower, because it examines each element 32 times. It's not N
lg2(N), but 32N. This 4.5s is remarkably close to our std::sort
baseline, 7.3s. Maybe we can think of the difference between 4.5s and
our baseline as what we pay for the generality of std::sort? 32N is
not far from 27N. Should we take 4.5s * 27/32 = 3.8s as an ultimate
stretch goal?
Starting Point: quicksort
Let's see how well we can do with a regular sorting algorithm. Start
by pasting in a bog-standard quicksort^6:
int* partition(int* begin, int* end) {
int pivot = end[-1];
int* left = begin;
for (int* right = begin; right < end - 1; ++right) {
if (*right <= pivot) {
int tmp = *left; *left = *right, *right = tmp;
++left;
}
}
int tmp = *left; *left = end[-1], end[-1] = tmp;
return left;
}
void quicksort(int* begin, int* end) {
while (end - begin > 1) {
int* mid = partition(begin, end);
quicksort(begin, mid);
begin = mid + 1;
} }
- auto* a = (unsigned*) ::mmap(
+ auto* a = (int*) ::mmap(
- radix_sort2(a, a + size);
+ quicksort(a, a + size);
real 0m8.309s
user 0m8.193s
sys 0m0.116s
This is really not bad. The std::sort in the standard library, at
7.3s, is a lot more code than we see here, but it has to perform well
on tricky cases we won't be testing.
What Is New?
What is different in modern CPUs from the machines the algorithms we
use were originally tuned for? Well, lots. Today they have up to a
half-billion transistors per core, not just the thousands originally
used to execute a notably similar instruction set. (Imagine, in each
generation, being asked to find a way to use another million, ten
million, hundred million! more transistors, to get better benchmark
times.)
We have many more caches, registers, instruction decoders, and
functional units-barrel shifters, multipliers, ALUs^7. There's even a
little just-in-time compiler, in hardware, to translate
complex-instruction sequences into simpler ones, with its own
peephole optimizer, and a cache of its output. A small-enough loop
can execute entirely from that cache.
One thing wholly new is the branch predictor^8. Technically, it's
another cache, one that accumulates a history of which way was chosen
on each of the last M conditional branch points encountered during
execution. (The number M is secret.) Branch prediction matters
because of something else that's wholly new: speculative execution^9.
When regular execution needs to stop and wait on a result,
speculative execution can run on ahead, stuffing pipelines with work
from upcoming loop iterations. Because the values needed to decide
which way a conditional branch will go haven't been computed yet, it
has to guess, based on the pattern of what has happened before at
that instruction address. When the branch predictor guesses wrong, a
lot of work may need to be discarded, but whenever it guesses right,
that work has already been done when regular execution gets there.
These days, often, most of the work gets done speculatively, so it is
vitally important to guess right. Branch predictors have become
astonishingly good at discovering patterns in our code, and guessing
right. Nowadays, they use a neural net to identify such patterns; the
branch prediction cache holds sets of coefficients for the neural
net.
Branching on random data, though, is the worst case for any branch
predictor, no matter how smart, because it can never find more
regularity than the data has. Here, it guesses wrong half the time,
and work is done and then discarded. The pipelines never fill, and
functional units sit idle.
Adaptation
If we want to better adapt our algorithm to the way a modern CPU
works, we need to protect speculative execution against randomness in
our data. Our only available course is to eliminate conditional
branches that depend on that data.
So, let's see what our conditional branches are doing. In this
quicksort, there are only three conditional branches.
The first check, at the top of quicksort itself:
while (end - begin > 1) {
detects when recursion has bottomed out. It evaluates to true half
the time, on a complicated schedule, but it is evaluated only about N
times, and hardly depends on the input.
The second:
for (int* right = begin; right < end - 1; ++right) {
controls the loop in partition that walks through a subrange of the
input. It runs about N lg2(N) times, but usually it's taken. It, too,
doesn't depend much on input values. It gets hard to predict only
when begin is very close to end, a small multiple of N times.
The third conditional branch:
if (*right <= pivot) {
happens N lg2(N) times, too, but it depends directly on the values of
elements in the input, every time. To prevent missed branch
predictions, we must find a way to avoid that branch. We need
identically the same sequence of instructions executed in the true
and the false case-and just have nothing change, in the false case.
The portable way to do this is to turn the result of the comparison
into a regular value, and then do ordinary arithmetic with it.
Sometimes, we can use the boolean value itself, as is. Perhaps the
simplest possible example is:
if (a < b) ++i;
which can always be transformed, via conversion to int, to:
i += (a < b);
A more general method is to convert the boolean value into a mask
value, of all zeroes or all ones. Then we can use bitwise operations
to produce different results for the all-zeroes and the all-ones
cases. For a slightly more complicated example, here we add up two
seconds/nanoseconds pairs:
secs = secs1 + secs2, nsecs = nsecs1 + nsecs2;
if (nsecs >= 1'000'000'000) {
++secs, nsecs -= 1'000'000'000;
}
This transforms to:
secs = secs1 + secs2, nsecs = nsecs1 + nsecs2;
int carry = (nsecs >= 1'000'000'000);
secs += carry, nsecs -= ((-carry) & 1'000'000'000);
With a carry, (-carry) is 111...111, and it subtracts a billion; with
no carry, (-carry) is zero, and it subtracts zero.
A New Primitive, swap_if
How can we use this method in our sort? First, let us make a swap_if:
inline bool swap_if(bool c, int& a, int& b) {
int ta = a, mask = -c; // false -> 0, true -> 111..111
a = (b & mask) | (ta & ~mask);
b = (ta & mask) | (b & ~mask);
return c;
}
In our partition function, then, we can transform
if (*right <= pivot) {
int tmp = *left; *left = *right, *right = tmp;
++left;
}
into just
left += swap_if(*right <= pivot, *left, *right);
This expands to more code, and all of it runs on every iteration, not
just half of them, as before. But now, there is no branch to predict,
or to mis-predict. It is just straight-line code.
Trying it:
$ g++ -O3 -march=native sort_swap_if.cc && time ./a.out
real 0m5.792s
user 0m5.679s
sys 0m0.112s
This is excellent! 5.7s is 1.44x better than before, and 1.29x as
fast as std::sort.
Another way to make a value control what happens is indexing. (We
have already seen an example of this, in the second radix sort
presented above.) The boolean value is used as an array index:
int v[2] = { a, b };
b = v[1-c], a = v[c];
real 0m4.576s
user 0m4.459s
sys 0m0.116s
Even better! This is 1.84x as fast as the naive quicksort, even
though the values have to take a detour through L1 cache so they can
have addresses. This version is easily generalized to types that are
not just a machine word:
template
bool swap_if(bool c, T& a, T& b) {
T v[2] = { std::move(a), std::move(b) };
b = std::move(v[1-c]), a = std::move(v[c]);
return c;
}
When T is int, compilers generate identical code for the template
version,
For completeness, there is also:
uint64_t both = (uint64_t(uint32_t(a)) << 32) | uint32_t(b);
int shift = c * 32;
both = (both << shift) | (both >> (64 - shift));
a = int(uint32_t(both & 0xffffffff)), b = int(both >> 32);
The line with the shifts turns into a single "rotate" instruction.
But this is not faster than the indexed version: it runs in 4.8s.
cmov Considered Disturbing
How does the first, "and-and-or", version do on Clang:
$ clang++ -O3 -march=native sort_swap_if.cc && time ./a.out
real 0m3.551s
user 0m3.430s
sys 0m0.120s
HOLY CRAP! 3.4s? What just happened?
Weren't we just guessing that 3.8s was as fast as we should ever hope
to get? This is 2.4x as fast as quicksort, and more than 2x as fast
as std::sort! This raises so many questions.
First, why the big difference between G++ and Clang? A quick detour
through Godbolt^10 reveals what is going on. G++ actually generated
the four "and", and two "or" instructions seen in swap_if. But
Clang's optimizer recognized what we were trying to do with the
masks, and replaced it all with a pair of simple cmov,
conditional-move, instructions. (On top of that, it unrolled the loop
in partition.)
What is the cmov instruction? It has been in ARM forever. Back in
2000, AMD included cmov in its 64-bit x86 ISA extensions. Then, Intel
had to adopt them when Itanium flopped.
cmov just copies a value from one place to another-register to
register, register to memory, memory to register-but only if a
condition-code flag has been set, such as by a recent comparison
instruction. Since cmov replaces a conditional branch, the result of
the comparison doesn't need to be predicted. The execution unit
doesn't know which value it will end up with, but it knows where it
will be, and can schedule copying that out to cache, and eventually
memory, and run on ahead, without discarding anything.
Why doesn't G++ generate cmov instructions? Older releases did, in
fact, often enough to generate bug reports^11. It turned out that,
any time a subsequent loop iteration depends on the result, and the
branch would have been predicted correctly, cmov may be slower than a
branch, sometimes much slower. cmov can stall speculation all by
itself.
The latest designs from Intel and AMD are said to avoid the cmov
pipeline stall, often, but Gcc has not caught up with them yet.
For more on cmov and pessimization in G++, follow links in ^12.
In this case, though, nothing later depends on the result-it just
gets stored, and the loop moves on-so this is a poster-child case for
using cmov.
Clang, it turns out, can turn a simpler definition of swap_if into
cmov instructions:
int ta = a, tb = b;
a = c ? tb : ta;
b = c ? ta : tb;
But for this, G++ just generates a badly-predicted branch. I have not
discovered a way to persuade G++ to produce cmov instructions (not
even in old releases, and not even with the new
__builtin_expect_with_probability intrinsic, probability 0.5). Even
"profile-guided optimization" doesn't help. (In the Linux kernel,
wherever a cmov is needed, a macro expands directly to assembly
code.) Looking into G++, it appears that it refuses to emit a cmov if
anything else follows it in the "basic block", even if what follows
ought to be another cmov ^13.
Faster Than (Pessimized) Radix Sort?
Another big question: how could the Clanged version be even faster
than our target time? This might come back to cmov, again. Radix sort
always performs the same number of copy operations. So, also, do our
first three versions of swap_if, as compiled by G++. But with cmov,
only half of the swap operations end up needing to copy the pair of
words out to L1 cache^14,^15, and thereby, perhaps, delay reading the
next pair from it. There may be other reasons: radix sort sweeps
majestically, sequentially through the whole sequence, relying on the
prefetcher to keep ahead of it, while quicksort spends much of its
time noodling around with values in L1 cache.
In the original quicksort, k was almost 3ns, but now it is just
1.3ns. More than half of the time we thought was being spent on
honest computation was wasted, sitting stalled on branch
mis-predictions! Who knows how much is still wasted? (Actually, the
CPU knows: it keeps a count of how many of various cache misses
happened, and you can read those out, given some care, with perf^16.)
Next
What can we do with what we've discovered? Sure, we can make our own
programs faster, particularly if we are using Clang. But it would be
better if we could make all programs faster.
We could propose a new library function, std::swap_if. Then,
implementers could ensure it uses cmov for machine-word types
(including pointers and floating-point values, and even small objects
like std::string_view), and use it in their sorting and partitioning
code. We could use it in our own programs, too. But for it to do much
good, we would need to get it into a Standard, and then persuade many
people to change a lot of code.
The experience with Clang's optimizer hints at an alternative: Why
can't compilers recognize the sequence we did, and perform their own
transformation?
This would be way better; just recompile, and all code gets faster,
some a lot faster, with no need to rewrite any of it. The std::sort
implementions in libstdc++ and libc++ are quite a bit more
complicated than our toy quicksort, but maybe a compiler could spot
places to transform that look unpromising to us.
Getting this optimization into compilers depends on those few who can
add a new optimization to G++'s or Clang's code generator taking time
away from other optimization work. Doesn't a 2x improvement over
current std::sort automatically deserve that kind of attention? But
it might not be so easy: too often, cmov is slower. The optimizer
doesn't just need to recognize when it could substitute cmov for a
branch, it needs to decide whether it should.
While the optimizer knows whether anything depends on the result, it
doesn't know whether the input would have patterns that the branch
predictor could tease out. Such an optimization would need a great
deal of testing to ensure it doesn't pessimize (too much) code.
Still, Clang, at least, seems to be all-in on plugging in cmov where
it seems to make sense. It just needs to learn to recognize the
conditionally-swapping and the conditionally-incrementing cases. It
should be able to compose those into the transformation used here,
and thence to cmov.
Conclusion
What can we conclude from this discovery? Before we conclude
anything, we should remind ourselves of its limitations. The tests
run were on completely random data. Truly random data seldom occurs
in real life. If there is a pattern in the data, the branch predictor
might be able to find and exploit it. Shouldn't it get that chance,
sometimes?
Furthermore, the idea was tested on only the simplest possible
elements, where both comparing and swapping were as cheap as they
ever can be. While sorting word-sized objects is still an important
use case, real sorting problems often have very different time
budgets. We never changed the number or kind of comparisons
performed, but the first big improvements doubled the number of
copies; then the final improvement halved them again. A copy is a
quarter or a third of a swap, but a dramatic change in their number
had comparatively little effect on run time.
For many object types, the self-moves seen in the simplest,
conditional-expression swap_if version are not permitted. For a
production library, we might need to specialize on whether self-moves
are allowed.
Our deep takeaway might be that counting comparisons and swaps
ignores what are, today, the actually expensive operations: cache
misses, of all kinds. Despite the millions of transistors devoted to
the task, caches miss, sometimes systematically. Any time the number
of misses, of all kinds, is not directly proportional to the number
of operations counted, an analysis may produce the wrong answer, and
lead us to bad choices.
But a lesson for the working programmer might be that, sometimes, you
know the problem better than the compiler or cache systems, and
measuring can lead you to simple code changes^17 that avoid costly
misses, with sometimes dramatic results.
Recommendations
Do these results suggest improvements to our Standard Library?
In C++, a factor of two in performance matters. We need two versions
of each of the Standard sort, partition, binary search, merge, and
heap algorithms that constitute nearly half of ^18: one
set that shields the branch predictor, and a second set that exposes
the branch predictor to any patterns it can tease out. But this does
not mean we need that many new named library functions! Instead, we
can bind the choice to an optional property of the comparison
predicate. The default should be to shield the branch predictor, at
least for small types, because the consequence of failing to protect
it can be so severe.
A branchless std::swap_if would be good to have in the Standard
Library^19, even after optimizers learn to generate it themselves,
because optimizers are sometimes too cautious. We should consider
carefully, also, whether std::count_if merits attention.
More to Come
Is that everything? Just the one weird trick?
No! This was just one example. Modern CPU chips are packed to the
rafters with dodgy gimcracks. They have pre-fetchers, micro-op caches
with micro-op fusion, register renaming, a shadow return-address
stack, pipelines everywhere, atomic interlocks, translation lookaside
buffers, hugepages, vector units. Caches collude with one another
over private buses.
It all makes many programs go faster than they have any business
going, but not necessarily your program. A better algorithm is no
longer enough to get top performance; your program needs to join in
the dance of the million-transistor accelerators. Anybody who insists
C is close to the machine is, at best, deluded.
Can the dance of the gimcracks drive k south of one nanosecond? Stay
tuned^20.
Finally: If you control a budget, and depend on the performance of
software, it might be a good use of that budget to help improve
compilers and standard libraries in the ways suggested.
The author thanks Andrei Alexandrescu for a most thorough and helpful
review of an early, and very different, draft of this article.
---------------------------------------------------------------------
1. https://arxiv.org/abs/1811.01259-
2. https://arxiv.org/abs/1810.12047-
3. https://arxiv.org/abs/1604.06697-
4. https://gitlab.com/ncmncm/sortfast/-
5. Programs were run an an i7-7820X SkylakeX.-
6. I call this bog-standard, but the quicksorts most easily found
online are always oddly pessimized, both more complicated and
slower. This one uses the "Lomuto partition", which is simpler
than Hoare's.)-
7. https://www.agner.org/optimize/microarchitecture.pdf-
8. https://danluu/branch-prediction/-
9. https://en.wikipedia.org/wiki/Speculative_execution-
10. https://godbolt.org/z/k4iZAB-
11. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309-
12. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85559-
13. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165-
14. https://gist.github.com/jboner/2841832-
15. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.957-
16. http://www.brendangregg.com/perf.html-
17. https://www.agner.org/optimize/optimizing_cpp.pdf-
18. https://en.cppreference.com/w/cpp/algorithm-
19. https://cantrip.org/swap_if.pdf-
20. Little optimization pun there.-