http://0x80.pl/notesen/2025-01-18-simd-heap.html
SIMD binary heap operations
Author: Wojciech Mula
Added on: 2025-01-18
Contents
* Introduction
* is_heap
+ Experiments
o Ryzen 7
o Skylake-X
o Ice Lake
* push_heap
* See also
* Source code
Introduction
Binary heap is a binary tree data structure having some interesting
properties. One of them is an array-friendly memory layout, achieved
by building (almost) complete binary tree.
A binary heap keeps at index 0 the maximum value, or the minimum one
depending on convention -- let's stick to maximum heaps. There is
exactly one invariant: a child node, if exist, keep a value less than
the parent node. For comparison, in the case of binary search trees,
we have more strict rules: the left child keeps a smaller value than
the parent's value, and the right child keeps a bigger value.
A non-root node stored at index i have the parent node at index floor
[(i - 1)/2], and children nodes at indices 2 [?] i + 1 and 2 [?] i + 2.
In this text we cover two procedures related to heaps:
* is_heap -- checks if an array is proper heap,
* push_heap -- adds a new element to the heap,
The procedure is_heap is vectorizable and using SIMD instructions
brings profit. We also show that it is possible to define this
function using forward iterators rather random iterators, as the C++
standard imposes.
The procedure push_heap can be expressed with gather and scatter, but
performance is terrible. For the sake of completeness, we show the
AVX-512 implementation.
There is also one more crucial method for heaps: removing the maximum
value, pop_heap. However, it is difficult to vectorize, and benefits
from vectorization would likely be worse than in the case of
push_heap.
is_heap
The easiest approach to check if the given array is a proper heap is
scanning the array from beginning, calculate the left and right
children indices and then compare their values according to the
invariant. We stop scanning, when both children indices are out of
the array bounds.
However, this algorithm is not suitable for vectorization. What we
need is sequential scan over array. We need two pointers: one
pointing parents and another pointing children nodes. The "parent"
pointer start at index 0, and is incremented by 1. The "children"
pointer nodes start at index 1, and is incremented by 2.
Below is C++ template using the modified algorithm. What the typename
argument suggests, this is suitable for forward iterators, while the
standard C++ functions from expect random iterators.
template >
bool is_heap_fwd(ForwardIterator start, ForwardIterator end, Compare cmp) {
if (start == end)
return true;
auto parent = start;
auto current = std::next(start);
while (current != end) {
if (cmp(*parent, *current))
return false;
current = std::next(current);
if (current == end)
break;
if (cmp(*parent, *current))
return false;
parent = std::next(parent);
current = std::next(current);
}
return true;
}
Below is a memory layout of a heap. We need to check if parents are
greater than their children, that is: a > b and a > c, then b > d and
b > e, then e > h, e > i, etc.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
| a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
| +---+ +-----------+ +---------------------------+ +----
| level 1 level 2 level 3
|
tree level 0
We can observe that a single parent node is used twice, thus our
vectorized approach would work in the following manner:
1. Load vector of parent nodes; let's assume vectors hold 8
elements.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
p = | a | b | c | d | e | f | g | h |
+---+---+---+---+---+---+---+---+
2. Load two vectors of child nodes.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
c0 = | b | c | d | e | f | g | h | i |
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
c1 = | j | k | l | m | n | o | p | r |
+---+---+---+---+---+---+---+---+
3. Double the parent nodes: each item appears them twice.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
p0 = | a | a | b | b | c | c | d | d |
+---+---+---+---+---+---+---+---+
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
p1 = | e | e | f | f | g | g | h | h |
+---+---+---+---+---+---+---+---+
4. Now both pairs p0/c0 and p1/c1 contain corresponding parent-child
pairs, thus a plain vector comparison yield whether all pairs
hold the relation parent-child.
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
p0 = | a | a | b | b | c | c | d | d | p1 = | e | e | f | f | g | g | h | h |
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
c0 = | b | c | d | e | f | g | h | i | c1 = | j | k | l | m | n | o | p | r |
+---+---+---+---+---+---+---+---+ +---+---+---+---+---+---+---+---+
The vectorized AVX2 code follows this algorithm.
bool is_heap_avx2_epi32(const int32_t* begin, const int32_t* end) {
const ssize_t k = 32/4; // words in a vector
if (end - begin < 2 * k) {
return std::is_heap(begin, end);
}
const int32_t* parent = begin;
const int32_t* current = begin + 1;
const __m256i lo = _mm256_setr_epi32(0, 0, 1, 1, 2, 2, 3, 3);
const __m256i hi = _mm256_setr_epi32(4, 4, 5, 5, 6, 6, 7, 7);
while (end - current >= 2 * k) {
// 1. load parents
const __m256i tmp = _mm256_loadu_si256((const __m256i*)parent);
const __m256i p0 = _mm256_permutevar8x32_epi32(tmp, lo);
const __m256i p1 = _mm256_permutevar8x32_epi32(tmp, hi);
// 2. load children
const __m256i children0 = _mm256_loadu_si256((const __m256i*)(current + 0));
const __m256i children1 = _mm256_loadu_si256((const __m256i*)(current + k));
// 3. compare parents with their children
const __m256i lt0 = _mm256_cmpgt_epi32(children0, p0);
const __m256i lt1 = _mm256_cmpgt_epi32(children1, p1);
const __m256i t0 = _mm256_or_si256(lt0, lt1);
if (_mm256_movemask_epi8(t0))
return false;
if (current + 2*k > end)
break;
parent += k;
current += 2*k;
}
for (ssize_t i = current - begin; i < end - begin; i++) {
if (begin[(i - 1) / 2] < begin[i]) {
return false;
}
}
return true;
}
The only downside of AVX2 is lack of comparison of unsigned 32-bit
integers. Although it is possible to express relation less-or-equals
with the min and equality: a <= b = min(a, b) = a.
Experiments
The benchmark utility was compiled with -O3 -march=native and invoked
with parameters ./benchmarks 1024 4096 8192, where the parameters are
input sizes.
Benchmarked procedures
+-------------------------------------------------------------------+
|Procedure | Comments |
|----------+--------------------------------------------------------|
|std |standatd C++ std::is_heap with random access iterators |
|----------+--------------------------------------------------------|
|fwd scalar|scalar implementation of the approach using forward |
| |iterators |
|----------+--------------------------------------------------------|
|rnd scalar|reimplementation of std::is_heap with random access |
| |iterators |
|----------+--------------------------------------------------------|
|fwd SSE |SSE implementation of the approach using forward |
| |iterators |
|----------+--------------------------------------------------------|
|fwd AVX2 |AVX2 implementation of the approach using forward |
| |iterators |
|----------+--------------------------------------------------------|
|fwd |AVX-512 implementation of the approach using forward |
|AVX-512 |iterators |
+-------------------------------------------------------------------+
Ryzen 7
* Compiler: gcc (Debian 14.1.0-5) 14.1.0
* CPU: AMD Ryzen 7 7730U with Radeon Graphics
+--------------------------------------------------------------------+
| | time in | |
|procedure| cycles per | speed-up |
| | byte | |
|---------+-------------+--------------------------------------------|
| |average|best | | |
|--------------------------------------------------------------------|
|Input size 1024 |
|--------------------------------------------------------------------|
|std |1.024 |1.016|1.0|###### |
|---------+-------+-----+---+----------------------------------------|
|fwd |0.454 |0.449|2.3|############# |
|scalar | | | | |
|---------+-------+-----+---+----------------------------------------|
|rnd |0.646 |0.645|1.6|######### |
|scalar | | | | |
|---------+-------+-----+---+----------------------------------------|
|fwd SSE |0.229 |0.215|4.7|########################## |
|---------+-------+-----+---+----------------------------------------|
|fwd AVX2 |0.142 |0.137|7.4|########################################|
|--------------------------------------------------------------------|
|Input size 4096 |
|--------------------------------------------------------------------|
|std |0.886 |0.879|1.0|##### |
|---------+-------+-----+---+----------------------------------------|
|fwd |0.451 |0.439|2.0|######### |
|scalar | | | | |
|---------+-------+-----+---+----------------------------------------|
|rnd |0.443 |0.439|2.0|######### |
|scalar | | | | |
|---------+-------+-----+---+----------------------------------------|
|fwd SSE |0.169 |0.166|5.3|######################## |
|---------+-------+-----+---+----------------------------------------|
|fwd AVX2 |0.103 |0.098|9.0|########################################|
|--------------------------------------------------------------------|
|Input size 8192 |
|--------------------------------------------------------------------|
|std |0.892 |0.881|1.0|##### |
|---------+-------+-----+---+----------------------------------------|
|fwd |0.452 |0.439|2.0|######### |
|scalar | | | | |
|---------+-------+-----+---+----------------------------------------|
|rnd |0.446 |0.444|2.0|######### |
|scalar | | | | |
|---------+-------+-----+---+----------------------------------------|
|fwd SSE |0.168 |0.166|5.3|######################## |
|---------+-------+-----+---+----------------------------------------|
|fwd AVX2 |0.101 |0.098|9.0|########################################|
+--------------------------------------------------------------------+
Skylake-X
* Compiler: gcc (GCC) 11.2.0
* CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
+---------------------------------------------------------------------+
| | time in | |
|procedure| cycles per | speed-up |
| | byte | |
|---------+-------------+---------------------------------------------|
| |average|best | | |
|---------------------------------------------------------------------|
|Input size 1024 |
|---------------------------------------------------------------------|
|std |2.529 |2.509|1.0 |##### |
|---------+-------+-----+----+----------------------------------------|
|fwd |2.021 |2.013|1.2 |###### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|rnd |1.033 |1.026|2.4 |########### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|fwd SSE |0.584 |0.574|4.4 |#################### |
|---------+-------+-----+----+----------------------------------------|
|fwd AVX2 |0.350 |0.328|7.6 |################################## |
|---------+-------+-----+----+----------------------------------------|
|fwd |0.286 |0.276|9.1 |########################################|
|AVX-512 | | | | |
|---------------------------------------------------------------------|
|Input size 4096 |
|---------------------------------------------------------------------|
|std |2.567 |2.498|1.0 |### |
|---------+-------+-----+----+----------------------------------------|
|fwd |2.008 |2.000|1.2 |#### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|rnd |1.021 |1.005|2.5 |######## |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|fwd SSE |0.547 |0.541|4.6 |############## |
|---------+-------+-----+----+----------------------------------------|
|fwd AVX2 |0.289 |0.280|8.9 |########################### |
|---------+-------+-----+----+----------------------------------------|
|fwd |0.190 |0.184|13.6|########################################|
|AVX-512 | | | | |
|---------------------------------------------------------------------|
|Input size 8192 |
|---------------------------------------------------------------------|
|std |2.566 |2.497|1.0 |### |
|---------+-------+-----+----+----------------------------------------|
|fwd |2.006 |1.998|1.2 |#### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|rnd |1.011 |1.001|2.5 |####### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|fwd SSE |0.539 |0.536|4.7 |############# |
|---------+-------+-----+----+----------------------------------------|
|fwd AVX2 |0.289 |0.280|8.9 |######################## |
|---------+-------+-----+----+----------------------------------------|
|fwd |0.175 |0.164|15.2|########################################|
|AVX-512 | | | | |
+---------------------------------------------------------------------+
Ice Lake
* Compiler: gcc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2)
* CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
+---------------------------------------------------------------------+
| | time in | |
|procedure| cycles per | speed-up |
| | byte | |
|---------+-------------+---------------------------------------------|
| |average|best | | |
|---------------------------------------------------------------------|
|Input size 1024 |
|---------------------------------------------------------------------|
|std |4.499 |2.584|1.0 |#### |
|---------+-------+-----+----+----------------------------------------|
|fwd |1.576 |1.096|2.4 |######### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|rnd |1.528 |0.875|3.0 |########### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|fwd SSE |0.976 |0.543|4.8 |################## |
|---------+-------+-----+----+----------------------------------------|
|fwd AVX2 |0.551 |0.297|8.7 |################################ |
|---------+-------+-----+----+----------------------------------------|
|fwd |0.356 |0.236|10.9|########################################|
|AVX-512 | | | | |
|---------------------------------------------------------------------|
|Input size 4096 |
|---------------------------------------------------------------------|
|std |4.497 |2.725|1.0 |## |
|---------+-------+-----+----+----------------------------------------|
|fwd |1.630 |1.196|2.3 |##### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|rnd |1.531 |1.123|2.4 |##### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|fwd SSE |0.954 |0.563|4.8 |########### |
|---------+-------+-----+----+----------------------------------------|
|fwd AVX2 |0.496 |0.286|9.5 |#################### |
|---------+-------+-----+----+----------------------------------------|
|fwd |0.228 |0.143|19.1|########################################|
|AVX-512 | | | | |
|---------------------------------------------------------------------|
|Input size 8192 |
|---------------------------------------------------------------------|
|std |4.516 |2.900|1.0 |### |
|---------+-------+-----+----+----------------------------------------|
|fwd |1.555 |1.211|2.4 |####### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|rnd |1.507 |1.083|2.7 |####### |
|scalar | | | | |
|---------+-------+-----+----+----------------------------------------|
|fwd SSE |0.965 |0.620|4.7 |############ |
|---------+-------+-----+----+----------------------------------------|
|fwd AVX2 |0.475 |0.355|8.2 |##################### |
|---------+-------+-----+----+----------------------------------------|
|fwd |0.220 |0.187|15.5|########################################|
|AVX-512 | | | | |
+---------------------------------------------------------------------+
push_heap
The procedure push_heap adds a new element to heap. We append the new
value to the and of the array. We start from this index, and check
whether the parent is greater. If it is, the heap invariant is hold,
so no more work is needed. Otherwise, we swap the new item with the
parent and recheck the swapped element with its new parent.
In each iteration we go up, until we reach the root.
The vectorized approach is based on the following observation: when
we build a list of values from the new node one up to the root, then
-- for a proper heap -- this list is sorted. Otherwise, we need to find
position where to put this new value. Taking this form another angle,
push_heap performs insertion sort on that list.
For example, when we're adding 11-th element of value X to the heap
show below, then its parents are at indices: 5, 2, and 0. Thus, we're
considering a list a, b, c, X and we already know that a > b, b > c.
+---+
| a | <---------- level 0
+---+
0
+---+---+
| b | c | <-------- level 1
+---+---+
1 2
+---+---+---+---+
| d | e | f | g | <---- level 2
+---+---+---+---+
3 4 5 6
+---+---+---+---+---+
| h | i | j | k | X | <---- level 3
+---+---+---+---+---+
7 8 9 10 11
Vectorized approach consists the following steps.
1. Prepare list of indices. We start at index n:
idx0 = n;
idx1 = (idx0 - 1) >> 1;
idx2 = (idx1 - 1) >> 1;
idx3 = (idx2 - 1) >> 1;
...
idx7 = (idx6 - 1) >> 1;
2. Perform gather using this list.
idx0 idx1 idx2 idx3 idx4 idx5 idx6 idx7
+----+----+----+----+----+----+----+----+
values = | 15 | 2 | 3 | 7 | 10 | 13 | 20 | 30 |
+----+----+----+----+----+----+----+----+
^
|
+-- new value
3. Locate position where to insert the new value.
+----+----+----+----+----+----+----+----+
new = | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 |
+----+----+----+----+----+----+----+----+
+----+----+----+----+----+----+----+----+
values = | 15 | 2 | 3 | 7 | 10 | 13 | 20 | 30 |
+----+----+----+----+----+----+----+----+
mask = new > values
+----+----+----+----+----+----+----+----+
mask = | 00 | ff | ff | ff | ff | ff | 00 | 00 |
+----+----+----+----+----+----+----+----+
0 7
4. Vector mask is converted into a bitmask (0b0011_1110) and the
number of leading zeroes uniquely identifies the shuffle pattern.
5. We load a precomputed shuffle pattern and perform shuffling.
+----+----+----+----+----+----+----+----+
values' = | 2 | 3 | 7 | 10 | 13 | 15 | 20 | 30 |
+----+----+----+----+----+----+----+----+
^
|
new value
6. Finally we scatter the vector to initial positions in the heap.
7. Repeat when the new item was placed on the last position after
shuffle.
First of all, this algorithm is not suitable for AVX2, as AVX2 does
not support scatters. The AVX-512 implementation is shown below.
void push_heap_avx2(int32_t* start, size_t size) {
if (size <= 255) {
std::push_heap(start, start + size);
return;
}
size_t index = size - 1;
while (index >= 256) {
// 1. Construct indices from the current element to parent nodes 7 levels up
ssize_t parent = 0;
uint32_t tmp[8];
tmp[0] = index;
for (int i=1; i < 8; i++) {
parent = (index - 1)/2;
tmp[i] = parent;
index = parent;
}
const __m256i indices = _mm256_load_si256((const __m256i*)tmp);
// 2. Load values from the selected path
const __m256i values = _mm256_i32gather_epi32((const int*)start, indices, sizeof(uint32_t));
// 3. Broadcast 0th element from the vector
const __m256i last_value = _mm256_permutevar8x32_epi32(values, _mm256_setzero_si256());
// 3. Check if the heap property is violated.
const __m256i mask = _mm256_cmpgt_epi32(values, last_value);
const uint8_t any_parent_less = _mm256_movemask_ps(_mm256_castsi256_ps(mask));
if (any_parent_less == 0) {
const __m256i sorted = _mm256_permutevar8x32_epi32(values, avx2_sort_values[7]);
_mm256_i32scatter_epi32(start, indices, sorted, sizeof(uint32_t));
index = tmp[7];
continue;
}
// 4. Elements on the path be should sorted, we need to locate where to insert a new value.
const int new_index = __builtin_ctz(any_parent_less) - 1;
if (new_index > 0) {
const __m256i sorted = _mm256_permutevar8x32_epi32(values, avx2_sort_values[new_index]);
_mm256_i32scatter_epi32(start, indices, sorted, sizeof(uint32_t));
}
}
std::push_heap(start, start + index + 1);
}
This cannot be faster than a scalar implementation. In the worst case
a scalar code execute log[2]n memory loads, comparisons and swaps. In
best case only two memory loads and one comparison is needed.
The vectorized code needs to construct vector of indices, than at
least one gather and scatter which are slow. My experiments on quite
decent Ice Lake machine showed 10x slowdowns, which makes this
approach unpractical (although it is nice).
See also
* Is sorted using SIMD instructions
Source code
Sample implementation is available at GitHub.