[HN Gopher] A performance analysis of Intel x86-SIMD-sort (AVX-512)
___________________________________________________________________
A performance analysis of Intel x86-SIMD-sort (AVX-512)
Author : Twirrim
Score : 73 points
Date : 2023-06-10 18:42 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gavinray wrote:
| In the event the author shows up: What is the mental process like
| when you work on/figure out these sorts of things?
|
| Do you have a sort of intuitive understanding/feeling for what
| you ought to do, or a mental visualization, or what goes on
| inside of your head?
|
| Some of this code is mind-bending and I'm struggling to see how
| someone arrives at stuff like this:
|
| [1]: https://github.com/Voultapher/sort-research-
| rs/blob/42bf339d...
|
| [2]: https://github.com/Voultapher/sort-research-
| rs/blob/42bf339d...
| tesdinger wrote:
| You cleverly wrote "sort" two times without referring to actual
| sorting in a thread about sorting.
|
| The first algorithm that you linked is easy to explain. As soon
| as a value has been stored in another location its original
| location can be overwritten. To get started we move something
| to an additional location called temp A > Temp. Then we can
| overwrite it with the intended value Z > A. Then we can
| overwrite Z and so on and so forth.
| Voultapher wrote:
| Ok so both the things you linked are not my ideas. The first
| one is AFAIK originally from the BlockQuicksort paper
| https://arxiv.org/pdf/1604.06697.pdf section 3.2. Which Orson
| Peters used in his pdqsort, which I know base ipnsort on. The
| second one is from Igor van den Hoven in his work on quadsort
| as I mention in the function description. Really a lot of this
| stuff is building on top of other peoples work and refining it.
| Seemingly I'm the first to write neat little ASCII graphics for
| them in the code making them easier to approach. There are some
| novel ideas by me in there too though, for example
| https://github.com/Voultapher/sort-research-rs/blob/42bf339d...
| the way I marry a limited number of sorting-networks, insertion
| sort and the bi-directional merge into one, very fast but
| binary size and compile time efficient package is novel. Or
| that with LLVM you can do 2 instead of 4 writes per loop
| iteration in the bi-directional merge, which Igor has now
| ported back to his stuff as well. Generally I'd say my work has
| been more about figuring out novel ways to combine existing
| ideas into one good cohesive package fit for a standard library
| sort implementation. As well as taking ideas that work in the C
| and C++ world and doing the legwork to adapt them for use in
| Rust, which requires a lot more considerations and a lot of
| code can't be ported straight up. I wrote a little more about
| that in another comment here.
| gavinray wrote:
| Thanks for the reply!
|
| Do you feel like you have a sort of "intuition" for which
| things might work, and you experiment starting from that, or
| what is your general thought process like?
|
| Or maybe you start by drawing things out on paper? Really
| curious what the process of reasoning and thinking about
| something like this is.
| slashdev wrote:
| So Rust's ipnsort is actually one of the better sorting
| implementations across a variety of input sizes. Faster than
| avx512 sort on small arrays, slower on mid sized arrays, and
| almost the same on large ones.
|
| The avx512 sort seems like a questionable choice given it
| requires specialized hardware and is not the best option under
| many real-world conditions.
| PartiallyTyped wrote:
| It's possible that AVX512 sort does better on newer
| architectures, no? Author's machine is a 3 generations old cpu.
| slashdev wrote:
| Yeah, that would probably change the calculus. There were
| major frequency (heat) issues on older Intel cpus when
| running avx512.
| PartiallyTyped wrote:
| There's also the complication that AVX512 was removed in
| 12xxx and down, and iirc has a softlock in 11xxx.
| pstuart wrote:
| IIRC, there were issues with it causing frequency throttling
| on Intel cpus, whereas AMD's avoid that by "double pumping".
| It would be very interesting to compare and contrast there.
|
| Seems like there's be value in it for all the new ML hotness
| that's come about. The AMD 7950x seems like it hits the sweet
| spot for that.
| threadl0cal wrote:
| Only one way to find out -- benchmarks, innit?
| Voultapher wrote:
| I'd like to emphasize that ipnsort is primarily designed as the
| new Rust std library unstable sort, which means it optimizes
| for all these factors simultaneously:
|
| - input type (including stuff that is not an integer or float)
|
| - input size
|
| - input pattern
|
| - prediction state
|
| - binary size
|
| - compile times
|
| - varying ISAs and Hardware
|
| Especially binary size and compile times mean I'm not chasing
| the performance crown for integers in some HPC setting. I can
| trivially pull out another 7% perf by ignoring binary size,
| compile times and effects on types that are not integers
| https://github.com/Voultapher/sort-research-
| rs/commit/d908fe....
| mlochbaum wrote:
| This is a point that I'm kind of fuzzy on: is there a
| specific requirement to not change the algorithm too much
| based on type/comparison? Like if the user calls it on a
| 1-byte type with default comparison, the best way to do this
| in a vacuum is a counting sort. Smaller generated code and
| everything. Both C/C++ and Rust developers seem unwilling to
| do this, but not having asked I don't know why exactly.
|
| On the other hand Julia just switched to radix sort much of
| the time in their latest version, so other languages can have
| different approaches.
| Conscat wrote:
| I wonder if AVX-512 CPU throttling is a big problem for sorting
| algorithms based on those instructions.
| mafik wrote:
| I've read the C++ code used in the benchmark and noticed several
| issues that would affect its performance (wrapping comparison
| function behind multiple layers of indirection, throwing
| exceptions (!?) from comparator, passing arguments as references
| rather than values, not using the C++ sort "Compare" template
| argument). Given the amount od indirection and boilerplate I'm
| pretty skeptical of the results od this benchmark.
| mlochbaum wrote:
| Steps to build a fast, highly adaptive AVX-512 sorting algorithm
| in C:
|
| - Clone fluxsort (https://github.com/scandum/fluxsort)
|
| - Replace the partitioning code in flux_default_partition and
| flux_reverse_partition with the obvious AVX-512 version using a
| compare and two compress instructions
|
| - If you're feeling ambitious, swap out the small array sorting,
| or incorporate crumsort's fulcrum partition for larger arrays.
|
| I know why I haven't done this: my computer doesn't have AVX-512,
| and hardly anyone else's that I know seems to. Maybe a couple Zen
| 4 owners. I'm less clear on why the tech giants are reinventing
| the wheel to make these sorting alrogithms that don't even handle
| pre-sorted data rather than working with some of the very high-
| quality open source stuff out there. Is adaptivity really
| considered that worthless?
|
| Fluxsort makes this particularly simple because it gets great
| performance out of a stable out-of-place partition. It's a bit
| newer; maybe the authors weren't aware of this work or started
| before it was published. But these algorithms both use (fairly
| difficult) in-place partitioning code; why not slot that into the
| well-known pdqsort?
| gavinray wrote:
| If you look at the history of the author's own sorting
| implementation (ipnsort), it started as an attempt to port
| fluxsort to Rust:
|
| https://github.com/Voultapher/Presentations/blob/master/rust...
|
| https://www.youtube.com/watch?v=3JZAQ4Gsl-g
|
| It starts there, eventually the author abandons the linked PR
| because of a better approach found with ipnsort:
|
| https://github.com/rust-lang/rust/pull/100856#issuecomment-1...
| mlochbaum wrote:
| I'm not talking about ipnsort, which I think is great. It
| makes good use of existing work and doesn't use AVX at all.
| So extending it with AVX-512 partition code would also be a
| good project!
| Voultapher wrote:
| You are right the initial kick-off point was an attempt to
| port fluxsort for a stable sort in Rust, but that quickly
| turned out to be unfeasible because Rust implementations have
| way higher requirements in terms of safety. The user can
| modify values during a comparison operation, leave the logic
| at any comparison point via an panic (exception) and the
| comparison function may not be a total order. Together these
| effects make most of the code in fluxsort useless for Rust. I
| had been working on ipn_stable and ipn_unstable to completely
| different implementations. ipnsort, previously ipn_unstable
| started off with the pdqsort port that is the current Rust
| `slice::sort_unstable` and from there on I iterated and tried
| out different ideas.
| mlochbaum wrote:
| Well, I know what you mean but "completely different" is
| potentially misleading here. The current ipnsort is using
| bidirectional merges developed for quadsort (the merging
| part of fluxsort) and the fulcrum partition from crumsort,
| also by Scandum (all credited in comments; check the source
| if you want to see more influences!). To balance things
| out, the strategy for using the namesake sorting networks
| is new to me: pick a few fixed sizes and handle the rest by
| rounding down then a few steps of insertion sorting.
| Voultapher wrote:
| I should clarify, I meant that ipn_stable and
| ipn_unstable were separate projects, albeit with some
| cross-over ideas.
|
| I can't rule out that someone else had that idea for
| limiting the use of sorting-networks, all I can say is
| that I came up with that myself. At the same time a lot
| of the good ideas in ipnsort are ideas from other people,
| and I try my best to accredit them. As I commented
| before, my goal is not peak perf for integers at any
| cost. Here is a commit
| https://github.com/Voultapher/sort-research-
| rs/commit/d908fe... that improves perf by 7% across sizes
| for the random pattern. But that comes with a non
| negligible binary size and compile time impact even for
| integers. And while I use heuristics to only use sorting-
| networks where sensible, they can guess wrong. This is a
| generic sort implementation with the goal of being a good
| all-round implementation fit for a standard library and
| those various uses.
| moonchild wrote:
| Avx512 is also good for merging and pattern detection. Merging
| in the case of a large input (rather than just opportunistic
| for presorted subsequences) is arguably the most interesting
| part--a big queueing and data movement problem.
| nerpderp82 wrote:
| The author posted here a couple hours ago,
| https://news.ycombinator.com/item?id=36268877
| Twirrim wrote:
| Argh. Apologies to the author. I didn't see the original when I
| looked before submitting, and I expected HN to pop up a link to
| the existing post, like it usually does. Wasn't my intention to
| post a dupe :(
| gavinray wrote:
| Ah that's sort of a bummer -- same title and everything
| (initially).
|
| Hopefully they will notice this thread and comment though, or
| the threads can be merged.
| Cold_Miserable wrote:
| AVX512 is scarcely faster than radix at 1 million elements while
| being vastly more complicated.
| fooblaster wrote:
| From what I understand, these sorts are only for direct lists of
| primitives. For most of the sorts I see, I'm trying to sort an
| array of structs with an embedded key. I know this organization
| destroys the memory access patterns that make SIMD effective. Is
| the best alternative to move the key out of the struct, and sort
| two arrays: one of primitive keys and another of original
| indices? This sort would still be very amenable to SIMD. It just
| requires a final pass where you rearrange the original array with
| the indices. Alternately, you could avoid sorting an array of
| indices and try sorting an struct of arrays directly. I don't see
| a lot of people handling this very frequent situation in these
| benchmarks. Any pointers?
| mlochbaum wrote:
| It depends on the size of the structs. For struct _pointers_
| you 're likely better off sorting keys and pointers
| simultaneously. It doesn't matter much until you get to large
| sizes (millions), but sorting indices and then selecting with
| them is random access. If the original ordering is messy,
| selecting can be slower than the sorting step. For structs a
| few words long, the unit you're moving is a larger portion of a
| cache line, and I'd expect the data movement to drown out any
| SIMD advantage. A radix sort might be all right because it
| moves less, but I'd probably go with sorting indices as the
| first thing to try unless I knew the arrays were huge. For very
| large structs there's an interesting effort called mountain
| sort[0], "probably the best sorting algorithm if you need to
| sort actual mountains by height". Given that it minimizes
| number of moves it's ignoring cache entirely. I haven't
| benchmarked so I can't say much about how practical it is.
|
| [0] https://github.com/Morwenn/mountain-sort
| BobbyJo wrote:
| In situations where I need to sort a list of large objects, I
| normally create an separate "OrderedView" or "Ordering" which
| holds either two lists [comparable_value] and [index] like you
| describe, or a single list of [comparable_value, index]
| (depending on the types and languages the second option will be
| faster). Then it just provides either an iterator. If you only
| need the ordering once this saves you from going back and re-
| ordering the original array.
|
| If you need to run through the list multiple times, it probably
| makes sense to actually go back and re-order the original so
| that the accesses are linear memory. That is, unless the values
| are pointers in which the indirection will likely kill the
| benefit.
| blt wrote:
| I agree this is a super common case. I don't have a benchmark,
| but I bet when actually moving structs the C qsort starts
| looking a lot more reasonable.
| inciampati wrote:
| The data stops before it gets to useful sizes (10e7 and up). How
| are people implementing sorting algorithms not routinely working
| at 10e9-10e12 where the workload is actually a bottleneck?
| henrydark wrote:
| I run ml algorithms like boosted trees (i.e xgboost) on data
| sets with 30k-1m rows and 200-2k columns. Sorting is the
| bottleneck, it's what the algorithm does. I doubt I'm special,
| and I'm sure these size are common
___________________________________________________________________
(page generated 2023-06-10 23:00 UTC)