[HN Gopher] A performance analysis of Intel x86-SIMD-sort (AVX-512)
       ___________________________________________________________________
        
       A performance analysis of Intel x86-SIMD-sort (AVX-512)
        
       Author : Twirrim
       Score  : 73 points
       Date   : 2023-06-10 18:42 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gavinray wrote:
       | In the event the author shows up: What is the mental process like
       | when you work on/figure out these sorts of things?
       | 
       | Do you have a sort of intuitive understanding/feeling for what
       | you ought to do, or a mental visualization, or what goes on
       | inside of your head?
       | 
       | Some of this code is mind-bending and I'm struggling to see how
       | someone arrives at stuff like this:
       | 
       | [1]: https://github.com/Voultapher/sort-research-
       | rs/blob/42bf339d...
       | 
       | [2]: https://github.com/Voultapher/sort-research-
       | rs/blob/42bf339d...
        
         | tesdinger wrote:
         | You cleverly wrote "sort" two times without referring to actual
         | sorting in a thread about sorting.
         | 
         | The first algorithm that you linked is easy to explain. As soon
         | as a value has been stored in another location its original
         | location can be overwritten. To get started we move something
         | to an additional location called temp A > Temp. Then we can
         | overwrite it with the intended value Z > A. Then we can
         | overwrite Z and so on and so forth.
        
         | Voultapher wrote:
         | Ok so both the things you linked are not my ideas. The first
         | one is AFAIK originally from the BlockQuicksort paper
         | https://arxiv.org/pdf/1604.06697.pdf section 3.2. Which Orson
         | Peters used in his pdqsort, which I know base ipnsort on. The
         | second one is from Igor van den Hoven in his work on quadsort
         | as I mention in the function description. Really a lot of this
         | stuff is building on top of other peoples work and refining it.
         | Seemingly I'm the first to write neat little ASCII graphics for
         | them in the code making them easier to approach. There are some
         | novel ideas by me in there too though, for example
         | https://github.com/Voultapher/sort-research-rs/blob/42bf339d...
         | the way I marry a limited number of sorting-networks, insertion
         | sort and the bi-directional merge into one, very fast but
         | binary size and compile time efficient package is novel. Or
         | that with LLVM you can do 2 instead of 4 writes per loop
         | iteration in the bi-directional merge, which Igor has now
         | ported back to his stuff as well. Generally I'd say my work has
         | been more about figuring out novel ways to combine existing
         | ideas into one good cohesive package fit for a standard library
         | sort implementation. As well as taking ideas that work in the C
         | and C++ world and doing the legwork to adapt them for use in
         | Rust, which requires a lot more considerations and a lot of
         | code can't be ported straight up. I wrote a little more about
         | that in another comment here.
        
           | gavinray wrote:
           | Thanks for the reply!
           | 
           | Do you feel like you have a sort of "intuition" for which
           | things might work, and you experiment starting from that, or
           | what is your general thought process like?
           | 
           | Or maybe you start by drawing things out on paper? Really
           | curious what the process of reasoning and thinking about
           | something like this is.
        
       | slashdev wrote:
       | So Rust's ipnsort is actually one of the better sorting
       | implementations across a variety of input sizes. Faster than
       | avx512 sort on small arrays, slower on mid sized arrays, and
       | almost the same on large ones.
       | 
       | The avx512 sort seems like a questionable choice given it
       | requires specialized hardware and is not the best option under
       | many real-world conditions.
        
         | PartiallyTyped wrote:
         | It's possible that AVX512 sort does better on newer
         | architectures, no? Author's machine is a 3 generations old cpu.
        
           | slashdev wrote:
           | Yeah, that would probably change the calculus. There were
           | major frequency (heat) issues on older Intel cpus when
           | running avx512.
        
             | PartiallyTyped wrote:
             | There's also the complication that AVX512 was removed in
             | 12xxx and down, and iirc has a softlock in 11xxx.
        
           | pstuart wrote:
           | IIRC, there were issues with it causing frequency throttling
           | on Intel cpus, whereas AMD's avoid that by "double pumping".
           | It would be very interesting to compare and contrast there.
           | 
           | Seems like there's be value in it for all the new ML hotness
           | that's come about. The AMD 7950x seems like it hits the sweet
           | spot for that.
        
           | threadl0cal wrote:
           | Only one way to find out -- benchmarks, innit?
        
         | Voultapher wrote:
         | I'd like to emphasize that ipnsort is primarily designed as the
         | new Rust std library unstable sort, which means it optimizes
         | for all these factors simultaneously:
         | 
         | - input type (including stuff that is not an integer or float)
         | 
         | - input size
         | 
         | - input pattern
         | 
         | - prediction state
         | 
         | - binary size
         | 
         | - compile times
         | 
         | - varying ISAs and Hardware
         | 
         | Especially binary size and compile times mean I'm not chasing
         | the performance crown for integers in some HPC setting. I can
         | trivially pull out another 7% perf by ignoring binary size,
         | compile times and effects on types that are not integers
         | https://github.com/Voultapher/sort-research-
         | rs/commit/d908fe....
        
           | mlochbaum wrote:
           | This is a point that I'm kind of fuzzy on: is there a
           | specific requirement to not change the algorithm too much
           | based on type/comparison? Like if the user calls it on a
           | 1-byte type with default comparison, the best way to do this
           | in a vacuum is a counting sort. Smaller generated code and
           | everything. Both C/C++ and Rust developers seem unwilling to
           | do this, but not having asked I don't know why exactly.
           | 
           | On the other hand Julia just switched to radix sort much of
           | the time in their latest version, so other languages can have
           | different approaches.
        
       | Conscat wrote:
       | I wonder if AVX-512 CPU throttling is a big problem for sorting
       | algorithms based on those instructions.
        
       | mafik wrote:
       | I've read the C++ code used in the benchmark and noticed several
       | issues that would affect its performance (wrapping comparison
       | function behind multiple layers of indirection, throwing
       | exceptions (!?) from comparator, passing arguments as references
       | rather than values, not using the C++ sort "Compare" template
       | argument). Given the amount od indirection and boilerplate I'm
       | pretty skeptical of the results od this benchmark.
        
       | mlochbaum wrote:
       | Steps to build a fast, highly adaptive AVX-512 sorting algorithm
       | in C:
       | 
       | - Clone fluxsort (https://github.com/scandum/fluxsort)
       | 
       | - Replace the partitioning code in flux_default_partition and
       | flux_reverse_partition with the obvious AVX-512 version using a
       | compare and two compress instructions
       | 
       | - If you're feeling ambitious, swap out the small array sorting,
       | or incorporate crumsort's fulcrum partition for larger arrays.
       | 
       | I know why I haven't done this: my computer doesn't have AVX-512,
       | and hardly anyone else's that I know seems to. Maybe a couple Zen
       | 4 owners. I'm less clear on why the tech giants are reinventing
       | the wheel to make these sorting alrogithms that don't even handle
       | pre-sorted data rather than working with some of the very high-
       | quality open source stuff out there. Is adaptivity really
       | considered that worthless?
       | 
       | Fluxsort makes this particularly simple because it gets great
       | performance out of a stable out-of-place partition. It's a bit
       | newer; maybe the authors weren't aware of this work or started
       | before it was published. But these algorithms both use (fairly
       | difficult) in-place partitioning code; why not slot that into the
       | well-known pdqsort?
        
         | gavinray wrote:
         | If you look at the history of the author's own sorting
         | implementation (ipnsort), it started as an attempt to port
         | fluxsort to Rust:
         | 
         | https://github.com/Voultapher/Presentations/blob/master/rust...
         | 
         | https://www.youtube.com/watch?v=3JZAQ4Gsl-g
         | 
         | It starts there, eventually the author abandons the linked PR
         | because of a better approach found with ipnsort:
         | 
         | https://github.com/rust-lang/rust/pull/100856#issuecomment-1...
        
           | mlochbaum wrote:
           | I'm not talking about ipnsort, which I think is great. It
           | makes good use of existing work and doesn't use AVX at all.
           | So extending it with AVX-512 partition code would also be a
           | good project!
        
           | Voultapher wrote:
           | You are right the initial kick-off point was an attempt to
           | port fluxsort for a stable sort in Rust, but that quickly
           | turned out to be unfeasible because Rust implementations have
           | way higher requirements in terms of safety. The user can
           | modify values during a comparison operation, leave the logic
           | at any comparison point via an panic (exception) and the
           | comparison function may not be a total order. Together these
           | effects make most of the code in fluxsort useless for Rust. I
           | had been working on ipn_stable and ipn_unstable to completely
           | different implementations. ipnsort, previously ipn_unstable
           | started off with the pdqsort port that is the current Rust
           | `slice::sort_unstable` and from there on I iterated and tried
           | out different ideas.
        
             | mlochbaum wrote:
             | Well, I know what you mean but "completely different" is
             | potentially misleading here. The current ipnsort is using
             | bidirectional merges developed for quadsort (the merging
             | part of fluxsort) and the fulcrum partition from crumsort,
             | also by Scandum (all credited in comments; check the source
             | if you want to see more influences!). To balance things
             | out, the strategy for using the namesake sorting networks
             | is new to me: pick a few fixed sizes and handle the rest by
             | rounding down then a few steps of insertion sorting.
        
               | Voultapher wrote:
               | I should clarify, I meant that ipn_stable and
               | ipn_unstable were separate projects, albeit with some
               | cross-over ideas.
               | 
               | I can't rule out that someone else had that idea for
               | limiting the use of sorting-networks, all I can say is
               | that I came up with that myself. At the same time a lot
               | of the good ideas in ipnsort are ideas from other people,
               | and I try my best to accredit them. As I commented
               | before, my goal is not peak perf for integers at any
               | cost. Here is a commit
               | https://github.com/Voultapher/sort-research-
               | rs/commit/d908fe... that improves perf by 7% across sizes
               | for the random pattern. But that comes with a non
               | negligible binary size and compile time impact even for
               | integers. And while I use heuristics to only use sorting-
               | networks where sensible, they can guess wrong. This is a
               | generic sort implementation with the goal of being a good
               | all-round implementation fit for a standard library and
               | those various uses.
        
         | moonchild wrote:
         | Avx512 is also good for merging and pattern detection. Merging
         | in the case of a large input (rather than just opportunistic
         | for presorted subsequences) is arguably the most interesting
         | part--a big queueing and data movement problem.
        
       | nerpderp82 wrote:
       | The author posted here a couple hours ago,
       | https://news.ycombinator.com/item?id=36268877
        
         | Twirrim wrote:
         | Argh. Apologies to the author. I didn't see the original when I
         | looked before submitting, and I expected HN to pop up a link to
         | the existing post, like it usually does. Wasn't my intention to
         | post a dupe :(
        
         | gavinray wrote:
         | Ah that's sort of a bummer -- same title and everything
         | (initially).
         | 
         | Hopefully they will notice this thread and comment though, or
         | the threads can be merged.
        
       | Cold_Miserable wrote:
       | AVX512 is scarcely faster than radix at 1 million elements while
       | being vastly more complicated.
        
       | fooblaster wrote:
       | From what I understand, these sorts are only for direct lists of
       | primitives. For most of the sorts I see, I'm trying to sort an
       | array of structs with an embedded key. I know this organization
       | destroys the memory access patterns that make SIMD effective. Is
       | the best alternative to move the key out of the struct, and sort
       | two arrays: one of primitive keys and another of original
       | indices? This sort would still be very amenable to SIMD. It just
       | requires a final pass where you rearrange the original array with
       | the indices. Alternately, you could avoid sorting an array of
       | indices and try sorting an struct of arrays directly. I don't see
       | a lot of people handling this very frequent situation in these
       | benchmarks. Any pointers?
        
         | mlochbaum wrote:
         | It depends on the size of the structs. For struct _pointers_
         | you 're likely better off sorting keys and pointers
         | simultaneously. It doesn't matter much until you get to large
         | sizes (millions), but sorting indices and then selecting with
         | them is random access. If the original ordering is messy,
         | selecting can be slower than the sorting step. For structs a
         | few words long, the unit you're moving is a larger portion of a
         | cache line, and I'd expect the data movement to drown out any
         | SIMD advantage. A radix sort might be all right because it
         | moves less, but I'd probably go with sorting indices as the
         | first thing to try unless I knew the arrays were huge. For very
         | large structs there's an interesting effort called mountain
         | sort[0], "probably the best sorting algorithm if you need to
         | sort actual mountains by height". Given that it minimizes
         | number of moves it's ignoring cache entirely. I haven't
         | benchmarked so I can't say much about how practical it is.
         | 
         | [0] https://github.com/Morwenn/mountain-sort
        
         | BobbyJo wrote:
         | In situations where I need to sort a list of large objects, I
         | normally create an separate "OrderedView" or "Ordering" which
         | holds either two lists [comparable_value] and [index] like you
         | describe, or a single list of [comparable_value, index]
         | (depending on the types and languages the second option will be
         | faster). Then it just provides either an iterator. If you only
         | need the ordering once this saves you from going back and re-
         | ordering the original array.
         | 
         | If you need to run through the list multiple times, it probably
         | makes sense to actually go back and re-order the original so
         | that the accesses are linear memory. That is, unless the values
         | are pointers in which the indirection will likely kill the
         | benefit.
        
         | blt wrote:
         | I agree this is a super common case. I don't have a benchmark,
         | but I bet when actually moving structs the C qsort starts
         | looking a lot more reasonable.
        
       | inciampati wrote:
       | The data stops before it gets to useful sizes (10e7 and up). How
       | are people implementing sorting algorithms not routinely working
       | at 10e9-10e12 where the workload is actually a bottleneck?
        
         | henrydark wrote:
         | I run ml algorithms like boosted trees (i.e xgboost) on data
         | sets with 30k-1m rows and 200-2k columns. Sorting is the
         | bottleneck, it's what the algorithm does. I doubt I'm special,
         | and I'm sure these size are common
        
       ___________________________________________________________________
       (page generated 2023-06-10 23:00 UTC)