[HN Gopher] Sorting with SIMD
___________________________________________________________________
Sorting with SIMD
Author : lukastyrychtr
Score : 117 points
Date : 2022-12-17 16:59 UTC (6 hours ago)
(HTM) web link (tweedegolf.nl)
(TXT) w3m dump (tweedegolf.nl)
| anovikov wrote:
| SIMD is a fantastic tool. My first try with AVX2 (256-bit),
| building an "overlay with transparency" for two YUVA420P frames,
| yielded speed better than 1 pixel per CPU cycle! Although i
| didn't even try optimising it all that much.
| the__alchemist wrote:
| When would you use SIMD vice a GPU? (Eg Vulkan comp shader) Is it
| easier to write for CPU, but you bring in the GPU shader if doing
| massively parallel ops vice just a few? I've skimmed a Rust lib
| that uses CPU SIMD (GLAM), and it used a diff syntax from normal,
| so I'm not positive it would be easier if you're familiar with
| the GPU process.
| fathyb wrote:
| It depends how much, where, and when you'd like the data to be
| sent? Most discrete GPUs cannot access the CPU memory directly,
| so you need to make a copy through the PCI bus, which can be
| slow.
|
| If you have small chunks of data (mining), or your destination
| is the screen (video game), it might make sense to use the GPU.
| If you need high-throughput, low latency, or your destination
| is something like a sound card (DAW), then SIMD might be a
| better choice.
|
| Fast integrated GPUs like Apple's allow for directly accessing
| the main memory without copy, making the GPU more viable for
| general purposes.
| anonymoushn wrote:
| This is a good explanation. Writing the masks in big-endian
| somewhat obscures things.
|
| As an aside, while recent developments in quicksort are quite
| good, it seems like MSB radix sort performs comparably to (or
| slightly better than!) these algorithms on random inputs.
| bee_rider wrote:
| Random inputs for which you don't know the bounds?
| anonymoushn wrote:
| Yes. If the input is bounded in a narrow range compared to
| the length if the input, it seems like it should actually be
| a bit slower than normal, since the first few passes may do
| nothing and the later passes will need to be run on more of
| the input than normal. We're using it for arrays of doubles
| that have fairly small exponents though, and this causes us
| to have lots of empty buckets in the first pass, and it
| performs well enough.
| bee_rider wrote:
| Cool! I should probably not have commented, I haven't
| really kept up with advances in sorting, haha. I assumed it
| was basically done to death 20 years ago. This will spur
| some reading I think! Thanks.
| anonymoushn wrote:
| Please do comment in situations like this.
|
| It's tough to find a good overview of this topic that
| includes recent work or that is aimed at empirical
| performance. That's part of why TFA is great!
|
| A version of MSB radix sort for sorting ints, doubles, or
| floats, or sorting structs by a single key of those types
| is here[0] and a version for sorting strings is here[1].
| These are based on the MSB radix sort from this
| repository[2] which is associated with a technical report
| about parallel string sorts. I was only interested in
| sequential sorts, but this repository turned out to be a
| great resource anyway.
|
| [0]: https://github.com/alichraghi/zort/blob/main/src/rad
| ix.zig
|
| [1]: https://github.com/dendibakh/perf-
| challenge6/blob/Solution_R...
|
| [2]: https://github.com/bingmann/parallel-string-sorting
| rikschennink wrote:
| This is why with redact.photo I randomly shuffle the pixels
| before blurring so it looks like you can reverse engineer it but
| you'll just get scrambled pixels.
| boberoni wrote:
| I thought that blurring a photo would set each pixel to the
| average color value of all its neighbors. Since information is
| lost, how could you reverse engineer a blurred photo?
| snovv_crash wrote:
| If there are a limited number of input permutations then you
| can test all of them to see which is plausible.
| lucb1e wrote:
| The threads are right next to each other, might you have meant
| to reply here? https://news.ycombinator.com/item?id=34031568
| "Unredacter: Never use pixelation [for] redaction"
| fearthetelomere wrote:
| >You should probably not use inline assembly in production code
|
| What are the alternatives here? Write the assembly in a separate
| file and provide a FFI?
| corysama wrote:
| Popular compilers support popular SIMD architectures through
| "intrinsic" functions. They look and act like regular
| functions, but they are built in to the compiler and usually
| compile to a single specific assembly instruction. In the
| article, _mm_set_epi32 is an intrinsic function that compiles
| to the instruction of the same name.
|
| This is a sharp contrast to inline assembly for which the
| compiler has practically zero visibility into. Inline assembly
| can't be pipelined with other work by the compiler. And, the
| compiler has to switch to a super-conservative assumption that
| the inline assembly might have done god-knows-what behind the
| compiler's back.
|
| AFAICT, the last holdouts for hand-written assembly are people
| working on media codecs. Even AAA game engines use intrinsic
| functions rarely and assembly nearly never.
| girvo wrote:
| Isn't the reason they had to use inline assembly there
| because the compiler they're using doesn't have that
| particular instruction bound as an intrinsic?
|
| What do you do in that case? I'm genuinely curious as it's
| something I've run up against: the vector extensions for the
| LX7 processor in the ESP32-S3 don't have intrinsics for them.
| jvanderbot wrote:
| There are SIMD abstraction libraries floating around. And many
| so-called "Math" libraries will use SIMD instructions to speed
| things up, I believe. So the work is to cast the problem to the
| language of the library(ies) and do some profiling.
| Dwedit wrote:
| If you are exclusively moving numbers around in an array, SIMD
| sorting sounds great.
|
| But when you want to actually sort things, it's different. You
| have objects in an array, the objects will have one member which
| is a Key that you will sort the objects on.
|
| In order to create a sorted permutation of the list, you either
| need to rearrange the list of object pointers (fast), rearrange
| the memory of the objects (slow), or you simply output a list of
| array indexes that represents what the sort order should be
| (fast).
|
| Code that doesn't physically move the object memory around
| creates indirection. Indirection makes any SIMD sorting depend on
| scatter-gather in order to get data in. Scatter-Gather causes
| random memory accesses which don't cache as well as sequential
| access.
| ww520 wrote:
| It might be easier for SIMD to work by extracting the key value
| from each object into its own array.
|
| Form an array with fixed element size: [ {index0 | key0},
| {index1 | key1}, ...], where indices are the element index to
| the original array of objects and the keys are fixed size. The
| fat element {index | key} is moved as a whole unit to carry the
| original index along with the swap during sorting. SIMD can
| deal with fixed size units very well.
| dan-robertson wrote:
| Yeah I find the thing you generally want but which many
| languages don't support well is to have a struct-of-arrays or
| data frame layout rather than an array of objects. Then you can
| do one of:
|
| Sort one array and rearrange other arrays the same way
|
| Compute the array of indices into another array such that
| array[index[i]] < array[index[i+1]], I.e. the sorted version of
| the array is then [array[i] for i in index]. If you have a
| vectorised index operation (eg APL, R, ...) getting the sorted
| array is one line of code.
|
| I think with simd you'd want to sort your array and
| simultaneously permute either a second array or the array of
| indices 1, 2, ..., n so you can use that to do accesses to
| corresponding elements in your other associated columns.
| xphos wrote:
| This is a well written SIMD example I find these are the hardest
| topic to clearly explain because of how much is happening. Anyone
| got other nice links I've been doing a lot of vector processing
| so I need to learn!
| gww wrote:
| Someone previously shared this YouTube playlist on HN, and I
| have found it to be very helpful [1].
|
| [1]
| https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg...
| wolf550e wrote:
| 128bit SIMD on x86 is called SSE, not AVX. AVX is 256bit. SSE
| (first introduced with the Pentium 3) had many extensions that
| added instructions: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and some
| others.
| cpgxiii wrote:
| That's not entirely true. AVX from the beginning (introduced
| with Sandy Bridge), not AVX2 (introduced with Haswell),
| contains 128-bit instructions as well, and these are what's
| being used in the article (e.g. vpermilps).
| adrian_b wrote:
| For most 128-bit SSE instructions, there are equivalent 128-bit
| AVX instructions, which differ from them by allowing 3 register
| addresses instead of 2 register addresses, and by not causing
| execution stalls when they are mixed with 256-bit AVX
| instructions.
|
| For most AVX instructions, you may choose between 128-bit and
| 256-bit instructions, which is especially useful on those CPUs
| where 256-bit instructions cause down-clocking.
|
| There are also 128-bit AVX-512 instructions and 256-bit AVX-512
| instructions.
|
| So the same SIMD algorithm may be implemented for Intel CPUs
| using either 128-bit SSE instructions or 128-bit AVX
| instructions or 128-bit AVX-512 instructions or 256-bit AVX
| instructions or 256-bit AVX-512 instructions or 512-bit AVX-512
| instructions.
| [deleted]
___________________________________________________________________
(page generated 2022-12-17 23:00 UTC)