[HN Gopher] In-Place Parallel Super Scalar Samplesort (IPS4o)
___________________________________________________________________
In-Place Parallel Super Scalar Samplesort (IPS4o)
Author : scott_s
Score : 40 points
Date : 2021-08-18 13:23 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| scott_s wrote:
| Academic paper: "Engineering In-place (Shared-memory) Sorting
| Algorithms", Michael Axtmann, Sascha Witt, Daniel Ferizovic,
| Peter Sanders, https://arxiv.org/abs/2009.13569
|
| To start discussion, I think it's interesting to note how I
| landed on this repo and paper. I got curious about Tim Sort, and
| some searching on HN brought up a 2018 story, "On the Worst-Case
| Complexity of TimSort",
| https://news.ycombinator.com/item?id=17883461. In the comments,
| poster lorenzhs has an excellent comment
| (https://news.ycombinator.com/item?id=17884612) which points out
| IPS4o as a sorting algorithm that takes modern hardware into
| account.
| h2odragon wrote:
| > The parallel version of IPS4o requires 16-byte atomic compare-
| and-exchange instructions to run the fastest. Most CPUs and
| compilers support 16-byte compare-and-exchange instructions
| nowadays.
|
| I was so thrilled when i got _4 byte_ CAS on Intel, Sparc, and
| Alpha (all i had were uniproc Alphas but it made the code _feel_
| cleaner). Then I wandered into other things and haven 't been
| keeping up with this stuff. _damn_ its got fancy.
|
| Does memory latency still suck rocks with lots of handwave about
| "but we have so much _width_ now "?
| chmod775 wrote:
| > "but we have so much width now"?
|
| We also have 16MB of L3 and smarter prefetching!
| dragontamer wrote:
| > Does memory latency still suck rocks with lots of handwave
| about "but we have so much width now"?
|
| Basically.
|
| Intel processors can process something like 2 or 3 load/stores
| per clock cycle to L1 cache at full rate, no questions asked.
|
| When you use AVX512 and do 512-bit loads/stores (64-byte),
| that's still only 2 or 3 per clock cycle. But if you use
| typical 64-bit registers, you're still only able to do 2 or 3
| load/stores per cycle. Its not even memory latency here, its
| literally the execution port that's the holdup.
|
| -------
|
| Now I've never looked at AVX512 / AVX2 (256-bit) from the
| perspective of multithreaded / atomic access / memory order.
| But from a performance perspective, that load/store is the full
| 512-bit or 256-bit width and probably atomic.
|
| ------
|
| I'm assuming the 16-byte atomic compare-and-exchange
| instructions are NEON on ARM (128-bit width), and probably some
| AVX instruction I'm not aware of on Intel/AMD (256-bit).
|
| EDIT: It seems like the 128-bit instruction on x86 is
| cmpxchg16b. 512-bit / 64-byte cache lines are the unit that the
| cache operates however. Intel has a TSX-NI (transactional
| memory) extension that operates over 64-bytes (making it
| possible to implement a cmpxchg64b, but over multiple
| instructions). Its not actually AVX512 or AVX2 related at all.
___________________________________________________________________
(page generated 2021-08-18 23:01 UTC)