[HN Gopher] In-Place Parallel Super Scalar Samplesort (IPS4o)
       ___________________________________________________________________
        
       In-Place Parallel Super Scalar Samplesort (IPS4o)
        
       Author : scott_s
       Score  : 40 points
       Date   : 2021-08-18 13:23 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | scott_s wrote:
       | Academic paper: "Engineering In-place (Shared-memory) Sorting
       | Algorithms", Michael Axtmann, Sascha Witt, Daniel Ferizovic,
       | Peter Sanders, https://arxiv.org/abs/2009.13569
       | 
       | To start discussion, I think it's interesting to note how I
       | landed on this repo and paper. I got curious about Tim Sort, and
       | some searching on HN brought up a 2018 story, "On the Worst-Case
       | Complexity of TimSort",
       | https://news.ycombinator.com/item?id=17883461. In the comments,
       | poster lorenzhs has an excellent comment
       | (https://news.ycombinator.com/item?id=17884612) which points out
       | IPS4o as a sorting algorithm that takes modern hardware into
       | account.
        
       | h2odragon wrote:
       | > The parallel version of IPS4o requires 16-byte atomic compare-
       | and-exchange instructions to run the fastest. Most CPUs and
       | compilers support 16-byte compare-and-exchange instructions
       | nowadays.
       | 
       | I was so thrilled when i got _4 byte_ CAS on Intel, Sparc, and
       | Alpha (all i had were uniproc Alphas but it made the code _feel_
       | cleaner). Then I wandered into other things and haven 't been
       | keeping up with this stuff. _damn_ its got fancy.
       | 
       | Does memory latency still suck rocks with lots of handwave about
       | "but we have so much _width_ now "?
        
         | chmod775 wrote:
         | > "but we have so much width now"?
         | 
         | We also have 16MB of L3 and smarter prefetching!
        
         | dragontamer wrote:
         | > Does memory latency still suck rocks with lots of handwave
         | about "but we have so much width now"?
         | 
         | Basically.
         | 
         | Intel processors can process something like 2 or 3 load/stores
         | per clock cycle to L1 cache at full rate, no questions asked.
         | 
         | When you use AVX512 and do 512-bit loads/stores (64-byte),
         | that's still only 2 or 3 per clock cycle. But if you use
         | typical 64-bit registers, you're still only able to do 2 or 3
         | load/stores per cycle. Its not even memory latency here, its
         | literally the execution port that's the holdup.
         | 
         | -------
         | 
         | Now I've never looked at AVX512 / AVX2 (256-bit) from the
         | perspective of multithreaded / atomic access / memory order.
         | But from a performance perspective, that load/store is the full
         | 512-bit or 256-bit width and probably atomic.
         | 
         | ------
         | 
         | I'm assuming the 16-byte atomic compare-and-exchange
         | instructions are NEON on ARM (128-bit width), and probably some
         | AVX instruction I'm not aware of on Intel/AMD (256-bit).
         | 
         | EDIT: It seems like the 128-bit instruction on x86 is
         | cmpxchg16b. 512-bit / 64-byte cache lines are the unit that the
         | cache operates however. Intel has a TSX-NI (transactional
         | memory) extension that operates over 64-bytes (making it
         | possible to implement a cmpxchg64b, but over multiple
         | instructions). Its not actually AVX512 or AVX2 related at all.
        
       ___________________________________________________________________
       (page generated 2021-08-18 23:01 UTC)