hngopher.com

       [HN Gopher] Intel Releases x86-SIMD-sort 2.0 With Faster AVX-512...
       ___________________________________________________________________
        
       Intel Releases x86-SIMD-sort 2.0 With Faster AVX-512 Sorting, New
       Algorithms
        
       Author : ashvardanian
       Score  : 78 points
       Date   : 2023-06-23 21:48 UTC (1 days ago)
        
 (HTM) web link (www.phoronix.com)
 (TXT) w3m dump (www.phoronix.com)
        
       | tecleandor wrote:
       | What kind of processes are heavily (or, at least, moderately)
       | impacted by the performance improvement of sorting algorithms?
       | 
       | By my lack of knowledge (and the heat of today :0 ) can't guess
       | what algorithms or processes do lots of sorting behind the
       | scenes.
        
         | eyegor wrote:
         | Sorting is pretty common in the numerics world because a lot of
         | algorithms or techniques can be optimized heavily with sorted
         | inputs. You either get to skip steps or bisect the dataset.
         | Sort of like how most fast fft implementations will run 10-20%
         | faster if you pad vectors to reach a power of 2 length. A
         | typical "preprocess pipeline" would involve splitting vectors
         | into power of 2 sizes or padding them to maximize cache lines,
         | normalizing (and often mapping to an integer domain), and
         | sorting.
        
         | tredre3 wrote:
         | I worked in embedded development and sorting large lists of
         | files was a surprisingly big bottleneck for many of our
         | projects because of the very slow microcontrollers. Even worse
         | we couldn't cache the results because of no memory.
         | 
         | So I was tasked to improve that and I had to write inline
         | assembly to abuse specific cpu instructions that could
         | effectively do much more char comparisons per clock cycle. We
         | ended up not using it because of the usual reasons (not
         | portable, hard to maintain) and our customers had to live with
         | the 20-30s delay when entering a large directory, versus the
         | 5-6s my code achieved.
         | 
         | (Sorry this has little to do with the topic at hand, I don't
         | really know when sorting becomes a problem on a multicore 5ghz
         | cpu)
        
         | djxfade wrote:
         | Databases would be me suggestion
        
       | cientifico wrote:
       | A interesting reading related to this is the thoughts of Linus
       | about AVX-512 back in 2020: https://www.phoronix.com/news/Linus-
       | Torvalds-On-AVX-512
       | 
       | My conclusion (feel free to enlighten me if I am wrong) is that a
       | system will profit by having more cores instead of AVX-512 for
       | the same power consumption.
        
         | CyberRage wrote:
         | heavily depends on the workload
         | 
         | Some workloads can be accelerated via AVX-512 as shown here by
         | Anandtech:
         | 
         | https://www.anandtech.com/show/17601/intel-core-i9-13900k-an...
         | 
         | See how AMD CPUs with AVX-512 enabled some a massive boost even
         | with similar/less cores
         | 
         | I would agree that most typical workloads don't benefit much
         | from AVX-512, it requires software support and good use-
         | case(wide parallel SIMD)
        
         | formerly_proven wrote:
         | "HPC doesn't exist and also FP performance is irrelevant" seems
         | a wildly out of touch take from Linus.
        
           | CyberRage wrote:
           | Honestly HPC moved to GPUs for most of the heavy FP compute
           | 
           | for CPUs INT perf is king, even in HPC/enterprise
        
             | _hypx wrote:
             | That would be ironic because Linus also predicted the death
             | of discrete GPUs.
        
             | eyegor wrote:
             | Unfortunately this is not true in numerics. Lots of stupid
             | heavy cfd/fea type workloads parellize well but aren't gpu
             | accelerated. The reasons aren't clear to me, but a lot of
             | the popular solvers are cpu only and involve mostly fp
             | calcs. There are a few solvers that use gpus but they tend
             | to be less accurate in exchange.
        
               | uguuo_o wrote:
               | Reasons : there is a significant amount of work needed to
               | get codes to work on a distributed hybrid or gpu-only
               | fashion. It's a completely different coding paradigm that
               | needs significant studies before commercial entities
               | adopt gpu use at scale. All-gpu solvers are starting to
               | be developed, such as fun3d GPU[0], but features are very
               | limited. GPU development is starting to catch up in the
               | community, so it won't be long before a significant
               | portion can operate heterogeneously or in gpu-only mode.
               | 
               | [0] https://fun3d.larc.nasa.gov/GPU_March_2021.pdf
        
         | janwas wrote:
         | (Frequent AVX-512 user, opinions are my own.)
         | 
         | It is time to stop quoting this rant, which is rather far
         | removed from actual practice and reality.
         | 
         | Specifically for sorting, Intel's sort and ours can be about
         | 10x as fast as scalar.
         | 
         | AVX-512 has high power? Great! Power is work per time. Let's
         | have more of that. It is _useful_ work per _energy spent_ that
         | we want to maximize, and there scalar code falls flat. Consider
         | that OoO scheduling/control overhead is the real culprit;
         | executing one instruction costs far more energy than the actual
         | operation. SIMD divides that fixed cost over 16 elements,
         | leading to 5-10x higher power efficiency.
         | 
         | Top frequency reduction? Not since the first implementation on
         | Skylake, and even there a non-issue, see https://github.com/goo
         | gle/highway/blob/master/hwy/contrib/so....
         | 
         | More cores? The shared-memory fabric is already bursting at the
         | seams (see on-chip bottlenecks of Genoa), we are already
         | cramming in too many for the current memory interface.
         | 
         | What would actually be necessary: more bandwidth! A64FX CPUs
         | actually beat GPUs in 2019 for supercomputing applications
         | thanks to their HBM. Intel's Sappire Rapids Max also has this.
         | Let's build more of those, at a decent price. And start using
         | the wide vector units, they are there precisely because lots of
         | highly-clocked cores in one big NUMA domain is not always the
         | best way forward.
        
           | mlochbaum wrote:
           | Please, you are overstating your case in a way that distracts
           | from the issues at hand. I do agree that Linus's opinions are
           | irrelevant to a lot of computing, and that sorting must make
           | some use of SIMD for the best performance, and that there's
           | good work in vqsort. But the 10x factor is highly misleading.
           | Sure, it's technically correct that you "can be about 10x as
           | fast" as a scalar sort. In the same sense that a scalar sort
           | can be 10x as fast as an AVX-512 one.
           | 
           | The 10x number is from comparing against std::sort on random
           | data, yes? Both common std::sort implementations are
           | branching quicksorts from the era when introsort was still
           | considered exciting new technology, and are much slower than
           | pdqsort and other branchless quicksorts. The measurements at
           | [0] show by my estimate about a gain of 3x against pdqsort
           | and 2x against ipnsort. For vqsort only; intel barely beats
           | ipnsort. I also pointed out to you at [1] benchmarks showing
           | 1450MB/s for 32-bit radix sort at a million elements on M1,
           | which is comparable to your Xeon measurements and nearly 3x
           | faster than the 499MB/s for M1 in the vqsort paper.
           | 
           | If you're going to quote a std::sort number, at least give
           | the name instead of implying it's a good representative for
           | scalar sorts! I would strongly recommend, however, that you
           | benchmark against and study more modern scalar algorithms. I
           | know SIMD and have even vectorized parts of pdqsort in the
           | past; I've chosen to work with scalar algorithms for the past
           | year or two because I think these methods still have a lot to
           | offer.
           | 
           | [0] https://github.com/Voultapher/sort-research-
           | rs/blob/main/wri...
           | 
           | [1] https://news.ycombinator.com/item?id=36309049
        
           | HuhWhatMeansYou wrote:
           | [dead]
        
         | qayxc wrote:
         | I won't say you're wrong, but the answer might depend heavily
         | on the application.
        
         | jackmott42 wrote:
         | A lot of the downclocking issues he was talking about then are
         | less severe now on newer Intel cpus and AMD cpus, which changes
         | the calculus a lot.
         | 
         | You could probably find a workload where your conclusion is
         | correct but I think the vast majority of workloads would be
         | faster with AVX-512 if you have the time to leverage it.
        
         | gmokki wrote:
         | AMD enabled AVX-512 without increasing their power consumption.
         | To do that their AVX-512 runs only half of max speed compared
         | Intel when using 512 bit registers (aka max flops is same as
         | with AVX2 256 bit registers). Still, because the new
         | instructions in AVX-512 are beneficial in many scenarios the
         | actual speed up is still often 10-20% in code that benefits
         | from the new instructions.
         | 
         | Then 4 years later AMD has the option to bump to full-speed
         | AVX-512 if it is really needed. This is the same they did with
         | AVX2 initially.
        
           | adrian_b wrote:
           | No, this (AMD running at half speed) is a myth. For most
           | instructions AMD runs the 512-bit instructions at exactly the
           | same speed as Intel. Both Intel and AMD execute either two
           | 512-bit instructions per cycle (most simpler instructions) or
           | one 512-bit instruction per cycle (a few of the more complex
           | instructions). The reason is that AMD executes four 256-bit
           | instructions per cycle (and Intel only three, with a fourth
           | 256-bit section that can be used only by the AVX-512
           | instructions). Both Intel and AMD pair 256-bit sections to
           | execute 512-bit operations at half rate, but equal throughput
           | (the effective throughput actually increases due to lower
           | overhead).
           | 
           | The main exception is that some of the models of Intel
           | Sapphire Rapids, only the most expensive of them, i.e. Xeon
           | Platinum and most of the Xeon Gold models, have a second FMA
           | unit, so only these models are able to execute two 512-bit
           | FMA instructions per cycle, while Zen 4 can execute only one
           | 512-bit FMA instruction + one 512-bit FADD instruction per
           | cycle.
           | 
           | The second advantage of Intel is that it has double bandwidth
           | towards the L1 cache, which is absolutely necessary for
           | Intel, because otherwise it would have been impossible to
           | provide the operands for two FMA instructions per cycle.
           | 
           | While the more expensive Intel models have this FMA
           | throughput advantage, there are also many AVX-512
           | instructions where AMD has lower latencies than Intel. AMD
           | has also higher clock frequencies when all the cores are
           | busy, and also higher memory throughput, so the result is
           | that for many AVX-512 workloads a Zen 4 beats easily a
           | Sapphire Rapids.
           | 
           | The main applications where Intel wins are those that depend
           | mainly on FMA, i.e. linear algebra operations on dense
           | arrays, e.g. DGEMM or LINPACK.
        
       | hedora wrote:
       | Are these significantly faster than the 256 bit versions? They
       | compare to non-simd, but that's less interesting.
       | 
       | Also, it's unfortunate that this stuff won't run on most
       | developer machines.
        
         | jackmott42 wrote:
         | It can often be more than you would expect from the width
         | increase, as lots of masking tools and instructions allow you
         | to vectorize things more efficiently than could be done before.
        
         | [deleted]
        
         | mgaunard wrote:
         | You don't need avx512 to implement a bitonic sort. I worked on
         | several implementations with plain avx.
        
           | vardump wrote:
           | Or SSE2+.
        
         | janwas wrote:
         | Yes, can be about 1.6x speedup:
         | https://github.com/Voultapher/sort-research-rs/blob/main/wri...
         | (cpp_intel_avx512).
         | 
         | Disclosure: I am the main author of vqsort, which is also a
         | vectorized Quicksort but portable.
        
       ___________________________________________________________________
       (page generated 2023-06-24 23:00 UTC)