[HN Gopher] Intel Releases x86-SIMD-sort 2.0 With Faster AVX-512...
___________________________________________________________________
Intel Releases x86-SIMD-sort 2.0 With Faster AVX-512 Sorting, New
Algorithms
Author : ashvardanian
Score : 78 points
Date : 2023-06-23 21:48 UTC (1 days ago)
(HTM) web link (www.phoronix.com)
(TXT) w3m dump (www.phoronix.com)
| tecleandor wrote:
| What kind of processes are heavily (or, at least, moderately)
| impacted by the performance improvement of sorting algorithms?
|
| By my lack of knowledge (and the heat of today :0 ) can't guess
| what algorithms or processes do lots of sorting behind the
| scenes.
| eyegor wrote:
| Sorting is pretty common in the numerics world because a lot of
| algorithms or techniques can be optimized heavily with sorted
| inputs. You either get to skip steps or bisect the dataset.
| Sort of like how most fast fft implementations will run 10-20%
| faster if you pad vectors to reach a power of 2 length. A
| typical "preprocess pipeline" would involve splitting vectors
| into power of 2 sizes or padding them to maximize cache lines,
| normalizing (and often mapping to an integer domain), and
| sorting.
| tredre3 wrote:
| I worked in embedded development and sorting large lists of
| files was a surprisingly big bottleneck for many of our
| projects because of the very slow microcontrollers. Even worse
| we couldn't cache the results because of no memory.
|
| So I was tasked to improve that and I had to write inline
| assembly to abuse specific cpu instructions that could
| effectively do much more char comparisons per clock cycle. We
| ended up not using it because of the usual reasons (not
| portable, hard to maintain) and our customers had to live with
| the 20-30s delay when entering a large directory, versus the
| 5-6s my code achieved.
|
| (Sorry this has little to do with the topic at hand, I don't
| really know when sorting becomes a problem on a multicore 5ghz
| cpu)
| djxfade wrote:
| Databases would be me suggestion
| cientifico wrote:
| A interesting reading related to this is the thoughts of Linus
| about AVX-512 back in 2020: https://www.phoronix.com/news/Linus-
| Torvalds-On-AVX-512
|
| My conclusion (feel free to enlighten me if I am wrong) is that a
| system will profit by having more cores instead of AVX-512 for
| the same power consumption.
| CyberRage wrote:
| heavily depends on the workload
|
| Some workloads can be accelerated via AVX-512 as shown here by
| Anandtech:
|
| https://www.anandtech.com/show/17601/intel-core-i9-13900k-an...
|
| See how AMD CPUs with AVX-512 enabled some a massive boost even
| with similar/less cores
|
| I would agree that most typical workloads don't benefit much
| from AVX-512, it requires software support and good use-
| case(wide parallel SIMD)
| formerly_proven wrote:
| "HPC doesn't exist and also FP performance is irrelevant" seems
| a wildly out of touch take from Linus.
| CyberRage wrote:
| Honestly HPC moved to GPUs for most of the heavy FP compute
|
| for CPUs INT perf is king, even in HPC/enterprise
| _hypx wrote:
| That would be ironic because Linus also predicted the death
| of discrete GPUs.
| eyegor wrote:
| Unfortunately this is not true in numerics. Lots of stupid
| heavy cfd/fea type workloads parellize well but aren't gpu
| accelerated. The reasons aren't clear to me, but a lot of
| the popular solvers are cpu only and involve mostly fp
| calcs. There are a few solvers that use gpus but they tend
| to be less accurate in exchange.
| uguuo_o wrote:
| Reasons : there is a significant amount of work needed to
| get codes to work on a distributed hybrid or gpu-only
| fashion. It's a completely different coding paradigm that
| needs significant studies before commercial entities
| adopt gpu use at scale. All-gpu solvers are starting to
| be developed, such as fun3d GPU[0], but features are very
| limited. GPU development is starting to catch up in the
| community, so it won't be long before a significant
| portion can operate heterogeneously or in gpu-only mode.
|
| [0] https://fun3d.larc.nasa.gov/GPU_March_2021.pdf
| janwas wrote:
| (Frequent AVX-512 user, opinions are my own.)
|
| It is time to stop quoting this rant, which is rather far
| removed from actual practice and reality.
|
| Specifically for sorting, Intel's sort and ours can be about
| 10x as fast as scalar.
|
| AVX-512 has high power? Great! Power is work per time. Let's
| have more of that. It is _useful_ work per _energy spent_ that
| we want to maximize, and there scalar code falls flat. Consider
| that OoO scheduling/control overhead is the real culprit;
| executing one instruction costs far more energy than the actual
| operation. SIMD divides that fixed cost over 16 elements,
| leading to 5-10x higher power efficiency.
|
| Top frequency reduction? Not since the first implementation on
| Skylake, and even there a non-issue, see https://github.com/goo
| gle/highway/blob/master/hwy/contrib/so....
|
| More cores? The shared-memory fabric is already bursting at the
| seams (see on-chip bottlenecks of Genoa), we are already
| cramming in too many for the current memory interface.
|
| What would actually be necessary: more bandwidth! A64FX CPUs
| actually beat GPUs in 2019 for supercomputing applications
| thanks to their HBM. Intel's Sappire Rapids Max also has this.
| Let's build more of those, at a decent price. And start using
| the wide vector units, they are there precisely because lots of
| highly-clocked cores in one big NUMA domain is not always the
| best way forward.
| mlochbaum wrote:
| Please, you are overstating your case in a way that distracts
| from the issues at hand. I do agree that Linus's opinions are
| irrelevant to a lot of computing, and that sorting must make
| some use of SIMD for the best performance, and that there's
| good work in vqsort. But the 10x factor is highly misleading.
| Sure, it's technically correct that you "can be about 10x as
| fast" as a scalar sort. In the same sense that a scalar sort
| can be 10x as fast as an AVX-512 one.
|
| The 10x number is from comparing against std::sort on random
| data, yes? Both common std::sort implementations are
| branching quicksorts from the era when introsort was still
| considered exciting new technology, and are much slower than
| pdqsort and other branchless quicksorts. The measurements at
| [0] show by my estimate about a gain of 3x against pdqsort
| and 2x against ipnsort. For vqsort only; intel barely beats
| ipnsort. I also pointed out to you at [1] benchmarks showing
| 1450MB/s for 32-bit radix sort at a million elements on M1,
| which is comparable to your Xeon measurements and nearly 3x
| faster than the 499MB/s for M1 in the vqsort paper.
|
| If you're going to quote a std::sort number, at least give
| the name instead of implying it's a good representative for
| scalar sorts! I would strongly recommend, however, that you
| benchmark against and study more modern scalar algorithms. I
| know SIMD and have even vectorized parts of pdqsort in the
| past; I've chosen to work with scalar algorithms for the past
| year or two because I think these methods still have a lot to
| offer.
|
| [0] https://github.com/Voultapher/sort-research-
| rs/blob/main/wri...
|
| [1] https://news.ycombinator.com/item?id=36309049
| HuhWhatMeansYou wrote:
| [dead]
| qayxc wrote:
| I won't say you're wrong, but the answer might depend heavily
| on the application.
| jackmott42 wrote:
| A lot of the downclocking issues he was talking about then are
| less severe now on newer Intel cpus and AMD cpus, which changes
| the calculus a lot.
|
| You could probably find a workload where your conclusion is
| correct but I think the vast majority of workloads would be
| faster with AVX-512 if you have the time to leverage it.
| gmokki wrote:
| AMD enabled AVX-512 without increasing their power consumption.
| To do that their AVX-512 runs only half of max speed compared
| Intel when using 512 bit registers (aka max flops is same as
| with AVX2 256 bit registers). Still, because the new
| instructions in AVX-512 are beneficial in many scenarios the
| actual speed up is still often 10-20% in code that benefits
| from the new instructions.
|
| Then 4 years later AMD has the option to bump to full-speed
| AVX-512 if it is really needed. This is the same they did with
| AVX2 initially.
| adrian_b wrote:
| No, this (AMD running at half speed) is a myth. For most
| instructions AMD runs the 512-bit instructions at exactly the
| same speed as Intel. Both Intel and AMD execute either two
| 512-bit instructions per cycle (most simpler instructions) or
| one 512-bit instruction per cycle (a few of the more complex
| instructions). The reason is that AMD executes four 256-bit
| instructions per cycle (and Intel only three, with a fourth
| 256-bit section that can be used only by the AVX-512
| instructions). Both Intel and AMD pair 256-bit sections to
| execute 512-bit operations at half rate, but equal throughput
| (the effective throughput actually increases due to lower
| overhead).
|
| The main exception is that some of the models of Intel
| Sapphire Rapids, only the most expensive of them, i.e. Xeon
| Platinum and most of the Xeon Gold models, have a second FMA
| unit, so only these models are able to execute two 512-bit
| FMA instructions per cycle, while Zen 4 can execute only one
| 512-bit FMA instruction + one 512-bit FADD instruction per
| cycle.
|
| The second advantage of Intel is that it has double bandwidth
| towards the L1 cache, which is absolutely necessary for
| Intel, because otherwise it would have been impossible to
| provide the operands for two FMA instructions per cycle.
|
| While the more expensive Intel models have this FMA
| throughput advantage, there are also many AVX-512
| instructions where AMD has lower latencies than Intel. AMD
| has also higher clock frequencies when all the cores are
| busy, and also higher memory throughput, so the result is
| that for many AVX-512 workloads a Zen 4 beats easily a
| Sapphire Rapids.
|
| The main applications where Intel wins are those that depend
| mainly on FMA, i.e. linear algebra operations on dense
| arrays, e.g. DGEMM or LINPACK.
| hedora wrote:
| Are these significantly faster than the 256 bit versions? They
| compare to non-simd, but that's less interesting.
|
| Also, it's unfortunate that this stuff won't run on most
| developer machines.
| jackmott42 wrote:
| It can often be more than you would expect from the width
| increase, as lots of masking tools and instructions allow you
| to vectorize things more efficiently than could be done before.
| [deleted]
| mgaunard wrote:
| You don't need avx512 to implement a bitonic sort. I worked on
| several implementations with plain avx.
| vardump wrote:
| Or SSE2+.
| janwas wrote:
| Yes, can be about 1.6x speedup:
| https://github.com/Voultapher/sort-research-rs/blob/main/wri...
| (cpp_intel_avx512).
|
| Disclosure: I am the main author of vqsort, which is also a
| vectorized Quicksort but portable.
___________________________________________________________________
(page generated 2023-06-24 23:00 UTC)