[HN Gopher] AVX512/VBMI2: A Programmer's Perspective
___________________________________________________________________
AVX512/VBMI2: A Programmer's Perspective
Author : ingve
Score : 87 points
Date : 2021-08-14 08:42 UTC (14 hours ago)
(HTM) web link (www.singlestore.com)
(TXT) w3m dump (www.singlestore.com)
| 37ef_ced3 wrote:
| I work with GPUs professionally, but I've written a lot of
| AVX-512 code (e.g., https://NN-512.com) and AVX-512 is a big step
| forward from AVX2.
|
| The regularity and simplicity and completeness of the instruction
| set is a big win.
|
| The lane predication (masking) of every instruction is useful
| when the data doesn't fit the vectors (and in loop tails) but it
| has numerous other uses, too. For example, it makes blending
| (vector splicing) and partial operations easy.
|
| The flexible permute instructions (two data vectors in, one
| control vector in, and one data vector out) are fast and
| enormously useful. Anyone who has puzzled with AVX2 will breathe
| a sigh of relief.
|
| The register file is big enough (32 vectors, each 16 floats wide)
| that register starvation typically isn't an issue. For example,
| you don't worry about having registers for your permutation
| control vectors. And the predicate masks are in another set of
| registers, which again are plentiful (eight!).
|
| The easing of alignment requirements (unaligned load/store
| instructions operating on aligned addresses are as fast as
| aligned loads/stores) is also a big win.
|
| AVX-512 is a real pleasure to use.
| wyldfire wrote:
| I work on some tooling for Hexagon DSPs and I wonder how the
| HVX instructions compare against AVX-512 (for integers at
| least). Different targets / use cases but I'm curious how it
| stacks up.
| dragontamer wrote:
| Avx512 probably needs bpermute for parity with GPU shared
| memory.
|
| Arbitrary swizzles both in the forward permute (currently in
| AVX512) and backwards direction (GPU only) is as necessary and
| proper as gather + scatter. Any program that uses one is highly
| likely to use the other.
|
| --------
|
| Butterfly permutes should be especially accelerated, as that
| pattern continues to show up. It seems like arbitrary permutes
| / bpermutes are expensive to implement in hardware, but
| butterfly permutes are the fundamental building block.
|
| Butterfly permutes are needed from FFT to scan/prefix sum
| operations. It's also fundamental (and simpler at the hardware
| level than arbitrary bpermute/permute)
|
| AMD implements the butterfly permute in DPP instructions.
| NVidia provides a simple shfl.bfly PTX instruction.
|
| Butterfly networks (and inverse butterfly) are how pdep and
| pext are implemented under the hood.
| http://palms.ee.princeton.edu/PALMSopen/hilewitz06FastBitCom...
|
| -----------
|
| IIRC,the butterfly / inverse butterfly network can implement
| any arbitrary permute in just log2(n) steps.
| Const-me wrote:
| > A Programmer's Perspective
|
| I'm a programmer too but I don't share that perspective.
|
| In my line of work, AVX512 is useless. This won't change until at
| least 25% market penetration on clients, which may or may not
| happen.
|
| Not all programmers are writing server code. Also, very small
| count of them are Intel partners with early access to the chips.
|
| > In AVX terminology, intra-lane movement is called a `shuffle`
| while inter-lane movement is called a `permute`
|
| I don't believe that's true. _mm_shuffle_ps( x, x, imm ) and
| _mm_permute_ps( x, imm ) are 100% equivalent. Also,
| _mm256_permute_ps permutes within lanes, _mm256_permutevar8x32_ps
| permutes across lanes.
| AdamProut wrote:
| If your building software as a service that changes this
| calculus a fair bit. You only need to wait for
| AWS/Azure/GCP/CSP of choice to support the hardware (you don't
| need to wait for mass market use of it). For example, many
| (all?) database as a service offerings now days run on very
| precisely configured hardware.
| Const-me wrote:
| > many (all?) database as a service offerings now days run on
| very precisely configured hardware
|
| I'm not sure AVX512 is a win even in that case. AMD Epyc
| peaks at 64 cores/socket, Intel Xeon at 40 cores/socket. It's
| not immediately obvious AMD's performance advantage over
| Intel is smaller than the Intel-only win from AVX512.
| dragontamer wrote:
| Are sockets the best measurement? AMD has a pretty complex
| network inside that socket: 8 dies + a switch.
|
| The Xeon 40 core is just one die, which means L3 cache and
| memory is more unified.
|
| L3 cache is probably king (data may not fit, but maybe some
| indexes?), but it's not obvious if EPYCs separate L3 caches
| are comparable to Xeons (or Power)
| Const-me wrote:
| AMD is much faster overall. This page has a benchmark of
| a MySQL database, the slide is called MariaDB:
| https://www.servethehome.com/amd-epyc-7763-review-top-
| for-th... The only benchmark where AMD didn't have
| substantial advantage is chess, I think the reason is
| AMD's slow pdep/pext instructions.
| dragontamer wrote:
| The 6258R is just 28 cores though, 56 total after dual
| socket.
|
| AMD Zen3 has single cycle pext / pdep now and the chess
| benchmarks are better as a result.
| janwas wrote:
| > In my line of work, AVX512 is useless. This won't change
| until at least 25% market penetration on clients, which may or
| may not happen.
|
| Why not use it where available? github.com/google/highway lets
| you write code once using 'platform-independent intrinsics',
| and generate AVX2/AVX-512/NEON/SVE into a single binary;
| dynamic dispatch runs the best available codepath for the
| current CPU.
|
| Disclosure: I am the main author. Happy to discuss.
| Const-me wrote:
| > Why not use it where available?
|
| AFAIK things like that (highway, OpenMP 4.0+, automatic
| vectorizers) are only good for vertical operations.
|
| When the only thing you're doing is billions of vertical
| operations, computing that thing on CPU is often a poor
| choice because GPUs are way faster at that, and way more
| power efficient too.
|
| Therefore, when I optimize CPU-running things, that code is
| usually not a good fit for GPUs. Sometimes the data size is
| too small (PCIe latency gonna eat all the profit from GPU),
| sometimes too many branches, but most typical reason is
| horizontal operations.
|
| An example is multiplying 6x24 by 24x1 matrices. A good
| approach is a loop with 12 iterations, in the loop body 2x
| _mm256_broadcast_sd, then _mm256_blend_pd (these 3
| instructions are making 3 vectors [v1,v1,v1,v1],
| [v1,v1,v2,v2], [v2,v2,v2,v2] ), then 3x _mm256_fmadd_pd with
| memory source operands. Then after the loop add 12 values
| into 6 with shuffles and vector adds.
| NohatCoder wrote:
| Question is, how many cores did we have to sacrifice to get
| AVX512? Intel moved from shipping 10 core CPUs to 8 core CPUs
| with AVX512. Clock speed may not go down much, but the price of
| that is a very high power consumption on AVX512 workloads.
|
| As best I can tell, it would be reasonable to expect around 12
| AVX2 cores operating on the same power budget as the current 8
| AVX512 cores. Does the average user have enough AVX512 workloads
| to make that tradeoff worth it?
| brigade wrote:
| Sunny/Cypress Cove had lots of other area growth than just
| AVX-512 [1]. The better comparison is Skylake vs Skylake-SP,
| from which AVX512 is estimated to cost about 5% of the tile
| area [2]. So that's the area cost of about 2 cores in the 28
| core chip.
|
| [1]
| https://en.wikichip.org/wiki/intel/microarchitectures/sunny_...
|
| [2]
| https://www.realworldtech.com/forum/?threadid=193291&curpost...
| wtallis wrote:
| > The better comparison is Skylake vs Skylake-SP, from which
| AVX512 is estimated to cost about 5% of the tile area
|
| That analysis might be slightly underestimating the area
| penalty of AVX512, because the consumer Skylake cores that
| didn't have AVX512 execution units still reserved space for
| the AVX512 register file. (And given that fact, it's all the
| more surprising that while Intel was repeatedly refreshing
| 14nm Skylake for the consumer market, they never added the
| rest of the AVX512 bits or redid the layout of the consumer
| cores to reclaim the blank space of the register file.)
| brigade wrote:
| Unlike the ALUs, it's trivial to locate and determine the
| area of a 512 bit x 168 register file. No way would Kanter
| have missed that in his analysis.
| zigzag312 wrote:
| For SIMD workload performance increase of AVX512 over AVX2 is
| 2x. How many cores would you need to add to double the
| performance of 8-core CPU?
|
| Any media processing can generally gain significant performance
| with SIMD instructions. Even web browsing is a workload that is
| affected, as (among other things) libjpeg-turbo uses SIMD
| instructions to accelerate decoding. It doesn't yet use AVX512,
| but it does use AVX2. If there are gains with AVX2, why
| wouldn't there be with AVX512.
|
| Even if AVX512 would remain 256 bits, it would be an
| improvement over AVX2 as new instructions enable acceleration
| of workload that were not possible before, plus there is an
| increase in productivity/ease of use.
|
| However, AVX512 won't be used much for consumer applications
| until there is big enough market penetration of CPUs that
| support these instructions, as it has been the case with any
| new SIMD instructions when they were first introduced.
| brigade wrote:
| Amdahl's law. JPEG decoding especially - the majority of the
| decode time these days is entropy decoding that cannot be
| parallelized by either SIMD or threads.
| Tostino wrote:
| I'm confused by your argument that this is similar to prior
| rollouts of new instruction sets. In the past when Intel
| pushed into instruction set, it generally went to their whole
| line of chips very quickly. That's not what I've seen with
| AVX at all.
| zigzag312 wrote:
| I was referring to software adoption. Software adoption
| cycle is similar, as wider software adoption happens only
| when there is enough market penetration of CPUs that
| support new instructions.
|
| There was AMD's 3DNow! that saw limited software adoption,
| because Intel didn't support it. Newer instruction sets are
| getting adopted progressively slower as consumers are
| replacing computer less often, AMD is slower at adopting
| each new AVX instruction set and Intel is getting more
| aggressive with market segmentation. Because market
| penetration of new instruction sets is getting slower, SW
| adoption is also much slower.
| janwas wrote:
| Totally agree software uptake is the pain point.
|
| Not even all gamer CPUs have SSE4
| (https://store.steampowered.com/hwsurvey), so it seems
| that runtime dispatch is unavoidable.
|
| Given that, if we can afford to generate code for new
| instruction sets and bundle it all into one slightly
| larger binary/library, the problem is solved, right?
|
| Highway makes it much easier to do that - no need to
| rewrite code for each new instruction set. As to Intel's
| market segmentation, we target 'clusters' of features,
| e.g. Haswell-like (AVX2, BMI2); Skylake (AVX-512
| F/BW/DQ/VL), and Icelake (VNNI, VBMI2, VAES etc) instead
| of all the possible combinations.
| adwn wrote:
| > _For SIMD workload performance increase of AVX512 over AVX2
| is 2x_
|
| It's less than 2x, because the core downclocking for AVX-512
| on older CPUs is higher than for AVX2: 60% vs 85% on Skylake,
| so only a ~1.4x speedup. Newer CPU architectures do not
| downclock, though.
| zigzag312 wrote:
| To be fair, multicore workload also causes CPU to operate
| at lower frequency than at single core workload. In the
| end, multi-threading and SIMD are both types of parallel
| processing and each has pros and cons.
| dragontamer wrote:
| > Any media processing can generally gain significant
| performance with SIMD instructions
|
| To a limit. JPEG (and many video codecs based on JPEG) have
| 8x8 macroblocks, which means the "easiest" SIMD-parallel is
| 64-way. And AVX512 taken 8-bits at a time is in fact, 64-way
| SIMD.
|
| To get further parallel processing after that, you'll
| probably have to change the format. GPUs go up to 1024-way
| NVidia blocks (or AMD Thread groups), which are basically
| SIMD-units ganged together so that thread-barrier
| instructions can keep them in sync better. 1024-work items
| corresponds to a 32x32 pixel working area.
|
| But that's no longer the format of JPEG. It'd have to be some
| future codec. Maybe modern codecs are seeing the writing on
| the wall and are increasing macroblock size for better
| parallel processing 10 years into the future (they are a
| surprisingly forward looking group in general).
| janwas wrote:
| > Maybe modern codecs are seeing the writing on the wall
| and are increasing macroblock size for better parallel
| processing 10 years into the future
|
| We did indeed do this for JPEG XL - the future is now :)
| 256x256 pixel groups are independently decodable (multi-
| core), each with >= 64-item (float) SIMD.
| cma wrote:
| AVX-512 lines up with 64-byte cache lines, it seems like it
| would be a huge change to go bigger.
| dragontamer wrote:
| NVidia GPUs are 32 wide warps, AMD CDNA are 64 wide.
| That's 1024 bit and 2048 bit respectively.
|
| Cache lines are probably 64 wide for the purpose of burst
| length 8 (64 bit burst length 8 is 64 bytes / 512 bits).
| acomjean wrote:
| I often wonder how much performance is being left behind from
| these extensions not being used explicitly (are compilers using
| these extensions automatically?). SSE, AVX.. Making it simplier
| would seem to be a huge win as people start to use this clearly
| improves performance significantly.
|
| Its seems most new CPUs support a fairly large subset of the
| older ones.
|
| I haven't dont low lever work of a while but remember doing some
| analysis of a signal processing code going from PA-RISC to Intel.
| The math compiler libraries for PA-RISC made the initial code
| transition so much slower on Intel. Using the Intel compiler and
| tweaking things started making things work so much better.
| jeffbee wrote:
| It's all application-specific. It's not even true that
| compiling with -march=${cpu} gives you the best performance on
| that CPU. As an example it was once discovered that for a
| certain large program -march=haswell was counterproductive on
| Haswell, and choosing a subset of those features was faster in
| practice. The same turned out to be true for Skylake. And BMI2
| did not work right on AMD processors until very recently, so
| the fact that the PDEP and PEXT existed was not helpful. Your
| program would run, but the instructions had high and data-
| dependent latency.
|
| In short, you need to measure the combination of compiler
| options that get you the best performance on your real
| platform. Most people can probably pick up a quick 10-20% win
| by recompiling their MySQL or whatever for arch=sandybridge,
| instead of k8-generic, but beyond that it gets trickier.
| dataflow wrote:
| > It's not even true that compiling with -march=${cpu} gives
| you the best performance on that CPU
|
| Are you thinking of -mtune? I'd never heard of people using
| -march for performance, I thought it was just specifying the
| ISA.
| jeffbee wrote:
| -march tells the compiler which instruction set extensions
| it is allowed to use. For e.g. the difference between
| arch=haswell and arch=sandybridge is AVX2, MOVBE, FMA, BMI,
| and BMI2. If you build with haswell, or any given CPU, the
| compiler might emit instructions that's won't run on older
| hardware. That's why most Linux distributions for PC are
| compiled for k8-generic, an x86-64 platform with few
| extensions (MMX, SSE, and SSE2, if I recall correctly).
|
| In the particular case as I recall it was AVX2 that was
| counterproductive on haswell. Disabling it made the program
| faster, even though AVX2 was supposed to be a headline
| feature of that generation.
| NohatCoder wrote:
| In general, almost all of it. If you haven't laid out your data
| to make it accessible to SIMD instructions, at best the program
| will get a modest speedup by using most of the instructions
| rearranging data to be SIMD-compatible.
|
| By far the simplest way of making the correct data-layout is to
| use the intrinsics manually, so that you discover any issues
| that arise.
|
| A step further is to change the program logic to better take
| advantage of SIMD-instructions, for instance Fabian Giesen has
| written a great deal on making compression code that splices
| the jobs in creative ways in order to utilize SIMD-
| instructions.
| colejohnson66 wrote:
| The big problem with Intel's compiler is that their dispatch
| function requires the CPU identify itself as "GenuineIntel"
| (for the vectorization advantage), which AMD processors don't
| do. https://www.agner.org/optimize/blog/read.php?i=49
| acomjean wrote:
| Interesting. and kinda sad. and based on the comments an
| ongoing problem.
|
| One hopes that ARM's ascendance would make "team x86-64" work
| in cooperative competition for better performance through
| compilers (it they can't through silicon).
| arthur2e5 wrote:
| AMD and recently Intel both moving to LLVM is a good step
| in that direction IMO. LLVM also already comes with a
| dispatcher, so hopefully they aren't going to... do it
| again. Personally I'm more excited about being able to port
| __attribute__ rich code more easily.
|
| (IIRC Intel has tuned down the Cripple AMD thing in MKL
| around late 2020 by providing a specialized Zen code path.
| It was slower than what you get with the detection
| backdoor, but only slightly.)
| [deleted]
| arthur2e5 wrote:
| Compilers are using them,^1 but aren't necessarily using them
| enough. They (generally) can't change alignments of a struct
| willy-nilly, nor can they just, say, split a big FP reduction
| into many parallel ones without violating 754. In most loop-
| related cases however, I would say a nudge in the form of
| #pragma omp simd goes a long way.^2
|
| And then there's the genius stuff like SIMD UTF8 decode and
| whatnot. Compilers don't magically figure those out. Heck, they
| sometimes have trouble with shuffles, which is why the author
| mentions that the new features will help.
|
| ^1 Oh, not GCC on -O2 IIRC. There were some talks about turning
| autovec on that level on like clang does, but I'm not sure it
| went anywhere... I also think MSVC doesn't do that.
|
| ^2 This pragma also turns on autovec even if the compiler is
| not told to do that with the flags. Although my original point
| was about the extra information you can provide with it.
| beached_whale wrote:
| I was recently wondering how much CPU time would be regained if
| we got an instruction to do x * 10 ^ y where x is an int64/y an
| int (or something like that) and do so without error. This is a
| really really common operation and needs a bigint library to do
| properly now for many cases. It's also slow, I think aggregate
| it's around 500MB/s to 1GB/s, but the slow path is < 100MB/s.
|
| But pretty much every JSON/XML/Number.from_string is using an
| operation like this, to the scale is quite large.
| stephencanon wrote:
| Converting a number from string can be done without any
| bignum operations at all, this has been well-understood for a
| while now. This is mostly just widely-used libraries having
| not caught up with the state of the art.
|
| There is some opportunity for ISA support to speed it up, but
| multiplication by powers of ten is not the bottleneck, that's
| just a table lookup and full-width product in high-
| performance implementations (i.e. a load and a mul
| instruction).
| beached_whale wrote:
| Got links? Even Daniel Lemire's Fast Double parser will
| dump to strtod when the numbers are not handleable because
| of this.
|
| You cannot represent 10^x, or even all 5^x beyond a certain
| point in IEEE754 double's but need to do the full y * 10^x
| operation without loss.
|
| Doing it lossy is easy(0 to 1 or 2 ulp off optimal)
| stephencanon wrote:
| You don't need to do it without loss, you need to do it
| with loss smaller than the closest that the digit
| sequence could be to an exact halfway point between two
| representable double-precision values.
|
| This means you may need arbitrary-precision for
| arbitrary-precision decimal strings, but in practice
| these libraries are "always" converting strings that came
| from formatting exact doubles (eg round-tripping through
| JSON), and so have bounded precision. Thus you can
| tightly bound the precision required.
|
| This precisely mirrors how formatting fp numbers used to
| require bignum arithmetic, but all the recent algorithms
| (Ryu, Dragonbox, SwiftDtoa) do it with fixed int128
| arithmetic and deliver always correct results. We'll see
| analogous algorithms for the other direction adopted in
| the next few years--the only reason this direction came
| later is that the other was a bigger performance problem
| originally.
| beached_whale wrote:
| I use dragonbox and it makes the serialization task quite
| easy and fast. It would be great for a mirror library
| that did so for deserialization of arbitrary decimal
| numbers. The current libraries are not fast, and it isn't
| the parsing, it's the math they do with those characters.
| gpderetta wrote:
| Some compilers these days offer 128bit floats (which is
| different from 754 quad) which is basically twice the
| mantissa of a standard double. IIRC multiplications and
| additions are relatively cheap.
| jeffbee wrote:
| Adding hardware to speed up decimal number handling
| strikes me as an old-school approach, the type of thing
| you might expect from 1960s mainframes. Isn't the more
| significant opportunity at the application architecture
| level instead? Why exchange numbers between computers in
| decimal? Even for people who are just devoted to the idea
| of human readability, which is an idea in need of
| scrutiny in my opinion, you can still exchange real
| numbers in base 16 format, 0xh.hPd.
| beached_whale wrote:
| hex floats are great, but text formats using decimal
| floats are ubiquitous and never going away.
| Someone wrote:
| I wouldn't expect that from 1960's mainframes. In cases
| where ingestion and outputting of decimal numbers could
| be a bottleneck, they would be programmed in COBOL, with
| a compiler using BCD
| (https://en.wikipedia.org/wiki/Binary-coded_decimal) to
| store numbers.
|
| Also, x87 has/had the FBLD and FBSTP instructions that
| can be used to convert between floating point and packed
| decimal (https://en.wikipedia.org/wiki/Intel_BCD_opcode)
|
| If these still are supported, I doubt Intel or AMD spend
| lots of effort making/keeping them fast, though.
___________________________________________________________________
(page generated 2021-08-14 23:01 UTC)