[HN Gopher] AVX512/VBMI2: A Programmer's Perspective
       ___________________________________________________________________
        
       AVX512/VBMI2: A Programmer's Perspective
        
       Author : ingve
       Score  : 87 points
       Date   : 2021-08-14 08:42 UTC (14 hours ago)
        
 (HTM) web link (www.singlestore.com)
 (TXT) w3m dump (www.singlestore.com)
        
       | 37ef_ced3 wrote:
       | I work with GPUs professionally, but I've written a lot of
       | AVX-512 code (e.g., https://NN-512.com) and AVX-512 is a big step
       | forward from AVX2.
       | 
       | The regularity and simplicity and completeness of the instruction
       | set is a big win.
       | 
       | The lane predication (masking) of every instruction is useful
       | when the data doesn't fit the vectors (and in loop tails) but it
       | has numerous other uses, too. For example, it makes blending
       | (vector splicing) and partial operations easy.
       | 
       | The flexible permute instructions (two data vectors in, one
       | control vector in, and one data vector out) are fast and
       | enormously useful. Anyone who has puzzled with AVX2 will breathe
       | a sigh of relief.
       | 
       | The register file is big enough (32 vectors, each 16 floats wide)
       | that register starvation typically isn't an issue. For example,
       | you don't worry about having registers for your permutation
       | control vectors. And the predicate masks are in another set of
       | registers, which again are plentiful (eight!).
       | 
       | The easing of alignment requirements (unaligned load/store
       | instructions operating on aligned addresses are as fast as
       | aligned loads/stores) is also a big win.
       | 
       | AVX-512 is a real pleasure to use.
        
         | wyldfire wrote:
         | I work on some tooling for Hexagon DSPs and I wonder how the
         | HVX instructions compare against AVX-512 (for integers at
         | least). Different targets / use cases but I'm curious how it
         | stacks up.
        
         | dragontamer wrote:
         | Avx512 probably needs bpermute for parity with GPU shared
         | memory.
         | 
         | Arbitrary swizzles both in the forward permute (currently in
         | AVX512) and backwards direction (GPU only) is as necessary and
         | proper as gather + scatter. Any program that uses one is highly
         | likely to use the other.
         | 
         | --------
         | 
         | Butterfly permutes should be especially accelerated, as that
         | pattern continues to show up. It seems like arbitrary permutes
         | / bpermutes are expensive to implement in hardware, but
         | butterfly permutes are the fundamental building block.
         | 
         | Butterfly permutes are needed from FFT to scan/prefix sum
         | operations. It's also fundamental (and simpler at the hardware
         | level than arbitrary bpermute/permute)
         | 
         | AMD implements the butterfly permute in DPP instructions.
         | NVidia provides a simple shfl.bfly PTX instruction.
         | 
         | Butterfly networks (and inverse butterfly) are how pdep and
         | pext are implemented under the hood.
         | http://palms.ee.princeton.edu/PALMSopen/hilewitz06FastBitCom...
         | 
         | -----------
         | 
         | IIRC,the butterfly / inverse butterfly network can implement
         | any arbitrary permute in just log2(n) steps.
        
       | Const-me wrote:
       | > A Programmer's Perspective
       | 
       | I'm a programmer too but I don't share that perspective.
       | 
       | In my line of work, AVX512 is useless. This won't change until at
       | least 25% market penetration on clients, which may or may not
       | happen.
       | 
       | Not all programmers are writing server code. Also, very small
       | count of them are Intel partners with early access to the chips.
       | 
       | > In AVX terminology, intra-lane movement is called a `shuffle`
       | while inter-lane movement is called a `permute`
       | 
       | I don't believe that's true. _mm_shuffle_ps( x, x, imm ) and
       | _mm_permute_ps( x, imm ) are 100% equivalent. Also,
       | _mm256_permute_ps permutes within lanes, _mm256_permutevar8x32_ps
       | permutes across lanes.
        
         | AdamProut wrote:
         | If your building software as a service that changes this
         | calculus a fair bit. You only need to wait for
         | AWS/Azure/GCP/CSP of choice to support the hardware (you don't
         | need to wait for mass market use of it). For example, many
         | (all?) database as a service offerings now days run on very
         | precisely configured hardware.
        
           | Const-me wrote:
           | > many (all?) database as a service offerings now days run on
           | very precisely configured hardware
           | 
           | I'm not sure AVX512 is a win even in that case. AMD Epyc
           | peaks at 64 cores/socket, Intel Xeon at 40 cores/socket. It's
           | not immediately obvious AMD's performance advantage over
           | Intel is smaller than the Intel-only win from AVX512.
        
             | dragontamer wrote:
             | Are sockets the best measurement? AMD has a pretty complex
             | network inside that socket: 8 dies + a switch.
             | 
             | The Xeon 40 core is just one die, which means L3 cache and
             | memory is more unified.
             | 
             | L3 cache is probably king (data may not fit, but maybe some
             | indexes?), but it's not obvious if EPYCs separate L3 caches
             | are comparable to Xeons (or Power)
        
               | Const-me wrote:
               | AMD is much faster overall. This page has a benchmark of
               | a MySQL database, the slide is called MariaDB:
               | https://www.servethehome.com/amd-epyc-7763-review-top-
               | for-th... The only benchmark where AMD didn't have
               | substantial advantage is chess, I think the reason is
               | AMD's slow pdep/pext instructions.
        
               | dragontamer wrote:
               | The 6258R is just 28 cores though, 56 total after dual
               | socket.
               | 
               | AMD Zen3 has single cycle pext / pdep now and the chess
               | benchmarks are better as a result.
        
         | janwas wrote:
         | > In my line of work, AVX512 is useless. This won't change
         | until at least 25% market penetration on clients, which may or
         | may not happen.
         | 
         | Why not use it where available? github.com/google/highway lets
         | you write code once using 'platform-independent intrinsics',
         | and generate AVX2/AVX-512/NEON/SVE into a single binary;
         | dynamic dispatch runs the best available codepath for the
         | current CPU.
         | 
         | Disclosure: I am the main author. Happy to discuss.
        
           | Const-me wrote:
           | > Why not use it where available?
           | 
           | AFAIK things like that (highway, OpenMP 4.0+, automatic
           | vectorizers) are only good for vertical operations.
           | 
           | When the only thing you're doing is billions of vertical
           | operations, computing that thing on CPU is often a poor
           | choice because GPUs are way faster at that, and way more
           | power efficient too.
           | 
           | Therefore, when I optimize CPU-running things, that code is
           | usually not a good fit for GPUs. Sometimes the data size is
           | too small (PCIe latency gonna eat all the profit from GPU),
           | sometimes too many branches, but most typical reason is
           | horizontal operations.
           | 
           | An example is multiplying 6x24 by 24x1 matrices. A good
           | approach is a loop with 12 iterations, in the loop body 2x
           | _mm256_broadcast_sd, then _mm256_blend_pd (these 3
           | instructions are making 3 vectors [v1,v1,v1,v1],
           | [v1,v1,v2,v2], [v2,v2,v2,v2] ), then 3x _mm256_fmadd_pd with
           | memory source operands. Then after the loop add 12 values
           | into 6 with shuffles and vector adds.
        
       | NohatCoder wrote:
       | Question is, how many cores did we have to sacrifice to get
       | AVX512? Intel moved from shipping 10 core CPUs to 8 core CPUs
       | with AVX512. Clock speed may not go down much, but the price of
       | that is a very high power consumption on AVX512 workloads.
       | 
       | As best I can tell, it would be reasonable to expect around 12
       | AVX2 cores operating on the same power budget as the current 8
       | AVX512 cores. Does the average user have enough AVX512 workloads
       | to make that tradeoff worth it?
        
         | brigade wrote:
         | Sunny/Cypress Cove had lots of other area growth than just
         | AVX-512 [1]. The better comparison is Skylake vs Skylake-SP,
         | from which AVX512 is estimated to cost about 5% of the tile
         | area [2]. So that's the area cost of about 2 cores in the 28
         | core chip.
         | 
         | [1]
         | https://en.wikichip.org/wiki/intel/microarchitectures/sunny_...
         | 
         | [2]
         | https://www.realworldtech.com/forum/?threadid=193291&curpost...
        
           | wtallis wrote:
           | > The better comparison is Skylake vs Skylake-SP, from which
           | AVX512 is estimated to cost about 5% of the tile area
           | 
           | That analysis might be slightly underestimating the area
           | penalty of AVX512, because the consumer Skylake cores that
           | didn't have AVX512 execution units still reserved space for
           | the AVX512 register file. (And given that fact, it's all the
           | more surprising that while Intel was repeatedly refreshing
           | 14nm Skylake for the consumer market, they never added the
           | rest of the AVX512 bits or redid the layout of the consumer
           | cores to reclaim the blank space of the register file.)
        
             | brigade wrote:
             | Unlike the ALUs, it's trivial to locate and determine the
             | area of a 512 bit x 168 register file. No way would Kanter
             | have missed that in his analysis.
        
         | zigzag312 wrote:
         | For SIMD workload performance increase of AVX512 over AVX2 is
         | 2x. How many cores would you need to add to double the
         | performance of 8-core CPU?
         | 
         | Any media processing can generally gain significant performance
         | with SIMD instructions. Even web browsing is a workload that is
         | affected, as (among other things) libjpeg-turbo uses SIMD
         | instructions to accelerate decoding. It doesn't yet use AVX512,
         | but it does use AVX2. If there are gains with AVX2, why
         | wouldn't there be with AVX512.
         | 
         | Even if AVX512 would remain 256 bits, it would be an
         | improvement over AVX2 as new instructions enable acceleration
         | of workload that were not possible before, plus there is an
         | increase in productivity/ease of use.
         | 
         | However, AVX512 won't be used much for consumer applications
         | until there is big enough market penetration of CPUs that
         | support these instructions, as it has been the case with any
         | new SIMD instructions when they were first introduced.
        
           | brigade wrote:
           | Amdahl's law. JPEG decoding especially - the majority of the
           | decode time these days is entropy decoding that cannot be
           | parallelized by either SIMD or threads.
        
           | Tostino wrote:
           | I'm confused by your argument that this is similar to prior
           | rollouts of new instruction sets. In the past when Intel
           | pushed into instruction set, it generally went to their whole
           | line of chips very quickly. That's not what I've seen with
           | AVX at all.
        
             | zigzag312 wrote:
             | I was referring to software adoption. Software adoption
             | cycle is similar, as wider software adoption happens only
             | when there is enough market penetration of CPUs that
             | support new instructions.
             | 
             | There was AMD's 3DNow! that saw limited software adoption,
             | because Intel didn't support it. Newer instruction sets are
             | getting adopted progressively slower as consumers are
             | replacing computer less often, AMD is slower at adopting
             | each new AVX instruction set and Intel is getting more
             | aggressive with market segmentation. Because market
             | penetration of new instruction sets is getting slower, SW
             | adoption is also much slower.
        
               | janwas wrote:
               | Totally agree software uptake is the pain point.
               | 
               | Not even all gamer CPUs have SSE4
               | (https://store.steampowered.com/hwsurvey), so it seems
               | that runtime dispatch is unavoidable.
               | 
               | Given that, if we can afford to generate code for new
               | instruction sets and bundle it all into one slightly
               | larger binary/library, the problem is solved, right?
               | 
               | Highway makes it much easier to do that - no need to
               | rewrite code for each new instruction set. As to Intel's
               | market segmentation, we target 'clusters' of features,
               | e.g. Haswell-like (AVX2, BMI2); Skylake (AVX-512
               | F/BW/DQ/VL), and Icelake (VNNI, VBMI2, VAES etc) instead
               | of all the possible combinations.
        
           | adwn wrote:
           | > _For SIMD workload performance increase of AVX512 over AVX2
           | is 2x_
           | 
           | It's less than 2x, because the core downclocking for AVX-512
           | on older CPUs is higher than for AVX2: 60% vs 85% on Skylake,
           | so only a ~1.4x speedup. Newer CPU architectures do not
           | downclock, though.
        
             | zigzag312 wrote:
             | To be fair, multicore workload also causes CPU to operate
             | at lower frequency than at single core workload. In the
             | end, multi-threading and SIMD are both types of parallel
             | processing and each has pros and cons.
        
           | dragontamer wrote:
           | > Any media processing can generally gain significant
           | performance with SIMD instructions
           | 
           | To a limit. JPEG (and many video codecs based on JPEG) have
           | 8x8 macroblocks, which means the "easiest" SIMD-parallel is
           | 64-way. And AVX512 taken 8-bits at a time is in fact, 64-way
           | SIMD.
           | 
           | To get further parallel processing after that, you'll
           | probably have to change the format. GPUs go up to 1024-way
           | NVidia blocks (or AMD Thread groups), which are basically
           | SIMD-units ganged together so that thread-barrier
           | instructions can keep them in sync better. 1024-work items
           | corresponds to a 32x32 pixel working area.
           | 
           | But that's no longer the format of JPEG. It'd have to be some
           | future codec. Maybe modern codecs are seeing the writing on
           | the wall and are increasing macroblock size for better
           | parallel processing 10 years into the future (they are a
           | surprisingly forward looking group in general).
        
             | janwas wrote:
             | > Maybe modern codecs are seeing the writing on the wall
             | and are increasing macroblock size for better parallel
             | processing 10 years into the future
             | 
             | We did indeed do this for JPEG XL - the future is now :)
             | 256x256 pixel groups are independently decodable (multi-
             | core), each with >= 64-item (float) SIMD.
        
             | cma wrote:
             | AVX-512 lines up with 64-byte cache lines, it seems like it
             | would be a huge change to go bigger.
        
               | dragontamer wrote:
               | NVidia GPUs are 32 wide warps, AMD CDNA are 64 wide.
               | That's 1024 bit and 2048 bit respectively.
               | 
               | Cache lines are probably 64 wide for the purpose of burst
               | length 8 (64 bit burst length 8 is 64 bytes / 512 bits).
        
       | acomjean wrote:
       | I often wonder how much performance is being left behind from
       | these extensions not being used explicitly (are compilers using
       | these extensions automatically?). SSE, AVX.. Making it simplier
       | would seem to be a huge win as people start to use this clearly
       | improves performance significantly.
       | 
       | Its seems most new CPUs support a fairly large subset of the
       | older ones.
       | 
       | I haven't dont low lever work of a while but remember doing some
       | analysis of a signal processing code going from PA-RISC to Intel.
       | The math compiler libraries for PA-RISC made the initial code
       | transition so much slower on Intel. Using the Intel compiler and
       | tweaking things started making things work so much better.
        
         | jeffbee wrote:
         | It's all application-specific. It's not even true that
         | compiling with -march=${cpu} gives you the best performance on
         | that CPU. As an example it was once discovered that for a
         | certain large program -march=haswell was counterproductive on
         | Haswell, and choosing a subset of those features was faster in
         | practice. The same turned out to be true for Skylake. And BMI2
         | did not work right on AMD processors until very recently, so
         | the fact that the PDEP and PEXT existed was not helpful. Your
         | program would run, but the instructions had high and data-
         | dependent latency.
         | 
         | In short, you need to measure the combination of compiler
         | options that get you the best performance on your real
         | platform. Most people can probably pick up a quick 10-20% win
         | by recompiling their MySQL or whatever for arch=sandybridge,
         | instead of k8-generic, but beyond that it gets trickier.
        
           | dataflow wrote:
           | > It's not even true that compiling with -march=${cpu} gives
           | you the best performance on that CPU
           | 
           | Are you thinking of -mtune? I'd never heard of people using
           | -march for performance, I thought it was just specifying the
           | ISA.
        
             | jeffbee wrote:
             | -march tells the compiler which instruction set extensions
             | it is allowed to use. For e.g. the difference between
             | arch=haswell and arch=sandybridge is AVX2, MOVBE, FMA, BMI,
             | and BMI2. If you build with haswell, or any given CPU, the
             | compiler might emit instructions that's won't run on older
             | hardware. That's why most Linux distributions for PC are
             | compiled for k8-generic, an x86-64 platform with few
             | extensions (MMX, SSE, and SSE2, if I recall correctly).
             | 
             | In the particular case as I recall it was AVX2 that was
             | counterproductive on haswell. Disabling it made the program
             | faster, even though AVX2 was supposed to be a headline
             | feature of that generation.
        
         | NohatCoder wrote:
         | In general, almost all of it. If you haven't laid out your data
         | to make it accessible to SIMD instructions, at best the program
         | will get a modest speedup by using most of the instructions
         | rearranging data to be SIMD-compatible.
         | 
         | By far the simplest way of making the correct data-layout is to
         | use the intrinsics manually, so that you discover any issues
         | that arise.
         | 
         | A step further is to change the program logic to better take
         | advantage of SIMD-instructions, for instance Fabian Giesen has
         | written a great deal on making compression code that splices
         | the jobs in creative ways in order to utilize SIMD-
         | instructions.
        
         | colejohnson66 wrote:
         | The big problem with Intel's compiler is that their dispatch
         | function requires the CPU identify itself as "GenuineIntel"
         | (for the vectorization advantage), which AMD processors don't
         | do. https://www.agner.org/optimize/blog/read.php?i=49
        
           | acomjean wrote:
           | Interesting. and kinda sad. and based on the comments an
           | ongoing problem.
           | 
           | One hopes that ARM's ascendance would make "team x86-64" work
           | in cooperative competition for better performance through
           | compilers (it they can't through silicon).
        
             | arthur2e5 wrote:
             | AMD and recently Intel both moving to LLVM is a good step
             | in that direction IMO. LLVM also already comes with a
             | dispatcher, so hopefully they aren't going to... do it
             | again. Personally I'm more excited about being able to port
             | __attribute__ rich code more easily.
             | 
             | (IIRC Intel has tuned down the Cripple AMD thing in MKL
             | around late 2020 by providing a specialized Zen code path.
             | It was slower than what you get with the detection
             | backdoor, but only slightly.)
        
           | [deleted]
        
         | arthur2e5 wrote:
         | Compilers are using them,^1 but aren't necessarily using them
         | enough. They (generally) can't change alignments of a struct
         | willy-nilly, nor can they just, say, split a big FP reduction
         | into many parallel ones without violating 754. In most loop-
         | related cases however, I would say a nudge in the form of
         | #pragma omp simd goes a long way.^2
         | 
         | And then there's the genius stuff like SIMD UTF8 decode and
         | whatnot. Compilers don't magically figure those out. Heck, they
         | sometimes have trouble with shuffles, which is why the author
         | mentions that the new features will help.
         | 
         | ^1 Oh, not GCC on -O2 IIRC. There were some talks about turning
         | autovec on that level on like clang does, but I'm not sure it
         | went anywhere... I also think MSVC doesn't do that.
         | 
         | ^2 This pragma also turns on autovec even if the compiler is
         | not told to do that with the flags. Although my original point
         | was about the extra information you can provide with it.
        
         | beached_whale wrote:
         | I was recently wondering how much CPU time would be regained if
         | we got an instruction to do x * 10 ^ y where x is an int64/y an
         | int (or something like that) and do so without error. This is a
         | really really common operation and needs a bigint library to do
         | properly now for many cases. It's also slow, I think aggregate
         | it's around 500MB/s to 1GB/s, but the slow path is < 100MB/s.
         | 
         | But pretty much every JSON/XML/Number.from_string is using an
         | operation like this, to the scale is quite large.
        
           | stephencanon wrote:
           | Converting a number from string can be done without any
           | bignum operations at all, this has been well-understood for a
           | while now. This is mostly just widely-used libraries having
           | not caught up with the state of the art.
           | 
           | There is some opportunity for ISA support to speed it up, but
           | multiplication by powers of ten is not the bottleneck, that's
           | just a table lookup and full-width product in high-
           | performance implementations (i.e. a load and a mul
           | instruction).
        
             | beached_whale wrote:
             | Got links? Even Daniel Lemire's Fast Double parser will
             | dump to strtod when the numbers are not handleable because
             | of this.
             | 
             | You cannot represent 10^x, or even all 5^x beyond a certain
             | point in IEEE754 double's but need to do the full y * 10^x
             | operation without loss.
             | 
             | Doing it lossy is easy(0 to 1 or 2 ulp off optimal)
        
               | stephencanon wrote:
               | You don't need to do it without loss, you need to do it
               | with loss smaller than the closest that the digit
               | sequence could be to an exact halfway point between two
               | representable double-precision values.
               | 
               | This means you may need arbitrary-precision for
               | arbitrary-precision decimal strings, but in practice
               | these libraries are "always" converting strings that came
               | from formatting exact doubles (eg round-tripping through
               | JSON), and so have bounded precision. Thus you can
               | tightly bound the precision required.
               | 
               | This precisely mirrors how formatting fp numbers used to
               | require bignum arithmetic, but all the recent algorithms
               | (Ryu, Dragonbox, SwiftDtoa) do it with fixed int128
               | arithmetic and deliver always correct results. We'll see
               | analogous algorithms for the other direction adopted in
               | the next few years--the only reason this direction came
               | later is that the other was a bigger performance problem
               | originally.
        
               | beached_whale wrote:
               | I use dragonbox and it makes the serialization task quite
               | easy and fast. It would be great for a mirror library
               | that did so for deserialization of arbitrary decimal
               | numbers. The current libraries are not fast, and it isn't
               | the parsing, it's the math they do with those characters.
        
               | gpderetta wrote:
               | Some compilers these days offer 128bit floats (which is
               | different from 754 quad) which is basically twice the
               | mantissa of a standard double. IIRC multiplications and
               | additions are relatively cheap.
        
               | jeffbee wrote:
               | Adding hardware to speed up decimal number handling
               | strikes me as an old-school approach, the type of thing
               | you might expect from 1960s mainframes. Isn't the more
               | significant opportunity at the application architecture
               | level instead? Why exchange numbers between computers in
               | decimal? Even for people who are just devoted to the idea
               | of human readability, which is an idea in need of
               | scrutiny in my opinion, you can still exchange real
               | numbers in base 16 format, 0xh.hPd.
        
               | beached_whale wrote:
               | hex floats are great, but text formats using decimal
               | floats are ubiquitous and never going away.
        
               | Someone wrote:
               | I wouldn't expect that from 1960's mainframes. In cases
               | where ingestion and outputting of decimal numbers could
               | be a bottleneck, they would be programmed in COBOL, with
               | a compiler using BCD
               | (https://en.wikipedia.org/wiki/Binary-coded_decimal) to
               | store numbers.
               | 
               | Also, x87 has/had the FBLD and FBSTP instructions that
               | can be used to convert between floating point and packed
               | decimal (https://en.wikipedia.org/wiki/Intel_BCD_opcode)
               | 
               | If these still are supported, I doubt Intel or AMD spend
               | lots of effort making/keeping them fast, though.
        
       ___________________________________________________________________
       (page generated 2021-08-14 23:01 UTC)