[HN Gopher] Benchmarking division and libdivide on Apple M1 and ...
___________________________________________________________________
Benchmarking division and libdivide on Apple M1 and Intel AVX512
Author : ridiculous_fish
Score : 144 points
Date : 2021-05-12 18:52 UTC (4 hours ago)
(HTM) web link (ridiculousfish.com)
(TXT) w3m dump (ridiculousfish.com)
| CoastalCoder wrote:
| I'm curious why the author showed the C++ source code, but not
| the (per-architecture) disassembly.
|
| I would think that's a much better starting point for trying to
| understand the m-architectural behavior.
| pkw792 wrote:
| I have just started to use a Mac M1 Mini, and am disappointed.
| It's incredibly slow to download or install anything. Hangs all
| the time, takes like 5 hours to install Xcode (it's done but the
| UI hangs leaving you to believe it has more to do). Hangs when
| cloning a git repo. Gets stuck anywhere and everywhere. Have to
| force kill everything and restart to knock some sense into it. I
| was always respectful of Mac users because Windows has had its
| problems in the past, but after using a Mac for the first time, I
| hate it more than ever.
| pkw792 wrote:
| It's very difficult to follow and engage in technical posts
| like these benchmarking micro-instructions and so on, where on
| the face of it the product is simply falling on its face in the
| most basic use-cases.
| rangewookie wrote:
| uhhh... your mac might be broken. I have one and my friend has
| one. We both engage in cpu/gpu intensive workloads and this
| just doesn't happen. Still within the return window? Would be
| interesting to find out your fan is DOA or something like
| that...
| pkw792 wrote:
| Yeah it's pretty much brand new and for sure within the
| return window. Maybe there's something wrong with it. I was
| expecting cool fireworks for sure, but it's been nothing but
| a PITA thus far.
| pbsd wrote:
| On Skylake-SP's AVX-512, instructions that previously were
| dispatched to port 0 or 1 get instead dispatched to ports 0 _and_
| 1. So instructions like vpsrlq get zero net speedup from
| switching to AVX-512 from AVX2. Instructions that previously ran
| on ports 0,1,5 will now run on ports 0 and 5, for a speedup of at
| best 1.33.
|
| Multiplication will depend on whether the chip has one or two FMA
| units. If so, you can run vpmuludq on ports 0 and 5, which is a
| 2x speedup compared to AVX2's ports 0 and 1. This 8275CL Xeon
| does have 2 FMA units.
|
| Looking at the two inner loops, we have up:
| vmovdqa64 zmm0,ZMMWORD PTR [rdi+rax*4] # p23 add
| rax,0x10 # p0156 vpmuludq zmm1,zmm0,zmm4 #
| p05 or only p0 vpsrlq zmm2,zmm1,0x20 # p0
| vpsrlq zmm1,zmm0,0x20 # p0 vpmuludq
| zmm1,zmm1,zmm4 # p05 or only p0 vpandd
| zmm1,zmm6,zmm1 # p05 vpord zmm1,zmm1,zmm2 # p05
| vpsubd zmm0,zmm0,zmm1 # p05 vpsrld
| zmm0,zmm0,0x1 # p0 vpaddd zmm0,zmm0,zmm1 # p05
| vpsrld zmm0,zmm0,xmm5 # p0+p5 vpaddd
| zmm3,zmm0,zmm3 # p05 cmp rax,rdx
| jb up up: vmovdqa
| ymm0,YMMWORD PTR [rdi+rax*4] # p23 add rax,0x8
| # p0156 vpmuludq ymm1,ymm0,ymm4 # p01
| vpsrlq ymm2,ymm1,0x20 # p01 vpsrlq
| ymm1,ymm0,0x20 # p01 vpmuludq
| ymm1,ymm1,ymm4 # p01 vpand
| ymm1,ymm1,ymm6 # p015 vpor
| ymm1,ymm1,ymm2 # p015 vpsubd
| ymm0,ymm0,ymm1 # p015 vpsrld
| ymm0,ymm0,0x1 # p01 vpaddd
| ymm0,ymm0,ymm1 # p015 vpsrld
| ymm0,ymm0,xmm5 # p01+p5 vpaddd
| ymm3,ymm0,ymm3 # p01 cmp rax,rdx
| jb up
|
| All other things being equal, we have on average, and counting
| only the differing instructions, a throughput of ~2.27
| instructions per cycle on the AVX2 loop, whereas it is somewhere
| around ~1.45-1.60 for AVX-512, depending whether you have 1 or 2
| FMA units to run multiplications on port 5.
|
| So based on this approximation, the AVX-512 code should probably
| run around 2*(1.5/2.27) ~ 1.33 times faster. Add to this that
| vpmuludq is actually one of the most thermally insensitive
| instructions around and will reduce your core's frequency by
| 100-200 MHz, and the small speedup you see is more or less
| explainable. (I actually do see some more noticeable speedup here
| when switching to AVX-512; 0.25 vs 0.21).
|
| PS: The Intel Icelake and later chips also manage to achieve a
| throughput of 1/2 divisions per cycle for 32-bit divisors, and
| 1/3 divisions per cycle for 64-bit divisors.
| celrod wrote:
| FWIW, llvm-mca estimates 448 clock cycles per 100 iterations of
| the AVX2 loop vs 528 cycles for the AVX512 loop with
| `-mcpu=cascadelake`. That suggests the AVX512 loop should be
| about 2*(448/528)=1.85 times faster.
| pbsd wrote:
| llvm-mca is highly unreliable when it comes to AVX-512. It
| thinks 3 512-bit vpaddd, vpsubd can be run per cycle.
| Adjusting for that you get 622 cycles instead of 528.
| brigade wrote:
| > Speculatively, AVX512 processes multiplies serially, one
| 256-bit lane at a time, losing half its parallelism.
|
| Sort of, in Skylake AVX512 fuses the 256-bit p0 and p1 together
| for one 512-bit uop, and p5 becomes 512-bit wide. So
| theoretically you get 2x 512-bit pipelines versus AVX2's 3x
| 256-bit pipelines (two of which can do multiplies.)
|
| Unfortunately, p5 doesn't support integer multiplies, even in
| SKUs where p5 _does_ support 512-bit floating-point multiplies.
| So AVX512 has no additional throughput for integer multiplies on
| current implementations.
| celrod wrote:
| p5 can do 512 bit operations, but not 256 bit, e.g. look at
| Skylake-AVX512 and Cascadelake (Xeon benched in the blog post
| was Cascadelake) ports for vaddpd:
|
| https://uops.info/html-instr/VADDPD_YMM_YMM_YMM.html
|
| Here is 256 bit VPMULUDQ: https://uops.info/html-
| instr/VPMULUDQ_YMM_YMM_YMM.html
|
| Here is 512 bit VPMULUDQ: https://uops.info/html-
| instr/VPMULUDQ_ZMM_ZMM_ZMM.html
|
| The 256 bit and 512 bit versions both have a reciprocal
| throughput of 0.5 cycles/op, using p01 for 256 bit and p05 for
| 512 bit (where, as you note p0 for 512 bit really means both 0
| and 1).
|
| So, given the same clock speed, this multiplication should have
| twice the throughput with 512 bit vectors as with 256 bit. This
| isn't true for those CPUs without p5, like icelake-client,
| tigerlake, and rocketlake. But should be true for the Xeon
| ridiculousfish benchmarked on.
| twoodfin wrote:
| I bet there's more than one integer division unit per core.
|
| When I've been micro-optimizing performance-critical code,
| integer division shows up as a hot spot regularly. I assume most
| developers don't think about the performance implications of
| coding up a / or % between two runtime values, preventing the
| compiler from doing any strength reduction. Apple must have seen
| this in their surely voluminous profiling of real-world
| applications.
| buildbot wrote:
| I think you got the point most people miss - apple had a unique
| ability to profile every app on the Mac and iOS App Store,
| possibly in an automated way, as part of the app submission
| pipeline. Intel and AMD could go out and profile real work
| applications, and I'm sure they do, but to get to the same
| level of breadth is probably not possible.
| Traster wrote:
| I'm a little skeptical of this, this is the same as saying
| Tesla is going to have self-driving because they can record
| all the decisions current Teslas make. The truth is that it's
| very path dependent. For Tesla this means you can't optmize
| getting into the scenario in the first place and for Apple it
| means you can't actually know which code path will actually
| be regularly used.
| pvg wrote:
| Is that really such a unique advantage for Apple? Intel and
| AMD can work with Microsoft to achieve something similar, for
| instance.
| bch wrote:
| I'm only speculating, but I'd think they wouldn't even
| "have to work with" anybody - couldn't they just instrument
| whatever they want?
| mhh__ wrote:
| The fact that Intel and AMD apparently don't prioritize integer
| division could suggest that their profiling suggests it's not
| worth it, but with Apple's transistor budget at the moment they
| can afford it.
|
| Also keep in mind that this Xeon might not be really made for
| number crunching (not really sure)?
| rodgerd wrote:
| Apple's chip designers have the advantage, I assume, of being
| able to wander down a hallway and ask what the telemetry from
| iOS and MacOS devices are telling them about real-world use.
| mhh__ wrote:
| Most of Intel's volume is probably shipped to customers who
| either don't care or buy _a lot_ of CPUs in one go, so the
| advantage of this probably isn 't quite as apparent as
| you'd imagine.
|
| What can definitely play a role is that (I don't think it's
| as much of a problem these days, but it definitely has been
| in the past) is the standard "benchmark" suites that
| chipmakers can beat each other over the head with e.g. I
| think it was Itanium that had a bunch of integer functional
| units mainly for the purpose of getting better SPEC numbers
| rather than working on the things that actually make
| programs fast (MEMORY) - I was maybe 1 or 2 when this chip
| came out, so this is nth-hand gossip, however.
| masklinn wrote:
| > The fact that Intel and AMD apparently don't prioritize
| integer division could suggest that their profiling suggests
| it's not worth it, but with Apple's transistor budget at the
| moment they can afford it.
|
| An other possibility is that Apple has a very different
| profiling base e.g. iOS applications, whereas Intel and AMD
| would have more artificial workloads, or be bound by
| workloads / profiles from scientific computing or the like
| (video games)?
| pbsd wrote:
| Intel greatly improved their divider implementation between
| Skylake and Icelake. The measurements in the OP are on
| Skylake-SP, prior to these improvements.
| criddell wrote:
| Would it be fair to characterize the M1 as being made for
| number crunching?
| mhh__ wrote:
| Refining number crunching to mean single threaded
| performance I would say yes, or at least definitely more so
| than the Intel chip
| gameswithgo wrote:
| Not any more than an intel/amd/etc cpu is. Like that XEON
| cpu is gonna crunch more numbers, just due to more cores.
| mhh__ wrote:
| If it was intended to be used in the cloud for example
| it's going to be doing more _work_ but probably designed
| around a memory-bound load rather than integer
| throughput.
| seumars wrote:
| I'm having a hard time concentrating on the article with that
| background
| codezero wrote:
| Funny, what size screen are you on? My wife said the same and
| tbh I didn't even notice it was a paper towel (funny gag) on my
| desktop system. I may have just not paid attention.
|
| Go into reader mode, the article is great.
| gigatexal wrote:
| Geez. I wonder if the M2 will just be higher clocked and more
| cores or if they'll improve the arch even more?
| sroussey wrote:
| Yes
| lmilcin wrote:
| As somebody who worked for Intel I am deeply ashamed for this
| result.
|
| I mean, seriously, all that tradition and experience and you have
| a phone company make circles around you on your own field.
| mhh__ wrote:
| Some notes:
|
| * What's the variance of the measurements?
|
| * Per core, the two processors actually (keep in mind based on
| Intel's TDP figure) have a roughly similar power budget i.e.
| 205/26 vs. 39/(4 or 8 depending on if you count the bigs, littles
| or both), so taking into account that the Apple processor is on a
| process that is something like 4 or 5 times denser it's not that
| surprising to me that its faster.
| phkahler wrote:
| Phoronix recently did some benchmarks with AVX512 and while it
| was (modestly if I recall) faster, it was horribly worse in terms
| of performance per watt.
|
| I really hope AMD doesn't adopt AVX512 and if they do I hope it's
| just the minimum for software compatibility.
|
| On a related note, my Ryzen 2400G does not benefit from
| recompiling code with -march=x86-64-v3 in fact it seems a tiny
| bit slower. I assume Zen2 and 3 will actually run faster with
| that option.
| johnklos wrote:
| There was a time when division was expensive enough to look for
| alternatives, but nowadays with the M1 it seems that adding even
| one or two adds or shifts may end up to be more expensive than
| division. My goodness, how times have changed!
| rock_artist wrote:
| I didn't read through the entire thing but...
|
| * Would be wise to compare x86_64 under Rosetta as it'll support
| some AVX translation if I remember correctly.
|
| * I didn't see use of Apple's accelerate framework. To comply
| with ARM64 additional custom Apple magic is within Private
| extensions / ops that should use higher level frameworks such as
| Accelerate
| hajile wrote:
| No AVX support with Rosetta2.
|
| Rosetta2 supports up through SSE2. That's the latest
| instruction set to no longer be patented as of around 2020.
| They can use x86_64 only because AMD released x86_64 spec in
| 1999 (even though actual chips came much later).
| gsnedders wrote:
| It certainly claims to support many things later than SSE2,
| including everything up to SEE4.2, on this MacBook Air (M1):
|
| % arch -x86_64 sysctl -a | grep machdep.cpu.features
|
| machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC
| SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE
| SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTSE64 MON DSCPL VMX EST
| TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 AES SEGLIM64
| floatboth wrote:
| Custom Apple instructions do... matrix stuff IIRC. Not
| something applicable to simple number division.
| mhh__ wrote:
| Has someone actually found out what instructions do what? I
| assume we won't get to play with them ourselves unless they
| get the Microsoft treatment over the hidden APIs
| GeekyBear wrote:
| Some information from Anandtech's deep dive into Apple's "big"
| Firestorm core.
|
| >On the Integer side, whose in-flight instructions and renaming
| physical register file capacity we estimate at around 354
| entries, we find at least 7 execution ports for actual arithmetic
| operations. These include 4 simple ALUs capable of ADD
| instructions, 2 complex units which feature also MUL (multiply)
| capabilities, and what appears to be a dedicated integer division
| unit. The core is able to handle 2 branches per cycle, which I
| think is enabled by also one or two dedicated branch forwarding
| ports, but I wasn't able to 100% confirm the layout of the design
| here.
|
| On the floating point and vector execution side of things, the
| new Firestorm cores are actually more impressive as they a 33%
| increase in capabilities, enabled by Apple's addition of a fourth
| execution pipeline. The FP rename registers here seem to land at
| 384 entries, which is again comparatively massive. The four
| 128-bit NEON pipelines thus on paper match the current throughput
| capabilities of desktop cores from AMD and Intel, albeit with
| smaller vectors. Floating-point operations throughput here is 1:1
| with the pipeline count, meaning Firestorm can do 4 FADDs and 4
| FMULs per cycle with respectively 3 and 4 cycles latency. That's
| quadruple the per-cycle throughput of Intel CPUs and previous AMD
| CPUs, and still double that of the recent Zen3, of course, still
| running at lower frequency. This might be one reason why Apples
| does so well in browser benchmarks (JavaScript numbers are
| floating-point doubles).
|
| Vector abilities of the 4 pipelines seem to be identical, with
| the only instructions that see lower throughput being FP
| divisions, reciprocals and square-root operations that only have
| an throughput of 1, on one of the four pipes.
|
| https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...
| gsnedders wrote:
| > This might be one reason why Apples does so well in browser
| benchmarks (JavaScript numbers are floating-point doubles).
|
| Reminder that browsers try to avoid using doubles for the
| Number type, preferring integers with overflow checks. Much of
| layout uses fixed point for subpixels, too. Using doubles all
| the time would be a notable perf regression.
| amelius wrote:
| What's the fastest way to implement integer division in hardware?
| pcwalton wrote:
| I always assumed that CPUs used Newton's method, though that
| could be wrong.
| https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%8...
|
| Edit: Yeah, that's only used for floating point. Looks like
| integer division is usually an algorithm called SRT:
| https://en.wikipedia.org/wiki/Division_algorithm#SRT_divisio...
| petermcneeley wrote:
| " Is the hardware divider pipelined, is there more than one per
| core?"
|
| Chaining the divisions (as a series of dependencies) would enable
| one to see the full latency of a single divide. You could use
| this data to estimate the number of divide units on the core.
| torstenvl wrote:
| Great and interesting work! The slowness of division operations
| is overlooked too often IMHO and is key to (my approach to)
| avoiding things like integer overflows (there may be a better way
| than dividing TYPE_MAX by one of the operands but I don't know an
| alternate technique). Pretty impressive if the M1 really can
| achieve two-clock-cycle division on a consistent basis.
|
| May I offer a nitpicking correction? 1.058ns compared to 6.998ns
| is an 85% savings, not 88%. The listing you have suggests that
| going down to 1.058ns is a bigger speed-up than going down to
| 0.891ns.
|
| (PS - Verizon's sale of Yahoo has been in the news lately so I
| thought of you and the other regulars of the Programming chat
| room the other day. Hope all is well.)
| ridiculous_fish wrote:
| Fixed the percentage, thank you. Hope you are doing well too!
| david2ndaccount wrote:
| Depending on what you're doing, you can usually just use the
| compiler intrinsics and check for overflow aka,
| `__builtin_mul_overflow` and similar instead of guarding
| against it.
| torstenvl wrote:
| Useful to know, but if we don't care about portability we can
| just write a function in assembler that checks the carry or
| overflow flag or whatever the architecture's equivalent is.
| mhh__ wrote:
| https://gcc.gnu.org/wiki/DontUseInlineAsm
| [deleted]
___________________________________________________________________
(page generated 2021-05-12 23:00 UTC)