[HN Gopher] How to get 1.5 TFlops of FP32 performance on a singl...
___________________________________________________________________
How to get 1.5 TFlops of FP32 performance on a single M1 CPU core
Author : signa11
Score : 281 points
Date : 2023-01-05 13:22 UTC (9 hours ago)
(HTM) web link (jott.live)
(TXT) w3m dump (jott.live)
| Tepix wrote:
| So, this is about Apple's undocumented AMX instructions, issued
| from CPU, executed on a special accelerator execution unit.
|
| Is there one such unit per CPU core?
| MuffinFlavored wrote:
| > So, this is about Apple's undocumented AMX instructions,
| issued from CPU, executed on a special accelerator execution
| unit.
|
| CPU instruction -> AMX instruction -> AMX result -> CPU?
|
| How are these kinds of things usually kept in sync/in a
| manageable state? Like does the CPU block until the AMX
| returns?
| my123 wrote:
| M1 has one AMX unit per cluster AFAIK. This however can and
| does change between different chips.
| danieldk wrote:
| Yes, there is one per core cluster. The title is a bit
| misleading, because it suggests that going to two or three
| cores will scale linearly, though it won't be much faster.
| See here for sgemm benchmarks for everything from the M1 to
| M1 Ultra and 1 to 16 threads:
|
| https://github.com/danieldk/gemm-benchmark#1-to-16-threads
| adrian_b wrote:
| No.
|
| So the title is misleading, even if it is true that you get
| this performance with a program that uses a single CPU core.
| mochomocha wrote:
| I think the author downplays the significance his work because it
| only applies to "small neural networks". There are a lot of use-
| cases that can benefit from this type of optimizations.
| Discovering how to use an undocumented fast accelerator available
| on millions of devices is very valuable.
| MuffinFlavored wrote:
| Not up to date on a lot of "AI"/"ML" things, why isn't this
| significant for medium/large neural networks as well?
| lostmsu wrote:
| RTX 3090 theoretical matmul is 142 TFlops. E.g. about 100x of
| this.
| bee_rider wrote:
| The 1.5 here is for a single core, though. So if we assume
| that the performance core on an M1 is around 7.5 watts (I'm
| not actually sure, seems like a reasonable upper bound
| though if a whole M1 mini is around 39 watts), we'd be
| looking at around 750 watts to match. Which seems like a
| surprisingly non-crazy amount of power given these are 32
| bit flops, unlike the 16 in the RTX 3090, and they come
| from a CPU.
| lostmsu wrote:
| This code runs on AMX co-processor. From the article:
|
| > An important distinction is that the AMX:CPU ratio is
| not 1:1; not every core has its own AMX co-processor.
|
| My understanding is there's only 1 of those per regular
| M1 CPU, maybe 4 on the largest one (Ultra).
| johndough wrote:
| The RTX 3090 has 35.58 TFlops FP32 performance, or 285.48
| FP16 according to https://en.wikipedia.org/wiki/List_of_Nvi
| dia_graphics_proces...
|
| EDIT: I fell for NVIDIA's marketing. The dense FP16
| performance is only half of 284.48, which is 142. Thanks to
| adgjlsfhk1 for the correction.
| adgjlsfhk1 wrote:
| That 285 is listed as (2:1 sparse) which means it's only
| valid for matrices where 2 out of every 4 numbers are
| zero. For dense matrices it's half that.
| bee_rider wrote:
| Are 2:1 sparse matrices a common thing? It seems weird,
| like clearly that's not sparse enough to want to use,
| like, sparse matrix "CSR" style storage or something,
| haha. I would just treat it as dense I guess.
| adgjlsfhk1 wrote:
| They aren't. As far as I can tell, Nvidia does this to be
| able to double the number of TFlops they put on their
| website. (this might be a little unfair, the real reason
| is that in ML it _might_ be possible to train a NN such
| that your matrices have this structure, but I haven 't
| seen anyone other than Nvidia use it)
| bee_rider wrote:
| I'm trying to think of cases where it might accidentally
| come up, and all I can think of is something like "oops I
| used complex but my values are actually real."
| [deleted]
| dotnet00 wrote:
| There has been some work in that direction but it hasn't
| really caught on as fast as NVIDIA may have expected it
| to.
| lostmsu wrote:
| Yeah, still waiting for this feature to be available in
| PyTorch natively.
| my123 wrote:
| Apple did prefer to expose it through their own
| Accelerator.framework API however...
| capableweb wrote:
| Of course they do, Apple like to remain as much in control as
| possible. If suddenly it becomes more efficient/faster to run
| ML/AI stuff on Asahi Linux on Mac hardware then with macOS,
| I'm sure they be embarrassed enough to take some sort of
| action. And I'm pretty sure that action will be towards the
| side of "closing things down" rather than "opening stuff up",
| as is tradition.
| my123 wrote:
| Wrong answer.
|
| AMX is an unstable ISA that changes between product
| generations. That's why it's not publicly documented.
|
| Arm SME is the standardisation of the concept, but is not
| inmarket yet.
|
| https://community.arm.com/arm-community-
| blogs/b/architecture...
| svantana wrote:
| Has it been verified that they actually use these
| instructions in Accelerate.framework? I just benchmarked this
| on my 2019 intel i9 mbp, and got the following speeds for
| 128x128 matrices, 32 repeats: cblas_sgemm: 36
| GFLOP/s vDSP_mmul: 41 GFLOP/s
|
| That's a pretty big deal if these functions are >30x faster
| on the M1...!
|
| edit: that seems to be verified in the tlkh.dev blog post
| above. Interestingly, I ran the same code on my bargain-
| basement 2020 iphone SE, and got 259GFLOP/s! These apple
| devices are pretty mindblowing.
| danieldk wrote:
| _Has it been verified that they actually use these
| instructions in Accelerate.framework?_
|
| Yes. Aside from benchmarks, you can easily verify this by
| profiling an application with Instruments and then
| inspecting the disassembly.
|
| However, it should be said that AMX does not scale linearly
| with the number of cores, but with the number of core
| clusters. So, on the M1 if you use Accelerate in two
| threads (rather than one), performance will barely improve,
| because the first thread can keep the AMX unit busy enough.
|
| However, e.g. the M1 Pro and M1 Max have two performance
| core clusters with AMX units in them. So matrix
| multiplication doubles roughly two times compared to the
| M1. Similarly, the M1 Ultra has fours performance core
| clusters, so matrix multiplication performance is roughly
| twice that of the M1 Pro/Max and four times that of the M1.
|
| Benchmarks:
|
| https://github.com/danieldk/gemm-benchmark#1-to-16-threads
| mark_l_watson wrote:
| Apple has done a wonderful job making CoreML smoothly
| integrated with iOS, iWatchOS, iPadOS, and macOS development.
| ur-whale wrote:
| > Apple has done a wonderful job making CoreML
|
| Apple has done a wonderful job of further locking their user
| into the golden cage they call a platform.
| bee_rider wrote:
| The worst thing is, their users don't even seem to be
| totally happy with the state of affairs! It's like they
| don't even realize their preferences are wrong. :(
| bee_rider wrote:
| This was intended to be obvious sarcasm, but I somehow
| accidentally added "don't" which... really just makes it
| confusing. Oops, haha.
| gcr wrote:
| I think you're right and you're wrong, it's a bit more
| complicated.
|
| ML is one of the few applications that benefit from
| platform-specific optimizations, so if you need every ounce
| of performance, you have your choice of which walled garden
| you want to tether your application to. The "lock-in" comes
| from the specific capabilities of your special-purpose
| hardware, and for serious applications, you're already
| thinking hard about whether to design your entire
| implementation around Apple, NVidia, Google/TPU, or even
| Android devices. For big models, platform-specific needs
| influence every aspect of model design, including
| data/model sharding, quantization, training loops...
|
| For non-scientific applications, it's usual practice to
| train your model in platform-agnostic ways using PyTorch or
| Tensorflow or whatever and _then_ deploy it to devices in
| platform-specific ways, whether that 's XLA, CoreML, Edge
| TPU, Android NNAPI, TensorflowJS, or hell, custom-written
| GLSL shaders or whatever.
|
| We're just starting to see cross-platform frameworks that
| abstract model inference: TFLite, PyTorch Mobile, ONNX. To
| their credit, CoreML can act as a backend for any of these,
| so you don't even need to worry about your platform.
| gjsman-1000 wrote:
| Every platform is a golden cage in some respect. Ask any
| business who is stuck on ancient Win32 and even DOS
| applications, source code long gone. (Looking at you my
| local McDonalds, Menards, Tractor Supply)...
| brookst wrote:
| I get the value of the common APIs, but as a developer how do
| you deal with the wide range of performance in different form
| factors and product generations? Is there some way to
| gracefully adapt the same models to a specific device's
| capabilities?
| londons_explore wrote:
| There are a bunch of easy ways to scale neural nets.
| Quantization and distillation being the main approaches (or
| some combination of the two). Both typically require more
| training time, but not much more human-effort.
|
| You can normally expect to get way more than half the
| 'outcome' from a neural net with half the
| ram/compute/time/power budget. So neural nets scale 'down'
| pretty well.
| londons_explore wrote:
| For comparison...
|
| A single Google TPUv4 'pod' (entire row of datacenter racks)
| gives 1,126,400 TFlops.
|
| Thats why your pet ML projects will always be behind those done
| at big companies.
| mhuffman wrote:
| I have always been under the impression that there will
| eventually be a way to distribute ML projects across many
| personal computers (like the Folding@home or SETI@home) that
| could give even Google a run for their money! A few hundred
| million personal computers is a lot of processing!
| Thaxll wrote:
| Why would Apple hide such optimization from public APIs?
| esskay wrote:
| It's only really used for their internal applications and the
| OS level stuff so assume they want to prevent performance
| issues with it having to deal with 3rd party stuff.
| frogblast wrote:
| It is available via public APIs, but the hardware instructions
| themselves are not documented. This lets the instructions
| change in future CPUs, vs having the ISA be baked in stone
| forever.
|
| Example: AMX predated the standard ARM matrix multiply
| instructions. Perhaps Apple will add the ARM versions someday
| and now can remove AMX without breaking compatibility. Or maybe
| there will be a non-additive AMXv2.
| bob1029 wrote:
| This is a fairly ridiculous amount of performance, all things
| considered.
|
| It always seemed to me like SIMD/AVX/etc would eventually come
| for the GPU's lunch money... How many more product generations of
| "SIMD on steroids" before this is practically true?
|
| The latency factor is the biggest thing for me. The GPU is a
| turtle compared to CPU-bound techniques. I can see emerging
| applications for this in real-time/streaming where every
| millisecond counts.
| zozbot234 wrote:
| The GPU is more like a slow U-Haul truck, whereas the CPU is a
| super fast race car. Both have merit in their own domain. And
| GPU training is pretty solidly in the "slow and steady" camp.
| pletnes wrote:
| Training in production, yes. Developing locally is still a
| thing for many reasons. More importantly, inference is more
| <<sports car>> - you want the app to stay interactive!
| fnordpiglet wrote:
| A typical goal is 60hz, which is 17k microseconds. My cursory
| research says as of 10 years ago a typical write/receive
| latency for an nvidia card, i7, pcie2.0 is 20 microseconds.
| That gives you a large budget despite the fact SIMD on chip is
| measured in cycles not microseconds. Inside the GPU you have a
| huge amount of space and resources to do highly specialized
| operations in vast concurrency, I.e., bandwidth for compute is
| huge and specialized. I don't see how CPU's or SOC will solve
| this without vastly increasing die sizes and heat and power
| consumption to be close to that of a GPU with all its cooling
| requirements and heavy power needs.
|
| That said I think the "good enough" metric is already there and
| unless you're doing hardware ray tracing or extreme details at
| high resolutions you won't need or care about a GPU any more.
|
| Latency though isn't the issue. The times involved for human
| perception are long and not getting shorter.
| fragmede wrote:
| Things have been "good enough" since 2012. But then VSCode
| and bigger webpages came along and suddenly a Core2Duo just
| doesn't cut it anymore. ML models need somewhere to run,
| locally, and both Apple and Google have dedicated hardware on
| smartphones for that. Support for bigger and bigger models
| (read GPU performance) in smaller and smaller packages is
| just the latest iteration of progress.
| fnordpiglet wrote:
| Yes I agree. Except I think real time ray tracing really is
| that much better and shifts the goal posts again.
| jasonwatkinspdx wrote:
| One interesting data point here is the Fugaku supercomputer is
| based around ARM's scalable vector stuff (basically Cray style
| variable length vectors vs short vector SIMD like AVX) and no
| gpu. Using HBM is a key enabler here.
|
| I'm not sure GPUs will be displaced, looking at the
| difficulties Larrabe had on the driver side, but I do think
| we'll see more flexible alternatives becoming popular.
| kllrnohj wrote:
| You'd need a fairly drastic shift in the memory architecture of
| CPUs for that. Not something unheard of, such as Intel's new
| Xeon Max beast with HBM 2e on the CPU module. But it's
| definitely not an issue of just throwing some big SIMD blocks
| onto the die & calling it a day. That is, after all, basically
| what AVX-512 is. And while it has it's place, it's also not
| eating anyone's lunch money.
|
| And also, as weird as it is, 1.5TFlops isn't actually that
| ridiculous. We had that performance 14 years ago at 150w with
| desktop GPUs. 14 years to reduce from 150w to what, 5w?, is
| cool but also honestly pretty par for the course is it not?
| Especially for a fixed-function block?
| sliken wrote:
| "You'd need a fairly drastic shift in the memory architecture
| of CPU". You mean like selling laptops (at 400GB/sec) and
| desktops (at 800GB/sec) with much improved memory systems.
|
| I don't want to give up SO-DIMMs for a few mm thinner laptop,
| but going from the intel/amd standard 70GB/sec to 400GB/sec
| is a pretty big incentive.
| r00fus wrote:
| Aside from Apple's processors is 1.5TFlops in 5w possible
| with other archs?
| roxgib wrote:
| Apple Silicon chips share memory between the CPU and GPU,
| would that play into any calculation of the relative
| benefits? Presumably the GPU isn't getting the full benefits
| of a GPU optimised memory set up so the difference would be
| smaller?
| touisteur wrote:
| The GPU people are also reaching for simd and fixed matmul hw
| to increase perf. Tensor Cores (int, fp16, tf32 and even fp64
| on A100) and the new DPX instructions. RT cores are a different
| kind of horse but still specialized for BVH traversal and ray-
| triangle intersection.
| theLiminator wrote:
| We're reaching a point where CPUs are increasingly getting
| more specialized, and GPUs are becoming increasingly
| generalized. Going at improvements from both sides of the
| sandwich.
| thechao wrote:
| That's how we felt when we were writing the software rasterizer
| for Larrabee! The issue is that that 1.5TFLOP is probably way
| more power than the M1 GPU's ~2.5TFLOP. The second issue is
| that a SW rasterizer is going to spend ~50% of its budget
| emulating fixed function. So, now you're way more power, for
| 1/4 the perf (best case). Also, you can't run any other apps,
| and you're probably going to have bandwidth issues to the
| display controller.
|
| GPUs are an optimization to try to use the excess Moore's law
| we have to get to the ghost of Dennard's law.
| bob1029 wrote:
| I think a power/latency/perf tradeoff could be agreeable for
| certain applications. GPUs in the cloud are not exactly
| cheap. Many gaming experiences do not require nanite-level
| graphics.
|
| Building something that can reliably output reasonable-
| quality 3d graphics without relying on specific GPU
| technologies will give you a much broader realm to operate
| with.
|
| I believe something along this path is the solution for
| streaming gaming. I perceive the failure of Stadia, et. al.
| as being a consequence of trying to bolt streaming onto
| existing GPU-based, local gaming solutions. Build something
| from scratch with streaming/latency as a #1 priority, and you
| can dramatically expand the operational radius of each
| datacenter (e.g. ~100km per millisecond saved).
| dotnet00 wrote:
| I feel like that's a somewhat out-of-touch interpretation,
| as Stadia failed largely because of Google's terrible
| reputation and the (completely valid) concerns from gamers
| about companies intending to turn even single player games
| into fragmented streaming platforms where the content is
| entirely dependent on the whims of the company (a fitting
| example being Google doing its thing and killing Stadia).
| They had no shortage of GPUs.
|
| NVIDIA's streaming service is doing relatively fine in
| comparison. They simply share a GPU between several users
| for anything that isn't demanding enough. They also get
| around some of the concerns about gaming being turned into
| another streaming style fragmented mess by not actually
| selling the games. You simply log into your account on
| Steam/GOG/whatever and play the games you already own as
| you might on a local PC.
|
| Additionally, "building something that can reliably output
| reasonable-quality 3d graphics without relying on specific
| GPU technologies" doesn't make much sense to me. If it's an
| accelerator designed to handle relatively modern 3d
| graphics, due to the programmability of a modern graphics
| pipeline it's effectively just a GPU. There aren't any
| underlying technologies that are required to be used as
| long as they can produce a similar output (mobile GPUs tend
| to have a different approach to how they implement the
| graphics pipeline compared to desktop GPUs for instance).
| jvanderbot wrote:
| Light is 300km/ms in a vacuum. Is it that much slower
| through switched fiber?
| Someone wrote:
| Signal speed in fiber is about 2/3 of that in vacuum (h
| ttps://en.wikipedia.org/wiki/Optical_fiber#Refractive_ind
| ex), but fiber won't be straight-line between sender and
| receiver, light doesn't move in a straight line inside
| the fiber, and the _switched_ adds delays.
|
| https://www.pingdom.com/blog/theoretical-vs-real-world-
| speed...: _"you should probably double the "ideal"
| response times shown above for a more realistic target to
| aim at"_
|
| So yes, 1/3 of light speed in vacuum seems a decent
| heuristic.
| rkangel wrote:
| Speed of light in glass is about 2/3 of the speed of
| light in a vacuum (refractive index of glass is around
| 1.5).
| MobiusHorizons wrote:
| Round trip latency matters here, which would get you down
| to 150km without any slowdown through fiber.
| thechao wrote:
| Well; except for the fact that the 1.5TFLOP quoted in the
| article is because of the AMX part. The _actually useful_
| throughput of the big-core is probably more like 35GFLOP
| _peak_. This compares to the 1-2TFLOP throughput of the
| GPU. The CPU is easily going to be 50-100x slower than the
| GPU.
|
| If you're talking full-screen Angry Birds with, say, a 2x
| average compositing, you're going to be fine on the CPU;
| but, energy- and jitter- wise you'll still be happier with
| the GPU, overall.
| superkuh wrote:
| It's a fast console, there's no doubt of that. Kind of like the
| Playstation 3 when it came out. Fast, not much software support
| without lots of special considerations, non-upgradable hardware,
| limited peripheral support. All in all, a fast CPU embedded in a
| marginal console-like "computer". People out there who were
| tricked into buying the M1 8GB ram version can confirm.
| 2fast4you wrote:
| Tricked how? I've got an M2 8GB and loving it
| smoldesu wrote:
| Even with swap, 8gb is pretty paltry on a memory-hungry
| system like MacOS, let alone a system that shares GPU and
| system memory. 16gb is the minimum for me even though I
| really only edit/compile code, and even then it can be pretty
| easy to max out your system memory after a couple docker
| containers...
|
| It might not be a 'trick' per-se, but anyone who intends to
| use a Mac for work should consider upgraded memory (IMO).
| 2fast4you wrote:
| I will agree with your last point. If I had bought this
| machine for doing seriously development I would've gone for
| 16gb. Saying that, I've been pleasantly surprised with its
| power. I've been playing with Metal, throwing together a
| native version ShaderToy, and it hasn't felt unpowered
| once. Even when running the iPad emulator.
|
| I did feel a little duped when I learned that some M1/M2
| machines can only support one external monitor. Now I have
| to replace my two monitors with a widescreen.
| smoldesu wrote:
| IMO, the 'problem' is that MacOS will use 4-5gb OOB, and
| using an Electron app with a browser open will easily
| push that into swap space. For most daily drivers, even
| light users, they'll be happy to have upgraded memory.
| 2fast4you wrote:
| Right now with just safari and a few background things
| I'm hovering at 6gb in use, so you're not wrong about how
| much memory is being used. Regardless I don't think it's
| a problem for light users. A light user imo would be just
| browsing and email. 8GB will give you plenty of headroom
| in that case.
|
| I'm going to keep an eye on ram usage for the next few
| days. I'm curious what it will look like on a more full
| workload because if things have been swapping out, I
| haven't noticed.
| robertoandred wrote:
| I love how Apple hater rhetoric hasn't changed in 30 years.
| kolbusa wrote:
| Nitpick... This paragraph is somewhat confusing. I think it is
| worded incorrectly:
|
| _> Let 's simplify the problem and implicitly transpose the
| matrix multiplication. Both A and B (our inputs) will have K (our
| reduction dimension) as the leading dimension. This doesn't
| really matter much in practice, but it simplifies our code a
| lot._
|
| The code is C[n * 16 + m] += A[k * 16 + m] * B[k
| * 16 + n];
|
| Which means that actually *m* is the leading dimension of A with
| stride 16, and for B it is *n* with stride 16.
| nimish wrote:
| How is fp64? Nvidia crippled the 4090 to just 1.3T fp64 flops so
| if a mac mini with m1 could match that it'd be a solid win
| bsdnoob wrote:
| You know you can see your upvoted stories right?
| [deleted]
| varunkmohan wrote:
| Posts like these are always awesome to look at how much we can
| push consumer hardware.
|
| It's hard not to really appreciate some of the devices we have
| today. For instance, an RTX 4090 is capable of 660 TFlops of FP8
| (MSRP 1600). Would not be surprised if we soon have laptops that
| can do petaflops of computation soon!
| kristianp wrote:
| Anyone with a comparison with Intel's deep learning boost or
| VNNI, which is available on avx-512 processors such as
| https://ark.intel.com/content/www/us/en/ark/products/213805/...
| gcr wrote:
| It's amazing to me that there are four separate pieces of
| hardware in M1 devices that can do matrix multiplies.
|
| In addition to running on the CPU, M1 Max devices have three
| separate kinds of hardware-accelerated `gemm`: the GPU, the ANE
| (Apple Neural Engine), and this special matrix coprocessor.
| Here's a fairly detailed post that benchmarks each:
|
| https://tlkh.dev/benchmarking-the-apple-m1-max
|
| And here's a great post about the justification for having so
| much special-purpose hardware:
|
| https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...
|
| As for the matrix coprocessor, Apple's built-in BLAS
| implementation (Accelerate.framework) uses this chip. You can
| link Numpy against this to benefit in your Python programs, for
| example. Here are some old instructions:
| https://gist.github.com/MarkDana/a9481b8134cf38a556cf23e1e81...
|
| All this represents yet another cycle on the Wheel of
| Reincarnation... (http://catb.org/jargon/html/W/wheel-of-
| reincarnation.html)
| amelius wrote:
| Isn't this wheel of reincarnation simply a result of a shifting
| bottleneck? A computation can be CPU-bound or memory-bound, and
| this can change over hardware generations.
| gcr wrote:
| Makes sense... We're also seeing energy efficiency and model
| size and latency becoming significant constraints these days,
| and the more unique constraints an application has, perhaps
| the more beneficial it is to have many different
| implementations with different tradeoffs.
| ElectricalUnion wrote:
| > energy efficiency (...) many different implementations
|
| Yep, thermal throttling is a thing, and sometimes all you
| need is either useless silicon padding or some specialized,
| most of the time dark silicon to both make it feasible to
| cool and prevent it from melting.
| roxgib wrote:
| I suspect Apple was more worried about battery use in
| this case.
| throw10920 wrote:
| It is, but the fact that the bottleneck has shifted multiple
| times (as opposed to just this one recent time) is nonobvious
| (to someone unfamiliar with computing history) and worthy of
| pointing out.
| Dylan16807 wrote:
| > All this represents yet another cycle on the Wheel of
| Reincarnation...
|
| Isn't this adding new cores directly onto the main chip? That
| doesn't sound like it fits to me.
|
| And at this point GPUs have been straddling both sides of the
| divide for decades, depending on the particular device form
| factor and the necessary power.
|
| The only thing I would actually say has gone through a _cycle_
| lately is the crypto accelerator for mac SSDs.
| TimTheTinker wrote:
| > Isn't this adding new cores directly onto the main chip?
| That doesn't sound like it fits to me.
|
| These are _coprocessors_ , which are a very different thing
| from just another CPU core. For one, they use a different
| architecture (instruction set, registers/memory, etc.).
|
| The "wheel of reincarnation" refers to features/capabilities
| on coprocessors eventually being folded into the main CPU.
| While CPUs have adopted insights from GPU implementations,
| GPU functionality has never been fully folded into CPUs
| (software rasterizers don't count).
| lalaithion wrote:
| There's also the media encoder hardware accelerator, which
| isn't quite `gemm`, but certainly contains hardware that
| performs `mm`s.
| MrBuddyCasino wrote:
| Since there is no summary, these are the benchmark findings:
| AMX co-processor 2 TFLOPS FP32 GPU 8 TFLOPS FP32
| Neural Engine 5.5 TFLOPS FP16
| Firadeoclus wrote:
| Note that AMX can achieve roughly double the FLOPS with FP16,
| and 8 TFLOPS for the GPU is only about 77% of peak. You can
| do better than that, especially using FP16 90+% is possible
| (which is >9.4 TFLOPS).
| londons_explore wrote:
| So why would you choose to use the Neural Engine rather than
| the GPU?
|
| Just power efficiency?
| potatolicious wrote:
| That and if you want to use the GPU at the same time.
| londons_explore wrote:
| Is there any easy way to use all of these at the same time?
| Ie. some library you can ask to do a big matrix multiply and
| it will loadbalance between the bits of hardware?
|
| Or do you have to manually split the computation between
| them?
| thewebcount wrote:
| I'm by no means an expert in any of this. I mainly work on
| video processing using the GPU. That said, I would think if
| any library would do load balancing between them, it would
| likely be the Accelerate.framework that ships with the
| system.
|
| However, I do have some experience with having the same
| code run on the GPU and the CPU. In my work, we have tried
| breaking images (usually frames of video) into various
| sized chunks and processing on both the CPU and GPU at the
| same time. Our conclusion is that the overhead of using
| both outweighs any benefit you'd get. The GPU is so much
| faster than the CPU, there's no point in involving the CPU
| at all. These experiments were done several years ago, so
| perhaps the landscape has changed since then, but that was
| what we found.
| jasonwatkinspdx wrote:
| You might find David Wright's presentations about Unreal
| 5 interesting:
|
| https://highperformancegraphics.org/slides22/Journey_to_N
| ani...
|
| https://advances.realtimerendering.com/s2022/index.html#L
| ume...
|
| They're great presentations with a lot of depth in the
| notes. I think videos are around somewhere if you prefer
| that.
|
| Two specifics I'd mention:
|
| It seems a lot of games now use feedback between frames
| as a way to tolerate the latency of moving data between
| CPU and GPU. Eg the CPU will use GPU crunched data from
| the previous frame as a source for CPU crunching that
| optimizes what data gets passed to the GPU next.
|
| The other is that fixed functionality is moving into
| shaders. Unreal 5 uses a mix of hardware rasterization
| and software rasterization in a shader (and path tracing
| now as well). There the tradeoff between the two is
| triangle size in pixels.
| thewebcount wrote:
| Oh wow! Thanks! That looks really cool.
| jasonwatkinspdx wrote:
| They're great. I dunno if you find 3d interesting vs
| video, but the section in that nanite presentation where
| he goes through how he arrived at the LoD clustering
| design is some of the smartest stuff I've ever seen any
| developer say, ever. Like John Carmack probably saw this
| and went "dang, wish I'd thought of that" levels of
| smart.
| muricula wrote:
| Some folks may be interested in the Armv9 Scalable Matrix
| Extensions which appear to do something very very similar.
| https://community.arm.com/arm-community-blogs/b/architecture...
| FL33TW00D wrote:
| I love all the posts by Bram. Please keep writing them!
| NelsonMinar wrote:
| Does Apple use the AMX in their own code? Is anything like the
| AMX present in their mobile CPUs?
| danieldk wrote:
| The AMX units are really nice, especially because you can use
| them simply with standard _sgemm_ through the Accelerate
| framework. However, in most applications where latency is not an
| issue, you 'll probably want to use Metal Performance shaders
| instead, not only are they much faster most applications, they
| can also be more energy efficient.
|
| For instance, we did benchmarks of spaCy (natural language
| processing) transformer models across various Apple Silicon SoCs
| and MPS was 1.9x (M1) to 5.5x faster (M1 Ultra) while providing
| far more performance per Watt. E.g. using MPS on an M2 MacBook
| Air used 4W less energy while being 2.7x faster than AMX.
|
| Full benchmarks are at the end of this post:
|
| https://explosion.ai/blog/metal-performance-shaders
| atonse wrote:
| I remember driving to college nearly 20 years ago and one of the
| headlines on the radio (NPR morning show) was that the DOE had
| unveiled the world's fastest supercomputer at the White House
| that day. It was going to do a whopping 6 Teraflops, and they
| explained what that meant. And I remember thinking about all the
| possibilities with that kind of compute.
|
| I understand that this 1.5 TFlops may not be an exact comparison
| (or maybe it's the same), but if it's even within an order of
| magnitude, it is beyond mind-blowing, and we've just crossed over
| into at Exaflops at the supercomputer level.
___________________________________________________________________
(page generated 2023-01-05 23:00 UTC)