[HN Gopher] Ask HN: Resources for general purpose GPU developmen...
___________________________________________________________________
Ask HN: Resources for general purpose GPU development on Apple's M*
chips?
While Apple M* chips seems to have an incredible unified memory
access, the available learning resources seem to be quite
restricted and often convoluted. Has anyone been able to get past
this barrier? I have some familiarity with general purpose software
development with CUDA and C++. I want to figure how to work with/
use Apple's developer resources for general purpose programming.
Author : thinking_banana
Score : 83 points
Date : 2024-12-25 16:58 UTC (6 hours ago)
| barkingcat wrote:
| There is no general purpose GPU development on Apple M series.
|
| There is Metal development. You want to learn Apple M-series gpu
| and gpgpu development? Learn Metal!
|
| https://developer.apple.com/metal/
| rowanG077 wrote:
| If you are open to run Linux you can use standard opencl and
| vulkan.
| feznyng wrote:
| Besides the official docs you can check out llama.cpp as an
| example that uses metal for accelerated inference on Apple
| silicon.
| dylanowen wrote:
| People have already mentioned Metal, but if you want cross
| platform, https://github.com/gfx-rs/wgpu has a vulkan-like API
| and cross compiles to all the various GPU frameworks. I believe
| it uses https://github.com/KhronosGroup/MoltenVK to run on Macs.
| You can also see the metal shader transpilation results for
| debugging.
| grovesNL wrote:
| wgpu has its own Metal backend that most people use by default
| (not MoltenVK).
|
| There is also a Vulkan backend if you want to run Vulkan
| through MoltenVK though.
| dylanowen wrote:
| Oh good to know! It's been a while since I've looked at the
| osx implementation
| tehsauce wrote:
| the metal backend does currently generate quite a lot of
| unnecessary command buffers, but in general performance seems
| solid.
| rudedogg wrote:
| With what the OP asked for, I don't think wgpu is the right
| choice. They want to push the limits of Apple Silicon, or do
| Apple platform specific work, so an abstraction layer like wgpu
| is going in the opposite direction in my opinion.
|
| Metal, and Apple's docs are the place to start.
| morphle wrote:
| You can help with the reverse engineering of Apple Silicon done
| by a dozen people worldwide, that is how we find out the GPU and
| NPU instructions[1-4]. There is over 43 trillion float operations
| per second to unlock at 8 terabit per second 'unified' memory
| bandwidth and 270 gigabits per second networking (less on the
| smaller chips)....
|
| [1] https://github.com/AsahiLinux/gpu
|
| [2] https://github.com/dougallj/applegpu
|
| [3] https://github.com/antgroup-
| skyward/ANETools/tree/main/ANEDi...
|
| [4] https://github.com/hollance/neural-engine
|
| You can use a high level APIs like MLX, Metal or CoreML to
| compute other things on the GPU and NPU.
|
| Shadama [5] is an example programming language that translates
| (with Ometa) matrix calculations into WebGPU or WebGL APIs (I
| forget which). You can do exactly the same with the MLX, Metal or
| CoreML APIs and only pay around 3% overhead going through the
| translation stages.
|
| [5] https://github.com/yoshikiohshima/Shadama
|
| I estimate it will cost around $22K at my hourly rate to
| completely reverse engineer the latest A16 and M4 CPU (ARMV9),
| GPU and NPU instruction sets. I think I am halfway on the reverse
| engineering, the debugging part is the hardest problem. You would
| however not be able to sell software with it on the APP Store as
| Apple forbids undocumented API's or bare metal instructions.
| KeplerBoy wrote:
| Where does the 270 gbit/s networking figure come from? Is it
| the aggregate bandwidth from the pcie slots on the mac pro,
| which could support nics at that speeds (and above according to
| my quick maths#), but there is not really any driver support
| for modern Intel or Mellanox/Nvidia NICs as far as I can tell.
|
| My use case would be hooking up a device which spews out sensor
| data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as
| possible. The new mac mini has the GPU power, but there's no
| way to get the data into it.
|
| # 2x Gen4x16 + 4x Gen3x8 = 2 * 31.508 GB/s + 4 * 7.877 GB/s [?]
| 90 GB/s = 720 gbit/s
| morphle wrote:
| > Where does the 270 gbit/s networking figure come from? Is
| it the aggregate bandwidth from the pcie slots on the Mac pro
|
| We both should restate and specify the calculation for each
| different Apple Silicon chip and the PCB/machine model it is
| wired onto.
|
| The $599 M4 Mac mini base model networking (aggregated Wifi,
| USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps.
| Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number
| is to high because the 2x Gen4x16 is shared/oversubscribed
| with the other PCIe lanes for x8 PCIe slots, SSD and
| Thunderbolt. You need to measure/benchmark it, not read the
| marketing PR.
|
| I estimate the $1400 M4 Pro Mac mini networking bandwidth by
| adding the external WiFi, 10 Gbps Ethernet, two USC-C ports
| (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps)
| but subtracting the PCIe 64 Gbps limit and not counting the
| internal SSD. Two $599 M4 Mac mini base models are faster and
| cheaper than one M4 Pro Mac mini.
|
| The point of the precise actual measurements I did of the
| trillion opereations per second and the billion of bits per
| second networking/interconnect of the M4 Mac mini against all
| the other Apple silicon machines is to find which package
| (chip plus pcb plus case) has the best price/performance/watt
| balanced against them networked together. On januari 2025 you
| can build the cheapest fastest supercomputer in the world
| from just off the shelf M4 16Gb Mac mini base models with 10G
| Ethernet, Mikrotek 100G switches and a few FPGA's. It would
| outperform all Nvidia, Cerebras, Tenstorrent and datacenter
| clusters I know of, mainly because of the low power Apple
| Silicon.
|
| Note that the M4 has only 1,2 Tips unified memory bandwidth
| and the M4 Pro has double that. The 8 Tops unified memory
| bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB
| DRAM. Without it you cant's reach 50 trillion operations per
| second. A Mac Studio has only around 190 Gbps external
| networking bandwidth but does not reach 43 trillion TOPS, as
| does the 720 Gbps of your Mac Pro estimate. By reverse
| engineering the instruction set you could squeeze a few
| percent extra performance out of this M4 cluster.
|
| The 43 trillion TOPS of the M4 itself is an estimate. The ANE
| does 34 TOPS, the CPU less than 5 TOP depending on float type
| and we have no reliable benchmarks for the CPU floating
| point.
| KeplerBoy wrote:
| The pcie configuration was taken from the mac pro and it's
| m2 ultra. https://www.apple.com/mac-pro/
|
| I'd assume the mac mini has a less extensive pcie/tb
| subsystem.
|
| No idea what people are doing with all those pcie slots
| except for nvme cards. I wonder how hard it would be to
| talk to a pcie fpga.
| morphle wrote:
| You use SerDes high speed serial links (up to 224 Gbps in
| 2025) to communicate between chips. A PCIe lane is just a
| Serdes with a 30% packet protocol overhead that uses DMA
| to copy bytes between to SRAM or DRAM buffers.
|
| You aggregate PCIe lanes (x16, x8, x4/Thunderbolt, x1).
| You could also built mesh networks from SerDes but now
| instead of PCIe switches You would need SerDes switches
| or routers (Ethernet, NVlink, Infiniband).
|
| You need those high speed links between chips for much
| more than SSD/NVME cards. Other NAS, Processors,
| Ethernet/internet, Camera, Wifi, Optics, DRAM, SRAM,
| power etc. For intercore communication (between
| processors or between chiplets), between networked PCB's,
| between DRAM chips (DDR5 is just another SerDes
| protocol), Flash Chips, camera chips, etc. Any other chip
| at faster then 250 Mbps speeds.
|
| I aggregate all the M4 Mac mini ports into a M4 cluster
| by mesh networking all its Serdes/PCIe with FPGAs into a
| very cheap low power supercomputer with exaflop
| performance. Cheaper than NVDIA. I'm sure Apple does the
| same in their data centers.
|
| My talk [1] on Wafer Scale Integration and free space
| optics goes deeper into how and why SerDes and PCIe will
| be replaced by fiber optics and free space optics for
| power reasons. I'm sure several parallel 2 Ghz optic
| lambdas per fiber (but no SerDes!) will be the next step
| in Apple Silicon as well: the M4 power budget already is
| mostly in the off-chip SerDes/Thunderbolt networking
| links.
|
| [1] https://vimeo.com/731037615
| KeplerBoy wrote:
| > I aggregate all the M4 Mac mini ports into a M4 cluster
| by mesh networking all its Serdes/PCIe with FPGAs into a
| very cheap low power supercomputer with exaflop
| performance. Cheaper than NVDIA. I'm sure Apple does the
| same in their data centers.
|
| That sounds super interesting, do you happen to have some
| further information on that? Is it just a bunch of FPGAs
| issuing DMA TLPs?
| morphle wrote:
| It is not the first time they built super computers from
| off the shelf Apple machines [1].
|
| M4 supercomputers are cheaper and it also will be lower
| Capex and Apex for most datacenter hardware.
|
| >do you happen to have some further information on that?
|
| Yes, the information is in my highly detailed custom
| documentation for the programmers and buyers of 'my'
| Apple Silicon super computer, Squeak and Ometa DSL
| programming languages and adaptive compiler. You can
| contact me for this highly technical report and several
| scientific papers (email in my profile).
|
| Do you know of people who might buy a super computer
| based on better specifications? Or even just buyers who
| will go for 'the lowest Capex and the lowest Opex
| supercomputer in 2025-2027'?
|
| Because the problem with HPC is that almost all funders
| and managers buy supercomputers with a safe brand name
| (Nvidia, AMD, Intel) at triple the cost and seldom from a
| super computer researcher as myself. But some do, if they
| understand why. I have been designing, selling,
| programming and operating super computers since 1984 (I
| was 20 years old then), this M4 Apple Silicon Cluster
| will be my ninth supercomputer. I prefer to build them
| from the ground up with our own chip and wafer scale
| integration designs but when an off-the-shelf chip is
| good enough I'll sell that instead.
| Price/Performance/Watt is what counts, ease of
| programming is a secondary consideration for what
| performance you achieve. Alan Kay argues you should
| rewrite your software from scratch [2] and do your own
| hardware [3] so that is what I've done sinds I learned
| from him.
|
| >Is it just a bunch of FPGAs issuing DMA TLPs?
|
| No. The FPGA's are optional for when you want to flatten
| the inter-core (=inter-SRAM cache) networking with
| switches or routers to a shorter hop topology for the
| message passing like a Slim fly diameter two hop topology
| [4].
|
| DMA (Direct Memory Access) TLPs (Transaction Layer
| Packets) are one of the worst ways of doing inter-core
| and inter-SRAM communication and on PCIe it has a huge
| 30% protocol overhead at triple the cost. Intel (and most
| other chip companies like NVIDIA, Altera, AMD/XILINX)
| can't design proper chips because they don't want to
| learn about software [2]. Apple Silicon is marginally
| better.
|
| You should use pure message passing between any process,
| preferably in a programming language and a VM that uses
| pure message passing at the lowest level (Squeak,
| Erlang). Even better if you then map those software
| messages directly to message passing hardware as in my
| custom chips [3].
|
| The reason to reverse Apple Silicon instructions for CPU,
| GPU and ANE are to be able to adapt my adaptive compiler
| to M4 chips but also to repurpose PCIe for low level
| message passing with much better performance and latency
| than DMA TLPs.
|
| To conclude, if you want to get the cheapest Capex and
| Opex M4 Mac mini supercomputer you need to rewrite your
| supercomputing software in a high level language and
| message passing system like the parallel Squeak Smalltalk
| VM [3] with adaptive load balancing compilation. C, C++,
| Swift, MPI or CUDA would result in sub-optimal software
| performance and orders of magnitude more lines of code
| when optimal performance of parallel software is the
| goal.
|
| [1]
| https://en.wikipedia.org/wiki/System_X_(supercomputer)
|
| [2] https://www.youtube.com/watch?v=ubaX1Smg6pY
|
| [3] https://vimeo.com/731037615
|
| [4] https://www.youtube.com/watch?v=rLjMrIWHsxs
| ricktdotorg wrote:
| sounds (at least at a high level) similar to EXO[1]
|
| [1] https://github.com/exo-explore/exo
| morphle wrote:
| Here a video of testing Exo to run huge LLMs on a cluster
| of M4 Macs[1] more cheaply than with a cluster of NVDIA
| RTX 4090s.
|
| [1] https://www.youtube.com/watch?v=GBR6pHZ68Ho
| _zoltan_ wrote:
| It's very weird to add together all kinds of very different
| networking solutions (WiFi, wired ethernet, TB) and talk
| about their aggregate potential bandwidth as a single
| number.
| dgfitz wrote:
| It's too bad they don't make this easier on developers, Apple.
| Is there a reason I don't see?
| morphle wrote:
| There certainly is a reason and indeed you don't see it
| because Apple downplays these things in their PR.
|
| It might be the same reason that is behind NVDIA's CUDA moat.
| CUDA lock-in prevented competitors like AMD and Intel to
| convince programmers and their customers to switch away from
| CUDA. So there was no software ported to their competitive
| GPU's. So you get anti-trust lawsuits [1].
|
| I think you should put yourself in Apples management mindset
| and then reason. I suspect they think they will not sell more
| iPhones or Macs if they let third party developers access the
| low level APIs and write faster software.
|
| They might reason that if no one knows the instruction sets
| hackers will write less code to break security. Security by
| obscurity.
|
| They certainly think that blocking competitors from reverse
| engineering the low power Apple Silicon and blocking them
| from using TSMC manufacturing capacity will keep them the
| most profitable company for another decade.
|
| [1] https://news.ycombinator.com/item?id=40593576
| dylan604 wrote:
| At this point, Apple is absolutely not afraid of an anti-
| trust lawsuit. To them, it is part of the cost of doing
| business
| morphle wrote:
| I concur, they are virtually untouchable in this respect.
| No one else will throw a trillion or more into developing
| lower power faster silicon.
| _zoltan_ wrote:
| CUDA didn't prevent anything at least not in the way you
| believe.
|
| Intel and AMD had no competitive offer, period. They still
| don't.
|
| NVIDIA is simply offering an ecosystem that is battle
| tested and is ready out of the box. Look at the recent
| semianalysis test to see how not ready AMD is, who would be
| the only company to have a real shot at this. Their HW on
| paper is better or equal, yet their software ecosystem is
| nowhere ready.
| AnthonyMouse wrote:
| > Look at the recent semianalysis test to see how not
| ready AMD is, who would be the only company to have a
| real shot at this. Their HW on paper is better or equal,
| yet their software ecosystem is nowhere ready.
|
| Reading that was kind of odd. It seems like their
| conclusion was that on paper AMD should be significantly
| less expensive and significantly faster, whereas in
| practice they're significantly less expensive and
| slightly slower because of unoptimized software, which
| actually seems like it'd still be a pretty good deal.
| Especially if the problem _is_ the software, because then
| the hardware could get better with a software update
| after you buy it.
|
| They also spend a lot of time complaining about how much
| trouble it is to install the experimental releases with
| some improvements that aren't in the stable branch yet,
| but then the performance difference was only big in a few
| cases and in general the experimental version was only a
| couple of percent faster, which either way should end up
| in the stable release in the near future.
|
| And they do a lot of benchmarks on interconnect bandwidth
| which, fair enough, Nvidia currently has some hardware
| advantage. But that also mainly matters to the small
| handful of companies doing training for huge frontier
| models and not to the far larger number of people doing
| inference or training smaller models.
| twoodfin wrote:
| Apple wants total freedom to rework lower levels of the stack
| down to the hardware, without worrying about application
| compatibility, hence their answer will continue to be Metal.
| morphle wrote:
| I agree that it allows Apple to redefine Apple Silicon
| instruction sets without having do explain it to 3rd party
| software developers, but it is certainly not the main
| reason they hide the technical documentation of the chips.
| MBCook wrote:
| Why not?
|
| Metal is the answer. Everything else is just
| implementation detail as GP said.
|
| Apple doesn't provide developer support to other OSes.
| The only OS they do anything for* is macOS. So to them
| there's no point.
|
| All they'd get is people relying on implementation
| details they shouldn't, other companies stealing what
| they consider their trade secrets, or more surface area
| for patent trolls to scan.
|
| * Someone on the Asahi team, I think Hector Martin, has
| commented before the Apple is doing things that clearly
| seem designed to allow others to make and securely boot
| other OSes on their Apple Silicon hardware. They clearly
| could be clamping it down far more but are choosing not
| to. However that's exactly as far as the support appears
| to go.
| amelius wrote:
| > but it is certainly not the main reason they hide the
| technical documentation of the chips
|
| What _is_ the main reason?
| morphle wrote:
| >What _is_ the main reason?
|
| I can't guess what is the main reason. There might not
| even be a main reason, as many groups of people at Apple
| and its shareholders decided this over the years.
|
| (Also see my speculations below in this thread).
|
| So not in any order of importance to Apple:
|
| 1) Create the same moat as NVIDIA has with CUDA.
|
| 2) Ability to re-define the microcode instruction set of
| all the dozens of different Apple Silicon chips now and
| in the future without having to worry about backwards
| compatibility. Each Apple Silicon chip simply recompiles
| code at runtime (similar to my adaptive compiler).
|
| 3) Zero hardware documentation needed, much cheaper PR
| and faster time to market, also making it harder to
| reverse engineer or repair.
|
| 4) Security. Security by obscurity
|
| 5) Keeping the walled garden up longer.
|
| 6) Frustrating reverse engineering of Apple software
|
| 7) Frustrating reverse engineering of Apple hardware
|
| 8) It won't make Apple more sales if 3rd party developers
| can write faster and more energy efficient GPU and NPU
| software.
|
| 9) Legal and patent infringements considerations
|
| 10) Future compiler improvements
|
| 11 ) Trade secrets
| amelius wrote:
| 9) hiding known and/or unknown patent infringements
| MuffinFlavored wrote:
| This would get rid of needing Metal to be the blackbox and
| enable things like "nvptx CUDA" equivalent /
| https://libc.llvm.org/gpu/ right?
|
| Very interesting. A steal for $22k but I guess very niche for
| now...
| morphle wrote:
| Yes, knowing the exact CPU and ANE assembly instructions (or
| the underlying microcode!!) allow for general software to
| adaptively compile processes on all the core types, not just
| the CPU ones. Its won't always be faster, you get more cache
| misses (some cores don't have cache) and different DMA and
| thread scheduling, some registers can't fit the floats or
| large integers, etc etc. But yes, it will be possible to use
| all 140 cores of the M2 Ultra or the 36 cores of the M4.
| There will be an M6 Extreme some day, maybe 500 cores?
|
| Actually, the GPU and ANE cores themsevlves are built from
| teams of smaller cores, maybe a few hundred or thousand in
| all, same as in most NVDIA chips.
|
| >A steal for $22k but I guess very niche for now...
|
| A single iPhone or Mac app (a game, an LLM, pattern
| recognition, security app, VPN, de/encryption, video en/dec
| coder) that can be sped up by 80%-200% can afford my faster
| assembly level API.
|
| But a whole series of hardware level zero-day exploits for
| iPhone and Mac, now that is not very niche at all. It is
| worth millions to reverse Apple Silicon instruction sets.
| JackYoustra wrote:
| any place you have your current progress written up on? Any
| methodology I could help contribute on? I've read each one of
| the four links you've given over the years and it seems vague
| with how far people have currently gotten and exact issues.
| mkagenius wrote:
| Check out MLX[1]. Its a bit like pytorch/tensorflow with added
| benefit of Apple Silicon.
|
| 1. https://ml-explore.github.io/mlx/build/html/index.html
| rgovostes wrote:
| It's hard to answer not knowing exactly what your aim is, or your
| experience level with CUDA and how easily the concepts you know
| will map to Metal, and what you find "restricted and convoluted"
| about the documentation.
|
| <Insert your favorite LLM> helped me write some simple Metal-
| accelerated code by scaffolding the compute pipeline, which took
| most of the nuisance out of learning the API and let me focus on
| writing the kernel code.
|
| Here's the code if it's helpful at all.
| https://github.com/rgov/thps-crack
| nixpulvis wrote:
| 2024 and still finding cheat codes in Tony Hawk Pro Skater 2.
| Wild!
| selimthegrim wrote:
| If Jamie Kennedy is reading this, we still haven't found the
| cheat code to make you funny.
| thetwentyone wrote:
| I've had a good time dabbling with Metal.jl:
| https://github.com/JuliaGPU/Metal.jl
| amelius wrote:
| Apple is known to actively discourage general purpose computing.
| Better try a different vendor.
| codr7 wrote:
| Preferably one that sells computers, not fashion statements.
| likeabbas wrote:
| It's not a fashion statement, it's a fucking deathwish
| aleinin wrote:
| If you're looking for a high level introduction to GPU
| development on Apple silicon I would recommend learning Metal.
| It's Apple's GPU acceleration language similar to CUDA for Nvidia
| hardware. I ported a set of puzzles for CUDA called GPU-Puzzles
| (a collection of exercises designed to teach GPU programming
| fundamentals)[1] to Metal [2]. I think it's a very accessible
| introduction to Metal and writing GPU kernels.
|
| [1] https://github.com/srush/GPU-Puzzles
|
| [2] https://github.com/abeleinin/Metal-Puzzles
| dylan604 wrote:
| After a quick scan through the [2] link, I have added this to
| the list of things to look into in 2025
| desideratum wrote:
| I'd reccomend checking out the CUDA mode Discord server! They
| also have a channel for Metal https://discord.gg/ZqckTYcv
| TriangleEdge wrote:
| Why not OpenCL or OpenGL? You'll not be constrained by the flavor
| of GPU.
| billti wrote:
| If you know CUDA, then I assume you know a bit already about GPUs
| and the major concepts. There's just minor differences and
| different terminology for things like "warps" etc.
|
| With that base, I've found their docs decent enough, especially
| coupled with the Metal Shader Language pdf they provide
| (https://developer.apple.com/metal/Metal-Shading-Language-
| Spe...), and quite a few code samples you can download from the
| docs site (e.g.
| https://developer.apple.com/documentation/metal/performing_c...).
|
| I'd note a lot of their stuff was still written in Objective-C,
| which I'm not that familiar with. But most of that is boilerplate
| and the rest is largely C/C++ based (including the Metal shader
| language).
|
| I just ported some CPU/SIMD number crunching (complex matrices)
| to Metal, and the speed up has been staggering. What used to take
| days now takes minutes. It is the hottest my M3 MacBook has ever
| been though! (See
| https://x.com/billticehurst/status/1871375773413876089 :-)
___________________________________________________________________
(page generated 2024-12-25 23:00 UTC)