hngopher.com

       [HN Gopher] Ask HN: Resources for general purpose GPU developmen...
       ___________________________________________________________________
        
       Ask HN: Resources for general purpose GPU development on Apple's M*
       chips?
        
       While Apple M* chips seems to have an incredible unified memory
       access, the available learning resources seem to be quite
       restricted and often convoluted. Has anyone been able to get past
       this barrier? I have some familiarity with general purpose software
       development with CUDA and C++. I want to figure how to work with/
       use Apple's developer resources for general purpose programming.
        
       Author : thinking_banana
       Score  : 83 points
       Date   : 2024-12-25 16:58 UTC (6 hours ago)
        
       | barkingcat wrote:
       | There is no general purpose GPU development on Apple M series.
       | 
       | There is Metal development. You want to learn Apple M-series gpu
       | and gpgpu development? Learn Metal!
       | 
       | https://developer.apple.com/metal/
        
       | rowanG077 wrote:
       | If you are open to run Linux you can use standard opencl and
       | vulkan.
        
       | feznyng wrote:
       | Besides the official docs you can check out llama.cpp as an
       | example that uses metal for accelerated inference on Apple
       | silicon.
        
       | dylanowen wrote:
       | People have already mentioned Metal, but if you want cross
       | platform, https://github.com/gfx-rs/wgpu has a vulkan-like API
       | and cross compiles to all the various GPU frameworks. I believe
       | it uses https://github.com/KhronosGroup/MoltenVK to run on Macs.
       | You can also see the metal shader transpilation results for
       | debugging.
        
         | grovesNL wrote:
         | wgpu has its own Metal backend that most people use by default
         | (not MoltenVK).
         | 
         | There is also a Vulkan backend if you want to run Vulkan
         | through MoltenVK though.
        
           | dylanowen wrote:
           | Oh good to know! It's been a while since I've looked at the
           | osx implementation
        
           | tehsauce wrote:
           | the metal backend does currently generate quite a lot of
           | unnecessary command buffers, but in general performance seems
           | solid.
        
         | rudedogg wrote:
         | With what the OP asked for, I don't think wgpu is the right
         | choice. They want to push the limits of Apple Silicon, or do
         | Apple platform specific work, so an abstraction layer like wgpu
         | is going in the opposite direction in my opinion.
         | 
         | Metal, and Apple's docs are the place to start.
        
       | morphle wrote:
       | You can help with the reverse engineering of Apple Silicon done
       | by a dozen people worldwide, that is how we find out the GPU and
       | NPU instructions[1-4]. There is over 43 trillion float operations
       | per second to unlock at 8 terabit per second 'unified' memory
       | bandwidth and 270 gigabits per second networking (less on the
       | smaller chips)....
       | 
       | [1] https://github.com/AsahiLinux/gpu
       | 
       | [2] https://github.com/dougallj/applegpu
       | 
       | [3] https://github.com/antgroup-
       | skyward/ANETools/tree/main/ANEDi...
       | 
       | [4] https://github.com/hollance/neural-engine
       | 
       | You can use a high level APIs like MLX, Metal or CoreML to
       | compute other things on the GPU and NPU.
       | 
       | Shadama [5] is an example programming language that translates
       | (with Ometa) matrix calculations into WebGPU or WebGL APIs (I
       | forget which). You can do exactly the same with the MLX, Metal or
       | CoreML APIs and only pay around 3% overhead going through the
       | translation stages.
       | 
       | [5] https://github.com/yoshikiohshima/Shadama
       | 
       | I estimate it will cost around $22K at my hourly rate to
       | completely reverse engineer the latest A16 and M4 CPU (ARMV9),
       | GPU and NPU instruction sets. I think I am halfway on the reverse
       | engineering, the debugging part is the hardest problem. You would
       | however not be able to sell software with it on the APP Store as
       | Apple forbids undocumented API's or bare metal instructions.
        
         | KeplerBoy wrote:
         | Where does the 270 gbit/s networking figure come from? Is it
         | the aggregate bandwidth from the pcie slots on the mac pro,
         | which could support nics at that speeds (and above according to
         | my quick maths#), but there is not really any driver support
         | for modern Intel or Mellanox/Nvidia NICs as far as I can tell.
         | 
         | My use case would be hooking up a device which spews out sensor
         | data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as
         | possible. The new mac mini has the GPU power, but there's no
         | way to get the data into it.
         | 
         | # 2x Gen4x16 + 4x Gen3x8 = 2 * 31.508 GB/s + 4 * 7.877 GB/s [?]
         | 90 GB/s = 720 gbit/s
        
           | morphle wrote:
           | > Where does the 270 gbit/s networking figure come from? Is
           | it the aggregate bandwidth from the pcie slots on the Mac pro
           | 
           | We both should restate and specify the calculation for each
           | different Apple Silicon chip and the PCB/machine model it is
           | wired onto.
           | 
           | The $599 M4 Mac mini base model networking (aggregated Wifi,
           | USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps.
           | Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number
           | is to high because the 2x Gen4x16 is shared/oversubscribed
           | with the other PCIe lanes for x8 PCIe slots, SSD and
           | Thunderbolt. You need to measure/benchmark it, not read the
           | marketing PR.
           | 
           | I estimate the $1400 M4 Pro Mac mini networking bandwidth by
           | adding the external WiFi, 10 Gbps Ethernet, two USC-C ports
           | (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps)
           | but subtracting the PCIe 64 Gbps limit and not counting the
           | internal SSD. Two $599 M4 Mac mini base models are faster and
           | cheaper than one M4 Pro Mac mini.
           | 
           | The point of the precise actual measurements I did of the
           | trillion opereations per second and the billion of bits per
           | second networking/interconnect of the M4 Mac mini against all
           | the other Apple silicon machines is to find which package
           | (chip plus pcb plus case) has the best price/performance/watt
           | balanced against them networked together. On januari 2025 you
           | can build the cheapest fastest supercomputer in the world
           | from just off the shelf M4 16Gb Mac mini base models with 10G
           | Ethernet, Mikrotek 100G switches and a few FPGA's. It would
           | outperform all Nvidia, Cerebras, Tenstorrent and datacenter
           | clusters I know of, mainly because of the low power Apple
           | Silicon.
           | 
           | Note that the M4 has only 1,2 Tips unified memory bandwidth
           | and the M4 Pro has double that. The 8 Tops unified memory
           | bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB
           | DRAM. Without it you cant's reach 50 trillion operations per
           | second. A Mac Studio has only around 190 Gbps external
           | networking bandwidth but does not reach 43 trillion TOPS, as
           | does the 720 Gbps of your Mac Pro estimate. By reverse
           | engineering the instruction set you could squeeze a few
           | percent extra performance out of this M4 cluster.
           | 
           | The 43 trillion TOPS of the M4 itself is an estimate. The ANE
           | does 34 TOPS, the CPU less than 5 TOP depending on float type
           | and we have no reliable benchmarks for the CPU floating
           | point.
        
             | KeplerBoy wrote:
             | The pcie configuration was taken from the mac pro and it's
             | m2 ultra. https://www.apple.com/mac-pro/
             | 
             | I'd assume the mac mini has a less extensive pcie/tb
             | subsystem.
             | 
             | No idea what people are doing with all those pcie slots
             | except for nvme cards. I wonder how hard it would be to
             | talk to a pcie fpga.
        
               | morphle wrote:
               | You use SerDes high speed serial links (up to 224 Gbps in
               | 2025) to communicate between chips. A PCIe lane is just a
               | Serdes with a 30% packet protocol overhead that uses DMA
               | to copy bytes between to SRAM or DRAM buffers.
               | 
               | You aggregate PCIe lanes (x16, x8, x4/Thunderbolt, x1).
               | You could also built mesh networks from SerDes but now
               | instead of PCIe switches You would need SerDes switches
               | or routers (Ethernet, NVlink, Infiniband).
               | 
               | You need those high speed links between chips for much
               | more than SSD/NVME cards. Other NAS, Processors,
               | Ethernet/internet, Camera, Wifi, Optics, DRAM, SRAM,
               | power etc. For intercore communication (between
               | processors or between chiplets), between networked PCB's,
               | between DRAM chips (DDR5 is just another SerDes
               | protocol), Flash Chips, camera chips, etc. Any other chip
               | at faster then 250 Mbps speeds.
               | 
               | I aggregate all the M4 Mac mini ports into a M4 cluster
               | by mesh networking all its Serdes/PCIe with FPGAs into a
               | very cheap low power supercomputer with exaflop
               | performance. Cheaper than NVDIA. I'm sure Apple does the
               | same in their data centers.
               | 
               | My talk [1] on Wafer Scale Integration and free space
               | optics goes deeper into how and why SerDes and PCIe will
               | be replaced by fiber optics and free space optics for
               | power reasons. I'm sure several parallel 2 Ghz optic
               | lambdas per fiber (but no SerDes!) will be the next step
               | in Apple Silicon as well: the M4 power budget already is
               | mostly in the off-chip SerDes/Thunderbolt networking
               | links.
               | 
               | [1] https://vimeo.com/731037615
        
               | KeplerBoy wrote:
               | > I aggregate all the M4 Mac mini ports into a M4 cluster
               | by mesh networking all its Serdes/PCIe with FPGAs into a
               | very cheap low power supercomputer with exaflop
               | performance. Cheaper than NVDIA. I'm sure Apple does the
               | same in their data centers.
               | 
               | That sounds super interesting, do you happen to have some
               | further information on that? Is it just a bunch of FPGAs
               | issuing DMA TLPs?
        
               | morphle wrote:
               | It is not the first time they built super computers from
               | off the shelf Apple machines [1].
               | 
               | M4 supercomputers are cheaper and it also will be lower
               | Capex and Apex for most datacenter hardware.
               | 
               | >do you happen to have some further information on that?
               | 
               | Yes, the information is in my highly detailed custom
               | documentation for the programmers and buyers of 'my'
               | Apple Silicon super computer, Squeak and Ometa DSL
               | programming languages and adaptive compiler. You can
               | contact me for this highly technical report and several
               | scientific papers (email in my profile).
               | 
               | Do you know of people who might buy a super computer
               | based on better specifications? Or even just buyers who
               | will go for 'the lowest Capex and the lowest Opex
               | supercomputer in 2025-2027'?
               | 
               | Because the problem with HPC is that almost all funders
               | and managers buy supercomputers with a safe brand name
               | (Nvidia, AMD, Intel) at triple the cost and seldom from a
               | super computer researcher as myself. But some do, if they
               | understand why. I have been designing, selling,
               | programming and operating super computers since 1984 (I
               | was 20 years old then), this M4 Apple Silicon Cluster
               | will be my ninth supercomputer. I prefer to build them
               | from the ground up with our own chip and wafer scale
               | integration designs but when an off-the-shelf chip is
               | good enough I'll sell that instead.
               | Price/Performance/Watt is what counts, ease of
               | programming is a secondary consideration for what
               | performance you achieve. Alan Kay argues you should
               | rewrite your software from scratch [2] and do your own
               | hardware [3] so that is what I've done sinds I learned
               | from him.
               | 
               | >Is it just a bunch of FPGAs issuing DMA TLPs?
               | 
               | No. The FPGA's are optional for when you want to flatten
               | the inter-core (=inter-SRAM cache) networking with
               | switches or routers to a shorter hop topology for the
               | message passing like a Slim fly diameter two hop topology
               | [4].
               | 
               | DMA (Direct Memory Access) TLPs (Transaction Layer
               | Packets) are one of the worst ways of doing inter-core
               | and inter-SRAM communication and on PCIe it has a huge
               | 30% protocol overhead at triple the cost. Intel (and most
               | other chip companies like NVIDIA, Altera, AMD/XILINX)
               | can't design proper chips because they don't want to
               | learn about software [2]. Apple Silicon is marginally
               | better.
               | 
               | You should use pure message passing between any process,
               | preferably in a programming language and a VM that uses
               | pure message passing at the lowest level (Squeak,
               | Erlang). Even better if you then map those software
               | messages directly to message passing hardware as in my
               | custom chips [3].
               | 
               | The reason to reverse Apple Silicon instructions for CPU,
               | GPU and ANE are to be able to adapt my adaptive compiler
               | to M4 chips but also to repurpose PCIe for low level
               | message passing with much better performance and latency
               | than DMA TLPs.
               | 
               | To conclude, if you want to get the cheapest Capex and
               | Opex M4 Mac mini supercomputer you need to rewrite your
               | supercomputing software in a high level language and
               | message passing system like the parallel Squeak Smalltalk
               | VM [3] with adaptive load balancing compilation. C, C++,
               | Swift, MPI or CUDA would result in sub-optimal software
               | performance and orders of magnitude more lines of code
               | when optimal performance of parallel software is the
               | goal.
               | 
               | [1]
               | https://en.wikipedia.org/wiki/System_X_(supercomputer)
               | 
               | [2] https://www.youtube.com/watch?v=ubaX1Smg6pY
               | 
               | [3] https://vimeo.com/731037615
               | 
               | [4] https://www.youtube.com/watch?v=rLjMrIWHsxs
        
               | ricktdotorg wrote:
               | sounds (at least at a high level) similar to EXO[1]
               | 
               | [1] https://github.com/exo-explore/exo
        
               | morphle wrote:
               | Here a video of testing Exo to run huge LLMs on a cluster
               | of M4 Macs[1] more cheaply than with a cluster of NVDIA
               | RTX 4090s.
               | 
               | [1] https://www.youtube.com/watch?v=GBR6pHZ68Ho
        
             | _zoltan_ wrote:
             | It's very weird to add together all kinds of very different
             | networking solutions (WiFi, wired ethernet, TB) and talk
             | about their aggregate potential bandwidth as a single
             | number.
        
         | dgfitz wrote:
         | It's too bad they don't make this easier on developers, Apple.
         | Is there a reason I don't see?
        
           | morphle wrote:
           | There certainly is a reason and indeed you don't see it
           | because Apple downplays these things in their PR.
           | 
           | It might be the same reason that is behind NVDIA's CUDA moat.
           | CUDA lock-in prevented competitors like AMD and Intel to
           | convince programmers and their customers to switch away from
           | CUDA. So there was no software ported to their competitive
           | GPU's. So you get anti-trust lawsuits [1].
           | 
           | I think you should put yourself in Apples management mindset
           | and then reason. I suspect they think they will not sell more
           | iPhones or Macs if they let third party developers access the
           | low level APIs and write faster software.
           | 
           | They might reason that if no one knows the instruction sets
           | hackers will write less code to break security. Security by
           | obscurity.
           | 
           | They certainly think that blocking competitors from reverse
           | engineering the low power Apple Silicon and blocking them
           | from using TSMC manufacturing capacity will keep them the
           | most profitable company for another decade.
           | 
           | [1] https://news.ycombinator.com/item?id=40593576
        
             | dylan604 wrote:
             | At this point, Apple is absolutely not afraid of an anti-
             | trust lawsuit. To them, it is part of the cost of doing
             | business
        
               | morphle wrote:
               | I concur, they are virtually untouchable in this respect.
               | No one else will throw a trillion or more into developing
               | lower power faster silicon.
        
             | _zoltan_ wrote:
             | CUDA didn't prevent anything at least not in the way you
             | believe.
             | 
             | Intel and AMD had no competitive offer, period. They still
             | don't.
             | 
             | NVIDIA is simply offering an ecosystem that is battle
             | tested and is ready out of the box. Look at the recent
             | semianalysis test to see how not ready AMD is, who would be
             | the only company to have a real shot at this. Their HW on
             | paper is better or equal, yet their software ecosystem is
             | nowhere ready.
        
               | AnthonyMouse wrote:
               | > Look at the recent semianalysis test to see how not
               | ready AMD is, who would be the only company to have a
               | real shot at this. Their HW on paper is better or equal,
               | yet their software ecosystem is nowhere ready.
               | 
               | Reading that was kind of odd. It seems like their
               | conclusion was that on paper AMD should be significantly
               | less expensive and significantly faster, whereas in
               | practice they're significantly less expensive and
               | slightly slower because of unoptimized software, which
               | actually seems like it'd still be a pretty good deal.
               | Especially if the problem _is_ the software, because then
               | the hardware could get better with a software update
               | after you buy it.
               | 
               | They also spend a lot of time complaining about how much
               | trouble it is to install the experimental releases with
               | some improvements that aren't in the stable branch yet,
               | but then the performance difference was only big in a few
               | cases and in general the experimental version was only a
               | couple of percent faster, which either way should end up
               | in the stable release in the near future.
               | 
               | And they do a lot of benchmarks on interconnect bandwidth
               | which, fair enough, Nvidia currently has some hardware
               | advantage. But that also mainly matters to the small
               | handful of companies doing training for huge frontier
               | models and not to the far larger number of people doing
               | inference or training smaller models.
        
           | twoodfin wrote:
           | Apple wants total freedom to rework lower levels of the stack
           | down to the hardware, without worrying about application
           | compatibility, hence their answer will continue to be Metal.
        
             | morphle wrote:
             | I agree that it allows Apple to redefine Apple Silicon
             | instruction sets without having do explain it to 3rd party
             | software developers, but it is certainly not the main
             | reason they hide the technical documentation of the chips.
        
               | MBCook wrote:
               | Why not?
               | 
               | Metal is the answer. Everything else is just
               | implementation detail as GP said.
               | 
               | Apple doesn't provide developer support to other OSes.
               | The only OS they do anything for* is macOS. So to them
               | there's no point.
               | 
               | All they'd get is people relying on implementation
               | details they shouldn't, other companies stealing what
               | they consider their trade secrets, or more surface area
               | for patent trolls to scan.
               | 
               | * Someone on the Asahi team, I think Hector Martin, has
               | commented before the Apple is doing things that clearly
               | seem designed to allow others to make and securely boot
               | other OSes on their Apple Silicon hardware. They clearly
               | could be clamping it down far more but are choosing not
               | to. However that's exactly as far as the support appears
               | to go.
        
               | amelius wrote:
               | > but it is certainly not the main reason they hide the
               | technical documentation of the chips
               | 
               | What _is_ the main reason?
        
               | morphle wrote:
               | >What _is_ the main reason?
               | 
               | I can't guess what is the main reason. There might not
               | even be a main reason, as many groups of people at Apple
               | and its shareholders decided this over the years.
               | 
               | (Also see my speculations below in this thread).
               | 
               | So not in any order of importance to Apple:
               | 
               | 1) Create the same moat as NVIDIA has with CUDA.
               | 
               | 2) Ability to re-define the microcode instruction set of
               | all the dozens of different Apple Silicon chips now and
               | in the future without having to worry about backwards
               | compatibility. Each Apple Silicon chip simply recompiles
               | code at runtime (similar to my adaptive compiler).
               | 
               | 3) Zero hardware documentation needed, much cheaper PR
               | and faster time to market, also making it harder to
               | reverse engineer or repair.
               | 
               | 4) Security. Security by obscurity
               | 
               | 5) Keeping the walled garden up longer.
               | 
               | 6) Frustrating reverse engineering of Apple software
               | 
               | 7) Frustrating reverse engineering of Apple hardware
               | 
               | 8) It won't make Apple more sales if 3rd party developers
               | can write faster and more energy efficient GPU and NPU
               | software.
               | 
               | 9) Legal and patent infringements considerations
               | 
               | 10) Future compiler improvements
               | 
               | 11 ) Trade secrets
        
               | amelius wrote:
               | 9) hiding known and/or unknown patent infringements
        
         | MuffinFlavored wrote:
         | This would get rid of needing Metal to be the blackbox and
         | enable things like "nvptx CUDA" equivalent /
         | https://libc.llvm.org/gpu/ right?
         | 
         | Very interesting. A steal for $22k but I guess very niche for
         | now...
        
           | morphle wrote:
           | Yes, knowing the exact CPU and ANE assembly instructions (or
           | the underlying microcode!!) allow for general software to
           | adaptively compile processes on all the core types, not just
           | the CPU ones. Its won't always be faster, you get more cache
           | misses (some cores don't have cache) and different DMA and
           | thread scheduling, some registers can't fit the floats or
           | large integers, etc etc. But yes, it will be possible to use
           | all 140 cores of the M2 Ultra or the 36 cores of the M4.
           | There will be an M6 Extreme some day, maybe 500 cores?
           | 
           | Actually, the GPU and ANE cores themsevlves are built from
           | teams of smaller cores, maybe a few hundred or thousand in
           | all, same as in most NVDIA chips.
           | 
           | >A steal for $22k but I guess very niche for now...
           | 
           | A single iPhone or Mac app (a game, an LLM, pattern
           | recognition, security app, VPN, de/encryption, video en/dec
           | coder) that can be sped up by 80%-200% can afford my faster
           | assembly level API.
           | 
           | But a whole series of hardware level zero-day exploits for
           | iPhone and Mac, now that is not very niche at all. It is
           | worth millions to reverse Apple Silicon instruction sets.
        
         | JackYoustra wrote:
         | any place you have your current progress written up on? Any
         | methodology I could help contribute on? I've read each one of
         | the four links you've given over the years and it seems vague
         | with how far people have currently gotten and exact issues.
        
       | mkagenius wrote:
       | Check out MLX[1]. Its a bit like pytorch/tensorflow with added
       | benefit of Apple Silicon.
       | 
       | 1. https://ml-explore.github.io/mlx/build/html/index.html
        
       | rgovostes wrote:
       | It's hard to answer not knowing exactly what your aim is, or your
       | experience level with CUDA and how easily the concepts you know
       | will map to Metal, and what you find "restricted and convoluted"
       | about the documentation.
       | 
       | <Insert your favorite LLM> helped me write some simple Metal-
       | accelerated code by scaffolding the compute pipeline, which took
       | most of the nuisance out of learning the API and let me focus on
       | writing the kernel code.
       | 
       | Here's the code if it's helpful at all.
       | https://github.com/rgov/thps-crack
        
         | nixpulvis wrote:
         | 2024 and still finding cheat codes in Tony Hawk Pro Skater 2.
         | Wild!
        
           | selimthegrim wrote:
           | If Jamie Kennedy is reading this, we still haven't found the
           | cheat code to make you funny.
        
       | thetwentyone wrote:
       | I've had a good time dabbling with Metal.jl:
       | https://github.com/JuliaGPU/Metal.jl
        
       | amelius wrote:
       | Apple is known to actively discourage general purpose computing.
       | Better try a different vendor.
        
         | codr7 wrote:
         | Preferably one that sells computers, not fashion statements.
        
           | likeabbas wrote:
           | It's not a fashion statement, it's a fucking deathwish
        
       | aleinin wrote:
       | If you're looking for a high level introduction to GPU
       | development on Apple silicon I would recommend learning Metal.
       | It's Apple's GPU acceleration language similar to CUDA for Nvidia
       | hardware. I ported a set of puzzles for CUDA called GPU-Puzzles
       | (a collection of exercises designed to teach GPU programming
       | fundamentals)[1] to Metal [2]. I think it's a very accessible
       | introduction to Metal and writing GPU kernels.
       | 
       | [1] https://github.com/srush/GPU-Puzzles
       | 
       | [2] https://github.com/abeleinin/Metal-Puzzles
        
         | dylan604 wrote:
         | After a quick scan through the [2] link, I have added this to
         | the list of things to look into in 2025
        
       | desideratum wrote:
       | I'd reccomend checking out the CUDA mode Discord server! They
       | also have a channel for Metal https://discord.gg/ZqckTYcv
        
       | TriangleEdge wrote:
       | Why not OpenCL or OpenGL? You'll not be constrained by the flavor
       | of GPU.
        
       | billti wrote:
       | If you know CUDA, then I assume you know a bit already about GPUs
       | and the major concepts. There's just minor differences and
       | different terminology for things like "warps" etc.
       | 
       | With that base, I've found their docs decent enough, especially
       | coupled with the Metal Shader Language pdf they provide
       | (https://developer.apple.com/metal/Metal-Shading-Language-
       | Spe...), and quite a few code samples you can download from the
       | docs site (e.g.
       | https://developer.apple.com/documentation/metal/performing_c...).
       | 
       | I'd note a lot of their stuff was still written in Objective-C,
       | which I'm not that familiar with. But most of that is boilerplate
       | and the rest is largely C/C++ based (including the Metal shader
       | language).
       | 
       | I just ported some CPU/SIMD number crunching (complex matrices)
       | to Metal, and the speed up has been staggering. What used to take
       | days now takes minutes. It is the hottest my M3 MacBook has ever
       | been though! (See
       | https://x.com/billticehurst/status/1871375773413876089 :-)
        
       ___________________________________________________________________
       (page generated 2024-12-25 23:00 UTC)