[HN Gopher] Apple's MLX adding CUDA support
___________________________________________________________________
Apple's MLX adding CUDA support
Author : nsagent
Score : 528 points
Date : 2025-07-14 21:40 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gsibble wrote:
| Awesome
| nxobject wrote:
| If you're going "wait, no Apple platform has first-party CUDA
| support!", note that this set of patches also adds support for
| "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".
|
| https://ml-explore.github.io/mlx/build/html/install.html
| teaearlgraycold wrote:
| I wonder if Jensen is scared. If this opens up the door to other
| implementations this could be a real threat to Nvidia. CUDA on
| AMD, CUDA on Intel, etc. Might we see actual competition?
| jsight wrote:
| I think this is the other way around. It won't be cuda on
| anything except for nvidia.
|
| However, this might make mlx into a much stronger competitor
| for Pytorch.
| teaearlgraycold wrote:
| Oh bummer. Almost got excited.
| baby_souffle wrote:
| If you implement compatible apis, are you prohibited from
| calling it cuda?
| moralestapia wrote:
| I'm sure I saw this lawsuit somewhere ...
|
| The gist is the API specification in itself is copyright,
| so it is copyright infringement then.
| wyldfire wrote:
| Too subtle - was this oracle vs java one? Remind me: java
| won or lost that one?
| mandevil wrote:
| Oracle sued Google, and Google won, 6-2 (RBG was dead,
| Barrett had not yet been confirmed when the case was
| heard).
|
| Supreme Court ruled that by applying the Four Factors of
| Fair Use, Google stayed within Fair Use.
|
| An API specification ends up being a system of organizing
| things, like the Dewey Decimal System (and thus not
| really something that can be copyrighted), which in the
| end marks the first factor for Google. Because Google
| limited the Android version of the API to just things
| that were useful for smart phones it won on the second
| factor too. Because only 0.4% of the code was reused, and
| mostly was rewritten, Google won on the third factor. And
| on the market factor, if they held for Oracle, it would
| harm the public because then "Oracle alone would hold the
| key. The result could well prove highly profitable to
| Oracle (or other firms holding a copyright in computer
| interfaces) ... [but] the lock would interfere with, not
| further, copyright's basic creativity objectives." So
| therefore the fourth factor was also pointing in Google's
| favor.
|
| Whether "java" won or lost is a question of what is
| "java"? Android can continue to use the Java API- so it
| is going to see much more activity. But Oracle didn't get
| to demand license fees, so they are sad.
| moralestapia wrote:
| Oh man, thanks for this.
|
| I always thought it was resolved as infringement and they
| had to license the Java APIs or something ...
|
| Wow.
| mandevil wrote:
| The district court ruled for Google over patents and
| copyright- that it was not a copyright at all, the Court
| of Appeals then reversed and demanded a second court
| trial on whether Google was doing fair use of Oracle's
| legitimate copyright, which the district court again held
| for Google, and then the Court of Appeals reversed the
| second ruling and held for Oracle that it was not fair
| use of their copyright, and then Google appealed that to
| the the Supreme Court ... and won in April 2021, putting
| an end to this case which was filed in August 2010. But
| the appeals court in between the district court and the
| Supreme Court meant that for a long while in the middle
| Oracle was the winner.
|
| This is part of why patents and copyrights can't be the
| moat for your company. 11 years, with lots of uncertainty
| and back-and-forth, to get a final decision.
| tough wrote:
| Yeah this case made me think using llms to clean-room
| reverse engineer any API exposing SaaS or private
| codebase would be game
| 15155 wrote:
| Considering 100% of the low-level CUDA API headers have the
| word "CUDA" in them, this would be interesting to know.
| mayli wrote:
| Yeah, nice to have MLX-opencl or MLX-amd-whatever
| almostgotcaught wrote:
| > CUDA backend
|
| _backend_
| tekacs wrote:
| This instance is the other way around, but that's what this is
| - CUDA on AMD (or other platforms): https://docs.scale-
| lang.com/stable/
| pjmlp wrote:
| Why, everyone keeps trying to copy CUDA while failing to
| understand why many of us love it.
| int_19h wrote:
| Abstraction layers for GPU compute already exist; this is yet
| another one, so it doesn't change anything substantially. Most
| of the time code written using such layers ends up running on
| NVIDIA hardware in prod anyway, so if anything that is a net
| positive for the company - it means that more people can now
| develop for its hardware on their devices.
| zdw wrote:
| How does this work when one of the key features of MLX is using a
| unified memory architecture? (see bullets on repo readme:
| https://github.com/ml-explore/mlx )
|
| I would think that bringing that to all UMA APUs (of any vendor)
| would be interesting, but discreet GPU's definitely would need a
| different approach?
|
| edit: reading the PR comments, it appears that CUDA supports a
| UMA API directly, and will transparently copy as needed.
| freeone3000 wrote:
| Eh yes but from my experience its lack of prefetch lends to
| significant memory stalls waiting for the copy. It might be
| suitable if your entire dataset fits in VRAM after doing a
| "manual prefetch" but it killed performance for my application
| (ML training) so hard that we actually got time to move to
| streaming loads.
| nerdsniper wrote:
| Edit: I had the details of the Google v Oracle case wrong. SCOTUS
| found that re-implementing an API does not infringe copyright. I
| was remembering the first and second appellate rulings.
|
| Also apparently this is not a re-implementation of CUDA.
| skyde wrote:
| this is CUDA backend to MLX not MLX backend for CUDA!
| liuliu wrote:
| You misunderstood and this is not re-implementing CUDA API.
|
| MLX is a PyTorch-like framework.
| Uehreka wrote:
| This is exactly the kind of thing I wouldn't opine on until
| like, an actual lawyer weighs in after thoroughly researching
| it. There are just too many shades of meaning in this kind of
| case law for laymen to draw actionable conclusions directly
| from the opinions.
|
| Though I imagine that if Apple is doing this themselves, they
| likely know what they're doing, whatever it is.
| MuffinFlavored wrote:
| Is this for Mac's with NVIDIA cards in them or Apple Metal/Apple
| Silicon speaking CUDA?... I can't really tell.
|
| Edit: looks like it's "write once, use everywhere". Write MLX,
| run it on Linux CUDA, and Apple Silicon/Metal.
| cowsandmilk wrote:
| Neither, it is for Linux computers with NVIDIA cards
| MBCook wrote:
| Seems you already found the answer.
|
| I'll note Apple hasn't shipped an Nvidia card in a very very
| long time. Even on the Mac pros before Apple Silicon they only
| ever sold AMD cards.
|
| My understanding from rumors is that they had a falling out
| over the problems with the dual GPU MacBook Pros and the
| quality of drivers.
|
| I have no idea if sticking one in on the PCI bus let you use it
| for AI stuff though.
| kmeisthax wrote:
| On Apple Silicon, writing to memory on a PCIe / Thunderbolt
| device will generate an exception. ARM spec says you're
| allowed to write to devices as if they were memory but Apple
| enforces that all writes to external devices go through a
| device memory mapping[0]. This makes using an external GPU on
| Apple Silicon[1] way more of a pain in the ass, if not
| impossible. AFAIK nobody's managed to write an eGPU driver
| for Apple Silicon, even with Asahi.
|
| [0] https://developer.arm.com/documentation/102376/0200/Devic
| e-m...
|
| [1] Raspberry Pi 4's PCIe has the same problem AFAIK
| bobmcnamara wrote:
| Ewww, that kills out of order CPU performance. If it's like
| ARMv7, it effectively turns each same-page access into it's
| own ordering barrier.
| saagarjha wrote:
| Writing to device memory does not generate an exception.
| xuki wrote:
| That particular MBP model had a high rate of GPU failure
| because it ran too hot.
|
| I imagined the convo between Steve Jobs and Jensen Huang went
| like this:
|
| S: your GPU is shit
|
| J: your thermal design is shit
|
| S: f u
|
| J: f u too
|
| Apple is the kind of company that hold a grudge for a very
| long time, their relationships with suppliers are very one
| way, their way or the highway.
| bobmcnamara wrote:
| S: omg so thin!!1!1!!l!
| rcruzeiro wrote:
| I think the ones that failed were the AMD ones,
| specifically the old 17 inches MacBook Pro.
| roboror wrote:
| D700s dying in the trash can Mac Pros cost me (and many
| others) a lot of time and money.
| MBCook wrote:
| I had 15" MBP, maybe a 2010, that was dual GPU with an
| Nvidia that was definitely a problem.
| sciencesama wrote:
| And so is the same with nvidea too
| narism wrote:
| The MBPs didn't run too hot, the Nvidia GPUs used an
| underfill that stopped providing structural support at a
| relatively normal temperature for GPUs (60-80 degrees C).
|
| GPU failures due to this also happened on Dell/HP/Sony
| laptops, some desktop models, as well as early models of
| the PS3.
|
| Some reading:
| https://www.badcaps.net/forum/troubleshooting-hardware-
| devic...
| sciencesama wrote:
| Are you watching the bear ?
| VladVladikoff wrote:
| Won't work. No driver support.
| dkga wrote:
| This is the only strategy humble me can see working for CUDA in
| MLX
| whatever1 wrote:
| This is the right answer. Local models will be accelerated by
| Apple private cloud.
| hbcondo714 wrote:
| > "write once, use everywhere"
|
| So my MLX workloads can soon be offloaded to the cloud!?
| Keyframe wrote:
| Now do linux support / drivers for Mac hardware!
| lvl155 wrote:
| Seriously. Those Apple guys became delusional especially after
| Jobs passed away. These guys just sat on their successes and
| did nothing for a decade plus. M1 was nice but that was all
| Jobs doing and planning. I don't like this Apple. They forgot
| how to innovate.
|
| But I guess we have a VR device nobody wants.
| jjtheblunt wrote:
| It would be funny if you were typing out your response on an
| iPhone that has been running for 36 hours without recharging.
| macinjosh wrote:
| if only their batteries would last that long.
| can16358p wrote:
| Unless one constantly browses Instagram or TikTok, they
| do.
| marcellus23 wrote:
| > M1 was nice but that was all Jobs doing and planning
|
| M1 was launched 9 years after Jobs died. You're saying they
| had everything ready to go back then and just sat on their
| asses for a decade?
| lvl155 wrote:
| Who bought Semi? Jobs knew they had to make their own. M1
| is just a product of their iPhone chips hence all the
| efficiency.
| saagarjha wrote:
| Ok, but did you ever think about PA Semi being the Alpha
| guys? Maybe the DEC leadership deserves credit for M1
| marcellus23 wrote:
| Jobs knew they had to make their own chips, and in your
| mind that constitutes "all the doing and planning"?
| lvl155 wrote:
| I said "[Jobs'] doing and planning" whereas you make it
| sound like Semi and M1 have nothing to do with Jobs.
| Apple has M1 because Jobs had a vision. Tell me one thing
| Apple did since Jobs' passing that show such a vision.
| Maybe Watch? Hardly ground breaking. VR? 'nuff said.
| pxc wrote:
| > Seriously. Those Apple guys became delusional especially
| after Jobs passed away.
|
| Didn't Jobs himself essentially die of delusion?
| bigyabai wrote:
| I think we're seeing the twilight of those efforts. Asahi Linux
| was an absolute _powerhouse_ of reverse-engineering prowess,
| and it took years to get decent Vulkan coverage and half of the
| modern lineup 's GPUs supported. Meanwhile AMD and even _Intel_
| are shipping Vulkan 1.3 drivers day-one on new hardware. It 's
| a cool enthusiast effort to extend the longevity of the
| hardware, but it bears repeating; nobody is disrupting Nvidia's
| bottom-line here. Apple doesn't sell hardware competitive with
| Nvidia's datacenter hardware, and even if they did it's not
| supported by the community. It's doubtful that Apple would make
| any attempt to help them.
|
| There seems to a pervading assumption that Apple is still
| making a VolksComputer in 2025, blithely supporting a freer
| status-quo for computing. They laid out their priorities
| completely with Apple Silicon, you're either on Apple's side or
| falling behind. Just the way they want it.
| albertzeyer wrote:
| This is exciting. So this is using unified memory of CUDA? I
| wonder how well that works. Is the behavior of the unified memory
| in CUDA actually the same as for Apple silicon? For Apple
| silicon, as I understand, the memory is anyway shared between GPU
| and CPU. But for CUDA, this is not the case. So when you have
| some tensor on CPU, how will it end up on GPU then? This needs a
| copy somehow. Or is this all hidden by CUDA?
| MBCook wrote:
| This is my guess, but does higher end hardware they sell, like
| the server rack stuff for AI, perhaps have the unified memory?
|
| I know standard GPUs don't.
|
| The patch suggested one of the reasons for it was to make it
| easy to develop on a Mac and run on a super computer. So the
| hardware with the unified memory might be in that class.
| Y_Y wrote:
| The servers don't, but the Jetsons do
| ajuhasz wrote:
| The Jetsons[1] have unified memory[2].
|
| [1] https://www.nvidia.com/en-us/autonomous-
| machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-
| demand/session/gtcspring22-s...
| tonyarkles wrote:
| They sure do and it's pretty amazing. One iteration of a
| vision system I worked on got frames from a camera over a
| Mellanox NIC that supports RDMA (Rivermax), preprocessed
| the images using CUDA, did inference on them with TensorRT,
| and the first time a single byte of the inference pipeline
| hit the CPU itself was when we were consuming the output.
| patrickkrusiec wrote:
| The physical memory is not be unified, but on modern rack
| scale Nvidia systems, like Grace Hopper or NVL72, the CPU and
| the GPU(s) share the same virtual address space and have non-
| uniform memory access to each other's memory.
| freeone3000 wrote:
| Standard GPUs absolutely do. Since CUDA 11, all CUDA cards
| expose the same _featureset_ at differing speeds (based on
| backing capability). You can absolutely (try to) run CUDA UMA
| on your 2060, and it will complete the computation.
| zcbenz wrote:
| In the absence of hardware unified memory, CUDA will
| automatically copy data between CPU/GPU when there are page
| faults.
| fenced_load wrote:
| There is also NVLink c2c support between Nvidia's CPUs and
| GPUs that doesn't require any copy, CPUs and GPUs directly
| access each other's memory over a coherent bus. IIRC, they
| have 4 CPU + 4 GPU servers already available.
| benreesman wrote:
| Yeah NCCL is a whole world and it's not even the only thing
| involved, but IIRC that's the difference between 8xH100 PCI
| and 8xH100 SXM2.
| nickysielicki wrote:
| See also: https://www.kernel.org/doc/html/v5.0/vm/hmm.html
| saagarjha wrote:
| This seems like it would be slow...
| freeone3000 wrote:
| Matches my experience. It's memory stalls all over the
| place, aggravated (on 12.3 at least) there wasn't even a
| prefetcher.
| ethan_smith wrote:
| CUDA's Unified Memory uses page migration with on-demand
| faulting to create the illusion of shared memory, whereas Apple
| Silicon has true shared physical memory, resulting in different
| performance characteristics despite the similar programming
| model.
| paulirish wrote:
| It's coming from zcbenz who created Electron among others
| https://zcbenz.com/ Nice.
| benreesman wrote:
| I wonder how much this is a result of Strix Halo. I had a fairly
| standard stipend for a work computer that I didn't end up using
| for a while so I recently cashed it in on the EVO-X2 and fuck me
| sideways: that thing is easily competitive with the mid-range
| znver5 EPYC machines I run substitors on. It mops the floor with
| any mere-mortal EC2 or GCE instance, like maybe some
| r1337.xxxxlarge.metal.metal or something has an edge, but the
| z1d.metal and the c6.2xlarge or whatever type stuff (fast cores,
| good NIC, table stakes), blows them away. And those things are
| 3-10K a month with heavy provisioned IOPS. This thing has real
| NVME and it cost 1800.
|
| I haven't done much local inference on it, but various YouTubers
| are starting to call the DGX Spark overkill / overpriced next to
| Strix Halo. The catch of course is ROCm isn't there yet (they're
| seeming serious now though, matter of time).
|
| Flawless CUDA on Apple gear would make it really tempting in a
| way that isn't true with Strix so cheap and good.
| jitl wrote:
| It's pretty explicitly targeting cloud cluster training in the
| PR description.
| ivape wrote:
| If we believe that there's not enough hardware to meet
| demand, then one could argue this helps Apple meet demand,
| even if it's just by a few percentage points.
| nl wrote:
| > The catch of course is ROCm isn't there yet (they're seeming
| serious now though, matter of time).
|
| Competitive AMD GPU neural compute has been any day now for at
| least 10 years.
| bigyabai wrote:
| The inference side is fine, nowadays. llama.cpp has had a
| GPU-agnostic Vulkan backend for a while, it's the training
| side that tends to be a sticking point for consumer GPUs.
| hamandcheese wrote:
| For the uninitiated, Strix Halo is the same as the AMD Ryzen AI
| Max+ 395 which will be in the Framework Desktop and is starting
| to show up in some mini PCs as well.
|
| The memory bandwidth on that thing is 200GB/s. That's great
| compared to most other consumer-level x86 platforms, but quite
| far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the
| pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
|
| It certainly seems like a great value. But for memory bandwidth
| intensive applications like LLMs, it is just barely entering
| the realm of "good enough".
| yieldcrv wrote:
| Apple is just being stupid, handicapping their own hardware
| so they can sell the fixed one next year or the year after
|
| This is time tested Apple strategy is now undermining their
| AI strategy and potential competitiveness
|
| tl;dr they could have done 1600GB/s
| saagarjha wrote:
| They could have shipped a B200 too. Obviously there are
| reasons they don't do that.
| Nevermark wrote:
| So their products are so much better, in customer demand
| terms that they don't need to rush tech out the door?
|
| Whatever story you want to create, if customers are happy
| year after year then Apple is serving them well.
|
| Maybe not with same feature dimension balance you want, or
| other artificial/wishful balances you might make up for
| them.
|
| (When Apple drops the ball it is usually painful, painfully
| obvious and most often a result of a deliberate and
| transparent priority tradeoff. No secret switcherooos or
| sneaky downgrading. See: Mac Pro for years...)
| yieldcrv wrote:
| Apple is absolutely fumbling on their AI strategy despite
| their vertical hardware integration, there is no
| strategy. Its a known problem inside Apple, not a 4-D
| chess thing to wow everyone with a refined version in
| 2030
| Rohansi wrote:
| You're comparing theoretical maximum memory bandwidth. It's
| not enough to only look at memory bandwidth because you're a
| lot more likely to be compute limited when you have a lot of
| memory bandwidth available. For example, M1 had so much
| bandwidth available that it couldn't make use of even when
| fully loaded.
| zargon wrote:
| GPUs have both the bandwidth and the compute. During token
| generation, no compute is needed. But both Apple silicon
| and Strix Halo fall on their face during prompt ingestion,
| due to lack of compute.
| supermatt wrote:
| Compute (and lots of it) is absolutely needed for
| generation - 10s of billions of FLOPs per token on the
| smaller models (7B) alone - with computations of the
| larger models scaling proportionally.
|
| Each token requires a forward pass through all
| transformer layers, involving large matrix
| multiplications at every step, followed by a final
| projection to the vocabulary.
| zargon wrote:
| Obviously I don't mean literally zero compute. The amount
| of compute needed scales with the number of parameters,
| but I have yet to use a model that has so many parameters
| that token generation becomes compute bound. (Up to 104B
| for dense models.) During token generation most of the
| time is spent idle waiting for weights to transfer from
| memory. The processor is bored out of its mind waiting
| for more data. Memory bandwidth is the bottleneck.
| supermatt wrote:
| It sounds like you aren't batching efficiently if you are
| being bound by memory bandwidth.
| zargon wrote:
| That's right, in the context of Apple silicon and Halo
| Strix, these use cases don't involve much batching.
| hamandcheese wrote:
| Memory bandwidth puts an upper limit on LLM tokens per
| second.
|
| At 200GB/s, that upper limit is not very high at all. So it
| doesn't really matter if the compute is there or not.
| Rohansi wrote:
| The M1 Max's GPU can only make use of about 90GB/s out of
| the 400GB/s they advertise/support. If the AMD chip can
| make better use of its 200GB/s then, as you say, it will
| manage to have better LLM tokens per second. You can't
| just look at what has the wider/faster memory bus.
|
| https://www.anandtech.com/show/17024/apple-m1-max-
| performanc...
| attentive wrote:
| how is it vs m4 mac mini?
| drcongo wrote:
| This was nice to read, I ordered an EVO-X2 a week ago though
| I'm still waiting for them to actually ship it - I was waiting
| on a DGX Spark but ended up deciding that was never actually
| going to ship. Got any good resources for getting the thing up
| and running with LLMs, diffusion models etc.?
| benreesman wrote:
| However excited you are, it's merited. Mine took forever too,
| and it's just completely worth it. It's like a flagship halo
| product, they won't make another one like this for a while I
| don't think. You won't be short on compute relative to a trip
| to best buy for many years.
| adultSwim wrote:
| Do you need to copy to load a model from CPU memory into GPU
| memory?
| orliesaurus wrote:
| Why is this a big deal, can anyone explain if they are familiar
| with the space?
| elpakal wrote:
| > NVIDIA hardware is widely used for academic and massive
| computations. Being able to write/test code locally on a Mac
| and then deploy to super computers would make a good developer
| experience.
|
| That one stands out to me as a mac user.
| radicaldreamer wrote:
| MacBooks used to use Nvidia GPUs, then Apple had a falling
| out with Nvidia and the beef stands to this day (Apple didn't
| use Nvidia hardware when training it's own LLMs for Apple
| Intelligence).
|
| I wouldn't be surprised if within the next few years we see a
| return of Nvidia hardware to the Mac, probably starting with
| low volume products like the MacPro, strictly for
| professional/high-end use cases.
| fooker wrote:
| > Apple didn't use Nvidia hardware when training it's own
| LLMs for Apple Intelligence
|
| Do you have some links for this?
| almostgotcaught wrote:
| People on hn make up more BS than your local bar
|
| https://www.investors.com/news/technology/apple-stock-
| apple-...
| tgma wrote:
| What did the poster make up? There's one line where they
| speculated about future and a commentary about beef
| existing to this day which is subjective but the rest of
| it was 100% factual: Apple relied on Google for training
| their LLM for various reasons and they did have a beef
| with NVIDIA re MacBooks a long time ago after which they
| switched the entire line to AMD Graphics.
| dialup_sounds wrote:
| https://arxiv.org/abs/2407.21075
|
| tl;dr - they used Google TPUs
| numpad0 wrote:
| > This PR is an ongoing effort to add a CUDA backend to MLX
|
| looks like it allows MLX _code_ to compile and run on x86 +
| GeForce hardware, not the other way around.
| sciencesama wrote:
| Apple is planing to build data centers with mseries of chips for
| both app development, testing and to host external services!
| mr_toad wrote:
| If they were doing that, they wouldn't need CUDA support. More
| likely they have internal developers who want to do development
| on Apple hardware and deploy to Nvidia hardware in production.
| lukev wrote:
| So to make sure I understand, this would mean:
|
| 1. Programs built against MLX -> Can take advantage of CUDA-
| enabled chips
|
| but not:
|
| 2. CUDA programs -> Can now run on Apple Silicon.
|
| Because the #2 would be a copyright violation (specifically with
| respect to NVidia's famous moat).
|
| Is this correct?
| saagarjha wrote:
| No, it's because doing 2 would be substantially harder.
| lukev wrote:
| There's a massive financial incentive (billions) to allow
| existing CUDA code to run on non-NVidia hardware. Not saying
| it's easy, but is implementation difficulty really the
| blocker?
| saagarjha wrote:
| Yes. See: AMD
| lukev wrote:
| AMD has never implemented the CUDA API. And not for
| technical reasons.
| gpm wrote:
| They did, or at least they paid someone else to.
|
| https://www.techpowerup.com/319016/amd-develops-rocm-
| based-s...
| Imustaskforhelp wrote:
| But I think then there was some lawsuit and the rocm
| guy/team had gone really ahead but amd dropped it because
| of either fear of lawsuit or lawsuit in general.
|
| Then, now they had to stop working on some part of the
| source code and had to rewrite a lot of things again,
| they are still not as close to as they were before amd
| lawyer shenanigan
| lmm wrote:
| I think it's ultimately a project management problem, like
| all hard problems. Yes it's a task that needs skilled
| programmers, but if an entity was willing to pay what
| programmers of that caliber cost and give them the
| conditions to let them succeed they could get it done.
| fooker wrote:
| Existing high performance cuda code is almost all first
| party libraries, written by NVIDIA and uses weird internal
| flags and inline ptx.
|
| You can get 90% of the way there with a small team of
| compiler devs. The rest 10% would take hundreds of people
| working ten years. The cost of this is suspiciously close
| to the billions in financial incentive you mentioned, funny
| how efficient markets work.
| lcnielsen wrote:
| > funny how efficient markets work.
|
| Can one really speak of efficient markets when there are
| multiple near molopolies at various steps in the
| production chain with massive integration, and infinity
| amounts of state spending in the process?
| bigyabai wrote:
| Sure they can. CUDA used to have a competitor, sponsored
| by Apple. It's name is OpenCL.
| dannyw wrote:
| And after Apple dropped NVIDIA, they stopped caring about
| openCL performance on their GPUs.
| fooker wrote:
| Yes, free markets and monopolies are not incompatible.
|
| When a monopoly uses it's status in an attempt to gain
| another monopoly, that's a problem and governments
| eventually strike this behavior down.
|
| Sometimes it takes time, because you'd rather not go on a
| ideology power trip and break something that's useful to
| the country/world.
| Perseids wrote:
| > > Can one really speak of efficient markets
|
| > Yes, free markets and monopolies are not incompatible.
|
| How did you get from "efficient markets" to "free
| markets"? The first could be accepted as inherently
| value, while the latter is clearly not, if this kind of
| freedom degrades to: "Sure you can start your business,
| it's a free country. For certain, you will fail, though,
| because there are monopolies already in place who have
| all the power in the market."
|
| Also, monopolies are regularly used to squeeze exorbitant
| shares of the added values from the other market
| participants, see e.g. Apple's AppStore cut. Accepting
| that as "efficient" would be a really unusual usage of
| the term in regard to markets.
| privatelypublic wrote:
| You scuttled your argument by using apple AppStore as an
| example.
| ameliaquining wrote:
| The term "efficient markets" tends to confuse and mislead
| people. It refers to a particular narrow form of
| "efficiency", which is definitely not the same thing as
| "socially optimal". It's more like "inexploitability";
| the idea is that in a big enough world, any limited
| opportunities to easily extract value will be taken (up
| to the opportunity cost of the labor of the people who
| can take them), so you shouldn't expect to find any
| unless you have an edge. The standard metaphor is, if I
| told you that there's a $20 bill on the sidewalk in Times
| Square and it's been there all week, you shouldn't
| believe me, because if it were there, someone would have
| picked it up.
|
| (The terminology is especially unfortunate because people
| tend to view it as praise for free markets, and since
| that's an ideological claim people respond with opposing
| ideological claims, and now the conversation is about
| ideology instead of about understanding a specific
| phenomenon in economics.)
|
| This is fully compatible with Apple's App Store revenue
| share existing and not creating value (i.e., being rent).
| What the efficient markets principle tells us is that, if
| it were possible for someone else to start their own app
| store with a smaller revenue share and steal Apple's
| customers that way, then their revenue share would
| already be much lower, to account for that. Since this
| isn't the case, we can conclude that there's some reason
| why starting your own competing app store wouldn't work.
| Of course, we already separately know what that reason
| is: an app store needs to be on people's existing devices
| to succeed, and your competing one wouldn't be.
|
| Similarly, if it were possible to spend $10 million to
| create an API-compatible clone of CUDA, and then save
| more than $10 million by not having to pay huge margins
| to Nvidia, then someone would have already done it. So we
| can conclude that either it can't be done for $10
| million, or it wouldn't create $10 million of value. In
| this case, the first seems more likely, and the comment
| above hypothesizes why: because an incomplete clone
| wouldn't produce $10 million of value, and a complete one
| would cost much more than $10 million. Alternatively, if
| Nvidia could enforce intellectual property rights against
| someone creating such a clone, that would also explain
| it.
|
| (Technically it's possible that this could instead be
| explained by a free-rider problem; i.e., such a clone
| would create more value than it would cost, but no
| company wants to sponsor it because they're all waiting
| for some other company to do it and then save the $10
| million it would cost to do it themselves. But this seems
| unlikely; big tech companies often spend more than $10
| million on open source projects of strategic
| significance, which a CUDA clone would have.)
| pjmlp wrote:
| And the tooling, people keep forgeting about CUDA
| tooling.
| int_19h wrote:
| From the market perspective, it's down to whether the
| amount of money needed to get there _and stay there_
| (keeping in mind that this would have to be an ongoing
| effort given that CUDA is not a static target) is more or
| less than the amount of money needed to just buy NVIDIA
| GPUs.
| ivell wrote:
| Modular is trying with Mojo + Max offering. It has taken
| quite a bit of effort to target NVidia and get parity. They
| are now focusing on other hardware.
| hangonhn wrote:
| Is CUDA tied very closely to the Nvidia hardware and
| architecture so that all the abstraction would not make sense
| on other platforms? I know very little about hardware and low
| level software.
|
| Thanks
| saagarjha wrote:
| Yes, also it's a moving target where people don't just want
| compatibility but also good performance.
| dagmx wrote:
| CUDA isn't really that hyper specific to NVIDIA hardware as
| an api.
|
| But a lot of the most useful libraries are closed source
| and available on NVIDIA hardware only.
|
| You could probably get most open source CUDA to run on
| other vendors hardware without crazy work. But you'd spend
| a ton more work getting to parity on ecosystem and lawyer
| fees when NVIDIA come at you.
| lcnielsen wrote:
| The kind of CUDA you or I would write is not very hardware
| specific (a few constants here and there) but the kind of
| CUDA behind cuBLAS with a million magic flags, inline PTX
| ("GPU assembly") and exploitation of driver/firmware hacks
| is. It's like the difference between numerics code in C and
| and numerics code in C with tons of in-line assembly code
| for each one of a number of specific processors.
|
| You can see similar things if you buy datacenter-grade CPUs
| from AMD or Intel and compare their per-model optimized
| BLAS builds and compilers to using OpenBLAS or swapping
| them around. The difference is not world ending but you can
| see maybe 50% in some cases.
| pjmlp wrote:
| CUDA is an ecosystem, many keep failing to understand that,
| trying to copy only the C++ compiler.
| ls612 wrote:
| #2 would be Google v. Oracle wouldn't it?
| quitit wrote:
| It's 1.
|
| It means that a developer can use their relatively low-powered
| Apple device (with UMA) to develop for deployment on nvidia's
| relatively high-powered systems.
|
| That's nice to have for a range of reasons.
| _zoltan_ wrote:
| "relatively high powered"? there's nothing faster out there.
| chvid wrote:
| Relative to what you can get in the cloud or on a desktop
| machine.
| MangoToupe wrote:
| Is this true per watt?
| spookie wrote:
| It doesn't matter for a lot of applications. But fair,
| for a big part of them it is either essential or a nice
| to have. But completely off the point if we are waging
| fastest compute no matter what.
| johnboiles wrote:
| ...fastest compute no matter watt
| sgt101 wrote:
| I wonder what Apple would have to do to make metal + its
| processors run faster than nVidia? I guess that it's all
| about the interconnects really.
| summarity wrote:
| Right now, for LLMs, the only limiting factor on Apple
| Silicon is memory bandwidth. There hasn't been progress
| on this since the original M1 Ultra. And since abandoning
| UltraFusion, we won't see progress here anytime soon
| either.
| glhaynes wrote:
| Have they abandoned UltraFusion? Last I'd heard, they'd
| just said something like "not all generations will get an
| Ultra chip" around the time the M4 showed up (the first M
| chip lacking an Ultra variation), which makes me think
| the M5 or M6 is fairly likely to get an Ultra.
| librasteve wrote:
| this is like saying the only limiting factor on computers
| is the von neumann bottleneck
| quitit wrote:
| Relative to the apple hardware, the nvidia is high powered.
|
| I appreciate that English is your second language after
| your Hungarian mother-tongue. My comment reflects upon the
| low and high powered compute of the apple vs. nvidia
| hardware.
| chvid wrote:
| If Apple cannot do their own implementation of CUDA due to
| copyright second best is this; getting developers to build
| for LMX (which is on their laptops) and still get NVIDIA
| hardware support.
|
| Apple should do a similar thing for AMD.
| xd1936 wrote:
| I thought that the US Supreme Court decision in Google v.
| Oracle and the Java reimplementation provided enough case
| precedent to allow companies to re-implement something like
| CUDA APIs?
|
| https://www.theverge.com/2021/4/5/22367851/google-oracle-
| sup...
|
| https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,
| _....
| timhigins wrote:
| Exactly and see also ROCM/HIP which is AMD's
| reimplementation of CUDA for their gpus.
| randomNumber7 wrote:
| What is the performance penalty compared to a program in
| native CUDA?
| karmakaze wrote:
| It would be great for Apple if enough developers took this
| path and Apple could later release datacenter GPUs that
| support MLX without CUDA.
| nightski wrote:
| It's the other way around. If Apple released data center
| GPUs then developers might take that path. Apple has shown
| time and again they don't care for developers, so it's on
| them.
| dagmx wrote:
| 2 also further cements CUDA as the de facto API to target, and
| nobody would write MLX targeted code instead.
|
| This way, you're more incentivized to write MLX and have it run
| everywhere. It's a situation of everyone wins, especially Apple
| because they can optimize it further for their platforms.
| tekawade wrote:
| I want #3 be able to connect NVIDIA GPU with Apple Silicon and
| run CUDA. Take advantage of apple silicon + unified memory +
| GPU + CUDA with PyTorch, JAX or TensorFlow.
|
| Haven't really explored MLX so can't speak about it.
| tho234j2344 wrote:
| I don't think #2 is really true - AMDs HIP is doing this exact
| thing after giving up on OpenCL way back in ~'17/'18.
| NekkoDroid wrote:
| I haven't looked into it, but doesn't HIP need everything to
| be recompiled against it? To my understanding it was mostly a
| source code translation effectivly.
| pjmlp wrote:
| For CUDA C++, not CUDA the ecosystem.
| sitkack wrote:
| #2 is not a copyright violation. You can reimplement APIs.
| 7734128 wrote:
| The famous Android Java fight is probably the most important
| case of that discussion.
| hnfong wrote:
| Indeed.
|
| Unfortunately when that case went to the Supreme Court they
| basically just said "yeah for this case it's fair use, but
| we're not going to comment on whether APIs in general are
| copyrightable"...
| adastra22 wrote:
| CUDA is not an API, it is a set of libraries written by
| NVIDIA. You'd have to reimplement those libraries, and for
| people to care at all you'd have to reimplement the
| optimizations in those libraries. That does get into various
| IP issues.
| Imustaskforhelp wrote:
| Even if its not as optimized, it would still be nice to see
| a CUDA alternative really
|
| Also I do wonder what the difference b/w a API and a set of
| libraries are, couldn't an API be exposed from that set of
| libraries which could be used? Its a little confusing I
| guess
| adastra22 wrote:
| > couldn't an API be exposed from that set of libraries
| which could be used
|
| And now you've entered that copyright violation
| territory.
| Someone wrote:
| IP infringement, not copyright violation.
|
| A clean room reimplementation of cuda would avoid any
| copyright claims, but would not necessary avoid patents
| infringement.
|
| https://en.wikipedia.org/wiki/Clean-room_design:
|
| _"Clean-room design is useful as a defense against
| copyright infringement because it relies on independent
| creation. However, because independent invention is not a
| defense against patents, clean-room designs typically
| cannot be used to circumvent patent restrictions."_
| dragonwriter wrote:
| > A clean room reimplementation of cuda would avoid any
| copyright claims,
|
| Assuming APIs are either not copyirghtable or that API
| reimplementation is always fair use of the API, _neither_
| of which there is sufficient precedent to justify as a
| conclusion; Oracle v. Google ended with "well, it would
| be fair use in the exact factual circumstances in this
| case so we don 't have to reach the thornier general
| questions".
| pjmlp wrote:
| CUDA is neither an API, nor a set of libraries, people get
| this wrong all the time.
|
| CUDA is an ecosystem of programming languages, libraries
| and developer tools.
|
| Composed by compilers for C, C++, Fortran, Python JIT DSLs,
| provided by NVidia, plus several others with either PTX or
| NVVM IR.
|
| The libraries, which you correctly point out.
|
| And then the IDE integrations, the GPU debugger that is on
| par with Visual Studio like debugging, profiler,...
|
| Hence why everyone that focus on copying only CUDA C, or
| CUDA C++, without everything else that makes CUDA relevant
| keeps failing.
| CamperBob2 wrote:
| Only the runtime components matter, though. Nobody cares
| about the dev tools beyond the core compiler. What people
| want is to be able to recompile and run on competitive
| hardware, and I don't understand why that's such an
| intractable problem.
| outworlder wrote:
| It is not.
|
| However, companies may still be hoping to get their own
| solutions in place instead of CUDA. If they do implement
| CUDA, that cements its position forever. That ship has
| probably already sailed, of course.
| StillBored wrote:
| Because literally the entire rest of the ecosystem is
| immature demoware. Rather than each vendor buying into
| opencl+SPIRV and building a robust stack around it, they
| are all doing their own half baked tech demos hoping to
| lock up some portion of the market to duplicate nvidia's
| success, or at least carve out a niche. While nvidia
| continues to extend and mature their ecosystem. Intel has
| oneAPI, AMD has ROCM, Arm has ACL/Kleidi/etc, and a pile
| of other stacks like MLX, Windows ML, whatever. Combined
| with a confusing mix of pure software plays like pytorch
| and windows ML.
|
| A lot of people talk about 'tooling' quality and no one
| hears them. I just spent a couple weeks porting a fairly
| small library to some fairly common personal hardware and
| hit all the same problems you see everywhere. Bugs aren't
| handled gracefully. Instead of returning "you messed up
| here", the hardware locks up, and power cycling is the
| only solution. Not a problem when your writing hello
| world, but trolling through tens of thousands of lines of
| GPU kernel code to find the error is going to burn
| engineer time without anything to show for it. Then when
| its running, spending weeks in an open feedback loop
| trying to figure out why the GPU utilization metrics are
| reporting 50% utilization (if your lucky enough to even
| have them) and the kernel is running at 1/4 the expected
| performance is again going to burn weeks. All because
| there isn't a functional profiler.
|
| And the vendors can't even get this stuff working. People
| rant about the ROCm support list not supporting, well the
| hardware people actually have. And it is such a mess,
| that in some cases it actually works but AMD says it
| doesn't. And of course, the only reason you hear people
| complaining about AMD is because they are literally the
| only company that has a hardware ecosystem that in theory
| spans the same breadth of devices from small embedded
| systems to giant data center grade products that NVIDIA
| does. Everyone else wants a slice of the market, but take
| apple here, they have nothing in the embedded/edge space
| that isn't a fixed function device (ex a watch, or apple
| TV), and their GPU's while interesting are nowhere near
| the level of the datacenter grade stuff, much less even
| top of the line AIC boards for gamers.
|
| And its all gotten to be such an industry wide pile of
| trash that people can't even keep track of basic feature
| capabilities. Like, a huge pile of hardware actually
| 'supports' openCL, but its buried to the point where
| actual engineers working on say ROCm are unaware its
| actually part of the ROCm stack (imagine my surprise!).
| And its been the same for nvidia, they have at times
| supported openCL, but the support is like a .dll they
| install with the GPU driver stack and don't even bother
| to document that its there. Or tensorflow that seems to
| have succumbed to the immense gravitational black hole it
| had become, where just building it on something that
| wasn't the blessed platform could take days.
| int_19h wrote:
| It's the same essential problem as with e.g. Wine - if
| you're trying to reimplement someone else's constantly
| evolving API with a closed-source implementation, it
| takes a lot of effort just to barely keep up.
|
| As far as portability, people who care about that already
| have the option of using higher-level APIs that have CUDA
| backend among several others. The main reason why you'd
| want to do CUDA directly is to squeeze that last bit of
| performance out of the hardware, but that is also
| precisely the area where deviation in small details
| starts to matter a lot.
| vFunct wrote:
| So if people aren't aware, you can have AI reimplement CUDA
| libraries for any hardware, as well as develop new ones.
|
| You wouldn't believe me if you didn't try it and see for
| yourself, so try it.
|
| NVidia's CUDA moat is no more.
| pxc wrote:
| Copyright can't prohibit compatible implementations that are
| developed independently through reverse engineering, if the
| implementers are very careful about the way they work.
|
| (Software patents can, though.)
| natas wrote:
| that means the next apple computer is going to use nvidia gpu(s).
| meepmorp wrote:
| but it's not an apple-submitted pr
| natas wrote:
| they can't make it that obvious
| meepmorp wrote:
| o
| MBCook wrote:
| There's no evidence of that. The post clearly identifies a
| _far_ more probable reason in letting things be developed in
| Mac's then deployed on Nvidia supercomputers.
| dnchdnd wrote:
| Random aside: A lot of the people working on MLX don't seem to be
| officially affiliated with Apple at least in a superficial
| review. See for example: https://x.com/prince_canuma
|
| Idly wondering, is Apple bankrolling this but wants to keep it in
| the DL? There were also rumours the team was looking to move at
| one point ?
| jpcompartir wrote:
| It seems more like Open Source devs who are looking to build
| clout/rep with MLX?
|
| Pretty sure Claude Sonnet is actually doing most of the work.
| Abishek_Muthian wrote:
| I've been very impressed with MLX models; I can open up local
| models to everyone in the house, something I wouldn't dare with
| my Nvidia computer for the risk of burning down the house.
|
| I've been hoping Apple Silicon becomes a serious contender for
| Nvidia chips; I wonder if the CUDA support is just Embrace,
| extend, and extinguish (EEE).
| m3kw9 wrote:
| I thought you either use MLX for apple silicone or you compile it
| for cudaw
| neurostimulant wrote:
| > Being able to write/test code locally on a Mac and then deploy
| to super computers would make a good developer experience.
|
| Does this means you can use MLX on linux now?
|
| Edit:
|
| Just tested it and it's working but only python 3.12 version is
| available on pypi right now: https://pypi.org/project/mlx-
| cuda/#files
| qwertox wrote:
| If Apple would support Nvidia cards it would be the #1 solution
| for developers.
| Nevermark wrote:
| If Apple doubled the specs of their Ultra M processor every
| year, in numbers of cores, RAM cells, internal and external
| bandwidth, until both the Ultra processor and its RAM plane
| took up full wafers, .... but still fit in a Mac Studio case,
| with a new white reverse-power heat extraction USB-C+ cable
| designed to be terminated at a port on a small wireless heat
| exchanger dish, which instantly beamed all the waste heat into
| space, at such high efficiency that the Studio internals could
| operate at -100 Celsius, and all those cores overclocked, oh
| they over clocked, ...
|
| Yes we can dream!
|
| It would great if Apple continues pushing M processors to next
| levels, in part, to go vertical into the cloud.
|
| Or if they start supporting nVidia.
|
| The latter seems less Apple-y. But they must be considering the
| value of a cloud level Apple-friendly AI computing solution, so
| something is likely (?) to happen.
| adultSwim wrote:
| This is great to see. I had wrongly assumed MLX was Apple-only.
| neuroelectron wrote:
| Just remember to name for fp8 kernels "cutlass" for +50%
| performance.
| mattfrommars wrote:
| It's year 2025 and we have yet to have impact of CUDA like what
| Java had in the idea, "write once, run it anywhere"
|
| Academia and companies continue to write proprietary code. Its as
| if we continue to write code for Adobe Flash or Microsoft
| Silverlight in year 2025.
|
| Honestly, I don't mind as Nvidia shareholder.
| raincole wrote:
| In the end Java doesn't achieve "write once, run it anywhere"
| either.
|
| I guess there might be a way to develop apps for iOS or even
| PlayStation in Java, but my knees hurt just thinking about how
| many hoops one needs to jump through.
| bigyabai wrote:
| I'll never get over the way Apple treated OpenCL. They saw the
| train coming down the tracks, spent so long hedging their bet
| against CUDA, and threw in the towel _the moment_ actual demand
| started cropping up. CUDA very nearly had a serious, corporate-
| funded and write-once-run-anywhere competitor.
|
| Normally I write something snide about not seeing where the
| puck was headed. But Apple _did_ skate to the puck the puck
| here, they just did nothing with it.
| int_19h wrote:
| Back in the day, the reason why people kept targeting Flash is
| because all the other alternatives were worse. If you recall,
| the only thing that made a difference was mobile, where Flash
| ended up being a liability due to performance and battery
| lifetime issues. And even then it took a company like Apple,
| which could rely on its cult status to draw the hard line on
| mobile Flash, ship iPhone without it (unlike Android which had
| it, warts and all), and steadfastly refuse to even consider
| adding it, forcing everybody else to choose between using Flash
| and supporting the lucrative iPhone ecosystem.
|
| I'm not even sure what the equivalent would be for CUDA tbh.
| bigyabai wrote:
| Apple could just, talk to Khronos again. In any protracted
| discussion of "how can the industry kill Nvidia", we always
| circle back around to the lack of communication. There _was_
| an era where Apple, AMD and even Nvidia all worked on Open
| Source non-CUDA acceleration primitives. There were working
| drivers, (a handful of) users, and bindings in multiple
| native languages. All they needed was industry applications,
| which would arrive with the crypto mining boom that Nvidia
| profited off of hand-over-fist. And by then, Apple refused to
| cooperate with their industry partners, and refused to
| support OpenCL on iPhone GPUs or Apple Silicon.
| Metaphorically, this would be like Apple refusing to
| implement HTML because they wanted to promote their own Flash
| alternative.
|
| Nvidia won because they don't deal with this level of asinine
| infighting. If Khronos could bring back a level of mutual
| respect to their consortium, they could present a serious
| threat. Apple is the only business still on their high horse;
| AMD, Intel and Qualcomm all recognize that they need to
| cooperate.
| m463 wrote:
| > "write once, run it anywhere"
|
| I think you mean:
|
| "write once, test everywhere"
___________________________________________________________________
(page generated 2025-07-15 23:01 UTC)