[HN Gopher] Apple's MLX adding CUDA support
       ___________________________________________________________________
        
       Apple's MLX adding CUDA support
        
       Author : nsagent
       Score  : 528 points
       Date   : 2025-07-14 21:40 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gsibble wrote:
       | Awesome
        
       | nxobject wrote:
       | If you're going "wait, no Apple platform has first-party CUDA
       | support!", note that this set of patches also adds support for
       | "Linux [platforms] with CUDA 12 and SM 7.0 (Volta) and up".
       | 
       | https://ml-explore.github.io/mlx/build/html/install.html
        
       | teaearlgraycold wrote:
       | I wonder if Jensen is scared. If this opens up the door to other
       | implementations this could be a real threat to Nvidia. CUDA on
       | AMD, CUDA on Intel, etc. Might we see actual competition?
        
         | jsight wrote:
         | I think this is the other way around. It won't be cuda on
         | anything except for nvidia.
         | 
         | However, this might make mlx into a much stronger competitor
         | for Pytorch.
        
           | teaearlgraycold wrote:
           | Oh bummer. Almost got excited.
        
           | baby_souffle wrote:
           | If you implement compatible apis, are you prohibited from
           | calling it cuda?
        
             | moralestapia wrote:
             | I'm sure I saw this lawsuit somewhere ...
             | 
             | The gist is the API specification in itself is copyright,
             | so it is copyright infringement then.
        
               | wyldfire wrote:
               | Too subtle - was this oracle vs java one? Remind me: java
               | won or lost that one?
        
               | mandevil wrote:
               | Oracle sued Google, and Google won, 6-2 (RBG was dead,
               | Barrett had not yet been confirmed when the case was
               | heard).
               | 
               | Supreme Court ruled that by applying the Four Factors of
               | Fair Use, Google stayed within Fair Use.
               | 
               | An API specification ends up being a system of organizing
               | things, like the Dewey Decimal System (and thus not
               | really something that can be copyrighted), which in the
               | end marks the first factor for Google. Because Google
               | limited the Android version of the API to just things
               | that were useful for smart phones it won on the second
               | factor too. Because only 0.4% of the code was reused, and
               | mostly was rewritten, Google won on the third factor. And
               | on the market factor, if they held for Oracle, it would
               | harm the public because then "Oracle alone would hold the
               | key. The result could well prove highly profitable to
               | Oracle (or other firms holding a copyright in computer
               | interfaces) ... [but] the lock would interfere with, not
               | further, copyright's basic creativity objectives." So
               | therefore the fourth factor was also pointing in Google's
               | favor.
               | 
               | Whether "java" won or lost is a question of what is
               | "java"? Android can continue to use the Java API- so it
               | is going to see much more activity. But Oracle didn't get
               | to demand license fees, so they are sad.
        
               | moralestapia wrote:
               | Oh man, thanks for this.
               | 
               | I always thought it was resolved as infringement and they
               | had to license the Java APIs or something ...
               | 
               | Wow.
        
               | mandevil wrote:
               | The district court ruled for Google over patents and
               | copyright- that it was not a copyright at all, the Court
               | of Appeals then reversed and demanded a second court
               | trial on whether Google was doing fair use of Oracle's
               | legitimate copyright, which the district court again held
               | for Google, and then the Court of Appeals reversed the
               | second ruling and held for Oracle that it was not fair
               | use of their copyright, and then Google appealed that to
               | the the Supreme Court ... and won in April 2021, putting
               | an end to this case which was filed in August 2010. But
               | the appeals court in between the district court and the
               | Supreme Court meant that for a long while in the middle
               | Oracle was the winner.
               | 
               | This is part of why patents and copyrights can't be the
               | moat for your company. 11 years, with lots of uncertainty
               | and back-and-forth, to get a final decision.
        
               | tough wrote:
               | Yeah this case made me think using llms to clean-room
               | reverse engineer any API exposing SaaS or private
               | codebase would be game
        
             | 15155 wrote:
             | Considering 100% of the low-level CUDA API headers have the
             | word "CUDA" in them, this would be interesting to know.
        
           | mayli wrote:
           | Yeah, nice to have MLX-opencl or MLX-amd-whatever
        
         | almostgotcaught wrote:
         | > CUDA backend
         | 
         |  _backend_
        
         | tekacs wrote:
         | This instance is the other way around, but that's what this is
         | - CUDA on AMD (or other platforms): https://docs.scale-
         | lang.com/stable/
        
         | pjmlp wrote:
         | Why, everyone keeps trying to copy CUDA while failing to
         | understand why many of us love it.
        
         | int_19h wrote:
         | Abstraction layers for GPU compute already exist; this is yet
         | another one, so it doesn't change anything substantially. Most
         | of the time code written using such layers ends up running on
         | NVIDIA hardware in prod anyway, so if anything that is a net
         | positive for the company - it means that more people can now
         | develop for its hardware on their devices.
        
       | zdw wrote:
       | How does this work when one of the key features of MLX is using a
       | unified memory architecture? (see bullets on repo readme:
       | https://github.com/ml-explore/mlx )
       | 
       | I would think that bringing that to all UMA APUs (of any vendor)
       | would be interesting, but discreet GPU's definitely would need a
       | different approach?
       | 
       | edit: reading the PR comments, it appears that CUDA supports a
       | UMA API directly, and will transparently copy as needed.
        
         | freeone3000 wrote:
         | Eh yes but from my experience its lack of prefetch lends to
         | significant memory stalls waiting for the copy. It might be
         | suitable if your entire dataset fits in VRAM after doing a
         | "manual prefetch" but it killed performance for my application
         | (ML training) so hard that we actually got time to move to
         | streaming loads.
        
       | nerdsniper wrote:
       | Edit: I had the details of the Google v Oracle case wrong. SCOTUS
       | found that re-implementing an API does not infringe copyright. I
       | was remembering the first and second appellate rulings.
       | 
       | Also apparently this is not a re-implementation of CUDA.
        
         | skyde wrote:
         | this is CUDA backend to MLX not MLX backend for CUDA!
        
         | liuliu wrote:
         | You misunderstood and this is not re-implementing CUDA API.
         | 
         | MLX is a PyTorch-like framework.
        
         | Uehreka wrote:
         | This is exactly the kind of thing I wouldn't opine on until
         | like, an actual lawyer weighs in after thoroughly researching
         | it. There are just too many shades of meaning in this kind of
         | case law for laymen to draw actionable conclusions directly
         | from the opinions.
         | 
         | Though I imagine that if Apple is doing this themselves, they
         | likely know what they're doing, whatever it is.
        
       | MuffinFlavored wrote:
       | Is this for Mac's with NVIDIA cards in them or Apple Metal/Apple
       | Silicon speaking CUDA?... I can't really tell.
       | 
       | Edit: looks like it's "write once, use everywhere". Write MLX,
       | run it on Linux CUDA, and Apple Silicon/Metal.
        
         | cowsandmilk wrote:
         | Neither, it is for Linux computers with NVIDIA cards
        
         | MBCook wrote:
         | Seems you already found the answer.
         | 
         | I'll note Apple hasn't shipped an Nvidia card in a very very
         | long time. Even on the Mac pros before Apple Silicon they only
         | ever sold AMD cards.
         | 
         | My understanding from rumors is that they had a falling out
         | over the problems with the dual GPU MacBook Pros and the
         | quality of drivers.
         | 
         | I have no idea if sticking one in on the PCI bus let you use it
         | for AI stuff though.
        
           | kmeisthax wrote:
           | On Apple Silicon, writing to memory on a PCIe / Thunderbolt
           | device will generate an exception. ARM spec says you're
           | allowed to write to devices as if they were memory but Apple
           | enforces that all writes to external devices go through a
           | device memory mapping[0]. This makes using an external GPU on
           | Apple Silicon[1] way more of a pain in the ass, if not
           | impossible. AFAIK nobody's managed to write an eGPU driver
           | for Apple Silicon, even with Asahi.
           | 
           | [0] https://developer.arm.com/documentation/102376/0200/Devic
           | e-m...
           | 
           | [1] Raspberry Pi 4's PCIe has the same problem AFAIK
        
             | bobmcnamara wrote:
             | Ewww, that kills out of order CPU performance. If it's like
             | ARMv7, it effectively turns each same-page access into it's
             | own ordering barrier.
        
             | saagarjha wrote:
             | Writing to device memory does not generate an exception.
        
           | xuki wrote:
           | That particular MBP model had a high rate of GPU failure
           | because it ran too hot.
           | 
           | I imagined the convo between Steve Jobs and Jensen Huang went
           | like this:
           | 
           | S: your GPU is shit
           | 
           | J: your thermal design is shit
           | 
           | S: f u
           | 
           | J: f u too
           | 
           | Apple is the kind of company that hold a grudge for a very
           | long time, their relationships with suppliers are very one
           | way, their way or the highway.
        
             | bobmcnamara wrote:
             | S: omg so thin!!1!1!!l!
        
             | rcruzeiro wrote:
             | I think the ones that failed were the AMD ones,
             | specifically the old 17 inches MacBook Pro.
        
               | roboror wrote:
               | D700s dying in the trash can Mac Pros cost me (and many
               | others) a lot of time and money.
        
               | MBCook wrote:
               | I had 15" MBP, maybe a 2010, that was dual GPU with an
               | Nvidia that was definitely a problem.
        
             | sciencesama wrote:
             | And so is the same with nvidea too
        
             | narism wrote:
             | The MBPs didn't run too hot, the Nvidia GPUs used an
             | underfill that stopped providing structural support at a
             | relatively normal temperature for GPUs (60-80 degrees C).
             | 
             | GPU failures due to this also happened on Dell/HP/Sony
             | laptops, some desktop models, as well as early models of
             | the PS3.
             | 
             | Some reading:
             | https://www.badcaps.net/forum/troubleshooting-hardware-
             | devic...
        
             | sciencesama wrote:
             | Are you watching the bear ?
        
           | VladVladikoff wrote:
           | Won't work. No driver support.
        
         | dkga wrote:
         | This is the only strategy humble me can see working for CUDA in
         | MLX
        
           | whatever1 wrote:
           | This is the right answer. Local models will be accelerated by
           | Apple private cloud.
        
         | hbcondo714 wrote:
         | > "write once, use everywhere"
         | 
         | So my MLX workloads can soon be offloaded to the cloud!?
        
       | Keyframe wrote:
       | Now do linux support / drivers for Mac hardware!
        
         | lvl155 wrote:
         | Seriously. Those Apple guys became delusional especially after
         | Jobs passed away. These guys just sat on their successes and
         | did nothing for a decade plus. M1 was nice but that was all
         | Jobs doing and planning. I don't like this Apple. They forgot
         | how to innovate.
         | 
         | But I guess we have a VR device nobody wants.
        
           | jjtheblunt wrote:
           | It would be funny if you were typing out your response on an
           | iPhone that has been running for 36 hours without recharging.
        
             | macinjosh wrote:
             | if only their batteries would last that long.
        
               | can16358p wrote:
               | Unless one constantly browses Instagram or TikTok, they
               | do.
        
           | marcellus23 wrote:
           | > M1 was nice but that was all Jobs doing and planning
           | 
           | M1 was launched 9 years after Jobs died. You're saying they
           | had everything ready to go back then and just sat on their
           | asses for a decade?
        
             | lvl155 wrote:
             | Who bought Semi? Jobs knew they had to make their own. M1
             | is just a product of their iPhone chips hence all the
             | efficiency.
        
               | saagarjha wrote:
               | Ok, but did you ever think about PA Semi being the Alpha
               | guys? Maybe the DEC leadership deserves credit for M1
        
               | marcellus23 wrote:
               | Jobs knew they had to make their own chips, and in your
               | mind that constitutes "all the doing and planning"?
        
               | lvl155 wrote:
               | I said "[Jobs'] doing and planning" whereas you make it
               | sound like Semi and M1 have nothing to do with Jobs.
               | Apple has M1 because Jobs had a vision. Tell me one thing
               | Apple did since Jobs' passing that show such a vision.
               | Maybe Watch? Hardly ground breaking. VR? 'nuff said.
        
           | pxc wrote:
           | > Seriously. Those Apple guys became delusional especially
           | after Jobs passed away.
           | 
           | Didn't Jobs himself essentially die of delusion?
        
         | bigyabai wrote:
         | I think we're seeing the twilight of those efforts. Asahi Linux
         | was an absolute _powerhouse_ of reverse-engineering prowess,
         | and it took years to get decent Vulkan coverage and half of the
         | modern lineup 's GPUs supported. Meanwhile AMD and even _Intel_
         | are shipping Vulkan 1.3 drivers day-one on new hardware. It 's
         | a cool enthusiast effort to extend the longevity of the
         | hardware, but it bears repeating; nobody is disrupting Nvidia's
         | bottom-line here. Apple doesn't sell hardware competitive with
         | Nvidia's datacenter hardware, and even if they did it's not
         | supported by the community. It's doubtful that Apple would make
         | any attempt to help them.
         | 
         | There seems to a pervading assumption that Apple is still
         | making a VolksComputer in 2025, blithely supporting a freer
         | status-quo for computing. They laid out their priorities
         | completely with Apple Silicon, you're either on Apple's side or
         | falling behind. Just the way they want it.
        
       | albertzeyer wrote:
       | This is exciting. So this is using unified memory of CUDA? I
       | wonder how well that works. Is the behavior of the unified memory
       | in CUDA actually the same as for Apple silicon? For Apple
       | silicon, as I understand, the memory is anyway shared between GPU
       | and CPU. But for CUDA, this is not the case. So when you have
       | some tensor on CPU, how will it end up on GPU then? This needs a
       | copy somehow. Or is this all hidden by CUDA?
        
         | MBCook wrote:
         | This is my guess, but does higher end hardware they sell, like
         | the server rack stuff for AI, perhaps have the unified memory?
         | 
         | I know standard GPUs don't.
         | 
         | The patch suggested one of the reasons for it was to make it
         | easy to develop on a Mac and run on a super computer. So the
         | hardware with the unified memory might be in that class.
        
           | Y_Y wrote:
           | The servers don't, but the Jetsons do
        
           | ajuhasz wrote:
           | The Jetsons[1] have unified memory[2].
           | 
           | [1] https://www.nvidia.com/en-us/autonomous-
           | machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-
           | demand/session/gtcspring22-s...
        
             | tonyarkles wrote:
             | They sure do and it's pretty amazing. One iteration of a
             | vision system I worked on got frames from a camera over a
             | Mellanox NIC that supports RDMA (Rivermax), preprocessed
             | the images using CUDA, did inference on them with TensorRT,
             | and the first time a single byte of the inference pipeline
             | hit the CPU itself was when we were consuming the output.
        
           | patrickkrusiec wrote:
           | The physical memory is not be unified, but on modern rack
           | scale Nvidia systems, like Grace Hopper or NVL72, the CPU and
           | the GPU(s) share the same virtual address space and have non-
           | uniform memory access to each other's memory.
        
           | freeone3000 wrote:
           | Standard GPUs absolutely do. Since CUDA 11, all CUDA cards
           | expose the same _featureset_ at differing speeds (based on
           | backing capability). You can absolutely (try to) run CUDA UMA
           | on your 2060, and it will complete the computation.
        
         | zcbenz wrote:
         | In the absence of hardware unified memory, CUDA will
         | automatically copy data between CPU/GPU when there are page
         | faults.
        
           | fenced_load wrote:
           | There is also NVLink c2c support between Nvidia's CPUs and
           | GPUs that doesn't require any copy, CPUs and GPUs directly
           | access each other's memory over a coherent bus. IIRC, they
           | have 4 CPU + 4 GPU servers already available.
        
             | benreesman wrote:
             | Yeah NCCL is a whole world and it's not even the only thing
             | involved, but IIRC that's the difference between 8xH100 PCI
             | and 8xH100 SXM2.
        
           | nickysielicki wrote:
           | See also: https://www.kernel.org/doc/html/v5.0/vm/hmm.html
        
           | saagarjha wrote:
           | This seems like it would be slow...
        
             | freeone3000 wrote:
             | Matches my experience. It's memory stalls all over the
             | place, aggravated (on 12.3 at least) there wasn't even a
             | prefetcher.
        
         | ethan_smith wrote:
         | CUDA's Unified Memory uses page migration with on-demand
         | faulting to create the illusion of shared memory, whereas Apple
         | Silicon has true shared physical memory, resulting in different
         | performance characteristics despite the similar programming
         | model.
        
       | paulirish wrote:
       | It's coming from zcbenz who created Electron among others
       | https://zcbenz.com/ Nice.
        
       | benreesman wrote:
       | I wonder how much this is a result of Strix Halo. I had a fairly
       | standard stipend for a work computer that I didn't end up using
       | for a while so I recently cashed it in on the EVO-X2 and fuck me
       | sideways: that thing is easily competitive with the mid-range
       | znver5 EPYC machines I run substitors on. It mops the floor with
       | any mere-mortal EC2 or GCE instance, like maybe some
       | r1337.xxxxlarge.metal.metal or something has an edge, but the
       | z1d.metal and the c6.2xlarge or whatever type stuff (fast cores,
       | good NIC, table stakes), blows them away. And those things are
       | 3-10K a month with heavy provisioned IOPS. This thing has real
       | NVME and it cost 1800.
       | 
       | I haven't done much local inference on it, but various YouTubers
       | are starting to call the DGX Spark overkill / overpriced next to
       | Strix Halo. The catch of course is ROCm isn't there yet (they're
       | seeming serious now though, matter of time).
       | 
       | Flawless CUDA on Apple gear would make it really tempting in a
       | way that isn't true with Strix so cheap and good.
        
         | jitl wrote:
         | It's pretty explicitly targeting cloud cluster training in the
         | PR description.
        
           | ivape wrote:
           | If we believe that there's not enough hardware to meet
           | demand, then one could argue this helps Apple meet demand,
           | even if it's just by a few percentage points.
        
         | nl wrote:
         | > The catch of course is ROCm isn't there yet (they're seeming
         | serious now though, matter of time).
         | 
         | Competitive AMD GPU neural compute has been any day now for at
         | least 10 years.
        
           | bigyabai wrote:
           | The inference side is fine, nowadays. llama.cpp has had a
           | GPU-agnostic Vulkan backend for a while, it's the training
           | side that tends to be a sticking point for consumer GPUs.
        
         | hamandcheese wrote:
         | For the uninitiated, Strix Halo is the same as the AMD Ryzen AI
         | Max+ 395 which will be in the Framework Desktop and is starting
         | to show up in some mini PCs as well.
         | 
         | The memory bandwidth on that thing is 200GB/s. That's great
         | compared to most other consumer-level x86 platforms, but quite
         | far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the
         | pro level cards) or even Apple's best (M3 Ultra has 800GB/s).
         | 
         | It certainly seems like a great value. But for memory bandwidth
         | intensive applications like LLMs, it is just barely entering
         | the realm of "good enough".
        
           | yieldcrv wrote:
           | Apple is just being stupid, handicapping their own hardware
           | so they can sell the fixed one next year or the year after
           | 
           | This is time tested Apple strategy is now undermining their
           | AI strategy and potential competitiveness
           | 
           | tl;dr they could have done 1600GB/s
        
             | saagarjha wrote:
             | They could have shipped a B200 too. Obviously there are
             | reasons they don't do that.
        
             | Nevermark wrote:
             | So their products are so much better, in customer demand
             | terms that they don't need to rush tech out the door?
             | 
             | Whatever story you want to create, if customers are happy
             | year after year then Apple is serving them well.
             | 
             | Maybe not with same feature dimension balance you want, or
             | other artificial/wishful balances you might make up for
             | them.
             | 
             | (When Apple drops the ball it is usually painful, painfully
             | obvious and most often a result of a deliberate and
             | transparent priority tradeoff. No secret switcherooos or
             | sneaky downgrading. See: Mac Pro for years...)
        
               | yieldcrv wrote:
               | Apple is absolutely fumbling on their AI strategy despite
               | their vertical hardware integration, there is no
               | strategy. Its a known problem inside Apple, not a 4-D
               | chess thing to wow everyone with a refined version in
               | 2030
        
           | Rohansi wrote:
           | You're comparing theoretical maximum memory bandwidth. It's
           | not enough to only look at memory bandwidth because you're a
           | lot more likely to be compute limited when you have a lot of
           | memory bandwidth available. For example, M1 had so much
           | bandwidth available that it couldn't make use of even when
           | fully loaded.
        
             | zargon wrote:
             | GPUs have both the bandwidth and the compute. During token
             | generation, no compute is needed. But both Apple silicon
             | and Strix Halo fall on their face during prompt ingestion,
             | due to lack of compute.
        
               | supermatt wrote:
               | Compute (and lots of it) is absolutely needed for
               | generation - 10s of billions of FLOPs per token on the
               | smaller models (7B) alone - with computations of the
               | larger models scaling proportionally.
               | 
               | Each token requires a forward pass through all
               | transformer layers, involving large matrix
               | multiplications at every step, followed by a final
               | projection to the vocabulary.
        
               | zargon wrote:
               | Obviously I don't mean literally zero compute. The amount
               | of compute needed scales with the number of parameters,
               | but I have yet to use a model that has so many parameters
               | that token generation becomes compute bound. (Up to 104B
               | for dense models.) During token generation most of the
               | time is spent idle waiting for weights to transfer from
               | memory. The processor is bored out of its mind waiting
               | for more data. Memory bandwidth is the bottleneck.
        
               | supermatt wrote:
               | It sounds like you aren't batching efficiently if you are
               | being bound by memory bandwidth.
        
               | zargon wrote:
               | That's right, in the context of Apple silicon and Halo
               | Strix, these use cases don't involve much batching.
        
             | hamandcheese wrote:
             | Memory bandwidth puts an upper limit on LLM tokens per
             | second.
             | 
             | At 200GB/s, that upper limit is not very high at all. So it
             | doesn't really matter if the compute is there or not.
        
               | Rohansi wrote:
               | The M1 Max's GPU can only make use of about 90GB/s out of
               | the 400GB/s they advertise/support. If the AMD chip can
               | make better use of its 200GB/s then, as you say, it will
               | manage to have better LLM tokens per second. You can't
               | just look at what has the wider/faster memory bus.
               | 
               | https://www.anandtech.com/show/17024/apple-m1-max-
               | performanc...
        
         | attentive wrote:
         | how is it vs m4 mac mini?
        
         | drcongo wrote:
         | This was nice to read, I ordered an EVO-X2 a week ago though
         | I'm still waiting for them to actually ship it - I was waiting
         | on a DGX Spark but ended up deciding that was never actually
         | going to ship. Got any good resources for getting the thing up
         | and running with LLMs, diffusion models etc.?
        
           | benreesman wrote:
           | However excited you are, it's merited. Mine took forever too,
           | and it's just completely worth it. It's like a flagship halo
           | product, they won't make another one like this for a while I
           | don't think. You won't be short on compute relative to a trip
           | to best buy for many years.
        
         | adultSwim wrote:
         | Do you need to copy to load a model from CPU memory into GPU
         | memory?
        
       | orliesaurus wrote:
       | Why is this a big deal, can anyone explain if they are familiar
       | with the space?
        
         | elpakal wrote:
         | > NVIDIA hardware is widely used for academic and massive
         | computations. Being able to write/test code locally on a Mac
         | and then deploy to super computers would make a good developer
         | experience.
         | 
         | That one stands out to me as a mac user.
        
           | radicaldreamer wrote:
           | MacBooks used to use Nvidia GPUs, then Apple had a falling
           | out with Nvidia and the beef stands to this day (Apple didn't
           | use Nvidia hardware when training it's own LLMs for Apple
           | Intelligence).
           | 
           | I wouldn't be surprised if within the next few years we see a
           | return of Nvidia hardware to the Mac, probably starting with
           | low volume products like the MacPro, strictly for
           | professional/high-end use cases.
        
             | fooker wrote:
             | > Apple didn't use Nvidia hardware when training it's own
             | LLMs for Apple Intelligence
             | 
             | Do you have some links for this?
        
               | almostgotcaught wrote:
               | People on hn make up more BS than your local bar
               | 
               | https://www.investors.com/news/technology/apple-stock-
               | apple-...
        
               | tgma wrote:
               | What did the poster make up? There's one line where they
               | speculated about future and a commentary about beef
               | existing to this day which is subjective but the rest of
               | it was 100% factual: Apple relied on Google for training
               | their LLM for various reasons and they did have a beef
               | with NVIDIA re MacBooks a long time ago after which they
               | switched the entire line to AMD Graphics.
        
               | dialup_sounds wrote:
               | https://arxiv.org/abs/2407.21075
               | 
               | tl;dr - they used Google TPUs
        
       | numpad0 wrote:
       | > This PR is an ongoing effort to add a CUDA backend to MLX
       | 
       | looks like it allows MLX _code_ to compile and run on x86 +
       | GeForce hardware, not the other way around.
        
       | sciencesama wrote:
       | Apple is planing to build data centers with mseries of chips for
       | both app development, testing and to host external services!
        
         | mr_toad wrote:
         | If they were doing that, they wouldn't need CUDA support. More
         | likely they have internal developers who want to do development
         | on Apple hardware and deploy to Nvidia hardware in production.
        
       | lukev wrote:
       | So to make sure I understand, this would mean:
       | 
       | 1. Programs built against MLX -> Can take advantage of CUDA-
       | enabled chips
       | 
       | but not:
       | 
       | 2. CUDA programs -> Can now run on Apple Silicon.
       | 
       | Because the #2 would be a copyright violation (specifically with
       | respect to NVidia's famous moat).
       | 
       | Is this correct?
        
         | saagarjha wrote:
         | No, it's because doing 2 would be substantially harder.
        
           | lukev wrote:
           | There's a massive financial incentive (billions) to allow
           | existing CUDA code to run on non-NVidia hardware. Not saying
           | it's easy, but is implementation difficulty really the
           | blocker?
        
             | saagarjha wrote:
             | Yes. See: AMD
        
               | lukev wrote:
               | AMD has never implemented the CUDA API. And not for
               | technical reasons.
        
               | gpm wrote:
               | They did, or at least they paid someone else to.
               | 
               | https://www.techpowerup.com/319016/amd-develops-rocm-
               | based-s...
        
               | Imustaskforhelp wrote:
               | But I think then there was some lawsuit and the rocm
               | guy/team had gone really ahead but amd dropped it because
               | of either fear of lawsuit or lawsuit in general.
               | 
               | Then, now they had to stop working on some part of the
               | source code and had to rewrite a lot of things again,
               | they are still not as close to as they were before amd
               | lawyer shenanigan
        
             | lmm wrote:
             | I think it's ultimately a project management problem, like
             | all hard problems. Yes it's a task that needs skilled
             | programmers, but if an entity was willing to pay what
             | programmers of that caliber cost and give them the
             | conditions to let them succeed they could get it done.
        
             | fooker wrote:
             | Existing high performance cuda code is almost all first
             | party libraries, written by NVIDIA and uses weird internal
             | flags and inline ptx.
             | 
             | You can get 90% of the way there with a small team of
             | compiler devs. The rest 10% would take hundreds of people
             | working ten years. The cost of this is suspiciously close
             | to the billions in financial incentive you mentioned, funny
             | how efficient markets work.
        
               | lcnielsen wrote:
               | > funny how efficient markets work.
               | 
               | Can one really speak of efficient markets when there are
               | multiple near molopolies at various steps in the
               | production chain with massive integration, and infinity
               | amounts of state spending in the process?
        
               | bigyabai wrote:
               | Sure they can. CUDA used to have a competitor, sponsored
               | by Apple. It's name is OpenCL.
        
               | dannyw wrote:
               | And after Apple dropped NVIDIA, they stopped caring about
               | openCL performance on their GPUs.
        
               | fooker wrote:
               | Yes, free markets and monopolies are not incompatible.
               | 
               | When a monopoly uses it's status in an attempt to gain
               | another monopoly, that's a problem and governments
               | eventually strike this behavior down.
               | 
               | Sometimes it takes time, because you'd rather not go on a
               | ideology power trip and break something that's useful to
               | the country/world.
        
               | Perseids wrote:
               | > > Can one really speak of efficient markets
               | 
               | > Yes, free markets and monopolies are not incompatible.
               | 
               | How did you get from "efficient markets" to "free
               | markets"? The first could be accepted as inherently
               | value, while the latter is clearly not, if this kind of
               | freedom degrades to: "Sure you can start your business,
               | it's a free country. For certain, you will fail, though,
               | because there are monopolies already in place who have
               | all the power in the market."
               | 
               | Also, monopolies are regularly used to squeeze exorbitant
               | shares of the added values from the other market
               | participants, see e.g. Apple's AppStore cut. Accepting
               | that as "efficient" would be a really unusual usage of
               | the term in regard to markets.
        
               | privatelypublic wrote:
               | You scuttled your argument by using apple AppStore as an
               | example.
        
               | ameliaquining wrote:
               | The term "efficient markets" tends to confuse and mislead
               | people. It refers to a particular narrow form of
               | "efficiency", which is definitely not the same thing as
               | "socially optimal". It's more like "inexploitability";
               | the idea is that in a big enough world, any limited
               | opportunities to easily extract value will be taken (up
               | to the opportunity cost of the labor of the people who
               | can take them), so you shouldn't expect to find any
               | unless you have an edge. The standard metaphor is, if I
               | told you that there's a $20 bill on the sidewalk in Times
               | Square and it's been there all week, you shouldn't
               | believe me, because if it were there, someone would have
               | picked it up.
               | 
               | (The terminology is especially unfortunate because people
               | tend to view it as praise for free markets, and since
               | that's an ideological claim people respond with opposing
               | ideological claims, and now the conversation is about
               | ideology instead of about understanding a specific
               | phenomenon in economics.)
               | 
               | This is fully compatible with Apple's App Store revenue
               | share existing and not creating value (i.e., being rent).
               | What the efficient markets principle tells us is that, if
               | it were possible for someone else to start their own app
               | store with a smaller revenue share and steal Apple's
               | customers that way, then their revenue share would
               | already be much lower, to account for that. Since this
               | isn't the case, we can conclude that there's some reason
               | why starting your own competing app store wouldn't work.
               | Of course, we already separately know what that reason
               | is: an app store needs to be on people's existing devices
               | to succeed, and your competing one wouldn't be.
               | 
               | Similarly, if it were possible to spend $10 million to
               | create an API-compatible clone of CUDA, and then save
               | more than $10 million by not having to pay huge margins
               | to Nvidia, then someone would have already done it. So we
               | can conclude that either it can't be done for $10
               | million, or it wouldn't create $10 million of value. In
               | this case, the first seems more likely, and the comment
               | above hypothesizes why: because an incomplete clone
               | wouldn't produce $10 million of value, and a complete one
               | would cost much more than $10 million. Alternatively, if
               | Nvidia could enforce intellectual property rights against
               | someone creating such a clone, that would also explain
               | it.
               | 
               | (Technically it's possible that this could instead be
               | explained by a free-rider problem; i.e., such a clone
               | would create more value than it would cost, but no
               | company wants to sponsor it because they're all waiting
               | for some other company to do it and then save the $10
               | million it would cost to do it themselves. But this seems
               | unlikely; big tech companies often spend more than $10
               | million on open source projects of strategic
               | significance, which a CUDA clone would have.)
        
               | pjmlp wrote:
               | And the tooling, people keep forgeting about CUDA
               | tooling.
        
             | int_19h wrote:
             | From the market perspective, it's down to whether the
             | amount of money needed to get there _and stay there_
             | (keeping in mind that this would have to be an ongoing
             | effort given that CUDA is not a static target) is more or
             | less than the amount of money needed to just buy NVIDIA
             | GPUs.
        
             | ivell wrote:
             | Modular is trying with Mojo + Max offering. It has taken
             | quite a bit of effort to target NVidia and get parity. They
             | are now focusing on other hardware.
        
           | hangonhn wrote:
           | Is CUDA tied very closely to the Nvidia hardware and
           | architecture so that all the abstraction would not make sense
           | on other platforms? I know very little about hardware and low
           | level software.
           | 
           | Thanks
        
             | saagarjha wrote:
             | Yes, also it's a moving target where people don't just want
             | compatibility but also good performance.
        
             | dagmx wrote:
             | CUDA isn't really that hyper specific to NVIDIA hardware as
             | an api.
             | 
             | But a lot of the most useful libraries are closed source
             | and available on NVIDIA hardware only.
             | 
             | You could probably get most open source CUDA to run on
             | other vendors hardware without crazy work. But you'd spend
             | a ton more work getting to parity on ecosystem and lawyer
             | fees when NVIDIA come at you.
        
             | lcnielsen wrote:
             | The kind of CUDA you or I would write is not very hardware
             | specific (a few constants here and there) but the kind of
             | CUDA behind cuBLAS with a million magic flags, inline PTX
             | ("GPU assembly") and exploitation of driver/firmware hacks
             | is. It's like the difference between numerics code in C and
             | and numerics code in C with tons of in-line assembly code
             | for each one of a number of specific processors.
             | 
             | You can see similar things if you buy datacenter-grade CPUs
             | from AMD or Intel and compare their per-model optimized
             | BLAS builds and compilers to using OpenBLAS or swapping
             | them around. The difference is not world ending but you can
             | see maybe 50% in some cases.
        
             | pjmlp wrote:
             | CUDA is an ecosystem, many keep failing to understand that,
             | trying to copy only the C++ compiler.
        
         | ls612 wrote:
         | #2 would be Google v. Oracle wouldn't it?
        
         | quitit wrote:
         | It's 1.
         | 
         | It means that a developer can use their relatively low-powered
         | Apple device (with UMA) to develop for deployment on nvidia's
         | relatively high-powered systems.
         | 
         | That's nice to have for a range of reasons.
        
           | _zoltan_ wrote:
           | "relatively high powered"? there's nothing faster out there.
        
             | chvid wrote:
             | Relative to what you can get in the cloud or on a desktop
             | machine.
        
             | MangoToupe wrote:
             | Is this true per watt?
        
               | spookie wrote:
               | It doesn't matter for a lot of applications. But fair,
               | for a big part of them it is either essential or a nice
               | to have. But completely off the point if we are waging
               | fastest compute no matter what.
        
               | johnboiles wrote:
               | ...fastest compute no matter watt
        
             | sgt101 wrote:
             | I wonder what Apple would have to do to make metal + its
             | processors run faster than nVidia? I guess that it's all
             | about the interconnects really.
        
               | summarity wrote:
               | Right now, for LLMs, the only limiting factor on Apple
               | Silicon is memory bandwidth. There hasn't been progress
               | on this since the original M1 Ultra. And since abandoning
               | UltraFusion, we won't see progress here anytime soon
               | either.
        
               | glhaynes wrote:
               | Have they abandoned UltraFusion? Last I'd heard, they'd
               | just said something like "not all generations will get an
               | Ultra chip" around the time the M4 showed up (the first M
               | chip lacking an Ultra variation), which makes me think
               | the M5 or M6 is fairly likely to get an Ultra.
        
               | librasteve wrote:
               | this is like saying the only limiting factor on computers
               | is the von neumann bottleneck
        
             | quitit wrote:
             | Relative to the apple hardware, the nvidia is high powered.
             | 
             | I appreciate that English is your second language after
             | your Hungarian mother-tongue. My comment reflects upon the
             | low and high powered compute of the apple vs. nvidia
             | hardware.
        
           | chvid wrote:
           | If Apple cannot do their own implementation of CUDA due to
           | copyright second best is this; getting developers to build
           | for LMX (which is on their laptops) and still get NVIDIA
           | hardware support.
           | 
           | Apple should do a similar thing for AMD.
        
             | xd1936 wrote:
             | I thought that the US Supreme Court decision in Google v.
             | Oracle and the Java reimplementation provided enough case
             | precedent to allow companies to re-implement something like
             | CUDA APIs?
             | 
             | https://www.theverge.com/2021/4/5/22367851/google-oracle-
             | sup...
             | 
             | https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,
             | _....
        
               | timhigins wrote:
               | Exactly and see also ROCM/HIP which is AMD's
               | reimplementation of CUDA for their gpus.
        
           | randomNumber7 wrote:
           | What is the performance penalty compared to a program in
           | native CUDA?
        
           | karmakaze wrote:
           | It would be great for Apple if enough developers took this
           | path and Apple could later release datacenter GPUs that
           | support MLX without CUDA.
        
             | nightski wrote:
             | It's the other way around. If Apple released data center
             | GPUs then developers might take that path. Apple has shown
             | time and again they don't care for developers, so it's on
             | them.
        
         | dagmx wrote:
         | 2 also further cements CUDA as the de facto API to target, and
         | nobody would write MLX targeted code instead.
         | 
         | This way, you're more incentivized to write MLX and have it run
         | everywhere. It's a situation of everyone wins, especially Apple
         | because they can optimize it further for their platforms.
        
         | tekawade wrote:
         | I want #3 be able to connect NVIDIA GPU with Apple Silicon and
         | run CUDA. Take advantage of apple silicon + unified memory +
         | GPU + CUDA with PyTorch, JAX or TensorFlow.
         | 
         | Haven't really explored MLX so can't speak about it.
        
         | tho234j2344 wrote:
         | I don't think #2 is really true - AMDs HIP is doing this exact
         | thing after giving up on OpenCL way back in ~'17/'18.
        
           | NekkoDroid wrote:
           | I haven't looked into it, but doesn't HIP need everything to
           | be recompiled against it? To my understanding it was mostly a
           | source code translation effectivly.
        
             | pjmlp wrote:
             | For CUDA C++, not CUDA the ecosystem.
        
         | sitkack wrote:
         | #2 is not a copyright violation. You can reimplement APIs.
        
           | 7734128 wrote:
           | The famous Android Java fight is probably the most important
           | case of that discussion.
        
             | hnfong wrote:
             | Indeed.
             | 
             | Unfortunately when that case went to the Supreme Court they
             | basically just said "yeah for this case it's fair use, but
             | we're not going to comment on whether APIs in general are
             | copyrightable"...
        
           | adastra22 wrote:
           | CUDA is not an API, it is a set of libraries written by
           | NVIDIA. You'd have to reimplement those libraries, and for
           | people to care at all you'd have to reimplement the
           | optimizations in those libraries. That does get into various
           | IP issues.
        
             | Imustaskforhelp wrote:
             | Even if its not as optimized, it would still be nice to see
             | a CUDA alternative really
             | 
             | Also I do wonder what the difference b/w a API and a set of
             | libraries are, couldn't an API be exposed from that set of
             | libraries which could be used? Its a little confusing I
             | guess
        
               | adastra22 wrote:
               | > couldn't an API be exposed from that set of libraries
               | which could be used
               | 
               | And now you've entered that copyright violation
               | territory.
        
               | Someone wrote:
               | IP infringement, not copyright violation.
               | 
               | A clean room reimplementation of cuda would avoid any
               | copyright claims, but would not necessary avoid patents
               | infringement.
               | 
               | https://en.wikipedia.org/wiki/Clean-room_design:
               | 
               |  _"Clean-room design is useful as a defense against
               | copyright infringement because it relies on independent
               | creation. However, because independent invention is not a
               | defense against patents, clean-room designs typically
               | cannot be used to circumvent patent restrictions."_
        
               | dragonwriter wrote:
               | > A clean room reimplementation of cuda would avoid any
               | copyright claims,
               | 
               | Assuming APIs are either not copyirghtable or that API
               | reimplementation is always fair use of the API, _neither_
               | of which there is sufficient precedent to justify as a
               | conclusion; Oracle v. Google ended with "well, it would
               | be fair use in the exact factual circumstances in this
               | case so we don 't have to reach the thornier general
               | questions".
        
             | pjmlp wrote:
             | CUDA is neither an API, nor a set of libraries, people get
             | this wrong all the time.
             | 
             | CUDA is an ecosystem of programming languages, libraries
             | and developer tools.
             | 
             | Composed by compilers for C, C++, Fortran, Python JIT DSLs,
             | provided by NVidia, plus several others with either PTX or
             | NVVM IR.
             | 
             | The libraries, which you correctly point out.
             | 
             | And then the IDE integrations, the GPU debugger that is on
             | par with Visual Studio like debugging, profiler,...
             | 
             | Hence why everyone that focus on copying only CUDA C, or
             | CUDA C++, without everything else that makes CUDA relevant
             | keeps failing.
        
               | CamperBob2 wrote:
               | Only the runtime components matter, though. Nobody cares
               | about the dev tools beyond the core compiler. What people
               | want is to be able to recompile and run on competitive
               | hardware, and I don't understand why that's such an
               | intractable problem.
        
               | outworlder wrote:
               | It is not.
               | 
               | However, companies may still be hoping to get their own
               | solutions in place instead of CUDA. If they do implement
               | CUDA, that cements its position forever. That ship has
               | probably already sailed, of course.
        
               | StillBored wrote:
               | Because literally the entire rest of the ecosystem is
               | immature demoware. Rather than each vendor buying into
               | opencl+SPIRV and building a robust stack around it, they
               | are all doing their own half baked tech demos hoping to
               | lock up some portion of the market to duplicate nvidia's
               | success, or at least carve out a niche. While nvidia
               | continues to extend and mature their ecosystem. Intel has
               | oneAPI, AMD has ROCM, Arm has ACL/Kleidi/etc, and a pile
               | of other stacks like MLX, Windows ML, whatever. Combined
               | with a confusing mix of pure software plays like pytorch
               | and windows ML.
               | 
               | A lot of people talk about 'tooling' quality and no one
               | hears them. I just spent a couple weeks porting a fairly
               | small library to some fairly common personal hardware and
               | hit all the same problems you see everywhere. Bugs aren't
               | handled gracefully. Instead of returning "you messed up
               | here", the hardware locks up, and power cycling is the
               | only solution. Not a problem when your writing hello
               | world, but trolling through tens of thousands of lines of
               | GPU kernel code to find the error is going to burn
               | engineer time without anything to show for it. Then when
               | its running, spending weeks in an open feedback loop
               | trying to figure out why the GPU utilization metrics are
               | reporting 50% utilization (if your lucky enough to even
               | have them) and the kernel is running at 1/4 the expected
               | performance is again going to burn weeks. All because
               | there isn't a functional profiler.
               | 
               | And the vendors can't even get this stuff working. People
               | rant about the ROCm support list not supporting, well the
               | hardware people actually have. And it is such a mess,
               | that in some cases it actually works but AMD says it
               | doesn't. And of course, the only reason you hear people
               | complaining about AMD is because they are literally the
               | only company that has a hardware ecosystem that in theory
               | spans the same breadth of devices from small embedded
               | systems to giant data center grade products that NVIDIA
               | does. Everyone else wants a slice of the market, but take
               | apple here, they have nothing in the embedded/edge space
               | that isn't a fixed function device (ex a watch, or apple
               | TV), and their GPU's while interesting are nowhere near
               | the level of the datacenter grade stuff, much less even
               | top of the line AIC boards for gamers.
               | 
               | And its all gotten to be such an industry wide pile of
               | trash that people can't even keep track of basic feature
               | capabilities. Like, a huge pile of hardware actually
               | 'supports' openCL, but its buried to the point where
               | actual engineers working on say ROCm are unaware its
               | actually part of the ROCm stack (imagine my surprise!).
               | And its been the same for nvidia, they have at times
               | supported openCL, but the support is like a .dll they
               | install with the GPU driver stack and don't even bother
               | to document that its there. Or tensorflow that seems to
               | have succumbed to the immense gravitational black hole it
               | had become, where just building it on something that
               | wasn't the blessed platform could take days.
        
               | int_19h wrote:
               | It's the same essential problem as with e.g. Wine - if
               | you're trying to reimplement someone else's constantly
               | evolving API with a closed-source implementation, it
               | takes a lot of effort just to barely keep up.
               | 
               | As far as portability, people who care about that already
               | have the option of using higher-level APIs that have CUDA
               | backend among several others. The main reason why you'd
               | want to do CUDA directly is to squeeze that last bit of
               | performance out of the hardware, but that is also
               | precisely the area where deviation in small details
               | starts to matter a lot.
        
             | vFunct wrote:
             | So if people aren't aware, you can have AI reimplement CUDA
             | libraries for any hardware, as well as develop new ones.
             | 
             | You wouldn't believe me if you didn't try it and see for
             | yourself, so try it.
             | 
             | NVidia's CUDA moat is no more.
        
         | pxc wrote:
         | Copyright can't prohibit compatible implementations that are
         | developed independently through reverse engineering, if the
         | implementers are very careful about the way they work.
         | 
         | (Software patents can, though.)
        
       | natas wrote:
       | that means the next apple computer is going to use nvidia gpu(s).
        
         | meepmorp wrote:
         | but it's not an apple-submitted pr
        
           | natas wrote:
           | they can't make it that obvious
        
             | meepmorp wrote:
             | o
        
         | MBCook wrote:
         | There's no evidence of that. The post clearly identifies a
         | _far_ more probable reason in letting things be developed in
         | Mac's then deployed on Nvidia supercomputers.
        
       | dnchdnd wrote:
       | Random aside: A lot of the people working on MLX don't seem to be
       | officially affiliated with Apple at least in a superficial
       | review. See for example: https://x.com/prince_canuma
       | 
       | Idly wondering, is Apple bankrolling this but wants to keep it in
       | the DL? There were also rumours the team was looking to move at
       | one point ?
        
         | jpcompartir wrote:
         | It seems more like Open Source devs who are looking to build
         | clout/rep with MLX?
         | 
         | Pretty sure Claude Sonnet is actually doing most of the work.
        
       | Abishek_Muthian wrote:
       | I've been very impressed with MLX models; I can open up local
       | models to everyone in the house, something I wouldn't dare with
       | my Nvidia computer for the risk of burning down the house.
       | 
       | I've been hoping Apple Silicon becomes a serious contender for
       | Nvidia chips; I wonder if the CUDA support is just Embrace,
       | extend, and extinguish (EEE).
        
       | m3kw9 wrote:
       | I thought you either use MLX for apple silicone or you compile it
       | for cudaw
        
       | neurostimulant wrote:
       | > Being able to write/test code locally on a Mac and then deploy
       | to super computers would make a good developer experience.
       | 
       | Does this means you can use MLX on linux now?
       | 
       | Edit:
       | 
       | Just tested it and it's working but only python 3.12 version is
       | available on pypi right now: https://pypi.org/project/mlx-
       | cuda/#files
        
       | qwertox wrote:
       | If Apple would support Nvidia cards it would be the #1 solution
       | for developers.
        
         | Nevermark wrote:
         | If Apple doubled the specs of their Ultra M processor every
         | year, in numbers of cores, RAM cells, internal and external
         | bandwidth, until both the Ultra processor and its RAM plane
         | took up full wafers, .... but still fit in a Mac Studio case,
         | with a new white reverse-power heat extraction USB-C+ cable
         | designed to be terminated at a port on a small wireless heat
         | exchanger dish, which instantly beamed all the waste heat into
         | space, at such high efficiency that the Studio internals could
         | operate at -100 Celsius, and all those cores overclocked, oh
         | they over clocked, ...
         | 
         | Yes we can dream!
         | 
         | It would great if Apple continues pushing M processors to next
         | levels, in part, to go vertical into the cloud.
         | 
         | Or if they start supporting nVidia.
         | 
         | The latter seems less Apple-y. But they must be considering the
         | value of a cloud level Apple-friendly AI computing solution, so
         | something is likely (?) to happen.
        
       | adultSwim wrote:
       | This is great to see. I had wrongly assumed MLX was Apple-only.
        
       | neuroelectron wrote:
       | Just remember to name for fp8 kernels "cutlass" for +50%
       | performance.
        
       | mattfrommars wrote:
       | It's year 2025 and we have yet to have impact of CUDA like what
       | Java had in the idea, "write once, run it anywhere"
       | 
       | Academia and companies continue to write proprietary code. Its as
       | if we continue to write code for Adobe Flash or Microsoft
       | Silverlight in year 2025.
       | 
       | Honestly, I don't mind as Nvidia shareholder.
        
         | raincole wrote:
         | In the end Java doesn't achieve "write once, run it anywhere"
         | either.
         | 
         | I guess there might be a way to develop apps for iOS or even
         | PlayStation in Java, but my knees hurt just thinking about how
         | many hoops one needs to jump through.
        
         | bigyabai wrote:
         | I'll never get over the way Apple treated OpenCL. They saw the
         | train coming down the tracks, spent so long hedging their bet
         | against CUDA, and threw in the towel _the moment_ actual demand
         | started cropping up. CUDA very nearly had a serious, corporate-
         | funded and write-once-run-anywhere competitor.
         | 
         | Normally I write something snide about not seeing where the
         | puck was headed. But Apple _did_ skate to the puck the puck
         | here, they just did nothing with it.
        
         | int_19h wrote:
         | Back in the day, the reason why people kept targeting Flash is
         | because all the other alternatives were worse. If you recall,
         | the only thing that made a difference was mobile, where Flash
         | ended up being a liability due to performance and battery
         | lifetime issues. And even then it took a company like Apple,
         | which could rely on its cult status to draw the hard line on
         | mobile Flash, ship iPhone without it (unlike Android which had
         | it, warts and all), and steadfastly refuse to even consider
         | adding it, forcing everybody else to choose between using Flash
         | and supporting the lucrative iPhone ecosystem.
         | 
         | I'm not even sure what the equivalent would be for CUDA tbh.
        
           | bigyabai wrote:
           | Apple could just, talk to Khronos again. In any protracted
           | discussion of "how can the industry kill Nvidia", we always
           | circle back around to the lack of communication. There _was_
           | an era where Apple, AMD and even Nvidia all worked on Open
           | Source non-CUDA acceleration primitives. There were working
           | drivers, (a handful of) users, and bindings in multiple
           | native languages. All they needed was industry applications,
           | which would arrive with the crypto mining boom that Nvidia
           | profited off of hand-over-fist. And by then, Apple refused to
           | cooperate with their industry partners, and refused to
           | support OpenCL on iPhone GPUs or Apple Silicon.
           | Metaphorically, this would be like Apple refusing to
           | implement HTML because they wanted to promote their own Flash
           | alternative.
           | 
           | Nvidia won because they don't deal with this level of asinine
           | infighting. If Khronos could bring back a level of mutual
           | respect to their consortium, they could present a serious
           | threat. Apple is the only business still on their high horse;
           | AMD, Intel and Qualcomm all recognize that they need to
           | cooperate.
        
         | m463 wrote:
         | > "write once, run it anywhere"
         | 
         | I think you mean:
         | 
         | "write once, test everywhere"
        
       ___________________________________________________________________
       (page generated 2025-07-15 23:01 UTC)