[HN Gopher] Run LLMs on Apple Neural Engine (ANE)
       ___________________________________________________________________
        
       Run LLMs on Apple Neural Engine (ANE)
        
       Author : behnamoh
       Score  : 194 points
       Date   : 2025-05-03 15:29 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | htk wrote:
       | I always felt that the neural engine was wasted silicon, they
       | could add more gpu cores in that die space and redirect the
       | neural processing api to the gpu as needed. But I'm no expert, so
       | if anyone here has a different opinion I'd love to learn from it.
        
         | bigyabai wrote:
         | If you did that, you'd stumble into the Apple GPU's lack of
         | tensor acceleration hardware. For an Nvidia-like experience
         | you'd have to re-architecture the GPU to subsume the NPU's
         | role, and if _that_ was easy then everyone would have done it
         | by now.
        
           | sroussey wrote:
           | M1/M2 shared a GPU design, same with M3/M4. So maybe M5 will
           | have a new design that includes tensor cores in the GPU.
        
         | lucasoshiro wrote:
         | I'm not a ML guy, but when I needed to train a NN I thought
         | that the my Mac's ANE would help. But actually, despite it
         | being way easier to setup tensorflow + metal + M1 on Mac than
         | to setup tensorflow + cuda + nvidia on Linux, the neural engine
         | cores are not used. Not even for classification, which are
         | their main purpose. I wouldn't say they are wasted silicon, but
         | they are way less useful than what we expect
        
           | mr_toad wrote:
           | Does Apple care about third party use of the ANE? There are
           | many iOS/iPadOS features that use it.
        
             | sroussey wrote:
             | Not really. Apple software uses the neural engine all over
             | the place, but rarely do others. Maybe this will change [1]
             | 
             | There was a guy using it for live video transformations and
             | it almost caused the phones to "melt". [2]
             | 
             | [1] https://machinelearning.apple.com/research/neural-
             | engine-tra...
             | 
             | [2] https://x.com/mattmireles/status/1916874296460456089
        
         | ks2048 wrote:
         | At least one link/benchmark I saw said the ANE can be 7x faster
         | than GPU (Metal / MPS),
         | 
         | https://discuss.pytorch.org/t/apple-neural-engine-ane-instea...
         | 
         | It seems intuitive that if they design hardware very
         | specifically for these applications (beyond just fast matmuls
         | on a GPU), they could squeeze out more performance.
        
           | astrange wrote:
           | Performance doesn't matter. Nothing is ever about
           | performance.
           | 
           | It's about performance/power ratios.
        
         | 1W6MIC49CYX9GAP wrote:
         | You're completely right, if you already have a GPU in a system
         | adding tensor cores to it gives you better performance per
         | area.
         | 
         | GPU + dedicated AI HW is virtually always the wrong approach
         | compared to GPU+ tensor cores
        
         | xiphias2 wrote:
         | I guess it's a hard choice as it's 5x more energy efficient
         | than GPU because it uses systolic array.
         | 
         | For laptops, 2x GPU cores would make more sense, for
         | phones/tablets, energy efficiency is everything.
        
         | brigade wrote:
         | Eyeballing 3rd party annotated die shots [1], it's about the
         | size of two GPU cores, but achieves 15.8 tflops. Which is more
         | than the reported 14.7 tflops of the 32-core GPU in the binned
         | M4 Max.
         | 
         | [1] https://vengineer.hatenablog.com/entry/2024/10/13/080000
        
           | Archit3ch wrote:
           | Not really. That's 15.8 fp16 ops compared to 14.7 fp32 ops
           | (that are actually useful outside AI). It would be
           | interesting to see if you can configure the ANE to recover
           | fp32 precision at lower throughput [1].
           | 
           | [1] https://arxiv.org/abs/2203.03341
        
             | brigade wrote:
             | Apple GPUs run fp16 at the same rate as fp32 except on
             | phones, so it is comparable for ML. No one runs inference
             | from fp32 weights.
             | 
             | But the point was about area efficiency
        
         | rz2k wrote:
         | I was trying to figure the same thing out a couple months ago,
         | and didn't find much information.
         | 
         | It looked like even ANEMLL provides limited low level access to
         | specifically direct processing toward the Apple Neural Engine,
         | because Core ML still acts as the orchestrator. Instead, flags
         | during conversion of a PyTorch or TensorFlow model can specify
         | ANE-optimized operations, quantization, and parameters hinting
         | at compute targets or optimization strategies. For example
         | `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`
         | during conversion would disfavor the GPU cores.
         | 
         | Anyway, I didn't actually experiment with this, but at the time
         | I thought maybe there could be a strategy of creating a
         | speculative execution framework, with a small ANE-compatible
         | model to act as the draft model paired with a larger target
         | model running on GPU cores. The idea being that the ANE's low
         | latency and high efficiency could accelerate results.
         | 
         | However, I would be interested to hear the perspective of
         | people who actually know something about the subject.
        
       | simonw wrote:
       | I'm trying to figure out what the secret sauce for this is. It
       | depends on https://github.com/apple/coremltools - is that the key
       | trick or are there other important techniques going on here?
        
         | smpanaro wrote:
         | coremltools is the only way to run on ANE, so less of a trick
         | and more of a requirement.
         | 
         | The tricks are more around optimizing for the hardware
         | capabilities/constraints. For instance:
         | 
         | - conv2d is faster than linear (see Apple's post [0]) so you
         | rewrite the model for that (example from the repo [1])
         | 
         | - inputs/outputs are static shapes, so KV cache requires some
         | creativity (I wrote about that here [2])
         | 
         | - compute is float16 (not bfloat16) so occasionally you have to
         | avoid activation overflows
         | 
         | [0]: https://machinelearning.apple.com/research/neural-engine-
         | tra...
         | 
         | [1]:
         | https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...
         | 
         | [2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine
        
       | kamranjon wrote:
       | I am curious if anyone knows if the neural cores in apple silicon
       | based machines are at all useful in training? I've been using the
       | MLX framework but haven't seen them mentioned anywhere so I'm
       | just wondering if they are only useful for inference? I know
       | whisper.cpp takes advantage of them in the inference context.
       | 
       | Edit: I changed llama.cpp to whisper.cpp - I didn't realize that
       | llama.cpp doesn't have a coreml option like whisper.cpp does.
        
         | lucasoshiro wrote:
         | Well, the TensorFlow port to metal was written by Apple, and it
         | doens't use ANE. If even they have chosen to use only GPU
         | probably the ANE wouldn't help in training. I also heard that
         | the ANE is way less powerful than Apple Silicon's GPU, but I
         | don't have numbers
        
         | 486sx33 wrote:
         | Maybe a quick side shift - What the heck are apples neural
         | cores good for ? Used for ? Use cases ?
        
           | kamranjon wrote:
           | They are definitely good for local inference as evidenced by
           | the pretty amazing performance increase on Apple silicon when
           | used with whisper.cpp, maybe other frameworks that utilize
           | coreml? I think they're sorta purpose built for doing matrix
           | math.
        
           | giantrobot wrote:
           | They're used pretty extensively by stuff like the Photos app
           | for various ML tasks. Not all AI is LLMs.
        
           | freehorse wrote:
           | In my understanding, they are intended to be utilized by
           | various apps (and Apple intelligence) for performing
           | different machine learning tasks (inference), in a manner
           | that is unobtrusive to the user and the rest of the system.
           | While a GPU would theoretically be more performant, it could
           | potentially lead to excessive energy consumption, temperature
           | rise, fan noise, and other factors that may not be desirable
           | when performing basic tasks like OCR when viewing an image
           | within an application.
           | 
           | Currently, it is for example used through the "Vision
           | Framework" eg for OCR tasks (for instance, when previewing an
           | image in macOS, it performs OCR in the background using the
           | ANE). Additionally, they are utilized by certain apple
           | intelligence features that are executed locally (eg when I
           | asked writing tools to rewrite this comment, I saw a spike in
           | ANE usage).
           | 
           | They can also be used for diffusion image models (through
           | core ml, diffusers has a nice frontend for that) but my
           | understanding is that they are primarily for "light" ML tasks
           | within an application rather than running larger models
           | (though that's also possible, but they are gonna probably run
           | slower than in gpu).
        
       | behnamoh wrote:
       | I wonder if Apple ever followed up with this:
       | https://github.com/apple/ml-ane-transformers
       | 
       | They claim their ANE-optimized models achieve "up to 10 times
       | faster and 14 times lower peak memory consumption compared to
       | baseline implementations."
       | 
       | AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp
       | started exploring this idea [0].
       | 
       | What's weird is that MLX is made _by_ Apple and yet, they can 't
       | support ANE given its closed-source API! [1]
       | 
       | [0]: https://github.com/ggml-org/llama.cpp/issues/10453
       | 
       | [1]: https://github.com/ml-
       | explore/mlx/issues/18#issuecomment-184...
        
         | kamranjon wrote:
         | Whisper.cpp has a coreml option which gives 3x speed up over
         | cpu only according to the docs: https://github.com/ggml-
         | org/whisper.cpp?tab=readme-ov-file#c...
        
           | zozbot234 wrote:
           | Some outdated information about bare-metal use of the ANE is
           | available from the Whisper.cpp pull req:
           | https://github.com/ggml-org/whisper.cpp/pull/1021 Even more
           | outdated information at: https://github.com/eiln/ane/tree/33a
           | 61249d773f8f50c02ab0b9fe... In short, the early (M1/M2)
           | versions of ANE are unlikely to be useful for modern LLM
           | inference due to their seemingly exclusive focus on
           | statically scheduled FP16 and INT8 MADDs.
           | 
           | More extensive information at https://github.com/tinygrad/tin
           | ygrad/tree/master/extra/accel... (from the Tinygrad folks,
           | note that this is also similarly outdated) seems to basically
           | confirm the above.
           | 
           | (The jury is still out for M3/M4 which currently have no
           | Asahi support - thus, no current prospects for driving the
           | ANE bare-metal. Note however that the M3/Pro/Max ANE reported
           | performance numbers are quite close to the M2 version, so
           | there may not be a real improvement there either. M3 Ultra
           | and especially the M4 series may be a different story.)
        
             | kamranjon wrote:
             | I wouldn't say that they aren't useful for inference (there
             | are pretty clear performance improvements even from the
             | asahi effort you linked) - it's just that you have to
             | convert the model ahead of time to be compatible with the
             | ANE which is explained in the readme docs for whisper.cpp
             | that I linked above.
             | 
             | I would say though that this likely excludes them from
             | being useful for training purposes.
        
               | zozbot234 wrote:
               | Note that I was only commenting on modern quantized LLM's
               | that basically avoid formats like FP16 or INT8,
               | preferring lower precision wherever feasible. When in-
               | memory model values must be padded to FP16/INT8 this
               | slashes your effective use of memory bandwidth, which is
               | what determines token generation speed. So the only
               | feasible benefits are really in the prompt pre-processing
               | phase, and even then only in lower power use compared to
               | GPU, not really in higher speed.
        
               | kamranjon wrote:
               | That's really interesting! I didn't know that about the
               | padding behavior here. I am interested to know which
               | models this would include? I know Gemma 3 raw is bf16 -
               | are you just talking about the quantized versions of
               | these? Or are models being released purely as quantized
               | versions these days? I know Google just released a QAT
               | (Quantization Aware Training) model of Gemma 3 27b - but
               | that base model was already released.
        
               | zozbot234 wrote:
               | Models may be released as unquantized (and even then they
               | are gradually shifting towards lower precisions over
               | time), but most people are going to be running them in a
               | quantized version simply because that gives you the best
               | bang for your buck (you can fit more interesting models
               | on the same hardware). Of course this is strictly about
               | local LLM inference, though one may reasonably assume
               | that the big players are also doing something similar.
        
               | conradev wrote:
               | My understanding is that model throughput is
               | fundamentally limited at some point by the fact that the
               | ANE is less wide than the GPU.
               | 
               | At that point, the ANE loses because you have to split
               | the model into chunks and only one fits at a time.
        
               | smpanaro wrote:
               | What do you mean by less wide? The main bottleneck for
               | transformers is memory bandwidth. ANE has a much lower
               | ceiling than CPU/GPU (yes, despite unified memory).
               | 
               | Chunking is actually beneficial as long as all the chunks
               | can fit into the ANE's cache. It speeds up compilation
               | for large network graphs and cached loads are negligible
               | cost. On M1 the cache limit is 3-4GB, but it is higher on
               | M2+.
        
           | jorvi wrote:
           | .. who is running LLMs on CPU instead of GPU or TPU/NPU
        
             | thot_experiment wrote:
             | Every apple fanboy with a $10k mac studio.
        
               | kamranjon wrote:
               | Pretty sure they're using the 80 GPU cores available in
               | that case.
        
               | echelon wrote:
               | And that still performs worse than entry-level Nvidia
               | gaming cards.
               | 
               | Apple isn't serious about AI and needs to figure their AI
               | story out. Every other big tech company is doing
               | something about it.
        
               | voidspark wrote:
               | Not for inferencing. M3 Ultra runs big LLMs twice as fast
               | as RTX 5090.
               | 
               | https://creativestrategies.com/mac-studio-m3-ultra-ai-
               | workst...
               | 
               | RTX 5090 only has 32GB RAM. M3 Ultra has up to 512 GB
               | with 819 GB/sec bandwidth. It can run models that will
               | not fit on an RTX card.
               | 
               | EDIT: Benchmark may not be properly utilizing the 5090.
               | But the M3 Ultra is way more capable than an entry level
               | RTX card at LLM inferencing.
        
               | Spooky23 wrote:
               | My little $599 Mac Mini does inference about 15-20%
               | slower than a 5070 in my kids' gaming rig. They cost
               | about the same, and I got a free computer.
               | 
               | Nvidia makes an incredible product, but apples different
               | market segmentation strategy might make it a real player
               | in the long run.
        
               | kamranjon wrote:
               | They're basically second place behind NVIDIA for model
               | inference performance and often the only game in town for
               | the average person if you're trying to run larger models
               | that wont fit in the 16 or 24gb of memory available in
               | top-shelf NVIDIA offerings.
               | 
               | I wouldn't say Apple isn't serious about AI, they had the
               | forethought to build the shared memory architecture with
               | the insane memory bandwidth needed for these types of
               | tasks, while at the same time designing neural cores
               | specifically for small on-device models needed for future
               | apps.
               | 
               | I'd say Apple is currently ahead of NVIDIA in just sheer
               | memory available - which for doing training and inference
               | on large models, it's kinda crucial, at least right now.
               | NVIDIA seems to be purposefully limiting the memory
               | available in their consumers cards which is pretty short
               | sighted I think.
        
               | voidspark wrote:
               | M3 Ultra has a big GPU with 819 GB/sec bandwidth.
               | 
               | LLM performance is twice as fast as RTX 5090
               | 
               | https://creativestrategies.com/mac-studio-m3-ultra-ai-
               | workst...
        
               | behnamoh wrote:
               | > LLM performance is twice as fast as RTX 5090
               | 
               | your tests are wrong. you used MLX for Mac Studio
               | (optimized for Apple Silicon) but you didn't use vLLM for
               | 5090. There's no way a machine with half the bandwidth of
               | 5090 delivers twice as fast tok/s.
        
               | seanmcdirmid wrote:
               | Unless it's a large model that doesn't fit in the 5090,
               | bust that's no longer a $4k macstudio I think.
        
               | behnamoh wrote:
               | that's orthogonal to the speed discussion.
               | 
               | also, the GP was mostly testing models that fit in both
               | 5090 and Mac Studio.
        
               | voidspark wrote:
               | $4k will get you a 96 GB Mac Studio with M3 Ultra (819
               | GB/sec).
               | 
               | That's 3x the RAM of the 5090.
        
               | voidspark wrote:
               | Yeah that's probably wrong. But the M3 Ultra is good
               | enough for local inferencing, in any case.
        
               | briandear wrote:
               | Can we stop with the derisive "fanboy" nonsense? Most
               | people don't say "FOSS" fanboy or Linux "fanboy" -- but
               | plenty of people here are exactly that. It's a bit
               | insulting to people that like and appreciate Mac
               | hardware; just because you might not like it doesn't mean
               | you have to be so dismissive. And that Mac Studio is a
               | very impressive computer -- but it's usually the ones
               | that have never used on that seem to have to most
               | opinions about them.
        
             | kamranjon wrote:
             | Actually that's a really good question, I hadn't considered
             | that the comparison here is just CPU vs using Metal
             | (CPU+GPU).
             | 
             | To answer the question though - I think this would be used
             | for cases where you are building an app that wants to
             | utilize a small AI model while at the same time having the
             | GPU free to do graphics related things, which I'm guessing
             | is why Apple stuck these into their hardware in the first
             | place.
             | 
             | Here is an interesting comparison between the two from a
             | whisper.cpp thread - ignoring startup times - the CPU+ANE
             | seems about on par with CPU+GPU: https://github.com/ggml-
             | org/whisper.cpp/pull/566#issuecommen...
        
               | conradev wrote:
               | It essentially never makes sense to run on the CPU and
               | you will only ever see enthusiasts doing it.
               | 
               | Yes, hammering the GPU too hard can affect the display
               | server, but no, switching to the CPU is not a good
               | alternative
        
               | kamranjon wrote:
               | Not switching to the CPU - switching to the ANE (Neural
               | Cores) - if you read the research papers Apple has
               | released - the example I gave is pretty much how it's
               | being used - small image classification models running on
               | the ANE, alongside a graphics app that needs the GPU to
               | be free.
        
             | yjftsjthsd-h wrote:
             | Not all of us own GPUs worth using. Now, among people using
             | macs... Maybe if you had a hardware failure?
        
           | echelon wrote:
           | > coreml option which gives 3x speed up over cpu
           | 
           | Which is still painfully slow. CoreML is not a real ML
           | platform.
        
         | ks2048 wrote:
         | One update from Apple Research since that one was "Deploying
         | Attention-Based Vision Transformers to Apple Neural Engine"
         | (but not clear the relationship. It doesn't build on
         | ane_transformers, but maybe a sister project for vision?)
         | 
         | blog: https://machinelearning.apple.com/research/vision-
         | transforme...
         | 
         | github: https://github.com/apple/ml-vision-transformers-ane
        
         | 1W6MIC49CYX9GAP wrote:
         | Based on the graphs "up to 10 times faster" compares
         | before/after flash attention
        
         | ljosifov wrote:
         | This more than anything feels emblematic to me, that Apple
         | executives are brain dead when it comes to software. AI
         | seemingly being a step too far. (in the s/w direction.) While
         | they could at some level grok classical s/w, NN-s are that
         | Terra Incognita where Apple executives can not possibly follow
         | in. It's just too strange and mysterious world for them to be
         | able to effectively decide or execute on anything. Worked
         | (20yrs ago) in an industrial R&D lab of a mostly h/w
         | manufacturer for 3 years. It looked to me that the worlds of
         | h/w and s/w, the mindsets, diverged pretty quickly on all
         | important considerations of "what should happen next".
        
           | anon373839 wrote:
           | Huh? Apple ships heaps of deep learning-based
           | products/features, and they've done so for years. But I'll
           | agree with you, they're currently behind in generative AI.
        
         | smpanaro wrote:
         | Not a public follow-up but the iOS 17 speech-to-text model has
         | a clever approach to KV caching that works within the ANE's
         | constraints (fixed size inputs).
         | 
         | I wrote about it here[0] but the gist is you can have a fixed
         | size cache and slide it in chunks with each inference. Not as
         | efficient as a cache that grows by one each time of course.
         | 
         | [0]: https://stephenpanaro.com/blog/inside-
         | apples-2023-transforme...
        
       | cowmix wrote:
       | This sorta reminds me of the lie that was pushed when the
       | Snapdragon X laptops were being released last year. Qualcomm
       | implied the NPU would be used for LLMs -- and I bought into the
       | BS without looking into it. I still use a Snapdragon laptop as my
       | daily driver (it's fine) but for running models locally, it's
       | still a joke. Despite Qualcomm's claims about running 13B
       | parameter models, software like LM Studio only runs on CPU with
       | NPU support merely "planned for future updates." XDA The NPU
       | isn't even faster than the CPU for LLMs -- it's just more power-
       | efficient for small models, not the big ones people actually want
       | to run. Their GPUs aren't much better for this purpose either.
       | The only hope for LLMs is the Vulkan support on the Snapdragon X
       | -- which still is half-baked.
        
         | nullpoint420 wrote:
         | If you don't mind me asking, what OS do you use on it?
        
           | cowmix wrote:
           | I use Windows 11. Podman/WLS2 works way better than I thought
           | it would. And when Gitbash was finally ported (officially) -
           | that filled other gaps I was missing in my workflow. Windows
           | ARM Python still is lacking all sorts of stuff, but overall
           | I'm pretty productive on it.
           | 
           | I pre/ordered the Snapdragon X Dev kit from Qualcomm - but
           | they ended up delivering a few units -- to only cancel the
           | whole program. The whole thing turned out to be a hot-mess
           | express saga. THAT computer was going to be my Debian rig.
        
         | captainregex wrote:
         | AnythingLLM uses NPU
        
           | mikaraento wrote:
           | Could you provide a pointer to docs for this? It wasn't
           | obvious from an initial read of their docs.
        
             | tough wrote:
             | could find some about it on the 1.7.2 changelog here
             | https://docs.anythingllm.com/changelog/v1.7.2
        
         | wmf wrote:
         | AFAIK Windows 11 does use the NPU to run Phi Silica language
         | models and this is available to any app through some API. The
         | models are quite small as you said though.
        
       | daedrdev wrote:
       | Apple is a competitive choice simply because their unified memory
       | allows you to get enough ram that would take multiple Gpus to
       | have enough space to run larger models.
        
         | fmajid wrote:
         | Yes, but their refusal to open up the ANE to third-party models
         | negates that. You can get (or will be able to very soon) a
         | Strix Halo Ryzen AI Max+ 395 able to access 96GB of unified RAM
         | (on a 128GB system) for well under half what you'd pay for an
         | equivalent M4 system from Apple.
        
           | rz2k wrote:
           | Doesn't the M3 Ultra have 3-4x the RAM bandwidth though?
        
             | wmf wrote:
             | At 3-4x the price.
        
           | wmf wrote:
           | Hobbyists don't use ANE on Apple Silicon and they won't use
           | XDNA on Strix Halo either. In both cases the GPU is faster
           | and can access most of system memory.
           | 
           | $2,000 vs. $3,500 isn't well under half either.
        
       | antirez wrote:
       | The README lacks the most important thing: how many more
       | tokens/sec at the same quantization, compared to llama.cpp / MLX?
       | It is worth to switch default platforms only if there is a major
       | improvement.
        
       | cwoolfe wrote:
       | Is there a performance benefit for inference speed on M-series
       | MacBooks, or is the primary task here simply to get inference
       | working on other platforms (like iOS)? If there is a performance
       | benefit, it would be great to see tokens/s of this vs. Ollama.
        
       | neuroelectron wrote:
       | btw, don't bother trying to buy a bunch of Mac boxes to run LLMs
       | in parallel because it won't be any faster than a single box.
        
         | ionwake wrote:
         | is everyone just waiting for teh DGX Spark? Are they really
         | going to ban local inference?
        
           | neuroelectron wrote:
           | What do you mean ban? The bandwidth between macs isn't enough
           | to do inference effectively.
        
             | zozbot234 wrote:
             | I'm pretty sure you can network Macs together via the
             | latest Thunderbolt standards and get pretty decent
             | performance overall. Sure, it will be a bottleneck to some
             | extent but it's still useful for many purposes.
        
       | m3kw9 wrote:
       | Even 1gb model is prohibitively big for phones if you want mass
       | adoption.
        
         | ukuina wrote:
         | We'll just have to await next week's oai-
         | agi1-0.4b-a0.1b-iq1_xs.gguf
        
       | xyst wrote:
       | Getting anything to work on Apple proprietary junk is such a
       | chore.
        
       | jvstokes wrote:
       | But coreml utilizes ANE, right? Is there some bottleneck in
       | coreml that requires lower level access?
        
       | gitroom wrote:
       | Man, Apple's tight grip on ANE is kinda nuts - would love to see
       | the day they let folks get real hands-on. you ever think
       | companies hold stuff back just to keep control, or is there
       | actually some big tech reason for it?
        
         | zozbot234 wrote:
         | People keep saying this but I'm not seeing the big difference
         | with other NPU varieties. Either way we're still talking about
         | very experimental stuff that also tends to be hardwired towards
         | some pre-determined use case. So I'm not surprised that people
         | are running into problems while trying to make these more
         | broadly useful.
        
           | wtallis wrote:
           | True; _everybody 's_ NPU hardware is afflicted by awkward
           | hardware and software constraints that don't come close to
           | keeping pace with the rapidly-shifting interests of ML
           | researchers.
           | 
           | To some degree, that's an unavoidable consequence of how long
           | it takes to design and ship specialized hardware with a
           | supporting software stack. By contrast, ML research is moving
           | way faster because they hardly ever ship anything product-
           | like; it's a good day when the installation instructions for
           | some ML thing only includes three steps that amount to
           | "download more Python packages".
           | 
           | And the lack of cross-vendor standardization for APIs and
           | model formats is also at least partly a consequence of
           | various NPUs evolving from very different starting points and
           | original use cases. For example, Intel's NPUs are derived
           | from Movidius, so they were originally designed for computer
           | vision, and it's not at all a surprise that making them do
           | LLMs might be an uphill battle. AMD's NPU comes from Xilinx
           | IP, so their software mess is entirely expected. Apple and
           | Qualcomm NPUs presumably are still designed primarily to
           | serve smartphone use cases, which didn't include LLMs until
           | after today's chips were designed.
           | 
           | It'll be _very_ interesting to see how this space matures
           | over the next several years, and whether the niche of
           | specialized low-power NPUs survives in PCs or if NVIDIA 's
           | approach of only using the GPU wins out. A lot of that
           | depends on whether anybody comes up with a true killer app
           | for local on-device AI.
        
             | zozbot234 wrote:
             | > It'll be very interesting to see how this space matures
             | over the next several years, and whether the niche of
             | specialized low-power NPUs survives in PCs or if NVIDIA's
             | approach of only using the GPU wins out.
             | 
             | GPU's are gaining their own kinds of specialized blocks
             | such as matrix/tensor compute units, or BVH acceleration
             | for ray-tracing (that may or may not turn out to be useful
             | for other stuff). So I'm not sure that there's any real
             | distinction from that POV - a specialized low-power unit in
             | an iGPU is going to be practically indistinguishable from a
             | NPU, except that it will probably be easier to target from
             | existing GPU API's.
        
               | wtallis wrote:
               | > a specialized low-power unit in an iGPU is going to be
               | practically indistinguishable from a NPU, except that it
               | will probably be easier to target from existing GPU
               | API's.
               | 
               | Possibly, depending on how low the power actually is. We
               | can't really tell from NVIDIA's tensor cores, because
               | waking up an NVIDIA discrete GPU at all has a higher
               | power cost than running an NPU. Intel's iGPUs have matrix
               | units, but I'm not sure if they can match their NPU on
               | power or performance.
        
       ___________________________________________________________________
       (page generated 2025-05-03 23:00 UTC)