[HN Gopher] Run LLMs on Apple Neural Engine (ANE)
___________________________________________________________________
Run LLMs on Apple Neural Engine (ANE)
Author : behnamoh
Score : 194 points
Date : 2025-05-03 15:29 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| htk wrote:
| I always felt that the neural engine was wasted silicon, they
| could add more gpu cores in that die space and redirect the
| neural processing api to the gpu as needed. But I'm no expert, so
| if anyone here has a different opinion I'd love to learn from it.
| bigyabai wrote:
| If you did that, you'd stumble into the Apple GPU's lack of
| tensor acceleration hardware. For an Nvidia-like experience
| you'd have to re-architecture the GPU to subsume the NPU's
| role, and if _that_ was easy then everyone would have done it
| by now.
| sroussey wrote:
| M1/M2 shared a GPU design, same with M3/M4. So maybe M5 will
| have a new design that includes tensor cores in the GPU.
| lucasoshiro wrote:
| I'm not a ML guy, but when I needed to train a NN I thought
| that the my Mac's ANE would help. But actually, despite it
| being way easier to setup tensorflow + metal + M1 on Mac than
| to setup tensorflow + cuda + nvidia on Linux, the neural engine
| cores are not used. Not even for classification, which are
| their main purpose. I wouldn't say they are wasted silicon, but
| they are way less useful than what we expect
| mr_toad wrote:
| Does Apple care about third party use of the ANE? There are
| many iOS/iPadOS features that use it.
| sroussey wrote:
| Not really. Apple software uses the neural engine all over
| the place, but rarely do others. Maybe this will change [1]
|
| There was a guy using it for live video transformations and
| it almost caused the phones to "melt". [2]
|
| [1] https://machinelearning.apple.com/research/neural-
| engine-tra...
|
| [2] https://x.com/mattmireles/status/1916874296460456089
| ks2048 wrote:
| At least one link/benchmark I saw said the ANE can be 7x faster
| than GPU (Metal / MPS),
|
| https://discuss.pytorch.org/t/apple-neural-engine-ane-instea...
|
| It seems intuitive that if they design hardware very
| specifically for these applications (beyond just fast matmuls
| on a GPU), they could squeeze out more performance.
| astrange wrote:
| Performance doesn't matter. Nothing is ever about
| performance.
|
| It's about performance/power ratios.
| 1W6MIC49CYX9GAP wrote:
| You're completely right, if you already have a GPU in a system
| adding tensor cores to it gives you better performance per
| area.
|
| GPU + dedicated AI HW is virtually always the wrong approach
| compared to GPU+ tensor cores
| xiphias2 wrote:
| I guess it's a hard choice as it's 5x more energy efficient
| than GPU because it uses systolic array.
|
| For laptops, 2x GPU cores would make more sense, for
| phones/tablets, energy efficiency is everything.
| brigade wrote:
| Eyeballing 3rd party annotated die shots [1], it's about the
| size of two GPU cores, but achieves 15.8 tflops. Which is more
| than the reported 14.7 tflops of the 32-core GPU in the binned
| M4 Max.
|
| [1] https://vengineer.hatenablog.com/entry/2024/10/13/080000
| Archit3ch wrote:
| Not really. That's 15.8 fp16 ops compared to 14.7 fp32 ops
| (that are actually useful outside AI). It would be
| interesting to see if you can configure the ANE to recover
| fp32 precision at lower throughput [1].
|
| [1] https://arxiv.org/abs/2203.03341
| brigade wrote:
| Apple GPUs run fp16 at the same rate as fp32 except on
| phones, so it is comparable for ML. No one runs inference
| from fp32 weights.
|
| But the point was about area efficiency
| rz2k wrote:
| I was trying to figure the same thing out a couple months ago,
| and didn't find much information.
|
| It looked like even ANEMLL provides limited low level access to
| specifically direct processing toward the Apple Neural Engine,
| because Core ML still acts as the orchestrator. Instead, flags
| during conversion of a PyTorch or TensorFlow model can specify
| ANE-optimized operations, quantization, and parameters hinting
| at compute targets or optimization strategies. For example
| `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine`
| during conversion would disfavor the GPU cores.
|
| Anyway, I didn't actually experiment with this, but at the time
| I thought maybe there could be a strategy of creating a
| speculative execution framework, with a small ANE-compatible
| model to act as the draft model paired with a larger target
| model running on GPU cores. The idea being that the ANE's low
| latency and high efficiency could accelerate results.
|
| However, I would be interested to hear the perspective of
| people who actually know something about the subject.
| simonw wrote:
| I'm trying to figure out what the secret sauce for this is. It
| depends on https://github.com/apple/coremltools - is that the key
| trick or are there other important techniques going on here?
| smpanaro wrote:
| coremltools is the only way to run on ANE, so less of a trick
| and more of a requirement.
|
| The tricks are more around optimizing for the hardware
| capabilities/constraints. For instance:
|
| - conv2d is faster than linear (see Apple's post [0]) so you
| rewrite the model for that (example from the repo [1])
|
| - inputs/outputs are static shapes, so KV cache requires some
| creativity (I wrote about that here [2])
|
| - compute is float16 (not bfloat16) so occasionally you have to
| avoid activation overflows
|
| [0]: https://machinelearning.apple.com/research/neural-engine-
| tra...
|
| [1]:
| https://github.com/Anemll/Anemll/blob/4bfa0b08183a437e759798...
|
| [2]: https://stephenpanaro.com/blog/kv-cache-for-neural-engine
| kamranjon wrote:
| I am curious if anyone knows if the neural cores in apple silicon
| based machines are at all useful in training? I've been using the
| MLX framework but haven't seen them mentioned anywhere so I'm
| just wondering if they are only useful for inference? I know
| whisper.cpp takes advantage of them in the inference context.
|
| Edit: I changed llama.cpp to whisper.cpp - I didn't realize that
| llama.cpp doesn't have a coreml option like whisper.cpp does.
| lucasoshiro wrote:
| Well, the TensorFlow port to metal was written by Apple, and it
| doens't use ANE. If even they have chosen to use only GPU
| probably the ANE wouldn't help in training. I also heard that
| the ANE is way less powerful than Apple Silicon's GPU, but I
| don't have numbers
| 486sx33 wrote:
| Maybe a quick side shift - What the heck are apples neural
| cores good for ? Used for ? Use cases ?
| kamranjon wrote:
| They are definitely good for local inference as evidenced by
| the pretty amazing performance increase on Apple silicon when
| used with whisper.cpp, maybe other frameworks that utilize
| coreml? I think they're sorta purpose built for doing matrix
| math.
| giantrobot wrote:
| They're used pretty extensively by stuff like the Photos app
| for various ML tasks. Not all AI is LLMs.
| freehorse wrote:
| In my understanding, they are intended to be utilized by
| various apps (and Apple intelligence) for performing
| different machine learning tasks (inference), in a manner
| that is unobtrusive to the user and the rest of the system.
| While a GPU would theoretically be more performant, it could
| potentially lead to excessive energy consumption, temperature
| rise, fan noise, and other factors that may not be desirable
| when performing basic tasks like OCR when viewing an image
| within an application.
|
| Currently, it is for example used through the "Vision
| Framework" eg for OCR tasks (for instance, when previewing an
| image in macOS, it performs OCR in the background using the
| ANE). Additionally, they are utilized by certain apple
| intelligence features that are executed locally (eg when I
| asked writing tools to rewrite this comment, I saw a spike in
| ANE usage).
|
| They can also be used for diffusion image models (through
| core ml, diffusers has a nice frontend for that) but my
| understanding is that they are primarily for "light" ML tasks
| within an application rather than running larger models
| (though that's also possible, but they are gonna probably run
| slower than in gpu).
| behnamoh wrote:
| I wonder if Apple ever followed up with this:
| https://github.com/apple/ml-ane-transformers
|
| They claim their ANE-optimized models achieve "up to 10 times
| faster and 14 times lower peak memory consumption compared to
| baseline implementations."
|
| AFAIK, neither MLX nor llama.cpp support ANE. Though llama.cpp
| started exploring this idea [0].
|
| What's weird is that MLX is made _by_ Apple and yet, they can 't
| support ANE given its closed-source API! [1]
|
| [0]: https://github.com/ggml-org/llama.cpp/issues/10453
|
| [1]: https://github.com/ml-
| explore/mlx/issues/18#issuecomment-184...
| kamranjon wrote:
| Whisper.cpp has a coreml option which gives 3x speed up over
| cpu only according to the docs: https://github.com/ggml-
| org/whisper.cpp?tab=readme-ov-file#c...
| zozbot234 wrote:
| Some outdated information about bare-metal use of the ANE is
| available from the Whisper.cpp pull req:
| https://github.com/ggml-org/whisper.cpp/pull/1021 Even more
| outdated information at: https://github.com/eiln/ane/tree/33a
| 61249d773f8f50c02ab0b9fe... In short, the early (M1/M2)
| versions of ANE are unlikely to be useful for modern LLM
| inference due to their seemingly exclusive focus on
| statically scheduled FP16 and INT8 MADDs.
|
| More extensive information at https://github.com/tinygrad/tin
| ygrad/tree/master/extra/accel... (from the Tinygrad folks,
| note that this is also similarly outdated) seems to basically
| confirm the above.
|
| (The jury is still out for M3/M4 which currently have no
| Asahi support - thus, no current prospects for driving the
| ANE bare-metal. Note however that the M3/Pro/Max ANE reported
| performance numbers are quite close to the M2 version, so
| there may not be a real improvement there either. M3 Ultra
| and especially the M4 series may be a different story.)
| kamranjon wrote:
| I wouldn't say that they aren't useful for inference (there
| are pretty clear performance improvements even from the
| asahi effort you linked) - it's just that you have to
| convert the model ahead of time to be compatible with the
| ANE which is explained in the readme docs for whisper.cpp
| that I linked above.
|
| I would say though that this likely excludes them from
| being useful for training purposes.
| zozbot234 wrote:
| Note that I was only commenting on modern quantized LLM's
| that basically avoid formats like FP16 or INT8,
| preferring lower precision wherever feasible. When in-
| memory model values must be padded to FP16/INT8 this
| slashes your effective use of memory bandwidth, which is
| what determines token generation speed. So the only
| feasible benefits are really in the prompt pre-processing
| phase, and even then only in lower power use compared to
| GPU, not really in higher speed.
| kamranjon wrote:
| That's really interesting! I didn't know that about the
| padding behavior here. I am interested to know which
| models this would include? I know Gemma 3 raw is bf16 -
| are you just talking about the quantized versions of
| these? Or are models being released purely as quantized
| versions these days? I know Google just released a QAT
| (Quantization Aware Training) model of Gemma 3 27b - but
| that base model was already released.
| zozbot234 wrote:
| Models may be released as unquantized (and even then they
| are gradually shifting towards lower precisions over
| time), but most people are going to be running them in a
| quantized version simply because that gives you the best
| bang for your buck (you can fit more interesting models
| on the same hardware). Of course this is strictly about
| local LLM inference, though one may reasonably assume
| that the big players are also doing something similar.
| conradev wrote:
| My understanding is that model throughput is
| fundamentally limited at some point by the fact that the
| ANE is less wide than the GPU.
|
| At that point, the ANE loses because you have to split
| the model into chunks and only one fits at a time.
| smpanaro wrote:
| What do you mean by less wide? The main bottleneck for
| transformers is memory bandwidth. ANE has a much lower
| ceiling than CPU/GPU (yes, despite unified memory).
|
| Chunking is actually beneficial as long as all the chunks
| can fit into the ANE's cache. It speeds up compilation
| for large network graphs and cached loads are negligible
| cost. On M1 the cache limit is 3-4GB, but it is higher on
| M2+.
| jorvi wrote:
| .. who is running LLMs on CPU instead of GPU or TPU/NPU
| thot_experiment wrote:
| Every apple fanboy with a $10k mac studio.
| kamranjon wrote:
| Pretty sure they're using the 80 GPU cores available in
| that case.
| echelon wrote:
| And that still performs worse than entry-level Nvidia
| gaming cards.
|
| Apple isn't serious about AI and needs to figure their AI
| story out. Every other big tech company is doing
| something about it.
| voidspark wrote:
| Not for inferencing. M3 Ultra runs big LLMs twice as fast
| as RTX 5090.
|
| https://creativestrategies.com/mac-studio-m3-ultra-ai-
| workst...
|
| RTX 5090 only has 32GB RAM. M3 Ultra has up to 512 GB
| with 819 GB/sec bandwidth. It can run models that will
| not fit on an RTX card.
|
| EDIT: Benchmark may not be properly utilizing the 5090.
| But the M3 Ultra is way more capable than an entry level
| RTX card at LLM inferencing.
| Spooky23 wrote:
| My little $599 Mac Mini does inference about 15-20%
| slower than a 5070 in my kids' gaming rig. They cost
| about the same, and I got a free computer.
|
| Nvidia makes an incredible product, but apples different
| market segmentation strategy might make it a real player
| in the long run.
| kamranjon wrote:
| They're basically second place behind NVIDIA for model
| inference performance and often the only game in town for
| the average person if you're trying to run larger models
| that wont fit in the 16 or 24gb of memory available in
| top-shelf NVIDIA offerings.
|
| I wouldn't say Apple isn't serious about AI, they had the
| forethought to build the shared memory architecture with
| the insane memory bandwidth needed for these types of
| tasks, while at the same time designing neural cores
| specifically for small on-device models needed for future
| apps.
|
| I'd say Apple is currently ahead of NVIDIA in just sheer
| memory available - which for doing training and inference
| on large models, it's kinda crucial, at least right now.
| NVIDIA seems to be purposefully limiting the memory
| available in their consumers cards which is pretty short
| sighted I think.
| voidspark wrote:
| M3 Ultra has a big GPU with 819 GB/sec bandwidth.
|
| LLM performance is twice as fast as RTX 5090
|
| https://creativestrategies.com/mac-studio-m3-ultra-ai-
| workst...
| behnamoh wrote:
| > LLM performance is twice as fast as RTX 5090
|
| your tests are wrong. you used MLX for Mac Studio
| (optimized for Apple Silicon) but you didn't use vLLM for
| 5090. There's no way a machine with half the bandwidth of
| 5090 delivers twice as fast tok/s.
| seanmcdirmid wrote:
| Unless it's a large model that doesn't fit in the 5090,
| bust that's no longer a $4k macstudio I think.
| behnamoh wrote:
| that's orthogonal to the speed discussion.
|
| also, the GP was mostly testing models that fit in both
| 5090 and Mac Studio.
| voidspark wrote:
| $4k will get you a 96 GB Mac Studio with M3 Ultra (819
| GB/sec).
|
| That's 3x the RAM of the 5090.
| voidspark wrote:
| Yeah that's probably wrong. But the M3 Ultra is good
| enough for local inferencing, in any case.
| briandear wrote:
| Can we stop with the derisive "fanboy" nonsense? Most
| people don't say "FOSS" fanboy or Linux "fanboy" -- but
| plenty of people here are exactly that. It's a bit
| insulting to people that like and appreciate Mac
| hardware; just because you might not like it doesn't mean
| you have to be so dismissive. And that Mac Studio is a
| very impressive computer -- but it's usually the ones
| that have never used on that seem to have to most
| opinions about them.
| kamranjon wrote:
| Actually that's a really good question, I hadn't considered
| that the comparison here is just CPU vs using Metal
| (CPU+GPU).
|
| To answer the question though - I think this would be used
| for cases where you are building an app that wants to
| utilize a small AI model while at the same time having the
| GPU free to do graphics related things, which I'm guessing
| is why Apple stuck these into their hardware in the first
| place.
|
| Here is an interesting comparison between the two from a
| whisper.cpp thread - ignoring startup times - the CPU+ANE
| seems about on par with CPU+GPU: https://github.com/ggml-
| org/whisper.cpp/pull/566#issuecommen...
| conradev wrote:
| It essentially never makes sense to run on the CPU and
| you will only ever see enthusiasts doing it.
|
| Yes, hammering the GPU too hard can affect the display
| server, but no, switching to the CPU is not a good
| alternative
| kamranjon wrote:
| Not switching to the CPU - switching to the ANE (Neural
| Cores) - if you read the research papers Apple has
| released - the example I gave is pretty much how it's
| being used - small image classification models running on
| the ANE, alongside a graphics app that needs the GPU to
| be free.
| yjftsjthsd-h wrote:
| Not all of us own GPUs worth using. Now, among people using
| macs... Maybe if you had a hardware failure?
| echelon wrote:
| > coreml option which gives 3x speed up over cpu
|
| Which is still painfully slow. CoreML is not a real ML
| platform.
| ks2048 wrote:
| One update from Apple Research since that one was "Deploying
| Attention-Based Vision Transformers to Apple Neural Engine"
| (but not clear the relationship. It doesn't build on
| ane_transformers, but maybe a sister project for vision?)
|
| blog: https://machinelearning.apple.com/research/vision-
| transforme...
|
| github: https://github.com/apple/ml-vision-transformers-ane
| 1W6MIC49CYX9GAP wrote:
| Based on the graphs "up to 10 times faster" compares
| before/after flash attention
| ljosifov wrote:
| This more than anything feels emblematic to me, that Apple
| executives are brain dead when it comes to software. AI
| seemingly being a step too far. (in the s/w direction.) While
| they could at some level grok classical s/w, NN-s are that
| Terra Incognita where Apple executives can not possibly follow
| in. It's just too strange and mysterious world for them to be
| able to effectively decide or execute on anything. Worked
| (20yrs ago) in an industrial R&D lab of a mostly h/w
| manufacturer for 3 years. It looked to me that the worlds of
| h/w and s/w, the mindsets, diverged pretty quickly on all
| important considerations of "what should happen next".
| anon373839 wrote:
| Huh? Apple ships heaps of deep learning-based
| products/features, and they've done so for years. But I'll
| agree with you, they're currently behind in generative AI.
| smpanaro wrote:
| Not a public follow-up but the iOS 17 speech-to-text model has
| a clever approach to KV caching that works within the ANE's
| constraints (fixed size inputs).
|
| I wrote about it here[0] but the gist is you can have a fixed
| size cache and slide it in chunks with each inference. Not as
| efficient as a cache that grows by one each time of course.
|
| [0]: https://stephenpanaro.com/blog/inside-
| apples-2023-transforme...
| cowmix wrote:
| This sorta reminds me of the lie that was pushed when the
| Snapdragon X laptops were being released last year. Qualcomm
| implied the NPU would be used for LLMs -- and I bought into the
| BS without looking into it. I still use a Snapdragon laptop as my
| daily driver (it's fine) but for running models locally, it's
| still a joke. Despite Qualcomm's claims about running 13B
| parameter models, software like LM Studio only runs on CPU with
| NPU support merely "planned for future updates." XDA The NPU
| isn't even faster than the CPU for LLMs -- it's just more power-
| efficient for small models, not the big ones people actually want
| to run. Their GPUs aren't much better for this purpose either.
| The only hope for LLMs is the Vulkan support on the Snapdragon X
| -- which still is half-baked.
| nullpoint420 wrote:
| If you don't mind me asking, what OS do you use on it?
| cowmix wrote:
| I use Windows 11. Podman/WLS2 works way better than I thought
| it would. And when Gitbash was finally ported (officially) -
| that filled other gaps I was missing in my workflow. Windows
| ARM Python still is lacking all sorts of stuff, but overall
| I'm pretty productive on it.
|
| I pre/ordered the Snapdragon X Dev kit from Qualcomm - but
| they ended up delivering a few units -- to only cancel the
| whole program. The whole thing turned out to be a hot-mess
| express saga. THAT computer was going to be my Debian rig.
| captainregex wrote:
| AnythingLLM uses NPU
| mikaraento wrote:
| Could you provide a pointer to docs for this? It wasn't
| obvious from an initial read of their docs.
| tough wrote:
| could find some about it on the 1.7.2 changelog here
| https://docs.anythingllm.com/changelog/v1.7.2
| wmf wrote:
| AFAIK Windows 11 does use the NPU to run Phi Silica language
| models and this is available to any app through some API. The
| models are quite small as you said though.
| daedrdev wrote:
| Apple is a competitive choice simply because their unified memory
| allows you to get enough ram that would take multiple Gpus to
| have enough space to run larger models.
| fmajid wrote:
| Yes, but their refusal to open up the ANE to third-party models
| negates that. You can get (or will be able to very soon) a
| Strix Halo Ryzen AI Max+ 395 able to access 96GB of unified RAM
| (on a 128GB system) for well under half what you'd pay for an
| equivalent M4 system from Apple.
| rz2k wrote:
| Doesn't the M3 Ultra have 3-4x the RAM bandwidth though?
| wmf wrote:
| At 3-4x the price.
| wmf wrote:
| Hobbyists don't use ANE on Apple Silicon and they won't use
| XDNA on Strix Halo either. In both cases the GPU is faster
| and can access most of system memory.
|
| $2,000 vs. $3,500 isn't well under half either.
| antirez wrote:
| The README lacks the most important thing: how many more
| tokens/sec at the same quantization, compared to llama.cpp / MLX?
| It is worth to switch default platforms only if there is a major
| improvement.
| cwoolfe wrote:
| Is there a performance benefit for inference speed on M-series
| MacBooks, or is the primary task here simply to get inference
| working on other platforms (like iOS)? If there is a performance
| benefit, it would be great to see tokens/s of this vs. Ollama.
| neuroelectron wrote:
| btw, don't bother trying to buy a bunch of Mac boxes to run LLMs
| in parallel because it won't be any faster than a single box.
| ionwake wrote:
| is everyone just waiting for teh DGX Spark? Are they really
| going to ban local inference?
| neuroelectron wrote:
| What do you mean ban? The bandwidth between macs isn't enough
| to do inference effectively.
| zozbot234 wrote:
| I'm pretty sure you can network Macs together via the
| latest Thunderbolt standards and get pretty decent
| performance overall. Sure, it will be a bottleneck to some
| extent but it's still useful for many purposes.
| m3kw9 wrote:
| Even 1gb model is prohibitively big for phones if you want mass
| adoption.
| ukuina wrote:
| We'll just have to await next week's oai-
| agi1-0.4b-a0.1b-iq1_xs.gguf
| xyst wrote:
| Getting anything to work on Apple proprietary junk is such a
| chore.
| jvstokes wrote:
| But coreml utilizes ANE, right? Is there some bottleneck in
| coreml that requires lower level access?
| gitroom wrote:
| Man, Apple's tight grip on ANE is kinda nuts - would love to see
| the day they let folks get real hands-on. you ever think
| companies hold stuff back just to keep control, or is there
| actually some big tech reason for it?
| zozbot234 wrote:
| People keep saying this but I'm not seeing the big difference
| with other NPU varieties. Either way we're still talking about
| very experimental stuff that also tends to be hardwired towards
| some pre-determined use case. So I'm not surprised that people
| are running into problems while trying to make these more
| broadly useful.
| wtallis wrote:
| True; _everybody 's_ NPU hardware is afflicted by awkward
| hardware and software constraints that don't come close to
| keeping pace with the rapidly-shifting interests of ML
| researchers.
|
| To some degree, that's an unavoidable consequence of how long
| it takes to design and ship specialized hardware with a
| supporting software stack. By contrast, ML research is moving
| way faster because they hardly ever ship anything product-
| like; it's a good day when the installation instructions for
| some ML thing only includes three steps that amount to
| "download more Python packages".
|
| And the lack of cross-vendor standardization for APIs and
| model formats is also at least partly a consequence of
| various NPUs evolving from very different starting points and
| original use cases. For example, Intel's NPUs are derived
| from Movidius, so they were originally designed for computer
| vision, and it's not at all a surprise that making them do
| LLMs might be an uphill battle. AMD's NPU comes from Xilinx
| IP, so their software mess is entirely expected. Apple and
| Qualcomm NPUs presumably are still designed primarily to
| serve smartphone use cases, which didn't include LLMs until
| after today's chips were designed.
|
| It'll be _very_ interesting to see how this space matures
| over the next several years, and whether the niche of
| specialized low-power NPUs survives in PCs or if NVIDIA 's
| approach of only using the GPU wins out. A lot of that
| depends on whether anybody comes up with a true killer app
| for local on-device AI.
| zozbot234 wrote:
| > It'll be very interesting to see how this space matures
| over the next several years, and whether the niche of
| specialized low-power NPUs survives in PCs or if NVIDIA's
| approach of only using the GPU wins out.
|
| GPU's are gaining their own kinds of specialized blocks
| such as matrix/tensor compute units, or BVH acceleration
| for ray-tracing (that may or may not turn out to be useful
| for other stuff). So I'm not sure that there's any real
| distinction from that POV - a specialized low-power unit in
| an iGPU is going to be practically indistinguishable from a
| NPU, except that it will probably be easier to target from
| existing GPU API's.
| wtallis wrote:
| > a specialized low-power unit in an iGPU is going to be
| practically indistinguishable from a NPU, except that it
| will probably be easier to target from existing GPU
| API's.
|
| Possibly, depending on how low the power actually is. We
| can't really tell from NVIDIA's tensor cores, because
| waking up an NVIDIA discrete GPU at all has a higher
| power cost than running an NPU. Intel's iGPUs have matrix
| units, but I'm not sure if they can match their NPU on
| power or performance.
___________________________________________________________________
(page generated 2025-05-03 23:00 UTC)