[HN Gopher] AI PCs Aren't Good at AI: The CPU Beats the NPU
___________________________________________________________________
AI PCs Aren't Good at AI: The CPU Beats the NPU
Author : dbreunig
Score : 167 points
Date : 2024-10-16 19:44 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| fancyfredbot wrote:
| The write up on the GitHub repo is much more informative than the
| blog.
|
| When running int8 matmul using onnx performance is ~0.6TF.
|
| https://github.com/usefulsensors/qc_npu_benchmark
| dang wrote:
| Thanks--we changed the URL to that from
| https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-
| at-.... Readers may way want to look at both, of course!
| dmitrygr wrote:
| In general MAC unit utilization tends to be low for transformers,
| but 1.3% seems pretty bad. I wonder if they fucked up the memory
| interface for the NPU. All the MACs in the world are useless if
| you cannot feed them.
| moffkalast wrote:
| I recall looking over the Ryzen AI architecture and the NPU is
| just plugged into PCIe and thus gets completely crap memory
| bandwidth. I would expect it might be similar here.
| PaulHoule wrote:
| I spent a lot of time with a business partner and an expert
| looking at the design space for accelerators and it was made
| very clear to me that the memory interface puts a hard limit
| on what you can do and that it is difficult to make the most
| of. Particularly if a half-baked product is being rushed out
| because of FOMO you'd practically expect them to ship
| something that gives a few percent of the performance because
| the memory interface doesn't really work, it happens to the
| best of them:
|
| https://en.wikipedia.org/wiki/Cell_(processor)
| wtallis wrote:
| It's unlikely to be literally connected over PCIe when it's
| on the same chip. It just _looks_ like it 's connected over
| PCIe because that's how you make peripherals discoverable to
| the OS. The integrated GPU also appears to be connected over
| PCIe, but obviously has access to far more memory bandwidth.
| Hizonner wrote:
| It's a tablet. It probably has like one DDR channel. It's not
| so much that they "fucked it up" as that they knowingly built a
| grossly unbalanced system so they could report a pointless
| number.
| dmitrygr wrote:
| Well, no. If the CPU can hit better numbers on the same model
| then the bandwidth from the DDR _IS_ there. Probably the NPU
| does not attach to the proper cache level, or just has a very
| thin pipe to it
| Hizonner wrote:
| The CPU is only about twice as good as the NPU, though
| (four times as good on one test). The NPU is being
| advertised as capable of 45 trillion operations per second,
| and he's getting 1.3 percent of that.
|
| So, OK, yeah, I concede that the NPU may have even worse
| access to memory than the CPU, but the bottom line is that
| neither one of them has anything close to what it needs to
| to actually delivering anything like the marketing headline
| performance number on any realistic workload.
|
| I bet a lot of people have bought those things after seeing
| "45 TOPS", thinking that they'd be able to usefully run
| transformers the size of main memory, and that's not
| happening on CPU _or_ NPU.
| dmitrygr wrote:
| Yup, sad all round. We are in agreement.
| pram wrote:
| I laughed when I saw that the Qualcomm "AI PC" is described as
| this in the ComfyUI docs:
|
| "Avoid", "Nothing works", "Worthless for any AI use"
| jsheard wrote:
| These NPUs are tying up a substantial amount of silicon area so
| it would be a real shame if they end up not being used for much.
| I can't find a die analysis of the Snapdragon X which isolates
| the NPU specifically but AMDs equivalent with the same ~50 TOPS
| performance target can be seen here, and takes up about as much
| area as three high performance CPU cores:
|
| https://www.techpowerup.com/325035/amd-strix-point-silicon-p...
| Kon-Peki wrote:
| Modern chips have to dedicate a certain percentage of the die
| to dark silicon [1] (or else they melt/throttle to
| uselessness), and these kinds of components count towards that
| amount. So the point of these components is to be used, but not
| to be used too much.
|
| Instead of an NPU, they could have used those transistors and
| die space for any number of things. But they wouldn't have put
| additional high performance CPU cores there - that would
| increase the power density too much and cause thermal issues
| that can only be solved with permanent throttling.
|
| [1] https://en.wikipedia.org/wiki/Dark_silicon
| IshKebab wrote:
| If they aren't being used it would be better to dedicate the
| space to more SRAM.
| a2l3aQ wrote:
| The point is parts of the CPU have to be off or throttled
| down when other components are under load to maintain TDP,
| adding cache that would almost certainly be being used
| defeats the point of that.
| jsheard wrote:
| Doesn't SRAM have much lower power density than logic per
| area though? Hence why AMD can get away with physically
| stacking cache on top of more cache in their X3D parts,
| without the bottom layer melting.
| ezst wrote:
| I can't wait for the LLM fad to be over so we get some sanity
| (and efficiency) back. I personally have no use for this extra
| hardware ("GenAI" doesn't help me in any way nor supports any
| work-related tasks). Worse, most people have no use for that
| (and recent surveys even show predominant hostility towards AI
| creep). We shouldn't be paying extra for that, it should be
| opt-in, and then it would become clear (by looking at the sales
| and how few are willing to pay a premium for "AI") how
| overblown and unnecessary this is.
| DrillShopper wrote:
| Corporatized gains in the market from hype Socialized losses
| in increased carbon emissions, upheaval from job loss, and
| higher prices on hardware.
|
| The more they say the future will be better the more that it
| looks like the status quo.
| renewiltord wrote:
| I was telling someone this and they gave me link to a laptop
| with higher battery life and better performance than my own,
| but I kept explaining to them that the feature I cared most
| about was die size. They couldn't understand it so I just had
| to leave them alone. Non-technical people don't get it. Die
| size is what I care about. It's a critical feature and so
| many mainstream companies are missing out on _my money_
| because they won 't optimize die size. Disgusting.
| nl wrote:
| Is this a parody?
|
| Why would anyone care about die size? And if you do why not
| get one of the many low power laptops with Atoms etc that
| do have small die size?
| throwaway48476 wrote:
| Maybe through a game of telephone they confused die size
| and node size?
| thfuran wrote:
| Yes, they're making fun of the comment they replied to.
| tedunangst wrote:
| No, no, no, you just don't get it. The only thing Dell
| will sell me is a laptop 324mm wide, which is totally
| appalling, but if they offered me a laptop that's 320mm
| wide, I'd immediately buy it. In my line of work, which
| is totally serious business, every millimeter counts.
| _zoltan_ wrote:
| News flash: you're in the niche of the niche. People don't
| care about die size.
|
| I'd be willing to bet that the amount of money they are
| missing out on is miniscule and is by far offset by
| people's money who care about other stuff. Like you know,
| performance and battery life, just to stick to your
| examples.
| JohnFen wrote:
| > These NPUs are tying up a substantial amount of silicon area
| so it would be a real shame if they end up not being used for
| much.
|
| This has been my thinking. Today you have to go out of your way
| to buy a system with an NPU, so I don't have any. But tomorrow,
| will they just be included by default? That seems like a waste
| for those of us who aren't going to be running models. I wonder
| what other uses they could be put to?
| jsheard wrote:
| > But tomorrow, will they just be included by default?
|
| That's already the way things are going due to Microsoft
| decreeing that Copilot+ is the future of Windows, so AMD and
| Intel are both putting NPUs which meet the Copilot+
| performance standard into every consumer part they make going
| forwards to secure OEM sales.
| AlexAndScripts wrote:
| It almost makes me want to find some use for them on my
| Linux box (not that is has an NPU), but I truly can't think
| of anything. Too small to run a meaningful LLM, and I'd
| want that in bursts anyway, I hate voice controls (at least
| with the current tech), and Recall sounds thoroughly
| useless. Could you do mediocre machine translation on it,
| perhaps? Local github copilot? An LLM that is purely used
| to build an abstract index of my notes in the background?
|
| Actually, could they be used to make better AI in games?
| That'd be neat. A shooter character with some kind of
| organic tactics, or a Civilisation/Stellaris AI that
| doesn't suck.
| jonas21 wrote:
| NPUs are already included by default in the Apple ecosystem.
| Nobody seems to mind.
| JohnFen wrote:
| It's not really a question of minding if it's there, unless
| its presence increases cost, anyway. It just seems a waste
| to let it go idle, so my mind wanders to what other use I
| could put that circuitry to.
| tromp wrote:
| > the 45 trillion operations per second that's listed in the
| specs
|
| Such a spec should be ideally be accompanied by code
| demonstrating or approximating the claimed performance. I can't
| imagine a sports car advertising a 0-100km/h spec of 2.0 seconds
| where a user is unable to get below 5 seconds.
| dmitrygr wrote:
| most likely multiplying the same 128x128 matrix from cache to
| cache. That gets you perfect MAC utilization with no need to
| hit memory. Gets you a big number that is not directly a lie -
| that perf _IS_ attainable, on a useless synthetic benchmark
| kmeisthax wrote:
| Sounds great for RNNs! /s
| tedunangst wrote:
| I have some bad news for you regarding how car acceleration is
| measured.
| isusmelj wrote:
| I think the results show that just in general the compute is not
| used well. That the CPU took 8.4ms and GPU took 3.2ms shows a
| very small gap. I'd expect more like 10x - 20x difference here.
| I'd assume that the onnxruntime might be the issue. I think some
| hardware vendors just release the compute units without shipping
| proper support yet. Let's see how fast that will change.
|
| Also, people often mistake the reason for an NPU is "speed".
| That's not correct. The whole point of the NPU is rather to focus
| on low power consumption. To focus on speed you'd need to get rid
| of the memory bottleneck. Then you end up designing your own ASIC
| with it's own memory. The NPUs we see in most devices are part of
| the SoC around the CPU to offload AI computations. It would be
| interesting to run this benchmark in a infinite loop for the
| three devices (CPU, NPU, GPU) and measure power consumption. I'd
| expect the NPU to be lowest and also best in terms of "ops/watt"
| AlexandrB wrote:
| > Also, people often mistake the reason for an NPU is "speed".
| That's not correct. The whole point of the NPU is rather to
| focus on low power consumption.
|
| I have a sneaking suspicion that the real real reason for an
| NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make
| sure we stick some AI stuff in our products too."
| kmeisthax wrote:
| You forget "Because Apple is doing it", too.
| rjsw wrote:
| I think other ARM SoC vendors like Rockchip added NPUs
| before Apple, or at least around the same time.
| itishappy wrote:
| I assume you're both right. I'm sure NPUs exist to fill a
| very real niche, but I'm also sure they're being shoehorned
| in everywhere regardless of product fit because "AI big right
| now."
| wtallis wrote:
| Looking at it slightly differently: putting low-power NPUs
| into laptop and phone SoCs is how to get on the AI
| bandwagon in a way that NVIDIA cannot easily disrupt. There
| are plenty of systems where a NVIDIA discrete GPU cannot
| fit into the budget (of $ or Watts). So even if NPUs are
| still somewhat of a solution in search of a problem (aka a
| killer app or two), they're not necessarily a sign that
| these manufacturers are acting entirely without strategy.
| kmeisthax wrote:
| > I think some hardware vendors just release the compute units
| without shipping proper support yet
|
| This is Nvidia's moat. Everything has optimized kernels for
| CUDA, and _maybe_ Apple Accelerate (which is the only way to
| touch the CPU matrix unit before M4, and the NPU at all). If
| you want to use anything else, either prepare to upstream
| patches in your ML framework of choice or prepare to write your
| own training and inference code.
| jamesy0ung wrote:
| What exactly does Windows do with a NPU? I don't own an 'AI PC'
| but it seems like the NPUs are slow and can't run much.
|
| I know Apple's Neural Engine is used to power Face ID and the
| facial recognition stuff in Photos, among other things.
| DrillShopper wrote:
| It supports Microsoft's Recall (now required) spyware
| Janicc wrote:
| Please remind me again how Recall sends data to Microsoft. I
| must've missed that part. Or are you against the print screen
| button too? I heard that takes images too. Very scary.
| cmeacham98 wrote:
| While calling it spyware like GP is over-exaggeration to a
| ridiculous level, comparing Recall to Print Screen is also
| inaccurate:
|
| Print Screen takes images on demand, Recall does so
| effectively at random. This means Recall could
| inadvertently screenshot and store information you didn't
| intend to keep a record of (To give an extreme example:
| Imagine an abuser uses Recall to discover their spouse
| browsing online domestic violence resources).
| Terr_ wrote:
| > Please remind me again how Recall sends data to
| Microsoft. I must've missed that part.
|
| Sure, just post the source code and I'll point it out, I
| some how misplaced my copy. /s
|
| The core problem here is trust, and over the last several
| years Microsoft has burned a hell of a lot of theirs with
| power-users of Windows. Even their most strident public
| promises of Recall being "opt-in" and "on-device only" will
| --paradoxically--only be kept as long as people remain
| suspicious.
|
| Once there's a critical mass MS go back to their old games,
| pushing a mandatory "security update" which reset or
| entirely-removes your privacy settings and adding new
| "telemetry" streams which you cannot inspect.
| bloated5048 wrote:
| It's always safe to assume it does if it's closed source. I
| rather be suspicious of big corporations seeking to profit
| at every step than naive.
|
| Also, it's security risk which already been exploited.
| Sure, MS fixed it, but can you be certain it won't be
| exploited some time in the future again?
| dagaci wrote:
| Its used for improving video calls, special effects, image
| editing/ effects and noise cancelling, teams stuff
| eightysixfour wrote:
| I thought the purpose of these things was not to be fast, but to
| be able to run small models with very little power usage? I have
| a newer AMD laptop with an NPU, and my power usage doesn't change
| using the video effects that supposedly run on it, but goes up
| when using the nvidia studio effects.
|
| It seems like the NPUs are for very optimized models that do
| small tasks, like eye contact, background blur, autocorrect
| models, transcription, and OCR. In particular, on Windows, I
| assumed they were running the full screen OCR (and maybe
| embeddings for search) for the rewind feature.
| conradev wrote:
| That is my understanding as well: low power and low latency.
|
| You can see this in action when evaluating a CoreML model on a
| macOS machine. The ANE takes half as long as the GPU which
| takes half as long as the CPU (actual factors being model
| dependent)
| nickpsecurity wrote:
| To take half as long, doesn't it have to perform twice as
| fast? Or am I misreading your comment?
| eightysixfour wrote:
| No, you can have latency that is independent of compute
| performance. The CPU/GPU may have other tasks and the work
| has to wait for the existing threads to finish, or for them
| to clock up, or have slower memory paths, etc.
|
| If you and I have the same calculator but I'm working on a
| set of problems and you're not, and we're both asked to do
| some math, it may take me longer to return it, even though
| the instantaneous performance of the math is the same.
| refulgentis wrote:
| In isolation, makes sense.
|
| Wouldn't it be odd for OP to present examples that are
| the _opposite_ of their claim, just to get us thinking
| about "well the CPU is busy?"
|
| Curious for their input.
| boomskats wrote:
| That's especially true because yours is a Xilinx FPGA. The one
| that they just attached to the latest gen mobile ryzens is 5x
| more capable too.
|
| AMD are doing some fantastic work at the moment, they just
| don't seem to be shouting about it. This one is particularly
| interesting
| https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...
|
| edit: not an FPGA. TIL. :'(
| errantspark wrote:
| Wait sorry back up a bit here. I can buy a laptop that has a
| daughter FPGA in it? Does it have GPIO??? Are we seriously
| building hardware worth buying again in 2024? Do you have a
| link?
| eightysixfour wrote:
| It isn't as fun as you think - they are setup for specific
| use cases and quite small. Here's a link to the software
| page: https://ryzenai.docs.amd.com/en/latest/index.html
|
| The teeny-tiny "NPU," which is actually an FPGA, is 10
| TOPS.
|
| Edit: I've been corrected, not an FPGA, just an IP block
| from Xilinx.
| wtallis wrote:
| It's not a FPGA. It's an NPU IP block from the Xilinx
| side of the company. It was presumably originally
| developed to be run on a Xilinx FPGA, but that doesn't
| mean AMD did the stupid thing and actually fabbed a FPGA
| fabric instead of properly synthesizing the design for
| their laptop ASIC. Xilinx involvement does not
| automatically mean it's an FPGA.
| eightysixfour wrote:
| Thanks for the correction, edited.
| boomskats wrote:
| Do you have any more reading on this? How come the XDNA
| drivers depend on Xilinx' XRT runtime?
| almostgotcaught wrote:
| because XRT has a plugin architecture: XRT<-shim
| plugin<-kernel driver. The shims register themselves with
| XRT. The XDNA driver repo houses both the shim and the
| kernel driver.
| boomskats wrote:
| Thanks, that makes sense.
| wtallis wrote:
| It would be surprising and strange if AMD _didn 't_ reuse
| the software framework they've already built for doing AI
| when that IP block is instantiated on an FPGA fabric
| rather than hardened in an ASIC.
| boomskats wrote:
| Well, I'm irrationally disappointed, but thanks.
| Appreciate the correction.
| boomskats wrote:
| Yes, the one on the ryzen 7000 chips like the 7840u isn't
| massive, but that's the last gen model. The one they've
| just released with the HX370 chip is estimated at 50
| TOPS, which is better than Qualcomm's ARM flagship that
| this post is about. It's a fivefold improvement in a
| single generation, it's pretty exciting.
|
| And it's an FPGA.
| almostgotcaught wrote:
| > And it's an FPGA.
|
| nope it's not.
| dekhn wrote:
| If you want GPIOs, you don't need (or want) an FPGA.
|
| I don't know the details of your use case, but I work with
| low level hardware driven by GPIOs and after a bit of
| investigation, concluded that having direect GPIO access in
| a modern PC was not necessary or desirable compared to the
| alternatives.
| beeflet wrote:
| It would be cool if most PCs had a general purpose FPGA that
| could be repurposed by the operating system. For example you
| could use it as a security processor like a TPM or as a
| bootrom, or you could repurpose it for DSP or something.
|
| It just seems like this would be better in terms of
| firmware/security/bootloading because you would be more able
| to fix it if an exploit gets discovered, and it would be
| leaner because different operating systems can implement
| their own stuff (for example linux might not want pluton in-
| chip security, windows might not want coreboot or linux-based
| boot, bare metal applications can have much simpler boot).
| walterbell wrote:
| Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has
| an OSS toolchain, http://www.enjoy-digital.fr/
| pclmulqdq wrote:
| It's not an FPGA. It's a VLIW DSP that Xilinx built to go
| into an FPGA-SoC to help run ML models.
| almostgotcaught wrote:
| this is the correct answer. one of the compilers for this
| DSP is https://github.com/Xilinx/llvm-aie.
| numpad0 wrote:
| Sorry for an OT comment but what is going on with that ascii
| art!? The content fits within 80 columns just fine[1], is it
| GPT generated?
|
| 1: https://pastebin.com/raw/R9BrqETR
| refulgentis wrote:
| You're absolutely right IMO, given what I heard when launching
| on-device speech recognition on Pixel, and after leaving
| Google, what I see from ex. Apple Neural Engine vs. CPU when
| running ONNX stuff.
|
| I'm a bit suspicious of the article's specific conclusion,
| because it is Qualcomm's ONNX, and it be out of date. Also,
| Android loved talking shit about Qualcomm software engineering.
|
| That being said, its directionally correct, insomuch as
| consumer hardware AI acceleration claims are near-universally
| BS unless you're A) writing 1P software B) someone in the 1P
| really wants you to take advantage.
| kristianp wrote:
| 1P?
| refulgentis wrote:
| First party, i.e. Google/Apple/Microsoft
| wmf wrote:
| This headline is seriously misleading because the author did not
| test AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs
| are not good.
| protastus wrote:
| Deploying a model on an NPU requires significant profile based
| optimization. Picking up a model that works fine on the CPU but
| hasn't been optimized for an NPU usually leads to disappointing
| results.
| catgary wrote:
| Yeah whenever I've spoken to people who work on stuff like IREE
| or OpenXLA they gave me the impression that understanding how
| to use those compilers/runtimes is an entire job.
| CAP_NET_ADMIN wrote:
| Beauty of CPUs - they'll chew through whatever bs code you
| throw at them at a reasonable speed.
| lostmsu wrote:
| The author's benchmark sucks if he could only get 2 tops from a
| laptop 4080. The thing should be doing somewhere around 80 tops.
|
| Given that you should take his NPU results with a truckload of
| salt.
| downrightmike wrote:
| They should have just made a pci card and not tried to push whole
| new machines on us. We are all good with the machines we already
| have. If you want to sell a new feature, then it needs to be an
| add-on
| Mistletoe wrote:
| >The second conclusion is that the measured performance of 573
| billion operations per second is only 1.3% of the 45 trillion
| ops/s that the marketing material promises.
|
| It just gets so hard to take this industry seriously.
___________________________________________________________________
(page generated 2024-10-16 23:00 UTC)