[HN Gopher] AI PCs Aren't Good at AI: The CPU Beats the NPU
       ___________________________________________________________________
        
       AI PCs Aren't Good at AI: The CPU Beats the NPU
        
       Author : dbreunig
       Score  : 167 points
       Date   : 2024-10-16 19:44 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | fancyfredbot wrote:
       | The write up on the GitHub repo is much more informative than the
       | blog.
       | 
       | When running int8 matmul using onnx performance is ~0.6TF.
       | 
       | https://github.com/usefulsensors/qc_npu_benchmark
        
         | dang wrote:
         | Thanks--we changed the URL to that from
         | https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-
         | at-.... Readers may way want to look at both, of course!
        
       | dmitrygr wrote:
       | In general MAC unit utilization tends to be low for transformers,
       | but 1.3% seems pretty bad. I wonder if they fucked up the memory
       | interface for the NPU. All the MACs in the world are useless if
       | you cannot feed them.
        
         | moffkalast wrote:
         | I recall looking over the Ryzen AI architecture and the NPU is
         | just plugged into PCIe and thus gets completely crap memory
         | bandwidth. I would expect it might be similar here.
        
           | PaulHoule wrote:
           | I spent a lot of time with a business partner and an expert
           | looking at the design space for accelerators and it was made
           | very clear to me that the memory interface puts a hard limit
           | on what you can do and that it is difficult to make the most
           | of. Particularly if a half-baked product is being rushed out
           | because of FOMO you'd practically expect them to ship
           | something that gives a few percent of the performance because
           | the memory interface doesn't really work, it happens to the
           | best of them:
           | 
           | https://en.wikipedia.org/wiki/Cell_(processor)
        
           | wtallis wrote:
           | It's unlikely to be literally connected over PCIe when it's
           | on the same chip. It just _looks_ like it 's connected over
           | PCIe because that's how you make peripherals discoverable to
           | the OS. The integrated GPU also appears to be connected over
           | PCIe, but obviously has access to far more memory bandwidth.
        
         | Hizonner wrote:
         | It's a tablet. It probably has like one DDR channel. It's not
         | so much that they "fucked it up" as that they knowingly built a
         | grossly unbalanced system so they could report a pointless
         | number.
        
           | dmitrygr wrote:
           | Well, no. If the CPU can hit better numbers on the same model
           | then the bandwidth from the DDR _IS_ there. Probably the NPU
           | does not attach to the proper cache level, or just has a very
           | thin pipe to it
        
             | Hizonner wrote:
             | The CPU is only about twice as good as the NPU, though
             | (four times as good on one test). The NPU is being
             | advertised as capable of 45 trillion operations per second,
             | and he's getting 1.3 percent of that.
             | 
             | So, OK, yeah, I concede that the NPU may have even worse
             | access to memory than the CPU, but the bottom line is that
             | neither one of them has anything close to what it needs to
             | to actually delivering anything like the marketing headline
             | performance number on any realistic workload.
             | 
             | I bet a lot of people have bought those things after seeing
             | "45 TOPS", thinking that they'd be able to usefully run
             | transformers the size of main memory, and that's not
             | happening on CPU _or_ NPU.
        
               | dmitrygr wrote:
               | Yup, sad all round. We are in agreement.
        
       | pram wrote:
       | I laughed when I saw that the Qualcomm "AI PC" is described as
       | this in the ComfyUI docs:
       | 
       | "Avoid", "Nothing works", "Worthless for any AI use"
        
       | jsheard wrote:
       | These NPUs are tying up a substantial amount of silicon area so
       | it would be a real shame if they end up not being used for much.
       | I can't find a die analysis of the Snapdragon X which isolates
       | the NPU specifically but AMDs equivalent with the same ~50 TOPS
       | performance target can be seen here, and takes up about as much
       | area as three high performance CPU cores:
       | 
       | https://www.techpowerup.com/325035/amd-strix-point-silicon-p...
        
         | Kon-Peki wrote:
         | Modern chips have to dedicate a certain percentage of the die
         | to dark silicon [1] (or else they melt/throttle to
         | uselessness), and these kinds of components count towards that
         | amount. So the point of these components is to be used, but not
         | to be used too much.
         | 
         | Instead of an NPU, they could have used those transistors and
         | die space for any number of things. But they wouldn't have put
         | additional high performance CPU cores there - that would
         | increase the power density too much and cause thermal issues
         | that can only be solved with permanent throttling.
         | 
         | [1] https://en.wikipedia.org/wiki/Dark_silicon
        
           | IshKebab wrote:
           | If they aren't being used it would be better to dedicate the
           | space to more SRAM.
        
             | a2l3aQ wrote:
             | The point is parts of the CPU have to be off or throttled
             | down when other components are under load to maintain TDP,
             | adding cache that would almost certainly be being used
             | defeats the point of that.
        
               | jsheard wrote:
               | Doesn't SRAM have much lower power density than logic per
               | area though? Hence why AMD can get away with physically
               | stacking cache on top of more cache in their X3D parts,
               | without the bottom layer melting.
        
         | ezst wrote:
         | I can't wait for the LLM fad to be over so we get some sanity
         | (and efficiency) back. I personally have no use for this extra
         | hardware ("GenAI" doesn't help me in any way nor supports any
         | work-related tasks). Worse, most people have no use for that
         | (and recent surveys even show predominant hostility towards AI
         | creep). We shouldn't be paying extra for that, it should be
         | opt-in, and then it would become clear (by looking at the sales
         | and how few are willing to pay a premium for "AI") how
         | overblown and unnecessary this is.
        
           | DrillShopper wrote:
           | Corporatized gains in the market from hype Socialized losses
           | in increased carbon emissions, upheaval from job loss, and
           | higher prices on hardware.
           | 
           | The more they say the future will be better the more that it
           | looks like the status quo.
        
           | renewiltord wrote:
           | I was telling someone this and they gave me link to a laptop
           | with higher battery life and better performance than my own,
           | but I kept explaining to them that the feature I cared most
           | about was die size. They couldn't understand it so I just had
           | to leave them alone. Non-technical people don't get it. Die
           | size is what I care about. It's a critical feature and so
           | many mainstream companies are missing out on _my money_
           | because they won 't optimize die size. Disgusting.
        
             | nl wrote:
             | Is this a parody?
             | 
             | Why would anyone care about die size? And if you do why not
             | get one of the many low power laptops with Atoms etc that
             | do have small die size?
        
               | throwaway48476 wrote:
               | Maybe through a game of telephone they confused die size
               | and node size?
        
               | thfuran wrote:
               | Yes, they're making fun of the comment they replied to.
        
               | tedunangst wrote:
               | No, no, no, you just don't get it. The only thing Dell
               | will sell me is a laptop 324mm wide, which is totally
               | appalling, but if they offered me a laptop that's 320mm
               | wide, I'd immediately buy it. In my line of work, which
               | is totally serious business, every millimeter counts.
        
             | _zoltan_ wrote:
             | News flash: you're in the niche of the niche. People don't
             | care about die size.
             | 
             | I'd be willing to bet that the amount of money they are
             | missing out on is miniscule and is by far offset by
             | people's money who care about other stuff. Like you know,
             | performance and battery life, just to stick to your
             | examples.
        
         | JohnFen wrote:
         | > These NPUs are tying up a substantial amount of silicon area
         | so it would be a real shame if they end up not being used for
         | much.
         | 
         | This has been my thinking. Today you have to go out of your way
         | to buy a system with an NPU, so I don't have any. But tomorrow,
         | will they just be included by default? That seems like a waste
         | for those of us who aren't going to be running models. I wonder
         | what other uses they could be put to?
        
           | jsheard wrote:
           | > But tomorrow, will they just be included by default?
           | 
           | That's already the way things are going due to Microsoft
           | decreeing that Copilot+ is the future of Windows, so AMD and
           | Intel are both putting NPUs which meet the Copilot+
           | performance standard into every consumer part they make going
           | forwards to secure OEM sales.
        
             | AlexAndScripts wrote:
             | It almost makes me want to find some use for them on my
             | Linux box (not that is has an NPU), but I truly can't think
             | of anything. Too small to run a meaningful LLM, and I'd
             | want that in bursts anyway, I hate voice controls (at least
             | with the current tech), and Recall sounds thoroughly
             | useless. Could you do mediocre machine translation on it,
             | perhaps? Local github copilot? An LLM that is purely used
             | to build an abstract index of my notes in the background?
             | 
             | Actually, could they be used to make better AI in games?
             | That'd be neat. A shooter character with some kind of
             | organic tactics, or a Civilisation/Stellaris AI that
             | doesn't suck.
        
           | jonas21 wrote:
           | NPUs are already included by default in the Apple ecosystem.
           | Nobody seems to mind.
        
             | JohnFen wrote:
             | It's not really a question of minding if it's there, unless
             | its presence increases cost, anyway. It just seems a waste
             | to let it go idle, so my mind wanders to what other use I
             | could put that circuitry to.
        
       | tromp wrote:
       | > the 45 trillion operations per second that's listed in the
       | specs
       | 
       | Such a spec should be ideally be accompanied by code
       | demonstrating or approximating the claimed performance. I can't
       | imagine a sports car advertising a 0-100km/h spec of 2.0 seconds
       | where a user is unable to get below 5 seconds.
        
         | dmitrygr wrote:
         | most likely multiplying the same 128x128 matrix from cache to
         | cache. That gets you perfect MAC utilization with no need to
         | hit memory. Gets you a big number that is not directly a lie -
         | that perf _IS_ attainable, on a useless synthetic benchmark
        
           | kmeisthax wrote:
           | Sounds great for RNNs! /s
        
         | tedunangst wrote:
         | I have some bad news for you regarding how car acceleration is
         | measured.
        
       | isusmelj wrote:
       | I think the results show that just in general the compute is not
       | used well. That the CPU took 8.4ms and GPU took 3.2ms shows a
       | very small gap. I'd expect more like 10x - 20x difference here.
       | I'd assume that the onnxruntime might be the issue. I think some
       | hardware vendors just release the compute units without shipping
       | proper support yet. Let's see how fast that will change.
       | 
       | Also, people often mistake the reason for an NPU is "speed".
       | That's not correct. The whole point of the NPU is rather to focus
       | on low power consumption. To focus on speed you'd need to get rid
       | of the memory bottleneck. Then you end up designing your own ASIC
       | with it's own memory. The NPUs we see in most devices are part of
       | the SoC around the CPU to offload AI computations. It would be
       | interesting to run this benchmark in a infinite loop for the
       | three devices (CPU, NPU, GPU) and measure power consumption. I'd
       | expect the NPU to be lowest and also best in terms of "ops/watt"
        
         | AlexandrB wrote:
         | > Also, people often mistake the reason for an NPU is "speed".
         | That's not correct. The whole point of the NPU is rather to
         | focus on low power consumption.
         | 
         | I have a sneaking suspicion that the real real reason for an
         | NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make
         | sure we stick some AI stuff in our products too."
        
           | kmeisthax wrote:
           | You forget "Because Apple is doing it", too.
        
             | rjsw wrote:
             | I think other ARM SoC vendors like Rockchip added NPUs
             | before Apple, or at least around the same time.
        
           | itishappy wrote:
           | I assume you're both right. I'm sure NPUs exist to fill a
           | very real niche, but I'm also sure they're being shoehorned
           | in everywhere regardless of product fit because "AI big right
           | now."
        
             | wtallis wrote:
             | Looking at it slightly differently: putting low-power NPUs
             | into laptop and phone SoCs is how to get on the AI
             | bandwagon in a way that NVIDIA cannot easily disrupt. There
             | are plenty of systems where a NVIDIA discrete GPU cannot
             | fit into the budget (of $ or Watts). So even if NPUs are
             | still somewhat of a solution in search of a problem (aka a
             | killer app or two), they're not necessarily a sign that
             | these manufacturers are acting entirely without strategy.
        
         | kmeisthax wrote:
         | > I think some hardware vendors just release the compute units
         | without shipping proper support yet
         | 
         | This is Nvidia's moat. Everything has optimized kernels for
         | CUDA, and _maybe_ Apple Accelerate (which is the only way to
         | touch the CPU matrix unit before M4, and the NPU at all). If
         | you want to use anything else, either prepare to upstream
         | patches in your ML framework of choice or prepare to write your
         | own training and inference code.
        
       | jamesy0ung wrote:
       | What exactly does Windows do with a NPU? I don't own an 'AI PC'
       | but it seems like the NPUs are slow and can't run much.
       | 
       | I know Apple's Neural Engine is used to power Face ID and the
       | facial recognition stuff in Photos, among other things.
        
         | DrillShopper wrote:
         | It supports Microsoft's Recall (now required) spyware
        
           | Janicc wrote:
           | Please remind me again how Recall sends data to Microsoft. I
           | must've missed that part. Or are you against the print screen
           | button too? I heard that takes images too. Very scary.
        
             | cmeacham98 wrote:
             | While calling it spyware like GP is over-exaggeration to a
             | ridiculous level, comparing Recall to Print Screen is also
             | inaccurate:
             | 
             | Print Screen takes images on demand, Recall does so
             | effectively at random. This means Recall could
             | inadvertently screenshot and store information you didn't
             | intend to keep a record of (To give an extreme example:
             | Imagine an abuser uses Recall to discover their spouse
             | browsing online domestic violence resources).
        
             | Terr_ wrote:
             | > Please remind me again how Recall sends data to
             | Microsoft. I must've missed that part.
             | 
             | Sure, just post the source code and I'll point it out, I
             | some how misplaced my copy. /s
             | 
             | The core problem here is trust, and over the last several
             | years Microsoft has burned a hell of a lot of theirs with
             | power-users of Windows. Even their most strident public
             | promises of Recall being "opt-in" and "on-device only" will
             | --paradoxically--only be kept as long as people remain
             | suspicious.
             | 
             | Once there's a critical mass MS go back to their old games,
             | pushing a mandatory "security update" which reset or
             | entirely-removes your privacy settings and adding new
             | "telemetry" streams which you cannot inspect.
        
             | bloated5048 wrote:
             | It's always safe to assume it does if it's closed source. I
             | rather be suspicious of big corporations seeking to profit
             | at every step than naive.
             | 
             | Also, it's security risk which already been exploited.
             | Sure, MS fixed it, but can you be certain it won't be
             | exploited some time in the future again?
        
         | dagaci wrote:
         | Its used for improving video calls, special effects, image
         | editing/ effects and noise cancelling, teams stuff
        
       | eightysixfour wrote:
       | I thought the purpose of these things was not to be fast, but to
       | be able to run small models with very little power usage? I have
       | a newer AMD laptop with an NPU, and my power usage doesn't change
       | using the video effects that supposedly run on it, but goes up
       | when using the nvidia studio effects.
       | 
       | It seems like the NPUs are for very optimized models that do
       | small tasks, like eye contact, background blur, autocorrect
       | models, transcription, and OCR. In particular, on Windows, I
       | assumed they were running the full screen OCR (and maybe
       | embeddings for search) for the rewind feature.
        
         | conradev wrote:
         | That is my understanding as well: low power and low latency.
         | 
         | You can see this in action when evaluating a CoreML model on a
         | macOS machine. The ANE takes half as long as the GPU which
         | takes half as long as the CPU (actual factors being model
         | dependent)
        
           | nickpsecurity wrote:
           | To take half as long, doesn't it have to perform twice as
           | fast? Or am I misreading your comment?
        
             | eightysixfour wrote:
             | No, you can have latency that is independent of compute
             | performance. The CPU/GPU may have other tasks and the work
             | has to wait for the existing threads to finish, or for them
             | to clock up, or have slower memory paths, etc.
             | 
             | If you and I have the same calculator but I'm working on a
             | set of problems and you're not, and we're both asked to do
             | some math, it may take me longer to return it, even though
             | the instantaneous performance of the math is the same.
        
               | refulgentis wrote:
               | In isolation, makes sense.
               | 
               | Wouldn't it be odd for OP to present examples that are
               | the _opposite_ of their claim, just to get us thinking
               | about  "well the CPU is busy?"
               | 
               | Curious for their input.
        
         | boomskats wrote:
         | That's especially true because yours is a Xilinx FPGA. The one
         | that they just attached to the latest gen mobile ryzens is 5x
         | more capable too.
         | 
         | AMD are doing some fantastic work at the moment, they just
         | don't seem to be shouting about it. This one is particularly
         | interesting
         | https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...
         | 
         | edit: not an FPGA. TIL. :'(
        
           | errantspark wrote:
           | Wait sorry back up a bit here. I can buy a laptop that has a
           | daughter FPGA in it? Does it have GPIO??? Are we seriously
           | building hardware worth buying again in 2024? Do you have a
           | link?
        
             | eightysixfour wrote:
             | It isn't as fun as you think - they are setup for specific
             | use cases and quite small. Here's a link to the software
             | page: https://ryzenai.docs.amd.com/en/latest/index.html
             | 
             | The teeny-tiny "NPU," which is actually an FPGA, is 10
             | TOPS.
             | 
             | Edit: I've been corrected, not an FPGA, just an IP block
             | from Xilinx.
        
               | wtallis wrote:
               | It's not a FPGA. It's an NPU IP block from the Xilinx
               | side of the company. It was presumably originally
               | developed to be run on a Xilinx FPGA, but that doesn't
               | mean AMD did the stupid thing and actually fabbed a FPGA
               | fabric instead of properly synthesizing the design for
               | their laptop ASIC. Xilinx involvement does not
               | automatically mean it's an FPGA.
        
               | eightysixfour wrote:
               | Thanks for the correction, edited.
        
               | boomskats wrote:
               | Do you have any more reading on this? How come the XDNA
               | drivers depend on Xilinx' XRT runtime?
        
               | almostgotcaught wrote:
               | because XRT has a plugin architecture: XRT<-shim
               | plugin<-kernel driver. The shims register themselves with
               | XRT. The XDNA driver repo houses both the shim and the
               | kernel driver.
        
               | boomskats wrote:
               | Thanks, that makes sense.
        
               | wtallis wrote:
               | It would be surprising and strange if AMD _didn 't_ reuse
               | the software framework they've already built for doing AI
               | when that IP block is instantiated on an FPGA fabric
               | rather than hardened in an ASIC.
        
               | boomskats wrote:
               | Well, I'm irrationally disappointed, but thanks.
               | Appreciate the correction.
        
               | boomskats wrote:
               | Yes, the one on the ryzen 7000 chips like the 7840u isn't
               | massive, but that's the last gen model. The one they've
               | just released with the HX370 chip is estimated at 50
               | TOPS, which is better than Qualcomm's ARM flagship that
               | this post is about. It's a fivefold improvement in a
               | single generation, it's pretty exciting.
               | 
               | And it's an FPGA.
        
               | almostgotcaught wrote:
               | > And it's an FPGA.
               | 
               | nope it's not.
        
             | dekhn wrote:
             | If you want GPIOs, you don't need (or want) an FPGA.
             | 
             | I don't know the details of your use case, but I work with
             | low level hardware driven by GPIOs and after a bit of
             | investigation, concluded that having direect GPIO access in
             | a modern PC was not necessary or desirable compared to the
             | alternatives.
        
           | beeflet wrote:
           | It would be cool if most PCs had a general purpose FPGA that
           | could be repurposed by the operating system. For example you
           | could use it as a security processor like a TPM or as a
           | bootrom, or you could repurpose it for DSP or something.
           | 
           | It just seems like this would be better in terms of
           | firmware/security/bootloading because you would be more able
           | to fix it if an exploit gets discovered, and it would be
           | leaner because different operating systems can implement
           | their own stuff (for example linux might not want pluton in-
           | chip security, windows might not want coreboot or linux-based
           | boot, bare metal applications can have much simpler boot).
        
             | walterbell wrote:
             | Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has
             | an OSS toolchain, http://www.enjoy-digital.fr/
        
           | pclmulqdq wrote:
           | It's not an FPGA. It's a VLIW DSP that Xilinx built to go
           | into an FPGA-SoC to help run ML models.
        
             | almostgotcaught wrote:
             | this is the correct answer. one of the compilers for this
             | DSP is https://github.com/Xilinx/llvm-aie.
        
           | numpad0 wrote:
           | Sorry for an OT comment but what is going on with that ascii
           | art!? The content fits within 80 columns just fine[1], is it
           | GPT generated?
           | 
           | 1: https://pastebin.com/raw/R9BrqETR
        
         | refulgentis wrote:
         | You're absolutely right IMO, given what I heard when launching
         | on-device speech recognition on Pixel, and after leaving
         | Google, what I see from ex. Apple Neural Engine vs. CPU when
         | running ONNX stuff.
         | 
         | I'm a bit suspicious of the article's specific conclusion,
         | because it is Qualcomm's ONNX, and it be out of date. Also,
         | Android loved talking shit about Qualcomm software engineering.
         | 
         | That being said, its directionally correct, insomuch as
         | consumer hardware AI acceleration claims are near-universally
         | BS unless you're A) writing 1P software B) someone in the 1P
         | really wants you to take advantage.
        
           | kristianp wrote:
           | 1P?
        
             | refulgentis wrote:
             | First party, i.e. Google/Apple/Microsoft
        
       | wmf wrote:
       | This headline is seriously misleading because the author did not
       | test AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs
       | are not good.
        
       | protastus wrote:
       | Deploying a model on an NPU requires significant profile based
       | optimization. Picking up a model that works fine on the CPU but
       | hasn't been optimized for an NPU usually leads to disappointing
       | results.
        
         | catgary wrote:
         | Yeah whenever I've spoken to people who work on stuff like IREE
         | or OpenXLA they gave me the impression that understanding how
         | to use those compilers/runtimes is an entire job.
        
         | CAP_NET_ADMIN wrote:
         | Beauty of CPUs - they'll chew through whatever bs code you
         | throw at them at a reasonable speed.
        
       | lostmsu wrote:
       | The author's benchmark sucks if he could only get 2 tops from a
       | laptop 4080. The thing should be doing somewhere around 80 tops.
       | 
       | Given that you should take his NPU results with a truckload of
       | salt.
        
       | downrightmike wrote:
       | They should have just made a pci card and not tried to push whole
       | new machines on us. We are all good with the machines we already
       | have. If you want to sell a new feature, then it needs to be an
       | add-on
        
       | Mistletoe wrote:
       | >The second conclusion is that the measured performance of 573
       | billion operations per second is only 1.3% of the 45 trillion
       | ops/s that the marketing material promises.
       | 
       | It just gets so hard to take this industry seriously.
        
       ___________________________________________________________________
       (page generated 2024-10-16 23:00 UTC)