[HN Gopher] AMD's Strix Point: Zen 5 Hits Mobile
       ___________________________________________________________________
        
       AMD's Strix Point: Zen 5 Hits Mobile
        
       Author : klelatti
       Score  : 150 points
       Date   : 2024-08-10 21:18 UTC (1 days ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | sm_1024 wrote:
       | IMO, the most interesting thing about this line is the battery
       | life---within an hour of MBP3 and within 2 hours of Asus's
       | Qualcomm. Making it comparable to ARM architectures.
       | 
       | Which is a little surprising because ARM is commonly believed to
       | be much more power efficient than x86.
       | 
       | [1] https://youtu.be/Z8WKR0VHfJw?si=A7zbFY2lsDa8iVQN&t=277
        
         | jiggawatts wrote:
         | > because ARM is commonly believed to be much more power
         | efficient than x86.
         | 
         | Because most ARM processors were designed for mobile phones and
         | optimised to death for power efficiency.
         | 
         | The total power usage of the front end decoders is a single
         | digit percentage of the total power draw. Even if ARM magically
         | needed 0 watts for this, it couldn't save more power than that.
         | The rest of the processor design elements are essentially
         | identical.
        
         | halJordan wrote:
         | Yeah if you make a worse core and then downclock it then you
         | will increase power efficiency. AMD thankfully only downclocks
         | the 5c, but Intel is shipping ivy lake equivalents in their
         | flagship products just to get power efficiency up.
        
         | arnaudsm wrote:
         | ARM got a lot of hype since the release of the M1, but most
         | users only compared it to the terrible Intel MBPs. Ryzen mobile
         | has been consistently close to Apple silicon perf/watt for 5
         | years. But got little press coverage.
         | 
         | Hype can be really decorrelated from real world performance.
        
           | sm_1024 wrote:
           | I have heard that part of the reason for little coverage of
           | ryzen mobile CPUs is their limited availability as AMD was
           | focussing on using the fab capacity for server chips.
        
           | jsheard wrote:
           | Any efficiency comparison involving Apples chips also has to
           | factor in that Tim Cook keeps showing up at TSMCs door with a
           | freight container full of cash to buy out exclusive access to
           | their bleeding edge silicon processes. ARM may be a factor
           | but don't underestimate the power of having more money than
           | God.
           | 
           | Case in point, Strix Point is built on TSMC 4nm while Apple
           | is already using TSMCs second generation 3nm process.
        
             | acdha wrote:
             | Process helps but have you seen benchmarks showing
             | equivalent performance between the same process node? I
             | think it's less that ARM is amazing than the Apple Silicon
             | team being very good and paired with aggressive
             | optimization throughout the stack but everything I've seen
             | suggests they are simply building better chips at their
             | target levels (not server, high power, etc.).
        
               | cubefox wrote:
               | > Our benchmark database shows the Dimensity 9300 scores
               | 2,207 and 7,408 in Geekbench 6.2's single and multi-core
               | tests. A 30% performance improvement implies the
               | Dimensity 9400 would score around 2,869 and and 9,630.
               | Its single-core performance is close to that of the
               | Snapdragon 8 Gen 4 (2,884/8,840) and it understandably
               | takes the lead in multi-core. Both are within spitting
               | distance from the Apple A17 Pro, which scores 2,915 and
               | 7,222 points in the benchmark. Then again, all three
               | chips are said to be manufactured on TSMC's N3 class
               | node, effectively leveling the playing field.
               | 
               | https://www.notebookcheck.net/MediaTek-
               | Dimensity-9400-rumour...
        
               | sroussey wrote:
               | I guess getting close to the same single thread score is
               | nice. Unfortunately, since only Apple is shipping it is
               | hard to compare if the others burn the battery to get
               | there.
               | 
               | I suspect the others two, like Apple with the A18
               | shipping next month, will be using the second gen N3.
               | Apple is expected to be around 3500 on that node.
               | 
               | Needless to say, what will be very interesting is to see
               | the perf/watt of all three on the same node and shipping
               | in actual products where the benchmarks can be put to
               | more useful tests.
        
               | cubefox wrote:
               | Yeah, and GPU tests, since the benchmarks above were only
               | for the CPU.
        
               | acdha wrote:
               | That appears to be an unconfirmed rumor and it's exciting
               | if true (and there aren't major caveats on power), but
               | did you notice how they mentioned extra work by ARM? The
               | argument isn't that Apple is unique, it's that the
               | performance gaps they've shown are more than simply
               | buying premium fab capacity.
               | 
               | That doesn't mean other designers can't also do that
               | work, but simply that it's more than just the process -
               | for example, the M2 shipped on TSMC's N5P first as an
               | exclusive but when Zen 5 shipped later on the same
               | process it didn't close the single core performance or
               | perf/watt gap. Some of that is x86 vs. ARM but there
               | isn't a single, simple factor which can explain this -
               | e.g. Apple carefully tuning the hardware, firmware, OS,
               | compilers, and libraries too undoubtably helps a lot and
               | it's been a perennial problem for non-Intel vendors on
               | the PC side since so many developers have tuned for Intel
               | first/only for decades.
        
               | cubefox wrote:
               | Since it's unclear whether Apple has a significant
               | architectural advantage over Qualcomm and MediaTek, I
               | would rather attribute this to relatively poor AMD
               | architectures. Provisionally. At least their GPUs have
               | been behind Nvidia for years. (AMD holding its own
               | against Intel is not surprising given Intel's chip fab
               | problems.)
        
               | acdha wrote:
               | Yes, to be clear I'd be very happy if MediaTek jumps in
               | with a strong contender since consumers win. It doesn't
               | look like the Qualcomm chips are performing as well as
               | hoped but I'd wait a bit to see how much tuning helps
               | since Windows ARM was not a major target until now.
        
               | AnthonyMouse wrote:
               | > for example, the M2 shipped on TSMC's N5P first as an
               | exclusive but when Zen 5 shipped later on the same
               | process it didn't close the single core performance or
               | perf/watt gap.
               | 
               | That was Zen 4, but it did close the gap:
               | 
               | https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073
               | _14...
               | 
               | Single thread performance is higher (so is MT), TDP is
               | slightly lower, Cinebench MT "points per watt" is 5%
               | higher.
               | 
               | We'll get to see it again when the 3nm version of Zen5 is
               | released (the initial ones are 4nm, which is a node Apple
               | didn't use).
        
             | hajile wrote:
             | Let's do the math on M1 Pro (10-core, N5, 2021) vs HX370
             | (12-core, N4P, 2024).
             | 
             | Firestorm without L3 is 2.281mm2. Icestorm is 0.59mm2. M1
             | Pro has 8P+2E for a total of 19.428mm2 of cores included.
             | 
             | Zen4 without L3 is 3.84mm2. Zen4c reduces that down to
             | 2.48mm2. Zen5 CCD is pretty much the same size as Zen4
             | (though with 27% more transistors), so core size should be
             | similar. AMD has also stated that Zen5c has a similar
             | shrink percent to Zen4c. We'll use their numbers. HX370 has
             | 4P+8C for a total area of 35.2mm2. If being twice the size
             | despite being on N4P instead of N5 like M1 seems like
             | foreshadowing, it is.
             | 
             | We'll use notebookcheck's Cinebench 2024 multithread power
             | and performance numbers to calculate perf / power / area
             | then multiply that by 100 to eliminate some decimals.
             | 
             | M1 Pro scores 824 (10-core) and while they don't have a
             | power value listed, they do list 33.6w package power
             | running the prime95 power virus, so cinebench's power
             | should be lower than that.
             | 
             | HX370 scored 1213 (12-core) and averaged 119w (maxing at a
             | massive 121.7w and that's without running a power virus).
             | 
             | This gives the following perf/power/area*100 scores:
             | 
             | M1 Pro -- 126 PPA
             | 
             | HX 379 -- 29 PPA
             | 
             | M1 is more than 4.3x better while being an entire node
             | behind and being released years before.
        
               | cyp0633 wrote:
               | Power efficiency is a curve, and Apple may have its own
               | reason not to make M1 Pro run at 110W as well
        
               | sm_1024 wrote:
               | I think the OC might have mis-read the power numbers, 110
               | W is well into desktop CPU power range. Here is a excerpt
               | from Anand Tech:
               | 
               | > In our peak power test, the Ryzen AI 9 HX 370 ramped up
               | and peaked at 33 W.
               | 
               | https://www.anandtech.com/show/21485/the-amd-ryzen-ai-
               | hx-370...
        
               | hajile wrote:
               | You can read the notebookcheck review for yourself.
               | 
               | https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-
               | anal...
        
               | ac29 wrote:
               | Those 100W+ numbers are total system power. And that
               | system has the CPU TDP set to 80W (far above AMD's
               | official max of 54W). It also has a discrete 4070 GPU
               | that can use over 100W on its own.
        
               | paulmd wrote:
               | if x86 laptops have 90w of platform power, that's a thing
               | that's concerning in itself, not a reasonable defense.
               | 
               | Remember, apple laptops have screens too, etc, and that
               | shows up in the average system power measurements the
               | same way. What's the difference in an x86 laptop?
               | 
               | I really doubt it's actually platform power, the problem
               | is that x86 is boosting up to 35W average/60W peak _per
               | thread_. 120W package power isn 't _unexpected_ , if
               | you're boosting 3-4 cores to maximum!
               | 
               | And _that 's_ the problem. x86 is far far worse at race-
               | to-sleep. It's not just "macos has better scheduling"...
               | you can see from the 1T power measurements that x86 is
               | simply drawing 2-3x the power while it's racing-to-sleep,
               | for performance that's roughly equivalent to ARM.
               | 
               | Whatever the cause, whether it's just bad design from AMD
               | and Intel, or legacy x86 cruft (I don't get how this
               | applies to actual computational load though, as opposed
               | to situations like idle power), or what... there is no
               | getting around the fact that M2 tops out at 10W per core
               | and a 8840HS or HX370 or Intel Meteor Lake are boosting
               | to 30-35W at 1T loads.
        
               | hajile wrote:
               | I stacked the deck in AMD's favor using a 3-year-old chip
               | on an older node.
               | 
               | Why is AMD using 3.6x more power than M1 to get just 32%
               | higher performance while having 17% more cores? Why are
               | AMD's cores nearly 2x the size despite being on a better
               | node and having 3 more years to work on them?
               | 
               | Why are Apple's scores the same on battery while AMD's
               | scores drop dramatically?
               | 
               | Apple does have a reason not to run at 120w -- it doesn't
               | need to.
               | 
               | Meanwhile, if AMD used the same 33w, nobody would buy
               | their chips because performance would be so incredibly
               | bad.
        
               | pickledish wrote:
               | You should try not to talk so confidently about things
               | you don't know about -- this statement
               | 
               | > if AMD used the same 33w, nobody would buy their chips
               | because performance would be so incredibly bad
               | 
               | Is completely incorrect, as another commenter (and I
               | think the notebookcheck article?) point out -- 30w is
               | about the sweet spot for these processors, and the reason
               | that 110w laptop seems so inefficient is because it's
               | giving the APU 80w of TDP, which is a bit silly since it
               | only performs marginally better than if you gave it e.g.
               | 30 watts. It's not a good idea to take that example as a
               | benchmark for the APU's efficiency, it varies depending
               | on how much TDP you give the processor, and 80w is not a
               | good TDP for these
        
               | hajile wrote:
               | Halo products with high scores sell chips. This isn't a
               | new idea.
               | 
               | So you lower the wattage down. Now you're at M1 Pro
               | levels of performance with 17% more cores and nearly
               | double the die area and barely competing with a chip 3
               | years older while on a newer, more expensive node too.
               | 
               | That's not selling me on your product (and that's without
               | mentioning the worst core latency I've seen in years when
               | going between P and C cores).
        
               | AnthonyMouse wrote:
               | > I stacked the deck in AMD's favor using a 3-year-old
               | chip on an older node.
               | 
               | You could just compare the ones that are actually on the
               | same process node:
               | 
               | https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073
               | _14...
               | 
               | But then you would see an AMD CPU with a lower TDP
               | getting higher benchmark results.
               | 
               | > Why is AMD using 3.6x more power than M1 to get just
               | 32% higher performance while having 17% more cores?
               | 
               | Getting 32% higher performance from 17% more cores
               | implies higher performance per core.
               | 
               | The power measurements that site uses are from the plug,
               | which is highly variable to the point of uselessness
               | because it takes into account every other component the
               | OEM puts into the machine and random other factors like
               | screen brightness, thermal solution and temperature
               | targets (which affects fan speed which affects fan power
               | consumption) etc. If you measure the wall power of a
               | system with a discrete GPU that by itself has a TDP >100W
               | and the system is drawing >100W, this tells you nothing
               | about the efficiency of the CPU.
               | 
               | AMD's CPUs have internal power monitors and configurable
               | power targets. At full load there is very little light
               | between the configured TDP and what they actually use.
               | This is basically required because the CPU has to be able
               | to operate in a system that can't dissipate more heat
               | than that, or one that can't supply more power.
               | 
               | > Meanwhile, if AMD used the same 33w, nobody would buy
               | their chips because performance would be so incredibly
               | bad.
               | 
               | 33W is approximately what their mobile CPUs actually use.
               | Also, even lower-configured TDP models exist and they're
               | not that much slower, e.g. the 7840U has a base TDP of
               | 15W vs. 35W for the 7840HS and the difference is a base
               | clock of 3.3GHz instead of 3.8GHz.
        
               | hajile wrote:
               | > Getting 32% higher performance from 17% more cores
               | implies higher performance per core.
               | 
               | I don't disagree that it is higher perf/core. It is
               | simply MUCH worse perf/watt because they are forced to
               | clock so high to achieve those results.
               | 
               | > The power measurements that site uses are from the
               | plug, which is highly variable to the point of
               | uselessness
               | 
               | They measure the HX370 using 119w with the screen off
               | (using an external monitor). What on that motherboard
               | would be using the remaining 85+W of power?
               | 
               | TDP is a suggestion, not a hard limit. Before thermal
               | throttling, they will often exceed the TDP by a factor of
               | 2x or more.
               | 
               | As to these specific benchmarks, the R9 7945HX3D you
               | linked to used 187w while the M2 Max used 78w for CB R15.
               | As to perf/watt, Cinebench before 2024 wasn't using NEON
               | properly on ARM, but was using Intel's hyper-optimized
               | libraries for x86. You should be looking at benchmarks
               | without such a massive bias.
        
               | paulmd wrote:
               | lmao he's citing cinebench R15? Which isn't just ancient
               | but actually emulated on arm, of course.
               | 
               | Really digging through the vaults for that one.
               | 
               | Geekbench 6 is perfectly fine for that stuff. But that
               | still shows apple tieing in MT and beating the pants off
               | x86 in 1T efficiency.
               | 
               | x86 1T boosts being silly is where the real problem comes
               | from. But if they don't throw 30-35w at a single thread
               | they lose horribly.
        
               | Const-me wrote:
               | > if AMD used the same 33w, nobody would buy their chips
               | because performance would be so incredibly bad
               | 
               | I'm writing this comment on HP ProBook 445 G8 laptop. I
               | believe I bought it in early 2022, so it's a relatively
               | old model. The laptop has a Ryzen 5 5600U processor which
               | uses <= 25W. I'm quite happy with both the performance
               | and battery life.
        
               | atq2119 wrote:
               | It's well known that performance doesn't scale linearly
               | with power.
               | 
               | Benchmarking incentives on PC have long pushed X86
               | vendors to drive their CPUs at points of the
               | power/performance curve that make their chips look less
               | efficient than they really are. Laptop benchmarking has
               | inherited that culture from desktop PC benchmarking to
               | some extent. This is slowly changing, but Apple has never
               | been subject to the same benchmarking pressures in the
               | first place.
               | 
               | You'll see in reviews that Zen5 can be very efficient
               | when operated in the right power range.
        
               | hajile wrote:
               | Zen5 can be more efficient at lower clockspeeds, but then
               | it loses badly to Apple's chips in raw performance.
        
               | sm_1024 wrote:
               | 119W for hx370 looks extremely sus, seems to me more like
               | the system level power consumption and not CPU-only.
               | 
               | According to phoronix [1,2], in their blender CPU test,
               | they measured a peak of 33W.
               | 
               | Here max power numbers from some other tests that I know
               | are multi-threaded:
               | 
               | --
               | 
               | Linux 6.8 Compilation: 33.13 W
               | 
               | LLVM Compilation: 33.25 W
               | 
               | --
               | 
               | If I plug in 33W into your equation, that would give us
               | score of HX 370: 104 PPA
               | 
               | This supports the HX 370 being pretty power efficient,
               | although still not as power efficient as M3.
               | 
               | [1] https://www.phoronix.com/review/amd-ryzen-
               | ai-9-hx-370/3
               | 
               | [2] https://www.phoronix.com/review/amd-ryzen-
               | ai-9-hx-370/4
        
               | hajile wrote:
               | https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-
               | anal...
               | 
               | They got those kinds of numbers across multiple systems.
               | You can take it up with them I guess.
               | 
               | I didn't even mention one of these systems was peaking at
               | 59w on single-core workloads.
        
               | sm_1024 wrote:
               | I see what's going on, they have two HX370 laptops:
               | Laptop  MC score  Avg Power          P16      1213
               | 113 W          S16       921       29 W       M3 Pro
               | 1059    (30 W?)
               | 
               | They don't have M3 Pro power numbers, but I assume it is
               | somewhere around 30W, seems like S16 has similar power
               | efficiency as HX 370 at 30 W.
               | 
               | Any more power, and the CPU is much less power efficient,
               | 300% increase in power for 30% increase in performance.
        
               | sudosysgen wrote:
               | This is true for every CPU. Past a certain point power
               | consumption scales quadratically with performance.
        
               | moonfern wrote:
               | About cinebench-geekbench-spec: https://old.reddit.com/r/
               | hardware/comments/pitid6/eli5_why_d... That's about
               | Cinebench 20, an overview of Cinebench 24 cpu&gpu(!):
               | https://www.cgdirector.com/cinebench-2024-scores/
        
               | jeswin wrote:
               | Even with the M3 the difference is marginal in multi-
               | threaded benchmarks, from the Cinebench link [1] someone
               | posted earlier on the thread.                   Apple M3
               | Pro 11-Core - 394 Points per Watt         AMD Ryzen AI 9
               | HX 370 - 354 Points per Watt         Apple M3 Max 16-Core
               | - 306 Points per Watt
               | 
               | And the Ryzen in on TSMC 4nm while the M3 is on 3nm. As
               | parent is saying, a lot of the Apple Silicon hype was due
               | to the massive upgrade it was over the Intel CPUs Apple
               | was using previously.
               | 
               | [1]: https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-
               | CPU-anal...
        
               | janwas wrote:
               | Cinebench might not be the most relevant benchmark, it
               | uses lots of scalar instructions with fairly high branch
               | mispredictions and low IPC:
               | https://chipsandcheese.com/2021/02/22/analyzing-
               | zen-2s-cineb....
        
           | carstenhag wrote:
           | But how is this the case? I never saw a single article
           | mentioning that a non-Mac laptop was better.
           | 
           | (Random article saying M3 pro is better than a Dell laptop
           | https://www.tomsguide.com/news/macbook-pro-m3-and-m3-max-
           | bat... )
        
             | moonfern wrote:
             | You're right, but... The idea comes from the desktop world.
             | AMD's zen 4 desktop cpu's especially the gaming variants
             | like the Ryzen 7 7800X3D almost matches the performance per
             | watt of Apple's M3.
             | 
             | Their laptop cpu's as some companies did release same model
             | different cpu were less efficient than intel.
             | 
             | But the Asus ProArt P16 (used in the article) did manage an
             | extreme endurance score in the video test called Big Buck
             | Bunny H.264 1080p which runs at 150 cd/m2 with 21 hours.
             | With it's higher resolution, oled and 10% less battery
             | capacity that's better 40 minutes better than the macbook
             | pro 16 m3 max. In the wifi test also run at 150 cd/m2 the
             | m3 run for 16 hours, the asus 8. (
             | https://www.notebookcheck.net/Asus-ProArt-P16-laptop-
             | review-... )
             | 
             | For me noise matters, that Asus has a whisper mode which
             | produces 42db as much as an M3 max under full load. Please
             | be aware that if you're susceptible of PWM, that ASUS
             | laptop has issues.
        
           | Filligree wrote:
           | Ryzen mobile is consistently close, yeah. But with the sole
           | exception of the Steam deck, I've yet to see a Ryzen mobile-
           | bearing laptop, Windows included, which is close to the
           | overall performance of the Macbook.
        
             | talldayo wrote:
             | > But with the sole exception of the Steam deck
             | 
             | Uuh wut? The Steam Deck is like 3-generation-old hardware
             | in mobile Ryzen terms. In a lot of ways it's similar to a
             | pared-back 4800u with fewer (and older) cores, and a
             | slightly bumped up GPU.
             | 
             | To me it's kinda the opposite. Excluding the Steam Deck, I
             | think most of AMD's Ultrabook APUs have been very close to
             | the products Apple's made on the equivalent nodes. Even on
             | 7nm the 4800u put up a competitive fight against M1, and
             | the gap has gotten thinner with each passing year.
             | According to the OpenCL benchmarks, the Radeon 680m on 6nm
             | scores higher than the M1 on 5nm:
             | https://browser.geekbench.com/opencl-benchmarks
             | 
             | Even back when Ryzen Mobile only shipped with Vega, it was
             | pretty clear that Apple and AMD were a pretty close match
             | in onboard GPU power.
        
               | amlib wrote:
               | Steam Deck might be behind in terms of hardware but in
               | terms of software it's way beyond your typical x86 linux
               | system power efficiency, and dare I say it's doing better
               | than windows machines with the typical shoddy bioses and
               | drivers, specially when you consider all the extraneous
               | services constantly sapping varying amounts of cpu time.
               | All that contributes to make the SD punch well above its
               | weight.
        
               | sudosysgen wrote:
               | My Alienware M15 Ryzen edition gets 7-8W power
               | consumption by just running "sudo powertop --autotune".
               | Basically all of the power efficiency stuff in the Steam
               | Deck apply to other Ryzen systems and are in the mainline
               | kernel.
        
             | makeitdouble wrote:
             | "overall performance" does a lot of work here. On sheer
             | benchmarks it's really comparable, with AMD being slightly
             | better depending on what you look at. e.g. the M1 vs the
             | 5700U (a similar class widely available mobile CPU):
             | 
             | https://www.cpubenchmark.net/cpu.php?cpu=AMD%20Ryzen%207%20
             | 5...
             | 
             | https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+32
             | 0...
             | 
             | They're not profiled the same, and don't belong in the same
             | ecosystem though, which makes a lot more difference than
             | the CPU themselves. In particular the AMD doesn't get a
             | dedicated compiler optimizing every applications of the
             | system to its strength and weaknesses (the other side of it
             | being the compatibility with the two vastest ecosystem we
             | have now)
        
             | sofixa wrote:
             | Depends on what you mean by "overall performance", but my
             | Asus ROG Zephyrus G14 2023 is full AMD, and outperforms my
             | work issued top of the line M1 MacBook Pro from a few
             | months earlier in every task I've done across the two
             | (gaming, compiling, heavy browsing). Battery life is lower
             | under heavy load and high performance on the Zephyrus, but
             | in power saving mode it's roughly comparable, albeit still
             | worse.
        
               | izacus wrote:
               | Same here, my G14 and the M1 MBP are pretty much
               | interchangeable for most workloads. The only time then
               | G14 starts fans is when the 4070 turns on... and that's
               | not an option on the M1 at all.
        
           | sandywaffles wrote:
           | I think that's because all the press talks about actual
           | battery life per laptop and the Apple Silicone laptops ship
           | with literally double the size battery of any AMD based
           | laptop without a discrete GPU. So while the efficiency may be
           | close, actually perceived battery life of the Mac will he
           | more than double when you also consider the priority Apple
           | puts into their power control combined with a larger overall
           | battery.
        
         | wtallis wrote:
         | The CPU core's instruction set has no influence on how well the
         | chip as a whole manages power when not executing instructions.
        
           | sm_1024 wrote:
           | That is fair, I was taught that decoders for x86 are less
           | efficient and more power hungry than RISC ISAs because of
           | their variable length instructions.
           | 
           | I remember being told (and it might be wrong) that ARM can
           | decode multiple instructions in parallel because the CPU
           | knows where the next instruction starts, but for x86, you'd
           | have to decode the instructions in order.
        
             | pohuing wrote:
             | That seems to not matter much nowadays. There's another
             | great(according to my untrained eye) writeup of the lack of
             | importance on chips and cheese.
             | 
             | https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-
             | doesnt-...
        
               | dzaima wrote:
               | The various mentioned power consumption amounts are 4-10%
               | per-core, or 0.5-6% of package (with the caveat of
               | running with micro-op cache off) for Zen 2, and 3-10% for
               | Haswell. That's not massive, but is still far from what
               | I'd consider insignificant; it could give leeway for an
               | extra core or some improved ALUs; or, even, depending on
               | the benchmark, is the difference between Zen 4 and Zen 5
               | (making the false assumption of a linear relation between
               | power and performance, at least), which'd essentially be
               | a "free" generational improvement. Of course the reality
               | is gonna be more modest than that, but it's not nothing.
        
               | Panzer04 wrote:
               | You missed the part where they mention ARM ends up
               | implementing the same thing to go fast.
               | 
               | The point is processors are either slow and efficient, or
               | fast and inefficient. It's just a tradeoff along the
               | curve.
        
               | dzaima wrote:
               | ARM doesn't need the variable-length instruction decoding
               | though, which on x86 essentially means that the decoder
               | has to attempt to decode at every single byte offset for
               | the start of the pipeline, wasting computation.
               | 
               | Indeed pretty much any architecture can benefit from some
               | form of op cache, but less of a need for it means its
               | size can be reduced (and savings spent in more useful
               | ways), and you'll still need actual decoding at some
               | point anyway (and, depending on the code footprint, may
               | need it a lot).
               | 
               | More generally, throwing silicon at a problem is, quite
               | obviously, a more expensive solution than not having the
               | problem in the first place.
        
               | XMPPwocky wrote:
               | But bigger fixed-length instructions mean more I$
               | pressure, right?
        
               | dzaima wrote:
               | RISC doesn't imply wasted instruction space; RISC-V has a
               | particularly interesting thing for this - with the
               | compressed ('c') extension you get 16-bit instructions
               | (which you can determine by just checking two bits), but
               | without it you can still save 6% of icache silicon via
               | only storing 30 bits per instruction, the remaining two
               | being always-1 for non-compressed instructions.
               | 
               | Also, x86 isn't even that efficient in its variable-
               | length instructions - some half of them contain the byte
               | 0x0F, representing an "oh no, we're low on single-byte
               | instructions, prefix new things with 0F". On top of that,
               | general-purpose instructions on 64-bit registers have a
               | prefix byte with 4 fixed bits. The VEX prefix (all AVX1/2
               | instructions) has 7 fixed bits. EVEX (all AVX-512
               | instructions) is a full fixed byte.
        
               | hajile wrote:
               | https://oscarlab.github.io/papers/instrpop-systor19.pdf
               | 
               | ARM64 instructions are 4 bytes. x86 instructions in real-
               | world code average 4.25 bytes. ARM64 gets closer to x86
               | code size as it adds new instructions to replace common
               | instruction sequences.
               | 
               | RISC-V has 2-byte and 4-byte instructions and averages
               | very close to 3-bytes. Despite this, the original
               | compressed code was only around 15% more dense than x86.
               | The addition of the B (bitwise) extensions and Zcb have
               | increased that advantage by quite a lot. As other
               | extensions get added, I'd expect to see this lead
               | increase over time.
        
               | paulmd wrote:
               | x86-64 wastes enough of its address space that arm64 is
               | typically smaller in practice. The RISC-V folks pointed
               | this out a decade ago - geomean across their SPEC suite,
               | x86 is 7.3% larger binary size than arm64.
               | 
               | https://people.eecs.berkeley.edu/%7Ekrste/papers/EECS-201
               | 6-1...
               | 
               | So there's another small factor leaning against x86 -
               | inferior code density means they get less out of their
               | icache than ARM64 due to their ISA design (legacy cruft).
               | And ARM64 often has larger icaches anyway - M1 is 6x the
               | icache of zen4 iirc, _and_ they get more out of it with
               | better code density.
               | 
               | <uno-reverse-card.png>
        
               | imtringued wrote:
               | x86 processors simply run a instruction length predictor
               | the same way they do it for branch prediction. That turns
               | the problem into something that can be tuned. Instead of
               | having to decode the instruction at every byte offset,
               | you can simply decide to optimize for the 99% case with a
               | slow path for rare combinations.
        
               | dzaima wrote:
               | That's still silicon spent on a problem that can be
               | architecturally avoided.
        
               | hajile wrote:
               | That stuff is WAY out-of-date and was flatly wrong when
               | it was published.
               | 
               | A715 cut decoder size a whopping 75% by dropping the more
               | CISC 32-bit stuff and completely eliminated the uop cache
               | too. Losing all that decode, cache, and cache controllers
               | means a big reduction in power consumption (decoders are
               | basically always on). All of ARM's latest CPU designs
               | have eliminated uop cache for this same reason.
               | 
               | At the time of publication, we already knew that M1
               | (already out for nearly a year) was the highest IPC chip
               | ever made and did not use a uop cache.
        
               | hajile wrote:
               | Clam makes some serious technical mistakes in that
               | article and some info is outdated.
               | 
               | 1. His claim that "ARM decoder is complex too" was wrong
               | at the time (M1 being an obvious example) and has been
               | proven more wrong since publication. ARM dropped the uop
               | cache as soon as they dropped support for their very
               | CISC-y 32-bit catastrophe. They bragged that this
               | coincided with a whopping 75% reduction in decoder size
               | for their A715 (while INCREASING from 4 decoders to 5)
               | and this was almost single-handedly responsible for the
               | reduced power consumption of that chip (as all the other
               | changes were comparatively minor). NONE of the current-
               | gen cores from ARM, Apple, or Qualcomm use uop cache
               | eliminating these power-hungry cache and cache
               | controllers.
               | 
               | 2. The paper[0] he quotes has a stupid conclusion. They
               | show integer workloads using a massive 22% of total core
               | power on the decoder and even their fake float workload
               | showed 8% of total core power. Realize that a study[1] of
               | the entire Ubuntu package repo showed that just 12
               | int/ALU instructions made up 89% of all code with
               | float/SIMD being in the very low single digits of use.
               | 
               | 3. x86 decoder situation has gotten worse. Because adding
               | extra decoders is exponentially complex, they decided to
               | spend massive amounts of transistors on multiple decoder
               | blocks working on various speculated branches. Setting
               | aside that this penalizes unrolled code (where they may
               | have just 3-4 decoders while modern ARM will have 10+
               | decoders), the setup for this is incredibly complex and
               | man-year intensive.
               | 
               | 4. "ARM decodes into uops too" is a false equivalency.
               | The uops used by ARM are extremely close to the original
               | instructions as shown by them being able to easily
               | eliminate the uop cache. x86 has a much harder job here
               | mapping a small set of instructions onto a large set.
               | 
               | 5. "ARM is bloated too". ARM redid their entire ISA to
               | eliminate bloat. If ISA didn't actually matter, why would
               | they do this?
               | 
               | 6. "RISC-V will become bloated too" is an appeal to
               | ignorance. x86 has SEVENTEEN major SIMD extensions
               | excluding the dozen or so AVX-512 extensions all with
               | various incompatibilities and issues. This is because
               | nobody knew what SIMD should look like. We know now and
               | RISC-V won't be making that mistake. x86 has useless
               | stuff like BCD instructions using up precious small
               | instruction space because they didn't know. RISC-V won't
               | do this either. With 50+ years of figuring the basics
               | out, RISC-V won't be making any major mistakes on the
               | most important stuff.
               | 
               | 7. Omitting complexity. A bloated, ancient codebase takes
               | forever to do anything with. A bloated, ancient ISA takes
               | forever to do anything with. If ARM and Intel both put X
               | dollars into a new CPU design, Intel is going to spend
               | 20-30% or maybe even more of their budget on devs
               | spending time chasing edge cases and testers to test al
               | those edge cases. Meanwhile, ARM is going to spend that
               | 20-30% of their budget on increasing performance. All
               | other things equal, the ARM chip will be better at any
               | given design price point.
               | 
               | 8. Compilers matter. Spitting out fast x86 code is
               | incredibly hard because there are so many variations on
               | how to do things each with their own tradeoffs (that
               | conflate in weird ways with the tradeoffs of nearby
               | instructions). We do peephole heuristic optimizations
               | because provably fast would take centuries. RISC-V and
               | ARM both make it far easier for compiler writers because
               | there's usually just one option rather than many options
               | and that one option is going to be fast.
               | 
               | [0] https://www.usenix.org/system/files/conference/cooldc
               | 16/cool...
               | 
               | [1] https://oscarlab.github.io/papers/instrpop-
               | systor19.pdf
        
               | dzaima wrote:
               | Some notes:
               | 
               | 3: I don't think more decoders should be exponentially
               | more complex, or even polynomial; I think O(n log n)
               | should suffice. It just has a hilarious constant factor
               | due to the lookup tables and logic needed, and that log
               | factor also impacts the critical path length, i.e.
               | pipeline length, i.e. mispredict penalty. Of note is that
               | x86's variable-length instructions aren't even
               | particularly good at code size.
               | 
               | Golden Cove (~1y after M1) has 6-wide decode, which is
               | probably reasonably near M1's 8-wide given x86's complex
               | instructions (mainly free single-use loads). [EDIT:
               | actually, no, chipsandcheese's diagram shows it only
               | moving 6 micro-ops per cycle to reorder buffer, even out
               | of the micro-op cache. Despite having 8/cycle retire.
               | Weird.]
               | 
               | 6: The count of extensions is a very bad way to measure
               | things; RISC-V will beat everything in that in no time,
               | if not already. The main things that matter are <=SSE4.2
               | (uses same instruction encoding as scalar code); AVX1/2
               | (VEX prefix); and AVX-512 (EVEX). The actual instruction
               | opcodes are shared across those. But three encoding modes
               | (plus the three different lengths of the legacy encoding)
               | is still bad (and APX adds another two onto this) and the
               | SSE-to-AVX transition thing is sad.
               | 
               | RISC-V already has two completely separate solutions for
               | SIMD - v (aka RVV, i.e. the interesting scalable one) and
               | p (a simpler thing that works in GPRs; largely not being
               | worked on but there's still some activity). And if one
               | wants to count extensions, there are already a dozen for
               | RVV (never mind its embedded subsets) - Zvfh, Zvfhmin,
               | Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned,
               | Zvknhb, Zvknha, Zvksed, Zvksh; though, granted, those
               | work better together than, say, SSE and AVX (but on x86
               | there's no reason to mix them anyway).
               | 
               | And RVV might get multiple instruction encoding forms too
               | - the current 32-bit one is forced into allowing using
               | only one register for masking due to lack of encoding
               | space, and a potential 48-bit and/or 64-bit instruction
               | encoding extension has been discussed quite a bit.
               | 
               | 8: RISC-V RVV can be pretty problematic for some things
               | if compiling without a specific target architecture, as
               | the scalability means that different implementations can
               | have good reason to have wildly different relative
               | instruction performance (perhaps most significant being
               | in-register gather (aka shuffle) vs arithmetic vs indexed
               | load from memory).
        
               | hajile wrote:
               | 3. You can look up the papers released in the late 90s on
               | the topic. If it was O(n log n), going bigger than 4 full
               | decoders would be pretty easy.
               | 
               | 6. Not all of those SIMD sets are compatible with each
               | other. Some (eg, SSE4a) wound up casualties of the Intel
               | v AMD war. It's so bad that the Intel AVX10 proposal is
               | mostly about trying to unify their latest stuff into
               | something more cohesive. If you try to code this stuff by
               | hand, it's an absolute mess.
               | 
               | The P proposal is basically DOA. It could happen, but
               | nobody's interested at this point. Just like the B
               | proposal subsumed a bunch of ridiculously small
               | extensions, I expect a new V proposal to simply unify
               | these. As you point out, there isn't really any conflict
               | between these tiny instruction releases.
               | 
               | There is discussion around the 48-bit format (the bits
               | have been reserved for years now), but there are a couple
               | different proposals (personally, I think 64-bit only with
               | the ability to put multiple instructions inside is
               | better, but that's another topic). Most likely, a 48-bit
               | format does NOT do multiple encoding, but instead does a
               | superset of encodings (just like how every 16-bit
               | instruction expands into a 32-bit instruction). They
               | need/want 48-bits to allow 4-address instructions too, so
               | I'd imagine it's coming sooner or later.
               | 
               | Either way, the length encoding is easy to work with
               | compared to x86 where you must check half the bits in
               | half the bytes before you can be sure about how long your
               | instruction really is.
               | 
               | 8. There could be some variance, but x86 has this issue
               | too and SO many more besides.
        
               | dzaima wrote:
               | I know the E-cores (gracemont, crestmont, skymont) have
               | the multi-decoder setup; the first couple search results
               | don't show Golden Cove being the same. Do you have some
               | reference for that?
               | 
               | 6. Ah yeah the funky SSE4a thing. RISC-V has its own
               | similar but worse thing with RVV0.7.1 / xtheadvector
               | already though, and it can be basically guaranteed that
               | there will be tons of one-off vendor extensions,
               | including vector ones, given that anyone can make such.
               | 
               | 8. RVV's vrgather is extremely bad at this, but is very
               | important for a bunch of non-trivial things; existing
               | RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes
               | 256 cycles for LMUL=8[1]. But some hypothetical future
               | hardware could do it at O(LMUL) for non-worst-case
               | indices, thus massively changing tradeoffs. So far the
               | compiler approaches are to just not do high LMUL when
               | vrgather is needed (potentially leaving free perf on the
               | table), or using indexed loads (potentially significantly
               | worse).
               | 
               | Whereas x86 and ARM SIMD perf variance is very tiny;
               | basically everything is pretty proportional everywhere,
               | with maybe the exception of very old atom cores. There'll
               | be some differences of 2x up or down of throughput of
               | instruction classes, but it's generally not so bad as to
               | make way for alternative approaches to be better.
               | 
               | [1]: https://camel-cdr.github.io/rvv-bench-
               | results/bpi_f3/index.h...
        
               | hajile wrote:
               | I think you may be correct about gracemont v golden cove.
               | Rumors/insiders say that Intel has supposedly decided to
               | kill off either the P or E-core team, so I'd guess that
               | the P-core team is getting layed off because the E-core
               | IPC is basically the same, but the E-core is massively
               | more efficient. Even if the P-core wins, I'd expect them
               | to adopt the 3x3 decoder just as AMD adopted a 2x4
               | decoder for zen5.
               | 
               | Using a non-frozen spec is at your own risk. There's
               | nothing comparable to stuff like SSE4a or FMA4. The
               | custom extension issue is vastly overstated. Anybody can
               | make extensions, but nobody will use unratified
               | extensions unless you are in a very niche industry. The P
               | extension is a good example here. The current proposal is
               | a copy/paste of a proprietary extension a company is
               | using. There may be people in their niche using their
               | extension, but I don't see people jumping to add support
               | anywhere (outside their own engineers).
               | 
               | There's a LOT to unpack about RVV. Packed SIMD doesn't
               | even have LMUL>1, so the comparison here is that you are
               | usually the same as Packed SIMD, but can sometimes be
               | better which isn't a terrible place to be.
               | 
               | Differing performance across different performance levels
               | is to be expected when RVV must scale from tiny DSPs up
               | to supercomputers. As you point out, old atom cores
               | (about the same as the Spacemit CPU) would have a
               | different performance profile from a larger core. Even
               | larger AMD cores have different performance
               | characteristics with their tendency to like double-
               | pumping AVX2/512 instructions (but not all of them --
               | just some).
               | 
               | In any case, it's a matter of the wrong configuration
               | unlike x86 where it is a matter of the wrong instruction
               | (and the wrong configuration at times). It seems obvious
               | to me that the compiler will ultimately need to generate
               | a handful of different code variants (shouldn't be a code
               | bloat issue because only a tiny fraction of all code is
               | SIMD) the dynamically choose the best variant for the
               | processor at runtime.
        
               | dzaima wrote:
               | > Packed SIMD doesn't even have LMUL>1, so the comparison
               | here is that you are usually the same as Packed SIMD, but
               | can sometimes be better which isn't a terrible place to
               | be.
               | 
               | Packed SIMD not having LMUL means that hardware can't
               | rely on it being used for high performance; whereas some
               | of the theadvector hardware (which could equally apply to
               | rvv1.0) already had VLEN=128 with 256-bit ALUs, thus
               | having LMUL=2 have twice the throughput of LMUL=1. And
               | even above LMUL=2 various benchmarks have shown
               | improvements.
               | 
               | Having a compiler output multiple versions is an
               | interesting idea. Pretty sure it won't happen though;
               | it'd be a rather difficult political mess of more and
               | more "please add special-casing of my hardware", and
               | would have the problem of it ceasing to reasonably
               | function on hardware released after being compiled
               | (unless like glibc or something gets some standard set of
               | hardware performance properties that can be updated
               | independently of precompiled software, which'd be extra
               | hard to get through). Also P-cores vs E-cores would add
               | an extra layer of mess. There might be some simpler
               | version of just going by VLEN, which is always constant,
               | but I don't see much use in that really.
        
               | janwas wrote:
               | > it's a matter of the wrong configuration unlike x86
               | where it is a matter of the wrong instruction
               | 
               | +1 to dzaima's mention of vrgather. The lack of fixed-
               | pattern shuffle instructions in RVV is absolutely a
               | wrong-instruction issue.
               | 
               | I agree with your point that multiple code variants +
               | runtime dispatch are helpful. We do this with Highway in
               | particular for x86. Users only write code once with
               | portable intrinsics, and the mess of instruction
               | selection is taken care of.
        
               | camel-cdr wrote:
               | > +1 to dzaima's mention of vrgather. The lack of fixed-
               | pattern shuffle instructions in RVV is absolutely a
               | wrong-instruction issue.
               | 
               | What others would you want? Something like vzip1/2 would
               | make sense, but that isn't much of an permutation, since
               | the input elements are exctly next to the output
               | elements.
        
               | janwas wrote:
               | Going through Highway's set of shuffle ops:
               | 
               | 64-bit OddEven/Reverse2/ConcatOdd/ConcatEven,
               | OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse,
               | CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and
               | Broadcast especially for 8-bit, TwoTablesLookupLanes,
               | InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).
               | 
               | All of these are considerably more expensive on RVV. SVE
               | has a nice set, despite also being VL-agnostic.
        
               | dzaima wrote:
               | More RVV questionable optimization cases:
               | 
               | - broadcasting a loaded value: a stride-0 load can be
               | used for this, and could be faster than going through a
               | GPR load & vmv.v.x, but could also be much slower.
               | 
               | - reversing: could use vrgather (could do high LMUL
               | everywhere and split into multiple LMUL=1 vrgathers),
               | could use a stride -1 load or store.
               | 
               | - early-exit loops: It's feasible to vectorize such, even
               | with loads via fault-only-first. But if vl=vlmax is used
               | for it, it might end up doing a ton of unnecessary
               | computation, esp. on high-VLEN hardware. Though there's
               | the "fun" solution of hardware intentionally lowering vl
               | on fault-onlt-first to what it considers reasonable as
               | there aren't strict requirements for it.
        
               | atq2119 wrote:
               | The trend seems to be going towards multiple decoder
               | complexes. Recent designs from AMD and Intel do this.
               | 
               | It makes sense to me: if the distance between branches is
               | small, a 10-wide decode may be wasted anyway. Better to
               | decode multiple basic blocks in parallel
        
               | dzaima wrote:
               | Expanding on 3: I think it ends up at O(n^2 * log n)
               | transistors, O(log n) critical path (not sure on routing
               | or what fan-out issues might there be).
               | 
               | Basically: determine end of instruction at each byte
               | (trivial but expensive). Determine end of two
               | instructions at each byte via end2[i]=end[end[i]]. Then
               | end4[i]=end2[end2[i]], etc, log times.
               | 
               | That's essentially log(n) shuffles. With 32-byte/cycle
               | decode that's roughy five 'vpermb ymm's, which is rather
               | expensive (though various forms of shortcuts should exist
               | - for the larger layers direct chasing is probably
               | feasible, and for the smaller ones some special-casing of
               | single-byte instructions could work).
               | 
               | And, actually, given the mention of O(log n)-transistor
               | shuffles at http://www.numberworld.org/blogs/2024_8_7_zen
               | 5_avx512_teardo..., it might even just be O(n * log^2(n))
               | transistors.
               | 
               | Importantly, x86 itself plays no part in the non-trivial
               | part. It applies equqlly to the RISC-V compressed
               | extension, just with a smaller constant.
        
               | hajile wrote:
               | Determining the end of a RISC-V instruction requires
               | checking two bits and you have the knowledge that no
               | instruction exceeds 4 bytes or uses less than 2 bytes.
               | 
               | x86 requires checking for a REX, REX2, VEX, EVEX, etc
               | prefix. Then you must check for either 1 or 2 instruction
               | bytes. Then you must check for the existence of a
               | register byte, how many immediate byte(s), and if you use
               | a scaled index byte. Then if a register byte exists, you
               | must check it for any displacement bytes to get your
               | final instruction length total.
               | 
               | RISC-V starts with a small complexity then multiplies it
               | by a small amount. x86 starts with a high complexity then
               | multiplies it by a big amount. The real world difference
               | here is large.
               | 
               | As I pointed out elsewhere ARM's A715 dropped support for
               | aarch32 (which is still far easier to decode than x86)
               | and cut decoder size by 75% while increasing raw decoder
               | count by 20%. The decoder penalties of bad ISA design
               | extend beyond finding instruction boundaries.
        
               | dzaima wrote:
               | I don't disagree that the real-world difference is
               | massive; that much is pretty clear. I'm just pointing out
               | that, as far as I can tell, it's all just a question of a
               | constant factor, it's just massive. I've written half of
               | a basic x86 decoder in regular imperative code, handling
               | just the baseline general-purpose legacy encoding
               | instructions (determines length correctly, and determines
               | opcode & operand values to some extent), and that was
               | already much.
        
               | clamchowder wrote:
               | Some notes: 1. Consider M1's 8-wide decoder hit the 5+
               | GHz clock speeds that Intel Golden Cove's decoder can.
               | More complex logic with more delays is harder to clock
               | up. Of course M1 may be held back by another critical
               | path, but it's interesting that no one has managed to get
               | a 8-wide Arm decoder running at the clock speeds that Zen
               | 3/4 and Golden Cove can.
               | 
               | A715's slides say the L1 icache gains uop cache features
               | including caching fusion cases. Likely it's a predecode
               | scheme much like AMD K10, just more aggressive with
               | what's in the predecode stage. Arm has been doing
               | predecode (moving some stages to the L1i fill path rather
               | than the hotter L1i hit path) to mitigate decode costs
               | for a long time. Mitigating decode costs _again_ with a
               | uop cache never made much sense especially considering
               | their low clock speeds. Picking one solution or the other
               | is a good move, as Intel /AMD have done. Arm picked
               | predecode for A715.
               | 
               | 2. The paper does not say 22% of core power is in the
               | decoders. It does say core power is ~22% of package
               | power. Wrong figure? Also, can you determine if the
               | decoder power situation is different on Arm cores? I
               | haven't seen any studies on that.
               | 
               | 3. Multiple decoder blocks doesn't penalize decoder
               | blocks once the load balancing is done right, which
               | Gracemont did. And you have to _massively_ unroll a loop
               | to screw up Tremont anyway. Conversely, decode blocks may
               | lose less throughput with branchy code. Consider that
               | decode slots after a taken branch are wasted, and
               | clustered decode gets around that. Intel stated they
               | preferred 3x3 over 2x4 for that reason.
               | 
               | 4. "uops used by ARM are extremely close to the original
               | instructions" It's the same on x86, micro-op count is
               | nearly equal to instruction count. It's helpful to gather
               | data to substantiate your conclusions. For example, on
               | Zen 4 and libx264 video encoding, there's ~4.7% more
               | micro-ops than instructions. Neoverse V2 retires ~19.3%
               | more micro-ops than instructions in the same workload.
               | Ofc it varies by workload. It's even possible to get
               | negative micro-op expansion on both architectures if you
               | hit branch fusion cases enough.
               | 
               | 8. You also have to tell your ARM compiler which of the
               | dozen or so ISA extension levels you want to target (see 
               | https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#i
               | nde...). It's not one option by any means. Not sure what
               | you mean by "peephole heuristic optimizations", but
               | people certainly micro-optimize for both arm and x86. For
               | arm, see
               | https://github.com/dotnet/runtime/pull/106191/files as an
               | example. Of course optimizations will vary for different
               | ISAs and microarchitectures. x86 is more widely used in
               | performance critical applications and so there's been
               | more research on optimizing for x86 architectures, but
               | that doesn't mean Arm's cores won't benefit from similar
               | optimization attention should they be pressed into a
               | performance critical role.
        
               | hajile wrote:
               | 1. Why would you WANT to hit 5+GHz when the downsides of
               | exponential power take over? High clocks aren't a feature
               | -- they are a cope.
               | 
               | AMD/Intel maintain I-cache and maintain a uop cache kept
               | in sync. Using a tiny part to pre-decode is different
               | from a massive uop cache working as far in advance as
               | possible in the hopes that your loops will keep you busy
               | enough that your tiny 4-wide decoder doesn't become
               | overwhelmed.
               | 
               | 2. The float workload was always BS because you can't run
               | nothing but floats. The integer workload had 22.1w total
               | core power and 4.8w power for the decoder. 4.8/22.1 is
               | 21.7%. Even the 1.8w float case is 8% of total core
               | power. The only other argument would be that the study is
               | wrong and 4.8w isn't actually just decoder power.
               | 
               | 3. We're talking about worst cases here. Nothing stops
               | ARM cores from creating a "work pool" of upcoming
               | branches in priority order for them to decode if they run
               | out of stuff on the main branch. This is the best of both
               | worlds where you can be faster on the main branch AND
               | still do the same branchy code trick too.
               | 
               | 4. This is the tail wagging the dog (and something else
               | if your numbers are correct). Complex x86 instructions
               | have garbage performance, so they are avoided by the
               | compiler. The problem is that you can't GUARANTEE those
               | instructions will NEVER be used, so the mere specter of
               | them forces complex algorithms all over the place where
               | ARM can do more simple things.
               | 
               | In any case, your numbers raise a VERY interesting
               | question about x86 being RISC under the hood.
               | 
               | Consider this. Say that we have 1024 bytes of ARM code
               | (256 instructions). x86 is around 15% smaller (871.25
               | bytes) and with the longer 4.25 byte instruction average,
               | x86 should have around 205 instructions. If ARM is
               | generating 19.3% more uops than instructions, we have
               | about 305 uops. x86 with just 4.7% more has 215 uops (the
               | difference here is way outside any margins of error
               | here).
               | 
               | If both are doing the same work, x86 uops must be in the
               | range of 30% more complex. Given the limits of what an
               | ALU can accomplish, we can say with certainty that x86
               | uops are doing SOMETHING that isn't the RISC they claim
               | to be doing. Perhaps one could claim that x86 is doing
               | some more sophisticated instructions in hardware, but
               | that's a claim that would need to be substantiated (I
               | don't know what ISA instructions you have that give a 15%
               | advantage being done in hardware, but aren't already in
               | the ARM ISA and I don't see ARM refusing to add circuitry
               | for current instructions to the ALU if it could reduce
               | uops by 15% either).
               | 
               | 8. https://en.wikipedia.org/wiki/Peephole_optimization
               | 
               | The final optimization stage is basically heuristic find
               | & replace. There could in theory be a mathematically
               | provable "best instruction selection", but finding it
               | would require trying every possible combination which
               | isn't possible as long as P=NP holds true.
               | 
               | My favorite absurdity of x86 (though hardly the only one)
               | is padding. You want to align function calls at cacheline
               | boundaries, but that means padding the previous cache
               | line with NOPs. Those NOPs translate into uops though.
               | Instead, you take your basic, short instruction and pad
               | it with useless bytes. Add a couple useless bytes to a
               | bunch of instructions and you now have the right length
               | to push the function over to the cache boundary without
               | adding any NOPs.
               | 
               | But the issues go deeper. When do you use a REX prefix?
               | You may want it so you can use 16 registers, but it also
               | increases code size. REX2 with APX is going to increase
               | this issue further where you must juggle when to use 8,
               | 16, or 32 registers and when you should prefer the long
               | REX2 because it has 3-register instructions. All kinds of
               | weird tradeoffs exist throughout the system. Because the
               | compilers optimize for the CPU and the CPU optimizes for
               | the compiler, you can wind up in very weird places.
               | 
               | In an ISA like ARM, there isn't any code density
               | weirdness to consider. In fact, there's very little
               | weirdness at all. Write it the intuitive way and you're
               | pretty much guaranteed to get good performance. Total
               | time to work on the compiler is a zero-sum game given the
               | limited number of experts. If you have to deal with these
               | kinds of heuristic headaches, there's something else you
               | can't be working on.
        
               | clamchowder wrote:
               | 1. Performance. Also Arm implemented instruction cache
               | coherency too.
               | 
               | Predecode/uop cache are both means to the same end,
               | mitigating decode power. AMD and Intel have used both
               | (though not on the same core). Arm has used both,
               | including both on the same core for quite a few
               | generations.
               | 
               | And a uop cache is just a cache. It's also big enough on
               | current generations to cache more than just loops, to the
               | point where it covers a majority of the instruction
               | stream. Not sure where the misunderstanding of the uop
               | cache "working as far in advance is possible" comes from.
               | Unless you're talking about the BPU running ahead and
               | prefetching into it? Which it does for L1i, and L2 as
               | well?
               | 
               | 2. "you can't run nothing but floats" they didn't do that
               | in the paper, they did D += A[j] + B[j] * C[j]. Something
               | like matrix multiplication comes to mind, and that's not
               | exactly a rare workload considering some ML stuff these
               | days.
               | 
               | But also, has a study been done on Arm cores? For all we
               | know they could spend similar power budgets on decode, or
               | more. I could say an Arm core uses 99% of its power
               | budget on decode, and be just as right as you are (they
               | probably don't, my point is you don't have concrete data
               | on both Arm and x86 decode power, which would be
               | necessary for a productive discussion on the subject)
               | 
               | 3. You're describing letting the BPU run ahead, which
               | everyone has been doing for the past 15 years or so.
               | Losing fetch bandwidth past a taken branch is a different
               | thing.
               | 
               | 4. Not sure where you're going. You started by suggesting
               | Arm has less micro-op expansion than x86, and I provided
               | a counterexample. Now you're talking about avoiding
               | complex instructions, which a) compilers do on both
               | architectures, they'll avoid stuff like division, and b)
               | humans don't in cases where complex instructions are
               | beneficial, see Linux kernel using rep movsb (https://git
               | hub.com/torvalds/linux/blob/5189dafa4cf950e675f02...),
               | and Arm introducing similar complex instructions
               | (https://community.arm.com/arm-community-
               | blogs/b/architecture...)
               | 
               | Also "complex" x86 instructions aren't avoided in the
               | video encoding workload. On x86 it takes ~16.5T
               | instructions to finish the workload, and ~19.9T on Arm
               | (and ~23.8T micro-ops on Neoverse V2). If "complex" means
               | more work per instruction, then x86 used more complex
               | instructions, right?
               | 
               | 8. You can use a variable length NOP on x86, or multiple
               | NOPs on Arm to align function calls to cacheline
               | boundaries. What's the difference? Isn't the latter worse
               | if you need to move by more than 4 bytes, since you have
               | multiple NOPs (and thus multiple uops, which you think is
               | the case but isn't always true, as some x86 and some Arm
               | CPUs can fuse NOP pairs)
               | 
               | But seriously, do try gathering some data to see if
               | cacheline alignment matters. A lot of x86/Arm cores that
               | do micro-op caching don't seem to care if a function (or
               | branch target) is aligned to the start of a cacheline.
               | Golden Cove's return predictor does appear to track
               | targets at cacheline granularity, but that's a special
               | case. Earlier Intel and pretty much all AMD cores don't
               | seem to care, nor do the Arm ones I've tested.
               | 
               | Anyway, you're making a lot of unsubstantiated guesses on
               | "weirdness" without anything to suggest it has any
               | effect. I don't think this is the right approach. Instead
               | of "tail wagging the dog" or whatever, I suggest a data-
               | based approach where you conduct experiments on some
               | x86/Arm CPUs, and analyze some x86/Arm programs. I guess
               | the analogy is, tell the dog to do something and see how
               | it behaves? Then draw conclusions off that?
        
               | hajile wrote:
               | 1. The biggest chip market is laptops and getting 15%
               | better performance for 80% more power (like we saw with X
               | Elite recently) isn't worth doing outside the marketing
               | win of a halo product (a big reason why almost everyone
               | is using slower X Elite variants). The most profitable
               | (per-chip) market is servers. They also prefer lower
               | clocks and better perf/watt because even with the high
               | chip costs, the energy will wind up costing them more
               | over the chip's lifespan. There's also a real cost to
               | adding extra pipeline stages. Tejas/Jayhawk cores are
               | Intel's cancelled examples of this.
               | 
               | L1 cache is "free" in that you can fill it with simple
               | data moves. uop cache requires actual work to decode and
               | store elements for use in addition to moving the data. As
               | to working ahead, you already covered this yourself. If
               | you have a nearly 1-to-1 instruction-to-uop ratio, having
               | just 4 decoders (eg, zen4) is a problem because you can
               | execute a lot more than just 4 instructions on the
               | backend. 6-wide Zen4 means you use 50% more instructions
               | than you decode per clock. You make up for this in loops,
               | but that means while you're executing your current loop,
               | you must be maxing out the decoders to speculatively fill
               | the rest of the uop cache before the loop finishes. If
               | the loop finishes and you don't have the next bunch of
               | instructions decoded, you have a multi-cycle delay coming
               | down the pipeline.
               | 
               | 2. I'd LOVE to see a similar study of current ARM chips,
               | but I think the answer here is pretty simple to deduce.
               | ARM's slide says "4x smaller decoders vs A710" despite
               | adding a 5th decoder. They claim 20% reduction in power
               | at the same performance and the biggest change is the
               | decoder. As x86 decode is absolutely more complex than
               | aarch32, we can only deduce that switching from x86 to
               | aarch64 would be an even more massive reduction. If we
               | assume an identical 75% reduction in decoder power, we'd
               | move from 4.8w on haswell the decoder down to 1.2w
               | reducing total core power from 22.1 to 18.5 or a ~16%
               | overall reduction in power. This isn't too far from to
               | the power numbers claimed by ARM.
               | 
               | 4. This was a tangent. I was talking about uops rather
               | than the ISA. Intel claims to be simple RISC internally
               | just like ARM, but if Intel is using nearly 30% fewer
               | uops to do the same work, their "RISC" backend is way
               | more complex than they're admitting.
               | 
               | 8. I believe aligning functions to cacheline boundaries
               | is a default flag at higher optimization levels. I'm
               | pretty sure that they did the analysis before enabling
               | this by default. x86 NOP flexibility is superior to ARM
               | (as is its ability to avoid them entirely), but the cause
               | is the weirdness of the x86 ISA and I think it's an
               | overall net negative.
               | 
               | Loads of x86 instructions are microcode only. Use one and
               | it'll be thousands of cycles. They remain in microcode
               | because nobody uses them, so why even try to optimize and
               | they aren't used because they are dog slow. How would you
               | collect data about this? Nothing will ever change unless
               | someone pours in millions of dollars in man-hours into
               | attempting to speed it up, but why would anyone want to
               | do that?
               | 
               | Optimizing for a local maxima rather than a global maxima
               | happens all over technology and it happens exactly
               | because of the data-driven approach you are talking
               | about. Look for the hot code and optimize it without
               | regard that there may be a better architecture you could
               | be using instead. Many successes relied on an intuitive
               | hunch.
               | 
               | ISA history has a ton of examples. iAPX432 super-CISC,
               | the RISC movement, branch delay slots, register windows,
               | EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All
               | of these were attempts to find new maxima with greater or
               | lesser degrees of success. When you look into these,
               | pretty much NONE of them had any real data to drive them
               | because there wasn't any data until they'd actually
               | started work.
        
               | clamchowder wrote:
               | 1. Yeah I agree, both X Elite and many Intel/AMD chips
               | clock well past their efficiency sweet spot at stock.
               | There is a cost to extra pipeline stages, but no one is
               | designing anything like Tejas/Jayhawk, or even earlier P4
               | variants these days. Also P4 had worse problems (like not
               | being able to cancel bogus ops until retirement) than
               | just a long pipeline.
               | 
               | Arm's predecoded L1i cache is not "free" and can't be
               | filled with simple data moves. You need predecode logic
               | to translate raw instruction bytes into an intermediate
               | format. If Arm expanded predecode to handle fusion cases
               | in A715, that predecode logic is likely more complex than
               | in proir generations.
               | 
               | 2. Size/area is different from power consumption. Also
               | the decoder is far from the only change. The BTBs were
               | changed from 2 to 3 level, and that can help efficiency
               | (could make a smaller L2 BTB with similar latency, while
               | a slower third level keeps capacity up). TLBs are bigger,
               | probably reducing page walks. Remember page walks are
               | memory accesses and the paper earlier showed data
               | transfers count for a large percentage of dynamic power.
               | 
               | 4. IMO no one is really RISC or CISC these days
               | 
               | 8. Sure you can align the function or not. I don't think
               | it matters except in rare corner cases on very old cores.
               | Not sure why you think it's an overall net negative.
               | "feeling weird" does not make for solid analysis.
               | 
               | Most x86 instructions are not microcode only. Again,
               | check your data with performance counters. Microcoded
               | instructions are in the extreme minority. Maybe
               | microcoded instructions were more common in 1978 with the
               | 8086, but a few things have changed between then and now.
               | Also microcoded instructions do not cost thousands of
               | cycles, have you checked? i.e. a gather is ~22 micro ops
               | on Haswell, from https://uops.info/table.html Golden Cove
               | does it in 5-7 uops.
               | 
               | ISA history has a lot of failed examples where people
               | tried to lean on the ISA to simplify the core
               | architecture. EPIC/VLIW, branch delay slots, and register
               | windows have all died off. Mill is a dumb idea and never
               | went anywhere. Everyone has converged on big OoO machines
               | for a reason, even though doing OoO execution is really
               | complex.
               | 
               | If you're interested in cases where ISA does matter, look
               | at GPUs. VLIW had some success there (AMD Terascale, the
               | HD 2xxx to 6xxx generations). Static instruction
               | scheduling is used in Nvidia GPUs since Kepler. In CPUs
               | ISA really doesn't matter unless you do something that
               | actively makes an OoO implementation harder, like
               | register windows or predication.
        
               | dzaima wrote:
               | > My favorite absurdity of x86 (though hardly the only
               | one) is padding. You want to align function calls at
               | cacheline boundaries, but that means padding the previous
               | cache line with NOPs. Those NOPs translate into uops
               | though.
               | 
               | I'd call that more neat than absurd.
               | 
               | > You may want it so you can use 16 registers, but it
               | also increases code size.
               | 
               | RISC-V has the exact same issue, some compressed
               | instructions having only 3 bits for operand registers.
               | And on x86 for 64-bit-operand instructions you need the
               | REX prefix always anyways. And it's not that hard to
               | pretty reasonably solve - just assign registers by their
               | use count.
               | 
               | Peephole optimizations specifically here are basically
               | irrelevant. Much of the complexity for x86 comes from
               | just register allocation around destructive operations
               | (though, that said, that does have rather wide-ranging
               | implications). Other than that, there's really not much
               | difference; all have the same general problems of moving
               | instructions together for fusing, reordering to reduce
               | register pressure vs putting parallelizable instructions
               | nearer, rotating loops to reduce branches, branches vs
               | branchless.
        
               | hajile wrote:
               | RISC-V has a different version of this issue that is
               | pretty straight-forward. Preferring 2-register operations
               | is already done to save register space. The only real
               | extra is preferring the 8 registers C uses for math.
               | After this, it's all just compression.
               | 
               | x86 has a multitude of other factors than just
               | compression. This is especially true with standard vs REX
               | instructions because most of the original 8 instructions
               | have specific purposes and instructions that depend on
               | them for these (eg, Accumulator instructions with A
               | register, Mul/div using A+D, shift uses C, etc). It's a
               | problem a lot harder than simple compression.
               | 
               | Just as cracking an alphanumeric password is
               | exponentially harder than a same-length password with
               | numbers only, solving for all the x86 complications and
               | exceptions is also exponentially harder.
        
               | dzaima wrote:
               | If anything, I'd say x86's fixed operands make register
               | allocation easier! Don't have to register-allocate that
               | which you can't. (ok, it might end up worse if you need
               | some additional 'mov's. And in my experience more 'mov's
               | is exactly what compilers often do.)
               | 
               | And, right, RISC-V even has the problem of being two-
               | operand for some compressed instructions. So the same
               | register allocation code that's gone towards x86 can
               | still help RISC-V (and vice versa)! On RISC-V, failure
               | means 2-4 bytes on a compressed instruction, and on x86
               | it means +3 bytes of a 'mov'. (granted, the additioanal
               | REX prefix cost is separate on x86, while included in
               | decompression on RISC-V)
        
               | hajile wrote:
               | With 16 registers, you can't just avoid a register
               | because it has a special use. Instead, you must work to
               | efficiently schedule around that special use.
               | 
               | Lack of special GPRs means you can rename with impunity
               | (this will change slightly with the load/store pair
               | extension). Having 31 truly GPR rather than 8 GPR+8
               | special GPR also gives a lot of freedom to compilers.
        
               | dzaima wrote:
               | Function arguments and return values already are
               | effectively special use, and should frequently be on par
               | if not much more frequent than the couple x86
               | instructions with fixed registers.
               | 
               | Both clang and gcc support calls having differing used
               | calling conventions within one function, which ends up
               | effectively exactly identical to fixed-register
               | instructions (i.e. an x86 'imul r64' can be done via a
               | pseudo-function where the return values are in rdx & rax,
               | an input is in rax, and everything else is non-volatile;
               | and the dynamically-choosable input can be allocated
               | separately). And '__asm__()' can do mixed fixed and non-
               | fixed registers anyway.
        
               | hajile wrote:
               | Unlike x86, none of this is strictly necessary. As long
               | as you put things back as expected, you may use all the
               | registers however you like.
        
               | dzaima wrote:
               | The option of not needing any fixed register usage would
               | apply to, what, optimizing compilers without support for
               | function calls (at least via passing arguments/results
               | via registers)? That's a very tiny niche to use as an
               | argument for having simplified compiler behavior.
               | 
               | And good register allocation is still pretty important on
               | RISC-V - using more registers, besides leading to less
               | compressed instruction usage, means more non-volatile
               | register spilling/restoring in function
               | prologue/epilogue, which on current compilers (esp.
               | clang) happens at the start & end of functions, even in
               | paths that don't need the registers.
               | 
               | That said, yes, RISC-V still indeed has much saner
               | baseline behavior here and allows for simpler basic
               | register allocation, but for non-trivial compilers the
               | actual set of useful optimizations isn't that different.
        
               | hajile wrote:
               | Not just simpler basic allocation. There are fewer
               | hazards to account for as well. The process on RISC-V
               | should be shorter, faster, and with less risk that the
               | chosen heuristics are bad in an edge case.
        
               | neonsunset wrote:
               | > Not sure what you mean by "peephole heuristic
               | optimizations"
               | 
               | Post-emit or within-emit stage optimization where a
               | sequence of instructions is replaced with a more
               | efficient shorter variant.
               | 
               | Think replacing pairs of ldr and str with ldp and stp,
               | changing ldr and increment with ldr with post-index
               | addressing mode, replacing address calculation before
               | atomic load with atomic load with addressing mode (I
               | think it was in ARMv8.3-a?).
               | 
               | The "heuristic" here might be possibly related to
               | additional analysis when doing such optimizations.
               | 
               | For example, previously mentioned ldr, ldr -> ldp (or
               | stp) optimization is not always a win. During work on
               | .NET 9, there was a change[0] that improved load and
               | store reordering to make it more likely that simple
               | consecutive loads and stores are merged on ARM64.
               | However, this change caused regressions in various hot
               | paths because, for example, previously matched ldr w0,
               | [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr]
               | pair got replaced with ldp w0, w1, [add] -> modify w0,
               | str w0 [addr].
               | 
               | Turns out this kind of merging defeated store forwarding
               | on Firestorm (and newer) as well as other ARM cores. The
               | regression was subsequently fixed[1], but I think the
               | parent comment author may have had scenarios like these
               | in mind.
               | 
               | [0]: https://github.com/dotnet/runtime/pull/92768
               | 
               | [1]: https://github.com/dotnet/runtime/pull/105695
        
               | NobodyNada wrote:
               | One more: there's more to an ISA than just the
               | instructions; there's semantic differences as well. x86
               | dates to a time before out-of-order execution, caches,
               | and multi-core systems, so it has an extremely strict
               | memory model that does not reflect modern hardware -- the
               | only memory-reordering optimization permitted by the ISA
               | is store buffering.
               | 
               | Modern x86 processors will actually perform speculative
               | weak memory accesses in order to try to work around this
               | memory model, flushing the pipeline if it turns out a
               | memory-ordering guarantee was violated in a way that
               | became visible to another core -- but this has complexity
               | and performance impacts, especially when applications
               | make heavy use of atomic operations and/or communication
               | between threads.
               | 
               | Simple atomic operations can be an order of magnitude
               | faster on ARMv8 vs x86: https://web.archive.org/web/20220
               | 129144454/https://twitter.c...
        
               | sroussey wrote:
               | Yes, and Apple added this memory model to their ARM
               | implementation so Rosetta2 would work well.
        
               | janwas wrote:
               | > With 50+ years of figuring the basics out, RISC-V won't
               | be making any major mistakes on the most important stuff.
               | 
               | RVV does have significant departures from prior work, and
               | some of them are difficult to understand:
               | 
               | - the whole concept of avl, which adds complexity in many
               | areas including reg renaming. From where I sit, we could
               | just use masks instead.
               | 
               | - mask bits reside in the lower bits of a vector, so we
               | either require tons of lane-crossing wires or some kind
               | of caching.
               | 
               | - global state LMUL/SEW makes things hard for compilers
               | and OoO.
               | 
               | - LMUL is cool but I imagine it's not fun to implement
               | reductions, and vrgather.
        
               | dzaima wrote:
               | How does avl affect register renaming? (there's the edge-
               | case of vl=0 that is horrifically stupid (which is by
               | itself a mistake for which I have seen no justification
               | but whatever) but that's probably not what you're
               | thinking of?) Agnostic mode makes it pretty simple for
               | hardware to do whatever it wants.
               | 
               | Over masks it has the benefit of allowing simple hardware
               | short-circuiting, though I'd imagine it'd be cheap enough
               | to 'or' together mask bit groups to short-circuit on (and
               | would also have the benefit of better masked throughput)
               | 
               | Cray-1 (1976) had VL, though, granted, that's a pretty
               | long span of no-VL until RVV.
        
               | janwas wrote:
               | Was thinking of a shorter avl producing partial results
               | merged into another reg. Something like a += b; a[0] +=
               | c[0]. Without avl we'd just have a write-after-write, but
               | with it, we now have an additional input, and whether
               | this happens depends on global state (VL).
               | 
               | Espasa discusses this around 6:45 of
               | https://www.youtube.com/watch?v=WzID6kk8RNs.
               | 
               | Agree agnostic would help, but the machine also has to
               | handle SW asking for mask/tail unchanged, right?
        
               | camel-cdr wrote:
               | > Agree agnostic would help, but the machine also has to
               | handle SW asking for mask/tail unchanged, right?
               | 
               | Yes, but it should rarely do so.
               | 
               | The problem is that because of the vl=0 case you always
               | have a dependency on avl. I think the motivavtion for the
               | vl=0 case was that any serious ooo implementation will
               | need to predict vl/vtype anyways, so there might as well
               | be this nice to have feature.
               | 
               | IMO they should've only supported ta,mu. I think the only
               | usecase for ma, is when you need to avoid exceptions. And
               | while tu is usefull, e.g. summing am array, it could be
               | handled differently. E.g. once vl<vlmax you write the
               | summ to a difgerent vector and do two reductions (or
               | rather two diffetent vectors given the avl to vl rules).
        
               | dzaima wrote:
               | What's the "nice to have feature" of vl=0 not modifying
               | registers? I can't see any benefit from it. If anything,
               | it's worse, due to the problems on reduce and vmv.s.x.
        
               | camel-cdr wrote:
               | "nice to hace" because it removes the need for a branch
               | for the n=0 case, for regular loops you probably still
               | want it, but there are siturations were not needing to
               | worry about vl=0 corrupting your data is somewhat nice.
        
               | dzaima wrote:
               | Huh, in what situation would vl=0 clobbering registers be
               | undesirable while on vl>=1 it's fine?
               | 
               | If hardware will be predicting vl, I'd imagine that would
               | break down anyway. Potentially catastrophically so if
               | hardware always chooses to predict vl=0 doesn't happen.
        
               | dzaima wrote:
               | > Agree agnostic would help, but the machine also has to
               | handle SW asking for mask/tail unchanged, right?
               | 
               | The agnosticness flags can be forwarded at decode-time
               | (at the cost of the non-immediate-vtype vsetvl being very
               | slow), so for most purposes it could be as fast as if it
               | were a bit inside the vector instruction itself. Doesn't
               | help vl=0 though.
        
             | anvuong wrote:
             | That was true when ARM was first released, but over the
             | years the decoder for ARM has gotten more and more
             | complicated. Who would have guessed adding more specialized
             | instructions would result in more complicated decoders? ARM
             | now uses multi-stage decoders, just the same as x86.
        
           | IshKebab wrote:
           | Sure, but it's not idle power consumption that's the
           | difference between these.
        
             | wmf wrote:
             | When a laptop gets 12 hours or more of battery life that's
             | because it's 90% idle.
        
               | jeffbee wrote:
               | And while it's important to design a chip that can enter
               | a deep idle state, the thing that differentiates one
               | Windows laptop from the next is how many mistakes the
               | BIOS writers made and whether the platform drivers work
               | correctly. This is also why you cannot really judge the
               | expected battery life under Linux by reading reviews of
               | laptops running Windows.
        
         | dagmx wrote:
         | Battery tests are important, but so is how it fairs on battery
         | (what is the performance drop off to maintain that), what's its
         | performance is ant its peak and how it long before it throttles
         | when pushed.
         | 
         | The M series processors have succeeded in all four: battery
         | life, performance parity between battery and plugged in, high
         | performance and performance sustainability.
         | 
         | So far, very few benchmarks have been comparing the latter
         | three as part of the full package assessment.
        
         | GrumpyYoungMan wrote:
         | The display, RAM, and other peripherals are consuming power
         | too. Short of running continuous high CPU loads, which most
         | people don't do on laptops, changes in CPU efficiency have less
         | apparent effect on battery life because it's only a fraction of
         | overall power draw.
        
         | Panzer04 wrote:
         | >5hr Battery life in laptops is mostly a function of how well
         | idle is managed, i think. The less work you can do while
         | running the users core program, the better. I'm not sure how
         | much impact CPU efficiency really has in that case.
         | 
         | If you are running a remotely demanding program (say, a game) ,
         | your battery life will be bad no matter what (ie. <4hrs) unless
         | you choose a very low TDP that performs badly always.
         | 
         | A laptop at idle should be able to manage ~5w power consumption
         | sumtpion regardless of AMD/intel/Apple processor, but it's
         | largely on the OS to achieve that.
        
         | 999900000999 wrote:
         | I have a 365 AMD laptop.
         | 
         | The battery is great if your doing very light stuff, Call of
         | Duty takes it's battery down to 3 hours.
         | 
         | Macs don't really support higher end games, so I can't directly
         | compare to my M1 Air.
        
           | sedatk wrote:
           | How does "great" translate to hours?
        
             | 999900000999 wrote:
             | This is really tricky.
             | 
             | The OEMs will use ever trick possible and do something like
             | open GMAIL to claim 10 hours, but given my typical use I
             | average 5 to 6. I make music using a software called
             | Maschine.
             | 
             | It's a massive step up over my old( still working just very
             | heavy) Lenovo Legion 2020, which would last about 2 hours
             | given the same usage.
             | 
             | This is all subjective at the end of the day. If none of
             | your applications actually work since your on ARM Windows
             | of course you'll have higher battery life.
        
         | cubefox wrote:
         | Unlike AMD and Qualcomm, Apple uses an expensive TSMC 3nm
         | process, so you would expect better battery life from the
         | "MBP3". I assume they used the process improvements to increase
         | performance instead.
        
           | hajile wrote:
           | Perf per watt is higher for M1 on N5 vs Zen5 on N4P, so the
           | problems go deeper than just process.
           | 
           | X Elite also beats AMD/Intel in perf/watt while being on the
           | same N4P node as HX370.
           | 
           | https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-
           | anal...
        
         | double0jimb0 wrote:
         | I didn't watch this link, but my Zenbook S 16 only gets
         | remotely close to my M2 MBA battery life if the zenbook is in
         | whatever is Windows 11 'efficiency' mode, and then it
         | benchmarks at 50% of the M2.
         | 
         | I don't think the two are remotely comparable in perf/watt.
        
       | nullc wrote:
       | Speaking of Zen 5, are there any rumors on when 128 core turin-x
       | will ship?
        
       | apatheticonion wrote:
       | Are there any mini PCs with Zen 5?
        
         | wmf wrote:
         | You could probably put a 9700X in a MS-A1. Besides that it will
         | take a few months.
        
         | hajile wrote:
         | I think there's a lot of hope for a Strix Halo mini-PC.
         | 
         | We currently have 7840HS+6650M with two massive heatsinks just
         | barely coming in at what you would consider a large mini-PC
         | rather than a SFFPC.
         | 
         | Just one chip cuts the heatsink demand in half. The cores
         | should be faster and it moves from 28CU up to 40CU and from
         | RDNA2 to RDNA3.5. As long as the cross-core-complex latency
         | isn't as bad as the HX370, I think it could be a real winner
         | for a long time as it's basically an upgraded PS5 that runs
         | Linux/Windows.
        
           | moffkalast wrote:
           | Hell I'd settle for a Z1 SBC or even mini PC. They can't seem
           | to keep up with demand for these new chips to put them into
           | any sort of products that don't have complete mass appeal.
           | It's impossible to find one that's not in a handheld gaming
           | console. I doubt they'll even make enough of the Halo to
           | cover laptops.
        
             | wmf wrote:
             | Z1 is basically the same as 8840U which you can find.
        
         | sliken wrote:
         | "AMD Strix Point expected to debut in October, claims AOOSTAR"
         | was a new headline I've seen. Seems about right that laptops
         | (higher margin parts) land a few months before the SFFs.
         | 
         | I'm excited about the strix halo that has more cores, double
         | the GPU, and double the memory badwidth (to keep the GPU fed).
        
         | JonChesterfield wrote:
         | There seem to be some on hawk point but not many. I'd like to
         | replace a PN50 (4800u system) from years ago but am attached to
         | it being small enough to fit in the cable tidy under the desk -
         | the 4" form factor seems to have grown a little over time.
        
       | CyberDildonics wrote:
       | What does "strix point" mean?
        
         | icegreentea2 wrote:
         | It's AMD's code/product name for their mobile CPUs with the
         | Zen5 microarchitecture.
         | 
         | The last AMD laptop code name was "Hawk Point" - strix is a
         | mythological bird. Who knows if AMD will keep with this naming
         | scheme.
        
       | jml7c5 wrote:
       | I sit firm in my belief that the best thing Microsoft could do
       | for their laptop ecosystem is to add support for a "max fan
       | speed" slider somewhere prominent in the Windows UI.
       | 
       | People want the option to make their laptop silent or nearly
       | silent. And when users do need the power, they generally prefer a
       | slightly slower laptop at a reasonable volume rather than the
       | roar of a jet engine.
       | 
       | Laptop manufacturers want their devices to score high on
       | benchmarks. The best way to do that is to add a fan that can
       | become very loud.
       | 
       | The incentives are not aligned.
       | 
       | All laptops should be designed to operate passively 100% of the
       | time, if the owner so chooses. I doubt manufacturers will go that
       | route unless Microsoft nudges them towards it. It would have
       | downstream effects on how review sites benchmark laptops (i.e.,
       | at various power draws/noise levels producing a curve rather than
       | a single number), which would have downstream effects on what CPU
       | designers optimize for. It'd be great for consumers.
        
         | latchkey wrote:
         | I don't know anything about Windows, but at least on Mac, I've
         | been using TGPro for years [0]. I'd assume there is something
         | similar in the Windows world.
         | 
         | In normal conditions my M1 mac can control its fans just fine,
         | but when I travel to hot places like Vietnam... I just keep the
         | fans on more often and my machine doesn't get nearly as hot. I
         | end up having to open it up after a few months and clean out
         | the fans, but that's fine.
         | 
         | [0] https://www.tunabellysoftware.com/tgpro/
        
         | kevin_thibedeau wrote:
         | Stop demanding paper thin laptops. My work Dell rarely turns on
         | its fan unless an AV scan is in progress and even then it's
         | rather tolerable. It isn't a fashionable thickness so has
         | plenty of internal volume for heat distribution.
        
           | rafaelmn wrote:
           | MacBook Air is thin and fanless, so it can be done.
        
             | cmeacham98 wrote:
             | The cheapest MacBook Air is $1000, and it's more like
             | $1500+ if you want a reasonable amount of RAM and storage.
             | There are similarly expensive Windows laptops available
             | that are fanless.
        
               | rafaelmn wrote:
               | Mind linking some (genuinely curious, would like to
               | checkout potential Linux machine for the next upgrade)
        
               | nov21b wrote:
               | Count me in
        
               | jacooper wrote:
               | Surface laptop.
        
               | tedunangst wrote:
               | Not even the arm version of surface laptop is fanless.
        
               | aurareturn wrote:
               | >There are similarly expensive Windows laptops available
               | that are fanless.
               | 
               | Such as?
        
               | phonon wrote:
               | Wait until Lunar Lake comes out.
        
               | hajile wrote:
               | I spent $1700 or so on my M1 Air not too long after they
               | were released. A ThinkPad X1 Carbon would have cost me
               | more money for massively worse performance. Quality costs
               | more.
               | 
               | The difference is that a 4800U would be looking pretty
               | bad vs a HX370 while the M1 still looks decent 4 years
               | later (especially when that HX370 is unplugged).
        
               | paulmd wrote:
               | For $1200 you can easily pick up a decent refurb MBP -
               | these are apple refurbs for example. OOS but an example
               | of what you can find if you look around a bit.
               | 
               | https://sellout.woot.com/offers/apple-14-macbook-pro-
               | with-10...
               | 
               | There's very little reason to chase the exact latest
               | model when even a 2020/2021 M1 family is still great.
        
             | fulafel wrote:
             | Many things can be done if you don't have coexist with AV.
             | 
             | (Though the use of the fan is always a configuration choice
             | with thermal management in CPUs these days).
        
             | AnthonyMouse wrote:
             | "Thin and fanless" aren't that hard, just use any low power
             | CPU.
             | 
             | But then people also want fast.
             | 
             | Apple does this by buying out TSMC's capacity for the
             | latest process nodes and then taking the
             | performance/efficiency trade off in favor of efficiency, so
             | they get something with similar performance and lower power
             | consumption. But then they charge you $400 for $50 worth of
             | RAM and solder it so you can't upgrade it yourself.
             | 
             | The thing to realize is that fans are not required to spin,
             | and the difference between the faster and slower processors
             | are the clock speed rather than the transistors. So what
             | you want is exactly what the OP requested: A laptop with
             | fans in it, but you can turn them off. Then the CPU gets
             | capped at a 15W TDP, has basically the same single-thread
             | performance but is slower on threaded workloads, and it's
             | no longer possible for Chrome to make your laptop loud.
             | 
             | But if you want to open up Blender you still have the
             | option to move up the fan slider and make it go faster.
        
               | thecompilr wrote:
               | The M1 was made on 5nm which have long been available to
               | AMD and other competitors in volume.
        
               | AnthonyMouse wrote:
               | Fast is relative. The Ryzen HX 370 has a TDP configurable
               | down to 15W and at that power level it could be run
               | fanless and would be faster than the M1, but it's still
               | faster yet if you give it 54W and raise the clock speed.
        
               | aurareturn wrote:
               | I'm going to need source on that.
               | 
               | What does HX 370 score at 10w?
        
               | schmidtleonard wrote:
               | Is that the chip AMD just released? Isn't the M1 about 4
               | years old?
        
               | ThatMedicIsASpy wrote:
               | Don't run Windows and you don't need fast. Unfortunately
               | Linux on notebooks is always a dice roll of random
               | features (cam, fingerprint, ..) not working.
               | 
               | There is a lot of older hardware running like crap
               | because Windows just bloats up.
        
               | MobiusHorizons wrote:
               | I know everyone on this site loves to hate on soldered
               | ram, but my impression is most people don't understand
               | that soldered ram is not the same thing as regular ram
               | modules. They are literally different memory chips (LPDDR
               | vs DDR) . When built to a specific chip my understanding
               | is you can design for tighter timings and higher
               | bandwidth which is important for the gpu. The M1 shipped
               | with very fast LPDDR4X running at 4266MT/s which was even
               | pretty fast by XMP desktop speeds at the time (2020).
               | There are real engineering advantages to soldered ram
               | especially if the memory controller has need designed
               | with to take advantage of it. I guess it is similar to
               | how gpu memory configurations are specialized and not
               | modular.
        
           | tuna74 wrote:
           | I want a thin and fanless laptop. You might want something
           | else.
        
         | shmerl wrote:
         | You can use Linux and hwmon.
        
         | mqus wrote:
         | > add support for a "max fan speed" slider somewhere prominent
         | in the Windows UI.
         | 
         | Isn't that what the "power settings" do? It's a slider at the
         | bottom right, hidden in a tray icon. Sure, it only has three
         | positions and also influences battery consumption but it pretty
         | much does what you want. (Not sure if windows 11 kept this
         | though)
        
         | nightski wrote:
         | I don't mind fans at all, in fact I find fan noises a little
         | soothing (a childhood thing, we didn't have AC). Everyone has
         | different priorities, personally I'd prefer to not have
         | throttled performance.
        
         | sandreas wrote:
         | Did you know NBFC (Notebook Fan Control)? It's old, but still
         | works on some devices and you can create custom profiles via
         | XML.
         | 
         | https://github.com/hirschmann/nbfc
        
         | jwells89 wrote:
         | It might also be good to mandate 10+ hours of battery life when
         | the laptop is in power saving mode. A number of laptops that'd
         | otherwise have decent battery life are hampered by things like
         | half-baked power management of discrete GPUs that doesn't
         | completely cut power supply to those components. Manufacturers
         | should be more heavily testing under this mode.
        
           | devbent wrote:
           | A few misbehaving CSS filters can make my discreet GPU turn
           | on and at that point my battery life is a goner. Not sure who
           | to blame in that scenario.
           | 
           | There was an old bug in FF around 2018 where a tab using a
           | GPU would prevent a Windows laptop from ever sleeping. That
           | ended up destroying that laptop's battery after it got thrown
           | in my backpack and overheated a couple times.
        
             | jwells89 wrote:
             | Seems like this could be fixed by a system setting that
             | disables automatic graphic switching which can be
             | controlled by power profiles. That way the user can set the
             | machine to use iGPU only when on battery, regardless of
             | what programs want.
        
         | automatic6131 wrote:
         | >I sit firm in my belief that the best thing Microsoft could do
         | for their laptop ecosystem is to add support for a "max fan
         | speed" slider somewhere prominent in the Windows UI
         | 
         | Until then, there's
         | https://github.com/Rem0o/FanControl.Releases
        
           | mkhalil wrote:
           | A closed source application for controlling one's fan...umm
           | no thank you.
           | 
           | I never will understand the reasoning behind why people are
           | so afraid of releasing their source code. Looks like a
           | weekend project; does he expect to make a living out of a
           | weekend project?
        
             | Rinzler89 wrote:
             | _> Looks like a weekend project;_
             | 
             | Is that the best developer insult in your repertoire?
             | 
             |  _> does he expect to make a living out of a weekend
             | project?_
             | 
             | Trying to make money from writing SW is not illegal. The
             | free market will decide.
             | 
             |  _> A closed source application for controlling one's
             | fan...umm no thank you. _
             | 
             | Well since you think it's only a weekend project, why don't
             | you put your money where your mouth is and spend a weekend
             | developing a FOSS fan control app if you need one?
        
             | wiseowise wrote:
             | > Looks like a weekend project; does he expect to make a
             | living out of a weekend project?
             | 
             | Why won't you spend a weekend and do the world a service?
        
         | cpv wrote:
         | Some recent asus ROG/TUF 2022+ models have fine tuning
         | available via "armoury crate" or "g-helper" (non-proprietary,
         | fan/community supported, code on github).
         | 
         | Disabling a dGpu and reducing power can yeld some impressive
         | results for battery life. Allows also defining fan speed curves
         | depending on temperature.
        
       | aurareturn wrote:
       | One of these has to be true (or both true):
       | 
       | 1. ARM is inherently more efficient than x86 CPUs in most tasks
       | 
       | 2. Nuvia and Apple are better CPU designers than AMD and Intel
       | 
       | Here are results from Notebookcheck:
       | 
       | Cinebench R24 ST perf/watt
       | 
       | * M3: 12.7 points/watt
       | 
       | * X Elite: 9.3 points/watt
       | 
       | * AMD HX 370: 3.74 points/watt
       | 
       | * AMD 8845HS: 3.1 points/watt
       | 
       | * Intel 155H: 3.1 points/watt
       | 
       | In ST, Apple is 3.4x more efficient than Zen5. X Elite is 2.4x
       | more efficient than Zen5.
       | 
       | Cinebench R24 MT perf/watt
       | 
       | * M3: 28.3 points/watt
       | 
       | * X Elite: 22.6 points/watt
       | 
       | * AMD HX 370: 19.7 points/watt
       | 
       | * AMD 8845HS: 14.8 points/watt
       | 
       | * Intel 155H: 14.5 points/watt
       | 
       | In MT, Apple is 1.9x more efficient than Zen4 and 1.4x more
       | efficient than HX 370. I expect M3 Pro/Max to increase the gap
       | because generally, more cores means more efficiency for Cinebench
       | MT. X Elite is also more efficient but the gap is closer.
       | However, we should note that in a laptop, ST matters more for
       | efficiency because of the burst behavior of usage. It's easier to
       | gain in MT efficiency as long as you have many cores and run them
       | at lower wattage. In this case, AMD's Zen5 12 core setup and 24
       | threads works well in Cinebench. Cinebench loves more threads.
       | 
       | One thing that is intriguing is that X Elite does not have little
       | cores which hurts its MT efficiency. It's likely a remnant of
       | Nuvia designing a server CPU, which does not need big.Little but
       | Qualcomm used it in a laptop SoC first.
       | 
       | Sources: https://www.youtube.com/watch?v=ZN2tC8DfJnc
       | 
       | https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...
        
         | trynumber9 wrote:
         | The Snapdragon X Elite is on the same node and when actually
         | doing a lot of work (i.e. cores loaded) it is close enough to
         | HX 370 while delivering similar throughput.
         | 
         | Why wouldn't the inherent inefficiency of x64 be as noticeable
         | in MT when all the inefficient cores are working? Because it is
         | running at lower clocks? Then what allows it to match the SDXE
         | in throughput? Does that need to lower its clock even more? I'm
         | not seeing what makes it inherent.
        
         | rafaelmn wrote:
         | From you link - Intel is topping the performance charts
         | (alongside AMD in SC) - they probably tune power usage
         | agressively to achieve these results.
         | 
         | I would guess it's more to do with coming from desktop CPU
         | design to mobile vs. phones to laptops.
        
           | aurareturn wrote:
           | >From you link - Intel is topping the performance charts
           | (alongside AMD in SC) - they probably tune power usage
           | agressively to achieve these results.
           | 
           | Cinebench 2024 ST:
           | 
           | * M3: 142 points
           | 
           | * X Elite: 123 points
           | 
           | * AMD HX 370: 116 points
           | 
           | * AMD 8845HS: 102 points
           | 
           | * Intel 155H: 108 points
           | 
           | Amongst each company's best laptop ST SoCs, no, Intel and AMD
           | are far behind in both ST scores and perf/watt.
           | 
           | If you're referring to desktop speeds, then yes, Intel's
           | 14900k does top the charts in ST Cinebench but it likely uses
           | well over 100w.
           | 
           | I mostly care about laptop SoCs. In the case of the M3, it
           | doesn't even have a fan.
        
             | rafaelmn wrote:
             | That's what I'm thinking - they make trade-offs to reach
             | peak performance in desktop designs that don't translate
             | optimally to laptops and when you start from mobile designs
             | you probably made the opposite trade-offs - that would be
             | my guess for the discrepancy.
        
               | aurareturn wrote:
               | I'm pretty sure that M3 Max closely matches the 14900k in
               | ST speeds but using something like 5 - 10% of the power.
        
               | rafaelmn wrote:
               | Not sure - they had power/thermal envelope in desktop
               | parts and no difference in performance AFAIK.
        
               | hajile wrote:
               | Laptops vastly outsell desktops, so this tradeoff means
               | hurting the majority of your customers to please a small
               | minority. Servers also care about perf/watt a LOT and
               | they are the highest profit margin segment.
               | 
               | Why would AMD choose a target that hurts the majority of
               | their market unless there wasn't another good option
               | available?
        
               | rafaelmn wrote:
               | The architecture started in desktop space and data
               | center/mobile was an afterthought up until Intel shitting
               | the bed repeatedly. If they redesigned from ground up
               | they could probably get better instructions/watt but that
               | would look terrible if it wasn't accompanied by a perf
               | boost over previous generation. Just like Apple doesn't
               | seem to scale well with more power.
        
               | aurareturn wrote:
               | > Just like Apple doesn't seem to scale well with more
               | power.
               | 
               | How do you know this? When has Apple ever given 400w to
               | the M3 Max like the 14900KS can get up to?
               | 
               | PS. An M4 running at 7w is faster in ST than a 14900KS
               | running in 250w+.
        
         | AnthonyMouse wrote:
         | > 1. ARM is inherently more efficient than x86 CPUs in most
         | tasks
         | 
         | > 2. Nuvia and Apple are better CPU designers than AMD and
         | Intel
         | 
         | The third possibility is that they just pick a different point
         | on the efficiency curve. You can double power consumption in
         | exchange for a few percent higher performance, double it again
         | for an even smaller increase.
         | 
         | The max turbo on the i9 14900KS is 253 W. The power efficiency
         | is _bad_. But it generally outperforms the M3, despite being on
         | a significantly worse process node, because that 's the trade
         | off.
         | 
         | AMD is only on a slightly worse process node and doesn't have
         | to do anything so aggressive, but they'll also sell you
         | whatever you want. The 8845HS and 8840U are basically the same
         | chip, but former has around double the TDP. In exchange for
         | that you get ~2% more single thread performance and ~15% more
         | multi-thread performance. Whereas the performance per watt for
         | the 8840U is nearly that of the M3, and the remaining
         | difference is basically the process node.
        
           | aurareturn wrote:
           | >The third possibility is that they just pick a different
           | point on the efficiency curve. You can double power
           | consumption in exchange for a few percent higher performance,
           | double it again for an even smaller increase.
           | 
           | This only makes sense if the Zen5 is actually faster in ST
           | than the M3. In this case, the M3 is 1.24x faster and 3.4x
           | more efficient in ST than Zen5.
           | 
           | AMD's Zen5 chip is just straight up slower in any curve.
           | 
           | >The max turbo on the i9 14900KS is 253 W. The power
           | efficiency is bad. But it generally outperforms the M3,
           | despite being on a significantly worse process node, because
           | that's the trade off.
           | 
           | It's not a trade off that Intel wants. The 14900KS runs at
           | 253w (sometimes 400w+) because that's the only way Intel is
           | able to stay remotely competitive at the very high end. An M3
           | Max will often match a 14900KS in performance using 5-10% of
           | the power.
        
             | AnthonyMouse wrote:
             | > This only makes sense if the Zen5 is actually faster in
             | ST than the M3. In this case, the M3 is 1.24x faster and
             | 3.4x more efficient in ST than Zen5.
             | 
             | It makes sense if Zen5 is faster in MT, since that's when
             | the CPUs will be power limited, and it is. For ST the
             | performance generally isn't power-limited for either of
             | them and then the M3 is on a newer process node.
             | 
             | It also depends on the benchmark. For example, Zen5 is
             | faster in ST on Cinebench _R23_. It 's not obvious what's
             | going on with R24, but it's a difference in the code rather
             | than the hardware.
             | 
             | The power numbers in that link also don't inspire a lot of
             | confidence. They have two systems with the same CPU but one
             | of them uses 119.3W and the other one uses 46.7W? Plausibly
             | the OEMs could have configured them differently but that
             | kind of throws out the entire premise of using the
             | comparison to measure the efficiency of the CPUs. The
             | number doesn't mean anything if the power consumption is
             | being set as a configuration parameter by the OEM and the
             | number of watts going to the display or a discrete GPU are
             | an uncontrolled hidden variable.
             | 
             | > It's not a trade off that Intel wants.
             | 
             | It's the one they've always taken, even when they were
             | unquestionably in the lead. They were selling up to 150W
             | desktop processors in 2008, because people buy them,
             | because they're faster.
             | 
             | Now they have to do it just to be competitive because their
             | process isn't as good, but the process is a different thing
             | than the ISA or the design of the CPU.
        
               | aurareturn wrote:
               | >It also depends on the benchmark. For example, Zen5 is
               | faster in ST on Cinebench R23. It's not obvious what's
               | going on with R24, but it's a difference in the code
               | rather than the hardware.
               | 
               | Cinebench R23 uses Intel Embree underneath. It's hand
               | optimized for AVX instruction set and poorly translated
               | to NEON. It's not even clear if it has any NEON
               | optimization.
        
             | Wytwwww wrote:
             | > An M3 Max will often match a 14900KS in performance using
             | 5-10% of the power.
             | 
             | On Cinebench and Passmark CPU the 14900K is 50-60% faster
             | so I'm not sure that's true.
        
         | jeswin wrote:
         | In the same article, if you picked their R23 benchmarks the
         | advantage vanishes in MT benchmarks; the 3nm M3 Max is actually
         | behind 4nm Strix Point in efficiency.
        
           | aurareturn wrote:
           | Cinebench R23 is hand optimized for X86. It uses Intel Embree
           | engine underneath.
           | 
           | That's why Cinebench R23 heavily favors x86.
        
         | kllrnohj wrote:
         | Performance per watt is a bad metric. You want instead
         | performance for a given power budget (eg, how much performance
         | can I get at 15w? 30w? etc...)
         | 
         | Otherwise you can trivially win 100% of performance/watt
         | comparisons by just setting clocks to the limit of the lowest
         | usable voltage level.
         | 
         | For example compare the 7950X to the 7950X@65w using the
         | officially supported eco mode option:
         | https://www.anandtech.com/show/17585/amd-zen-4-ryzen-9-7950x...
         | 
         | Cinebench R23 MT:
         | 
         | 7950X stock: 225 points/watt
         | 
         | 7950X @ 65w eco mode: _482_ points /watt
         | 
         | Over 2x perf/watt improvement on _the exact same chip_ , and a
         | power efficiency that tops the charts, beating every laptop
         | chip in that notebookcheck test by a _large_ amount as well.
         | And yet how many 7950x owners are using the 65w eco mode
         | option? Probably none. Because perf /watt isn't actually
         | meaningful. Rather it's how much performance can I get for a
         | given power budget.
        
           | aurareturn wrote:
           | >Performance per watt is a bad metric.
           | 
           | It isn't. It needs context.
           | 
           | Sure, you can get an Intel Celeron to have more perf/watt
           | than an M1 if you give the Celeron low enough wattage.
           | 
           | The key here is the absolute performance.
           | 
           | In this case, both the M3 and the X Elite are not only
           | significantly more efficient than both Zen4 and Zen5 in ST,
           | they are also straight up faster while being more efficient.
        
         | kllrnohj wrote:
         | > 1. ARM is inherently more efficient than x86 CPUs in most
         | tasks
         | 
         | I'm not sure how you're reaching the conclusion of "most tasks"
         | when Cinebench R24 is the only test you used because R23, which
         | doesn't agree, was rejected for hand-wavey nebulous reasons,
         | and nothing else was tested.
         | 
         | R24 is hardly a representative workload of "most tasks" nor is
         | it claiming/trying to be.
        
           | hajile wrote:
           | Anandtech shows[0] that M3 is massively ahead in integer
           | performance, but slightly behind in float performance on Spec
           | 2017.
           | 
           | Integer workloads are by far the most common, but they tend
           | to not scale to multiple cores very well. Most workloads that
           | scale well across cores also benefit from big FP/SIMD units
           | too.
           | 
           | Put another way, the real issue with R24 is that it makes
           | HX370 look better than it would look in more normal consumer
           | workloads.
           | 
           | [0] https://www.anandtech.com/show/21485/the-amd-ryzen-ai-
           | hx-370...
        
             | kllrnohj wrote:
             | > that M3 is massively ahead in integer performance
             | 
             | The M3 is certainly an impressive chip, but note that it's
             | only massively ahead in _some_ of the int tests. It 's not
             | a consistent gap.
             | 
             | > Integer workloads are by far the most common, but they
             | tend to not scale to multiple cores very well.
             | 
             | The HX370 does better than the Me in specint MT though.
             | 
             | But regardless the anandtech results paint a _much_ closer
             | picture than the single R24 results that GP used as the
             | basis of the efficiency thesis.
        
               | aurareturn wrote:
               | The HX370 should win in SPECINT MT. It has 12 cores to
               | the M3's 8 cores and it runs at significantly higher
               | power.
               | 
               | Compare HX370 SPECINT MT To an M3 Pro and let's see the
               | results.
        
               | kllrnohj wrote:
               | > [HX370] runs at significantly higher power.
               | 
               | It used 33w. Meanwhile the M3 result came from a 2023
               | MacBook Pro 14-Inch, which certainly has the potential
               | for a TDP of around that. If you can find SPECINT MT
               | numbers w/ power data for an M3 Pro lets see it. Or even
               | just power data for an M3 non-pro in the 14" MBP. A quick
               | search isn't turning up any.
        
           | aurareturn wrote:
           | >I'm not sure how you're reaching the conclusion of "most
           | tasks" when Cinebench R24 is the only test you used because
           | R23, which doesn't agree, was rejected for hand-wavey
           | nebulous reasons, and nothing else was tested.
           | 
           | There are no hand-wavey nebulous reasons.
           | 
           | Cinebench R23 uses Intel Embree engine, which is hand
           | optimized for x86 CPUs. That's why x86 CPUs look far better
           | than ARM CPUs in it.
           | 
           | If there is an application that is purely hand optimized for
           | ARM, and then compiled for x86, do you think it's fair to use
           | it to compare the two architectures?
           | 
           | SPEC & GB6 mostly agrees with Cinebench 2024.
        
         | sudosysgen wrote:
         | Why are you comparing a chip optimized for performance with a
         | chip optimized for efficiency? Take an Ultrabook Zen5, not the
         | HX370.
        
       | imtringued wrote:
       | >Read bandwidth from a single cluster caps out at just under 62
       | GB/s. The memory controller has a bit more bandwidth on tap, but
       | you'll need to load cores from both clusters to get it.
       | 
       | Except for DRR5-7500 it isn't just "a bit more" it is actually
       | double at 120GB/s. This might pose a challenge for LLM inference,
       | which absolutely needs the full 120GB/s.
        
       | mshockwave wrote:
       | Switching from individual schedulers to unified one for integer
       | execution makes sense to me, but I still don't quite understand
       | why FP execution units do the opposite, could somebody explain
       | why?
        
       ___________________________________________________________________
       (page generated 2024-08-11 23:01 UTC)