[HN Gopher] AMD's Strix Point: Zen 5 Hits Mobile
___________________________________________________________________
AMD's Strix Point: Zen 5 Hits Mobile
Author : klelatti
Score : 150 points
Date : 2024-08-10 21:18 UTC (1 days ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| sm_1024 wrote:
| IMO, the most interesting thing about this line is the battery
| life---within an hour of MBP3 and within 2 hours of Asus's
| Qualcomm. Making it comparable to ARM architectures.
|
| Which is a little surprising because ARM is commonly believed to
| be much more power efficient than x86.
|
| [1] https://youtu.be/Z8WKR0VHfJw?si=A7zbFY2lsDa8iVQN&t=277
| jiggawatts wrote:
| > because ARM is commonly believed to be much more power
| efficient than x86.
|
| Because most ARM processors were designed for mobile phones and
| optimised to death for power efficiency.
|
| The total power usage of the front end decoders is a single
| digit percentage of the total power draw. Even if ARM magically
| needed 0 watts for this, it couldn't save more power than that.
| The rest of the processor design elements are essentially
| identical.
| halJordan wrote:
| Yeah if you make a worse core and then downclock it then you
| will increase power efficiency. AMD thankfully only downclocks
| the 5c, but Intel is shipping ivy lake equivalents in their
| flagship products just to get power efficiency up.
| arnaudsm wrote:
| ARM got a lot of hype since the release of the M1, but most
| users only compared it to the terrible Intel MBPs. Ryzen mobile
| has been consistently close to Apple silicon perf/watt for 5
| years. But got little press coverage.
|
| Hype can be really decorrelated from real world performance.
| sm_1024 wrote:
| I have heard that part of the reason for little coverage of
| ryzen mobile CPUs is their limited availability as AMD was
| focussing on using the fab capacity for server chips.
| jsheard wrote:
| Any efficiency comparison involving Apples chips also has to
| factor in that Tim Cook keeps showing up at TSMCs door with a
| freight container full of cash to buy out exclusive access to
| their bleeding edge silicon processes. ARM may be a factor
| but don't underestimate the power of having more money than
| God.
|
| Case in point, Strix Point is built on TSMC 4nm while Apple
| is already using TSMCs second generation 3nm process.
| acdha wrote:
| Process helps but have you seen benchmarks showing
| equivalent performance between the same process node? I
| think it's less that ARM is amazing than the Apple Silicon
| team being very good and paired with aggressive
| optimization throughout the stack but everything I've seen
| suggests they are simply building better chips at their
| target levels (not server, high power, etc.).
| cubefox wrote:
| > Our benchmark database shows the Dimensity 9300 scores
| 2,207 and 7,408 in Geekbench 6.2's single and multi-core
| tests. A 30% performance improvement implies the
| Dimensity 9400 would score around 2,869 and and 9,630.
| Its single-core performance is close to that of the
| Snapdragon 8 Gen 4 (2,884/8,840) and it understandably
| takes the lead in multi-core. Both are within spitting
| distance from the Apple A17 Pro, which scores 2,915 and
| 7,222 points in the benchmark. Then again, all three
| chips are said to be manufactured on TSMC's N3 class
| node, effectively leveling the playing field.
|
| https://www.notebookcheck.net/MediaTek-
| Dimensity-9400-rumour...
| sroussey wrote:
| I guess getting close to the same single thread score is
| nice. Unfortunately, since only Apple is shipping it is
| hard to compare if the others burn the battery to get
| there.
|
| I suspect the others two, like Apple with the A18
| shipping next month, will be using the second gen N3.
| Apple is expected to be around 3500 on that node.
|
| Needless to say, what will be very interesting is to see
| the perf/watt of all three on the same node and shipping
| in actual products where the benchmarks can be put to
| more useful tests.
| cubefox wrote:
| Yeah, and GPU tests, since the benchmarks above were only
| for the CPU.
| acdha wrote:
| That appears to be an unconfirmed rumor and it's exciting
| if true (and there aren't major caveats on power), but
| did you notice how they mentioned extra work by ARM? The
| argument isn't that Apple is unique, it's that the
| performance gaps they've shown are more than simply
| buying premium fab capacity.
|
| That doesn't mean other designers can't also do that
| work, but simply that it's more than just the process -
| for example, the M2 shipped on TSMC's N5P first as an
| exclusive but when Zen 5 shipped later on the same
| process it didn't close the single core performance or
| perf/watt gap. Some of that is x86 vs. ARM but there
| isn't a single, simple factor which can explain this -
| e.g. Apple carefully tuning the hardware, firmware, OS,
| compilers, and libraries too undoubtably helps a lot and
| it's been a perennial problem for non-Intel vendors on
| the PC side since so many developers have tuned for Intel
| first/only for decades.
| cubefox wrote:
| Since it's unclear whether Apple has a significant
| architectural advantage over Qualcomm and MediaTek, I
| would rather attribute this to relatively poor AMD
| architectures. Provisionally. At least their GPUs have
| been behind Nvidia for years. (AMD holding its own
| against Intel is not surprising given Intel's chip fab
| problems.)
| acdha wrote:
| Yes, to be clear I'd be very happy if MediaTek jumps in
| with a strong contender since consumers win. It doesn't
| look like the Qualcomm chips are performing as well as
| hoped but I'd wait a bit to see how much tuning helps
| since Windows ARM was not a major target until now.
| AnthonyMouse wrote:
| > for example, the M2 shipped on TSMC's N5P first as an
| exclusive but when Zen 5 shipped later on the same
| process it didn't close the single core performance or
| perf/watt gap.
|
| That was Zen 4, but it did close the gap:
|
| https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073
| _14...
|
| Single thread performance is higher (so is MT), TDP is
| slightly lower, Cinebench MT "points per watt" is 5%
| higher.
|
| We'll get to see it again when the 3nm version of Zen5 is
| released (the initial ones are 4nm, which is a node Apple
| didn't use).
| hajile wrote:
| Let's do the math on M1 Pro (10-core, N5, 2021) vs HX370
| (12-core, N4P, 2024).
|
| Firestorm without L3 is 2.281mm2. Icestorm is 0.59mm2. M1
| Pro has 8P+2E for a total of 19.428mm2 of cores included.
|
| Zen4 without L3 is 3.84mm2. Zen4c reduces that down to
| 2.48mm2. Zen5 CCD is pretty much the same size as Zen4
| (though with 27% more transistors), so core size should be
| similar. AMD has also stated that Zen5c has a similar
| shrink percent to Zen4c. We'll use their numbers. HX370 has
| 4P+8C for a total area of 35.2mm2. If being twice the size
| despite being on N4P instead of N5 like M1 seems like
| foreshadowing, it is.
|
| We'll use notebookcheck's Cinebench 2024 multithread power
| and performance numbers to calculate perf / power / area
| then multiply that by 100 to eliminate some decimals.
|
| M1 Pro scores 824 (10-core) and while they don't have a
| power value listed, they do list 33.6w package power
| running the prime95 power virus, so cinebench's power
| should be lower than that.
|
| HX370 scored 1213 (12-core) and averaged 119w (maxing at a
| massive 121.7w and that's without running a power virus).
|
| This gives the following perf/power/area*100 scores:
|
| M1 Pro -- 126 PPA
|
| HX 379 -- 29 PPA
|
| M1 is more than 4.3x better while being an entire node
| behind and being released years before.
| cyp0633 wrote:
| Power efficiency is a curve, and Apple may have its own
| reason not to make M1 Pro run at 110W as well
| sm_1024 wrote:
| I think the OC might have mis-read the power numbers, 110
| W is well into desktop CPU power range. Here is a excerpt
| from Anand Tech:
|
| > In our peak power test, the Ryzen AI 9 HX 370 ramped up
| and peaked at 33 W.
|
| https://www.anandtech.com/show/21485/the-amd-ryzen-ai-
| hx-370...
| hajile wrote:
| You can read the notebookcheck review for yourself.
|
| https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-
| anal...
| ac29 wrote:
| Those 100W+ numbers are total system power. And that
| system has the CPU TDP set to 80W (far above AMD's
| official max of 54W). It also has a discrete 4070 GPU
| that can use over 100W on its own.
| paulmd wrote:
| if x86 laptops have 90w of platform power, that's a thing
| that's concerning in itself, not a reasonable defense.
|
| Remember, apple laptops have screens too, etc, and that
| shows up in the average system power measurements the
| same way. What's the difference in an x86 laptop?
|
| I really doubt it's actually platform power, the problem
| is that x86 is boosting up to 35W average/60W peak _per
| thread_. 120W package power isn 't _unexpected_ , if
| you're boosting 3-4 cores to maximum!
|
| And _that 's_ the problem. x86 is far far worse at race-
| to-sleep. It's not just "macos has better scheduling"...
| you can see from the 1T power measurements that x86 is
| simply drawing 2-3x the power while it's racing-to-sleep,
| for performance that's roughly equivalent to ARM.
|
| Whatever the cause, whether it's just bad design from AMD
| and Intel, or legacy x86 cruft (I don't get how this
| applies to actual computational load though, as opposed
| to situations like idle power), or what... there is no
| getting around the fact that M2 tops out at 10W per core
| and a 8840HS or HX370 or Intel Meteor Lake are boosting
| to 30-35W at 1T loads.
| hajile wrote:
| I stacked the deck in AMD's favor using a 3-year-old chip
| on an older node.
|
| Why is AMD using 3.6x more power than M1 to get just 32%
| higher performance while having 17% more cores? Why are
| AMD's cores nearly 2x the size despite being on a better
| node and having 3 more years to work on them?
|
| Why are Apple's scores the same on battery while AMD's
| scores drop dramatically?
|
| Apple does have a reason not to run at 120w -- it doesn't
| need to.
|
| Meanwhile, if AMD used the same 33w, nobody would buy
| their chips because performance would be so incredibly
| bad.
| pickledish wrote:
| You should try not to talk so confidently about things
| you don't know about -- this statement
|
| > if AMD used the same 33w, nobody would buy their chips
| because performance would be so incredibly bad
|
| Is completely incorrect, as another commenter (and I
| think the notebookcheck article?) point out -- 30w is
| about the sweet spot for these processors, and the reason
| that 110w laptop seems so inefficient is because it's
| giving the APU 80w of TDP, which is a bit silly since it
| only performs marginally better than if you gave it e.g.
| 30 watts. It's not a good idea to take that example as a
| benchmark for the APU's efficiency, it varies depending
| on how much TDP you give the processor, and 80w is not a
| good TDP for these
| hajile wrote:
| Halo products with high scores sell chips. This isn't a
| new idea.
|
| So you lower the wattage down. Now you're at M1 Pro
| levels of performance with 17% more cores and nearly
| double the die area and barely competing with a chip 3
| years older while on a newer, more expensive node too.
|
| That's not selling me on your product (and that's without
| mentioning the worst core latency I've seen in years when
| going between P and C cores).
| AnthonyMouse wrote:
| > I stacked the deck in AMD's favor using a 3-year-old
| chip on an older node.
|
| You could just compare the ones that are actually on the
| same process node:
|
| https://www.notebookcheck.net/R9-7945HX3D-vs-M2-Max_15073
| _14...
|
| But then you would see an AMD CPU with a lower TDP
| getting higher benchmark results.
|
| > Why is AMD using 3.6x more power than M1 to get just
| 32% higher performance while having 17% more cores?
|
| Getting 32% higher performance from 17% more cores
| implies higher performance per core.
|
| The power measurements that site uses are from the plug,
| which is highly variable to the point of uselessness
| because it takes into account every other component the
| OEM puts into the machine and random other factors like
| screen brightness, thermal solution and temperature
| targets (which affects fan speed which affects fan power
| consumption) etc. If you measure the wall power of a
| system with a discrete GPU that by itself has a TDP >100W
| and the system is drawing >100W, this tells you nothing
| about the efficiency of the CPU.
|
| AMD's CPUs have internal power monitors and configurable
| power targets. At full load there is very little light
| between the configured TDP and what they actually use.
| This is basically required because the CPU has to be able
| to operate in a system that can't dissipate more heat
| than that, or one that can't supply more power.
|
| > Meanwhile, if AMD used the same 33w, nobody would buy
| their chips because performance would be so incredibly
| bad.
|
| 33W is approximately what their mobile CPUs actually use.
| Also, even lower-configured TDP models exist and they're
| not that much slower, e.g. the 7840U has a base TDP of
| 15W vs. 35W for the 7840HS and the difference is a base
| clock of 3.3GHz instead of 3.8GHz.
| hajile wrote:
| > Getting 32% higher performance from 17% more cores
| implies higher performance per core.
|
| I don't disagree that it is higher perf/core. It is
| simply MUCH worse perf/watt because they are forced to
| clock so high to achieve those results.
|
| > The power measurements that site uses are from the
| plug, which is highly variable to the point of
| uselessness
|
| They measure the HX370 using 119w with the screen off
| (using an external monitor). What on that motherboard
| would be using the remaining 85+W of power?
|
| TDP is a suggestion, not a hard limit. Before thermal
| throttling, they will often exceed the TDP by a factor of
| 2x or more.
|
| As to these specific benchmarks, the R9 7945HX3D you
| linked to used 187w while the M2 Max used 78w for CB R15.
| As to perf/watt, Cinebench before 2024 wasn't using NEON
| properly on ARM, but was using Intel's hyper-optimized
| libraries for x86. You should be looking at benchmarks
| without such a massive bias.
| paulmd wrote:
| lmao he's citing cinebench R15? Which isn't just ancient
| but actually emulated on arm, of course.
|
| Really digging through the vaults for that one.
|
| Geekbench 6 is perfectly fine for that stuff. But that
| still shows apple tieing in MT and beating the pants off
| x86 in 1T efficiency.
|
| x86 1T boosts being silly is where the real problem comes
| from. But if they don't throw 30-35w at a single thread
| they lose horribly.
| Const-me wrote:
| > if AMD used the same 33w, nobody would buy their chips
| because performance would be so incredibly bad
|
| I'm writing this comment on HP ProBook 445 G8 laptop. I
| believe I bought it in early 2022, so it's a relatively
| old model. The laptop has a Ryzen 5 5600U processor which
| uses <= 25W. I'm quite happy with both the performance
| and battery life.
| atq2119 wrote:
| It's well known that performance doesn't scale linearly
| with power.
|
| Benchmarking incentives on PC have long pushed X86
| vendors to drive their CPUs at points of the
| power/performance curve that make their chips look less
| efficient than they really are. Laptop benchmarking has
| inherited that culture from desktop PC benchmarking to
| some extent. This is slowly changing, but Apple has never
| been subject to the same benchmarking pressures in the
| first place.
|
| You'll see in reviews that Zen5 can be very efficient
| when operated in the right power range.
| hajile wrote:
| Zen5 can be more efficient at lower clockspeeds, but then
| it loses badly to Apple's chips in raw performance.
| sm_1024 wrote:
| 119W for hx370 looks extremely sus, seems to me more like
| the system level power consumption and not CPU-only.
|
| According to phoronix [1,2], in their blender CPU test,
| they measured a peak of 33W.
|
| Here max power numbers from some other tests that I know
| are multi-threaded:
|
| --
|
| Linux 6.8 Compilation: 33.13 W
|
| LLVM Compilation: 33.25 W
|
| --
|
| If I plug in 33W into your equation, that would give us
| score of HX 370: 104 PPA
|
| This supports the HX 370 being pretty power efficient,
| although still not as power efficient as M3.
|
| [1] https://www.phoronix.com/review/amd-ryzen-
| ai-9-hx-370/3
|
| [2] https://www.phoronix.com/review/amd-ryzen-
| ai-9-hx-370/4
| hajile wrote:
| https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-
| anal...
|
| They got those kinds of numbers across multiple systems.
| You can take it up with them I guess.
|
| I didn't even mention one of these systems was peaking at
| 59w on single-core workloads.
| sm_1024 wrote:
| I see what's going on, they have two HX370 laptops:
| Laptop MC score Avg Power P16 1213
| 113 W S16 921 29 W M3 Pro
| 1059 (30 W?)
|
| They don't have M3 Pro power numbers, but I assume it is
| somewhere around 30W, seems like S16 has similar power
| efficiency as HX 370 at 30 W.
|
| Any more power, and the CPU is much less power efficient,
| 300% increase in power for 30% increase in performance.
| sudosysgen wrote:
| This is true for every CPU. Past a certain point power
| consumption scales quadratically with performance.
| moonfern wrote:
| About cinebench-geekbench-spec: https://old.reddit.com/r/
| hardware/comments/pitid6/eli5_why_d... That's about
| Cinebench 20, an overview of Cinebench 24 cpu&gpu(!):
| https://www.cgdirector.com/cinebench-2024-scores/
| jeswin wrote:
| Even with the M3 the difference is marginal in multi-
| threaded benchmarks, from the Cinebench link [1] someone
| posted earlier on the thread. Apple M3
| Pro 11-Core - 394 Points per Watt AMD Ryzen AI 9
| HX 370 - 354 Points per Watt Apple M3 Max 16-Core
| - 306 Points per Watt
|
| And the Ryzen in on TSMC 4nm while the M3 is on 3nm. As
| parent is saying, a lot of the Apple Silicon hype was due
| to the massive upgrade it was over the Intel CPUs Apple
| was using previously.
|
| [1]: https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-
| CPU-anal...
| janwas wrote:
| Cinebench might not be the most relevant benchmark, it
| uses lots of scalar instructions with fairly high branch
| mispredictions and low IPC:
| https://chipsandcheese.com/2021/02/22/analyzing-
| zen-2s-cineb....
| carstenhag wrote:
| But how is this the case? I never saw a single article
| mentioning that a non-Mac laptop was better.
|
| (Random article saying M3 pro is better than a Dell laptop
| https://www.tomsguide.com/news/macbook-pro-m3-and-m3-max-
| bat... )
| moonfern wrote:
| You're right, but... The idea comes from the desktop world.
| AMD's zen 4 desktop cpu's especially the gaming variants
| like the Ryzen 7 7800X3D almost matches the performance per
| watt of Apple's M3.
|
| Their laptop cpu's as some companies did release same model
| different cpu were less efficient than intel.
|
| But the Asus ProArt P16 (used in the article) did manage an
| extreme endurance score in the video test called Big Buck
| Bunny H.264 1080p which runs at 150 cd/m2 with 21 hours.
| With it's higher resolution, oled and 10% less battery
| capacity that's better 40 minutes better than the macbook
| pro 16 m3 max. In the wifi test also run at 150 cd/m2 the
| m3 run for 16 hours, the asus 8. (
| https://www.notebookcheck.net/Asus-ProArt-P16-laptop-
| review-... )
|
| For me noise matters, that Asus has a whisper mode which
| produces 42db as much as an M3 max under full load. Please
| be aware that if you're susceptible of PWM, that ASUS
| laptop has issues.
| Filligree wrote:
| Ryzen mobile is consistently close, yeah. But with the sole
| exception of the Steam deck, I've yet to see a Ryzen mobile-
| bearing laptop, Windows included, which is close to the
| overall performance of the Macbook.
| talldayo wrote:
| > But with the sole exception of the Steam deck
|
| Uuh wut? The Steam Deck is like 3-generation-old hardware
| in mobile Ryzen terms. In a lot of ways it's similar to a
| pared-back 4800u with fewer (and older) cores, and a
| slightly bumped up GPU.
|
| To me it's kinda the opposite. Excluding the Steam Deck, I
| think most of AMD's Ultrabook APUs have been very close to
| the products Apple's made on the equivalent nodes. Even on
| 7nm the 4800u put up a competitive fight against M1, and
| the gap has gotten thinner with each passing year.
| According to the OpenCL benchmarks, the Radeon 680m on 6nm
| scores higher than the M1 on 5nm:
| https://browser.geekbench.com/opencl-benchmarks
|
| Even back when Ryzen Mobile only shipped with Vega, it was
| pretty clear that Apple and AMD were a pretty close match
| in onboard GPU power.
| amlib wrote:
| Steam Deck might be behind in terms of hardware but in
| terms of software it's way beyond your typical x86 linux
| system power efficiency, and dare I say it's doing better
| than windows machines with the typical shoddy bioses and
| drivers, specially when you consider all the extraneous
| services constantly sapping varying amounts of cpu time.
| All that contributes to make the SD punch well above its
| weight.
| sudosysgen wrote:
| My Alienware M15 Ryzen edition gets 7-8W power
| consumption by just running "sudo powertop --autotune".
| Basically all of the power efficiency stuff in the Steam
| Deck apply to other Ryzen systems and are in the mainline
| kernel.
| makeitdouble wrote:
| "overall performance" does a lot of work here. On sheer
| benchmarks it's really comparable, with AMD being slightly
| better depending on what you look at. e.g. the M1 vs the
| 5700U (a similar class widely available mobile CPU):
|
| https://www.cpubenchmark.net/cpu.php?cpu=AMD%20Ryzen%207%20
| 5...
|
| https://www.cpubenchmark.net/cpu.php?cpu=Apple+M1+8+Core+32
| 0...
|
| They're not profiled the same, and don't belong in the same
| ecosystem though, which makes a lot more difference than
| the CPU themselves. In particular the AMD doesn't get a
| dedicated compiler optimizing every applications of the
| system to its strength and weaknesses (the other side of it
| being the compatibility with the two vastest ecosystem we
| have now)
| sofixa wrote:
| Depends on what you mean by "overall performance", but my
| Asus ROG Zephyrus G14 2023 is full AMD, and outperforms my
| work issued top of the line M1 MacBook Pro from a few
| months earlier in every task I've done across the two
| (gaming, compiling, heavy browsing). Battery life is lower
| under heavy load and high performance on the Zephyrus, but
| in power saving mode it's roughly comparable, albeit still
| worse.
| izacus wrote:
| Same here, my G14 and the M1 MBP are pretty much
| interchangeable for most workloads. The only time then
| G14 starts fans is when the 4070 turns on... and that's
| not an option on the M1 at all.
| sandywaffles wrote:
| I think that's because all the press talks about actual
| battery life per laptop and the Apple Silicone laptops ship
| with literally double the size battery of any AMD based
| laptop without a discrete GPU. So while the efficiency may be
| close, actually perceived battery life of the Mac will he
| more than double when you also consider the priority Apple
| puts into their power control combined with a larger overall
| battery.
| wtallis wrote:
| The CPU core's instruction set has no influence on how well the
| chip as a whole manages power when not executing instructions.
| sm_1024 wrote:
| That is fair, I was taught that decoders for x86 are less
| efficient and more power hungry than RISC ISAs because of
| their variable length instructions.
|
| I remember being told (and it might be wrong) that ARM can
| decode multiple instructions in parallel because the CPU
| knows where the next instruction starts, but for x86, you'd
| have to decode the instructions in order.
| pohuing wrote:
| That seems to not matter much nowadays. There's another
| great(according to my untrained eye) writeup of the lack of
| importance on chips and cheese.
|
| https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-
| doesnt-...
| dzaima wrote:
| The various mentioned power consumption amounts are 4-10%
| per-core, or 0.5-6% of package (with the caveat of
| running with micro-op cache off) for Zen 2, and 3-10% for
| Haswell. That's not massive, but is still far from what
| I'd consider insignificant; it could give leeway for an
| extra core or some improved ALUs; or, even, depending on
| the benchmark, is the difference between Zen 4 and Zen 5
| (making the false assumption of a linear relation between
| power and performance, at least), which'd essentially be
| a "free" generational improvement. Of course the reality
| is gonna be more modest than that, but it's not nothing.
| Panzer04 wrote:
| You missed the part where they mention ARM ends up
| implementing the same thing to go fast.
|
| The point is processors are either slow and efficient, or
| fast and inefficient. It's just a tradeoff along the
| curve.
| dzaima wrote:
| ARM doesn't need the variable-length instruction decoding
| though, which on x86 essentially means that the decoder
| has to attempt to decode at every single byte offset for
| the start of the pipeline, wasting computation.
|
| Indeed pretty much any architecture can benefit from some
| form of op cache, but less of a need for it means its
| size can be reduced (and savings spent in more useful
| ways), and you'll still need actual decoding at some
| point anyway (and, depending on the code footprint, may
| need it a lot).
|
| More generally, throwing silicon at a problem is, quite
| obviously, a more expensive solution than not having the
| problem in the first place.
| XMPPwocky wrote:
| But bigger fixed-length instructions mean more I$
| pressure, right?
| dzaima wrote:
| RISC doesn't imply wasted instruction space; RISC-V has a
| particularly interesting thing for this - with the
| compressed ('c') extension you get 16-bit instructions
| (which you can determine by just checking two bits), but
| without it you can still save 6% of icache silicon via
| only storing 30 bits per instruction, the remaining two
| being always-1 for non-compressed instructions.
|
| Also, x86 isn't even that efficient in its variable-
| length instructions - some half of them contain the byte
| 0x0F, representing an "oh no, we're low on single-byte
| instructions, prefix new things with 0F". On top of that,
| general-purpose instructions on 64-bit registers have a
| prefix byte with 4 fixed bits. The VEX prefix (all AVX1/2
| instructions) has 7 fixed bits. EVEX (all AVX-512
| instructions) is a full fixed byte.
| hajile wrote:
| https://oscarlab.github.io/papers/instrpop-systor19.pdf
|
| ARM64 instructions are 4 bytes. x86 instructions in real-
| world code average 4.25 bytes. ARM64 gets closer to x86
| code size as it adds new instructions to replace common
| instruction sequences.
|
| RISC-V has 2-byte and 4-byte instructions and averages
| very close to 3-bytes. Despite this, the original
| compressed code was only around 15% more dense than x86.
| The addition of the B (bitwise) extensions and Zcb have
| increased that advantage by quite a lot. As other
| extensions get added, I'd expect to see this lead
| increase over time.
| paulmd wrote:
| x86-64 wastes enough of its address space that arm64 is
| typically smaller in practice. The RISC-V folks pointed
| this out a decade ago - geomean across their SPEC suite,
| x86 is 7.3% larger binary size than arm64.
|
| https://people.eecs.berkeley.edu/%7Ekrste/papers/EECS-201
| 6-1...
|
| So there's another small factor leaning against x86 -
| inferior code density means they get less out of their
| icache than ARM64 due to their ISA design (legacy cruft).
| And ARM64 often has larger icaches anyway - M1 is 6x the
| icache of zen4 iirc, _and_ they get more out of it with
| better code density.
|
| <uno-reverse-card.png>
| imtringued wrote:
| x86 processors simply run a instruction length predictor
| the same way they do it for branch prediction. That turns
| the problem into something that can be tuned. Instead of
| having to decode the instruction at every byte offset,
| you can simply decide to optimize for the 99% case with a
| slow path for rare combinations.
| dzaima wrote:
| That's still silicon spent on a problem that can be
| architecturally avoided.
| hajile wrote:
| That stuff is WAY out-of-date and was flatly wrong when
| it was published.
|
| A715 cut decoder size a whopping 75% by dropping the more
| CISC 32-bit stuff and completely eliminated the uop cache
| too. Losing all that decode, cache, and cache controllers
| means a big reduction in power consumption (decoders are
| basically always on). All of ARM's latest CPU designs
| have eliminated uop cache for this same reason.
|
| At the time of publication, we already knew that M1
| (already out for nearly a year) was the highest IPC chip
| ever made and did not use a uop cache.
| hajile wrote:
| Clam makes some serious technical mistakes in that
| article and some info is outdated.
|
| 1. His claim that "ARM decoder is complex too" was wrong
| at the time (M1 being an obvious example) and has been
| proven more wrong since publication. ARM dropped the uop
| cache as soon as they dropped support for their very
| CISC-y 32-bit catastrophe. They bragged that this
| coincided with a whopping 75% reduction in decoder size
| for their A715 (while INCREASING from 4 decoders to 5)
| and this was almost single-handedly responsible for the
| reduced power consumption of that chip (as all the other
| changes were comparatively minor). NONE of the current-
| gen cores from ARM, Apple, or Qualcomm use uop cache
| eliminating these power-hungry cache and cache
| controllers.
|
| 2. The paper[0] he quotes has a stupid conclusion. They
| show integer workloads using a massive 22% of total core
| power on the decoder and even their fake float workload
| showed 8% of total core power. Realize that a study[1] of
| the entire Ubuntu package repo showed that just 12
| int/ALU instructions made up 89% of all code with
| float/SIMD being in the very low single digits of use.
|
| 3. x86 decoder situation has gotten worse. Because adding
| extra decoders is exponentially complex, they decided to
| spend massive amounts of transistors on multiple decoder
| blocks working on various speculated branches. Setting
| aside that this penalizes unrolled code (where they may
| have just 3-4 decoders while modern ARM will have 10+
| decoders), the setup for this is incredibly complex and
| man-year intensive.
|
| 4. "ARM decodes into uops too" is a false equivalency.
| The uops used by ARM are extremely close to the original
| instructions as shown by them being able to easily
| eliminate the uop cache. x86 has a much harder job here
| mapping a small set of instructions onto a large set.
|
| 5. "ARM is bloated too". ARM redid their entire ISA to
| eliminate bloat. If ISA didn't actually matter, why would
| they do this?
|
| 6. "RISC-V will become bloated too" is an appeal to
| ignorance. x86 has SEVENTEEN major SIMD extensions
| excluding the dozen or so AVX-512 extensions all with
| various incompatibilities and issues. This is because
| nobody knew what SIMD should look like. We know now and
| RISC-V won't be making that mistake. x86 has useless
| stuff like BCD instructions using up precious small
| instruction space because they didn't know. RISC-V won't
| do this either. With 50+ years of figuring the basics
| out, RISC-V won't be making any major mistakes on the
| most important stuff.
|
| 7. Omitting complexity. A bloated, ancient codebase takes
| forever to do anything with. A bloated, ancient ISA takes
| forever to do anything with. If ARM and Intel both put X
| dollars into a new CPU design, Intel is going to spend
| 20-30% or maybe even more of their budget on devs
| spending time chasing edge cases and testers to test al
| those edge cases. Meanwhile, ARM is going to spend that
| 20-30% of their budget on increasing performance. All
| other things equal, the ARM chip will be better at any
| given design price point.
|
| 8. Compilers matter. Spitting out fast x86 code is
| incredibly hard because there are so many variations on
| how to do things each with their own tradeoffs (that
| conflate in weird ways with the tradeoffs of nearby
| instructions). We do peephole heuristic optimizations
| because provably fast would take centuries. RISC-V and
| ARM both make it far easier for compiler writers because
| there's usually just one option rather than many options
| and that one option is going to be fast.
|
| [0] https://www.usenix.org/system/files/conference/cooldc
| 16/cool...
|
| [1] https://oscarlab.github.io/papers/instrpop-
| systor19.pdf
| dzaima wrote:
| Some notes:
|
| 3: I don't think more decoders should be exponentially
| more complex, or even polynomial; I think O(n log n)
| should suffice. It just has a hilarious constant factor
| due to the lookup tables and logic needed, and that log
| factor also impacts the critical path length, i.e.
| pipeline length, i.e. mispredict penalty. Of note is that
| x86's variable-length instructions aren't even
| particularly good at code size.
|
| Golden Cove (~1y after M1) has 6-wide decode, which is
| probably reasonably near M1's 8-wide given x86's complex
| instructions (mainly free single-use loads). [EDIT:
| actually, no, chipsandcheese's diagram shows it only
| moving 6 micro-ops per cycle to reorder buffer, even out
| of the micro-op cache. Despite having 8/cycle retire.
| Weird.]
|
| 6: The count of extensions is a very bad way to measure
| things; RISC-V will beat everything in that in no time,
| if not already. The main things that matter are <=SSE4.2
| (uses same instruction encoding as scalar code); AVX1/2
| (VEX prefix); and AVX-512 (EVEX). The actual instruction
| opcodes are shared across those. But three encoding modes
| (plus the three different lengths of the legacy encoding)
| is still bad (and APX adds another two onto this) and the
| SSE-to-AVX transition thing is sad.
|
| RISC-V already has two completely separate solutions for
| SIMD - v (aka RVV, i.e. the interesting scalable one) and
| p (a simpler thing that works in GPRs; largely not being
| worked on but there's still some activity). And if one
| wants to count extensions, there are already a dozen for
| RVV (never mind its embedded subsets) - Zvfh, Zvfhmin,
| Zvfbfwma, Zvfbfmin, Zvbb, Zvkb, Zvbc, Zvkg, Zvkned,
| Zvknhb, Zvknha, Zvksed, Zvksh; though, granted, those
| work better together than, say, SSE and AVX (but on x86
| there's no reason to mix them anyway).
|
| And RVV might get multiple instruction encoding forms too
| - the current 32-bit one is forced into allowing using
| only one register for masking due to lack of encoding
| space, and a potential 48-bit and/or 64-bit instruction
| encoding extension has been discussed quite a bit.
|
| 8: RISC-V RVV can be pretty problematic for some things
| if compiling without a specific target architecture, as
| the scalability means that different implementations can
| have good reason to have wildly different relative
| instruction performance (perhaps most significant being
| in-register gather (aka shuffle) vs arithmetic vs indexed
| load from memory).
| hajile wrote:
| 3. You can look up the papers released in the late 90s on
| the topic. If it was O(n log n), going bigger than 4 full
| decoders would be pretty easy.
|
| 6. Not all of those SIMD sets are compatible with each
| other. Some (eg, SSE4a) wound up casualties of the Intel
| v AMD war. It's so bad that the Intel AVX10 proposal is
| mostly about trying to unify their latest stuff into
| something more cohesive. If you try to code this stuff by
| hand, it's an absolute mess.
|
| The P proposal is basically DOA. It could happen, but
| nobody's interested at this point. Just like the B
| proposal subsumed a bunch of ridiculously small
| extensions, I expect a new V proposal to simply unify
| these. As you point out, there isn't really any conflict
| between these tiny instruction releases.
|
| There is discussion around the 48-bit format (the bits
| have been reserved for years now), but there are a couple
| different proposals (personally, I think 64-bit only with
| the ability to put multiple instructions inside is
| better, but that's another topic). Most likely, a 48-bit
| format does NOT do multiple encoding, but instead does a
| superset of encodings (just like how every 16-bit
| instruction expands into a 32-bit instruction). They
| need/want 48-bits to allow 4-address instructions too, so
| I'd imagine it's coming sooner or later.
|
| Either way, the length encoding is easy to work with
| compared to x86 where you must check half the bits in
| half the bytes before you can be sure about how long your
| instruction really is.
|
| 8. There could be some variance, but x86 has this issue
| too and SO many more besides.
| dzaima wrote:
| I know the E-cores (gracemont, crestmont, skymont) have
| the multi-decoder setup; the first couple search results
| don't show Golden Cove being the same. Do you have some
| reference for that?
|
| 6. Ah yeah the funky SSE4a thing. RISC-V has its own
| similar but worse thing with RVV0.7.1 / xtheadvector
| already though, and it can be basically guaranteed that
| there will be tons of one-off vendor extensions,
| including vector ones, given that anyone can make such.
|
| 8. RVV's vrgather is extremely bad at this, but is very
| important for a bunch of non-trivial things; existing
| RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes
| 256 cycles for LMUL=8[1]. But some hypothetical future
| hardware could do it at O(LMUL) for non-worst-case
| indices, thus massively changing tradeoffs. So far the
| compiler approaches are to just not do high LMUL when
| vrgather is needed (potentially leaving free perf on the
| table), or using indexed loads (potentially significantly
| worse).
|
| Whereas x86 and ARM SIMD perf variance is very tiny;
| basically everything is pretty proportional everywhere,
| with maybe the exception of very old atom cores. There'll
| be some differences of 2x up or down of throughput of
| instruction classes, but it's generally not so bad as to
| make way for alternative approaches to be better.
|
| [1]: https://camel-cdr.github.io/rvv-bench-
| results/bpi_f3/index.h...
| hajile wrote:
| I think you may be correct about gracemont v golden cove.
| Rumors/insiders say that Intel has supposedly decided to
| kill off either the P or E-core team, so I'd guess that
| the P-core team is getting layed off because the E-core
| IPC is basically the same, but the E-core is massively
| more efficient. Even if the P-core wins, I'd expect them
| to adopt the 3x3 decoder just as AMD adopted a 2x4
| decoder for zen5.
|
| Using a non-frozen spec is at your own risk. There's
| nothing comparable to stuff like SSE4a or FMA4. The
| custom extension issue is vastly overstated. Anybody can
| make extensions, but nobody will use unratified
| extensions unless you are in a very niche industry. The P
| extension is a good example here. The current proposal is
| a copy/paste of a proprietary extension a company is
| using. There may be people in their niche using their
| extension, but I don't see people jumping to add support
| anywhere (outside their own engineers).
|
| There's a LOT to unpack about RVV. Packed SIMD doesn't
| even have LMUL>1, so the comparison here is that you are
| usually the same as Packed SIMD, but can sometimes be
| better which isn't a terrible place to be.
|
| Differing performance across different performance levels
| is to be expected when RVV must scale from tiny DSPs up
| to supercomputers. As you point out, old atom cores
| (about the same as the Spacemit CPU) would have a
| different performance profile from a larger core. Even
| larger AMD cores have different performance
| characteristics with their tendency to like double-
| pumping AVX2/512 instructions (but not all of them --
| just some).
|
| In any case, it's a matter of the wrong configuration
| unlike x86 where it is a matter of the wrong instruction
| (and the wrong configuration at times). It seems obvious
| to me that the compiler will ultimately need to generate
| a handful of different code variants (shouldn't be a code
| bloat issue because only a tiny fraction of all code is
| SIMD) the dynamically choose the best variant for the
| processor at runtime.
| dzaima wrote:
| > Packed SIMD doesn't even have LMUL>1, so the comparison
| here is that you are usually the same as Packed SIMD, but
| can sometimes be better which isn't a terrible place to
| be.
|
| Packed SIMD not having LMUL means that hardware can't
| rely on it being used for high performance; whereas some
| of the theadvector hardware (which could equally apply to
| rvv1.0) already had VLEN=128 with 256-bit ALUs, thus
| having LMUL=2 have twice the throughput of LMUL=1. And
| even above LMUL=2 various benchmarks have shown
| improvements.
|
| Having a compiler output multiple versions is an
| interesting idea. Pretty sure it won't happen though;
| it'd be a rather difficult political mess of more and
| more "please add special-casing of my hardware", and
| would have the problem of it ceasing to reasonably
| function on hardware released after being compiled
| (unless like glibc or something gets some standard set of
| hardware performance properties that can be updated
| independently of precompiled software, which'd be extra
| hard to get through). Also P-cores vs E-cores would add
| an extra layer of mess. There might be some simpler
| version of just going by VLEN, which is always constant,
| but I don't see much use in that really.
| janwas wrote:
| > it's a matter of the wrong configuration unlike x86
| where it is a matter of the wrong instruction
|
| +1 to dzaima's mention of vrgather. The lack of fixed-
| pattern shuffle instructions in RVV is absolutely a
| wrong-instruction issue.
|
| I agree with your point that multiple code variants +
| runtime dispatch are helpful. We do this with Highway in
| particular for x86. Users only write code once with
| portable intrinsics, and the mess of instruction
| selection is taken care of.
| camel-cdr wrote:
| > +1 to dzaima's mention of vrgather. The lack of fixed-
| pattern shuffle instructions in RVV is absolutely a
| wrong-instruction issue.
|
| What others would you want? Something like vzip1/2 would
| make sense, but that isn't much of an permutation, since
| the input elements are exctly next to the output
| elements.
| janwas wrote:
| Going through Highway's set of shuffle ops:
|
| 64-bit OddEven/Reverse2/ConcatOdd/ConcatEven,
| OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse,
| CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and
| Broadcast especially for 8-bit, TwoTablesLookupLanes,
| InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).
|
| All of these are considerably more expensive on RVV. SVE
| has a nice set, despite also being VL-agnostic.
| dzaima wrote:
| More RVV questionable optimization cases:
|
| - broadcasting a loaded value: a stride-0 load can be
| used for this, and could be faster than going through a
| GPR load & vmv.v.x, but could also be much slower.
|
| - reversing: could use vrgather (could do high LMUL
| everywhere and split into multiple LMUL=1 vrgathers),
| could use a stride -1 load or store.
|
| - early-exit loops: It's feasible to vectorize such, even
| with loads via fault-only-first. But if vl=vlmax is used
| for it, it might end up doing a ton of unnecessary
| computation, esp. on high-VLEN hardware. Though there's
| the "fun" solution of hardware intentionally lowering vl
| on fault-onlt-first to what it considers reasonable as
| there aren't strict requirements for it.
| atq2119 wrote:
| The trend seems to be going towards multiple decoder
| complexes. Recent designs from AMD and Intel do this.
|
| It makes sense to me: if the distance between branches is
| small, a 10-wide decode may be wasted anyway. Better to
| decode multiple basic blocks in parallel
| dzaima wrote:
| Expanding on 3: I think it ends up at O(n^2 * log n)
| transistors, O(log n) critical path (not sure on routing
| or what fan-out issues might there be).
|
| Basically: determine end of instruction at each byte
| (trivial but expensive). Determine end of two
| instructions at each byte via end2[i]=end[end[i]]. Then
| end4[i]=end2[end2[i]], etc, log times.
|
| That's essentially log(n) shuffles. With 32-byte/cycle
| decode that's roughy five 'vpermb ymm's, which is rather
| expensive (though various forms of shortcuts should exist
| - for the larger layers direct chasing is probably
| feasible, and for the smaller ones some special-casing of
| single-byte instructions could work).
|
| And, actually, given the mention of O(log n)-transistor
| shuffles at http://www.numberworld.org/blogs/2024_8_7_zen
| 5_avx512_teardo..., it might even just be O(n * log^2(n))
| transistors.
|
| Importantly, x86 itself plays no part in the non-trivial
| part. It applies equqlly to the RISC-V compressed
| extension, just with a smaller constant.
| hajile wrote:
| Determining the end of a RISC-V instruction requires
| checking two bits and you have the knowledge that no
| instruction exceeds 4 bytes or uses less than 2 bytes.
|
| x86 requires checking for a REX, REX2, VEX, EVEX, etc
| prefix. Then you must check for either 1 or 2 instruction
| bytes. Then you must check for the existence of a
| register byte, how many immediate byte(s), and if you use
| a scaled index byte. Then if a register byte exists, you
| must check it for any displacement bytes to get your
| final instruction length total.
|
| RISC-V starts with a small complexity then multiplies it
| by a small amount. x86 starts with a high complexity then
| multiplies it by a big amount. The real world difference
| here is large.
|
| As I pointed out elsewhere ARM's A715 dropped support for
| aarch32 (which is still far easier to decode than x86)
| and cut decoder size by 75% while increasing raw decoder
| count by 20%. The decoder penalties of bad ISA design
| extend beyond finding instruction boundaries.
| dzaima wrote:
| I don't disagree that the real-world difference is
| massive; that much is pretty clear. I'm just pointing out
| that, as far as I can tell, it's all just a question of a
| constant factor, it's just massive. I've written half of
| a basic x86 decoder in regular imperative code, handling
| just the baseline general-purpose legacy encoding
| instructions (determines length correctly, and determines
| opcode & operand values to some extent), and that was
| already much.
| clamchowder wrote:
| Some notes: 1. Consider M1's 8-wide decoder hit the 5+
| GHz clock speeds that Intel Golden Cove's decoder can.
| More complex logic with more delays is harder to clock
| up. Of course M1 may be held back by another critical
| path, but it's interesting that no one has managed to get
| a 8-wide Arm decoder running at the clock speeds that Zen
| 3/4 and Golden Cove can.
|
| A715's slides say the L1 icache gains uop cache features
| including caching fusion cases. Likely it's a predecode
| scheme much like AMD K10, just more aggressive with
| what's in the predecode stage. Arm has been doing
| predecode (moving some stages to the L1i fill path rather
| than the hotter L1i hit path) to mitigate decode costs
| for a long time. Mitigating decode costs _again_ with a
| uop cache never made much sense especially considering
| their low clock speeds. Picking one solution or the other
| is a good move, as Intel /AMD have done. Arm picked
| predecode for A715.
|
| 2. The paper does not say 22% of core power is in the
| decoders. It does say core power is ~22% of package
| power. Wrong figure? Also, can you determine if the
| decoder power situation is different on Arm cores? I
| haven't seen any studies on that.
|
| 3. Multiple decoder blocks doesn't penalize decoder
| blocks once the load balancing is done right, which
| Gracemont did. And you have to _massively_ unroll a loop
| to screw up Tremont anyway. Conversely, decode blocks may
| lose less throughput with branchy code. Consider that
| decode slots after a taken branch are wasted, and
| clustered decode gets around that. Intel stated they
| preferred 3x3 over 2x4 for that reason.
|
| 4. "uops used by ARM are extremely close to the original
| instructions" It's the same on x86, micro-op count is
| nearly equal to instruction count. It's helpful to gather
| data to substantiate your conclusions. For example, on
| Zen 4 and libx264 video encoding, there's ~4.7% more
| micro-ops than instructions. Neoverse V2 retires ~19.3%
| more micro-ops than instructions in the same workload.
| Ofc it varies by workload. It's even possible to get
| negative micro-op expansion on both architectures if you
| hit branch fusion cases enough.
|
| 8. You also have to tell your ARM compiler which of the
| dozen or so ISA extension levels you want to target (see
| https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#i
| nde...). It's not one option by any means. Not sure what
| you mean by "peephole heuristic optimizations", but
| people certainly micro-optimize for both arm and x86. For
| arm, see
| https://github.com/dotnet/runtime/pull/106191/files as an
| example. Of course optimizations will vary for different
| ISAs and microarchitectures. x86 is more widely used in
| performance critical applications and so there's been
| more research on optimizing for x86 architectures, but
| that doesn't mean Arm's cores won't benefit from similar
| optimization attention should they be pressed into a
| performance critical role.
| hajile wrote:
| 1. Why would you WANT to hit 5+GHz when the downsides of
| exponential power take over? High clocks aren't a feature
| -- they are a cope.
|
| AMD/Intel maintain I-cache and maintain a uop cache kept
| in sync. Using a tiny part to pre-decode is different
| from a massive uop cache working as far in advance as
| possible in the hopes that your loops will keep you busy
| enough that your tiny 4-wide decoder doesn't become
| overwhelmed.
|
| 2. The float workload was always BS because you can't run
| nothing but floats. The integer workload had 22.1w total
| core power and 4.8w power for the decoder. 4.8/22.1 is
| 21.7%. Even the 1.8w float case is 8% of total core
| power. The only other argument would be that the study is
| wrong and 4.8w isn't actually just decoder power.
|
| 3. We're talking about worst cases here. Nothing stops
| ARM cores from creating a "work pool" of upcoming
| branches in priority order for them to decode if they run
| out of stuff on the main branch. This is the best of both
| worlds where you can be faster on the main branch AND
| still do the same branchy code trick too.
|
| 4. This is the tail wagging the dog (and something else
| if your numbers are correct). Complex x86 instructions
| have garbage performance, so they are avoided by the
| compiler. The problem is that you can't GUARANTEE those
| instructions will NEVER be used, so the mere specter of
| them forces complex algorithms all over the place where
| ARM can do more simple things.
|
| In any case, your numbers raise a VERY interesting
| question about x86 being RISC under the hood.
|
| Consider this. Say that we have 1024 bytes of ARM code
| (256 instructions). x86 is around 15% smaller (871.25
| bytes) and with the longer 4.25 byte instruction average,
| x86 should have around 205 instructions. If ARM is
| generating 19.3% more uops than instructions, we have
| about 305 uops. x86 with just 4.7% more has 215 uops (the
| difference here is way outside any margins of error
| here).
|
| If both are doing the same work, x86 uops must be in the
| range of 30% more complex. Given the limits of what an
| ALU can accomplish, we can say with certainty that x86
| uops are doing SOMETHING that isn't the RISC they claim
| to be doing. Perhaps one could claim that x86 is doing
| some more sophisticated instructions in hardware, but
| that's a claim that would need to be substantiated (I
| don't know what ISA instructions you have that give a 15%
| advantage being done in hardware, but aren't already in
| the ARM ISA and I don't see ARM refusing to add circuitry
| for current instructions to the ALU if it could reduce
| uops by 15% either).
|
| 8. https://en.wikipedia.org/wiki/Peephole_optimization
|
| The final optimization stage is basically heuristic find
| & replace. There could in theory be a mathematically
| provable "best instruction selection", but finding it
| would require trying every possible combination which
| isn't possible as long as P=NP holds true.
|
| My favorite absurdity of x86 (though hardly the only one)
| is padding. You want to align function calls at cacheline
| boundaries, but that means padding the previous cache
| line with NOPs. Those NOPs translate into uops though.
| Instead, you take your basic, short instruction and pad
| it with useless bytes. Add a couple useless bytes to a
| bunch of instructions and you now have the right length
| to push the function over to the cache boundary without
| adding any NOPs.
|
| But the issues go deeper. When do you use a REX prefix?
| You may want it so you can use 16 registers, but it also
| increases code size. REX2 with APX is going to increase
| this issue further where you must juggle when to use 8,
| 16, or 32 registers and when you should prefer the long
| REX2 because it has 3-register instructions. All kinds of
| weird tradeoffs exist throughout the system. Because the
| compilers optimize for the CPU and the CPU optimizes for
| the compiler, you can wind up in very weird places.
|
| In an ISA like ARM, there isn't any code density
| weirdness to consider. In fact, there's very little
| weirdness at all. Write it the intuitive way and you're
| pretty much guaranteed to get good performance. Total
| time to work on the compiler is a zero-sum game given the
| limited number of experts. If you have to deal with these
| kinds of heuristic headaches, there's something else you
| can't be working on.
| clamchowder wrote:
| 1. Performance. Also Arm implemented instruction cache
| coherency too.
|
| Predecode/uop cache are both means to the same end,
| mitigating decode power. AMD and Intel have used both
| (though not on the same core). Arm has used both,
| including both on the same core for quite a few
| generations.
|
| And a uop cache is just a cache. It's also big enough on
| current generations to cache more than just loops, to the
| point where it covers a majority of the instruction
| stream. Not sure where the misunderstanding of the uop
| cache "working as far in advance is possible" comes from.
| Unless you're talking about the BPU running ahead and
| prefetching into it? Which it does for L1i, and L2 as
| well?
|
| 2. "you can't run nothing but floats" they didn't do that
| in the paper, they did D += A[j] + B[j] * C[j]. Something
| like matrix multiplication comes to mind, and that's not
| exactly a rare workload considering some ML stuff these
| days.
|
| But also, has a study been done on Arm cores? For all we
| know they could spend similar power budgets on decode, or
| more. I could say an Arm core uses 99% of its power
| budget on decode, and be just as right as you are (they
| probably don't, my point is you don't have concrete data
| on both Arm and x86 decode power, which would be
| necessary for a productive discussion on the subject)
|
| 3. You're describing letting the BPU run ahead, which
| everyone has been doing for the past 15 years or so.
| Losing fetch bandwidth past a taken branch is a different
| thing.
|
| 4. Not sure where you're going. You started by suggesting
| Arm has less micro-op expansion than x86, and I provided
| a counterexample. Now you're talking about avoiding
| complex instructions, which a) compilers do on both
| architectures, they'll avoid stuff like division, and b)
| humans don't in cases where complex instructions are
| beneficial, see Linux kernel using rep movsb (https://git
| hub.com/torvalds/linux/blob/5189dafa4cf950e675f02...),
| and Arm introducing similar complex instructions
| (https://community.arm.com/arm-community-
| blogs/b/architecture...)
|
| Also "complex" x86 instructions aren't avoided in the
| video encoding workload. On x86 it takes ~16.5T
| instructions to finish the workload, and ~19.9T on Arm
| (and ~23.8T micro-ops on Neoverse V2). If "complex" means
| more work per instruction, then x86 used more complex
| instructions, right?
|
| 8. You can use a variable length NOP on x86, or multiple
| NOPs on Arm to align function calls to cacheline
| boundaries. What's the difference? Isn't the latter worse
| if you need to move by more than 4 bytes, since you have
| multiple NOPs (and thus multiple uops, which you think is
| the case but isn't always true, as some x86 and some Arm
| CPUs can fuse NOP pairs)
|
| But seriously, do try gathering some data to see if
| cacheline alignment matters. A lot of x86/Arm cores that
| do micro-op caching don't seem to care if a function (or
| branch target) is aligned to the start of a cacheline.
| Golden Cove's return predictor does appear to track
| targets at cacheline granularity, but that's a special
| case. Earlier Intel and pretty much all AMD cores don't
| seem to care, nor do the Arm ones I've tested.
|
| Anyway, you're making a lot of unsubstantiated guesses on
| "weirdness" without anything to suggest it has any
| effect. I don't think this is the right approach. Instead
| of "tail wagging the dog" or whatever, I suggest a data-
| based approach where you conduct experiments on some
| x86/Arm CPUs, and analyze some x86/Arm programs. I guess
| the analogy is, tell the dog to do something and see how
| it behaves? Then draw conclusions off that?
| hajile wrote:
| 1. The biggest chip market is laptops and getting 15%
| better performance for 80% more power (like we saw with X
| Elite recently) isn't worth doing outside the marketing
| win of a halo product (a big reason why almost everyone
| is using slower X Elite variants). The most profitable
| (per-chip) market is servers. They also prefer lower
| clocks and better perf/watt because even with the high
| chip costs, the energy will wind up costing them more
| over the chip's lifespan. There's also a real cost to
| adding extra pipeline stages. Tejas/Jayhawk cores are
| Intel's cancelled examples of this.
|
| L1 cache is "free" in that you can fill it with simple
| data moves. uop cache requires actual work to decode and
| store elements for use in addition to moving the data. As
| to working ahead, you already covered this yourself. If
| you have a nearly 1-to-1 instruction-to-uop ratio, having
| just 4 decoders (eg, zen4) is a problem because you can
| execute a lot more than just 4 instructions on the
| backend. 6-wide Zen4 means you use 50% more instructions
| than you decode per clock. You make up for this in loops,
| but that means while you're executing your current loop,
| you must be maxing out the decoders to speculatively fill
| the rest of the uop cache before the loop finishes. If
| the loop finishes and you don't have the next bunch of
| instructions decoded, you have a multi-cycle delay coming
| down the pipeline.
|
| 2. I'd LOVE to see a similar study of current ARM chips,
| but I think the answer here is pretty simple to deduce.
| ARM's slide says "4x smaller decoders vs A710" despite
| adding a 5th decoder. They claim 20% reduction in power
| at the same performance and the biggest change is the
| decoder. As x86 decode is absolutely more complex than
| aarch32, we can only deduce that switching from x86 to
| aarch64 would be an even more massive reduction. If we
| assume an identical 75% reduction in decoder power, we'd
| move from 4.8w on haswell the decoder down to 1.2w
| reducing total core power from 22.1 to 18.5 or a ~16%
| overall reduction in power. This isn't too far from to
| the power numbers claimed by ARM.
|
| 4. This was a tangent. I was talking about uops rather
| than the ISA. Intel claims to be simple RISC internally
| just like ARM, but if Intel is using nearly 30% fewer
| uops to do the same work, their "RISC" backend is way
| more complex than they're admitting.
|
| 8. I believe aligning functions to cacheline boundaries
| is a default flag at higher optimization levels. I'm
| pretty sure that they did the analysis before enabling
| this by default. x86 NOP flexibility is superior to ARM
| (as is its ability to avoid them entirely), but the cause
| is the weirdness of the x86 ISA and I think it's an
| overall net negative.
|
| Loads of x86 instructions are microcode only. Use one and
| it'll be thousands of cycles. They remain in microcode
| because nobody uses them, so why even try to optimize and
| they aren't used because they are dog slow. How would you
| collect data about this? Nothing will ever change unless
| someone pours in millions of dollars in man-hours into
| attempting to speed it up, but why would anyone want to
| do that?
|
| Optimizing for a local maxima rather than a global maxima
| happens all over technology and it happens exactly
| because of the data-driven approach you are talking
| about. Look for the hot code and optimize it without
| regard that there may be a better architecture you could
| be using instead. Many successes relied on an intuitive
| hunch.
|
| ISA history has a ton of examples. iAPX432 super-CISC,
| the RISC movement, branch delay slots, register windows,
| EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All
| of these were attempts to find new maxima with greater or
| lesser degrees of success. When you look into these,
| pretty much NONE of them had any real data to drive them
| because there wasn't any data until they'd actually
| started work.
| clamchowder wrote:
| 1. Yeah I agree, both X Elite and many Intel/AMD chips
| clock well past their efficiency sweet spot at stock.
| There is a cost to extra pipeline stages, but no one is
| designing anything like Tejas/Jayhawk, or even earlier P4
| variants these days. Also P4 had worse problems (like not
| being able to cancel bogus ops until retirement) than
| just a long pipeline.
|
| Arm's predecoded L1i cache is not "free" and can't be
| filled with simple data moves. You need predecode logic
| to translate raw instruction bytes into an intermediate
| format. If Arm expanded predecode to handle fusion cases
| in A715, that predecode logic is likely more complex than
| in proir generations.
|
| 2. Size/area is different from power consumption. Also
| the decoder is far from the only change. The BTBs were
| changed from 2 to 3 level, and that can help efficiency
| (could make a smaller L2 BTB with similar latency, while
| a slower third level keeps capacity up). TLBs are bigger,
| probably reducing page walks. Remember page walks are
| memory accesses and the paper earlier showed data
| transfers count for a large percentage of dynamic power.
|
| 4. IMO no one is really RISC or CISC these days
|
| 8. Sure you can align the function or not. I don't think
| it matters except in rare corner cases on very old cores.
| Not sure why you think it's an overall net negative.
| "feeling weird" does not make for solid analysis.
|
| Most x86 instructions are not microcode only. Again,
| check your data with performance counters. Microcoded
| instructions are in the extreme minority. Maybe
| microcoded instructions were more common in 1978 with the
| 8086, but a few things have changed between then and now.
| Also microcoded instructions do not cost thousands of
| cycles, have you checked? i.e. a gather is ~22 micro ops
| on Haswell, from https://uops.info/table.html Golden Cove
| does it in 5-7 uops.
|
| ISA history has a lot of failed examples where people
| tried to lean on the ISA to simplify the core
| architecture. EPIC/VLIW, branch delay slots, and register
| windows have all died off. Mill is a dumb idea and never
| went anywhere. Everyone has converged on big OoO machines
| for a reason, even though doing OoO execution is really
| complex.
|
| If you're interested in cases where ISA does matter, look
| at GPUs. VLIW had some success there (AMD Terascale, the
| HD 2xxx to 6xxx generations). Static instruction
| scheduling is used in Nvidia GPUs since Kepler. In CPUs
| ISA really doesn't matter unless you do something that
| actively makes an OoO implementation harder, like
| register windows or predication.
| dzaima wrote:
| > My favorite absurdity of x86 (though hardly the only
| one) is padding. You want to align function calls at
| cacheline boundaries, but that means padding the previous
| cache line with NOPs. Those NOPs translate into uops
| though.
|
| I'd call that more neat than absurd.
|
| > You may want it so you can use 16 registers, but it
| also increases code size.
|
| RISC-V has the exact same issue, some compressed
| instructions having only 3 bits for operand registers.
| And on x86 for 64-bit-operand instructions you need the
| REX prefix always anyways. And it's not that hard to
| pretty reasonably solve - just assign registers by their
| use count.
|
| Peephole optimizations specifically here are basically
| irrelevant. Much of the complexity for x86 comes from
| just register allocation around destructive operations
| (though, that said, that does have rather wide-ranging
| implications). Other than that, there's really not much
| difference; all have the same general problems of moving
| instructions together for fusing, reordering to reduce
| register pressure vs putting parallelizable instructions
| nearer, rotating loops to reduce branches, branches vs
| branchless.
| hajile wrote:
| RISC-V has a different version of this issue that is
| pretty straight-forward. Preferring 2-register operations
| is already done to save register space. The only real
| extra is preferring the 8 registers C uses for math.
| After this, it's all just compression.
|
| x86 has a multitude of other factors than just
| compression. This is especially true with standard vs REX
| instructions because most of the original 8 instructions
| have specific purposes and instructions that depend on
| them for these (eg, Accumulator instructions with A
| register, Mul/div using A+D, shift uses C, etc). It's a
| problem a lot harder than simple compression.
|
| Just as cracking an alphanumeric password is
| exponentially harder than a same-length password with
| numbers only, solving for all the x86 complications and
| exceptions is also exponentially harder.
| dzaima wrote:
| If anything, I'd say x86's fixed operands make register
| allocation easier! Don't have to register-allocate that
| which you can't. (ok, it might end up worse if you need
| some additional 'mov's. And in my experience more 'mov's
| is exactly what compilers often do.)
|
| And, right, RISC-V even has the problem of being two-
| operand for some compressed instructions. So the same
| register allocation code that's gone towards x86 can
| still help RISC-V (and vice versa)! On RISC-V, failure
| means 2-4 bytes on a compressed instruction, and on x86
| it means +3 bytes of a 'mov'. (granted, the additioanal
| REX prefix cost is separate on x86, while included in
| decompression on RISC-V)
| hajile wrote:
| With 16 registers, you can't just avoid a register
| because it has a special use. Instead, you must work to
| efficiently schedule around that special use.
|
| Lack of special GPRs means you can rename with impunity
| (this will change slightly with the load/store pair
| extension). Having 31 truly GPR rather than 8 GPR+8
| special GPR also gives a lot of freedom to compilers.
| dzaima wrote:
| Function arguments and return values already are
| effectively special use, and should frequently be on par
| if not much more frequent than the couple x86
| instructions with fixed registers.
|
| Both clang and gcc support calls having differing used
| calling conventions within one function, which ends up
| effectively exactly identical to fixed-register
| instructions (i.e. an x86 'imul r64' can be done via a
| pseudo-function where the return values are in rdx & rax,
| an input is in rax, and everything else is non-volatile;
| and the dynamically-choosable input can be allocated
| separately). And '__asm__()' can do mixed fixed and non-
| fixed registers anyway.
| hajile wrote:
| Unlike x86, none of this is strictly necessary. As long
| as you put things back as expected, you may use all the
| registers however you like.
| dzaima wrote:
| The option of not needing any fixed register usage would
| apply to, what, optimizing compilers without support for
| function calls (at least via passing arguments/results
| via registers)? That's a very tiny niche to use as an
| argument for having simplified compiler behavior.
|
| And good register allocation is still pretty important on
| RISC-V - using more registers, besides leading to less
| compressed instruction usage, means more non-volatile
| register spilling/restoring in function
| prologue/epilogue, which on current compilers (esp.
| clang) happens at the start & end of functions, even in
| paths that don't need the registers.
|
| That said, yes, RISC-V still indeed has much saner
| baseline behavior here and allows for simpler basic
| register allocation, but for non-trivial compilers the
| actual set of useful optimizations isn't that different.
| hajile wrote:
| Not just simpler basic allocation. There are fewer
| hazards to account for as well. The process on RISC-V
| should be shorter, faster, and with less risk that the
| chosen heuristics are bad in an edge case.
| neonsunset wrote:
| > Not sure what you mean by "peephole heuristic
| optimizations"
|
| Post-emit or within-emit stage optimization where a
| sequence of instructions is replaced with a more
| efficient shorter variant.
|
| Think replacing pairs of ldr and str with ldp and stp,
| changing ldr and increment with ldr with post-index
| addressing mode, replacing address calculation before
| atomic load with atomic load with addressing mode (I
| think it was in ARMv8.3-a?).
|
| The "heuristic" here might be possibly related to
| additional analysis when doing such optimizations.
|
| For example, previously mentioned ldr, ldr -> ldp (or
| stp) optimization is not always a win. During work on
| .NET 9, there was a change[0] that improved load and
| store reordering to make it more likely that simple
| consecutive loads and stores are merged on ARM64.
| However, this change caused regressions in various hot
| paths because, for example, previously matched ldr w0,
| [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr]
| pair got replaced with ldp w0, w1, [add] -> modify w0,
| str w0 [addr].
|
| Turns out this kind of merging defeated store forwarding
| on Firestorm (and newer) as well as other ARM cores. The
| regression was subsequently fixed[1], but I think the
| parent comment author may have had scenarios like these
| in mind.
|
| [0]: https://github.com/dotnet/runtime/pull/92768
|
| [1]: https://github.com/dotnet/runtime/pull/105695
| NobodyNada wrote:
| One more: there's more to an ISA than just the
| instructions; there's semantic differences as well. x86
| dates to a time before out-of-order execution, caches,
| and multi-core systems, so it has an extremely strict
| memory model that does not reflect modern hardware -- the
| only memory-reordering optimization permitted by the ISA
| is store buffering.
|
| Modern x86 processors will actually perform speculative
| weak memory accesses in order to try to work around this
| memory model, flushing the pipeline if it turns out a
| memory-ordering guarantee was violated in a way that
| became visible to another core -- but this has complexity
| and performance impacts, especially when applications
| make heavy use of atomic operations and/or communication
| between threads.
|
| Simple atomic operations can be an order of magnitude
| faster on ARMv8 vs x86: https://web.archive.org/web/20220
| 129144454/https://twitter.c...
| sroussey wrote:
| Yes, and Apple added this memory model to their ARM
| implementation so Rosetta2 would work well.
| janwas wrote:
| > With 50+ years of figuring the basics out, RISC-V won't
| be making any major mistakes on the most important stuff.
|
| RVV does have significant departures from prior work, and
| some of them are difficult to understand:
|
| - the whole concept of avl, which adds complexity in many
| areas including reg renaming. From where I sit, we could
| just use masks instead.
|
| - mask bits reside in the lower bits of a vector, so we
| either require tons of lane-crossing wires or some kind
| of caching.
|
| - global state LMUL/SEW makes things hard for compilers
| and OoO.
|
| - LMUL is cool but I imagine it's not fun to implement
| reductions, and vrgather.
| dzaima wrote:
| How does avl affect register renaming? (there's the edge-
| case of vl=0 that is horrifically stupid (which is by
| itself a mistake for which I have seen no justification
| but whatever) but that's probably not what you're
| thinking of?) Agnostic mode makes it pretty simple for
| hardware to do whatever it wants.
|
| Over masks it has the benefit of allowing simple hardware
| short-circuiting, though I'd imagine it'd be cheap enough
| to 'or' together mask bit groups to short-circuit on (and
| would also have the benefit of better masked throughput)
|
| Cray-1 (1976) had VL, though, granted, that's a pretty
| long span of no-VL until RVV.
| janwas wrote:
| Was thinking of a shorter avl producing partial results
| merged into another reg. Something like a += b; a[0] +=
| c[0]. Without avl we'd just have a write-after-write, but
| with it, we now have an additional input, and whether
| this happens depends on global state (VL).
|
| Espasa discusses this around 6:45 of
| https://www.youtube.com/watch?v=WzID6kk8RNs.
|
| Agree agnostic would help, but the machine also has to
| handle SW asking for mask/tail unchanged, right?
| camel-cdr wrote:
| > Agree agnostic would help, but the machine also has to
| handle SW asking for mask/tail unchanged, right?
|
| Yes, but it should rarely do so.
|
| The problem is that because of the vl=0 case you always
| have a dependency on avl. I think the motivavtion for the
| vl=0 case was that any serious ooo implementation will
| need to predict vl/vtype anyways, so there might as well
| be this nice to have feature.
|
| IMO they should've only supported ta,mu. I think the only
| usecase for ma, is when you need to avoid exceptions. And
| while tu is usefull, e.g. summing am array, it could be
| handled differently. E.g. once vl<vlmax you write the
| summ to a difgerent vector and do two reductions (or
| rather two diffetent vectors given the avl to vl rules).
| dzaima wrote:
| What's the "nice to have feature" of vl=0 not modifying
| registers? I can't see any benefit from it. If anything,
| it's worse, due to the problems on reduce and vmv.s.x.
| camel-cdr wrote:
| "nice to hace" because it removes the need for a branch
| for the n=0 case, for regular loops you probably still
| want it, but there are siturations were not needing to
| worry about vl=0 corrupting your data is somewhat nice.
| dzaima wrote:
| Huh, in what situation would vl=0 clobbering registers be
| undesirable while on vl>=1 it's fine?
|
| If hardware will be predicting vl, I'd imagine that would
| break down anyway. Potentially catastrophically so if
| hardware always chooses to predict vl=0 doesn't happen.
| dzaima wrote:
| > Agree agnostic would help, but the machine also has to
| handle SW asking for mask/tail unchanged, right?
|
| The agnosticness flags can be forwarded at decode-time
| (at the cost of the non-immediate-vtype vsetvl being very
| slow), so for most purposes it could be as fast as if it
| were a bit inside the vector instruction itself. Doesn't
| help vl=0 though.
| anvuong wrote:
| That was true when ARM was first released, but over the
| years the decoder for ARM has gotten more and more
| complicated. Who would have guessed adding more specialized
| instructions would result in more complicated decoders? ARM
| now uses multi-stage decoders, just the same as x86.
| IshKebab wrote:
| Sure, but it's not idle power consumption that's the
| difference between these.
| wmf wrote:
| When a laptop gets 12 hours or more of battery life that's
| because it's 90% idle.
| jeffbee wrote:
| And while it's important to design a chip that can enter
| a deep idle state, the thing that differentiates one
| Windows laptop from the next is how many mistakes the
| BIOS writers made and whether the platform drivers work
| correctly. This is also why you cannot really judge the
| expected battery life under Linux by reading reviews of
| laptops running Windows.
| dagmx wrote:
| Battery tests are important, but so is how it fairs on battery
| (what is the performance drop off to maintain that), what's its
| performance is ant its peak and how it long before it throttles
| when pushed.
|
| The M series processors have succeeded in all four: battery
| life, performance parity between battery and plugged in, high
| performance and performance sustainability.
|
| So far, very few benchmarks have been comparing the latter
| three as part of the full package assessment.
| GrumpyYoungMan wrote:
| The display, RAM, and other peripherals are consuming power
| too. Short of running continuous high CPU loads, which most
| people don't do on laptops, changes in CPU efficiency have less
| apparent effect on battery life because it's only a fraction of
| overall power draw.
| Panzer04 wrote:
| >5hr Battery life in laptops is mostly a function of how well
| idle is managed, i think. The less work you can do while
| running the users core program, the better. I'm not sure how
| much impact CPU efficiency really has in that case.
|
| If you are running a remotely demanding program (say, a game) ,
| your battery life will be bad no matter what (ie. <4hrs) unless
| you choose a very low TDP that performs badly always.
|
| A laptop at idle should be able to manage ~5w power consumption
| sumtpion regardless of AMD/intel/Apple processor, but it's
| largely on the OS to achieve that.
| 999900000999 wrote:
| I have a 365 AMD laptop.
|
| The battery is great if your doing very light stuff, Call of
| Duty takes it's battery down to 3 hours.
|
| Macs don't really support higher end games, so I can't directly
| compare to my M1 Air.
| sedatk wrote:
| How does "great" translate to hours?
| 999900000999 wrote:
| This is really tricky.
|
| The OEMs will use ever trick possible and do something like
| open GMAIL to claim 10 hours, but given my typical use I
| average 5 to 6. I make music using a software called
| Maschine.
|
| It's a massive step up over my old( still working just very
| heavy) Lenovo Legion 2020, which would last about 2 hours
| given the same usage.
|
| This is all subjective at the end of the day. If none of
| your applications actually work since your on ARM Windows
| of course you'll have higher battery life.
| cubefox wrote:
| Unlike AMD and Qualcomm, Apple uses an expensive TSMC 3nm
| process, so you would expect better battery life from the
| "MBP3". I assume they used the process improvements to increase
| performance instead.
| hajile wrote:
| Perf per watt is higher for M1 on N5 vs Zen5 on N4P, so the
| problems go deeper than just process.
|
| X Elite also beats AMD/Intel in perf/watt while being on the
| same N4P node as HX370.
|
| https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-
| anal...
| double0jimb0 wrote:
| I didn't watch this link, but my Zenbook S 16 only gets
| remotely close to my M2 MBA battery life if the zenbook is in
| whatever is Windows 11 'efficiency' mode, and then it
| benchmarks at 50% of the M2.
|
| I don't think the two are remotely comparable in perf/watt.
| nullc wrote:
| Speaking of Zen 5, are there any rumors on when 128 core turin-x
| will ship?
| apatheticonion wrote:
| Are there any mini PCs with Zen 5?
| wmf wrote:
| You could probably put a 9700X in a MS-A1. Besides that it will
| take a few months.
| hajile wrote:
| I think there's a lot of hope for a Strix Halo mini-PC.
|
| We currently have 7840HS+6650M with two massive heatsinks just
| barely coming in at what you would consider a large mini-PC
| rather than a SFFPC.
|
| Just one chip cuts the heatsink demand in half. The cores
| should be faster and it moves from 28CU up to 40CU and from
| RDNA2 to RDNA3.5. As long as the cross-core-complex latency
| isn't as bad as the HX370, I think it could be a real winner
| for a long time as it's basically an upgraded PS5 that runs
| Linux/Windows.
| moffkalast wrote:
| Hell I'd settle for a Z1 SBC or even mini PC. They can't seem
| to keep up with demand for these new chips to put them into
| any sort of products that don't have complete mass appeal.
| It's impossible to find one that's not in a handheld gaming
| console. I doubt they'll even make enough of the Halo to
| cover laptops.
| wmf wrote:
| Z1 is basically the same as 8840U which you can find.
| sliken wrote:
| "AMD Strix Point expected to debut in October, claims AOOSTAR"
| was a new headline I've seen. Seems about right that laptops
| (higher margin parts) land a few months before the SFFs.
|
| I'm excited about the strix halo that has more cores, double
| the GPU, and double the memory badwidth (to keep the GPU fed).
| JonChesterfield wrote:
| There seem to be some on hawk point but not many. I'd like to
| replace a PN50 (4800u system) from years ago but am attached to
| it being small enough to fit in the cable tidy under the desk -
| the 4" form factor seems to have grown a little over time.
| CyberDildonics wrote:
| What does "strix point" mean?
| icegreentea2 wrote:
| It's AMD's code/product name for their mobile CPUs with the
| Zen5 microarchitecture.
|
| The last AMD laptop code name was "Hawk Point" - strix is a
| mythological bird. Who knows if AMD will keep with this naming
| scheme.
| jml7c5 wrote:
| I sit firm in my belief that the best thing Microsoft could do
| for their laptop ecosystem is to add support for a "max fan
| speed" slider somewhere prominent in the Windows UI.
|
| People want the option to make their laptop silent or nearly
| silent. And when users do need the power, they generally prefer a
| slightly slower laptop at a reasonable volume rather than the
| roar of a jet engine.
|
| Laptop manufacturers want their devices to score high on
| benchmarks. The best way to do that is to add a fan that can
| become very loud.
|
| The incentives are not aligned.
|
| All laptops should be designed to operate passively 100% of the
| time, if the owner so chooses. I doubt manufacturers will go that
| route unless Microsoft nudges them towards it. It would have
| downstream effects on how review sites benchmark laptops (i.e.,
| at various power draws/noise levels producing a curve rather than
| a single number), which would have downstream effects on what CPU
| designers optimize for. It'd be great for consumers.
| latchkey wrote:
| I don't know anything about Windows, but at least on Mac, I've
| been using TGPro for years [0]. I'd assume there is something
| similar in the Windows world.
|
| In normal conditions my M1 mac can control its fans just fine,
| but when I travel to hot places like Vietnam... I just keep the
| fans on more often and my machine doesn't get nearly as hot. I
| end up having to open it up after a few months and clean out
| the fans, but that's fine.
|
| [0] https://www.tunabellysoftware.com/tgpro/
| kevin_thibedeau wrote:
| Stop demanding paper thin laptops. My work Dell rarely turns on
| its fan unless an AV scan is in progress and even then it's
| rather tolerable. It isn't a fashionable thickness so has
| plenty of internal volume for heat distribution.
| rafaelmn wrote:
| MacBook Air is thin and fanless, so it can be done.
| cmeacham98 wrote:
| The cheapest MacBook Air is $1000, and it's more like
| $1500+ if you want a reasonable amount of RAM and storage.
| There are similarly expensive Windows laptops available
| that are fanless.
| rafaelmn wrote:
| Mind linking some (genuinely curious, would like to
| checkout potential Linux machine for the next upgrade)
| nov21b wrote:
| Count me in
| jacooper wrote:
| Surface laptop.
| tedunangst wrote:
| Not even the arm version of surface laptop is fanless.
| aurareturn wrote:
| >There are similarly expensive Windows laptops available
| that are fanless.
|
| Such as?
| phonon wrote:
| Wait until Lunar Lake comes out.
| hajile wrote:
| I spent $1700 or so on my M1 Air not too long after they
| were released. A ThinkPad X1 Carbon would have cost me
| more money for massively worse performance. Quality costs
| more.
|
| The difference is that a 4800U would be looking pretty
| bad vs a HX370 while the M1 still looks decent 4 years
| later (especially when that HX370 is unplugged).
| paulmd wrote:
| For $1200 you can easily pick up a decent refurb MBP -
| these are apple refurbs for example. OOS but an example
| of what you can find if you look around a bit.
|
| https://sellout.woot.com/offers/apple-14-macbook-pro-
| with-10...
|
| There's very little reason to chase the exact latest
| model when even a 2020/2021 M1 family is still great.
| fulafel wrote:
| Many things can be done if you don't have coexist with AV.
|
| (Though the use of the fan is always a configuration choice
| with thermal management in CPUs these days).
| AnthonyMouse wrote:
| "Thin and fanless" aren't that hard, just use any low power
| CPU.
|
| But then people also want fast.
|
| Apple does this by buying out TSMC's capacity for the
| latest process nodes and then taking the
| performance/efficiency trade off in favor of efficiency, so
| they get something with similar performance and lower power
| consumption. But then they charge you $400 for $50 worth of
| RAM and solder it so you can't upgrade it yourself.
|
| The thing to realize is that fans are not required to spin,
| and the difference between the faster and slower processors
| are the clock speed rather than the transistors. So what
| you want is exactly what the OP requested: A laptop with
| fans in it, but you can turn them off. Then the CPU gets
| capped at a 15W TDP, has basically the same single-thread
| performance but is slower on threaded workloads, and it's
| no longer possible for Chrome to make your laptop loud.
|
| But if you want to open up Blender you still have the
| option to move up the fan slider and make it go faster.
| thecompilr wrote:
| The M1 was made on 5nm which have long been available to
| AMD and other competitors in volume.
| AnthonyMouse wrote:
| Fast is relative. The Ryzen HX 370 has a TDP configurable
| down to 15W and at that power level it could be run
| fanless and would be faster than the M1, but it's still
| faster yet if you give it 54W and raise the clock speed.
| aurareturn wrote:
| I'm going to need source on that.
|
| What does HX 370 score at 10w?
| schmidtleonard wrote:
| Is that the chip AMD just released? Isn't the M1 about 4
| years old?
| ThatMedicIsASpy wrote:
| Don't run Windows and you don't need fast. Unfortunately
| Linux on notebooks is always a dice roll of random
| features (cam, fingerprint, ..) not working.
|
| There is a lot of older hardware running like crap
| because Windows just bloats up.
| MobiusHorizons wrote:
| I know everyone on this site loves to hate on soldered
| ram, but my impression is most people don't understand
| that soldered ram is not the same thing as regular ram
| modules. They are literally different memory chips (LPDDR
| vs DDR) . When built to a specific chip my understanding
| is you can design for tighter timings and higher
| bandwidth which is important for the gpu. The M1 shipped
| with very fast LPDDR4X running at 4266MT/s which was even
| pretty fast by XMP desktop speeds at the time (2020).
| There are real engineering advantages to soldered ram
| especially if the memory controller has need designed
| with to take advantage of it. I guess it is similar to
| how gpu memory configurations are specialized and not
| modular.
| tuna74 wrote:
| I want a thin and fanless laptop. You might want something
| else.
| shmerl wrote:
| You can use Linux and hwmon.
| mqus wrote:
| > add support for a "max fan speed" slider somewhere prominent
| in the Windows UI.
|
| Isn't that what the "power settings" do? It's a slider at the
| bottom right, hidden in a tray icon. Sure, it only has three
| positions and also influences battery consumption but it pretty
| much does what you want. (Not sure if windows 11 kept this
| though)
| nightski wrote:
| I don't mind fans at all, in fact I find fan noises a little
| soothing (a childhood thing, we didn't have AC). Everyone has
| different priorities, personally I'd prefer to not have
| throttled performance.
| sandreas wrote:
| Did you know NBFC (Notebook Fan Control)? It's old, but still
| works on some devices and you can create custom profiles via
| XML.
|
| https://github.com/hirschmann/nbfc
| jwells89 wrote:
| It might also be good to mandate 10+ hours of battery life when
| the laptop is in power saving mode. A number of laptops that'd
| otherwise have decent battery life are hampered by things like
| half-baked power management of discrete GPUs that doesn't
| completely cut power supply to those components. Manufacturers
| should be more heavily testing under this mode.
| devbent wrote:
| A few misbehaving CSS filters can make my discreet GPU turn
| on and at that point my battery life is a goner. Not sure who
| to blame in that scenario.
|
| There was an old bug in FF around 2018 where a tab using a
| GPU would prevent a Windows laptop from ever sleeping. That
| ended up destroying that laptop's battery after it got thrown
| in my backpack and overheated a couple times.
| jwells89 wrote:
| Seems like this could be fixed by a system setting that
| disables automatic graphic switching which can be
| controlled by power profiles. That way the user can set the
| machine to use iGPU only when on battery, regardless of
| what programs want.
| automatic6131 wrote:
| >I sit firm in my belief that the best thing Microsoft could do
| for their laptop ecosystem is to add support for a "max fan
| speed" slider somewhere prominent in the Windows UI
|
| Until then, there's
| https://github.com/Rem0o/FanControl.Releases
| mkhalil wrote:
| A closed source application for controlling one's fan...umm
| no thank you.
|
| I never will understand the reasoning behind why people are
| so afraid of releasing their source code. Looks like a
| weekend project; does he expect to make a living out of a
| weekend project?
| Rinzler89 wrote:
| _> Looks like a weekend project;_
|
| Is that the best developer insult in your repertoire?
|
| _> does he expect to make a living out of a weekend
| project?_
|
| Trying to make money from writing SW is not illegal. The
| free market will decide.
|
| _> A closed source application for controlling one's
| fan...umm no thank you. _
|
| Well since you think it's only a weekend project, why don't
| you put your money where your mouth is and spend a weekend
| developing a FOSS fan control app if you need one?
| wiseowise wrote:
| > Looks like a weekend project; does he expect to make a
| living out of a weekend project?
|
| Why won't you spend a weekend and do the world a service?
| cpv wrote:
| Some recent asus ROG/TUF 2022+ models have fine tuning
| available via "armoury crate" or "g-helper" (non-proprietary,
| fan/community supported, code on github).
|
| Disabling a dGpu and reducing power can yeld some impressive
| results for battery life. Allows also defining fan speed curves
| depending on temperature.
| aurareturn wrote:
| One of these has to be true (or both true):
|
| 1. ARM is inherently more efficient than x86 CPUs in most tasks
|
| 2. Nuvia and Apple are better CPU designers than AMD and Intel
|
| Here are results from Notebookcheck:
|
| Cinebench R24 ST perf/watt
|
| * M3: 12.7 points/watt
|
| * X Elite: 9.3 points/watt
|
| * AMD HX 370: 3.74 points/watt
|
| * AMD 8845HS: 3.1 points/watt
|
| * Intel 155H: 3.1 points/watt
|
| In ST, Apple is 3.4x more efficient than Zen5. X Elite is 2.4x
| more efficient than Zen5.
|
| Cinebench R24 MT perf/watt
|
| * M3: 28.3 points/watt
|
| * X Elite: 22.6 points/watt
|
| * AMD HX 370: 19.7 points/watt
|
| * AMD 8845HS: 14.8 points/watt
|
| * Intel 155H: 14.5 points/watt
|
| In MT, Apple is 1.9x more efficient than Zen4 and 1.4x more
| efficient than HX 370. I expect M3 Pro/Max to increase the gap
| because generally, more cores means more efficiency for Cinebench
| MT. X Elite is also more efficient but the gap is closer.
| However, we should note that in a laptop, ST matters more for
| efficiency because of the burst behavior of usage. It's easier to
| gain in MT efficiency as long as you have many cores and run them
| at lower wattage. In this case, AMD's Zen5 12 core setup and 24
| threads works well in Cinebench. Cinebench loves more threads.
|
| One thing that is intriguing is that X Elite does not have little
| cores which hurts its MT efficiency. It's likely a remnant of
| Nuvia designing a server CPU, which does not need big.Little but
| Qualcomm used it in a laptop SoC first.
|
| Sources: https://www.youtube.com/watch?v=ZN2tC8DfJnc
|
| https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-CPU-anal...
| trynumber9 wrote:
| The Snapdragon X Elite is on the same node and when actually
| doing a lot of work (i.e. cores loaded) it is close enough to
| HX 370 while delivering similar throughput.
|
| Why wouldn't the inherent inefficiency of x64 be as noticeable
| in MT when all the inefficient cores are working? Because it is
| running at lower clocks? Then what allows it to match the SDXE
| in throughput? Does that need to lower its clock even more? I'm
| not seeing what makes it inherent.
| rafaelmn wrote:
| From you link - Intel is topping the performance charts
| (alongside AMD in SC) - they probably tune power usage
| agressively to achieve these results.
|
| I would guess it's more to do with coming from desktop CPU
| design to mobile vs. phones to laptops.
| aurareturn wrote:
| >From you link - Intel is topping the performance charts
| (alongside AMD in SC) - they probably tune power usage
| agressively to achieve these results.
|
| Cinebench 2024 ST:
|
| * M3: 142 points
|
| * X Elite: 123 points
|
| * AMD HX 370: 116 points
|
| * AMD 8845HS: 102 points
|
| * Intel 155H: 108 points
|
| Amongst each company's best laptop ST SoCs, no, Intel and AMD
| are far behind in both ST scores and perf/watt.
|
| If you're referring to desktop speeds, then yes, Intel's
| 14900k does top the charts in ST Cinebench but it likely uses
| well over 100w.
|
| I mostly care about laptop SoCs. In the case of the M3, it
| doesn't even have a fan.
| rafaelmn wrote:
| That's what I'm thinking - they make trade-offs to reach
| peak performance in desktop designs that don't translate
| optimally to laptops and when you start from mobile designs
| you probably made the opposite trade-offs - that would be
| my guess for the discrepancy.
| aurareturn wrote:
| I'm pretty sure that M3 Max closely matches the 14900k in
| ST speeds but using something like 5 - 10% of the power.
| rafaelmn wrote:
| Not sure - they had power/thermal envelope in desktop
| parts and no difference in performance AFAIK.
| hajile wrote:
| Laptops vastly outsell desktops, so this tradeoff means
| hurting the majority of your customers to please a small
| minority. Servers also care about perf/watt a LOT and
| they are the highest profit margin segment.
|
| Why would AMD choose a target that hurts the majority of
| their market unless there wasn't another good option
| available?
| rafaelmn wrote:
| The architecture started in desktop space and data
| center/mobile was an afterthought up until Intel shitting
| the bed repeatedly. If they redesigned from ground up
| they could probably get better instructions/watt but that
| would look terrible if it wasn't accompanied by a perf
| boost over previous generation. Just like Apple doesn't
| seem to scale well with more power.
| aurareturn wrote:
| > Just like Apple doesn't seem to scale well with more
| power.
|
| How do you know this? When has Apple ever given 400w to
| the M3 Max like the 14900KS can get up to?
|
| PS. An M4 running at 7w is faster in ST than a 14900KS
| running in 250w+.
| AnthonyMouse wrote:
| > 1. ARM is inherently more efficient than x86 CPUs in most
| tasks
|
| > 2. Nuvia and Apple are better CPU designers than AMD and
| Intel
|
| The third possibility is that they just pick a different point
| on the efficiency curve. You can double power consumption in
| exchange for a few percent higher performance, double it again
| for an even smaller increase.
|
| The max turbo on the i9 14900KS is 253 W. The power efficiency
| is _bad_. But it generally outperforms the M3, despite being on
| a significantly worse process node, because that 's the trade
| off.
|
| AMD is only on a slightly worse process node and doesn't have
| to do anything so aggressive, but they'll also sell you
| whatever you want. The 8845HS and 8840U are basically the same
| chip, but former has around double the TDP. In exchange for
| that you get ~2% more single thread performance and ~15% more
| multi-thread performance. Whereas the performance per watt for
| the 8840U is nearly that of the M3, and the remaining
| difference is basically the process node.
| aurareturn wrote:
| >The third possibility is that they just pick a different
| point on the efficiency curve. You can double power
| consumption in exchange for a few percent higher performance,
| double it again for an even smaller increase.
|
| This only makes sense if the Zen5 is actually faster in ST
| than the M3. In this case, the M3 is 1.24x faster and 3.4x
| more efficient in ST than Zen5.
|
| AMD's Zen5 chip is just straight up slower in any curve.
|
| >The max turbo on the i9 14900KS is 253 W. The power
| efficiency is bad. But it generally outperforms the M3,
| despite being on a significantly worse process node, because
| that's the trade off.
|
| It's not a trade off that Intel wants. The 14900KS runs at
| 253w (sometimes 400w+) because that's the only way Intel is
| able to stay remotely competitive at the very high end. An M3
| Max will often match a 14900KS in performance using 5-10% of
| the power.
| AnthonyMouse wrote:
| > This only makes sense if the Zen5 is actually faster in
| ST than the M3. In this case, the M3 is 1.24x faster and
| 3.4x more efficient in ST than Zen5.
|
| It makes sense if Zen5 is faster in MT, since that's when
| the CPUs will be power limited, and it is. For ST the
| performance generally isn't power-limited for either of
| them and then the M3 is on a newer process node.
|
| It also depends on the benchmark. For example, Zen5 is
| faster in ST on Cinebench _R23_. It 's not obvious what's
| going on with R24, but it's a difference in the code rather
| than the hardware.
|
| The power numbers in that link also don't inspire a lot of
| confidence. They have two systems with the same CPU but one
| of them uses 119.3W and the other one uses 46.7W? Plausibly
| the OEMs could have configured them differently but that
| kind of throws out the entire premise of using the
| comparison to measure the efficiency of the CPUs. The
| number doesn't mean anything if the power consumption is
| being set as a configuration parameter by the OEM and the
| number of watts going to the display or a discrete GPU are
| an uncontrolled hidden variable.
|
| > It's not a trade off that Intel wants.
|
| It's the one they've always taken, even when they were
| unquestionably in the lead. They were selling up to 150W
| desktop processors in 2008, because people buy them,
| because they're faster.
|
| Now they have to do it just to be competitive because their
| process isn't as good, but the process is a different thing
| than the ISA or the design of the CPU.
| aurareturn wrote:
| >It also depends on the benchmark. For example, Zen5 is
| faster in ST on Cinebench R23. It's not obvious what's
| going on with R24, but it's a difference in the code
| rather than the hardware.
|
| Cinebench R23 uses Intel Embree underneath. It's hand
| optimized for AVX instruction set and poorly translated
| to NEON. It's not even clear if it has any NEON
| optimization.
| Wytwwww wrote:
| > An M3 Max will often match a 14900KS in performance using
| 5-10% of the power.
|
| On Cinebench and Passmark CPU the 14900K is 50-60% faster
| so I'm not sure that's true.
| jeswin wrote:
| In the same article, if you picked their R23 benchmarks the
| advantage vanishes in MT benchmarks; the 3nm M3 Max is actually
| behind 4nm Strix Point in efficiency.
| aurareturn wrote:
| Cinebench R23 is hand optimized for X86. It uses Intel Embree
| engine underneath.
|
| That's why Cinebench R23 heavily favors x86.
| kllrnohj wrote:
| Performance per watt is a bad metric. You want instead
| performance for a given power budget (eg, how much performance
| can I get at 15w? 30w? etc...)
|
| Otherwise you can trivially win 100% of performance/watt
| comparisons by just setting clocks to the limit of the lowest
| usable voltage level.
|
| For example compare the 7950X to the 7950X@65w using the
| officially supported eco mode option:
| https://www.anandtech.com/show/17585/amd-zen-4-ryzen-9-7950x...
|
| Cinebench R23 MT:
|
| 7950X stock: 225 points/watt
|
| 7950X @ 65w eco mode: _482_ points /watt
|
| Over 2x perf/watt improvement on _the exact same chip_ , and a
| power efficiency that tops the charts, beating every laptop
| chip in that notebookcheck test by a _large_ amount as well.
| And yet how many 7950x owners are using the 65w eco mode
| option? Probably none. Because perf /watt isn't actually
| meaningful. Rather it's how much performance can I get for a
| given power budget.
| aurareturn wrote:
| >Performance per watt is a bad metric.
|
| It isn't. It needs context.
|
| Sure, you can get an Intel Celeron to have more perf/watt
| than an M1 if you give the Celeron low enough wattage.
|
| The key here is the absolute performance.
|
| In this case, both the M3 and the X Elite are not only
| significantly more efficient than both Zen4 and Zen5 in ST,
| they are also straight up faster while being more efficient.
| kllrnohj wrote:
| > 1. ARM is inherently more efficient than x86 CPUs in most
| tasks
|
| I'm not sure how you're reaching the conclusion of "most tasks"
| when Cinebench R24 is the only test you used because R23, which
| doesn't agree, was rejected for hand-wavey nebulous reasons,
| and nothing else was tested.
|
| R24 is hardly a representative workload of "most tasks" nor is
| it claiming/trying to be.
| hajile wrote:
| Anandtech shows[0] that M3 is massively ahead in integer
| performance, but slightly behind in float performance on Spec
| 2017.
|
| Integer workloads are by far the most common, but they tend
| to not scale to multiple cores very well. Most workloads that
| scale well across cores also benefit from big FP/SIMD units
| too.
|
| Put another way, the real issue with R24 is that it makes
| HX370 look better than it would look in more normal consumer
| workloads.
|
| [0] https://www.anandtech.com/show/21485/the-amd-ryzen-ai-
| hx-370...
| kllrnohj wrote:
| > that M3 is massively ahead in integer performance
|
| The M3 is certainly an impressive chip, but note that it's
| only massively ahead in _some_ of the int tests. It 's not
| a consistent gap.
|
| > Integer workloads are by far the most common, but they
| tend to not scale to multiple cores very well.
|
| The HX370 does better than the Me in specint MT though.
|
| But regardless the anandtech results paint a _much_ closer
| picture than the single R24 results that GP used as the
| basis of the efficiency thesis.
| aurareturn wrote:
| The HX370 should win in SPECINT MT. It has 12 cores to
| the M3's 8 cores and it runs at significantly higher
| power.
|
| Compare HX370 SPECINT MT To an M3 Pro and let's see the
| results.
| kllrnohj wrote:
| > [HX370] runs at significantly higher power.
|
| It used 33w. Meanwhile the M3 result came from a 2023
| MacBook Pro 14-Inch, which certainly has the potential
| for a TDP of around that. If you can find SPECINT MT
| numbers w/ power data for an M3 Pro lets see it. Or even
| just power data for an M3 non-pro in the 14" MBP. A quick
| search isn't turning up any.
| aurareturn wrote:
| >I'm not sure how you're reaching the conclusion of "most
| tasks" when Cinebench R24 is the only test you used because
| R23, which doesn't agree, was rejected for hand-wavey
| nebulous reasons, and nothing else was tested.
|
| There are no hand-wavey nebulous reasons.
|
| Cinebench R23 uses Intel Embree engine, which is hand
| optimized for x86 CPUs. That's why x86 CPUs look far better
| than ARM CPUs in it.
|
| If there is an application that is purely hand optimized for
| ARM, and then compiled for x86, do you think it's fair to use
| it to compare the two architectures?
|
| SPEC & GB6 mostly agrees with Cinebench 2024.
| sudosysgen wrote:
| Why are you comparing a chip optimized for performance with a
| chip optimized for efficiency? Take an Ultrabook Zen5, not the
| HX370.
| imtringued wrote:
| >Read bandwidth from a single cluster caps out at just under 62
| GB/s. The memory controller has a bit more bandwidth on tap, but
| you'll need to load cores from both clusters to get it.
|
| Except for DRR5-7500 it isn't just "a bit more" it is actually
| double at 120GB/s. This might pose a challenge for LLM inference,
| which absolutely needs the full 120GB/s.
| mshockwave wrote:
| Switching from individual schedulers to unified one for integer
| execution makes sense to me, but I still don't quite understand
| why FP execution units do the opposite, could somebody explain
| why?
___________________________________________________________________
(page generated 2024-08-11 23:01 UTC)