[HN Gopher] Arm Announces New Mobile Armv9 CPU Microarchitectures
___________________________________________________________________
Arm Announces New Mobile Armv9 CPU Microarchitectures
Author : seik
Score : 204 points
Date : 2021-05-25 13:45 UTC (9 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| rektide wrote:
| So very happy to have a real bump to Cortex A53. Cortex A510,
| woohoo!
|
| Cortex-A55 ended up being such a modest, faint improvement over
| 2012's Cortex-A53. Another point of reference, A53 initially
| shipped on 28nm.
| hajile wrote:
| I think everyone is blowing past the real story here.
|
| A72 cores are 10% slower than A73 per clock.
|
| A510 is also supposed to be around 10% slower than A73 -- about
| the same performance as A72.
|
| The big difference is that A72 is out-of-order while A510 is in-
| order. This means that A510 won't be vulnerable to spectre,
| meltdown, or any of the dozens of related vulnerabilities that
| keep popping up.
|
| For the first time, we can run untrusted code at A72 speeds (fast
| enough for most things) without worrying that it's stealing data.
| phire wrote:
| In-order CPUs aren't automatically immune to speculative
| execution exploits. ARM's A8 core is an example of an in-order
| core that was vunable.
|
| It's speculative execution that causes problems, and in-order
| CPUs still do speculative execution during branch prediction.
| It's just that they typically get through less instruction
| while speculating.
|
| All you need for Spectre is a branch misperception that lasts
| long enough for a dependent load.
|
| However, it's been over three years since ARM have known about
| Meltdown and Spectre. There is a good chance the A510 is
| explicitly designed to be resistant to those exploits.
| mhh__ wrote:
| Spectre is because computers are memory bound so branch
| prediction without being able to touch memory is useless, and
| touching memory is very hard to undo, not because of out of
| order execution as per se.
| pertymcpert wrote:
| In order cores can't use branch prediction?
| amatecha wrote:
| Interestly, I think they do (I don't know much about them).
| ARM's own infographic[0] says "In-order design with 'big-
| core' inspired prefetch & prediction", not sure if that's the
| same branch prediction we're talking about in regards to
| Spectre, etc.
|
| [0] https://community.arm.com/cfs-file/__key/communityserver-
| blo... from https://community.arm.com/developer/ip-
| products/processors/b...
| Symmetry wrote:
| They do but almost no in order design can manage to load a
| value from memory and then launch another load from the
| location based on that value before the branch resolves and
| the speculation is aborted. IIRC the Cortex A8 and Power 6
| are the only two in order processors vulnerable to Specter.
| amatecha wrote:
| Interesting, and great point. Curious to see the first SBCs
| using these :) (FWIW Cortex-A710 is still out-of-order
| pipeline, if anyone was wondering)
|
| Spec pages:
|
| A510: https://developer.arm.com/ip-
| products/processors/cortex-a/co...
|
| A710: https://developer.arm.com/ip-
| products/processors/cortex-a/co...
| user-the-name wrote:
| Why did ARM have to abandon their nice, clean numbering system of
| basically just increasing the number in each family by one, and
| go for just completely random meaningless numbers that jump by an
| order of magnitude for no reason... Such a waste of a clean and
| understandable lineup.
| AlexAltea wrote:
| This is explained in the article:
|
| > The new CPU family marks one of the largest architectural
| jumps we've had in years, as the company is now baselining all
| three new CPU IPs on Armv9.0.
|
| I believe the deprecation of AArch32 is quite an important
| change and by itself already warrants issuing a new major
| version (then there's all mandatory extensions from v8.2+ and
| SVE2).
|
| They have not claimed to bump the major version on a regular
| basis from now as you seem to suggest.
| icegreentea2 wrote:
| That's a reasonable perspective to take. But the A710 still
| supports AArch32. And they were really close to a traditional
| re-vamp point anyways =P. This release (of the middle core)
| would have been then A79. The next release (the one that
| would actually deprecate AAarch32) would have been A80... and
| doing weird shit when you reach "10" would just follow in the
| grand tradition of tech. I think really only Intels somehow
| just... kept counting up in a more or less sane way.
| bogwog wrote:
| It'd be interesting to know the thought process behind their
| terrible naming convention. From the outside, it looks like
| they went out of their way to making it as confusing as
| possible.
| ksec wrote:
| This is actually worse than I thought compared to N2 used on
| Server. N2 was suppose to have 40% IPC increase, comparatively
| speaking.
|
| > In terms of the performance and power curve, the new X2 core
| extends itself ahead of the X1 curve in both metrics. The +16%
| performance figure in terms of the peak performance points,
| though it does come at a cost of higher power consumption.
|
| The 16% figure was done with an 8MB Cache compared to a 4MB Cache
| on X1. i.e In the _absolute_ ( also unrealistic ) best case
| scenario, with an extra 10% Clock-speed increase, you are looking
| at single core performance of X2, released with flagship phone in
| 2022, to be roughly equivalent to A13 used on iPhone 11, released
| in 2019.
|
| The most interesting part is actually the LITTLE Core A510. I am
| wondering what it would bring to low cost computing.
| Unfortunately no Die Size estimate were given.
| daniel_iversen wrote:
| Could anyone explain the main differences between this new Armv9
| CPU and the Apple M1 ARMs? What are the strengths and weaknesses
| of the two or is one lightyears ahead of the other?
| hajile wrote:
| ARMv8 is a big target [0]. You have 6 major ISA extensions of
| which recent ARM designs support 2 and Apple's support 5.
| There's also SVE which was a kind of side-option. In addition,
| there's the 32-bit version, the 64-bit version, the R and A
| versions all with different capabilities and support levels.
|
| ARMv9 tries to unify that a bit [1]. They add a new security
| model and SVE moves from a side option to a requirement. A
| bunch of the various v8 instructions also get bundled into the
| core requirements.
|
| A14 vs X2 is a different question. A14/M1 _really_ dropped the
| ball by not supporting SVE. I suspect they will support it in
| their next generation of processors, but the problem is
| adoption. Everyone is jumping on the M1 units and won 't be
| upgrading their laptops for another few years. As such, they'll
| be held back for the foreseeable future.
|
| Performance is no question and A14 will be retaining its lead
| for the next 2-3 years at least. X1 chips are already
| significantly slower than A14 chips. X2 only increases
| performance by 10-16% (as pointed out in the article, the 16%
| is a bit disingenuous as it compares X1 with 4mb cache to the
| x2 with 8mb of cache while X1 gets a decent speedup with the
| 8mb version). Furthermore, by the time X2 is actually in
| devices, Apple will already be launching their own next
| generation of processors (though I suspect we're about to see
| them move to small performance increases due to diminishing
| returns).
|
| [0] https://en.wikipedia.org/wiki/AArch64
|
| [1] https://www.anandtech.com/show/16584/arm-announces-
| armv9-arc...
| brokencode wrote:
| Where does the A14/M1 suffer due to the lack of SVE? The
| performance of M1 is well known to be terrific, so it's hard
| to characterize that as dropping the ball in my book. More
| like they prioritized other features instead, and they ended
| up creating a great processor.
| hajile wrote:
| Let's say M2 has SVE. Do you:
|
| * only use NEON to save developer time and lose performance
| and forward compatibility
|
| * only use SVE to save developer time and lose backward
| compatibility
|
| * Pay to support both and deal with the cost/headaches
|
| Experience shows that AVX took years to adopt (and still
| isn't used for everything) because SSE2 was "good enough"
| and the extra costs and loss of backward compatibility
| weren't worth it.
|
| If SVE were supported out of the gate, then the problem
| would simply never exist. Don't forget (as stated before)
| that there's been a huge wave of M1 buyers. People upgraded
| early either to get the nice features or not be left behind
| as Apple drops support.
|
| Let's say you have 100M Mac users and an average of 20M are
| buying new machines any given year (a new machine every 5
| years). The 1.5-2-year M1 wave gets 60-70M upgraders in the
| surge. Now sales are going to decline for a while as the
| remaining people stick to their update schedule (or hold on
| to their x86 machines until they die). Now the M2 with SVE
| only gets 5-10M upgraders. Does it make sense to target
| such a small group or wait a few years? I suspect there
| will be a lot of waiting.
| brokencode wrote:
| My point was that it probably doesn't matter. M1 is
| already very fast, even without SVE. At some point you
| just have to decide to ship a product, even if it doesn't
| have every possible feature.
|
| Like other posters mentioned, for vector operations like
| this, you could be dynamically linking to a library that
| handles this for you in the best way for the hardware.
| Then when new instructions become available, you don't
| have to change any of your code to take advantage.
| tylerhou wrote:
| I don't think any code sensitive enough to performance to
| warrant SVE instructions will want to tolerate a jump in
| a very tight loop.
| zsmi wrote:
| It's even harder than that.
|
| SVE vs NEON performance will also hugely depend on the
| vector length the given algorithm requires, and the
| stress that the instructions put on the memory subsystem.
|
| Memory hierarchy varies by product and will likely
| continue to do so regardless of what M2 does.
|
| In the end, I echo eyesee's comment, for best performance
| one should really use Accelerate.framework on Apple's
| hardware.
| eyesee wrote:
| I suppose they "dropped the ball" in the sense that those
| instructions cannot be assumed to be available, thus will
| not be encoded by the compiler by default. Any future
| processors which include the instructions may not benefit
| until developers recompile for the new instructions and go
| through the extra work required to conditionally execute
| when available.
|
| That said, to get the best performance on vector math it
| has long been recommended to use Apple's own
| Accelerate.framework, which has the benefit of enabling use
| of their proprietary matrix math coprocessor. One can
| expect the framework to always take maximum advantage of
| the hardware wherever it runs with no extra development
| effort required.
| JamesDeepDown wrote:
| "I suspect we're about to see [Apple Silicon] move to small
| performance increases due to diminishing returns"
|
| Is there any discussion of this online, I'm curious to read
| more.
| hajile wrote:
| This has already happened.
|
| A14 is 3.1GHz while A13 is 2.66GHz. A14 is around 20%
| faster than A13 overall, but it is also clocked 14% higher
| which gives only around 6% more IPC. Power consumption all
| but guarantees that they won't be doing that too often.
| zibzab wrote:
| M1 is ARMv8.4
|
| v9 has a bunch of neat features that improve security and
| performance (+ less baggage from ARMv7 which also improves area
| and power)
| gsnedders wrote:
| > v9 has a bunch of neat features that improve security and
| performance
|
| Many of them are already optional features in later ARMv8
| revisions, however.
| zibzab wrote:
| A lot of v8.x features were added to fix mistakes in v8.
|
| v9 brings a bunch of completely new stuff, specially in the
| security department.
| wmf wrote:
| The Arm X2 is still going to be behind Apple Firestorm but at
| least Arm won't be trailing by 50% any more. At least for now
| Arm doesn't have an appetite for create monster 8-issue cores.
|
| See X1 vs. Firestorm:
| https://www.anandtech.com/show/16463/snapdragon-888-vs-exyno...
| zibzab wrote:
| X1 is _not_ ARMv9 and there is not even a comparison with M1
| on that page.
|
| X2 is not even out yet, you have no idea how it may perform.
|
| Why has every post on HN drift into blindly praising an
| unrelated product?
| wmf wrote:
| ARMv8 vs ARMv9 doesn't matter; performance matters. This
| article is about the X2 so of course we're going to discuss
| what little information we have.
| klelatti wrote:
| > In terms of the performance and power curve, the new X2
| core extends itself ahead of the X1 curve in both metrics.
| The +16% performance figure in terms of the peak
| performance points, though it does come at a cost of higher
| power consumption.
|
| Some idea how the X2 will perform.
| GeekyBear wrote:
| >Some idea how the X2 will perform.
|
| If you read on a bit, there is some question that those
| performance metrics will be seen in the real world, due
| to existing thermal issues.
|
| > I am admittedly pessimistic in regards to power
| improvements in whichever node the next flagship SoCs
| come in (be it 5LPP or 4LPP). It could well be plausible
| that we wouldn't see the full +16% improvement in actual
| SoCs next year.
| zibzab wrote:
| In general for moving from v8 to v9, I think 16% is
| extremely pessimistic.
|
| Removal of aarch32 has huge implications. And this is not
| limited to CPU but also touches MMU, caches and more that
| in v8 had to provide aarch32 compatible interfaces. This
| lead to really inefficient designs (for example, MMU walk
| which is extremely critical to performance is more than
| twice as complicated in ARMv8 compared to v7 and v9).
|
| The space saved by removing these can be used for bigger
| caches, wider units, better branch prediction and other
| fun stuff.
|
| Finally, note also that the baseline X1 numbers come from
| Samsung who we all know are worst in class right now. And
| they are using an inferior process. Let's see what qcomm,
| ampere and amazon can do with v9.
| Aaargh20318 wrote:
| Does Apple's M1 still support AArch32, it using an ARMv8.4-A
| instruction set ? I'm assuming no 32-bit code is ever executed on
| macOS on ARM, how much die space would removing the 32-bit
| support save ?
|
| With WWDC '21 being only weeks away, I wonder if we're going to
| see an ARMv9 M2.
| spijdar wrote:
| AFAIK none of the (current) apple produced processors do 32
| bit, even the apple watch. Which is interesting, as it is very
| memory constrained, the usual reason for using 32 bit code.
|
| Judging by some LLVM patches by an apple engineer, they solved
| this by creating an ILP32 mode for Aarch64 where they use the
| full 64 bit mode but with 32 bit pointers.
| djxfade wrote:
| I'm pretty sure the M1 does not have AArch32 hardware on it's
| die. It wouldn't make any sense.
| my123 wrote:
| AArch32 is not implemented on Apple CPUs since quite a long
| time now.
| Aaargh20318 wrote:
| Is AArch32 support optional in ARMv8.4-A ?
| TNorthover wrote:
| It has been all along.
| macintux wrote:
| Doesn't seem that long ago that Apple implementing 64-bit ARM
| CPUs was derided as a publicity stunt.
| marcan_42 wrote:
| No. There is no AArch32 on the M1. Even things like the A32
| relevant bits in the Apple-proprietary performance counters are
| reserved and unimplemented.
| glhaynes wrote:
| iOS 11 was the first Apple OS to not run 32-bit apps so I'm
| guessing the iPhone X that was released later that year
| (2017) with the A11 Bionic chip was the first to drop 32-bit
| support.
| mumblemumble wrote:
| Yep, that was the one.
|
| Perhaps more relevant to the M1, though, is that OS X
| dropped support for 32-bit apps about 18 months back, with
| Catalina. The timing there seems just too coincidental to
| _not_ have been done in order to pave the way for the M1.
| _ph_ wrote:
| Indeed. I think throwing out 32 bit support in Catalina
| was in preparation for the M1 one year later. Not so much
| because the M1 is 64 bit only, as 32 bit x86 applications
| wouldn't have run on it either way. Perhaps to make
| Rosetta simpler, as it only had to support 64 bit code.
| But also with Catalina, all old and unmaintained
| applications did no longer run. Those applications, which
| run on Catalina, do run on M1. Which made the switch from
| x86 to ARM cpus much smoother for the users. The hard cut
| had been done with Catalina already - and this is the
| reason my private Mac still runs Mojave :p. When I get a
| new Mac, switching to ARM doesn't create any less
| backwards compatibility than switching to Catalina
| would...
| my123 wrote:
| Rosetta supports translating 32-bit x86 instructions.
| However, there are no native 32-bit OS libs anymore.
|
| This means that you have to do the thunking back and
| forth yourself to call OS libraries. The only x86 program
| using this facility so far is Wine/CrossOver on Apple
| Silicon Macs, which as such runs 32-bit x86 apps...
| hajile wrote:
| ARMv9 requires SVE. As M1 doesn't have SVE, I'd assume ARMv8.x.
| Wikipedia claims A14 is ARMv8.5-A, so that would be my best
| guess.
| _joel wrote:
| M2's already in fabrication, as to being ARMv9, not sure.
| https://asia.nikkei.com/Business/Tech/Semiconductors/Apple-s...
| api wrote:
| Everything I've heard says it's an M1 with more performance
| cores and possibly more cache and GPU cores. It seems awfully
| soon for it to be V9.
| _joel wrote:
| Ah OK, that makes sense
| rejectedandsad wrote:
| I want to emphasize it's unknown whether this is the
| successor to the M1 by major or minor version - but it's more
| likely that this is a production ramp-up for the new Pro
| machines to be announced during WWDC and will likely ship
| later this summer.
|
| The successor to the M1, the lowest-end Apple Silicon running
| on Macs, is likely expected next year in a rumored redesign
| of the Macbook Air (whatever the CPU itself is called)
| billiam wrote:
| This is what I am banking on. I want to get another year
| out of my old Intel Macs before getting an MX-based Macbook
| that should melt my eyeballs with its speed while lasting a
| day and a half on battery.
| GeekyBear wrote:
| >We've seen that Samsung's 5LPE node used by Qualcomm and S.LSI
| in the Snapdragon 888 and Exynos 2100 has under-delivered in
| terms of performance and power efficiency, and I generally
| consider both big cores' power consumption to be at a higher
| bound limit when it comes to thermals.
|
| I expect Qualcomm to stick with Samsung foundry in the next
| generation, so I am admittedly pessimistic in regards to power
| improvements in whichever node the next flagship SoCs come in (be
| it 5LPP or 4LPP). It could well be plausible that we wouldn't see
| the full +16% improvement in actual SoCs next year.
|
| https://www.anandtech.com/show/16693/arm-announces-mobile-ar...
|
| It sounds like the thermal issues of the current generation
| flagship Android chips are expected to remain in place.
| smitty1110 wrote:
| Honestly, they're probably stuck with Samsung for the mid term
| (5 years). You simply can't get any TSMC capacity on their top
| nodes when Apple and AMD get first and second bid on
| everything. Maybe GloFo will figure out their issues, and maybe
| Intel will sell their latest nodes, but until then companies
| are stuck with their current partners.
| mardifoufs wrote:
| Didn't gloflo officially announce they will stick to >14nm a
| few years ago? Have they started developing a new node? IIRC
| they just didn't have the capital to keep churning out a few
| dozens of billions on r&d/equipment every few years, so they
| shifted their focus on maximizing yield on older nodes
| paulpan wrote:
| TLDR seems to be that next year's Cortex X2 and A710 are very
| incremental upgrades (+10-15%) over existing, at least compared
| to what ARM has delivered in the recent years. The A510 seems
| promising for the little core. Wait for the 2023 next gen Sophia-
| based designs if you can.
|
| Given the industry's reliance on ARM's CPU designs, I wonder if
| it makes sense for prosumers / enthusiasts to keep track of these
| ARM microarchitectures instead of, say, Qualcomm Snapdragon 888
| and Samsung Exynous 2100. Because ultimately the ARM architecture
| is the cornerstone of performance metrics in any non-Apple
| smartphone.
|
| Conversely, it'll be interesting to see how smartphone OEMs
| market their 2022 flagship devices when it's clearly insinuated
| that the CPU performance on the next-gen ARM cores will not be
| much of an uplift compared to current gen. Probably more emphasis
| on camera, display and design.
| amelius wrote:
| > when it's clearly insinuated that the CPU performance on the
| next-gen ARM cores will not be much of an uplift compared to
| current gen
|
| Perhaps going to smaller technology nodes will make them
| faster, still? Or is that already part of the prediction?
| wmf wrote:
| X1/A78 is on "5 nm" and "3 nm" isn't available yet so the new
| X2/A710 will still be on a similar process.
| pantalaimon wrote:
| Isn't Qualcomm going to release original designs again with the
| acquisition of Nuvia?
| TradingPlaces wrote:
| This is the most under-appreciated thing in the ARM world rn.
| Early 2023 probably for the first Snapdragons with Nuvia
| cores
| zsmi wrote:
| https://www.anandtech.com/show/16553/qualcomm-completes-
| acqu...
|
| "The immediate goals for the NUVIA team will be implementing
| custom CPU cores into laptop-class Snapdragon SoCs running
| Windows" - Qualcomm's Keith Kressin, SVP and GM, Edge Cloud
| and Computing
| paulpan wrote:
| I thought Nuvia's designs are for higher TDP products like
| laptops and servers, rather than mobile SOCs. But it'd make
| sense for both segments, e.g. scaling core count.
| zokier wrote:
| Not directly related to these new fancy cores, but I was looking
| at the product range and noticed Cortex-A34 core in there that
| has snuck by quietly. I couldn't find any hardware, chips or
| devices, that are using that core. It has the interesting
| property that its AArch64 only, like these new ones. Has anyone
| seen or heard about it in the wild?
| nickcw wrote:
| I will shed a small tear for the passing of AArch32 as it was the
| first processor architecture I really enjoyed programming.
|
| I wrote hundreds of thousands of lines of assembler for it and
| what a fun architecture it was to program for. Enough registers
| to mean you rarely needed extra storage in a function and
| conditional instructions to lose those branches and combine that
| with 3 register arguments to each instruction meant that there
| was lots of opportunity for optimisation. Plus the (not very RISC
| but extremely useful) load and store multiple instructions and
| that made it a joy to work with.
|
| AArch64 is quite nice too but they took a some of the fun out of
| the instruction set - conditional execution and load and store
| multiple. They did that for good reason though in order to make
| faster superscalar processors so I don't blame them!
| josteink wrote:
| > I will shed a small tear for the passing of AArch32 as it was
| the first processor architecture I really enjoyed programming.
|
| I'll say the same for the Motorola 68k series CPUs for the
| exact same reason.
|
| With Intel's era of total domination nearing an end, I guess
| we're seeing some things go full circle.
| txdv wrote:
| any recommendations on learning assembly?
| kevin_thibedeau wrote:
| Cortex M isn't going anywhere. I can't imagine 32-bit ARM will
| ever die.
| gsnedders wrote:
| Genuine question: what stops most Cortex M users from
| adopting A64 with ILP32? Anything except existing codebases?
| monocasa wrote:
| At least in the M0 range, A64 won't be competitive on a
| gate count basis. M0s can be ridiculously tiny, down to
| ~12k gates, which is how they were able to take a huge
| chunk out of the market that had previously been 8/16 bit
| processors like 8051s and the like.
|
| M33s and up might make sense as A64 cores though.
| ChuckNorris89 wrote:
| Many reasons. Cost being the biggest (64bit dies would cost
| way more than the 32bit ones) and power consumption (having
| a complex pipeline with 64bit registers isn't great when
| you have to run on a coin cell for a year).
|
| Similar reasons to why 8 and 16 bit micros stuck around for
| so long in low cost and low power devices before Cortex M0+
| became cheap and frugal enough.
| topspin wrote:
| Genuine answer: power.
|
| Power consumption is a hard constraint. The benefits of A64
| have to outweigh the cost in power consumption, and the
| benefits just aren't there. Coin cells, small solar
| collectors and supercaps are among the extremely limited
| power sources for many Cortex M applications.
| ant6n wrote:
| Also fun bits for optimization: the barrel shifter that can be
| used on the second argument in any instruction, and the
| possibility to make any instruction conditional on the flags.
|
| ...At some point I started an x86 on Arm emulator, and managed
| to run x86 instructions in like 5-7 Arm instructions, without
| JIT, including reading next instruction and jumping to it - all
| thanks to the powerful Arm instruction set.
| nickcw wrote:
| Ah yes, I forgot the barrel shifter. Shift anything in no
| extra cycles :-)
| zibzab wrote:
| I find the clustering approach in A510 very intriguing.
|
| Why run the low power cores in pair? Is this because android
| _really_ struggles on single core due to the way it was designed?
| So you basically need two cores even when mostly idle?
| jng wrote:
| The two cores only share FP and vector execution units. A ton
| of code doesn't use those at all, and so they are effectively
| two separate cores most of the time. It provides full
| compatibility on both cores, full performance very often, and
| saves a lot of die area (FP and vector are going to be very
| transistor-heavy). It's just a tradeoff.
| zibzab wrote:
| That makes more sense.
|
| I assume if you are doing floating point or SIMD on more than
| one core then it's time to switch to big cores anyway.
| klelatti wrote:
| Interesting that mobile platforms / CPU's losing 32 bit support
| whilst the dominant desktop platform retains it.
|
| Not sure what the cost of continuing 32 bit support on x86 is for
| Intel/AMD but does there come a point when Intel/AMD/MS decide
| that it should be supported via (Rosetta like) emulation rather
| than in silicon?
| dehrmann wrote:
| > the cost of continuing 32 bit support on x86 is for Intel/AMD
|
| Pretty sure they still have 16-bit support.
|
| It's easier to migrate for a CPU mostly used for Android
| because it's already had a mix of architectures, there are
| fewer legacy business applications, and distribution through an
| app store hides potential confusion from users.
| jeroenhd wrote:
| Another advantage the Android platform has specifically is
| that its applications mostly consist of bytecode, limiting
| the impact phasing out 32 bit instructions has on the
| platform. Most high performance chips are in Android devices
| like phones and tablets where native libraries are more the
| exception than the rule.
|
| Desktop relies on a lot of legacy support because even today
| developers make (unjust) assumptions that a certain system
| quirk will work for the next few years. There's no good
| reason to drop x86 support and there's good reason to keep
| it. The PC world can't afford to pull an Apple because
| there's less of a fan cult around most PC products that helps
| shift the blame on developers when their favourite games stop
| working.
| dehrmann wrote:
| Run-anywhere was (and halfway still is) a _huge_ selling
| point for Java. It was always clunky for desktop apps
| because of slow JVM warmup and nothing feeling native, but
| Android shifted enough of that into the OS so it 's not a
| problem.
| klelatti wrote:
| Good point on 16 bit!
|
| I suppose idly wondering at what point the legacy business
| applications become so old that the speed penalty of
| emulation becomes something that almost everyone can live
| with.
| undersuit wrote:
| The cost of continuing support is increased decoder complexity.
| The simplicity/orthogonality of the ARM ISA allows simpler
| instruction decoders. Simpler, faster, and easier to scale.
| Intel and AMD instruction decoders are impressive beasts that
| consume considerable manpower, chip space, and physical power.
| klelatti wrote:
| I'd always assumed that decoding x86 and x64 had a lot in
| common so not much to be saved there? Happy to be told
| otherwise.
|
| Agreed that Arm (esp AArch64) must be a lot simpler.
| usefulcat wrote:
| As I understand it, the fact that x86 has variable-length
| instructions makes a significant difference. On ARM if you
| want to look ahead to see what the next instructions are,
| you can just add a fixed offset to the address of the
| current instruction. If you want to do that on x86, you
| have to do at least enough work to figure out what each
| instruction is, so that you know how large it is, and only
| then will you know where the next instruction begins.
| Obviously this is not very amenable to any kind of
| concurrent processing.
| Someone wrote:
| You can do quite a bit concurrently, but at the expense
| of hardware. You 'just' speculatively assume an
| instruction starts at every byte offset and start
| decoding.
|
| Then, once you figure out that the instruction at offset
| 0 is N bytes, ignore the speculative decodings starting
| at offsets 1 through N-1, and tell the speculative
| decoding at offset N that it is good to go. That means
| that it in turn can inform its successors whether they
| are doing needless work or are good to go, etc.
|
| That effectively means you need more decoders, and, I
| guess, have to accept a (?slightly?) longer delay in
| decoding instructions that are 'later' in this cascade.
| gchadwick wrote:
| > The simplicity/orthogonality of the ARM ISA allows simpler
| instruction decoders. Simpler, faster, and easier to scale.
| Intel and AMD instruction decoders are impressive beasts that
| consume considerable manpower, chip space, and physical power
|
| These claims are often made and indeed make some sense but is
| there any actual evidence for them? To really know you'd need
| access to the RTL of both cutting edge x86 and arm designs to
| do the analysis to work out what the decoders are actually
| costing in terms of power and area and whether they tend to
| produce critical timing paths. You'd also need access to the
| companies project planning/timesheets to get an estimate of
| engineering effort for both (and chances are data isn't
| really tracked at that level of granularity, you'll also need
| a deep dive of their bug tracking to determine what is
| decoder related for instance and estimate how much time has
| been spent on dealing with decoder issues). I suspect
| Intel/AMD/arm have no interest in making the relevant
| information publicly available.
|
| You could attempt this analysis without access to RTL but
| isolating the power cost of the decoder with the silicon
| alone sounds hard and potentially infeasible.
| hajile wrote:
| x86 die breakdowns put the area of the decoder as bigger
| than the integer ALUs. While unused ALUs can power gate,
| there's almost never a time when the decoders are not in
| use.
|
| Likewise, parsing is a well-studied field and parallel
| parsing has been a huge focus for decades now. If you look
| around, you can find papers and patents around decoding
| highly serialized instruction sets (aka x86). The speedups
| over a naive implementation are huge, but come at the cost
| of many transistors while still not being as efficient or
| scalable as parallel parsing of fixed-length instructions.
| The insistence that parsing compressed, serial streams can
| be done for free mystifies me.
|
| I believe you can still find where some AMD exec said that
| they weren't going wider than 4 decoders because the
| power/performance ratio became much too bad. If decoders
| weren't a significant cost to their designs (both in
| transistors and power), you'd expect widening to be a non-
| issue.
|
| EDIT: here's a link to a die breakdown from AMD
|
| https://forums.anandtech.com/threads/annotated-hi-res-
| core-d...
| gchadwick wrote:
| I guess I'm making the wrong argument. I'd agree it's
| clear an x86 decoder will be bigger, more power hungry
| etc than an arm decoder. The real question is how much of
| a factor that is for the rest of the micro-architecture?
| Is the decoder dragging everything down or just a pain
| you can deal with at some extra bit of power/area cost
| that doesn't really matter? That's what I was aiming to
| get at in the call for evidence. Is x86 inherently
| inferior to arm and cannot scale as well for large super
| scalar CPUs because the decoder drags you down or is
| Apple just better at micro-architecture design (perhaps
| AMD's team would also fail to scale well beyond 4 wide
| with an arm design, perhaps Apple's team could have built
| an x86 M1).
| klelatti wrote:
| I'm sure that Arm must have done a lot of analysis around
| this issue when AArch64 was being developed.
|
| After all the relatively simple Thumb extension had been
| part of the Arm ISA for a long time (and was arguably one
| of the reasons for its success) and they still decided to
| go for fixed width.
| gchadwick wrote:
| Also out of interest do you have a link to an x86 die
| breakdown that includes decoder and ALU area? Looking at
| wikichip for example they've got a breakdown for Ice
| Lake: https://en.wikichip.org/wiki/intel/microarchitectur
| es/ice_la... but it doesn't get into that level of
| detail, a vague 'Execution units' that isn't even given
| bounds is the best you get and is likely an educated
| guess rather than definite knowledge of what that bit of
| die contains. Reverse engineering from die shots can do
| impressive things but certainly what you see in public
| never seems to get into that level of detail and would
| likely be significant effort without assistance from
| Intel.
| hajile wrote:
| Here you go. This one combines a high-res die shot with
| AMD's die breakdown of Zen 2.
|
| You'll notice that I understated things significantly.
| Not only is decoder bigger than the integer ALUs, but
| it's more than 2x as big if you don't include the uop
| cache and around 3x as big if you do! It dwarfs almost
| every other part of the die except caches and the beast
| that is load/store
|
| https://forums.anandtech.com/threads/annotated-hi-res-
| core-d...
|
| Original slides.
|
| https://forums.anandtech.com/threads/amds-efforts-
| involved-i...
| gchadwick wrote:
| Thanks, my mistake was searching exclusively for Intel
| micro-architecture. It'd be interesting to see if there
| are further similar die breakdowns for other micro-
| architectures around, trawling through conference
| proceedings likely to yield better results than a quick
| Google. Just skimming through hot chips as it has a
| publicly accessible archive (those AMD die breakdowns
| come from ISSCC which isn't so easily accessible).
|
| The decoder is certainly a reasonable fraction of the
| core area, though as a fraction of total chip area it's
| still not too much as other units in core are of similar
| size or larger size (Floating point/SIMD, branch
| prediction, load/store, L2) plus all of the uncore stuff
| (L3 in particular). Really we need a similar die shot of
| a big arm core to compare its decoder size too. Hotchips
| 2019 has a highlighted die plot of an N1 with different
| blocks coloured but sadly it doesn't provide a key as to
| which block is what colour.
| mastax wrote:
| I remember reading (in 2012 maybe) that ISA doesn't really
| matter and decoders don't really matter for performance. The
| decoder is such a small part of the power and area compared
| to the rest of the chip. I had a hypothesis that as we reach
| the tail end of microarchitectural performance the ISA will
| start to matter a lot more, since there are fewer and fewer
| places left to optimize.
|
| Well now in 2021 cores are getting wider and wider to have
| any throughput improvement and power is a bigger concern than
| ever. Apple M1 has an 8-wide decoder feeding the rest of the
| extremely wide core. For comparison, Zen 3 has a 4-wide
| decoder and Ice Lake has 5-wide. We'll see in 3 years if
| Intel and AMD were just being economical or unimaginative or
| if they really can't go wider due to the x86 decode
| complexity. I suppose we'll never really know if they cover
| for a power hungry decoder with secret sauce elsewhere in the
| design.
| dragontamer wrote:
| The state-machine that can determine the start-of-
| instructions can be run in parallel. In fact, any RegEx /
| state machine / can be run in parallel on every byte with
| Kogge-stone / parallel prefix, because the state-machine
| itself is an associative (but not communitive) operation.
|
| As such, I believe that decoders (even complicated ones
| like x86) scale at O(n) total work (aka power used) and
| O(log(n)) for depth (aka: clock cycles of latency).
|
| -------
|
| Obviously, a simpler ISA would allow for simpler decoding.
| But I don't think that decoders would scale poorly, even if
| you had to build a complicated parallel-execution engine to
| discover the start-of-instruction information.
| Someone wrote:
| x86 instructions can straddle MMU pages and even cache
| lines.
|
| I guess that will affect the constant factor in that O(n)
| (assuming that's true. I wouldn't even dare say I believe
| that or its converse)
| dragontamer wrote:
| > x86 instructions can straddle MMU pages and even cache
| lines.
|
| That doesn't change the size or power-requirements of the
| decoder however. The die-size is related to the total-
| work done (aka: O(n) die area). And O(n) also describes
| the power requirements.
|
| If the core is stalled on MMU pages / cache line stalls,
| then the core idles and uses less power. The die-area
| used doesn't change (because once fabricated in
| lithography, the hardware can't change)
|
| > (assuming that's true. I wouldn't even dare say I
| believe that or its converse)
|
| Kogge-stone
| (https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder)
| can take *any* associative operation and parallelize it
| into O(n) work / O(log(n)) depth.
|
| The original application was the Kogge-stone carry-
| lookahead adders. How do you calculate the "carry bit" in
| O(log(n)) time, where n is the number of bits? It is
| clear that a carry bit depends on all 32-bits + 32-bits
| (!!!), so it almost defies logic to think that you can
| figure it out in O(log(n)) depth.
|
| You need to think about it a bit, but it definitely
| works, and is a thing taught in computer architecture
| classes for computer engineers.
|
| --------
|
| Anyway, if you understand how Kogge-Stone carry lookahead
| works, then you understand that any associative operation
| (of which "Carry-bit" calculations is associative). The
| next step is realizing that "stepping a state machine" is
| associative.
|
| This is trickier to prove, so I'll just defer to the 1986
| article "Data Parallel Algorithms" (http://uenics.evansvi
| lle.edu/~mr56/ece757/DataParallelAlgori...), page 6 in
| the PDF / page 1175 in the lower corner.
|
| There, Hillis / Steel prove that FSM / regex parsing is
| an associative operation, and therefore can be
| implemented in the Kogge-stone adder (called a "prefix-
| sum" or "Scan" operation in that paper).
| atq2119 wrote:
| You're right about the implications of Kogge-Stone, but
| constant factors matter _a lot_.
|
| In fixed width ISAs, you have N decoders and all their
| work is useful.
|
| In byte granularity dynamic width ISAs, you have one
| (partial) decoder per byte of your instruction window.
| All but N of their decode results are effectively thrown
| away.
|
| That's very wasteful.
|
| The only way out I can see is if determining instruction
| length is extremely cheap. That doesn't really describe
| x86 though.
|
| The other aspect is that if you want to keep things
| cheap, all this logic is bound to add at least one
| pipeline stage (because you keep it cheap by splitting
| instruction decode into a first partial decode that
| determines instruction length followed by a full decode
| in the next stage). Making your pipeline ~10% longer is a
| big drag on performance.
| dragontamer wrote:
| > In fixed width ISAs, you have N decoders and all their
| work is useful.
|
| ARM isn't fixed width anymore. AESE + AESMC macro-op fuse
| into a singular opcode, for example.
|
| IIRC, high performance ARM cores are macro-op fusing
| and/or splitting up opcodes into micro-ops. Its not
| necessarily a 1-to-1 translation from instructions to
| opcodes anymore for ARM.
|
| But yes, a simpler ISA will have a simpler decoder. But a
| lot of people seem to think its a huge, possibly
| asymptoticly huge, advantage.
|
| -----------------
|
| > The only way out I can see is if determining
| instruction length is extremely cheap. That doesn't
| really describe x86 though.
|
| I just described a O(n) work / O(log(n)) latency way of
| determining instruction length.
|
| Proposal: 1. have a "instruction length decoder" on every
| byte coming in. Because this ONLY determines instruction
| length and is decidedly not a full decoder, its much
| cheaper than a real decoder.
|
| 2. Once the "instruction length decoders" determine the
| start-of-instructions, have 4 to 8 complete decoders read
| instructions starting "magically" in the right spots.
|
| That's the key for my proposal. Sure, the Kogge-stone
| part is a bit more costly, but focus purely on
| instruction length and you're pretty much set to have a
| cheap and simple "full decoder" down the line.
| klelatti wrote:
| Maybe but O(n) on a complex decoder could be a material
| penalty vs a much simpler design.
| dragontamer wrote:
| Sure, but only within a constant factor.
|
| My point is that the ARM's ISA vs x86 ISA is not any kind
| of asymptotic difference in efficiency that grossly
| prevents scaling.
|
| Even the simplest decoder on a exactly 32-bit ISA (like
| POWER9) with no frills would need O(n) scaling. If you do
| 16-instructions per clock tick, you need 16x individual
| decoders on every 4-bytes.
|
| Sure, that delivers the results in O(1) instead of
| O(log(n)) like a Kogge-stone FSM would do, but ARM64
| isn't exactly cake to decode either. There's microop /
| macro-op fusion going on in ARM64 (ex: AESE + AESMC are
| fused on ARM64 and executed as one uop).
| mhh__ wrote:
| >because the state-machine itself is an associative (but
| not communitive) operation.
|
| Proof?
|
| > But I don't think that decoders would scale poorly,
| even if you had to build a complicated parallel-execution
| engine to discover the start-of-instruction information.
|
| Surely we have empirical proof of that in that Apple's
| first iteration of a PC chip is noticeably wider than
| AMD's top of the line offering. On top of this we know
| that to make X86 run fast you have to include an entirely
| new layer of cache just to help the decoders out.
|
| We've had this exact discussion on this site before, so I
| won't start it up again but even if you are right that
| you can decode in this manner I think empirically we
| _know_ that the coefficient is not pretty.
| dragontamer wrote:
| I'm not a chip designer. I don't know the best decoding
| algorithms in the field.
|
| What I can say is, I've come up with a decoding algorithm
| that's clearly O(n) total work and O(log(n)) depth. From
| there, additional work would be done to discover faster
| methodologies.
|
| The importance of proving O(n) total work is not in "X
| should be done in this manner". Its in that "X has
| asymptotic complexity of at worst case, this value".
|
| I presume that the actual chip designers making decoders
| are working on something better than what I came up with.
|
| > Proof?
|
| Its not an easy proof. But I think I've given enough
| information in sibling posts that you can figure it out
| yourself over the next hour if you really cared.
|
| The Hillis / Steele paper is considered to be a good
| survey of parallel computing methodologies in the 1980s,
| and is a good read anyway.
|
| One key idea that helped me figure it out, is that you
| can run FSM's "backwards". Consider a FSM where "abcd" is
| being matched. (No match == state 1. State 2 == a was
| matched. State 3 == a and b was matched. State 4 == abc
| was matched, State5 == abcd was matched).
|
| You can see rather easily that FSM(a)(b)(c)(d), applied
| from left-to-right is the normal way FSMs work.
|
| Reverse-FSM(d) == 4 if-and-only if the 4th character is
| d. Reverse-FSM(other-characters) == initial state in all
| other cases. Reverse-FSM(d)(c) == 3.
|
| In contrast, we can also run the FSM in the standard
| forward approach. FSM(a)(b) == 3
|
| Because FSM(a)(b) == 3, and Reverse-FSM(d)(c) == 3, the
| two states match up and we know that the final state was
| 5.
|
| As such, we can run the FSM-backwards. By carefully
| crafting "reverse-FSM" and "forward-FSM" operations, we
| can run the finite-state-machine in any order.
|
| Does that help understand the concept? The "later" FSM
| operations being applied on the later-characters are
| "backwards" operations (figuring out the transition from
| transitioning from state4 to state 3). While "earlier"
| FSM operations on the first characters are the
| traditional forward-stepping of the FSM.
|
| When the FSM operations meet in the middle (aka:
| FSM(a)(b) meets with reverse-FSM(c)), all you do is
| compare states: reverse-FSM(c) declares a match if-and-
| only-if the state was 3. FSM(a)(b) clearly is in state 3,
| so you can combine the two values together and say that
| the final state was in fact 4.
|
| -----
|
| Last step: now that you think of forward and reverse
| FSMs, the "middle" FSMs are also similar. Consider
| FSM2(b), which processes the 2nd-character: the output is
| basically 1->3 if-and-only-if the 2nd character is b.
|
| FSM(a) == 1 and ReverseFSM(d)(c) == 3, and we have the
| middle FSM2 returning (1->3). So we know that the two
| sides match up.
|
| So we can see that all bytes: the first bytes, the last
| bytes, and even the middle bytes, can be processed in
| parallel.
|
| For example, lets take FSM2(b)(c), and process the two
| middle bytes before we process the beginning or end. We
| see that the 2nd byte is (b), which means that FSM2(b)(c)
| creates a 2->4 link (if the state entering FSM2(b)(c) is
| 2, then the last state is 4).
|
| Now we do FSM(a), and see that we are in state 2. FSM(a)
| puts us in state 2, and since FSM2(b)(c) has been
| preprocessed to be 2->4, that puts us at state 4.
|
| So we can really process the FSM in any order. Kogge-
| Stone gives us a O(n) work + O(log(n)) parallel
| methodology. Done.
| mhh__ wrote:
| Is parsing the instructions in parallel strictly a
| _finite_ state machine? Lookahead? etc.
| dragontamer wrote:
| Its pretty obvious to me that the x86 prefix-extensions
| to the ISA are a Chomsky Type3 regular grammar. Which
| means that a (simple) RegEx can describe the x86
| prefixes, which can be converted into a nondeterministic
| finite automata, which can be converted into a
| deterministic finite state machine.
|
| Or under more colloquial terms: you "parse" a potential
| x86 instruction by "Starting with the left-most byte,
| read one-byte at a time until you get a complete
| instruction".
|
| Any grammar that you parse one-byte-at-a-time from left-
| to-right is a Chomsky Type3 grammar. (In contrast: 5 + 3
| * 2 cannot be parsed from left to right: 3*2 needs to be
| evaluated first before the 5+ part).
| Symmetry wrote:
| You're right in a sense, in that even really hairy decode
| problems like x86 only add 15% or so to a core's overall
| power usage.
|
| But on the other hand verification is a big NRE cost for
| chips and proving that that second ISA works correctly in
| all cases is a pretty substantial engineering cost even if
| the resulting chip is much the same.
| ants_a wrote:
| It likely wouldn't be 15% with an 8-wide x86 decoder if
| such a thing would even be possible within any reasonable
| clock budget. So in that sense a fixed width ISA does buy
| something. Also, given that chips today are mostly power
| limited, 15% power usage is 15% that could be used to
| increase performance in other ways.
| toast0 wrote:
| My understanding is AArch64 is more or less fresh vs AArch32,
| whereas the ia16/ia32/amd64 transitions are all relatively
| simple extensions. Almost all the instructions work in all
| three modes, just the default registers are a bit different and
| the registers available are different and addressing modes
| might be a smidge different. You would make some gains in
| decoder logic by dropping the older stuff, but not a whole lot;
| amd64 still has the same variable width instructions and strong
| memory consistency model that make it hard.
| astrange wrote:
| If you dropped 16-bit and i386 then you might be able to
| reuse some of the shorter instruction codes like 0x62 (BOUND)
| that aren't supported in x86-64.
|
| Decoders aren't the problem with variable length instructions
| anyway (and they can be good for performance because they're
| cache efficient.) The main problem is security because you
| can obfuscate programs by jumping into the middle of
| instructions.
| colejohnson66 wrote:
| Nitpick: 0x62 can't be reused as it's _already_ been
| repurposed for the EVEX prefix ;)
| jefurii wrote:
| I read this as "Arm Announces New Mobile Army".
| DCKing wrote:
| The A510 small core is really the big news here for me. ARM
| updates their small core architecture very infrequently and the
| small cores hold back the features of their big cores. Because of
| the Cortex A55, the Cortex X1 was stuck on ARMv8.2. The OS needs
| to be able to schedule a thread on both small and big cores, you
| see. And that meant ARM's own IP missed out on security features
| such as pointer authentication and memory tagging, which ARM
| customer Apple has been shipping for years at this point.
|
| The Cortex A55 (announced 2017) was also a very minor upgrade
| over the Cortex A53 (announced 2012). These small cores are truly
| the baseline performance of application processors in global
| society, powering cheap smartphones, $30 TV boxes, most IoT
| applications doing non-trivial general computation and
| educational systems like the Raspberry Pi 3. If ARM's numbers are
| near the truth (real-world implementations do like to disappoint
| us in practice), we're going to see a nice security and
| performance uplift there. Things have been stale there a while.
|
| If anything this announcement means that in a few years time, the
| vast majority of application processors sold will have robust
| mitigations against many important classes of memory corruption
| exploits [1].
|
| [1]: https://googleprojectzero.blogspot.com/2019/02/examining-
| poi...
| my123 wrote:
| Note that Apple only shipped pointer authentication so far, not
| memory tagging.
|
| Memory tagging only will ship on ARMv9 products in practice,
| across the board.
| pjmlp wrote:
| I also find a pity that so far only Solaris SPARC has had
| proper mitigations in place for C code.
|
| The irony of these kind of mitigations is that computers are
| evolving into C Machines.
| oblio wrote:
| I think especially the cheap smartphone space will benefit.
| Cheap Android phones are frequently so slow, they could use
| some cheap higher performance CPUs.
___________________________________________________________________
(page generated 2021-05-25 23:00 UTC)