[HN Gopher] Arm Announces New Mobile Armv9 CPU Microarchitectures
       ___________________________________________________________________
        
       Arm Announces New Mobile Armv9 CPU Microarchitectures
        
       Author : seik
       Score  : 204 points
       Date   : 2021-05-25 13:45 UTC (9 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | rektide wrote:
       | So very happy to have a real bump to Cortex A53. Cortex A510,
       | woohoo!
       | 
       | Cortex-A55 ended up being such a modest, faint improvement over
       | 2012's Cortex-A53. Another point of reference, A53 initially
       | shipped on 28nm.
        
       | hajile wrote:
       | I think everyone is blowing past the real story here.
       | 
       | A72 cores are 10% slower than A73 per clock.
       | 
       | A510 is also supposed to be around 10% slower than A73 -- about
       | the same performance as A72.
       | 
       | The big difference is that A72 is out-of-order while A510 is in-
       | order. This means that A510 won't be vulnerable to spectre,
       | meltdown, or any of the dozens of related vulnerabilities that
       | keep popping up.
       | 
       | For the first time, we can run untrusted code at A72 speeds (fast
       | enough for most things) without worrying that it's stealing data.
        
         | phire wrote:
         | In-order CPUs aren't automatically immune to speculative
         | execution exploits. ARM's A8 core is an example of an in-order
         | core that was vunable.
         | 
         | It's speculative execution that causes problems, and in-order
         | CPUs still do speculative execution during branch prediction.
         | It's just that they typically get through less instruction
         | while speculating.
         | 
         | All you need for Spectre is a branch misperception that lasts
         | long enough for a dependent load.
         | 
         | However, it's been over three years since ARM have known about
         | Meltdown and Spectre. There is a good chance the A510 is
         | explicitly designed to be resistant to those exploits.
        
         | mhh__ wrote:
         | Spectre is because computers are memory bound so branch
         | prediction without being able to touch memory is useless, and
         | touching memory is very hard to undo, not because of out of
         | order execution as per se.
        
         | pertymcpert wrote:
         | In order cores can't use branch prediction?
        
           | amatecha wrote:
           | Interestly, I think they do (I don't know much about them).
           | ARM's own infographic[0] says "In-order design with 'big-
           | core' inspired prefetch & prediction", not sure if that's the
           | same branch prediction we're talking about in regards to
           | Spectre, etc.
           | 
           | [0] https://community.arm.com/cfs-file/__key/communityserver-
           | blo... from https://community.arm.com/developer/ip-
           | products/processors/b...
        
           | Symmetry wrote:
           | They do but almost no in order design can manage to load a
           | value from memory and then launch another load from the
           | location based on that value before the branch resolves and
           | the speculation is aborted. IIRC the Cortex A8 and Power 6
           | are the only two in order processors vulnerable to Specter.
        
         | amatecha wrote:
         | Interesting, and great point. Curious to see the first SBCs
         | using these :) (FWIW Cortex-A710 is still out-of-order
         | pipeline, if anyone was wondering)
         | 
         | Spec pages:
         | 
         | A510: https://developer.arm.com/ip-
         | products/processors/cortex-a/co...
         | 
         | A710: https://developer.arm.com/ip-
         | products/processors/cortex-a/co...
        
       | user-the-name wrote:
       | Why did ARM have to abandon their nice, clean numbering system of
       | basically just increasing the number in each family by one, and
       | go for just completely random meaningless numbers that jump by an
       | order of magnitude for no reason... Such a waste of a clean and
       | understandable lineup.
        
         | AlexAltea wrote:
         | This is explained in the article:
         | 
         | > The new CPU family marks one of the largest architectural
         | jumps we've had in years, as the company is now baselining all
         | three new CPU IPs on Armv9.0.
         | 
         | I believe the deprecation of AArch32 is quite an important
         | change and by itself already warrants issuing a new major
         | version (then there's all mandatory extensions from v8.2+ and
         | SVE2).
         | 
         | They have not claimed to bump the major version on a regular
         | basis from now as you seem to suggest.
        
           | icegreentea2 wrote:
           | That's a reasonable perspective to take. But the A710 still
           | supports AArch32. And they were really close to a traditional
           | re-vamp point anyways =P. This release (of the middle core)
           | would have been then A79. The next release (the one that
           | would actually deprecate AAarch32) would have been A80... and
           | doing weird shit when you reach "10" would just follow in the
           | grand tradition of tech. I think really only Intels somehow
           | just... kept counting up in a more or less sane way.
        
         | bogwog wrote:
         | It'd be interesting to know the thought process behind their
         | terrible naming convention. From the outside, it looks like
         | they went out of their way to making it as confusing as
         | possible.
        
       | ksec wrote:
       | This is actually worse than I thought compared to N2 used on
       | Server. N2 was suppose to have 40% IPC increase, comparatively
       | speaking.
       | 
       | > In terms of the performance and power curve, the new X2 core
       | extends itself ahead of the X1 curve in both metrics. The +16%
       | performance figure in terms of the peak performance points,
       | though it does come at a cost of higher power consumption.
       | 
       | The 16% figure was done with an 8MB Cache compared to a 4MB Cache
       | on X1. i.e In the _absolute_ ( also unrealistic ) best case
       | scenario, with an extra 10% Clock-speed increase, you are looking
       | at single core performance of X2, released with flagship phone in
       | 2022, to be roughly equivalent to A13 used on iPhone 11, released
       | in 2019.
       | 
       | The most interesting part is actually the LITTLE Core A510. I am
       | wondering what it would bring to low cost computing.
       | Unfortunately no Die Size estimate were given.
        
       | daniel_iversen wrote:
       | Could anyone explain the main differences between this new Armv9
       | CPU and the Apple M1 ARMs? What are the strengths and weaknesses
       | of the two or is one lightyears ahead of the other?
        
         | hajile wrote:
         | ARMv8 is a big target [0]. You have 6 major ISA extensions of
         | which recent ARM designs support 2 and Apple's support 5.
         | There's also SVE which was a kind of side-option. In addition,
         | there's the 32-bit version, the 64-bit version, the R and A
         | versions all with different capabilities and support levels.
         | 
         | ARMv9 tries to unify that a bit [1]. They add a new security
         | model and SVE moves from a side option to a requirement. A
         | bunch of the various v8 instructions also get bundled into the
         | core requirements.
         | 
         | A14 vs X2 is a different question. A14/M1 _really_ dropped the
         | ball by not supporting SVE. I suspect they will support it in
         | their next generation of processors, but the problem is
         | adoption. Everyone is jumping on the M1 units and won 't be
         | upgrading their laptops for another few years. As such, they'll
         | be held back for the foreseeable future.
         | 
         | Performance is no question and A14 will be retaining its lead
         | for the next 2-3 years at least. X1 chips are already
         | significantly slower than A14 chips. X2 only increases
         | performance by 10-16% (as pointed out in the article, the 16%
         | is a bit disingenuous as it compares X1 with 4mb cache to the
         | x2 with 8mb of cache while X1 gets a decent speedup with the
         | 8mb version). Furthermore, by the time X2 is actually in
         | devices, Apple will already be launching their own next
         | generation of processors (though I suspect we're about to see
         | them move to small performance increases due to diminishing
         | returns).
         | 
         | [0] https://en.wikipedia.org/wiki/AArch64
         | 
         | [1] https://www.anandtech.com/show/16584/arm-announces-
         | armv9-arc...
        
           | brokencode wrote:
           | Where does the A14/M1 suffer due to the lack of SVE? The
           | performance of M1 is well known to be terrific, so it's hard
           | to characterize that as dropping the ball in my book. More
           | like they prioritized other features instead, and they ended
           | up creating a great processor.
        
             | hajile wrote:
             | Let's say M2 has SVE. Do you:
             | 
             | * only use NEON to save developer time and lose performance
             | and forward compatibility
             | 
             | * only use SVE to save developer time and lose backward
             | compatibility
             | 
             | * Pay to support both and deal with the cost/headaches
             | 
             | Experience shows that AVX took years to adopt (and still
             | isn't used for everything) because SSE2 was "good enough"
             | and the extra costs and loss of backward compatibility
             | weren't worth it.
             | 
             | If SVE were supported out of the gate, then the problem
             | would simply never exist. Don't forget (as stated before)
             | that there's been a huge wave of M1 buyers. People upgraded
             | early either to get the nice features or not be left behind
             | as Apple drops support.
             | 
             | Let's say you have 100M Mac users and an average of 20M are
             | buying new machines any given year (a new machine every 5
             | years). The 1.5-2-year M1 wave gets 60-70M upgraders in the
             | surge. Now sales are going to decline for a while as the
             | remaining people stick to their update schedule (or hold on
             | to their x86 machines until they die). Now the M2 with SVE
             | only gets 5-10M upgraders. Does it make sense to target
             | such a small group or wait a few years? I suspect there
             | will be a lot of waiting.
        
               | brokencode wrote:
               | My point was that it probably doesn't matter. M1 is
               | already very fast, even without SVE. At some point you
               | just have to decide to ship a product, even if it doesn't
               | have every possible feature.
               | 
               | Like other posters mentioned, for vector operations like
               | this, you could be dynamically linking to a library that
               | handles this for you in the best way for the hardware.
               | Then when new instructions become available, you don't
               | have to change any of your code to take advantage.
        
               | tylerhou wrote:
               | I don't think any code sensitive enough to performance to
               | warrant SVE instructions will want to tolerate a jump in
               | a very tight loop.
        
               | zsmi wrote:
               | It's even harder than that.
               | 
               | SVE vs NEON performance will also hugely depend on the
               | vector length the given algorithm requires, and the
               | stress that the instructions put on the memory subsystem.
               | 
               | Memory hierarchy varies by product and will likely
               | continue to do so regardless of what M2 does.
               | 
               | In the end, I echo eyesee's comment, for best performance
               | one should really use Accelerate.framework on Apple's
               | hardware.
        
             | eyesee wrote:
             | I suppose they "dropped the ball" in the sense that those
             | instructions cannot be assumed to be available, thus will
             | not be encoded by the compiler by default. Any future
             | processors which include the instructions may not benefit
             | until developers recompile for the new instructions and go
             | through the extra work required to conditionally execute
             | when available.
             | 
             | That said, to get the best performance on vector math it
             | has long been recommended to use Apple's own
             | Accelerate.framework, which has the benefit of enabling use
             | of their proprietary matrix math coprocessor. One can
             | expect the framework to always take maximum advantage of
             | the hardware wherever it runs with no extra development
             | effort required.
        
           | JamesDeepDown wrote:
           | "I suspect we're about to see [Apple Silicon] move to small
           | performance increases due to diminishing returns"
           | 
           | Is there any discussion of this online, I'm curious to read
           | more.
        
             | hajile wrote:
             | This has already happened.
             | 
             | A14 is 3.1GHz while A13 is 2.66GHz. A14 is around 20%
             | faster than A13 overall, but it is also clocked 14% higher
             | which gives only around 6% more IPC. Power consumption all
             | but guarantees that they won't be doing that too often.
        
         | zibzab wrote:
         | M1 is ARMv8.4
         | 
         | v9 has a bunch of neat features that improve security and
         | performance (+ less baggage from ARMv7 which also improves area
         | and power)
        
           | gsnedders wrote:
           | > v9 has a bunch of neat features that improve security and
           | performance
           | 
           | Many of them are already optional features in later ARMv8
           | revisions, however.
        
             | zibzab wrote:
             | A lot of v8.x features were added to fix mistakes in v8.
             | 
             | v9 brings a bunch of completely new stuff, specially in the
             | security department.
        
         | wmf wrote:
         | The Arm X2 is still going to be behind Apple Firestorm but at
         | least Arm won't be trailing by 50% any more. At least for now
         | Arm doesn't have an appetite for create monster 8-issue cores.
         | 
         | See X1 vs. Firestorm:
         | https://www.anandtech.com/show/16463/snapdragon-888-vs-exyno...
        
           | zibzab wrote:
           | X1 is _not_ ARMv9 and there is not even a comparison with M1
           | on that page.
           | 
           | X2 is not even out yet, you have no idea how it may perform.
           | 
           | Why has every post on HN drift into blindly praising an
           | unrelated product?
        
             | wmf wrote:
             | ARMv8 vs ARMv9 doesn't matter; performance matters. This
             | article is about the X2 so of course we're going to discuss
             | what little information we have.
        
             | klelatti wrote:
             | > In terms of the performance and power curve, the new X2
             | core extends itself ahead of the X1 curve in both metrics.
             | The +16% performance figure in terms of the peak
             | performance points, though it does come at a cost of higher
             | power consumption.
             | 
             | Some idea how the X2 will perform.
        
               | GeekyBear wrote:
               | >Some idea how the X2 will perform.
               | 
               | If you read on a bit, there is some question that those
               | performance metrics will be seen in the real world, due
               | to existing thermal issues.
               | 
               | > I am admittedly pessimistic in regards to power
               | improvements in whichever node the next flagship SoCs
               | come in (be it 5LPP or 4LPP). It could well be plausible
               | that we wouldn't see the full +16% improvement in actual
               | SoCs next year.
        
               | zibzab wrote:
               | In general for moving from v8 to v9, I think 16% is
               | extremely pessimistic.
               | 
               | Removal of aarch32 has huge implications. And this is not
               | limited to CPU but also touches MMU, caches and more that
               | in v8 had to provide aarch32 compatible interfaces. This
               | lead to really inefficient designs (for example, MMU walk
               | which is extremely critical to performance is more than
               | twice as complicated in ARMv8 compared to v7 and v9).
               | 
               | The space saved by removing these can be used for bigger
               | caches, wider units, better branch prediction and other
               | fun stuff.
               | 
               | Finally, note also that the baseline X1 numbers come from
               | Samsung who we all know are worst in class right now. And
               | they are using an inferior process. Let's see what qcomm,
               | ampere and amazon can do with v9.
        
       | Aaargh20318 wrote:
       | Does Apple's M1 still support AArch32, it using an ARMv8.4-A
       | instruction set ? I'm assuming no 32-bit code is ever executed on
       | macOS on ARM, how much die space would removing the 32-bit
       | support save ?
       | 
       | With WWDC '21 being only weeks away, I wonder if we're going to
       | see an ARMv9 M2.
        
         | spijdar wrote:
         | AFAIK none of the (current) apple produced processors do 32
         | bit, even the apple watch. Which is interesting, as it is very
         | memory constrained, the usual reason for using 32 bit code.
         | 
         | Judging by some LLVM patches by an apple engineer, they solved
         | this by creating an ILP32 mode for Aarch64 where they use the
         | full 64 bit mode but with 32 bit pointers.
        
         | djxfade wrote:
         | I'm pretty sure the M1 does not have AArch32 hardware on it's
         | die. It wouldn't make any sense.
        
           | my123 wrote:
           | AArch32 is not implemented on Apple CPUs since quite a long
           | time now.
        
           | Aaargh20318 wrote:
           | Is AArch32 support optional in ARMv8.4-A ?
        
             | TNorthover wrote:
             | It has been all along.
        
           | macintux wrote:
           | Doesn't seem that long ago that Apple implementing 64-bit ARM
           | CPUs was derided as a publicity stunt.
        
         | marcan_42 wrote:
         | No. There is no AArch32 on the M1. Even things like the A32
         | relevant bits in the Apple-proprietary performance counters are
         | reserved and unimplemented.
        
           | glhaynes wrote:
           | iOS 11 was the first Apple OS to not run 32-bit apps so I'm
           | guessing the iPhone X that was released later that year
           | (2017) with the A11 Bionic chip was the first to drop 32-bit
           | support.
        
             | mumblemumble wrote:
             | Yep, that was the one.
             | 
             | Perhaps more relevant to the M1, though, is that OS X
             | dropped support for 32-bit apps about 18 months back, with
             | Catalina. The timing there seems just too coincidental to
             | _not_ have been done in order to pave the way for the M1.
        
               | _ph_ wrote:
               | Indeed. I think throwing out 32 bit support in Catalina
               | was in preparation for the M1 one year later. Not so much
               | because the M1 is 64 bit only, as 32 bit x86 applications
               | wouldn't have run on it either way. Perhaps to make
               | Rosetta simpler, as it only had to support 64 bit code.
               | But also with Catalina, all old and unmaintained
               | applications did no longer run. Those applications, which
               | run on Catalina, do run on M1. Which made the switch from
               | x86 to ARM cpus much smoother for the users. The hard cut
               | had been done with Catalina already - and this is the
               | reason my private Mac still runs Mojave :p. When I get a
               | new Mac, switching to ARM doesn't create any less
               | backwards compatibility than switching to Catalina
               | would...
        
               | my123 wrote:
               | Rosetta supports translating 32-bit x86 instructions.
               | However, there are no native 32-bit OS libs anymore.
               | 
               | This means that you have to do the thunking back and
               | forth yourself to call OS libraries. The only x86 program
               | using this facility so far is Wine/CrossOver on Apple
               | Silicon Macs, which as such runs 32-bit x86 apps...
        
         | hajile wrote:
         | ARMv9 requires SVE. As M1 doesn't have SVE, I'd assume ARMv8.x.
         | Wikipedia claims A14 is ARMv8.5-A, so that would be my best
         | guess.
        
         | _joel wrote:
         | M2's already in fabrication, as to being ARMv9, not sure.
         | https://asia.nikkei.com/Business/Tech/Semiconductors/Apple-s...
        
           | api wrote:
           | Everything I've heard says it's an M1 with more performance
           | cores and possibly more cache and GPU cores. It seems awfully
           | soon for it to be V9.
        
             | _joel wrote:
             | Ah OK, that makes sense
        
           | rejectedandsad wrote:
           | I want to emphasize it's unknown whether this is the
           | successor to the M1 by major or minor version - but it's more
           | likely that this is a production ramp-up for the new Pro
           | machines to be announced during WWDC and will likely ship
           | later this summer.
           | 
           | The successor to the M1, the lowest-end Apple Silicon running
           | on Macs, is likely expected next year in a rumored redesign
           | of the Macbook Air (whatever the CPU itself is called)
        
             | billiam wrote:
             | This is what I am banking on. I want to get another year
             | out of my old Intel Macs before getting an MX-based Macbook
             | that should melt my eyeballs with its speed while lasting a
             | day and a half on battery.
        
       | GeekyBear wrote:
       | >We've seen that Samsung's 5LPE node used by Qualcomm and S.LSI
       | in the Snapdragon 888 and Exynos 2100 has under-delivered in
       | terms of performance and power efficiency, and I generally
       | consider both big cores' power consumption to be at a higher
       | bound limit when it comes to thermals.
       | 
       | I expect Qualcomm to stick with Samsung foundry in the next
       | generation, so I am admittedly pessimistic in regards to power
       | improvements in whichever node the next flagship SoCs come in (be
       | it 5LPP or 4LPP). It could well be plausible that we wouldn't see
       | the full +16% improvement in actual SoCs next year.
       | 
       | https://www.anandtech.com/show/16693/arm-announces-mobile-ar...
       | 
       | It sounds like the thermal issues of the current generation
       | flagship Android chips are expected to remain in place.
        
         | smitty1110 wrote:
         | Honestly, they're probably stuck with Samsung for the mid term
         | (5 years). You simply can't get any TSMC capacity on their top
         | nodes when Apple and AMD get first and second bid on
         | everything. Maybe GloFo will figure out their issues, and maybe
         | Intel will sell their latest nodes, but until then companies
         | are stuck with their current partners.
        
           | mardifoufs wrote:
           | Didn't gloflo officially announce they will stick to >14nm a
           | few years ago? Have they started developing a new node? IIRC
           | they just didn't have the capital to keep churning out a few
           | dozens of billions on r&d/equipment every few years, so they
           | shifted their focus on maximizing yield on older nodes
        
       | paulpan wrote:
       | TLDR seems to be that next year's Cortex X2 and A710 are very
       | incremental upgrades (+10-15%) over existing, at least compared
       | to what ARM has delivered in the recent years. The A510 seems
       | promising for the little core. Wait for the 2023 next gen Sophia-
       | based designs if you can.
       | 
       | Given the industry's reliance on ARM's CPU designs, I wonder if
       | it makes sense for prosumers / enthusiasts to keep track of these
       | ARM microarchitectures instead of, say, Qualcomm Snapdragon 888
       | and Samsung Exynous 2100. Because ultimately the ARM architecture
       | is the cornerstone of performance metrics in any non-Apple
       | smartphone.
       | 
       | Conversely, it'll be interesting to see how smartphone OEMs
       | market their 2022 flagship devices when it's clearly insinuated
       | that the CPU performance on the next-gen ARM cores will not be
       | much of an uplift compared to current gen. Probably more emphasis
       | on camera, display and design.
        
         | amelius wrote:
         | > when it's clearly insinuated that the CPU performance on the
         | next-gen ARM cores will not be much of an uplift compared to
         | current gen
         | 
         | Perhaps going to smaller technology nodes will make them
         | faster, still? Or is that already part of the prediction?
        
           | wmf wrote:
           | X1/A78 is on "5 nm" and "3 nm" isn't available yet so the new
           | X2/A710 will still be on a similar process.
        
         | pantalaimon wrote:
         | Isn't Qualcomm going to release original designs again with the
         | acquisition of Nuvia?
        
           | TradingPlaces wrote:
           | This is the most under-appreciated thing in the ARM world rn.
           | Early 2023 probably for the first Snapdragons with Nuvia
           | cores
        
           | zsmi wrote:
           | https://www.anandtech.com/show/16553/qualcomm-completes-
           | acqu...
           | 
           | "The immediate goals for the NUVIA team will be implementing
           | custom CPU cores into laptop-class Snapdragon SoCs running
           | Windows" - Qualcomm's Keith Kressin, SVP and GM, Edge Cloud
           | and Computing
        
           | paulpan wrote:
           | I thought Nuvia's designs are for higher TDP products like
           | laptops and servers, rather than mobile SOCs. But it'd make
           | sense for both segments, e.g. scaling core count.
        
       | zokier wrote:
       | Not directly related to these new fancy cores, but I was looking
       | at the product range and noticed Cortex-A34 core in there that
       | has snuck by quietly. I couldn't find any hardware, chips or
       | devices, that are using that core. It has the interesting
       | property that its AArch64 only, like these new ones. Has anyone
       | seen or heard about it in the wild?
        
       | nickcw wrote:
       | I will shed a small tear for the passing of AArch32 as it was the
       | first processor architecture I really enjoyed programming.
       | 
       | I wrote hundreds of thousands of lines of assembler for it and
       | what a fun architecture it was to program for. Enough registers
       | to mean you rarely needed extra storage in a function and
       | conditional instructions to lose those branches and combine that
       | with 3 register arguments to each instruction meant that there
       | was lots of opportunity for optimisation. Plus the (not very RISC
       | but extremely useful) load and store multiple instructions and
       | that made it a joy to work with.
       | 
       | AArch64 is quite nice too but they took a some of the fun out of
       | the instruction set - conditional execution and load and store
       | multiple. They did that for good reason though in order to make
       | faster superscalar processors so I don't blame them!
        
         | josteink wrote:
         | > I will shed a small tear for the passing of AArch32 as it was
         | the first processor architecture I really enjoyed programming.
         | 
         | I'll say the same for the Motorola 68k series CPUs for the
         | exact same reason.
         | 
         | With Intel's era of total domination nearing an end, I guess
         | we're seeing some things go full circle.
        
         | txdv wrote:
         | any recommendations on learning assembly?
        
         | kevin_thibedeau wrote:
         | Cortex M isn't going anywhere. I can't imagine 32-bit ARM will
         | ever die.
        
           | gsnedders wrote:
           | Genuine question: what stops most Cortex M users from
           | adopting A64 with ILP32? Anything except existing codebases?
        
             | monocasa wrote:
             | At least in the M0 range, A64 won't be competitive on a
             | gate count basis. M0s can be ridiculously tiny, down to
             | ~12k gates, which is how they were able to take a huge
             | chunk out of the market that had previously been 8/16 bit
             | processors like 8051s and the like.
             | 
             | M33s and up might make sense as A64 cores though.
        
             | ChuckNorris89 wrote:
             | Many reasons. Cost being the biggest (64bit dies would cost
             | way more than the 32bit ones) and power consumption (having
             | a complex pipeline with 64bit registers isn't great when
             | you have to run on a coin cell for a year).
             | 
             | Similar reasons to why 8 and 16 bit micros stuck around for
             | so long in low cost and low power devices before Cortex M0+
             | became cheap and frugal enough.
        
             | topspin wrote:
             | Genuine answer: power.
             | 
             | Power consumption is a hard constraint. The benefits of A64
             | have to outweigh the cost in power consumption, and the
             | benefits just aren't there. Coin cells, small solar
             | collectors and supercaps are among the extremely limited
             | power sources for many Cortex M applications.
        
         | ant6n wrote:
         | Also fun bits for optimization: the barrel shifter that can be
         | used on the second argument in any instruction, and the
         | possibility to make any instruction conditional on the flags.
         | 
         | ...At some point I started an x86 on Arm emulator, and managed
         | to run x86 instructions in like 5-7 Arm instructions, without
         | JIT, including reading next instruction and jumping to it - all
         | thanks to the powerful Arm instruction set.
        
           | nickcw wrote:
           | Ah yes, I forgot the barrel shifter. Shift anything in no
           | extra cycles :-)
        
       | zibzab wrote:
       | I find the clustering approach in A510 very intriguing.
       | 
       | Why run the low power cores in pair? Is this because android
       | _really_ struggles on single core due to the way it was designed?
       | So you basically need two cores even when mostly idle?
        
         | jng wrote:
         | The two cores only share FP and vector execution units. A ton
         | of code doesn't use those at all, and so they are effectively
         | two separate cores most of the time. It provides full
         | compatibility on both cores, full performance very often, and
         | saves a lot of die area (FP and vector are going to be very
         | transistor-heavy). It's just a tradeoff.
        
           | zibzab wrote:
           | That makes more sense.
           | 
           | I assume if you are doing floating point or SIMD on more than
           | one core then it's time to switch to big cores anyway.
        
       | klelatti wrote:
       | Interesting that mobile platforms / CPU's losing 32 bit support
       | whilst the dominant desktop platform retains it.
       | 
       | Not sure what the cost of continuing 32 bit support on x86 is for
       | Intel/AMD but does there come a point when Intel/AMD/MS decide
       | that it should be supported via (Rosetta like) emulation rather
       | than in silicon?
        
         | dehrmann wrote:
         | > the cost of continuing 32 bit support on x86 is for Intel/AMD
         | 
         | Pretty sure they still have 16-bit support.
         | 
         | It's easier to migrate for a CPU mostly used for Android
         | because it's already had a mix of architectures, there are
         | fewer legacy business applications, and distribution through an
         | app store hides potential confusion from users.
        
           | jeroenhd wrote:
           | Another advantage the Android platform has specifically is
           | that its applications mostly consist of bytecode, limiting
           | the impact phasing out 32 bit instructions has on the
           | platform. Most high performance chips are in Android devices
           | like phones and tablets where native libraries are more the
           | exception than the rule.
           | 
           | Desktop relies on a lot of legacy support because even today
           | developers make (unjust) assumptions that a certain system
           | quirk will work for the next few years. There's no good
           | reason to drop x86 support and there's good reason to keep
           | it. The PC world can't afford to pull an Apple because
           | there's less of a fan cult around most PC products that helps
           | shift the blame on developers when their favourite games stop
           | working.
        
             | dehrmann wrote:
             | Run-anywhere was (and halfway still is) a _huge_ selling
             | point for Java. It was always clunky for desktop apps
             | because of slow JVM warmup and nothing feeling native, but
             | Android shifted enough of that into the OS so it 's not a
             | problem.
        
           | klelatti wrote:
           | Good point on 16 bit!
           | 
           | I suppose idly wondering at what point the legacy business
           | applications become so old that the speed penalty of
           | emulation becomes something that almost everyone can live
           | with.
        
         | undersuit wrote:
         | The cost of continuing support is increased decoder complexity.
         | The simplicity/orthogonality of the ARM ISA allows simpler
         | instruction decoders. Simpler, faster, and easier to scale.
         | Intel and AMD instruction decoders are impressive beasts that
         | consume considerable manpower, chip space, and physical power.
        
           | klelatti wrote:
           | I'd always assumed that decoding x86 and x64 had a lot in
           | common so not much to be saved there? Happy to be told
           | otherwise.
           | 
           | Agreed that Arm (esp AArch64) must be a lot simpler.
        
             | usefulcat wrote:
             | As I understand it, the fact that x86 has variable-length
             | instructions makes a significant difference. On ARM if you
             | want to look ahead to see what the next instructions are,
             | you can just add a fixed offset to the address of the
             | current instruction. If you want to do that on x86, you
             | have to do at least enough work to figure out what each
             | instruction is, so that you know how large it is, and only
             | then will you know where the next instruction begins.
             | Obviously this is not very amenable to any kind of
             | concurrent processing.
        
               | Someone wrote:
               | You can do quite a bit concurrently, but at the expense
               | of hardware. You 'just' speculatively assume an
               | instruction starts at every byte offset and start
               | decoding.
               | 
               | Then, once you figure out that the instruction at offset
               | 0 is N bytes, ignore the speculative decodings starting
               | at offsets 1 through N-1, and tell the speculative
               | decoding at offset N that it is good to go. That means
               | that it in turn can inform its successors whether they
               | are doing needless work or are good to go, etc.
               | 
               | That effectively means you need more decoders, and, I
               | guess, have to accept a (?slightly?) longer delay in
               | decoding instructions that are 'later' in this cascade.
        
           | gchadwick wrote:
           | > The simplicity/orthogonality of the ARM ISA allows simpler
           | instruction decoders. Simpler, faster, and easier to scale.
           | Intel and AMD instruction decoders are impressive beasts that
           | consume considerable manpower, chip space, and physical power
           | 
           | These claims are often made and indeed make some sense but is
           | there any actual evidence for them? To really know you'd need
           | access to the RTL of both cutting edge x86 and arm designs to
           | do the analysis to work out what the decoders are actually
           | costing in terms of power and area and whether they tend to
           | produce critical timing paths. You'd also need access to the
           | companies project planning/timesheets to get an estimate of
           | engineering effort for both (and chances are data isn't
           | really tracked at that level of granularity, you'll also need
           | a deep dive of their bug tracking to determine what is
           | decoder related for instance and estimate how much time has
           | been spent on dealing with decoder issues). I suspect
           | Intel/AMD/arm have no interest in making the relevant
           | information publicly available.
           | 
           | You could attempt this analysis without access to RTL but
           | isolating the power cost of the decoder with the silicon
           | alone sounds hard and potentially infeasible.
        
             | hajile wrote:
             | x86 die breakdowns put the area of the decoder as bigger
             | than the integer ALUs. While unused ALUs can power gate,
             | there's almost never a time when the decoders are not in
             | use.
             | 
             | Likewise, parsing is a well-studied field and parallel
             | parsing has been a huge focus for decades now. If you look
             | around, you can find papers and patents around decoding
             | highly serialized instruction sets (aka x86). The speedups
             | over a naive implementation are huge, but come at the cost
             | of many transistors while still not being as efficient or
             | scalable as parallel parsing of fixed-length instructions.
             | The insistence that parsing compressed, serial streams can
             | be done for free mystifies me.
             | 
             | I believe you can still find where some AMD exec said that
             | they weren't going wider than 4 decoders because the
             | power/performance ratio became much too bad. If decoders
             | weren't a significant cost to their designs (both in
             | transistors and power), you'd expect widening to be a non-
             | issue.
             | 
             | EDIT: here's a link to a die breakdown from AMD
             | 
             | https://forums.anandtech.com/threads/annotated-hi-res-
             | core-d...
        
               | gchadwick wrote:
               | I guess I'm making the wrong argument. I'd agree it's
               | clear an x86 decoder will be bigger, more power hungry
               | etc than an arm decoder. The real question is how much of
               | a factor that is for the rest of the micro-architecture?
               | Is the decoder dragging everything down or just a pain
               | you can deal with at some extra bit of power/area cost
               | that doesn't really matter? That's what I was aiming to
               | get at in the call for evidence. Is x86 inherently
               | inferior to arm and cannot scale as well for large super
               | scalar CPUs because the decoder drags you down or is
               | Apple just better at micro-architecture design (perhaps
               | AMD's team would also fail to scale well beyond 4 wide
               | with an arm design, perhaps Apple's team could have built
               | an x86 M1).
        
               | klelatti wrote:
               | I'm sure that Arm must have done a lot of analysis around
               | this issue when AArch64 was being developed.
               | 
               | After all the relatively simple Thumb extension had been
               | part of the Arm ISA for a long time (and was arguably one
               | of the reasons for its success) and they still decided to
               | go for fixed width.
        
               | gchadwick wrote:
               | Also out of interest do you have a link to an x86 die
               | breakdown that includes decoder and ALU area? Looking at
               | wikichip for example they've got a breakdown for Ice
               | Lake: https://en.wikichip.org/wiki/intel/microarchitectur
               | es/ice_la... but it doesn't get into that level of
               | detail, a vague 'Execution units' that isn't even given
               | bounds is the best you get and is likely an educated
               | guess rather than definite knowledge of what that bit of
               | die contains. Reverse engineering from die shots can do
               | impressive things but certainly what you see in public
               | never seems to get into that level of detail and would
               | likely be significant effort without assistance from
               | Intel.
        
               | hajile wrote:
               | Here you go. This one combines a high-res die shot with
               | AMD's die breakdown of Zen 2.
               | 
               | You'll notice that I understated things significantly.
               | Not only is decoder bigger than the integer ALUs, but
               | it's more than 2x as big if you don't include the uop
               | cache and around 3x as big if you do! It dwarfs almost
               | every other part of the die except caches and the beast
               | that is load/store
               | 
               | https://forums.anandtech.com/threads/annotated-hi-res-
               | core-d...
               | 
               | Original slides.
               | 
               | https://forums.anandtech.com/threads/amds-efforts-
               | involved-i...
        
               | gchadwick wrote:
               | Thanks, my mistake was searching exclusively for Intel
               | micro-architecture. It'd be interesting to see if there
               | are further similar die breakdowns for other micro-
               | architectures around, trawling through conference
               | proceedings likely to yield better results than a quick
               | Google. Just skimming through hot chips as it has a
               | publicly accessible archive (those AMD die breakdowns
               | come from ISSCC which isn't so easily accessible).
               | 
               | The decoder is certainly a reasonable fraction of the
               | core area, though as a fraction of total chip area it's
               | still not too much as other units in core are of similar
               | size or larger size (Floating point/SIMD, branch
               | prediction, load/store, L2) plus all of the uncore stuff
               | (L3 in particular). Really we need a similar die shot of
               | a big arm core to compare its decoder size too. Hotchips
               | 2019 has a highlighted die plot of an N1 with different
               | blocks coloured but sadly it doesn't provide a key as to
               | which block is what colour.
        
           | mastax wrote:
           | I remember reading (in 2012 maybe) that ISA doesn't really
           | matter and decoders don't really matter for performance. The
           | decoder is such a small part of the power and area compared
           | to the rest of the chip. I had a hypothesis that as we reach
           | the tail end of microarchitectural performance the ISA will
           | start to matter a lot more, since there are fewer and fewer
           | places left to optimize.
           | 
           | Well now in 2021 cores are getting wider and wider to have
           | any throughput improvement and power is a bigger concern than
           | ever. Apple M1 has an 8-wide decoder feeding the rest of the
           | extremely wide core. For comparison, Zen 3 has a 4-wide
           | decoder and Ice Lake has 5-wide. We'll see in 3 years if
           | Intel and AMD were just being economical or unimaginative or
           | if they really can't go wider due to the x86 decode
           | complexity. I suppose we'll never really know if they cover
           | for a power hungry decoder with secret sauce elsewhere in the
           | design.
        
             | dragontamer wrote:
             | The state-machine that can determine the start-of-
             | instructions can be run in parallel. In fact, any RegEx /
             | state machine / can be run in parallel on every byte with
             | Kogge-stone / parallel prefix, because the state-machine
             | itself is an associative (but not communitive) operation.
             | 
             | As such, I believe that decoders (even complicated ones
             | like x86) scale at O(n) total work (aka power used) and
             | O(log(n)) for depth (aka: clock cycles of latency).
             | 
             | -------
             | 
             | Obviously, a simpler ISA would allow for simpler decoding.
             | But I don't think that decoders would scale poorly, even if
             | you had to build a complicated parallel-execution engine to
             | discover the start-of-instruction information.
        
               | Someone wrote:
               | x86 instructions can straddle MMU pages and even cache
               | lines.
               | 
               | I guess that will affect the constant factor in that O(n)
               | (assuming that's true. I wouldn't even dare say I believe
               | that or its converse)
        
               | dragontamer wrote:
               | > x86 instructions can straddle MMU pages and even cache
               | lines.
               | 
               | That doesn't change the size or power-requirements of the
               | decoder however. The die-size is related to the total-
               | work done (aka: O(n) die area). And O(n) also describes
               | the power requirements.
               | 
               | If the core is stalled on MMU pages / cache line stalls,
               | then the core idles and uses less power. The die-area
               | used doesn't change (because once fabricated in
               | lithography, the hardware can't change)
               | 
               | > (assuming that's true. I wouldn't even dare say I
               | believe that or its converse)
               | 
               | Kogge-stone
               | (https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder)
               | can take *any* associative operation and parallelize it
               | into O(n) work / O(log(n)) depth.
               | 
               | The original application was the Kogge-stone carry-
               | lookahead adders. How do you calculate the "carry bit" in
               | O(log(n)) time, where n is the number of bits? It is
               | clear that a carry bit depends on all 32-bits + 32-bits
               | (!!!), so it almost defies logic to think that you can
               | figure it out in O(log(n)) depth.
               | 
               | You need to think about it a bit, but it definitely
               | works, and is a thing taught in computer architecture
               | classes for computer engineers.
               | 
               | --------
               | 
               | Anyway, if you understand how Kogge-Stone carry lookahead
               | works, then you understand that any associative operation
               | (of which "Carry-bit" calculations is associative). The
               | next step is realizing that "stepping a state machine" is
               | associative.
               | 
               | This is trickier to prove, so I'll just defer to the 1986
               | article "Data Parallel Algorithms" (http://uenics.evansvi
               | lle.edu/~mr56/ece757/DataParallelAlgori...), page 6 in
               | the PDF / page 1175 in the lower corner.
               | 
               | There, Hillis / Steel prove that FSM / regex parsing is
               | an associative operation, and therefore can be
               | implemented in the Kogge-stone adder (called a "prefix-
               | sum" or "Scan" operation in that paper).
        
               | atq2119 wrote:
               | You're right about the implications of Kogge-Stone, but
               | constant factors matter _a lot_.
               | 
               | In fixed width ISAs, you have N decoders and all their
               | work is useful.
               | 
               | In byte granularity dynamic width ISAs, you have one
               | (partial) decoder per byte of your instruction window.
               | All but N of their decode results are effectively thrown
               | away.
               | 
               | That's very wasteful.
               | 
               | The only way out I can see is if determining instruction
               | length is extremely cheap. That doesn't really describe
               | x86 though.
               | 
               | The other aspect is that if you want to keep things
               | cheap, all this logic is bound to add at least one
               | pipeline stage (because you keep it cheap by splitting
               | instruction decode into a first partial decode that
               | determines instruction length followed by a full decode
               | in the next stage). Making your pipeline ~10% longer is a
               | big drag on performance.
        
               | dragontamer wrote:
               | > In fixed width ISAs, you have N decoders and all their
               | work is useful.
               | 
               | ARM isn't fixed width anymore. AESE + AESMC macro-op fuse
               | into a singular opcode, for example.
               | 
               | IIRC, high performance ARM cores are macro-op fusing
               | and/or splitting up opcodes into micro-ops. Its not
               | necessarily a 1-to-1 translation from instructions to
               | opcodes anymore for ARM.
               | 
               | But yes, a simpler ISA will have a simpler decoder. But a
               | lot of people seem to think its a huge, possibly
               | asymptoticly huge, advantage.
               | 
               | -----------------
               | 
               | > The only way out I can see is if determining
               | instruction length is extremely cheap. That doesn't
               | really describe x86 though.
               | 
               | I just described a O(n) work / O(log(n)) latency way of
               | determining instruction length.
               | 
               | Proposal: 1. have a "instruction length decoder" on every
               | byte coming in. Because this ONLY determines instruction
               | length and is decidedly not a full decoder, its much
               | cheaper than a real decoder.
               | 
               | 2. Once the "instruction length decoders" determine the
               | start-of-instructions, have 4 to 8 complete decoders read
               | instructions starting "magically" in the right spots.
               | 
               | That's the key for my proposal. Sure, the Kogge-stone
               | part is a bit more costly, but focus purely on
               | instruction length and you're pretty much set to have a
               | cheap and simple "full decoder" down the line.
        
               | klelatti wrote:
               | Maybe but O(n) on a complex decoder could be a material
               | penalty vs a much simpler design.
        
               | dragontamer wrote:
               | Sure, but only within a constant factor.
               | 
               | My point is that the ARM's ISA vs x86 ISA is not any kind
               | of asymptotic difference in efficiency that grossly
               | prevents scaling.
               | 
               | Even the simplest decoder on a exactly 32-bit ISA (like
               | POWER9) with no frills would need O(n) scaling. If you do
               | 16-instructions per clock tick, you need 16x individual
               | decoders on every 4-bytes.
               | 
               | Sure, that delivers the results in O(1) instead of
               | O(log(n)) like a Kogge-stone FSM would do, but ARM64
               | isn't exactly cake to decode either. There's microop /
               | macro-op fusion going on in ARM64 (ex: AESE + AESMC are
               | fused on ARM64 and executed as one uop).
        
               | mhh__ wrote:
               | >because the state-machine itself is an associative (but
               | not communitive) operation.
               | 
               | Proof?
               | 
               | > But I don't think that decoders would scale poorly,
               | even if you had to build a complicated parallel-execution
               | engine to discover the start-of-instruction information.
               | 
               | Surely we have empirical proof of that in that Apple's
               | first iteration of a PC chip is noticeably wider than
               | AMD's top of the line offering. On top of this we know
               | that to make X86 run fast you have to include an entirely
               | new layer of cache just to help the decoders out.
               | 
               | We've had this exact discussion on this site before, so I
               | won't start it up again but even if you are right that
               | you can decode in this manner I think empirically we
               | _know_ that the coefficient is not pretty.
        
               | dragontamer wrote:
               | I'm not a chip designer. I don't know the best decoding
               | algorithms in the field.
               | 
               | What I can say is, I've come up with a decoding algorithm
               | that's clearly O(n) total work and O(log(n)) depth. From
               | there, additional work would be done to discover faster
               | methodologies.
               | 
               | The importance of proving O(n) total work is not in "X
               | should be done in this manner". Its in that "X has
               | asymptotic complexity of at worst case, this value".
               | 
               | I presume that the actual chip designers making decoders
               | are working on something better than what I came up with.
               | 
               | > Proof?
               | 
               | Its not an easy proof. But I think I've given enough
               | information in sibling posts that you can figure it out
               | yourself over the next hour if you really cared.
               | 
               | The Hillis / Steele paper is considered to be a good
               | survey of parallel computing methodologies in the 1980s,
               | and is a good read anyway.
               | 
               | One key idea that helped me figure it out, is that you
               | can run FSM's "backwards". Consider a FSM where "abcd" is
               | being matched. (No match == state 1. State 2 == a was
               | matched. State 3 == a and b was matched. State 4 == abc
               | was matched, State5 == abcd was matched).
               | 
               | You can see rather easily that FSM(a)(b)(c)(d), applied
               | from left-to-right is the normal way FSMs work.
               | 
               | Reverse-FSM(d) == 4 if-and-only if the 4th character is
               | d. Reverse-FSM(other-characters) == initial state in all
               | other cases. Reverse-FSM(d)(c) == 3.
               | 
               | In contrast, we can also run the FSM in the standard
               | forward approach. FSM(a)(b) == 3
               | 
               | Because FSM(a)(b) == 3, and Reverse-FSM(d)(c) == 3, the
               | two states match up and we know that the final state was
               | 5.
               | 
               | As such, we can run the FSM-backwards. By carefully
               | crafting "reverse-FSM" and "forward-FSM" operations, we
               | can run the finite-state-machine in any order.
               | 
               | Does that help understand the concept? The "later" FSM
               | operations being applied on the later-characters are
               | "backwards" operations (figuring out the transition from
               | transitioning from state4 to state 3). While "earlier"
               | FSM operations on the first characters are the
               | traditional forward-stepping of the FSM.
               | 
               | When the FSM operations meet in the middle (aka:
               | FSM(a)(b) meets with reverse-FSM(c)), all you do is
               | compare states: reverse-FSM(c) declares a match if-and-
               | only-if the state was 3. FSM(a)(b) clearly is in state 3,
               | so you can combine the two values together and say that
               | the final state was in fact 4.
               | 
               | -----
               | 
               | Last step: now that you think of forward and reverse
               | FSMs, the "middle" FSMs are also similar. Consider
               | FSM2(b), which processes the 2nd-character: the output is
               | basically 1->3 if-and-only-if the 2nd character is b.
               | 
               | FSM(a) == 1 and ReverseFSM(d)(c) == 3, and we have the
               | middle FSM2 returning (1->3). So we know that the two
               | sides match up.
               | 
               | So we can see that all bytes: the first bytes, the last
               | bytes, and even the middle bytes, can be processed in
               | parallel.
               | 
               | For example, lets take FSM2(b)(c), and process the two
               | middle bytes before we process the beginning or end. We
               | see that the 2nd byte is (b), which means that FSM2(b)(c)
               | creates a 2->4 link (if the state entering FSM2(b)(c) is
               | 2, then the last state is 4).
               | 
               | Now we do FSM(a), and see that we are in state 2. FSM(a)
               | puts us in state 2, and since FSM2(b)(c) has been
               | preprocessed to be 2->4, that puts us at state 4.
               | 
               | So we can really process the FSM in any order. Kogge-
               | Stone gives us a O(n) work + O(log(n)) parallel
               | methodology. Done.
        
               | mhh__ wrote:
               | Is parsing the instructions in parallel strictly a
               | _finite_ state machine? Lookahead? etc.
        
               | dragontamer wrote:
               | Its pretty obvious to me that the x86 prefix-extensions
               | to the ISA are a Chomsky Type3 regular grammar. Which
               | means that a (simple) RegEx can describe the x86
               | prefixes, which can be converted into a nondeterministic
               | finite automata, which can be converted into a
               | deterministic finite state machine.
               | 
               | Or under more colloquial terms: you "parse" a potential
               | x86 instruction by "Starting with the left-most byte,
               | read one-byte at a time until you get a complete
               | instruction".
               | 
               | Any grammar that you parse one-byte-at-a-time from left-
               | to-right is a Chomsky Type3 grammar. (In contrast: 5 + 3
               | * 2 cannot be parsed from left to right: 3*2 needs to be
               | evaluated first before the 5+ part).
        
             | Symmetry wrote:
             | You're right in a sense, in that even really hairy decode
             | problems like x86 only add 15% or so to a core's overall
             | power usage.
             | 
             | But on the other hand verification is a big NRE cost for
             | chips and proving that that second ISA works correctly in
             | all cases is a pretty substantial engineering cost even if
             | the resulting chip is much the same.
        
               | ants_a wrote:
               | It likely wouldn't be 15% with an 8-wide x86 decoder if
               | such a thing would even be possible within any reasonable
               | clock budget. So in that sense a fixed width ISA does buy
               | something. Also, given that chips today are mostly power
               | limited, 15% power usage is 15% that could be used to
               | increase performance in other ways.
        
         | toast0 wrote:
         | My understanding is AArch64 is more or less fresh vs AArch32,
         | whereas the ia16/ia32/amd64 transitions are all relatively
         | simple extensions. Almost all the instructions work in all
         | three modes, just the default registers are a bit different and
         | the registers available are different and addressing modes
         | might be a smidge different. You would make some gains in
         | decoder logic by dropping the older stuff, but not a whole lot;
         | amd64 still has the same variable width instructions and strong
         | memory consistency model that make it hard.
        
           | astrange wrote:
           | If you dropped 16-bit and i386 then you might be able to
           | reuse some of the shorter instruction codes like 0x62 (BOUND)
           | that aren't supported in x86-64.
           | 
           | Decoders aren't the problem with variable length instructions
           | anyway (and they can be good for performance because they're
           | cache efficient.) The main problem is security because you
           | can obfuscate programs by jumping into the middle of
           | instructions.
        
             | colejohnson66 wrote:
             | Nitpick: 0x62 can't be reused as it's _already_ been
             | repurposed for the EVEX prefix ;)
        
       | jefurii wrote:
       | I read this as "Arm Announces New Mobile Army".
        
       | DCKing wrote:
       | The A510 small core is really the big news here for me. ARM
       | updates their small core architecture very infrequently and the
       | small cores hold back the features of their big cores. Because of
       | the Cortex A55, the Cortex X1 was stuck on ARMv8.2. The OS needs
       | to be able to schedule a thread on both small and big cores, you
       | see. And that meant ARM's own IP missed out on security features
       | such as pointer authentication and memory tagging, which ARM
       | customer Apple has been shipping for years at this point.
       | 
       | The Cortex A55 (announced 2017) was also a very minor upgrade
       | over the Cortex A53 (announced 2012). These small cores are truly
       | the baseline performance of application processors in global
       | society, powering cheap smartphones, $30 TV boxes, most IoT
       | applications doing non-trivial general computation and
       | educational systems like the Raspberry Pi 3. If ARM's numbers are
       | near the truth (real-world implementations do like to disappoint
       | us in practice), we're going to see a nice security and
       | performance uplift there. Things have been stale there a while.
       | 
       | If anything this announcement means that in a few years time, the
       | vast majority of application processors sold will have robust
       | mitigations against many important classes of memory corruption
       | exploits [1].
       | 
       | [1]: https://googleprojectzero.blogspot.com/2019/02/examining-
       | poi...
        
         | my123 wrote:
         | Note that Apple only shipped pointer authentication so far, not
         | memory tagging.
         | 
         | Memory tagging only will ship on ARMv9 products in practice,
         | across the board.
        
         | pjmlp wrote:
         | I also find a pity that so far only Solaris SPARC has had
         | proper mitigations in place for C code.
         | 
         | The irony of these kind of mitigations is that computers are
         | evolving into C Machines.
        
         | oblio wrote:
         | I think especially the cheap smartphone space will benefit.
         | Cheap Android phones are frequently so slow, they could use
         | some cheap higher performance CPUs.
        
       ___________________________________________________________________
       (page generated 2021-05-25 23:00 UTC)