[HN Gopher] ARM or x86? ISA Doesn't Matter (2021)
___________________________________________________________________
ARM or x86? ISA Doesn't Matter (2021)
Author : NavinF
Score : 73 points
Date : 2023-05-14 20:38 UTC (2 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| TheLoafOfBread wrote:
| What does matter is standardization. For example a booting
| process. When I have x86 image of Windows/Linux, I can boot it on
| any x86 processor. When I have ARM image, well, then I can boot
| it on a SoC it is built for and that's big maybe because if
| outside peripherals are different (i.e. different LCD driver) or
| lives on different pins of SoC, then I am screwed and will have
| at best partially working system.
|
| Standardization is something what will carry x86 very far into
| the future despite its infectivity on low power devices.
| dehrmann wrote:
| > When I have x86 image of Windows/Linux, I can boot it on any
| x86 processor.
|
| Where this gets absurd is modern Debian supports i686 and up.
| You should be able to get a 27-year-old Pentium Pro to boot the
| same image as a Raptor Lake CPU.
| nubinetwork wrote:
| > When I have ARM image, well, then I can boot it on a SoC it
| is built for ... or lives on different pins of SoC, then I am
| screwed and will have at best partially working system.
|
| That's not even the half of it either... what firmware does the
| board run? U-boot is nice, but sometimes you aren't lucky and
| you're stuck with something proprietary. Although if you're
| extremely lucky, you'll have a firmware that supports efi
| kernels.
| dtx1 wrote:
| yeah, no. Tooling support, driver support and general
| Optimization matters. So does platform maturity.
|
| You don't want your phone to run x86 (and it won't for a while)
| and though possible its a pain to deal with an arm server at the
| moment because some random library you use just won't be
| compatible. And If single threaded performance matters, ARM is
| behind by a decade.
| dehrmann wrote:
| It sounds like Atom could have found a home in phones.
| circuit10 wrote:
| Having an Oracle Cloud Free Tier ARM VPS it's surprising how
| much just works, I think the only thing I couldn't run was
| Chrome Remote Desktop (yes, I want to remote into my VPS
| sometimes, for example it's the easiest way to leave a GUI
| program running in the background without leaving my PC on) and
| only a few other things needed extra steps. But it's probably a
| lot different on desktop or if you're running different types
| of programs
| wolf550e wrote:
| Some libraries have x86 SIMD code but no ARM SIMD code, so
| benchmarking real world use cases you compare SIMD vs scalar
| code and x86 is much faster. Server side libraries for ARM
| are a less mature situation than x86.
| rektide wrote:
| We don't know, is the only good answer. We haven't done much
| trying in the past 10 years.
|
| Intel's Lakefield was doing quite well in the tablet/MID space.
| It also had the disadvantage of comparatively ancient Atom-
| eaque cores- far worse than Intel's new E cores-and a massive
| massive huge Skylake core.
|
| ARM is no longer behind by all that much on single threaded. On
| geekbench a m2 can do 1916 points, a 7950 2300points. Slightly
| bigger gap on Cinebench, 1580 Vs 2050. A big part of the gap
| here is almost certainly the Hz being so different.
|
| We just don't know. There's old beliefs we have held but we had
| so little evidence for these biases then. X86 rarely tried to
| be really tiny, had so much more to learn if it was to succeed.
| ARM rarely tried to be big, and has been learning. There's
| scant evidence there are real limiting factors for either.
| jsheard wrote:
| > You don't want your phone to run x86 (and it won't for a
| while)
|
| It's easily forgotten but there were Android phones which used
| Intel x86 processors, such as the early Asus Zenfones. They
| didn't stick though.
| hedora wrote:
| These benchmarks suggest arm has been at single threaded
| performance parity on server since 2020:
|
| https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...
|
| (Apple Silicon blows them away on laptops, of course.)
| tyingq wrote:
| I imagine some of the remaining gap might be places where
| inline ASM or things like SIMD, AVX, etc, exist. Where
| there's been more years and a larger set of people optimizing
| that ASM for x86/64 servers.
| kaelinl wrote:
| This isn't the point of the article.
|
| The article is commenting on CPU design: area efficiency, power
| efficiency, design cost, etc. They're proposing that the reason
| x86 CPUs have historically beat ARM CPUs in performance, and
| the reason ARM CPUs have historically beat x86 CPUs in power
| efficiency, has nothing to do with the design of the ISA
| itself. You could build an ARM CPU to beat an x86 CPU in high
| performance computing, or vice versa. They're saying that the
| format of the instructions and the particular way the
| operations are structured isn't the driving factor. Instead,
| it's just a historical arteract of how the ISAs were used.
|
| In other words, yes, there are plenty of ecosystem reasons that
| these two (and potentially, more) families of chips are better
| for some things vs. others, but if the two companies swapped
| their ISAs 30 years ago we might see exactly the same ecosystem
| just with different instruction formats.
| isidor3 wrote:
| It is interesting to me how both instruction sets have converged
| on splitting operations into simpler micro ops. The author
| briefly mentions RISC-V as having "better" core instructions, but
| it makes me wonder if having the best possible instructions would
| even help that much.
|
| If you made a CPU that directly ran off of some convergent
| microcode, would you then lose because of bandwidth of getting
| those instructions to the chip? Or is compressing instruction
| streams already a pretty-well-solved problem if you're able to do
| it from a clean slate, instead of being tied to what instruction
| representations a chip happened to be using many years ago?
| circuit10 wrote:
| > If you made a CPU that directly ran off of some convergent
| microcode
|
| I think that's the original idea behind RISC
| mafribe wrote:
| Microcode is often attributed to [1] from 1952.
|
| [1] M. V. Wilkes, J. B. Stringer, _Micro-programming and the
| design of the control circuits in an electronic digital
| computer._
| kwhitefoot wrote:
| It's arguable that Babbage's design for the Analytical
| Engine included microcode.
|
| See, i.a., https://www.fourmilab.ch/babbage/glossary.html
| isidor3 wrote:
| Yes, and obviously ARM didn't chose the instructions in its
| reduced set optimally, if the best implementations require
| those instructions to be split into smaller ones. But that
| doesn't really speak to if that's because it's just _better_
| to pack instructions that way, or because these
| implementations of ARM and x86 just need to do it to be
| performant in spite of deficiencies in their instruction
| sets.
| api wrote:
| Decoder complexity matters. ARM with its single instruction width
| allows arbitrarily parallel decoders with only linear growth in
| transistor count. X86 with its many widths and formats requires
| decoders that grow exponentially in complexity with parallelism,
| consuming more silicon and power to achieve higher levels of
| instruction level parallelism. It requires a degree of brute
| force with many possible size branches being explored at once
| among other expensive tricks.
|
| This is one of the major areas where the instruction sets are not
| equal. ARM has a distinct efficiency advantage.
| userbinator wrote:
| How is it exponential? It's only a multiplicative increase in
| decode positions.
| thechao wrote:
| It's not exponential; it's not even quadratic (it is
| superlinear), if you put any thought into the design. I worked
| on an x86 part with 15 decoders/fetch unit. The area was
| annoying, but unimportant. (We didn't commit 15 ops/cycle; just
| pc-boundary determination.)
|
| I've also worked on ARM/custom GPU ISAs. The limiting rate is
| the total complexity of the ISA, not the encoding density.
|
| In fact, from an I$ point-of-view, the tighter x86 encodings
| are a pretty good win -- at least a few % on very long fetch
| sequences.
| jeffbee wrote:
| Does anyone actually care about this? The x86 decoders are not
| large on modern implementations, and putting more transistors
| on dice is a well-solved problem.
| api wrote:
| It uses more power. The decoder is like another ALU that is
| always screaming at 100%. It means you can easily keep up
| with ARM in speed but not power efficiency.
| jeffbee wrote:
| It doesn't seem to matter _in practice_. Current generation
| Intel CPUs and the Apple M2 have very similar performance
| at the same power levels.
| rowanG077 wrote:
| How did you arrive at that conclusion? Comparing the M2
| Max vs 13650HX it's very obvious the M2 Max uses a LOT
| less power. It's not even close, it's less then HALF the
| power. The M2 max has a little worse performance. But it
| manages to beat the Intel in some benchmarks.
| jeffbee wrote:
| You don't have to let the Intel chips scale up the power
| like that. You can lock them to whatever power level
| suits you. An i7-1370P configured at 20W has broadly
| similar performance to an M2.
| rowanG077 wrote:
| Mind linking me some power measurements at same wattage?
| I didn't even know you could set a power target on Intel
| or Appl Mx. Well you can disable turbo boost on Intel but
| even then intel blows over their own marketed TDP by a
| lot.
| jeffbee wrote:
| Intel introduced the "running average power limit" over
| ten years ago. https://lkml.indiana.edu/hypermail/linux/k
| ernel/1304.0/01322...
| rowanG077 wrote:
| Rapl doesn't allow setting a limit which is always
| obeyed. You can set PL1 and PL2 limits. But intel CPUs
| will gladly go over those limits in the short term. For
| example when running a benchmark. That's why I asked for
| specific benchmarks which include power measurements.
|
| For example: https://www.notebookcheck.net/i7-1360P-vs-M2
| _14731_14521.247...
|
| this shows the M2 has a little worse performance compared
| to the 1360P. But the 1360P requires 2.5x the power to
| achieve that.
| Panzer04 wrote:
| Apple's chips are on better process nodes, which confuses
| the issue. That being said, you really have to test chips
| at the same power level to get an idea of performance per
| watt in a comparison.
|
| You can easily double CPU power for only a few hundred
| MHz or 10-20% extra performance.
|
| See https://www.pcworld.com/article/1359352/cool-down-a-
| deep-div..., which benchmarks chips at different power
| limits for an example.
| rowanG077 wrote:
| Yes I agree with you. That doesn't mean this is easy to
| achieve. With the exception of AMD chips it's
| unfortunately very hard to simply "benchmark with a fixed
| power budget".
| arp242 wrote:
| How much power does it (roughly) use? Are we talking about
| 1% of the overall usage? 10%? 50%?
| tester756 wrote:
| this article states different, so how is it?
| mafribe wrote:
| One could argue that one of the reasons why SIMD
| instructions, and indeed GPUs, are popular, is because they
| amortise the (transistor and power) cost of decoding over
| more compute units, in the case of GPUs over many more.
|
| There are also other considerations, like rolling back state
| in OOO machines, or precise exceptions. All this becomes more
| complex with an x86-style instruction set.
| tux3 wrote:
| The x86 decoders consume a reasonable amount of power, but the
| trouble is making them wider without affecting that.
|
| I have an AMD CPU. Zen CPUs come with a fairly wide backend. But
| the frontend is what it is (especially early Zen), and without
| SMT it's essentially impossible to keep all those execution units
| fed. It's not that 8 x86 decoders wouldn't be a benefit, it's
| just that more decoders isn't cheap in x86 cores, each extra
| decoder is a serious cost.
|
| If you compare with the big ARM cores, having a wide frontend is
| not a complex research problem or an impractical cost. 8 wide ARM
| decode is completely practical. You even have open source
| superscalar RISC-V cores just publicly available on Github
| running on FPGAs with 8 wide decode. Large frontends are
| (relatively) cheap and easy, if you're not x86.
|
| So when we notice that the narrower x86 CPU's decode doesn't
| consume that much (a "drop in the ocean"), that's because it was
| designed narrower to keep the PPA reasonable! The reason I can't
| feed my Zen backend isn't because having a wide frontend is
| useless and I should just enable SMT anyways, it's because x86
| makes wide decodes much less practical than competing
| architectures.
| tester756 wrote:
| >The x86 decoders consume a reasonable amount of power
|
| this article states otherwise, so how is it?
| ip26 wrote:
| It's a trade-off between problems. Variable length instructions
| are not as trivial to decode wide, so you need more cleverness
| here. However, fixed length instructions decrease code density,
| which asks more of the instruction cache. Note Zen4 has a 32 KB
| L1 instruction cache while the M1 has a 192 KB L1 instruction
| cache, requiring extra cleverness here instead to handle the
| higher latency and area. Meanwhile, micro-op caches hide both
| problems.
|
| The are ripple effects to consider as well. The large L1 caches
| of M1 (320 KB total) put capacity pressure on L2, towards
| larger sizes and/or away from inclusive policy. See the 12MB
| shared L2. Meanwhile, the narrower decode of Zen4 puts pressure
| on things like branch prediction accuracy & mispredict
| correction latency - if you predicted the wrong codepath, you
| can't catch up as quickly. See the large branch predictors on
| Zen4.
| pclmulqdq wrote:
| ARM used to find itself on the wrong side of this tradeoff in
| the era of 4-wide x86 decode units and 4-6 wide ARM decoders.
| They lost too much perf to cache size for the decoder width
| to make up for it.
|
| It's unclear to me if they will pull ahead on the perf/area
| game with the era of 8-wide x86 decoders coming.
| codedokode wrote:
| x86 also uses 2-address instructions which means that you
| often need to use moves between registers (additional
| instructions), example: [1]. ARM uses 3-address instructions.
|
| Also, x86 code is compact, but not as compact as in era of
| 8080 [2] - here addition and multiplication require 3 bytes
| each, 6 bytes total. To my surprise, ARM has an add-multiply
| instruction and it uses just 4 bytes (instead of 8) [3].
|
| And RISC-V uses 6 bytes because of shortened instruction for
| addition [4]
|
| Of course, this simple function cannot be a replacement for
| proper analysis, but it seems that x86 code is not
| significantly denser.
|
| Also to my great disappoitment none of those CPUs has checked
| overflow for arithmetic operation.
|
| [1] https://godbolt.org/z/jsoccE5jv
|
| [2] https://godbolt.org/z/jTMs1MEzh
|
| [3] https://godbolt.org/z/nGb8qKcxe
|
| [4] https://godbolt.org/z/x9c115crY
| cesarb wrote:
| > Note Zen4 has a 32 KB L1 instruction cache while the M1 has
| a 192 KB L1 instruction cache, requiring extra cleverness
| here instead to handle the higher latency and area.
|
| There's another factor here: to have a low latency, the L1
| cache has to be indexed by the bits which don't change when
| translating from virtual addresses to physical addresses.
| That makes it harder to have a larger low-latency L1 cache
| when the native page size is 4KiB (AMD/Intel) instead of
| 16KiB (Apple M1/M2).
|
| That is, most of the "cleverness" allowing for a larger L1
| instruction cache is simply a larger page size.
| mjevans wrote:
| If the instructions are on average 3-4x less dense (take
| about that much more space) than the trade-off in
| associativity granularity and corresponding increase in
| cache size are logical. The logical management circuits
| would be around the same size, though the number of memory
| cells and corresponding cost in silicon, power / thermal,
| and signal propagation / layout issues remain.
| phkahler wrote:
| >> I have an AMD CPU. Zen CPUs come with a fairly wide backend.
| But the frontend is what it is...
|
| Zen 5 is widening the front end. My guess is with scaling
| coming to an end, one nice tweak with Zen 6 should be darn near
| the end of the performance road for a bit. Not saying the
| actual end, but it should be one of those sweet spots where you
| build a PC and it's really good for years to come.
|
| I'm still running Raven Ridge and have no need to upgrade, but
| I will when I can get double the cores or more at double the
| IPC or more, and maybe at lower power ;-)
| dehrmann wrote:
| This would explain part of why Apple hasn't been pushing M2 for
| the data center. Its chips are a better fit for bursty human
| workloads, not server workloads.
| rowanG077 wrote:
| Apple doesn't see value in going after the server market.
| KerrAvon wrote:
| Apple Silicon chips aren't for sale outside Apple, and Apple
| hasn't made any products relevant to data centers since they
| terminated the Xserve line as part of the PowerPC -> Intel
| transition.
| dehrmann wrote:
| They have a CPU that's been labeled some version of fastest
| or most efficient, they're hungry for more revenue, but
| somehow have no interest in the data center market? There
| must be a reason.
___________________________________________________________________
(page generated 2023-05-14 23:00 UTC)