[HN Gopher] Nvidia Unveils Grace: A High-Performance Arm CPU for...
___________________________________________________________________
Nvidia Unveils Grace: A High-Performance Arm CPU for Use in Big AI
Systems
Author : haakon
Score : 249 points
Date : 2021-04-12 16:32 UTC (6 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| crb002 wrote:
| +1 ECC RAM
| legulere wrote:
| Big Data, Big AI, what's next? Big Bullshit?
| jhgb wrote:
| Nah, that's already been here for quite a while.
| rexreed wrote:
| Honestly the bottom down-voted comment has it right. What AI
| application is actually driving demand here? What can't be
| accomplished now (or with reasonable expenditures) that can be
| accomplished by this one CPU that will be released in 2 yrs? What
| AI applications will need this 2 yrs from now that don't need it
| now?
|
| I understand the here-and-now AI applications. But this is
| smelling more like Big AI Hype than Big AI need.
| cracker_jacks wrote:
| "640K ought to be enough for anybody."
| cma wrote:
| Real business-class features we want to know about:
|
| Will they auto-detect workloads and cripple performance (like the
| mining stuff recently)? Only work through special drivers with
| extra licensing feeds depending on the name of the building it is
| in (data center vs office)?
| rubatuga wrote:
| Market segmentation is practiced by every chip company that you
| use. Intel: ECC. AMD: ROCM. Qualcomm: cost as percentage of the
| phone price.
| cma wrote:
| I still think Nvidia takes it further.
| volta83 wrote:
| Every company does market segmentation: it makes sense to
| have customers that want a feature pay more for it.
|
| Still, every company does it differently.
|
| For example, both NVIDIA and AMD compute GPUs are
| necessarily more expensive than gamer GPUs because of
| hardware costs (e.g. HBM).
|
| However, NVIDIA gamer GPUs can do CUDA, while AMD gamer
| GPUs can't do ROCm.
|
| The reason is that NVIDIA has 1 architecture for gaming and
| compute (Ampere), while AMD has two different architectures
| (RDNA and CDNA).
| cma wrote:
| It's common, but only possible in a very dominant
| position or with competitors that are borderline
| colluding.
| volta83 wrote:
| You must be the only gamer in the world that wants an
| HBM2e GPU for gaming that's 10x more expensive while only
| delivering a negligible improvement in FPS.
| cma wrote:
| I'm only talking about driver/license locks, not
| different ram types.
| Aissen wrote:
| GPU-to-CPU interface >900GB/sec NVLink 4. What kind of
| interconnect is that ? Is that even physically realistic ?
| freeone3000 wrote:
| Depends on how big you want to make it. If they're willing to
| go four inches, that'd do it with existing per-pin speeds from
| NVLink 3.
| rincebrain wrote:
| Well, according to [1], NVIDIA lists NVLink 3.0 as being 50
| Gb/s per lane per direction, and lists the total maximum
| bandwidth of NVSwitch for Ampere (using NVLink 3.0) as 900 GB/s
| each direction, so it doesn't seem completely out of reach.
|
| [1] - https://en.wikipedia.org/wiki/NVLink
| Aissen wrote:
| With 50Gb/s per lane, that would be 144 lanes to reach
| 900GB/s. Quite impressive.
| [deleted]
| rincebrain wrote:
| Fascinatingly, NVIDIA's own docs [1] claim GPU<->GPU
| bandwidth on that device of 600 GB/s (though they claim
| total aggregate bandwidth of 9.6 TB/s). Which would be
| what, 96 and 1536 lanes, respectively? That's quite the
| pinout.
|
| [1] - https://www.nvidia.com/en-us/data-center/nvlink/
| robomartin wrote:
| Well, PCIe 6 x16 will do 128 GB/s. Of course, the real question
| is how many transactions per second you get. For the PCIe 6 16
| lanes it's about 64 GT/s.
|
| Speaking in general terms, data rate and transaction rate don't
| necessarily match because a transaction might require the
| transmitter to wait for the receiver to check packet integrity
| and then issue acknowledgement to the transmitter before a new
| packet can be sent.
|
| Yet another case, again, speaking in general terms, would be
| the case of having to insert wait states to deal with memory
| access or other processor architecture issues.
|
| Simple example, on the STM32 processor you cannot toggle I/O in
| software at anywhere close to the CPU clock rate due to
| architectural constraints (to include the instruction set). On
| a processor running at 48 MHz you can only do a max toggle rate
| of about 3 MHz (toggle rate = number of state transitions per
| second).
| alexhutcheson wrote:
| The fact that they are using a Neoverse core licensed from ARM
| seems to imply that there won't be another generation for
| NVidia's Denver/Carmel microarchitectures. Somewhat of a shame,
| because those microarchitectures were unorthodox in some ways,
| and it would have been interesting to see where that line of
| evolution would have lead.
|
| I believe this leaves Apple, ARM, Fujitsu, and Marvell as the
| only companies currently designing and selling cores that
| implement the ARM instruction set. That may drop to 3 in the next
| generation, since it's not obvious that Marvell's ThunderX3 cores
| are really seeing enough traction to be be worth the non-
| recurring engineering costs of a custom core. Are there any
| others?
| klelatti wrote:
| Designing but not yet selling Qualcomm / Nuvia?
| alexhutcheson wrote:
| Yeah will be interesting to see if and when they bring a
| design to market.
| Bluestein wrote:
| The whole combination of AI and the name gives "watched over by
| machines of loving grace" a whole new twist, eh?
| TheMagicHorsey wrote:
| Is anyone but Apple making big investments in ARM for the
| desktop? This is another ARM for the datacenter design.
|
| If other companies don't make genuine investments in ARM for the
| desktop there's a real chance that Apple will get a huge an
| difficult to assail application performance advantage as
| application developers begin to focus on making Mac apps first,
| and port to x86 as an afterthought.
|
| Something similar happened back in the day when Intel was the de
| facto king, and everything on other platforms was a handicapped
| afterthought.
|
| I wouldn't want to have my desktops be 15 to 30% slower than Macs
| running the same software, simply because of emulation or lack of
| local optimizations.
|
| So I'm really looking forward to ARM competition on the desktop.
| callesgg wrote:
| Super parallell arm chips could that not be a future thing for
| nvidia or another chip manufacturer. A normal CPU die with
| thousands of independent Cores.
| modeless wrote:
| I hope they make workstations. I want to see some competition for
| the eventual Apple Silicon Mac Pro.
| macksd wrote:
| You probably mean less powerful than this, but they do:
| https://www.nvidia.com/en-us/deep-learning-
| ai/solutions/work....
| modeless wrote:
| Yes they make workstations, but they don't make ARM
| workstations. Yet. They already have ARM chips they could use
| for it, but they went with x86 instead despite the fact that
| they have to purchase the x86 chips from their direct
| competitor. Also, yes, less than $100k starting price would
| be nice.
| dhruvdh wrote:
| They are licensing ARM cores; which as of now cannot compete
| with Apple silicon.
|
| While there are using some future ARM core, and I've read
| rumors that future designs might try to emulate what has made
| Apple cores successful; we cannot say whether Apple designs
| will stagnate or continue to improve at current rate.
|
| There is potential for competition from Qualcomm after their
| Nuvia acquisition though.
| adgjlsfhk1 wrote:
| It seems weird to me to say that arm cores can't compete with
| apple silicon given that apple doesn't own fabs. They are
| using arm cores on TSMC silicon (exactly the same as this).
| seabrookmx wrote:
| > They are using arm cores on TSMC silicon (exactly the
| same as this)
|
| No the Apple Silicon chips use the arm _instruction set_
| but they do not use their core design. Apple designs their
| core in house, much like Qualcomm does with snapdragon.
| Both of these companies have an architectural license which
| allows them to do this.
| tibbydudeza wrote:
| Qualcomm no longer makes their own cores - they just use
| ARM reference IP designs since the Kryo.
|
| That will probably change with their Nuvia acquisition.
| ac29 wrote:
| Maybe not in single threaded performance, but Apple has no
| server grade parts. Ampere, for example, is shipping an 80
| core ARM N1 processor that puts out some truely impressive
| multithreaded performance. An M1 Mac is an entirely different
| market - making a fast 4+4 core laptop processor doesn't
| neccesarily translate into making a fast 64+ core server
| processor.
| devmor wrote:
| What do you mean ARM cores can't compete with Apple silicon?
| "Apple silicon" are ARM cores.
| dharmab wrote:
| Apple Silicon is compatible with the ARM instruction set
| but they are not "just ARM cores" in their internal design.
| mlyle wrote:
| He means cores made by ARM, not cores implementing the ARM
| ISA. Currently, the cores designed by ARM cannot touch the
| Apple M1.
| [deleted]
| titzer wrote:
| I think Apple did Arm an unbelievable favor by absolutely
| trouncing all CPU competitors with the M1. By being so fast,
| Apple's chip attracts many new languages and compiler backends
| to Arm that want a piece of that sweet performance pie. Which
| means that other vendors will want to have arm offerings, and
| not, e.g. RISCv5.
|
| I have no idea what Apple's plans for the M1 chip are, but if
| they had manufacturing capacity, they could put oodles of these
| chips into datacenters and workstations the world over and
| basically eat the x86 high-performance market. The fact that
| the chip uses so little power (15W) means they can absolutely
| cram them into servers where CPUs can easily consume 180W. That
| means 10x the number of chips for the same power, and not all
| concentrated in one spot. A lot of very interesting server
| designs are now possible.
| klelatti wrote:
| It's hard to imagine that until a few months ago it was very
| difficult to get a decent Arm desktop / laptop. I imagine
| lots of developers working now to fix outstanding Arm bugs /
| issues.
| giantrobot wrote:
| While I'm sure lots of projects have actual ARM-related
| bugs, there was a whole class of "we didn't expect this
| platform/arch combination" compilation bugs that have seen
| fixes lately. It's not that the code has bugs on ARM, a lot
| of OSS has been compiling on ARM for a decade (or more)
| thanks to Raspberry Pis, Chromebooks, and Android but built
| scripts didn't understand "darwin/arm64". Back in December
| installing stuff on an M1 Mac via Homebrew was a pain but
| it's gotten significantly easier over the past few months.
|
| But a million (est) new general purpose ARM computers
| hitting the population certainly affects the prioritizing
| of ARM issues in a bug tracker.
| mhh__ wrote:
| > compiler backends to Arm that want a piece of that sweet
| performance pie
|
| How many compilers didn't support ARM?
| GrumpyNl wrote:
| I need a new video card and there are no Nvidia to buy, all is
| bought by miners. Will it go the same with this card?
| redtriumph wrote:
| Currently, there are no plans for consumer-grade CPUs. Even
| this new CPU class is shipping in 2023.
| remexre wrote:
| > Today at GTC 2021 NVIDIA announces its first CPU
|
| Wait, Nvidia's been making ARM CPUs for years now; most memorably
| Project Denver.
| 015a wrote:
| Arguably, most memorably, Tegra; the CPU/GPU which powers the
| Nintendo Switch.
| Jasper_ wrote:
| That uses a licensed ARM Cortex design under the hood.
| jdsully wrote:
| NVIDIA called it their first "data center CPU". Our helpful
| reporter simplified it to the point of being flat out wrong.
| Not uncommon.
| justin66 wrote:
| I expected more from a site called VideoCardz.
| titzer wrote:
| Given that there are essentially no architectural details here
| other than bandwidth estimates, and the release timeline is in
| 2023, how exactly does this count as "unveiling"? Headline should
| read: "NVidia working on new arm chip due in two years", or
| something else much more bland.
| mrlento234 wrote:
| Not quite. CSCS supercomputing center in Switzerland have
| already started receiving the hardware
| (https://www.cscs.ch/science/computer-science-
| hpc/2021/cscs-d...). Perhaps, we may see some benchmarks. To
| wider HPC users, it will be only available in 2023 as the
| article mentioned.
| IanCutress wrote:
| I suspect that's more racks of storage, not racks of compute.
| Nothing to suggest it's compute.
| seniorivn wrote:
| as i understand it's compute, just not cpu compute, those
| cpu are designed to be good enough for cuda servers
| DetroitThrow wrote:
| Hey Ian, I love reading your posts on Anandtech, you're a
| fantastic technical communicator.
| titzer wrote:
| Hopefully some architectural details are forthcoming then!
| But that is not what is in this article.
| allie1 wrote:
| As AMD proved us, a lot can happen in 3 years
| valine wrote:
| I like the sound of a non-Apple arm chip for workstations. Given
| my positive experience with the M1 I'd be perfectly happy never
| using x86 again after this market niche is filled.
| webaholic wrote:
| I don't think this will be anywhere near as good as the M1,
| since they are using the ARM Neoverse cores.
| ac29 wrote:
| Apple throws a lot of transistors at their 4 performance
| cores in the M1 to get the performance they do - its not
| clear that approach would realistically scale to a
| workstation CPU with 16, 32, or more cores (at least not with
| current fab capabilities).
| awill wrote:
| Me too. But my decades old steam collection isn't looking
| forward to it. That's one advantage of cloud gaming. It won't
| matter what your desktop runs on.
| nabla9 wrote:
| Finally news from Nvidia that really moved markets.
| Nvidia +4.68%, Intel -4.65% AMD -4.47%
| 01100011 wrote:
| I wonder how permanent this is. As a Nvidian who sells his
| shares as soon as they vest and who owns some Intel for
| diversification, I wonder if I should load up on Intel? You
| really can't compete with their fab availability. Having a
| great design means nothing unless you can get TSMC to grant you
| production capacity.
| nabla9 wrote:
| TSMC takes orders years ahead and builds capacity to match
| working together with big customers. Those who pay more
| (price per unit and large volume) get first shot. That's why
| Apple is always first, followed by Nvidia and AMD, then
| Qualcomm.
|
| There is bottled demand because Intel's failure to deliver
| was not fully anticipated by anyone.
| gchadwick wrote:
| It'd be interesting to know if NVidia are going for an ARMv9
| core, in particular if they'll have a core with an SVE2
| implementation.
|
| It may be they don't want to detract from focus on the GPUs for
| vector computation so prefer a CPU without much vector muscle.
|
| Also interesting that they're picking up an arm core rather than
| continuing with their own design. Something to do with the
| potential takeover (the merged company would only want to support
| so many micro-architectural lines)?
| adrian_b wrote:
| They have said clearly that the core is licensed from ARM and
| one of the Neoverse future models.
|
| There was no information whether it will have any good SVE2
| implementation. On the contrary they insisted only on the
| integer performance and on the high-speed memory interface.
| dragontamer wrote:
| Neoverse V1 has SVE, Neoverse E or N do not.
|
| "E" is efficiency, N is standard, V is high-speed. IIRC, N is
| the overall winner in performance/watt. Efficiency cores have
| the lowest clock speed (overall use the least amount of
| watts/power). V purposefully goes beyond the performance/watt
| curve for higher per-core compute capabilities
| Teongot wrote:
| Neoverse-N2 will have SVE2 (source https://github.com/gcc-
| mirror/gcc/blob/master/gcc/config/aar... )
| gchadwick wrote:
| Here's Anandtech's article on the previous Neoverse V1/N2
| announcement: https://www.anandtech.com/show/16073/arm-
| announces-neoverse-... arm weren't saying anything official
| but Anandtech did a little digging and reckons V1 is SVE 1
| and v8 and N2 could be Armv9 with SVE 2.
|
| I'd suspect NVidia would be using the V1 here as it's the
| higher performing core, but not way to be certain.
| klelatti wrote:
| This has got me wondering whether an Nvidia owned Arm could
| limit SVE2 implementations so as not to compete with Nvidia's
| GPU. That would certainly be the case for Arm designed cores -
| not a desirable outcome.
| MikeCapone wrote:
| I doubt it, it's not like the market for acceleration is
| stagnant and saturated and they need to steal some
| marketshare points from one side to help the other.
|
| It's all greenfield and growing so far, they'll win more by
| having the very best products they can make on both sides.
| mlyle wrote:
| You'd think. But it wouldn't be the first time a new
| product is hampered to not slightly theoretically
| cannibalize an existing product family.
| theonlyklas wrote:
| I think they will use SVE2 because I assume they'll need to
| perform vector reads/writes to NVLink connected peripherals to
| reach that 900GB/s GPU-to-CPU bandwidth metric they described.
| api wrote:
| Tangent: Apple should bring back the Xserve with their M1 line,
| or alternately license the M1 core IP to another company to
| produce a differently-branded server-oriented chip. The
| performance of that thing is mind blowing and I don't see how
| this would compete with or harm their desktop and mobile
| business.
| bombcar wrote:
| How much of that performance is on-chip memory and how
| usable/scalable is that? An Xserve that is limited to one CPU
| and can't have more RAM would pretty mediocre.
| AnthonyMouse wrote:
| The cheapest available Epyc (7313P) has 16 cores and dual
| socket systems have up to 128 cores and 256 threads. Server
| workloads are massively parallel, so a 4+4 core M1 would be
| embarrassed and Apple wouldn't want to subject themselves to
| that comparison.
|
| But another reason they won't do it is that TSMC has a finite
| amount of 5nm fab capacity. They can't make more of the chips
| than they already do.
| api wrote:
| I'm thinking of a 64-core M1. It would not be the laptop
| chip.
| ac29 wrote:
| A 4+4 core M1 is 16 billion transistors. Some of that is
| the little cores, GPU, etc, but its not clear to me its
| practical to get, say 8x larger. That would be 128 billion
| transistors. As a point of comparison, NVIDIA's RTX 3090 is
| 28B transistors, and thats a huge, expensive chip.
| [deleted]
| [deleted]
| rektide wrote:
| There's a lot of interconnects (CCIX, CXL, OpenCAPI, NVLink,
| GenZ) brewing. Nvidia going big is, hopefully, a move that will
| prompt some uptake from the other chip makers. 900GBps link, more
| than main memory: big numbers there. Side note, I miss AMD being
| actively involved with interconnects. InfinityFabric seems core
| to everything they are doing, but back in the HyperTransport days
| it was something known, that folks could build products for,
| interoperate with. Not many did, but it's still frustrating
| seeing AMD keeping cards so much closer to the chest.
| filereaper wrote:
| Looks like NVidia broke up with POWER on IBM and made their own
| chip.
|
| They have interconnects from Mellanox, GPUs and their own CPUs
| now.
|
| I suspect the supercomputing lists will be dominated by NVidia
| now.
| arcanus wrote:
| That is certainly the trend. AMD is bringing Frontier online
| later this year, which might be the only counter to this.
| DonHopkins wrote:
| I love the name "Grace", after Grace Hopper.
| paulmd wrote:
| There's a tendency to use first names to refer to women in
| professional settings or political power that is somewhat
| infantilizing and demeaning.
|
| I doubt anyone really deliberately sets out to be like "haha
| yessss today I shall elide this woman's credentials", but this
| is one of those unconscious gender-bias things that is
| commonplace in our society and is probably best to try and make
| a point of avoiding.
|
| https://news.cornell.edu/stories/2018/07/when-last-comes-fir...
|
| https://metro.co.uk/2018/03/04/referring-to-women-by-their-f...
|
| (etc etc)
|
| I'd prefer they used "Hopper" instead, in the same way they
| have chosen to refer to previous architectures by the last
| names of their namesakes (Maxwell, Pascal, Ampere, Volta,
| Kepler, Fermi, etc). I'd see that as being more professionally
| respectful for her contributions.
|
| But yes I very much like the idea of naming it after Hopper.
| bloak wrote:
| Perhaps you're being downvoted because it's a tangent. It's a
| real phenomenon, though, and an interesting one. Of course
| there are many things that influence which parts of someone's
| full name get used, and if the tendency is a problem it's a
| trivial one compared to all the other problems that women
| face, but, yes, in general it would probably be a good idea
| to be more consistent in this respect.
|
| Vaguely related: J. K. Rowling's "real" full name is Joanne
| Rowling. The publisher "thought a book by an obviously female
| author might not appeal to the target audience of young
| boys".
|
| There's another famous (in the UK at least) computer
| scientist called Hopper: Andy Hopper. So "G.B.M. Hopper",
| perhaps? That would have more gravitas than "Andy"!
| hderms wrote:
| I feel like there's a non-zero chance they named it Grace
| instead of Hopper so their new architecture doesn't sound
| like a bug or a frog or something. You could be right, though
| trynumber9 wrote:
| Hopper was already reserved for an Nvidia GPU:
| https://en.wikipedia.org/wiki/Hopper_(microarchitecture)
| paulmd wrote:
| Yeah, I dunno what is going on with that, I assumed that
| had changed if they were going to use the name "grace" for
| another product.
|
| I guess I'm not sure if "Hopper" refers to the product as a
| whole (like Tegra) and early leakers misunderstood that, or
| whether Hopper is the name of the microarchitecture and
| "Grace" is the product, or if it's changed from Hopper to
| Grace because they didn't like the name, or what.
|
| Otherwise it's a little awkward to have products named both
| "grace" and "hopper"...
| lprd wrote:
| So is ARM the future at this point? After seeing how well Apple's
| M1 performed against a traditional AMD/Intel CPU, it has me
| wondering. I used to think that ARM was really only suited for
| smaller devices.
| hilios wrote:
| Depends, performance wise it should be able to compete with or
| even outperform x86 in many areas. A big problem until now was
| cross compatibility regarding peripherals, which complicates
| running a common OS on ARM chips from different vendors. There
| is currently a standardization effort (Arm SystemReady SR) that
| might help with that issue though.
| Hamuko wrote:
| Based on initial testing, AWS EC2 instances with ARM chips
| performed as well if not better than the Intel instances, but
| they cost 20% less. The only drawback that I've really
| encountered thus far was that it complicates the build process.
| moistbar wrote:
| Does ARM have a uniquely complex build process, or is it the
| mix of architectures that makes it more difficult?
| sumtechguy wrote:
| ARM is all over the place with its ISA. x86 has the benefit
| that most companies made it 'IBM compatible'. There are one
| off x86 ISAs but they are mostly forgotten at this point.
| The ARM CPU family itself is fairly consistent (mostly),
| but included hardware is a very mixed bag. The x86 has on
| the other hand the history of build it to make it work like
| IBM. All the way from how things boot up, memory space
| addresses, must have I/O, etc. ARM on the other hand may or
| may not have that depending on which ISA you target or are
| creating. Things like the raspberry PI has changed some of
| that as many are mimicking the broadcom ISA and
| specifically that with the raspberry pi one. The x86 arch
| has also picked up some interesting baggage along the way
| because of what it is. We can mostly ignore it but it is
| there. For example you would not build a ARM board these
| days with an IDE interface but some of those bits still
| exist in the x86 world.
|
| ARM is more of a tool kit to build different purpose built
| computers (you even see them show up in usb sticks). While
| x86 is particular ISA that has a long history behind it. So
| you may see something like 'Amazon builds its own ARM
| computers'. That means they spun their own boards, built
| their own toolchains (more likely recompiled existing
| ones), and probably have their own OS distro to match. Each
| one of those is a fairly large endeavor to do. When you see
| something like 'Amazon builds its own x86 boards', they
| have shaved out the other two parts of that and are
| focusing on hardware. That they are building their own
| means they see the value in owning the whole stack. Also if
| you have your own distro means you usually have to 'own'
| building the whole thing. So I can go grab an x86 gcc stack
| from my repo provider. They will need to act as the repo
| owner and build it themselves and keep up with the patches.
| Depending on what has been added that can be quite the task
| all by itself.
| Hamuko wrote:
| Mix of architectures and the fact that our normal CI server
| is still x86-based and really didn't want to do ARM builds.
| ksec wrote:
| Based on Future ARM Neoverse, so basically nothing much to see
| here from CPU perspective, What really stands out, are those
| ridiculous number from its Memory system and Interconnect.
|
| CPU: LPDDR5X with ECC Memory at 500+GB/s Memory Bandwidth. (
| Something Apple may dip into. R.I.P for Mac with upgradable
| Memory )
|
| GPU: HBM2e at _2000_ GB /s. Yes, three zeros, this is not a typo.
|
| NVLink: 500GB/s
|
| This will surely further solidify CUDA dominance. Not entirely
| sure how Intel's XE with OneAPI and AMD's ROCm is going to
| compete.
| Dylan16807 wrote:
| > GPU: HBM2e at 2000 GB/s. Yes, three zeros, this is not a
| typo.
|
| It's a good step forward but your average consumer GPU is
| already around a quarter to a third of that and a Radeon VII
| had 1000 GB/s two years ago.
| jabl wrote:
| The Nvidia A100 80GB already provides 2 TB/s mem BW today.
| Also using HBM2e.
| m_mueller wrote:
| I think what you're missing here is the NVLink part. The fact
| that you can get a small cluster of these linked up like that
| for 400k, all wrapped in a box, makes HPC quite a bit more
| accessible. Even 5 years ago, if you wanted to run a regional
| sized weather model at reasonable resolution, you needed to
| have some serious funding (say, nation states or oil /
| insurance companies). Nowadays you could do it with some
| angel investment and get one of these Nvidia boxes and just
| program them like they're one GPU.
| kllrnohj wrote:
| Critically it's CPU to GPU NVLink here, not the "boring"
| GPU to GPU NVLink that's common on Quadros. 500GB/s
| bandwidth between CPU & GPU massively changes when & how
| you can GPU accelerate things, that's a 10X difference over
| the status quo.
| kimixa wrote:
| Also "cpu->cpu" NVLink is interesting. Though it was my
| understanding that NVLink is point-to-point, and would
| require some massive switching system to be able to
| access any node in the cluster anywhere near that rate
| without some locality bias (IE nodes on the "first"
| downstream switch are faster to access and less
| contention)
| de6u99er wrote:
| Don't know if it's just me but this product looks like a beta-
| product for early adopters.
| rektide wrote:
| It's initially for two huge HPC systems. It'll be interesting
| to see what kind of availability it ever has to the rest of the
| world.
| lprd wrote:
| So is ARM the future at this point? After seeing how well Apple's
| M1 performed against a traditional AMD/Intel CPU, it has me
| wondering. I used to think that ARM was really only suited for
| smaller devices.
| fulafel wrote:
| The instruction set doesn't make a significant difference
| technically, the main things about them are monopolies
| (patents) tied to ISAs, and sw compatibility.
| rvanlaar wrote:
| I'm interested in your thoughts on why this doesn't make a
| significant difference. From what I've read, the M1 has a lot
| of tricks up its sleeve that are next to impossible on X86.
| For example ARM instructions can be decoded in parallel.
| kllrnohj wrote:
| It will come down entirely to who can sustain a good CPU core.
|
| Currently Apple is the only company making performance-
| competitive ARM cores that can make a reasonable justification
| for an architecture switch.
|
| Otherwise AMD's CPUs are still ahead of everyone else,
| including all other ARM CPU cores not made by Apple. And even
| Intel is still faster in places where performance matters more
| than power efficiency (eg, desktop & PC gaming)
| aeyes wrote:
| Amazons ARM chips are performance competitive as well, for
| many workloads you can expect at least similar performance
| per core at the same clock speed.
| floatboth wrote:
| Arm's Neoverse cores are doing pretty well in the datacenter
| space -- on AWS, the Graviton2 instances are currently the
| best ones for lots of use cases. It's clear that core designs
| by Arm are really good. The problem currently is the lag
| between the design being done and various vendors' chips
| incorporating it.
|
| upd: oh also in the HPC world, Fujitsu with the A64FX seems
| to be like the best thing ever now
| rubatuga wrote:
| Fujitsu flying under the radar while having the fastest cpu
| ever made haha
| kllrnohj wrote:
| Graviton2 is competitive sometimes with Epyc, but also
| falls far behind in some tests (eg, Java performance is a
| bloodbath). Overall across majority tests, Neoverse
| consistently comes up short of Milan even when Neoverse is
| given a core-count advantage. And critically the per-core
| performance of Graviton2 / Neoverse is worse, and per-core
| performance is what matters to consumer space.
|
| But it can't just be competitive it needs to be
| significantly better in order for the consumer space to
| care. Nobody is going to run Windows on ARM just to get
| equivalent performance to Windows on X86, especially not
| when that means most apps will be worse. That's what's
| really impressive about the M1, and so far is very unique
| to Apple's ARM cpus.
|
| > oh also in the HPC world, Fujitsu with the A64FX seems to
| be like the best thing ever now
|
| A64FX doesn't appear to be a particularly good CPU core,
| rather it's a SIMD powerhouse. It's the AVX-512 problem -
| when you can use it, it can be great. But you mostly can't,
| so it's mostly dead weight. Obviously in HPC space this is
| different scenario entirely, but that's not going to
| translate to consumer space at all (and it's not an ARM
| advantage, either - 512bit SIMD hit consumer space via x86
| first with Intel's Rocket Lake).
| klelatti wrote:
| Not sure why you're placing so much weight on Epyc
| outperforming Graviton but discounting designs / use
| cases where Arm is clearly now better. Plus it's clear
| that we are just at the beginning of a period where some
| firms with very deep pockets are starting to invest
| seriously in Arm on the server and the desktop.
|
| If x64 ISA had major advantages over Arm then that would
| be significant, but I've not heard anyone make that case:
| instead it's a debate about how big the Arm advantage is.
|
| Can x64 remain competitive in some segments: probably and
| inertia will work in its favour. I do think it's
| inevitable that we will see a major shift to Arm though.
| huac wrote:
| so then we think about what makes Apple's M1 so good. one
| hard-to-replicate factor is that they designed their hardware
| and software together, the ops which MacOS uses often are
| heavily optimized on chip.
|
| but one factor that you can replicate is colocating memory,
| CPU, and GPU, the system-on-chip architecture. that's what
| Nvidia looks to be going after with Grace, and I'm sure
| they've learned lessons from their integrated designs e.g.
| Jetson. very excited to see how this plays out!
| kllrnohj wrote:
| > one hard-to-replicate factor is that they designed their
| hardware and software together, the ops which MacOS uses
| often are heavily optimized on chip.
|
| Not really, they are still just using the same ARM ISA as
| everyone else. The only hardware/software integration magic
| of the M1 so far seems to be the x86 memory model emulation
| mode, which others could definitely replicate.
|
| > but one factor that you can replicate is colocating
| memory, CPU, and GPU, the system-on-chip architecture.
|
| AMD introduced that in the x86 world back in 2013 with
| their Kavari APU ( https://www.zdnet.com/article/a-closer-
| look-at-amds-heteroge... ), and it's been fairly typical
| since then for on-die integrated GPUs on all ISAs.
| dkjaudyeqooe wrote:
| ARM is the present, RISC-V is the future and Intel is the past.
|
| The magic of Apple's M1 comes from the engineers who worked on
| the CPU implementation and the TSMC process.
|
| The architecture has some impact on performance but I think it
| is simplicity and and ease of implementation that factors most
| into how well it can perform (as per the RISC idea). In that
| sense Intel lags for small, fast and efficient processors
| because their legacy architecture pays a penalty for decoding
| and translation (into simpler ops) overhead. Eventually designs
| will abandon ARM for RISC-V for similar reasons as well as
| financial ones.
|
| Really, today it's a question of who has the best
| implementation of any given architecture.
| mhh__ wrote:
| The next decade is ARM's for the taking, _but_ if Intel and AMD
| can make good cores then it 's not anywhere close to slam dunk.
|
| One of the reasons why M1 is good is pure and simple that it
| has a pretty enormous transistor budget, not solely because
| it's ARM.
| api wrote:
| Being ARM has something to do with it. The x86 instruction
| decoder may be only about ~5% of the die, but it's 5% of the
| die that has to run _all the time_. Think about how warm your
| CPU gets when you run e.g. heavy FPU loads and then imagine
| that 's happening all the time. You can see the power
| difference right there.
|
| It's also very hard to achieve more than 4X parallelism
| (though I think Ice Lake got 6X at some additional cost) in
| decode, making instruction level parallelism harder. X86's
| hack to get around this is SMT/hyperthreading to keep the
| core fed with 2X instruction streams, but that adds a lot
| more complexity and is a security minefield.
|
| Last but not least: ARM's looser default memory model allows
| for more read/write reordering and a simpler cache.
|
| ARM has a distinct simplicity and low-overhead advantage over
| X86/X64.
| NortySpock wrote:
| > x86 instruction decoder may be only about ~5% of the die
|
| What percent of the die is an ARM instruction decoder?
| duskwuff wrote:
| Much less. x86 instruction decoding is complicated by the
| fact that instructions are variable-width and are byte-
| aligned (i.e. any instruction can begin at any address).
| This makes decoding more than one instruction per clock
| cycle complicated -- I believe the silicon has to try
| decoding instructions at every possible offset within the
| decode buffer, then mask out the instructions which are
| actually inside another instruction.
|
| ARM A32/A64 instruction decoding is dramatically simpler
| -- all instructions are 32 bits wide and word-aligned, so
| decoding them in parallel is trivial. T32 ("Thumb") is a
| bit more complex, but still easier than x86.
| monocasa wrote:
| I totally agree with the core of your argument (aarch64
| decoding is inherently simpler and more power efficient
| than x86), but I'll throw out there that it's not quite
| as bad as you say on x86 as there's some nonobvious
| efficiencies (I've been writing a parallel x86 decoder).
|
| What nearly everyone uses is a 16 byte buffer aligned to
| the program counter being fed into the first stage
| decode. This first stage, yes has to look at each byte
| offset as if it could be a new instruction, but doesn't
| have to do full decode. It only finds instruction length
| information. From there you feed this length information
| in and do full decode on the byte offsets that represent
| actual instruction boundaries. That's how you end up with
| x86 cores with '4 wide decode' despite needing to
| initially look at each byte.
|
| Now for the efficiencies. Each length decoder for each
| byte offset isn't symmetric. Only the length decoder at
| offset 0 in the buffer has to handle everything, and the
| other length decoders can simply flag "I can't handle
| this", and the buffer won't be shifted down past where
| they were on the next cycle and the byte 0 decoder can
| fix up any goofiness. Because of this, they can
|
| * be stripped out of instructions that aren't really used
| much anymore if that helps them
|
| * can be stripped of weird cases like handling crazy
| usages of prefix bytes
|
| * don't have to handle instructions bigger than their
| portion of the decode buffer. For instance a length
| decoder starting at byte 12 can't handle more than a 4
| byte instruction anyway, so that can simplify it's logic
| considerably. That means that the simpler length decoders
| end up feeding into the higher stack up full decoder
| selection, so some of the overhead cancels out in a nice
| way.
|
| On top of that, I think that 5% includes pieces like the
| microcode ROMs. Modern ARM cores almost certainly have
| (albeit much smaller) microcode ROMs as well to handle
| the more complex state transitions.
|
| Once again, totally agreed with your main point, but it's
| closer than what the general public consensus says.
| ant6n wrote:
| I wonder whether a modern byte-sized instruction encoding
| would sort of look like Unicode, where every byte is self
| synchronizing... I guess it can be even weaker than that,
| probably only every second or fourth byte needs to
| synchronize.
| pbsd wrote:
| The x86 decoder is not running all the time; the uops cache
| and the LSD exist precisely to avoid this. With
| instructions fed from the decoders you can only sustain 4
| instructions per cycle, while to get to 5 or 6 your
| instructions need to be coming from either the uops cache
| or the LSD. In the case of the Zen 3, the cache can deliver
| 8 uops per cycle to the pipeline (but the overall thoughput
| is limited elsewhere at 6)!
|
| Furthermore, the high-performance ARM designs, starting
| with the Cortex-A77, started using the same trick---the
| 6-wide execution happens only when instructions are being
| fed from the decoded macro-op cache.
| ant6n wrote:
| How can you run 8 instructions at the same time if you
| only have 16 general purpose registers? You'd have to
| either be doing float ops or constantly spilling. So I'm
| integer code, how many of those instructions are just
| moving data between memory and registers (push/pop?).
|
| I'd say ARM has a big advantage for instruction level
| parallelism with 32 registers.
| mhh__ wrote:
| Register renaming for a start, and this is about decoding
| not execution
| ant6n wrote:
| Okay fair. But the bigger subject is inherent performance
| advantage of the architecture. You don't just want to
| decode many instructions per cycle, you also want to
| issue them. So decoding width and issuing width are
| related.
|
| And it seems to me that ARM has an advantage here. If you
| want execute 8 instructions in parallel, you gotta
| actually have 8 independent things that need to get
| executed. I guess you could have a giant out of order
| buffer, and include stack locations in your register
| renaming scheme, but it seems much easier to find
| parallelism if a bunch of adjacent instructions are
| explicitly independent. Which is much easier if you have
| more registers - the compiler can then help the cpu
| keeping all those instruction units fed.
| mhh__ wrote:
| The decoder might not be running strictly all the time,
| but I would wager that for some applications at least it
| doesn't make much of a difference. For HPC or DSP or
| whatever where you spend a lot of time in relatively
| dense loops the uop cache is probably big enough to ease
| the strain on the decoder, but for sparser code
| (Compilers come to mind, lots of function calls and
| memory bound work) I wouldn't be surprised if it didn't
| make as much difference.
|
| I have vTune installed so I guess I could investigate
| this if I dig out the right PMCs
| pbsd wrote:
| I agree; compiler-type code will miss the cache most of
| the time. A simple test with clang++ compiling some
| nontrivial piece of C++: 0
| lsd_uops
| 1,092,318,746 idq_dsb_uops
| ( +- 0.49% ) 4,045,959,682 idq_mite_uops
| ( +- 0.06% )
|
| The LSD is disabled in this chip (Skylake) due to errata,
| but we can see only 1/5th of the uops come from the uops
| cache. However, the more relevant experiment in terms of
| power is how many cycles is the cache active instead of
| the decoders: 0
| lsd_cycles_active
| 378,993,057 idq_dsb_cycles
| ( +- 0.18% ) 1,616,999,501 idq_mite_cycles
| ( +- 0.07% )
|
| The ratio is similar: the regular decoders are not active
| only around 1/5th of the time.
|
| In comparison, gzipping a 20M file looks a lot better:
| 0 lsd_cycles_active
| 2,900,847,992 idq_dsb_cycles
| ( +- 0.07% ) 407,705,985 idq_mite_cycles
| ( +- 0.33% )
| mhh__ wrote:
| This is why I said it's ARM's for the taking.
|
| I'm not familiar with how ARM's memory model effects the
| cache design - Source?
| jayd16 wrote:
| Another reason is the something like 150% memory bandwidth
| and I'm sure there are other simple wins along those lines.
|
| The M1 isn't necessarily a win for Arm in general. Other
| manufacturers weren't competing before and its yet to be seen
| if they will.
| mhh__ wrote:
| It's the memory stupid!
| to11mtm wrote:
| Specifically, the memory -latency-.
|
| By going on-package there's almost certainly latency
| advantages in addition to the much-vaunted bandwidth
| gains.
|
| That's going to pan out to better perf, and likely better
| power usage as well.
| NathanielK wrote:
| 150% compared to what?
| jayd16 wrote:
| The latest i9 and the latest Ryzen 9, ie the competition.
| NathanielK wrote:
| Intel Tigerlake and Amd Renoir both support 128bit
| LPDDR4x at 4266MHz. Maybe you're confusing the desktop
| chips that use conventional DDR4? The M1 isn't
| competitive with them.
| jayd16 wrote:
| Oh those are pretty new and I haven't seen any benchmarks
| with LPDDR in an equivalent laptop chip. Do you have a
| link to any?
| ravi-delia wrote:
| I've seen things like this a lot, and it's a bit confusing.
| If parts of the M1's performance are due to throwing compute
| at the problem, why hasn't Intel been doing that for years?
| What about ARM, or the M1, allowed this to happen?
| NathanielK wrote:
| Intel has. Many M1 design choices are fairly typical for
| desktop x86 chips, but unheard of with ARM.
|
| For example, the M1 has 128 bit wide memory. This has been
| standard for decades on the desktop(dual channel), but
| unheard of in cellphones. The M1 also has similar amounts
| of cache to the new AMD and Intel chips, but thats several
| times more than the latest snapdragon. Qualcomm also
| doesn't just design for the latest node. Most of their
| volume is on cheaper, less dense nodes.
| dpatterbee wrote:
| Buying the majority of TSMC's 5nm process output helped.
| It's a combination of good engineering, the most advanced
| process, and intel shitting themselves I would say.
| tambourine_man wrote:
| >...is pure and simple that it has a pretty enormous
| transistor budget
|
| There's a lot of brute force, yes, but it's not the only
| reason. There are lots of smart design decisions as well.
| amelius wrote:
| Yes, but those decisions optimize for the single user
| laptop case, not for e.g. servers.
| mhh__ wrote:
| "One of the reasons" I did say.
| tambourine_man wrote:
| True, I misread it.
| phendrenad2 wrote:
| It really comes down to how well they can emulate X86. People
| aren't going to give up access to 3 decades of Windows
| software.
| pjerem wrote:
| I'm sure ARM already took over x86 if you have a wider
| definition of personal computers. And a lot of people
| already gave up access to 3 decades of Windows software by
| using their phone or tablet as their main device.
|
| Plus, most of the last decade software is software that
| runs on some sort of VM or another (be it JVM, CLR, a
| Javascript engine or even LLVM).
|
| Soon (in years), x86 will only be needed by professionals
| that are tied to really old software. And those particular
| needs will probably be satisfied by decent emulation.
| kllrnohj wrote:
| > Soon (in years), x86 will only be needed by
| professionals that are tied to really old software.
|
| There's also that PC & console gaming markets, which are
| not small and have not made any movements of any kind
| towards ARM so far.
| bitwize wrote:
| > So is ARM the future at this point?
|
| The near future. A few years out, RISC-V is gonna change
| everything.
| CalChris wrote:
| Apple isn't entering the cloud market. Moreover the M1 isn't a
| cloud cpu. The M1 SOC emphasizes low latency and performance
| per watt over throughput.
| 1MachineElf wrote:
| I wonder what percentage of it's supported toolchain components
| will be proprietary.
| CalChris wrote:
| _Grace, in contrast, is a much safer project for NVIDIA; they're
| merely licensing Arm cores rather than building their own ..._
|
| NVIDIA is buying ARM.
| klelatti wrote:
| Trying to buy Arm.
|
| Multiple competition investigations permitting.
___________________________________________________________________
(page generated 2021-04-12 23:00 UTC)