[HN Gopher] The AMD "Aldebaran" GPU that won exascale
___________________________________________________________________
The AMD "Aldebaran" GPU that won exascale
Author : jonbaer
Score : 58 points
Date : 2021-11-15 19:29 UTC (3 hours ago)
(HTM) web link (www.nextplatform.com)
(TXT) w3m dump (www.nextplatform.com)
| visionscaper wrote:
| What software will they use to, for instance, train large deep
| learning models? Nvidia has CUDA, AMD has what? Are they writing
| new software from scratch? Maybe they have a lot of frameworks to
| solve problems in the "traditional" HPC space (eg weather
| forecasts), but in the ML space I only heard of ROCm which seems
| to be poorly supported.
|
| AMD seems such an odd choice for "AI supercomputers".
| genpfault wrote:
| > If you want to know how and why AMD motors have been chosen for
| so many of the pre-exascale and exascale HPC and AI systems...
|
| AMD...motors?
| amelius wrote:
| engines
|
| Probably bad translation.
| phkahler wrote:
| Around 1980 my family got their first computer. I've followed
| this business ever since, and I was amazed that a CRAY could do
| MFLOPS. My MS basic interpreter could do hundreds or even
| thousands of FLOPs on its 8080A. I watched as the high end went
| to hundreds of MFLOPS, then GIGAFLOPS which seemed insane. There
| were national efforts to reach TFLOPS, and reading about the
| challenges (IIRC at the time interconnect was a huge deal) made
| it seem like the end was near. Moore's law was always in danger.
| Then came PETAFLOPS consuming megawatts of power.
|
| And now I play VR on a battery powered gizmo doing about 1 TFLOP
| strapped to my head, and EXAFLOPS are basically here. This is all
| with at least TSMC 5nm, 3nm, 2nm, and multi-layer left on the
| table. After watching this relentless advance for 4 decades I'm
| pretty sure it will go beyond even that, but we just don't know
| what it will look like yet.
|
| It's become everyday tech to me, but if I look back the
| progression is absolutely astounding.
| jacquesm wrote:
| It is astounding. What is more astounding to me is that we burn
| so many of these cycles on eye candy, and that we waste so many
| of them on bloat. If not for that your battery powered gizmo
| would run for many days on one charge instead of having to be
| connected to its umbilical for 8 hours every night.
| interstice wrote:
| If that's astounding think about what most of us spend our
| time doing with the insane amount of processing power
| available between our ears.
| jacquesm wrote:
| Sure, but we didn't engineer that with performance in mind,
| and with computers it was enough to declare a previous
| generation obsolete. Whereas productivity for computer
| based applications was actually pretty good for the first
| generation of those machines, pretty much every cycle
| counted. Some people used them to play games and there was
| a recreational element to programming in its own right but
| it wasn't as though anybody would burn cycles to give a UI
| the texture of the real thing, it was a computer that
| worked and it produced results, which was all that
| mattered.
| dragontamer wrote:
| https://www.amd.com/system/files/documents/amd-cdna2-white-p...
|
| The article points out this CDNA2 whitepaper, which has the juicy
| technical details.
|
| CDNA1 is here: https://www.amd.com/system/files/documents/amd-
| cdna-whitepap...
|
| -----
|
| CDNA2 / MI200 is a chiplet strategy with two "GCDs", each
| functioning as a classic GPU. These two GCDs can access each
| other's memory, but only at a lower 400GBps speed (page 8
| whitepaper).
|
| The actual HBM RAM is designed for 1600 GBps (article), x2 since
| two GCDs exist. AMD says its like 3200 GBps but in actuality, any
| one such block/workgroup can only get 2000 GBps (1600 GBps local
| RAM + 400 GBps from infinity-fabric / partner GCD). So its really
| a bit complicated and will likely be very workload specific.
|
| If your data can be cloned / split efficiently, then the RAM
| probably will look like 3200GBps. But if you have to communicate
| with both parts of RAM to see all the data, you'll see a clear
| slowdown.
| fxtentacle wrote:
| I would sincerely hope for a competitive AMD GPU for deep
| learning. But as long as it's a week-long journey with unknown
| ending to try to recompile TensorFlow to support ROCm, everyone I
| know in AI will firmly stick with NVIDIA and their production-
| proven drivers and CUDA APIs.
|
| I wish AMD would offer something like NVIDIA's Inception program
| to gift some accelerators and GPUs to suitable C++ coders (like
| me) so that there's at least a few tutorials on the internet on
| how other people managed to successfully use AMD + ROCm for deep
| learning.
|
| EDIT: And it seems ROCm doesn't even support any of those new
| RDNA2 accelerators or gaming GPUs:
| https://github.com/RadeonOpenCompute/ROCm/issues/1344
|
| So this is great hardware, but absolutely useless unless you are
| big enough to write your own GPU drivers from scratch ~_~
| rektide wrote:
| AMD's not nowhere.
| https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learni...
| shows what should be a followable happy path to getting
| TensorFlow going (2 year old TF 1.15, and a 2.2beta). I'm
| curious what is prickly or hard about it.
|
| IMO the deep learning folk need to be working more actively
| towards the future. The CUDA free ride is amazing, but AMD's
| HIP already does a good job being CUDA compliant in a general
| sense. But CUDA also sort of encompasses the massive collection
| of libraries that Nvidia has written to accelerate a huge
| amount of use cases. Trying to keep pace with that free-ride is
| hard.
|
| My hope is eventually we start to invest in Vulkan Compute.
| Vulkan is way way way harder than CUDA, but it's the only right
| way I can see to do things. Getting TensorFlow & other
| libraries ported to run atop Vulkan is a herculean feat, but
| once there's a start, I tend to believe most ML practitioners
| won't have to think about the particulars, and I think the deep
| engineering talent will be able to come, optimize, improve the
| Vulkan engines very quickly, rapidly be able to improve
| whatever it is we start with.
|
| It's a huge task, but it just seems like it's got to happen. I
| don't see what alternative there is, long term, to starting to
| get good with Vulkan.
| spijdar wrote:
| > I'm curious what is prickly or hard about it.
|
| I don't want to presume, but it sounds like you haven't
| actually tried using ROCm "what should be".
|
| My experience with it was an absolute nightmare, I've never
| gotten ROCm working before. Just as well, since it turns out
| my systems never would have supported it for various reasons
| (lacking PCIe atomics for one), but I never actually got so
| far as to run into the driver problem, I never got the whole
| custom LLVM fork/ROCm software stack to work.
|
| Caveat, I'm not professionally involved in deep learning or
| HPC, and as others have mentioned, the framework was only
| intended for a few specific cards running on very specific
| hardware for HPC cases.
|
| But pretending like this is even a fraction as useful for the
| "average" person trying to experiment or even work at a low-
| medium level in machine learning feels off to me.
|
| I don't think people will be swayed by platitudes about
| creating a competitive open-systems ecosystem to use plainly
| inferior software. Companies aren't going to spend oodles of
| money (and individuals won't volunteer tons of time) to
| suffer porting frameworks to target bare-bones APIs for the
| sake of being good sports.
|
| Until either nvidia screws over everyone so much that using
| AMD cards becomes the path of least resistance, or AMD/Intel
| offers products at significantly lower prices than nvidia, I
| don't see the status quo changing much.
| Certhas wrote:
| ROCm support for Gaming cards has been poor (and not
| advertised) but it's part of the tech stack they are selling
| with these accelerators:
|
| https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontie...
|
| It's clearly a real problem that AMD's ML Software stack isn't
| quite there, and lacks in support for the non-specialized
| cards, but that's not really an issue for these HPC use
| cases....
| esistgut wrote:
| Today a Blender beta version with HIP support has been
| released. This is working on RDNA hardware (RDNA2 officially
| supported, RDNA1 enabled but not supported). I guess a release
| date for ROCm is approaching after all.
| my123 wrote:
| ROCm 4.5 is also the last release to support their own Vega 10
| based accelerator. (Radeon Instinct MI25)
|
| https://github.com/RadeonOpenCompute/ROCm#amd-instinct-mi25-...
|
| aka AMD doesn't care... they just want the supercomputer
| contracts where the customers are savvy enough to build their
| own very specific SW stack.
| generalizations wrote:
| Sounds like AMD might still be using the 'tesla roadster'
| strategy, of selling fewer, more lucrative contracts for the
| time being. Probably not that they don't care, just that for
| now, they have to focus.
| [deleted]
| jacquesm wrote:
| Is there even a single machine in the supercomputer top 10
| that uses AMD GPUs?
|
| I see NVIDIA all over the place there but I'm not aware of
| any of them using AMD GPUs, though a couple do use AMD CPUs.
| my123 wrote:
| None today.
|
| those HPC machines will be the first ones
| esistgut wrote:
| ROCm doesn't support RDNA at all
| https://github.com/RadeonOpenCompute/ROCm/issues/887
| Certhas wrote:
| It's promised for ROCm 5.0 now...
|
| https://github.com/RadeonOpenCompute/ROCm/issues/1180#issuec.
| ..
| meragrin_ wrote:
| "might expect good news with 5.0" is not a promise.
| volta83 wrote:
| From the same source two weeks ago:
| https://www.nextplatform.com/2021/10/26/china-has-already-re...
|
| Do they read their own news?
|
| China won Exascale. Twice. Before anybody else.
|
| Is there an MI200 systems in the Top500 yet?
|
| Supercomputing 2021 is running and the updated November 2021
| Top500 list was announced.
|
| There is only one system in the top 10, and that's an NVIDIA A100
| system from Microsoft.
|
| The only 2 systems with > 100 PFLOPS are Summit and Fugaku.
| [deleted]
| rektide wrote:
| I've been playing a little of the small-arena-survival game
| Warhammer40k: Dawn of War 2 (2009), and when starting, the AMD
| "The Future is Fusion" logo shows full screen. For the longest
| time this was kind of a sad memento to something lost, a future
| that never happened: Fusion was a ~2009 campaign for their APUs,
| their GPU+CPU chips, and other possible shared-memory systems.
|
| Well, it's happening. Sort of. AMD is finally getting into the
| post PCIe game for reals this time. Only at the very high end of
| the market though. (Perhaps upcoming consumer GPUs might have
| such capabilities, but AMD seems to be only shipping literally
| dozens/hundreds of high end GPUs a month atm.) Fusion is
| happening... for the very big. Oh, and also, Apple, whose 200 /
| 400GBps M1 Pro/MAX chips are performing true wonders via
| fused/unified memory. The Steam Deck, with ~66GBps and integrated
| AMD APU/GPU, will be a next test. I'm not sure how consoles are
| doing these days, which is another strongly AMD corner.
|
| In some ways, the Infinity Fabric 3 news makes me a bit sad. In
| it's past life, Infinity Fabric was known as HyperTransport
| (HTX), an open standard, backed by a HyperTransport Consortium,
| with roots supposedly going all the way back to DEC & Jim Keller
| (who Apple got some experience from too via the 2007 PA Semi
| acquisition) & other engineers seemingly. FPGAs, network cards,
| storage could all work closely with the CPU via HTX. In this new
| regime, Infinity Fabric is finally catching up with the also
| closed/proprietary capability of Nvidia gpus plus CPUs as well
| (only available on NV+Power architecture computers AFAIK). But
| outside players aren't really able to experiment with these new
| faster closer-to-memory architectures, unlike with HTX. For that,
| folks need to use some of the various other more-open
| fabric/interconnects, which are often lower latency than PCIe but
| usually not faster: CXL, OpenCAPI, Gen-Z, and others.
| dragontamer wrote:
| OAM is an open standard that Intel + AMD seem to be supporting
| though.
|
| And with AMD now the owner of Xilinx, there's a good chance
| that this technology will be across Xilinx FPGAs + AMD GPUs +
| AMD CPUs.
| ggm wrote:
| If AMD wanted to sell more in this space, wouldn't it pay to
| support the code which runs in this space? Intel and nvidia are
| masters of funding compiler and tool chain and even applications
| stacks to work on their dice. Reading this article and here, I
| get the impression AMD hasn't entirely mastered how you need to
| sell the gateway drugs as well as the hard stuff, to sell more of
| the hard stuff in the end.
___________________________________________________________________
(page generated 2021-11-15 23:00 UTC)