[HN Gopher] The AMD "Aldebaran" GPU that won exascale
       ___________________________________________________________________
        
       The AMD "Aldebaran" GPU that won exascale
        
       Author : jonbaer
       Score  : 58 points
       Date   : 2021-11-15 19:29 UTC (3 hours ago)
        
 (HTM) web link (www.nextplatform.com)
 (TXT) w3m dump (www.nextplatform.com)
        
       | visionscaper wrote:
       | What software will they use to, for instance, train large deep
       | learning models? Nvidia has CUDA, AMD has what? Are they writing
       | new software from scratch? Maybe they have a lot of frameworks to
       | solve problems in the "traditional" HPC space (eg weather
       | forecasts), but in the ML space I only heard of ROCm which seems
       | to be poorly supported.
       | 
       | AMD seems such an odd choice for "AI supercomputers".
        
       | genpfault wrote:
       | > If you want to know how and why AMD motors have been chosen for
       | so many of the pre-exascale and exascale HPC and AI systems...
       | 
       | AMD...motors?
        
         | amelius wrote:
         | engines
         | 
         | Probably bad translation.
        
       | phkahler wrote:
       | Around 1980 my family got their first computer. I've followed
       | this business ever since, and I was amazed that a CRAY could do
       | MFLOPS. My MS basic interpreter could do hundreds or even
       | thousands of FLOPs on its 8080A. I watched as the high end went
       | to hundreds of MFLOPS, then GIGAFLOPS which seemed insane. There
       | were national efforts to reach TFLOPS, and reading about the
       | challenges (IIRC at the time interconnect was a huge deal) made
       | it seem like the end was near. Moore's law was always in danger.
       | Then came PETAFLOPS consuming megawatts of power.
       | 
       | And now I play VR on a battery powered gizmo doing about 1 TFLOP
       | strapped to my head, and EXAFLOPS are basically here. This is all
       | with at least TSMC 5nm, 3nm, 2nm, and multi-layer left on the
       | table. After watching this relentless advance for 4 decades I'm
       | pretty sure it will go beyond even that, but we just don't know
       | what it will look like yet.
       | 
       | It's become everyday tech to me, but if I look back the
       | progression is absolutely astounding.
        
         | jacquesm wrote:
         | It is astounding. What is more astounding to me is that we burn
         | so many of these cycles on eye candy, and that we waste so many
         | of them on bloat. If not for that your battery powered gizmo
         | would run for many days on one charge instead of having to be
         | connected to its umbilical for 8 hours every night.
        
           | interstice wrote:
           | If that's astounding think about what most of us spend our
           | time doing with the insane amount of processing power
           | available between our ears.
        
             | jacquesm wrote:
             | Sure, but we didn't engineer that with performance in mind,
             | and with computers it was enough to declare a previous
             | generation obsolete. Whereas productivity for computer
             | based applications was actually pretty good for the first
             | generation of those machines, pretty much every cycle
             | counted. Some people used them to play games and there was
             | a recreational element to programming in its own right but
             | it wasn't as though anybody would burn cycles to give a UI
             | the texture of the real thing, it was a computer that
             | worked and it produced results, which was all that
             | mattered.
        
       | dragontamer wrote:
       | https://www.amd.com/system/files/documents/amd-cdna2-white-p...
       | 
       | The article points out this CDNA2 whitepaper, which has the juicy
       | technical details.
       | 
       | CDNA1 is here: https://www.amd.com/system/files/documents/amd-
       | cdna-whitepap...
       | 
       | -----
       | 
       | CDNA2 / MI200 is a chiplet strategy with two "GCDs", each
       | functioning as a classic GPU. These two GCDs can access each
       | other's memory, but only at a lower 400GBps speed (page 8
       | whitepaper).
       | 
       | The actual HBM RAM is designed for 1600 GBps (article), x2 since
       | two GCDs exist. AMD says its like 3200 GBps but in actuality, any
       | one such block/workgroup can only get 2000 GBps (1600 GBps local
       | RAM + 400 GBps from infinity-fabric / partner GCD). So its really
       | a bit complicated and will likely be very workload specific.
       | 
       | If your data can be cloned / split efficiently, then the RAM
       | probably will look like 3200GBps. But if you have to communicate
       | with both parts of RAM to see all the data, you'll see a clear
       | slowdown.
        
       | fxtentacle wrote:
       | I would sincerely hope for a competitive AMD GPU for deep
       | learning. But as long as it's a week-long journey with unknown
       | ending to try to recompile TensorFlow to support ROCm, everyone I
       | know in AI will firmly stick with NVIDIA and their production-
       | proven drivers and CUDA APIs.
       | 
       | I wish AMD would offer something like NVIDIA's Inception program
       | to gift some accelerators and GPUs to suitable C++ coders (like
       | me) so that there's at least a few tutorials on the internet on
       | how other people managed to successfully use AMD + ROCm for deep
       | learning.
       | 
       | EDIT: And it seems ROCm doesn't even support any of those new
       | RDNA2 accelerators or gaming GPUs:
       | https://github.com/RadeonOpenCompute/ROCm/issues/1344
       | 
       | So this is great hardware, but absolutely useless unless you are
       | big enough to write your own GPU drivers from scratch ~_~
        
         | rektide wrote:
         | AMD's not nowhere.
         | https://rocmdocs.amd.com/en/latest/Deep_learning/Deep-learni...
         | shows what should be a followable happy path to getting
         | TensorFlow going (2 year old TF 1.15, and a 2.2beta). I'm
         | curious what is prickly or hard about it.
         | 
         | IMO the deep learning folk need to be working more actively
         | towards the future. The CUDA free ride is amazing, but AMD's
         | HIP already does a good job being CUDA compliant in a general
         | sense. But CUDA also sort of encompasses the massive collection
         | of libraries that Nvidia has written to accelerate a huge
         | amount of use cases. Trying to keep pace with that free-ride is
         | hard.
         | 
         | My hope is eventually we start to invest in Vulkan Compute.
         | Vulkan is way way way harder than CUDA, but it's the only right
         | way I can see to do things. Getting TensorFlow & other
         | libraries ported to run atop Vulkan is a herculean feat, but
         | once there's a start, I tend to believe most ML practitioners
         | won't have to think about the particulars, and I think the deep
         | engineering talent will be able to come, optimize, improve the
         | Vulkan engines very quickly, rapidly be able to improve
         | whatever it is we start with.
         | 
         | It's a huge task, but it just seems like it's got to happen. I
         | don't see what alternative there is, long term, to starting to
         | get good with Vulkan.
        
           | spijdar wrote:
           | > I'm curious what is prickly or hard about it.
           | 
           | I don't want to presume, but it sounds like you haven't
           | actually tried using ROCm "what should be".
           | 
           | My experience with it was an absolute nightmare, I've never
           | gotten ROCm working before. Just as well, since it turns out
           | my systems never would have supported it for various reasons
           | (lacking PCIe atomics for one), but I never actually got so
           | far as to run into the driver problem, I never got the whole
           | custom LLVM fork/ROCm software stack to work.
           | 
           | Caveat, I'm not professionally involved in deep learning or
           | HPC, and as others have mentioned, the framework was only
           | intended for a few specific cards running on very specific
           | hardware for HPC cases.
           | 
           | But pretending like this is even a fraction as useful for the
           | "average" person trying to experiment or even work at a low-
           | medium level in machine learning feels off to me.
           | 
           | I don't think people will be swayed by platitudes about
           | creating a competitive open-systems ecosystem to use plainly
           | inferior software. Companies aren't going to spend oodles of
           | money (and individuals won't volunteer tons of time) to
           | suffer porting frameworks to target bare-bones APIs for the
           | sake of being good sports.
           | 
           | Until either nvidia screws over everyone so much that using
           | AMD cards becomes the path of least resistance, or AMD/Intel
           | offers products at significantly lower prices than nvidia, I
           | don't see the status quo changing much.
        
         | Certhas wrote:
         | ROCm support for Gaming cards has been poor (and not
         | advertised) but it's part of the tech stack they are selling
         | with these accelerators:
         | 
         | https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontie...
         | 
         | It's clearly a real problem that AMD's ML Software stack isn't
         | quite there, and lacks in support for the non-specialized
         | cards, but that's not really an issue for these HPC use
         | cases....
        
         | esistgut wrote:
         | Today a Blender beta version with HIP support has been
         | released. This is working on RDNA hardware (RDNA2 officially
         | supported, RDNA1 enabled but not supported). I guess a release
         | date for ROCm is approaching after all.
        
         | my123 wrote:
         | ROCm 4.5 is also the last release to support their own Vega 10
         | based accelerator. (Radeon Instinct MI25)
         | 
         | https://github.com/RadeonOpenCompute/ROCm#amd-instinct-mi25-...
         | 
         | aka AMD doesn't care... they just want the supercomputer
         | contracts where the customers are savvy enough to build their
         | own very specific SW stack.
        
           | generalizations wrote:
           | Sounds like AMD might still be using the 'tesla roadster'
           | strategy, of selling fewer, more lucrative contracts for the
           | time being. Probably not that they don't care, just that for
           | now, they have to focus.
        
           | [deleted]
        
           | jacquesm wrote:
           | Is there even a single machine in the supercomputer top 10
           | that uses AMD GPUs?
           | 
           | I see NVIDIA all over the place there but I'm not aware of
           | any of them using AMD GPUs, though a couple do use AMD CPUs.
        
             | my123 wrote:
             | None today.
             | 
             | those HPC machines will be the first ones
        
         | esistgut wrote:
         | ROCm doesn't support RDNA at all
         | https://github.com/RadeonOpenCompute/ROCm/issues/887
        
           | Certhas wrote:
           | It's promised for ROCm 5.0 now...
           | 
           | https://github.com/RadeonOpenCompute/ROCm/issues/1180#issuec.
           | ..
        
             | meragrin_ wrote:
             | "might expect good news with 5.0" is not a promise.
        
       | volta83 wrote:
       | From the same source two weeks ago:
       | https://www.nextplatform.com/2021/10/26/china-has-already-re...
       | 
       | Do they read their own news?
       | 
       | China won Exascale. Twice. Before anybody else.
       | 
       | Is there an MI200 systems in the Top500 yet?
       | 
       | Supercomputing 2021 is running and the updated November 2021
       | Top500 list was announced.
       | 
       | There is only one system in the top 10, and that's an NVIDIA A100
       | system from Microsoft.
       | 
       | The only 2 systems with > 100 PFLOPS are Summit and Fugaku.
        
       | [deleted]
        
       | rektide wrote:
       | I've been playing a little of the small-arena-survival game
       | Warhammer40k: Dawn of War 2 (2009), and when starting, the AMD
       | "The Future is Fusion" logo shows full screen. For the longest
       | time this was kind of a sad memento to something lost, a future
       | that never happened: Fusion was a ~2009 campaign for their APUs,
       | their GPU+CPU chips, and other possible shared-memory systems.
       | 
       | Well, it's happening. Sort of. AMD is finally getting into the
       | post PCIe game for reals this time. Only at the very high end of
       | the market though. (Perhaps upcoming consumer GPUs might have
       | such capabilities, but AMD seems to be only shipping literally
       | dozens/hundreds of high end GPUs a month atm.) Fusion is
       | happening... for the very big. Oh, and also, Apple, whose 200 /
       | 400GBps M1 Pro/MAX chips are performing true wonders via
       | fused/unified memory. The Steam Deck, with ~66GBps and integrated
       | AMD APU/GPU, will be a next test. I'm not sure how consoles are
       | doing these days, which is another strongly AMD corner.
       | 
       | In some ways, the Infinity Fabric 3 news makes me a bit sad. In
       | it's past life, Infinity Fabric was known as HyperTransport
       | (HTX), an open standard, backed by a HyperTransport Consortium,
       | with roots supposedly going all the way back to DEC & Jim Keller
       | (who Apple got some experience from too via the 2007 PA Semi
       | acquisition) & other engineers seemingly. FPGAs, network cards,
       | storage could all work closely with the CPU via HTX. In this new
       | regime, Infinity Fabric is finally catching up with the also
       | closed/proprietary capability of Nvidia gpus plus CPUs as well
       | (only available on NV+Power architecture computers AFAIK). But
       | outside players aren't really able to experiment with these new
       | faster closer-to-memory architectures, unlike with HTX. For that,
       | folks need to use some of the various other more-open
       | fabric/interconnects, which are often lower latency than PCIe but
       | usually not faster: CXL, OpenCAPI, Gen-Z, and others.
        
         | dragontamer wrote:
         | OAM is an open standard that Intel + AMD seem to be supporting
         | though.
         | 
         | And with AMD now the owner of Xilinx, there's a good chance
         | that this technology will be across Xilinx FPGAs + AMD GPUs +
         | AMD CPUs.
        
       | ggm wrote:
       | If AMD wanted to sell more in this space, wouldn't it pay to
       | support the code which runs in this space? Intel and nvidia are
       | masters of funding compiler and tool chain and even applications
       | stacks to work on their dice. Reading this article and here, I
       | get the impression AMD hasn't entirely mastered how you need to
       | sell the gateway drugs as well as the hard stuff, to sell more of
       | the hard stuff in the end.
        
       ___________________________________________________________________
       (page generated 2021-11-15 23:00 UTC)