[HN Gopher] Porting HPC Applications to AMD Instinct MI300A Usin...
       ___________________________________________________________________
        
       Porting HPC Applications to AMD Instinct MI300A Using Unified
       Memory and OpenMP
        
       Author : arcanus
       Score  : 53 points
       Date   : 2024-05-04 16:47 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | bee_rider wrote:
       | APU's for HPC are going to be a wild ride. Accelerated computing
       | in shared memory. Get CPU-focused folks will actually get access
       | to some high throughput compute accessible on the sort of
       | timescales that we can actually reason about (the GPU is so far
       | away).
        
         | djmips wrote:
         | All video game consoles use APUs and it does make memory
         | related operations potentially faster but at least for video
         | games it's not the bottleneck. I suppose for HPC it might have
         | more significance.
        
           | bayindirh wrote:
           | If you're doing simulations, or poking big matrices
           | continuously on CPUs, you can saturate the memory controller
           | pretty easily. If you know what you're doing, your FPU or
           | vector units are saturated at the same time, so "whole
           | system" becomes the bottleneck while it tries to keep itself
           | cool.
           | 
           | Games move that kind of data in the beginning and doesn't
           | stream new data that much after the initial texture and model
           | data. If you are working on HPC with GPUs, you may need to
           | constantly stream in new data to the GPU while streaming out
           | the results. This is why datacenter/compute GPUs have
           | multiple independent DMA engines.
        
           | crest wrote:
           | Afaik those unified memory architectures are mostly neither
           | cache coherent nor do they support virtual addresses
           | efficiently (you have to trap into privileged code to
           | pin/unpin the mappings) which means that the relative cost is
           | lower than a dedicated GPU accessed via PCIe slots, but still
           | to high. Only the "boring" old Bobcat based AMD APUs
           | supported accessing unpinned virtual memory from the L3 (aka
           | system level) cache and nobody bothered with porting code to
           | them.
        
         | JonChesterfield wrote:
         | APUs are very cool for GPU programming in general. Explicitly
         | copying data to/from GPUs is a definite nuisance. I'm hopeful
         | that the MI300A will have a positive knock on effect on the low
         | power APUs in laptops and similar.
        
           | imtringued wrote:
           | >Explicitly copying data to/from GPUs is a definite nuisance.
           | 
           | CXL allows fine grained shared memory, but people look at the
           | shiny high bandwidth NVLink and talk about how much better it
           | is for... AI.
        
       | curt15 wrote:
       | I was talking with a friend in HPC lately who said that AMD is
       | actually quite competitive in the HPC space these days. For
       | example, Frontier
       | (https://docs.olcf.ornl.gov/systems/frontier_user_guide.html) is
       | an all-AMD installation. Do scientists actually use ROCm in their
       | code or does AMD have another programming framework for their
       | Instinct chips?
        
         | kkielhofner wrote:
         | I currently have a project with ORNL OLCF (on Frontier). The
         | short answer is yes. Happy to answer any questions I can.
        
           | ysleepy wrote:
           | ROCm or HIP? Does it start out with porting a lot from CUDA
           | etc. or starting fresh on top of the AMD APIs?
           | 
           | How much of the project time is spent on that compute API
           | stuff in comparison to "payload" work?
        
         | almostgotcaught wrote:
         | National labs sign "cost-effective" deals. NVIDIA isn't cost-
         | effective. Aurora (at Argonne) is all Intel GPU. Aurora is also
         | a clusterfuck so that just tells you these decisions aren't
         | made by the most competent people.
        
           | jfkfif wrote:
           | nvidia absolutely gives deals to national labs and
           | universities. See Crossroads @ LANL, Isambard in the UK,
           | Perlmutter @ LBL. While AMD is being deployed at LLNL and
           | ORNL, Nvidia isn't done with their HPC game. Maybe not at the
           | leadership level, but we'll see how Oak Ridge and LANL decide
           | their next round of procurements
        
           | wmf wrote:
           | Both Frontier and Aurora bet on unproven future chips.
           | Sometimes it pays off and sometimes it doesn't.
        
           | Dalewyn wrote:
           | They are competent people, just not in the fields techies
           | want.
           | 
           | When you're a national laboratory and your wallet is taxes
           | from fellow Americans, it is very important that you find a
           | balance between bang and buck. Lest you get your budget
           | slashed or worse.
        
       | mathiasgredal wrote:
       | Having looked briefly at the code I still think C++17 parallel
       | algorithms are more ergonomic compared to OpenMP:
       | https://rocm.blogs.amd.com/software-tools-optimization/hipst...
        
         | mgaunard wrote:
         | funny how we only get LoC between the different versions, but
         | not the performance...
         | 
         | Of course the parallel algorithms are shorter, it's a more
         | high-level interface. But being explicit gives you more control
         | and potentially more performance.
        
         | bee_rider wrote:
         | Is language support why people like OpenMP?
         | 
         | I think it is nice because it supports both C and Fortran, and
         | they use the same runtime, so you can do things like pin
         | threads to cores or avoid oversubscription. Stuff like calling
         | a Fortran library that uses OpenMP, from a C code that also
         | uses OpenMP, doesn't require anything clever.
        
           | jltsiren wrote:
           | OpenMP has been around for a long time. People know how to
           | use it, and it has gained many features that are useful for
           | scientific computing.
           | 
           | The consortium behind OpenMP consists mostly of hardware
           | companies and organizations doing scientific computing.
           | Software companies are largely missing. That may contribute
           | to the popularity of OpenMP, as the interests of scientific
           | computing and software development are often different.
        
       ___________________________________________________________________
       (page generated 2024-05-04 23:00 UTC)