[HN Gopher] AMD's CDNA 3 Compute Architecture
       ___________________________________________________________________
        
       AMD's CDNA 3 Compute Architecture
        
       Author : ksec
       Score  : 118 points
       Date   : 2023-12-17 18:51 UTC (4 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | jdewerd wrote:
       | > AMD diverged their GPU architecture development into separate
       | CDNA and RDNA lines specialized for compute and graphics
       | respectively.
       | 
       | Ooooh, is that why the consumer cards don't do compute? Yikes, I
       | was hoping that was just a bit of misguided segmentation but this
       | sounds like a high level architecture problem, like an interstate
       | without an on ramp. Oof.
        
         | jacoblambda wrote:
         | They added support for high end radeon cards 2 months ago. Rocm
         | is "eventually" coming to RDNA in general but it's a slow
         | process which is more or less in line with how AMD has
         | approached rocm from the start (targetting a very small subset
         | of compute and slowly expanding it with each major version).
         | 
         | https://www.tomshardware.com/news/amd-enables-rocm-and-pytor...
        
           | mnau wrote:
           | That's a good step forward. This separation is a massive
           | barrier to entry. Making CUDA work reliably on every product
           | was a masterstroke by NVIDIA.
        
           | treprinum wrote:
           | The issue is that RDNA and CDNA are different so when an
           | enthusiast makes a fast RDNA code it doesn't mean it would
           | work well on CDNA and vice-versa. Not sure why AMD had to go
           | this route, only high-end pros will write software for CDNA
           | and they won't get any mindshare.
        
             | JonChesterfield wrote:
             | What do you have in mind for code that works noticeably
             | better on one than the other?
        
         | DarkmSparks wrote:
         | I was quite surprised to find at least one of the the top500 is
         | based on AMD gpus, it seems to me not bringing that to the
         | consumer market was a conscious choice made some time ago.
        
         | cogman10 wrote:
         | Looks like the split was around 2016? Which could make sense
         | given the state of crypto at the time. One problem that hit
         | nvidia more than AMD is consumer cards getting vacuumed up by
         | crypto farms. AMD making a conscious split effectively isolated
         | their compute cards from their gamer cards.
         | 
         | That being said, I can't imagine this is something that's been
         | good for adoption of AMD cards for compute tasks. The wonderful
         | thing about CUDA is you don't need a special accelerator card
         | to develop cuda code.
        
           | wmf wrote:
           | You can't really isolate crypto mining. If a gaming card is
           | profitable to mine with, miners will buy it and use the
           | gaming drivers.
        
           | Zardoz84 wrote:
           | if i remember correctly, AMD GPU cards was being sold very
           | well to crypto miners. The split wasn't because that. They
           | did, because gamming and computing having different hardware
           | optimization requisites, and that allowed to get better and
           | more competitive GPUs and computing cards on the market.
        
           | dotnet00 wrote:
           | The split had more to do with GCN being very compute heavy
           | and AMD's software completely failing to take advantage of
           | that. GCN got transitioned into CDNA, and got heavily
           | reworked for graphics as RDNA. This is why while RDNA is
           | nowhere near as good at compute oriented features, it tends
           | to outperform NVIDIA in pure rasterization.
           | 
           | In contrast, current NVIDIA GPUs are very compute heavy
           | architectures (to the point that they often artificially
           | limit certain aspects of their consumer cards other than fp64
           | to not interfere with the workstation cards), and they take
           | great advantage of it all with their various features
           | combining software and hardware capabilities.
        
         | ColonelPhantom wrote:
         | > Ooooh, is that why the consumer cards don't do compute?
         | 
         | No, not really. Sure, the consumer cards are based on a less
         | suitable architecture, so they won't be as good, but it's not
         | like it's not a general purpose GPU architecture anymore.
         | 
         | The real problem is that ROCm has made the very weird technical
         | decision of shipping GPU machine code, rather than bytecode.
         | Nvidia has their own bytecode NVPTX, and Intel just uses SPIR-V
         | for oneAPI.
         | 
         | This means that ROCm support for a GPU architecture requires
         | shipping binaries of all the libraries (such as rocBLAS,
         | rocDNN, rocsparse, rocFFT) for all architectures.
         | 
         | As the cherry on top, on paper one GPU family consists of
         | different architectures! The RX 6700 XT identifies as a
         | different 'architecture' (gfx1031) than the RX 6800 XT
         | (gfx1030). These are 'close enough' that you can tell ROCm to
         | use gfx1030 binaries on gfx1031 and it'll usually just work,
         | but still.
         | 
         | I feel like this is a large part of the reason AMD is so bad
         | about supporting older GPUs in ROCm. It'd cost a bit of effort,
         | sure, but I feel like the enormous packages you'd get are a
         | bigger issue. (This is another factor that kills hobbyist ROCm,
         | because anyone with a Pascal+ GPU can do CUDA, but ROCm support
         | is usually limited to newer and higher end cards.)
        
           | slavik81 wrote:
           | > As the cherry on top, on paper one GPU family consists of
           | different architectures! The RX 6700 XT identifies as a
           | different 'architecture' (gfx1031) than the RX 6800 XT
           | (gfx1030). These are 'close enough' that you can tell ROCm to
           | use gfx1030 binaries on gfx1031 and it'll usually just work,
           | but still.
           | 
           | As far as I can tell, the gfx1030 and gfx1031 ISAs are
           | identical in all but name. LLVM treats them exactly the same
           | and all tests for the ROCm math libraries will pass when you
           | run gfx1030 code on gfx1031 GPUs using the override.
           | 
           | The Debian package for HIP has a patch so that if the user
           | has a gfx1031 GPU and there are no gfx1031 code objects
           | available, it will fall back to using gfx1030 code objects
           | instead. When I proposed that patch upstream, it was rejected
           | in favour of adding a new ISA that explicitly supports all
           | the GPUs in a family. That solution is more complex and is
           | taking longer to implement, but will more thoroughly solve
           | that problem (as the solution Debian is using only works well
           | when there's one ISA in the family that is a subset of all
           | the others).
        
         | beebeepka wrote:
         | I thought trolling wasn't welcome on HN
        
         | vegabook wrote:
         | AMD has always been a terrible steward of ATI.
         | 
         | They're fundamentally a hardware company (as per Lisa Su's CV)
         | which didn't cotton on to the fact that Cuda was the killer. I
         | remember how @Bridgman kept playing a rearguard action on
         | Phoronix, his job being to keep devs on board. Losing battle.
         | 
         | I kinda understand it because [80/90]s era hardware people
         | intrinsically think hardware is the top dog in the stack, and
         | that's where all AMD management come from including Su.
         | 
         | Kodura understood that Nvidia was killing AMD because the
         | consumer cards could run CUDA. So he pushed through, against
         | Lisa Su, the Radeon VII, which until very recently, and for
         | many years, was the only consumer card that ROCm supported. He
         | was basically fired shortly thereafter, and the RVII, which was
         | a fantastic card, was shutdown sharpish. Then they brought in
         | Wang who crystallised the consumer/pro segmentation.
         | 
         | Now they're furiously trying to backpedal, and it's too late.
         | There are multiple attempting competitors, but basically the
         | only one worth talking about is AAPL and Metal.
         | 
         | AMD lost the window.
        
           | simfree wrote:
           | ATI on its own did not have a solid future, they could have
           | easily ended up like their other local Canadian counterparts
           | Nortel or Blackberry.
        
             | vegabook wrote:
             | "Dead or Canadian" yes?
        
             | hylaride wrote:
             | I live in Toronto and knew people who worked for ATI back
             | in the 1990s-2000s. The management structure was _fucked_.
             | In those days it was generally regarded that ATI could and
             | did produce better hardware than NVIDIA, but ATI's drivers
             | were horrible and only evened out near the end of the
             | hardware's lifetime. But the end of the lifetime was far
             | too late and all the benchmarks would have long been done
             | on the buggy drivers, driving people to NVIDIA.
             | 
             | I kid you not, but the software side of ATI's reviews and
             | incentives were literally QA got rewarded for finding bugs
             | and then the software engineers were then rewarded for
             | fixing them. You can imagine what happened next. The bitch
             | of it is that it was overall just a symptom of the problem
             | that the company as a whole didn't prioritize software.
             | NVIDIA knew that the drivers and software efficiency of
             | using the hardware was just as if not more important than
             | the hardware (the stories in the 1990s of how John Carmack
             | used various hacks with hardware is a good example of
             | that).
        
         | sva_ wrote:
         | Well the APU does, there are compute intrinsics and even a
         | small guide by AMD on how to use them:
         | 
         | https://gpuopen.com/learn/wmma_on_rdna3/
        
         | Const-me wrote:
         | > the consumer cards don't do compute?
         | 
         | Typically, software developers only support a single GPGPU API,
         | and that API is nVidia CUDA.
         | 
         | Technically, consumer AMD cards are awesome at compute. For
         | example, UE5 renders triangle meshes with compute instead of
         | graphics https://www.youtube.com/watch?v=TMorJX3Nj6U Moreover,
         | AMD cards often outperform nVidia equivalents because nVidia
         | prioritized ray tracing and DLSS over compute power and memory
         | bandwidth.
         | 
         | The issue is, no tech company is interested in adding D3D or
         | Vulkan backend to AI libraries like PyTorch. nVidia is not
         | doing that because they are happy with the status quo. Intel
         | and AMD are not doing that because both hope to replace CUDA
         | with their proprietary equivalents, instead of an open GPU API.
        
           | boppo1 wrote:
           | >Vulkan backend to AI libraries like PyTorch
           | 
           | How many man hours is a project like this?
        
             | Const-me wrote:
             | Couple times in the past I wanted to port open source ML
             | models from CUDA/Python to a better technology stack. I
             | have ported Whisper https://github.com/Const-me/Whisper/
             | and Mistral https://github.com/Const-me/Cgml/ to D3D11. I
             | don't remember how much time I spent, but given both were
             | unpaid part-time hobby projects, probably under 160 hours /
             | each.
             | 
             | These software projects were great to validate the
             | technology choices, but note I only did bare minimum to
             | implement specific ML models. Implementing a complete
             | PyTorch backend gonna involve dramatically more work. I
             | can't even estimate how much more because I'm not an expert
             | in Python or these Python-based ML libraries.
        
           | GeekyBear wrote:
           | > The issue is, no tech company is interested in adding D3D
           | or Vulkan backend to AI libraries like PyTorch.
           | 
           | AMD is interested in making PyTorch take advantage of their
           | chips, and has done so.
           | 
           | > we are delighted that the PyTorch 2.0 stable release
           | includes support for AMD Instinct(tm) and Radeon(tm) GPUs
           | that are supported by the ROCm(tm) software platform.
           | 
           | When NVIDIA is charging 1000% profit margins for a GPU aimed
           | at use as an AI accelerator, you can expect competitors to be
           | willing to do what it takes to move into that market.
           | 
           | https://www.tomshardware.com/news/nvidia-
           | makes-1000-profit-o...
        
           | zozbot234 wrote:
           | Vulkan Compute backends for numerical compute (as typified by
           | both OpenCL and SYCL) are challenging, you can look at clspv
           | https://github.com/google/clspv project for the nitty gritty
           | details. The lowest-effort path so far is most likely via
           | some combination of Rocm/HIP (for hardware that AMD bothers
           | to support themselves) and the Mesa project's RustiCL backend
           | (for everything else).
        
             | Const-me wrote:
             | > Vulkan Compute backends for numerical compute (as
             | typified by both OpenCL and SYCL) are challenging
             | 
             | Microsoft has an offline dxc.exe compiler which compiles
             | HLSL to Spir-V. Also, DXVK has a JIT compiler which
             | recompiles DXBC byte codes to Spir-V. Both technologies are
             | old, stable and reliable, for example the DXVK's JIT
             | compiler is a critical software component of the SteamDeck
             | console.
             | 
             | > The lowest-effort path so far is most likely
             | 
             | I agree that's most likely to happen, but the outcome is
             | horrible from consumer PoV.
             | 
             | Mesa is Linux-only, Rust is too hard to use for vast
             | majority of developers (myself included), AMD will never
             | support older cards with ROCm, and we now have the third
             | discrete GPU vendor, Intel.
        
               | my123 wrote:
               | SPIR-V for OpenCL and for Vulkan are substantially
               | different, with the translation between the two being
               | quite non-trivial.
               | 
               | (note that rusticl + zink does deal with it _partially_
               | to some extent nowadays)
               | 
               | + Vulkan memory management doesn't expose unified address
               | space primitives
        
               | Const-me wrote:
               | Why would you want OpenCL? Pretty sure D3D11 compute
               | shaders gonna be adequate for a Torch backend, and they
               | even work on Linux with Wine: https://github.com/Const-
               | me/Whisper/issues/42 Native Vulkan compute shaders would
               | be even better.
               | 
               | Why would you want unified address space? At least in my
               | experience, it's often too slow to be useful. DMA
               | transfers (CopyResource in D3D11, copy command queue in
               | D3D12, transfer queue in VK) are implemented by dedicated
               | hardware inside GPUs, and are way more efficient.
        
           | einpoklum wrote:
           | > Typically, software developers only support a single GPGPU
           | API, and that API is nVidia CUDA.
           | 
           | There is very little NVIDIA and CUDA cards off of X86_64 and
           | maybe OpenPower architectures. So I disagree. Also, OpenCL,
           | despite being kind of a "betrayed standard", enjoys quite a
           | lot of popularity even on x86_64 (sometimes even with NVIDIA
           | hardware) - even if it is not as popular there.
           | 
           | > AMD cards often outperform nVidia equivalents
           | 
           | Can you link to benchmarks or other analysis supporting this
           | claim? This has not been my impression in recent years,
           | though I don't routinely look at high-end AMD hardware.
           | 
           | > because nVidia prioritized ray tracing and DLSS over
           | compute power and memory bandwidth.
           | 
           | Has it really?
        
             | Const-me wrote:
             | > Can you link to benchmarks or other analysis supporting
             | this claim?
             | 
             | Current generation nVidia:
             | https://en.wikipedia.org/wiki/GeForce_40_series#Desktop
             | 
             | Current generation AMD:
             | https://en.wikipedia.org/wiki/Template:AMD_Radeon_RX_7000
             | 
             | The key performance characteristics are processing power
             | TFlops, and memory bandwidth GB/s.
             | 
             | nVidia 4080 which costs $1200: 43 TFlops FP16, 43 TFlops
             | FP32, 0.672 TFlops FP64, 717 GB/s memory.
             | 
             | AMD 7900 XTX which costs $1000: 93 TFlops FP16, 47 TFlops
             | FP32, 1.5 TFlops FP64, 960 GB/s memory.
             | 
             | Note that for applications which bottleneck on FP16 compute
             | (many ML workloads) or FP64 compute (many traditional HPC
             | workloads: numerical solvers, fluid dynamics, etc), the
             | 7900 XTX even outperforms the 4090 which costs $1600.
        
       | kaycebasques wrote:
       | I wasn't familiar with VLIW. Sounds cool!
       | 
       | > Very long instruction word (VLIW) refers to instruction set
       | architectures designed to exploit instruction level parallelism
       | (ILP). Whereas conventional central processing units (CPU,
       | processor) mostly allow programs to specify instructions to
       | execute in sequence only, a VLIW processor allows programs to
       | explicitly specify instructions to execute in parallel. This
       | design is intended to allow higher performance without the
       | complexity inherent in some other designs.
       | 
       | > The traditional means to improve performance in processors
       | include dividing instructions into substeps so the instructions
       | can be executed partly at the same time (termed pipelining),
       | dispatching individual instructions to be executed independently,
       | in different parts of the processor (superscalar architectures),
       | and even executing instructions in an order different from the
       | program (out-of-order execution).[1] These methods all complicate
       | hardware (larger circuits, higher cost and energy use) because
       | the processor must make all of the decisions internally for these
       | methods to work.
       | 
       | https://en.wikipedia.org/wiki/Very_long_instruction_word
        
         | mpweiher wrote:
         | The most prominent example of VLIW processors was the Itanic,
         | er, Itanium.
         | 
         | It, er, didn't work out well.
         | 
         | Hence Itanic.
         | 
         | Their premise was that the compiler could figure out
         | dependencies sufficiently statically so that they could put
         | multiple sequential and some divergent execution paths in the
         | same instruction. It turned out the compilers couldn't actually
         | do this, so processors figure out dependencies and
         | parallelizable instructions dynamically from a sequential
         | instruction stream.
         | 
         | Which is a lot of work, a lot of chip resources and a lot of
         | energy. And only works up to a point, after which you hit
         | diminishing returns. Which is where we appear to be these days.
        
           | kimixa wrote:
           | VLIW is also still heavily used in DSPs and similar. And
           | arguably the dual issue instructions in both AMD's and
           | Nvidia's latest GPUs is a step back towards that.
           | 
           | It feels like a lot of these things are cyclical - the core
           | ideas aren't new and phase in and out of favor as other parts
           | of architecture design changes around and increases or
           | decreases benefits accordingly.
        
             | gpderetta wrote:
             | It is not so much that it is cyclical than VLIW being good
             | for numerical/DSP stuff with more predictable memory access
             | and terrible for general computation. Itanium was ok at
             | floating point code, but terrible at typical pointer
             | chasing loads.
        
         | yarg wrote:
         | Read up on SIMD in general.
         | 
         | (The means of processing as opposed to the language used to
         | dispatch commands.)
         | 
         | (And worth baring in mind is the fact that terms such as VLIW4
         | and VLIW5 refer to specific implementations.)
         | 
         | https://en.wikipedia.org/wiki/Single_instruction,_multiple_d...
        
           | gpderetta wrote:
           | simd != vliw though.
        
       | einpoklum wrote:
       | > Compute has been outpacing memory for decades. Like CPUs, GPUs
       | have countered this with increasingly sophisticated caching
       | strategies.
       | 
       | I'd say it's rather the contrary. Unlike CPUs, GPUs don't attempt
       | to directly counter this. By accepting higher latencies, they
       | have let themselves parallelize much more widely (or wildly)
       | relative to CPUs - and the high number of parallel pseudo-threads
       | provides a "latency hiding" effect.
       | 
       | This effect is illustrated for example, in this presentation on
       | optimizing GPU code:
       | 
       | https://www.olcf.ornl.gov/wp-content/uploads/2019/12/03-CUDA...
       | 
       | (lame) animation on slide 11 and onwards.
        
       | feanaro wrote:
       | A bit off topic, but when did "compute" become a noun? It is
       | extremely grating to my ears.
        
         | synergy20 wrote:
         | it's pretty common these days due to AI with GPU-alike chips.
        
       ___________________________________________________________________
       (page generated 2023-12-17 23:00 UTC)