[HN Gopher] AMD's CDNA 3 Compute Architecture
___________________________________________________________________
AMD's CDNA 3 Compute Architecture
Author : ksec
Score : 118 points
Date : 2023-12-17 18:51 UTC (4 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| jdewerd wrote:
| > AMD diverged their GPU architecture development into separate
| CDNA and RDNA lines specialized for compute and graphics
| respectively.
|
| Ooooh, is that why the consumer cards don't do compute? Yikes, I
| was hoping that was just a bit of misguided segmentation but this
| sounds like a high level architecture problem, like an interstate
| without an on ramp. Oof.
| jacoblambda wrote:
| They added support for high end radeon cards 2 months ago. Rocm
| is "eventually" coming to RDNA in general but it's a slow
| process which is more or less in line with how AMD has
| approached rocm from the start (targetting a very small subset
| of compute and slowly expanding it with each major version).
|
| https://www.tomshardware.com/news/amd-enables-rocm-and-pytor...
| mnau wrote:
| That's a good step forward. This separation is a massive
| barrier to entry. Making CUDA work reliably on every product
| was a masterstroke by NVIDIA.
| treprinum wrote:
| The issue is that RDNA and CDNA are different so when an
| enthusiast makes a fast RDNA code it doesn't mean it would
| work well on CDNA and vice-versa. Not sure why AMD had to go
| this route, only high-end pros will write software for CDNA
| and they won't get any mindshare.
| JonChesterfield wrote:
| What do you have in mind for code that works noticeably
| better on one than the other?
| DarkmSparks wrote:
| I was quite surprised to find at least one of the the top500 is
| based on AMD gpus, it seems to me not bringing that to the
| consumer market was a conscious choice made some time ago.
| cogman10 wrote:
| Looks like the split was around 2016? Which could make sense
| given the state of crypto at the time. One problem that hit
| nvidia more than AMD is consumer cards getting vacuumed up by
| crypto farms. AMD making a conscious split effectively isolated
| their compute cards from their gamer cards.
|
| That being said, I can't imagine this is something that's been
| good for adoption of AMD cards for compute tasks. The wonderful
| thing about CUDA is you don't need a special accelerator card
| to develop cuda code.
| wmf wrote:
| You can't really isolate crypto mining. If a gaming card is
| profitable to mine with, miners will buy it and use the
| gaming drivers.
| Zardoz84 wrote:
| if i remember correctly, AMD GPU cards was being sold very
| well to crypto miners. The split wasn't because that. They
| did, because gamming and computing having different hardware
| optimization requisites, and that allowed to get better and
| more competitive GPUs and computing cards on the market.
| dotnet00 wrote:
| The split had more to do with GCN being very compute heavy
| and AMD's software completely failing to take advantage of
| that. GCN got transitioned into CDNA, and got heavily
| reworked for graphics as RDNA. This is why while RDNA is
| nowhere near as good at compute oriented features, it tends
| to outperform NVIDIA in pure rasterization.
|
| In contrast, current NVIDIA GPUs are very compute heavy
| architectures (to the point that they often artificially
| limit certain aspects of their consumer cards other than fp64
| to not interfere with the workstation cards), and they take
| great advantage of it all with their various features
| combining software and hardware capabilities.
| ColonelPhantom wrote:
| > Ooooh, is that why the consumer cards don't do compute?
|
| No, not really. Sure, the consumer cards are based on a less
| suitable architecture, so they won't be as good, but it's not
| like it's not a general purpose GPU architecture anymore.
|
| The real problem is that ROCm has made the very weird technical
| decision of shipping GPU machine code, rather than bytecode.
| Nvidia has their own bytecode NVPTX, and Intel just uses SPIR-V
| for oneAPI.
|
| This means that ROCm support for a GPU architecture requires
| shipping binaries of all the libraries (such as rocBLAS,
| rocDNN, rocsparse, rocFFT) for all architectures.
|
| As the cherry on top, on paper one GPU family consists of
| different architectures! The RX 6700 XT identifies as a
| different 'architecture' (gfx1031) than the RX 6800 XT
| (gfx1030). These are 'close enough' that you can tell ROCm to
| use gfx1030 binaries on gfx1031 and it'll usually just work,
| but still.
|
| I feel like this is a large part of the reason AMD is so bad
| about supporting older GPUs in ROCm. It'd cost a bit of effort,
| sure, but I feel like the enormous packages you'd get are a
| bigger issue. (This is another factor that kills hobbyist ROCm,
| because anyone with a Pascal+ GPU can do CUDA, but ROCm support
| is usually limited to newer and higher end cards.)
| slavik81 wrote:
| > As the cherry on top, on paper one GPU family consists of
| different architectures! The RX 6700 XT identifies as a
| different 'architecture' (gfx1031) than the RX 6800 XT
| (gfx1030). These are 'close enough' that you can tell ROCm to
| use gfx1030 binaries on gfx1031 and it'll usually just work,
| but still.
|
| As far as I can tell, the gfx1030 and gfx1031 ISAs are
| identical in all but name. LLVM treats them exactly the same
| and all tests for the ROCm math libraries will pass when you
| run gfx1030 code on gfx1031 GPUs using the override.
|
| The Debian package for HIP has a patch so that if the user
| has a gfx1031 GPU and there are no gfx1031 code objects
| available, it will fall back to using gfx1030 code objects
| instead. When I proposed that patch upstream, it was rejected
| in favour of adding a new ISA that explicitly supports all
| the GPUs in a family. That solution is more complex and is
| taking longer to implement, but will more thoroughly solve
| that problem (as the solution Debian is using only works well
| when there's one ISA in the family that is a subset of all
| the others).
| beebeepka wrote:
| I thought trolling wasn't welcome on HN
| vegabook wrote:
| AMD has always been a terrible steward of ATI.
|
| They're fundamentally a hardware company (as per Lisa Su's CV)
| which didn't cotton on to the fact that Cuda was the killer. I
| remember how @Bridgman kept playing a rearguard action on
| Phoronix, his job being to keep devs on board. Losing battle.
|
| I kinda understand it because [80/90]s era hardware people
| intrinsically think hardware is the top dog in the stack, and
| that's where all AMD management come from including Su.
|
| Kodura understood that Nvidia was killing AMD because the
| consumer cards could run CUDA. So he pushed through, against
| Lisa Su, the Radeon VII, which until very recently, and for
| many years, was the only consumer card that ROCm supported. He
| was basically fired shortly thereafter, and the RVII, which was
| a fantastic card, was shutdown sharpish. Then they brought in
| Wang who crystallised the consumer/pro segmentation.
|
| Now they're furiously trying to backpedal, and it's too late.
| There are multiple attempting competitors, but basically the
| only one worth talking about is AAPL and Metal.
|
| AMD lost the window.
| simfree wrote:
| ATI on its own did not have a solid future, they could have
| easily ended up like their other local Canadian counterparts
| Nortel or Blackberry.
| vegabook wrote:
| "Dead or Canadian" yes?
| hylaride wrote:
| I live in Toronto and knew people who worked for ATI back
| in the 1990s-2000s. The management structure was _fucked_.
| In those days it was generally regarded that ATI could and
| did produce better hardware than NVIDIA, but ATI's drivers
| were horrible and only evened out near the end of the
| hardware's lifetime. But the end of the lifetime was far
| too late and all the benchmarks would have long been done
| on the buggy drivers, driving people to NVIDIA.
|
| I kid you not, but the software side of ATI's reviews and
| incentives were literally QA got rewarded for finding bugs
| and then the software engineers were then rewarded for
| fixing them. You can imagine what happened next. The bitch
| of it is that it was overall just a symptom of the problem
| that the company as a whole didn't prioritize software.
| NVIDIA knew that the drivers and software efficiency of
| using the hardware was just as if not more important than
| the hardware (the stories in the 1990s of how John Carmack
| used various hacks with hardware is a good example of
| that).
| sva_ wrote:
| Well the APU does, there are compute intrinsics and even a
| small guide by AMD on how to use them:
|
| https://gpuopen.com/learn/wmma_on_rdna3/
| Const-me wrote:
| > the consumer cards don't do compute?
|
| Typically, software developers only support a single GPGPU API,
| and that API is nVidia CUDA.
|
| Technically, consumer AMD cards are awesome at compute. For
| example, UE5 renders triangle meshes with compute instead of
| graphics https://www.youtube.com/watch?v=TMorJX3Nj6U Moreover,
| AMD cards often outperform nVidia equivalents because nVidia
| prioritized ray tracing and DLSS over compute power and memory
| bandwidth.
|
| The issue is, no tech company is interested in adding D3D or
| Vulkan backend to AI libraries like PyTorch. nVidia is not
| doing that because they are happy with the status quo. Intel
| and AMD are not doing that because both hope to replace CUDA
| with their proprietary equivalents, instead of an open GPU API.
| boppo1 wrote:
| >Vulkan backend to AI libraries like PyTorch
|
| How many man hours is a project like this?
| Const-me wrote:
| Couple times in the past I wanted to port open source ML
| models from CUDA/Python to a better technology stack. I
| have ported Whisper https://github.com/Const-me/Whisper/
| and Mistral https://github.com/Const-me/Cgml/ to D3D11. I
| don't remember how much time I spent, but given both were
| unpaid part-time hobby projects, probably under 160 hours /
| each.
|
| These software projects were great to validate the
| technology choices, but note I only did bare minimum to
| implement specific ML models. Implementing a complete
| PyTorch backend gonna involve dramatically more work. I
| can't even estimate how much more because I'm not an expert
| in Python or these Python-based ML libraries.
| GeekyBear wrote:
| > The issue is, no tech company is interested in adding D3D
| or Vulkan backend to AI libraries like PyTorch.
|
| AMD is interested in making PyTorch take advantage of their
| chips, and has done so.
|
| > we are delighted that the PyTorch 2.0 stable release
| includes support for AMD Instinct(tm) and Radeon(tm) GPUs
| that are supported by the ROCm(tm) software platform.
|
| When NVIDIA is charging 1000% profit margins for a GPU aimed
| at use as an AI accelerator, you can expect competitors to be
| willing to do what it takes to move into that market.
|
| https://www.tomshardware.com/news/nvidia-
| makes-1000-profit-o...
| zozbot234 wrote:
| Vulkan Compute backends for numerical compute (as typified by
| both OpenCL and SYCL) are challenging, you can look at clspv
| https://github.com/google/clspv project for the nitty gritty
| details. The lowest-effort path so far is most likely via
| some combination of Rocm/HIP (for hardware that AMD bothers
| to support themselves) and the Mesa project's RustiCL backend
| (for everything else).
| Const-me wrote:
| > Vulkan Compute backends for numerical compute (as
| typified by both OpenCL and SYCL) are challenging
|
| Microsoft has an offline dxc.exe compiler which compiles
| HLSL to Spir-V. Also, DXVK has a JIT compiler which
| recompiles DXBC byte codes to Spir-V. Both technologies are
| old, stable and reliable, for example the DXVK's JIT
| compiler is a critical software component of the SteamDeck
| console.
|
| > The lowest-effort path so far is most likely
|
| I agree that's most likely to happen, but the outcome is
| horrible from consumer PoV.
|
| Mesa is Linux-only, Rust is too hard to use for vast
| majority of developers (myself included), AMD will never
| support older cards with ROCm, and we now have the third
| discrete GPU vendor, Intel.
| my123 wrote:
| SPIR-V for OpenCL and for Vulkan are substantially
| different, with the translation between the two being
| quite non-trivial.
|
| (note that rusticl + zink does deal with it _partially_
| to some extent nowadays)
|
| + Vulkan memory management doesn't expose unified address
| space primitives
| Const-me wrote:
| Why would you want OpenCL? Pretty sure D3D11 compute
| shaders gonna be adequate for a Torch backend, and they
| even work on Linux with Wine: https://github.com/Const-
| me/Whisper/issues/42 Native Vulkan compute shaders would
| be even better.
|
| Why would you want unified address space? At least in my
| experience, it's often too slow to be useful. DMA
| transfers (CopyResource in D3D11, copy command queue in
| D3D12, transfer queue in VK) are implemented by dedicated
| hardware inside GPUs, and are way more efficient.
| einpoklum wrote:
| > Typically, software developers only support a single GPGPU
| API, and that API is nVidia CUDA.
|
| There is very little NVIDIA and CUDA cards off of X86_64 and
| maybe OpenPower architectures. So I disagree. Also, OpenCL,
| despite being kind of a "betrayed standard", enjoys quite a
| lot of popularity even on x86_64 (sometimes even with NVIDIA
| hardware) - even if it is not as popular there.
|
| > AMD cards often outperform nVidia equivalents
|
| Can you link to benchmarks or other analysis supporting this
| claim? This has not been my impression in recent years,
| though I don't routinely look at high-end AMD hardware.
|
| > because nVidia prioritized ray tracing and DLSS over
| compute power and memory bandwidth.
|
| Has it really?
| Const-me wrote:
| > Can you link to benchmarks or other analysis supporting
| this claim?
|
| Current generation nVidia:
| https://en.wikipedia.org/wiki/GeForce_40_series#Desktop
|
| Current generation AMD:
| https://en.wikipedia.org/wiki/Template:AMD_Radeon_RX_7000
|
| The key performance characteristics are processing power
| TFlops, and memory bandwidth GB/s.
|
| nVidia 4080 which costs $1200: 43 TFlops FP16, 43 TFlops
| FP32, 0.672 TFlops FP64, 717 GB/s memory.
|
| AMD 7900 XTX which costs $1000: 93 TFlops FP16, 47 TFlops
| FP32, 1.5 TFlops FP64, 960 GB/s memory.
|
| Note that for applications which bottleneck on FP16 compute
| (many ML workloads) or FP64 compute (many traditional HPC
| workloads: numerical solvers, fluid dynamics, etc), the
| 7900 XTX even outperforms the 4090 which costs $1600.
| kaycebasques wrote:
| I wasn't familiar with VLIW. Sounds cool!
|
| > Very long instruction word (VLIW) refers to instruction set
| architectures designed to exploit instruction level parallelism
| (ILP). Whereas conventional central processing units (CPU,
| processor) mostly allow programs to specify instructions to
| execute in sequence only, a VLIW processor allows programs to
| explicitly specify instructions to execute in parallel. This
| design is intended to allow higher performance without the
| complexity inherent in some other designs.
|
| > The traditional means to improve performance in processors
| include dividing instructions into substeps so the instructions
| can be executed partly at the same time (termed pipelining),
| dispatching individual instructions to be executed independently,
| in different parts of the processor (superscalar architectures),
| and even executing instructions in an order different from the
| program (out-of-order execution).[1] These methods all complicate
| hardware (larger circuits, higher cost and energy use) because
| the processor must make all of the decisions internally for these
| methods to work.
|
| https://en.wikipedia.org/wiki/Very_long_instruction_word
| mpweiher wrote:
| The most prominent example of VLIW processors was the Itanic,
| er, Itanium.
|
| It, er, didn't work out well.
|
| Hence Itanic.
|
| Their premise was that the compiler could figure out
| dependencies sufficiently statically so that they could put
| multiple sequential and some divergent execution paths in the
| same instruction. It turned out the compilers couldn't actually
| do this, so processors figure out dependencies and
| parallelizable instructions dynamically from a sequential
| instruction stream.
|
| Which is a lot of work, a lot of chip resources and a lot of
| energy. And only works up to a point, after which you hit
| diminishing returns. Which is where we appear to be these days.
| kimixa wrote:
| VLIW is also still heavily used in DSPs and similar. And
| arguably the dual issue instructions in both AMD's and
| Nvidia's latest GPUs is a step back towards that.
|
| It feels like a lot of these things are cyclical - the core
| ideas aren't new and phase in and out of favor as other parts
| of architecture design changes around and increases or
| decreases benefits accordingly.
| gpderetta wrote:
| It is not so much that it is cyclical than VLIW being good
| for numerical/DSP stuff with more predictable memory access
| and terrible for general computation. Itanium was ok at
| floating point code, but terrible at typical pointer
| chasing loads.
| yarg wrote:
| Read up on SIMD in general.
|
| (The means of processing as opposed to the language used to
| dispatch commands.)
|
| (And worth baring in mind is the fact that terms such as VLIW4
| and VLIW5 refer to specific implementations.)
|
| https://en.wikipedia.org/wiki/Single_instruction,_multiple_d...
| gpderetta wrote:
| simd != vliw though.
| einpoklum wrote:
| > Compute has been outpacing memory for decades. Like CPUs, GPUs
| have countered this with increasingly sophisticated caching
| strategies.
|
| I'd say it's rather the contrary. Unlike CPUs, GPUs don't attempt
| to directly counter this. By accepting higher latencies, they
| have let themselves parallelize much more widely (or wildly)
| relative to CPUs - and the high number of parallel pseudo-threads
| provides a "latency hiding" effect.
|
| This effect is illustrated for example, in this presentation on
| optimizing GPU code:
|
| https://www.olcf.ornl.gov/wp-content/uploads/2019/12/03-CUDA...
|
| (lame) animation on slide 11 and onwards.
| feanaro wrote:
| A bit off topic, but when did "compute" become a noun? It is
| extremely grating to my ears.
| synergy20 wrote:
| it's pretty common these days due to AI with GPU-alike chips.
___________________________________________________________________
(page generated 2023-12-17 23:00 UTC)