[HN Gopher] Matrix Core Programming on AMD GPUs
       ___________________________________________________________________
        
       Matrix Core Programming on AMD GPUs
        
       Author : skidrow
       Score  : 106 points
       Date   : 2025-10-04 21:22 UTC (1 days ago)
        
 (HTM) web link (salykova.github.io)
 (TXT) w3m dump (salykova.github.io)
        
       | gleenn wrote:
       | Glad to see more articles out using AMD hardware acceleration
       | especially for matrix math. More diversity in this space is
       | welcome.
        
         | latchkey wrote:
         | Many people have been asking them for this sort of content, and
         | it is happening. Can't be more excited. Also note that it is
         | AMD, but not AMD. Being published in the open on an individual
         | github.
        
       | imtringued wrote:
       | Whenever I see code like this, I'm starting to think that GPUs
       | are uniquely unsuited for matrix multiplication.
       | 
       | You're pretending that each streaming multiprocessor can handle
       | independent threads, when in reality you're feeding something
       | that only exists once or twice per SM. It's like independently
       | controlling one out of 32 cars on a 32 lane highway where the
       | cars aren't allowed to switch lanes and having the controls on
       | one car replicated to all the others when in reality everyone is
       | sitting in the same bus.
        
         | MaxBarraclough wrote:
         | I'm not sure I follow. Matrix multiplication isn't inherently
         | 'branchy' in a way that we would expect to cause inefficient
         | execution on SIMT (i.e. branch divergence).
        
           | touisteur wrote:
           | I think the remark is more about Tensor Cores (or Matrix
           | Cores in AMD lingo) are distributed by SM (and not aside on
           | an interconnect and individually programmable) so on the same
           | SM you have your classical warps (cuda cores) AND the Tensor
           | units and switching between one and the other might be
           | confusing.
           | 
           | My vision of SMs has always been "assume AVX512 is the
           | default ISA" and "tensor cores are another layer aside of
           | this" (kind-of like AMX) and you have this heterogeneous
           | "thing" to program. Don't know if it helps. The CUDA
           | programming model hides a lot and looking at PTX code in
           | nsight-compute is most enlightening.
        
       ___________________________________________________________________
       (page generated 2025-10-05 23:01 UTC)