[HN Gopher] Writing high-performance matrix multiplication kerne...
       ___________________________________________________________________
        
       Writing high-performance matrix multiplication kernels for
       Blackwell
        
       Author : lairv
       Score  : 40 points
       Date   : 2025-10-02 15:43 UTC (4 days ago)
        
 (HTM) web link (docs.jax.dev)
 (TXT) w3m dump (docs.jax.dev)
        
       | arjvik wrote:
       | The interesting part is this is done in Pallas!
       | 
       | Seems like the Pallas of old has completely been upgraded
        
         | reasonableklout wrote:
         | Pallas has a couple backends, this is the new-ish Mosaic GPU
         | one. AAUI it provides a bunch of low-level APIs for interacting
         | directly with NVIDIA-specific and new Blackwell features like
         | SMEM, TMEM, collective MMA, etc.
         | 
         | What's interesting is that the MGPU team has achieved SOTA
         | Blackwell GEMM performance before Triton (which IIUC is trying
         | to bring up Gluon to reach the same level). All the big players
         | are coming up with their own block-based low-level-ish DSLs for
         | CUDA: OpenAI, NVIDIA, and now Google.
        
       ___________________________________________________________________
       (page generated 2025-10-06 23:00 UTC)