[HN Gopher] Writing high-performance matrix multiplication kerne...
___________________________________________________________________
Writing high-performance matrix multiplication kernels for
Blackwell
Author : lairv
Score : 40 points
Date : 2025-10-02 15:43 UTC (4 days ago)
(HTM) web link (docs.jax.dev)
(TXT) w3m dump (docs.jax.dev)
| arjvik wrote:
| The interesting part is this is done in Pallas!
|
| Seems like the Pallas of old has completely been upgraded
| reasonableklout wrote:
| Pallas has a couple backends, this is the new-ish Mosaic GPU
| one. AAUI it provides a bunch of low-level APIs for interacting
| directly with NVIDIA-specific and new Blackwell features like
| SMEM, TMEM, collective MMA, etc.
|
| What's interesting is that the MGPU team has achieved SOTA
| Blackwell GEMM performance before Triton (which IIUC is trying
| to bring up Gluon to reach the same level). All the big players
| are coming up with their own block-based low-level-ish DSLs for
| CUDA: OpenAI, NVIDIA, and now Google.
___________________________________________________________________
(page generated 2025-10-06 23:00 UTC)