[HN Gopher] AI's compute fragmentation: what matrix multiplicati...
___________________________________________________________________
AI's compute fragmentation: what matrix multiplication teaches us
Author : tzhenghao
Score : 46 points
Date : 2023-03-23 18:34 UTC (4 hours ago)
(HTM) web link (www.modular.com)
(TXT) w3m dump (www.modular.com)
| BenoitP wrote:
| There's hope in intermediate representations, in OpenXLA:
|
| https://opensource.googleblog.com/2023/03/openxla-is-ready-t...
|
| > OpenXLA is an open source ML compiler ecosystem co-developed by
| AI/ML industry leaders including Alibaba, Amazon Web Services,
| AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face,
| Intel, Meta, and NVIDIA. It enables developers to compile and
| optimize models from all leading ML frameworks for efficient
| training and serving on a wide variety of hardware
| junrushao1994 wrote:
| One thing I really love about XLA is GSPMD which effectively
| allows scalable distributed training in practice. However, I
| was quite curious how it is related to matrix multiplication
| though, given XLA is more focusing on graph-level optimization
| and basically offloads matmul to other libraries like Triton
| and cuBLAS
| photochemsyn wrote:
| > "Think about it: how can a small number of specialized experts,
| who hand write and tune assembly code, possibly scale their work
| to all the different configurations while also incorporating
| their work into all the AI frameworks?! It's simply an impossible
| task."
|
| Naively, I wonder if this is the kind of problem that AI itself
| can solve, which is a rather singularity-approaching concept.
| Maybe there's too much logic involved and not enough training
| data on different configurations for that to work? A bit spooky
| however, the thought of self-bootstrapping AI.
| dimatura wrote:
| There has been work on using AI for this at various levels - at
| the neural architecture level (finding neural architectures
| with high throughput/latency for a given hardware), at the
| algorithm level (finding faster matrix multiplication
| routines), and at the hardware level (iirc Google stated the
| latest version of google TPUs were partially designed with AI).
| bigbillheck wrote:
| This is the kind of problem AI's been solving for 25 years and
| more: https://www.fftw.org
| bigbillheck wrote:
| Surely one solution is for the AI frameworks to each themselves
| understand the operating environment and choose the best
| implementation at run-time, much like the way they currently do.
| kickingvegas wrote:
| Off topic, but related.
| https://mastodon.social/@mcc/110024854706734967
| adamnemecek wrote:
| No, they compute spectra.
| junrushao1994 wrote:
| My take: optimizing matrix multiplication is not hard on modern
| architecture if you have the right abstraction. The code itself
| could be fragmented across different programming models, which is
| true, but the underlying techniques are not hard for a 2nd/3rd
| year undergrad to understand. There are only a few important ones
| on GPU: loop tiling, pipelining, shared memory swizzle, memory
| coalescing. A properly designed compiler can allow developers to
| optimize matmuls within 100 lines of code.
| touisteur wrote:
| Looking at the effort plunked into things like cutlass and them
| still not reaching cuBLAS perf (which very few can beat - in
| the places where cuBLAS shines! which is... not that many...),
| and even in cuDNN and they're still eeking out single digit
| improvements regularly, I'd say this is probably harder than
| that. At least if you're reaching for the >50% use of the 37
| TFLOPS of an A40. If you're fine throwing more GPUs at the
| problem, sure.
|
| Edit: I mean when you still see papers every year with large
| improvements in perf, and things like 'we used tensor cores and
| managed to get back fp32 accuracy with 3 rounds of the things'
| - what? - I can attest it doesn't take 2 weeks to get this kind
| of results. And it's just getting started on tensor cores! And
| when on the nvidia forums someone says 'nah probably no
| improvement to use tensor cores for fft' and you get a link
| with a paper with a significative improvement in perf using
| tensor cores, I say we're just starting.
| junrushao1994 wrote:
| This is definitely a great point! With the context of AI
| workloads, where critical matmuls are basically of regular
| large shapes, are there many cases where cutlass/Triton are
| worse than cuBLAS where we need to throw more GPUs at it?
| touisteur wrote:
| cuBLAS is very often too heavy (too much overhead, memory
| movement to fit the API, not optimized for small batches of
| small matrices) and you can get huge improvements while
| chaining cudnn/cutlass/autotuned kernels. Especially if
| you're still on GDDR6 every data movement is a killer so if
| you can put it all together and never go back to global
| memory, you get amazing improvements. And this is without
| tensor cores. Programming them by hand is a pain so here
| enters cutlass...
| junrushao1994 wrote:
| Yeah cuBLAS is definitely not perfect in many cases :-((
|
| Speaking of GEMM fusion that you mentioned, flash
| attention is basically GEMM fusion with online softmax
| right? This is something I believe really cool and can be
| made really easy wit a proper abstraction. Say, you may
| move a chunk of computation under a certain loop and
| instruct the compiler to optimize data movement or cache
| intermediate tiles somewhere on chip
| touisteur wrote:
| There's something of this in cutlass with prologues and
| epilogues, and in the 'backend mode' of cudnn, but
| overall breaking the 'cuBLAS takes your whole device and
| tries to saturate it for this one matmul' is going to
| require a huge lot of abstraction work.
|
| Cutlass is supposed to be the first step and to anyone
| who struggles to understand WTF you're doing when using
| it, you are not alone. I've seen literally amazing room-
| silencing stuff with it, but heavy template stuff is
| really not my thing.
| junrushao1994 wrote:
| > we used tensor cores and managed to get back fp32 accuracy
| with 3 rounds of the things
|
| Hey are you referring to 3xTF32 (https://github.com/NVIDIA/cu
| tlass/tree/master/examples/28_am...)? IMO this is a perfect
| example where proper abstraction could save engineers non-
| trivial amount of time - imagine a compiler stack which
| allows 3xTF32 as a normal dtype and subsequent analysis
| compatible with this special dtype :-)
| mathisfun123 wrote:
| > A properly designed compiler can allow developers to optimize
| matmuls within 100 lines of code.
|
| man this is such a funny closing comment - what exactly do you
| think is involved in designing a compiler that enables devs to
| optimize matmuls if not 1000s of person hours/years/etc of very
| "fine-grained" perf research? what the "abstraction" people
| don't understand (because they only deal in abstractions) is
| that achieving performance involves literally the antithesis of
| abstraction - you need to understand your hardware down to the
| gate level (sometimes).
|
| > loop tiling, pipelining, shared memory swizzle, memory
| coalescing
|
| have you ever applied any of these? the only way you could
| apply these as a generic (without consideration of your
| particular hardware) algo is using a tuner; this is of course
| widely the route taken but that's not an "understanding" of
| anything except guess and check.
| brucethemoose2 wrote:
| Yeah well tell all that to Nvidia, who very much likes the
| fragmentation and wants to keep things that way.
| dekhn wrote:
| they are the one vendor who had the insight ~20 years ago to
| invest long-term in GPUs and have continuously made impressive
| products while supporting a cross-platform developer base. For
| this, I reward them with my $$$ (both work and home).
| misnome wrote:
| And they developed this fragmentation by... building good
| tools, good documentation, and comprehensively supporting them
| for 15 years in a way that makes people feel safe building on
| top of them.
|
| It's not fragmentation, they built a moat.
| touisteur wrote:
| And with their actual understanding of the hardware
| limitations of GPUs (memory bandwidth) and the parallel work
| on things like cutlass (if there was ever an unportable thing
| :-), the coming *Dx libraries (the explosion of
| cuBLAS/Solver/FFT to allow kernel fusion and new in-kernel
| linear algebra shenanigans) the slow but steady introduction
| of sparsity everywhere, I can't see how anyone can but play
| catch-up.
| spookie wrote:
| It's not like other vendors have made meaningful efforts in
| alternatives. AMD still hasn't released RDNA3 support for ROCm,
| their open compute platform. Hell, I don't even think RDNA2 has
| proper support as of now.
|
| There's also the issue of poor documentation and learning
| material in the wild.
| turmeric_root wrote:
| yeah when getting DL up and running on AMD requires using a
| datacentre card then it's no wonder CUDA is more popular. AMD
| is enabling ROCm for commercial GPUs now but it's still a
| pain to get it up and running, because of the inertia that
| CUDA has.
| version_five wrote:
| A cool mission
___________________________________________________________________
(page generated 2023-03-23 23:00 UTC)