[HN Gopher] Show HN: Flash Attention in ~100 lines of CUDA
___________________________________________________________________
Show HN: Flash Attention in ~100 lines of CUDA
Author : tspeterkim
Score : 146 points
Date : 2024-03-16 15:31 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| treesciencebot wrote:
| Pretty neat implementation. In general, for these sort of
| exercises (and even if the intention is to go to prod with custom
| kernels) I lean towards Triton to write the kernels themselves.
| It is much more easier to integrate to the tool chain, and allows
| a level of abstraction that doesn't affect performance even a
| little bit while providing useful constructs.
| ixaxaar wrote:
| You mean triton the inference server or triton the DSL for
| cuda?
| whimsicalism wrote:
| they mean the dsl (not just necessarily for cuda)
| p1esk wrote:
| The DSL: https://openai.com/research/triton
| treesciencebot wrote:
| triton the DSL.
| whimsicalism wrote:
| yeah even the official flashattention is moving many
| implementations from cutlass to triton except for the main mha
| backward/forward pass
| jart wrote:
| It was written with cutlass? No wonder Peter Kim found it
| valuable and worthwhile to de-obfuscate. Adopting a new
| programming language invented by OpenAI doesn't sound like a
| much better alternative. I'd be shocked if either of them
| were able to build code for AMD GPUs, where it's easy to
| adapt CUDA code, but not if it's buried in tens of thousands
| of lines of frameworks. I like open source code to have
| clarity so I can optimize it for my own production
| environment myself. When people distribute code they've
| productionized for themselves, it squeezes out all the alpha
| and informational value. Just because something's open source
| doesn't mean it's open source. I think people mostly do it to
| lick the cookie without giving much away.
| fpgamlirfanboy wrote:
| > allows a level of abstraction that doesn't affect performance
| even a little bit
|
| The second part of this sentence is true because the first part
| is false.
| treesciencebot wrote:
| zero cost abstractions exist. doesn't mean all abstractions
| are zero-cost. or being zero-cost somehow invalidates their
| abstractness/genericness. but maybe we differ on the
| definition of abstractions.
| araes wrote:
| For those who have no idea what's being discussed, quick
| background.
|
| Discussing: Transformer [1] memory issues and approximate
| attention [2] in machine learning training.
|
| Specifically: FlashAttention: Fast and Memory-Efficient Exact
| Attention with IO-Awareness. [3]
|
| As a side comment, this entire industry is sorely in need of at
| least intros. The entire space has moved so fast in the last year
| I need an entire new dictionary and thesaurus for all the terms
| they've created. Notably, because of this, found out Google has a
| glossary of machine learning terms. Actually somewhat handy.
|
| [1] Google Machine Learning Glossary (Transformer):
| https://developers.google.com/machine-learning/glossary/#tra...
|
| [2] Same (Attention): https://developers.google.com/machine-
| learning/glossary/#att...
|
| [2] arXiv: https://arxiv.org/abs/2205.14135
| robrenaud wrote:
| Regarding your comment about how fast the research and industry
| is moving, would HN readers be interested in relevant one or
| two paragraph summaries that are basically "explain it like I
| am a machine learning engineer from 2020" but also knows the
| power of these models from a perspective of using ChatGPT or MS
| Copilot? That is, assume a fair amount of technical knowledge
| about the fundamentals, but don't assume that the reader is
| paying any attention to have whitebox knowledge of the current
| state of the art.
| jprete wrote:
| I personally have been looking for "explain it like I'm a CS
| PhD with lots of experience and the ability to look stuff
| up". But I suspect your summary would be pretty handy as
| well.
| jhanoncomm wrote:
| I reckon you need tacit knowledge. No way around it. Build
| a GPT using Python and Pytorch. For a good course: Andrej
| Karpathy is your keyword. At $1000 it is great value. But
| actually it is free ;-)
| whimsicalism wrote:
| frankly i don't really feel like all that much has changed
| since 2020 except for scale
| araes wrote:
| That sounds at least somewhat helpful. Honestly, a gradient
| for some of this stuff would be nice. Explain to me like I'm:
| "five", "a high schooler", "a college grad (not CS/ML/Eng)",
| "a CS/Eng not ML".
|
| Although in a couple years, kids in restaurants will probably
| telling me how they're leveling up attention on their neuro-
| pet. The singularity is steep.
| imjonse wrote:
| singularity implies AI increases exponentially, not human
| intelligence. Kids will not talk about neural nets any time
| soon.
| godelski wrote:
| Zero shot is wrong, but that definition is commonly used.
|
| Zero shot is testing out if distribution, not just "a task" not
| trained on. The later is ill defined.
|
| The original definition comes from a few papers. But the
| classic example is a clarifier recognizing zebras but having
| never been trained in zebras (but may have been trained on
| horses). There's are out of distribution. But importantly, out
| of the implicit distribution, not the target distribution.
|
| The common improper usage usually confuses these two. A simple
| example might me training in 256x256 images and testing on
| 1024x1024. That's still in the implicit distribution (as long
| as the classes are identical). A very common example is
| training on a large dataset like LAION and then testing on coco
| or image net 1k. This is not zero shot because the classes in
| ImageNet are in LAION (and in Coco). Basically, this is a
| useless definition because then any validation or test set is
| zero shot because those were never seen in the training data
| and thus out of the training distribution. But remember that
| data sets are proxies for larger distributions.
|
| Where is can get sometimes tricky is tasks (emergence has
| entered the chat). For example, you may not intend to train a
| generative model to do clarification but you probably did (it's
| very clear -- in the math -- if you're training density models
| (KLD, score, etc)). This can get hairy because it's very easy
| to train a model to do things that you aren't realizing you are
| and later find out. Some people can get upset about this but
| it's the nature of frameworks that have low interpretability.
| There's still a lot of mathematics we need to learn and it
| tends not to be an explicit focus in ML but there are plenty in
| the community focused on this.
| saiojd wrote:
| What does __syncthreads() do here exactly? I'm new to CUDA, could
| get the overall idea of the FlashAttention paper but not the
| details.
| cavisne wrote:
| Causes every thread in the block to wait until they have
| reached this point. Worth reading a cuda primer for more
| details on blocks/warps.
|
| Since the threads are relying on each other to fill the SRAM
| with all needed data if you didn't wait then values would be
| missing.
| xrd wrote:
| Any CUDA primer you recommend in particular? I had this same
| question.
| winwang wrote:
| Here's an article on syncing in CUDA via cooperative
| groups: https://developer.nvidia.com/blog/cooperative-
| groups/
|
| There's also explicit warp synchronization, i.e.
| __syncwarp(). More on warp primitives here:
| https://developer.nvidia.com/blog/using-cuda-warp-level-
| prim...
| zer0zzz wrote:
| This is fantastic. I am just starting in the ML space (compile
| from compilers) and I love short kernels that I can use to
| understand things better with.
| lagrange77 wrote:
| > compile from compilers
|
| What does that mean?
| zer0zzz wrote:
| Typo, meant to write "coming from compilers"
| einpoklum wrote:
| My GPU work is not in ML (deep or otherwise); but ...
|
| 1. "100 lines of CUDA" + PyTorch; maybe this is useful and maybe
| it isn't, but counting lines of code on top of a huge codebase is
| not very meaningful.
|
| 2. Launching separate kernels, synchronously, on the default
| stream, for various operations, is typically not the right way to
| utilize a GPU.
___________________________________________________________________
(page generated 2024-03-16 23:00 UTC)