[HN Gopher] Show HN: Flash Attention in ~100 lines of CUDA
       ___________________________________________________________________
        
       Show HN: Flash Attention in ~100 lines of CUDA
        
       Author : tspeterkim
       Score  : 146 points
       Date   : 2024-03-16 15:31 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | treesciencebot wrote:
       | Pretty neat implementation. In general, for these sort of
       | exercises (and even if the intention is to go to prod with custom
       | kernels) I lean towards Triton to write the kernels themselves.
       | It is much more easier to integrate to the tool chain, and allows
       | a level of abstraction that doesn't affect performance even a
       | little bit while providing useful constructs.
        
         | ixaxaar wrote:
         | You mean triton the inference server or triton the DSL for
         | cuda?
        
           | whimsicalism wrote:
           | they mean the dsl (not just necessarily for cuda)
        
           | p1esk wrote:
           | The DSL: https://openai.com/research/triton
        
           | treesciencebot wrote:
           | triton the DSL.
        
         | whimsicalism wrote:
         | yeah even the official flashattention is moving many
         | implementations from cutlass to triton except for the main mha
         | backward/forward pass
        
           | jart wrote:
           | It was written with cutlass? No wonder Peter Kim found it
           | valuable and worthwhile to de-obfuscate. Adopting a new
           | programming language invented by OpenAI doesn't sound like a
           | much better alternative. I'd be shocked if either of them
           | were able to build code for AMD GPUs, where it's easy to
           | adapt CUDA code, but not if it's buried in tens of thousands
           | of lines of frameworks. I like open source code to have
           | clarity so I can optimize it for my own production
           | environment myself. When people distribute code they've
           | productionized for themselves, it squeezes out all the alpha
           | and informational value. Just because something's open source
           | doesn't mean it's open source. I think people mostly do it to
           | lick the cookie without giving much away.
        
         | fpgamlirfanboy wrote:
         | > allows a level of abstraction that doesn't affect performance
         | even a little bit
         | 
         | The second part of this sentence is true because the first part
         | is false.
        
           | treesciencebot wrote:
           | zero cost abstractions exist. doesn't mean all abstractions
           | are zero-cost. or being zero-cost somehow invalidates their
           | abstractness/genericness. but maybe we differ on the
           | definition of abstractions.
        
       | araes wrote:
       | For those who have no idea what's being discussed, quick
       | background.
       | 
       | Discussing: Transformer [1] memory issues and approximate
       | attention [2] in machine learning training.
       | 
       | Specifically: FlashAttention: Fast and Memory-Efficient Exact
       | Attention with IO-Awareness. [3]
       | 
       | As a side comment, this entire industry is sorely in need of at
       | least intros. The entire space has moved so fast in the last year
       | I need an entire new dictionary and thesaurus for all the terms
       | they've created. Notably, because of this, found out Google has a
       | glossary of machine learning terms. Actually somewhat handy.
       | 
       | [1] Google Machine Learning Glossary (Transformer):
       | https://developers.google.com/machine-learning/glossary/#tra...
       | 
       | [2] Same (Attention): https://developers.google.com/machine-
       | learning/glossary/#att...
       | 
       | [2] arXiv: https://arxiv.org/abs/2205.14135
        
         | robrenaud wrote:
         | Regarding your comment about how fast the research and industry
         | is moving, would HN readers be interested in relevant one or
         | two paragraph summaries that are basically "explain it like I
         | am a machine learning engineer from 2020" but also knows the
         | power of these models from a perspective of using ChatGPT or MS
         | Copilot? That is, assume a fair amount of technical knowledge
         | about the fundamentals, but don't assume that the reader is
         | paying any attention to have whitebox knowledge of the current
         | state of the art.
        
           | jprete wrote:
           | I personally have been looking for "explain it like I'm a CS
           | PhD with lots of experience and the ability to look stuff
           | up". But I suspect your summary would be pretty handy as
           | well.
        
             | jhanoncomm wrote:
             | I reckon you need tacit knowledge. No way around it. Build
             | a GPT using Python and Pytorch. For a good course: Andrej
             | Karpathy is your keyword. At $1000 it is great value. But
             | actually it is free ;-)
        
           | whimsicalism wrote:
           | frankly i don't really feel like all that much has changed
           | since 2020 except for scale
        
           | araes wrote:
           | That sounds at least somewhat helpful. Honestly, a gradient
           | for some of this stuff would be nice. Explain to me like I'm:
           | "five", "a high schooler", "a college grad (not CS/ML/Eng)",
           | "a CS/Eng not ML".
           | 
           | Although in a couple years, kids in restaurants will probably
           | telling me how they're leveling up attention on their neuro-
           | pet. The singularity is steep.
        
             | imjonse wrote:
             | singularity implies AI increases exponentially, not human
             | intelligence. Kids will not talk about neural nets any time
             | soon.
        
         | godelski wrote:
         | Zero shot is wrong, but that definition is commonly used.
         | 
         | Zero shot is testing out if distribution, not just "a task" not
         | trained on. The later is ill defined.
         | 
         | The original definition comes from a few papers. But the
         | classic example is a clarifier recognizing zebras but having
         | never been trained in zebras (but may have been trained on
         | horses). There's are out of distribution. But importantly, out
         | of the implicit distribution, not the target distribution.
         | 
         | The common improper usage usually confuses these two. A simple
         | example might me training in 256x256 images and testing on
         | 1024x1024. That's still in the implicit distribution (as long
         | as the classes are identical). A very common example is
         | training on a large dataset like LAION and then testing on coco
         | or image net 1k. This is not zero shot because the classes in
         | ImageNet are in LAION (and in Coco). Basically, this is a
         | useless definition because then any validation or test set is
         | zero shot because those were never seen in the training data
         | and thus out of the training distribution. But remember that
         | data sets are proxies for larger distributions.
         | 
         | Where is can get sometimes tricky is tasks (emergence has
         | entered the chat). For example, you may not intend to train a
         | generative model to do clarification but you probably did (it's
         | very clear -- in the math -- if you're training density models
         | (KLD, score, etc)). This can get hairy because it's very easy
         | to train a model to do things that you aren't realizing you are
         | and later find out. Some people can get upset about this but
         | it's the nature of frameworks that have low interpretability.
         | There's still a lot of mathematics we need to learn and it
         | tends not to be an explicit focus in ML but there are plenty in
         | the community focused on this.
        
       | saiojd wrote:
       | What does __syncthreads() do here exactly? I'm new to CUDA, could
       | get the overall idea of the FlashAttention paper but not the
       | details.
        
         | cavisne wrote:
         | Causes every thread in the block to wait until they have
         | reached this point. Worth reading a cuda primer for more
         | details on blocks/warps.
         | 
         | Since the threads are relying on each other to fill the SRAM
         | with all needed data if you didn't wait then values would be
         | missing.
        
           | xrd wrote:
           | Any CUDA primer you recommend in particular? I had this same
           | question.
        
             | winwang wrote:
             | Here's an article on syncing in CUDA via cooperative
             | groups: https://developer.nvidia.com/blog/cooperative-
             | groups/
             | 
             | There's also explicit warp synchronization, i.e.
             | __syncwarp(). More on warp primitives here:
             | https://developer.nvidia.com/blog/using-cuda-warp-level-
             | prim...
        
       | zer0zzz wrote:
       | This is fantastic. I am just starting in the ML space (compile
       | from compilers) and I love short kernels that I can use to
       | understand things better with.
        
         | lagrange77 wrote:
         | > compile from compilers
         | 
         | What does that mean?
        
           | zer0zzz wrote:
           | Typo, meant to write "coming from compilers"
        
       | einpoklum wrote:
       | My GPU work is not in ML (deep or otherwise); but ...
       | 
       | 1. "100 lines of CUDA" + PyTorch; maybe this is useful and maybe
       | it isn't, but counting lines of code on top of a huge codebase is
       | not very meaningful.
       | 
       | 2. Launching separate kernels, synchronously, on the default
       | stream, for various operations, is typically not the right way to
       | utilize a GPU.
        
       ___________________________________________________________________
       (page generated 2024-03-16 23:00 UTC)