[HN Gopher] FlashAttention-2, 2x faster than FlashAttention
       ___________________________________________________________________
        
       FlashAttention-2, 2x faster than FlashAttention
        
       Author : machdiamonds
       Score  : 58 points
       Date   : 2023-07-17 18:21 UTC (4 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | whimsicalism wrote:
       | Does anyone have resources for a good way to get started with
       | this sort of modern GPU systems work?
        
         | luckyt wrote:
         | I found it helpful to start with CUDA on numba since it lets
         | you write GPU kernels in python. Assuming you're like most ML
         | engineers and you're more familiar with python than C++, this
         | allows you to separately learn CUDA concepts from also learning
         | C++ at the same time. There's also a set of GPU puzzles for
         | beginners [1] using to get started with numba CUDA.
         | 
         | [1] https://github.com/srush/GPU-Puzzles
        
           | whimsicalism wrote:
           | Thanks for the link! Sasha is actually my former professor -
           | if this is anything like his past pytorch puzzles I'm sure
           | I'll find it enjoyable.
        
         | jahewson wrote:
         | If you'd like a practical goal, you probably want to learn
         | PyTorch and have a little background knowledge of the memory
         | architecture of the GPUs. If you want to go deep, learn CUDA:
         | https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
        
           | whimsicalism wrote:
           | Yes, I know pytorch well at this point and have basic memory
           | architecture understanding. In the process of learning CUDA,
           | but would love pointers for depth/intermediate things to
           | explore.
        
             | jahewson wrote:
             | I found this talk helpful. https://on-
             | demand.gputechconf.com/gtc/2017/presentation/s712...
             | 
             | Have you tried the Visual Profiler yet?
        
         | brrrrrm wrote:
         | I'd start with the example of implementing the fastest
         | reduction you possibly can. Pretty much all complexity in every
         | kernel used in ML extends from this concept (reductions with
         | addition).
         | 
         | https://developer.download.nvidia.com/assets/cuda/files/redu...
        
           | whimsicalism wrote:
           | thank you for the suggestion - will take a look!
        
       | ternaus wrote:
       | I would be very greatfull to see how one can leverage it not for
       | LLMs but for Stable Diffusion models
        
         | m00x wrote:
         | Why couldn't it be applied to SD?
        
           | m00x wrote:
           | It looks like it's already at thing
           | https://github.com/AUTOMATIC1111/stable-diffusion-
           | webui/blob...
        
       | lucidrains wrote:
       | huge! thank you Tri!
        
         | bufo wrote:
         | Tri Dao and Tim Dettmers ftw
        
           | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-07-17 23:01 UTC)