hngopher.com

       [HN Gopher] GPU Puzzles
       ___________________________________________________________________
        
       GPU Puzzles
        
       Author : cgadski
       Score  : 310 points
       Date   : 2024-09-18 13:08 UTC (5 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | srush wrote:
       | I made these a couple of years ago as a teaching exercise for
       | https://minitorch.github.io/. At the time the resources for doing
       | anything on GPUs were pretty sparse and the NVidia docs were
       | quite challenging.
       | 
       | These days there are great resources for going deep on this
       | topic. The CUDA-mode org is particularly great, both their video
       | series and PMPP reading groups.
        
         | bytepoet wrote:
         | Thanks a lot, Sasha, for creating these. I found your LLM
         | training puzzles to be excellent as well.
        
           | srush wrote:
           | Awesome! Here are all of them if anyone else is looking.
           | 
           | https://github.com/srush/Triton-puzzles
           | https://github.com/srush/tensor-puzzles
           | https://github.com/srush/autodiff-puzzles
           | https://github.com/srush/transformer-puzzles
           | https://github.com/srush/GPTworld
           | https://github.com/srush/LLM-Training-Puzzles
        
         | nextos wrote:
         | Slightly offtopic, but any chance you could update or re-upload
         | code for your https://github.com/harvardnlp/DeepLatentNLP
         | tutorial? I found the NLP latent variable models discussed
         | there really interesting, and notebooks were excellent.
         | However, these seem gone and the only thing left are slides?
         | 
         | Alternatively, any other places that discuss the same topics,
         | including some code? I could only find equivalent discussions
         | with code in Pyro docs and Kevin Murphy's book, volume 2. But
         | these are more sparse as they also cover many other topics.
        
           | srush wrote:
           | I'll take a look. Yeah Pyro is the best thing to do here. But
           | it would be nice to revisit some of these implementationz
        
             | nextos wrote:
             | Thank you so much!
        
         | lins1909 wrote:
         | Thanks Sasha - this looks like a great resource.Just to be
         | clear, would you recommend going through other newer resources
         | than this instead?
         | 
         | Not sure if your comment is to discourage someone from going
         | through this.
        
           | srush wrote:
           | These still hold up, and I think they're a great first step.
           | But they no longer get you to the goal line. Think about it
           | more as conceptual practice, before you enter the jungle.
        
             | lins1909 wrote:
             | Got it, thank you.
        
       | ismailmaj wrote:
       | It would be nice if the puzzles natively supported C++ CUDA.
        
         | srush wrote:
         | Here is a port without the visualizer:
         | 
         | https://twitter.com/srush_nlp/status/1719376959572980094
         | 
         | Here is an amazing in-browser implementation in WebGPU
         | 
         | https://www.answer.ai/posts/2024-09-12-gpupuzzles.html
        
       | fifilura wrote:
       | I think this course is also relevant for some deeper context.
       | 
       | https://gfxcourses.stanford.edu/cs149/fall23/lecture/datapar...
        
         | geekodour wrote:
         | all videos should be available on YT by end of month
        
       | throwaway314155 wrote:
       | Either puzzle 4 has a bug in it or I'm losing my mind. (Possible
       | answer to solution below, so don't read if you want to go in
       | fresh)                   # FILL ME IN (roughly 2 lines)
       | if local_i < size and local_j < size:
       | out[local_i][local_j] = a[local_i][local_j] + 10
       | 
       | Results in a failed assertion:                    AssertionError:
       | Wrong number of indices
       | 
       | But the test cell beneath it will still pass?
        
         | imjonse wrote:
         | maybe try out[local_i, local_j] ?
        
       | aleinin wrote:
       | I recently ported this to Metal for Apple Silicon computers. If
       | you're interested in learning GPU programming on an M series Mac,
       | I think this is a very accessible option. Thanks to Sasha for
       | making this!
       | 
       | https://github.com/abeleinin/Metal-Puzzles
        
       | 867-5309 wrote:
       | seems like an opportune moment to gift a plug for bitcoin
       | puzzles, namely BTC32 / 1000 BTC Challenge[1]
       | 
       | pools are in dire need of cuda developers
       | 
       | [1]https://bitcointalk.org/index.php?topic=1306983.0
        
         | talldayo wrote:
         | > pools are in dire need of cuda developers
         | 
         | Pools have money; if they need CUDA engineers, they are fully
         | capable of hiring them at the industry rate.
        
           | 867-5309 wrote:
           | most are community-based, plus, the prize can far exceed such
           | a rate
        
             | talldayo wrote:
             | > the prize can far exceed such a rate
             | 
             | For all the good it's done them.
        
         | jamilton wrote:
         | Why? Wouldn't existing tools be about as good as they could be?
        
       | wmil wrote:
       | So I'm used to working with lists and maps, which doesn't really
       | track well with tackling problems on thousands of cores.
       | 
       | Is the usual strategy to worry less about repeating calculations
       | and just use brute force to tackle the problem?
       | 
       | Is there a good resource to read about how to tackle problems in
       | an extremely parallel way?
        
         | srush wrote:
         | I would recommend first learning Numpy or a similar vectorized
         | library. If you have a good sense of those data structures
         | (array broadcasting) it is a good starting point for what you
         | can do in a GPU world.
        
       | saagarjha wrote:
       | When working on GPU code there's really two parts to it, I feel.
       | One is "how do I even write code for the GPU" which this tutorial
       | seems to cover but there's a second part which is "how do I write
       | _good_ code for the GPU" which seems like it would need another
       | resource or expansion to this one.
        
         | derefr wrote:
         | I've always felt like the best interactive educational model
         | for forming a good intuition on how to maximize throughput and
         | minimize worst-case latency in a pipelined parallel dataflow
         | system (e.g. DSPs, FPGAs, GPUs, or even distributed message-
         | passing systems) would be some variant of the game _Factorio_.
         | Specifically, one with:
         | 
         | 1. instead of buildings, IP cores doing processing steps;
         | 
         | 2. instead of belts, wires -- which take up far less than one
         | tile, so many can run together along one tile and many can
         | connect to a single IP core; where each wire can move its
         | contents at arbitrary speed (including "stopped") -- but where
         | this will have a power-use cost proportional to the wire's
         | speed;
         | 
         | 3. an overall goal of optimizing for rocket launches per second
         | _per power-usage watt_. (Which should overall require
         | minimizing the amount of stuff moving around across the whole
         | base, avoiding pipeline stalls; doing as much parallel batching
         | as possible; etc.)
         | 
         | (Yes, I know Shenzhen I/O exists. It's great for what it does
         | -- modelling signals and signal transformations -- but it
         | doesn't model individual packets of data as moving along wires
         | with propagation delay, and with the potential for e.g.
         | parallel-line interference given a bad encoding scheme, quantum
         | tunnelling, overclocking or undervolting components, etc. I
         | think a Factorio-variant would actually be much more flexible
         | to implement these aspects.)
        
       | czhu12 wrote:
       | I loved the tensor puzzles you made. I spent the morning
       | revisiting and liking all the videos on youtube you've made. Hope
       | for many more in the future!
        
         | srush wrote:
         | Thanks so much!
        
       | az226 wrote:
       | Can I hire you to make Flash Attention a reality for V100?
        
         | srush wrote:
         | Nope! Too hard for me. But it would be a great practice for
         | someone who wants to get started in this space. There is a
         | Triton implementation that might be a good starting place.
        
       ___________________________________________________________________
       (page generated 2024-09-23 23:00 UTC)