[HN Gopher] GPU Puzzles
___________________________________________________________________
GPU Puzzles
Author : cgadski
Score : 310 points
Date : 2024-09-18 13:08 UTC (5 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| srush wrote:
| I made these a couple of years ago as a teaching exercise for
| https://minitorch.github.io/. At the time the resources for doing
| anything on GPUs were pretty sparse and the NVidia docs were
| quite challenging.
|
| These days there are great resources for going deep on this
| topic. The CUDA-mode org is particularly great, both their video
| series and PMPP reading groups.
| bytepoet wrote:
| Thanks a lot, Sasha, for creating these. I found your LLM
| training puzzles to be excellent as well.
| srush wrote:
| Awesome! Here are all of them if anyone else is looking.
|
| https://github.com/srush/Triton-puzzles
| https://github.com/srush/tensor-puzzles
| https://github.com/srush/autodiff-puzzles
| https://github.com/srush/transformer-puzzles
| https://github.com/srush/GPTworld
| https://github.com/srush/LLM-Training-Puzzles
| nextos wrote:
| Slightly offtopic, but any chance you could update or re-upload
| code for your https://github.com/harvardnlp/DeepLatentNLP
| tutorial? I found the NLP latent variable models discussed
| there really interesting, and notebooks were excellent.
| However, these seem gone and the only thing left are slides?
|
| Alternatively, any other places that discuss the same topics,
| including some code? I could only find equivalent discussions
| with code in Pyro docs and Kevin Murphy's book, volume 2. But
| these are more sparse as they also cover many other topics.
| srush wrote:
| I'll take a look. Yeah Pyro is the best thing to do here. But
| it would be nice to revisit some of these implementationz
| nextos wrote:
| Thank you so much!
| lins1909 wrote:
| Thanks Sasha - this looks like a great resource.Just to be
| clear, would you recommend going through other newer resources
| than this instead?
|
| Not sure if your comment is to discourage someone from going
| through this.
| srush wrote:
| These still hold up, and I think they're a great first step.
| But they no longer get you to the goal line. Think about it
| more as conceptual practice, before you enter the jungle.
| lins1909 wrote:
| Got it, thank you.
| ismailmaj wrote:
| It would be nice if the puzzles natively supported C++ CUDA.
| srush wrote:
| Here is a port without the visualizer:
|
| https://twitter.com/srush_nlp/status/1719376959572980094
|
| Here is an amazing in-browser implementation in WebGPU
|
| https://www.answer.ai/posts/2024-09-12-gpupuzzles.html
| fifilura wrote:
| I think this course is also relevant for some deeper context.
|
| https://gfxcourses.stanford.edu/cs149/fall23/lecture/datapar...
| geekodour wrote:
| all videos should be available on YT by end of month
| throwaway314155 wrote:
| Either puzzle 4 has a bug in it or I'm losing my mind. (Possible
| answer to solution below, so don't read if you want to go in
| fresh) # FILL ME IN (roughly 2 lines)
| if local_i < size and local_j < size:
| out[local_i][local_j] = a[local_i][local_j] + 10
|
| Results in a failed assertion: AssertionError:
| Wrong number of indices
|
| But the test cell beneath it will still pass?
| imjonse wrote:
| maybe try out[local_i, local_j] ?
| aleinin wrote:
| I recently ported this to Metal for Apple Silicon computers. If
| you're interested in learning GPU programming on an M series Mac,
| I think this is a very accessible option. Thanks to Sasha for
| making this!
|
| https://github.com/abeleinin/Metal-Puzzles
| 867-5309 wrote:
| seems like an opportune moment to gift a plug for bitcoin
| puzzles, namely BTC32 / 1000 BTC Challenge[1]
|
| pools are in dire need of cuda developers
|
| [1]https://bitcointalk.org/index.php?topic=1306983.0
| talldayo wrote:
| > pools are in dire need of cuda developers
|
| Pools have money; if they need CUDA engineers, they are fully
| capable of hiring them at the industry rate.
| 867-5309 wrote:
| most are community-based, plus, the prize can far exceed such
| a rate
| talldayo wrote:
| > the prize can far exceed such a rate
|
| For all the good it's done them.
| jamilton wrote:
| Why? Wouldn't existing tools be about as good as they could be?
| wmil wrote:
| So I'm used to working with lists and maps, which doesn't really
| track well with tackling problems on thousands of cores.
|
| Is the usual strategy to worry less about repeating calculations
| and just use brute force to tackle the problem?
|
| Is there a good resource to read about how to tackle problems in
| an extremely parallel way?
| srush wrote:
| I would recommend first learning Numpy or a similar vectorized
| library. If you have a good sense of those data structures
| (array broadcasting) it is a good starting point for what you
| can do in a GPU world.
| saagarjha wrote:
| When working on GPU code there's really two parts to it, I feel.
| One is "how do I even write code for the GPU" which this tutorial
| seems to cover but there's a second part which is "how do I write
| _good_ code for the GPU" which seems like it would need another
| resource or expansion to this one.
| derefr wrote:
| I've always felt like the best interactive educational model
| for forming a good intuition on how to maximize throughput and
| minimize worst-case latency in a pipelined parallel dataflow
| system (e.g. DSPs, FPGAs, GPUs, or even distributed message-
| passing systems) would be some variant of the game _Factorio_.
| Specifically, one with:
|
| 1. instead of buildings, IP cores doing processing steps;
|
| 2. instead of belts, wires -- which take up far less than one
| tile, so many can run together along one tile and many can
| connect to a single IP core; where each wire can move its
| contents at arbitrary speed (including "stopped") -- but where
| this will have a power-use cost proportional to the wire's
| speed;
|
| 3. an overall goal of optimizing for rocket launches per second
| _per power-usage watt_. (Which should overall require
| minimizing the amount of stuff moving around across the whole
| base, avoiding pipeline stalls; doing as much parallel batching
| as possible; etc.)
|
| (Yes, I know Shenzhen I/O exists. It's great for what it does
| -- modelling signals and signal transformations -- but it
| doesn't model individual packets of data as moving along wires
| with propagation delay, and with the potential for e.g.
| parallel-line interference given a bad encoding scheme, quantum
| tunnelling, overclocking or undervolting components, etc. I
| think a Factorio-variant would actually be much more flexible
| to implement these aspects.)
| czhu12 wrote:
| I loved the tensor puzzles you made. I spent the morning
| revisiting and liking all the videos on youtube you've made. Hope
| for many more in the future!
| srush wrote:
| Thanks so much!
| az226 wrote:
| Can I hire you to make Flash Attention a reality for V100?
| srush wrote:
| Nope! Too hard for me. But it would be a great practice for
| someone who wants to get started in this space. There is a
| Triton implementation that might be a good starting place.
___________________________________________________________________
(page generated 2024-09-23 23:00 UTC)