hngopher.com

       [HN Gopher] Ask HN: How to learn CUDA to professional level
       ___________________________________________________________________
        
       Ask HN: How to learn CUDA to professional level
        
       Hi all, I was wondering what books/courses/projects one might do to
       learn CUDA programming.  (To be frank, the main reason is a lot of
       companies I'd wish to work for require CUDA experience -- this
       shouldn't change your answers hopefully, just wanted to provide
       some context )
        
       Author : upmind
       Score  : 178 points
       Date   : 2025-06-08 10:52 UTC (12 hours ago)
        
       | dist-epoch wrote:
       | As they typically say: Just Do It (tm).
       | 
       | Start writing some CUDA core to sort an array or find the maximum
       | element.
        
         | amelius wrote:
         | I'd rather learn to use a library that works on any brand of
         | GPU.
         | 
         | If that is not an option, I'll wait!
        
           | pjmlp wrote:
           | If only Khronos and the competition cared about the developer
           | experience....
        
             | the__alchemist wrote:
             | This is continuously a point of frustration! Vulkan compute
             | is... suboptimal. I use Cuda because it feels like the only
             | practical option. I want Vulkan or something else to
             | compete seriously, but until that happens, I will use Cuda.
        
               | pjmlp wrote:
               | It took until Vulkanised 2025, to acknowledge Vulkan
               | became the same mess as OpenGL, and to put an action plan
               | into action to try to correct this.
               | 
               | Had it not been for Apple with OpenCL initial
               | contribution, regardless of how it went from there, AMD
               | with Mantle as starting point for Vulkan, NVidia with
               | Vulkan-Hpp and Slang, and the ecosystem of Khronos
               | standards would be much worse.
               | 
               | Also Vulkan isn't as bad as OpenGL tooling, because
               | LunarG exists, and someone pays them for the whole Vulkan
               | SDK.
               | 
               | The attitude "we put paper standards" and the community
               | should step in for the implementations and tooling,
               | hardly comes to the productivity from private APIs
               | tooling.
               | 
               | Also all GPU vendors, including Intel and AMD, also
               | rather push their own compute APIs, even if based on top
               | of Khronos ones.
        
               | corysama wrote:
               | Is https://github.com/KomputeProject/kompute +
               | https://shader-slang.org/ getting there?
               | 
               | Runs on anything + auto-differentiatation.
        
           | Cloudef wrote:
           | Both zig and rust are aiming to compile to gpus natively.
           | What cuda and hip provide is heterogeneous computing runtime,
           | aka hiding the boilerplate of executing code on cpu and gpu
           | seamlessly
        
           | latchkey wrote:
           | Then learn PyTorch.
           | 
           | The hardware between brands is fundamentally different. There
           | isn't a standard like x86 for CPUs.
           | 
           | So, while you may use something like HIPIFY to translate your
           | code between APIs, at least with GPU programming, it makes
           | sense to learn how they differ from each other or just pick
           | one of them and work with it knowing that the others will
           | just be some variation of the same idea.
        
             | horsellama wrote:
             | the jobs requiring cuda experience are most of the times
             | because torch is not good enough
        
           | uecker wrote:
           | GCC / clang also have support for offloading.
        
           | moralestapia wrote:
           | K, bud.
           | 
           | Perhaps you haven't noticed, but you're in a thread that
           | asked about CUDA, explicitly.
        
         | the__alchemist wrote:
         | I concur with this. Then supplement with resources A/R.
         | Ideally, find some tasks in your programs that are parallelize.
         | (Learning what these are is important too!), and switch them to
         | Cuda. If you don't have any, make a toy case, e.g. an n-body
         | simulation.
        
       | lokimedes wrote:
       | There's a couple of "concerns" you may separate to make this a
       | bit more tractable:
       | 
       | 1. Learning CUDA - the framework, libraries and high-layer
       | wrappers. This is something that changes with times and trends.
       | 
       | 2. Learning high-performance computing approaches. While a GPU
       | and the Nvlink interfaces are Nvidia specific, working in a
       | massively-parallel distributed computing environment is a general
       | branch of knowledge that is translatable across HPC
       | architectures.
       | 
       | 3. Application specifics. If your thing is Transformers, you may
       | just as well start from Torch, Tensorflow, etc. and rely on the
       | current high-level abstractions, to inspire your learning down to
       | the fundamentals.
       | 
       | I'm no longer active in any of the above, so I can't be more
       | specific, but if you want to master CUDA, I would say learning
       | how massive-parallel programming works, is the foundation that
       | may translate into transferable skills.
        
         | rramadass wrote:
         | This is the right approach. Without (2) trying to learn (1)
         | will just lead to "confusion worse confounded". I also suggest
         | a book recommendation here -
         | https://news.ycombinator.com/item?id=44216478
        
           | lokimedes wrote:
           | This one was my go-to for HPC, but it may be a bit dated by
           | now: https://www.amazon.com/Introduction-Performance-
           | Computing-Sc...
        
             | rramadass wrote:
             | That's a good book too (i have it) but more general than
             | the Ridgway Scott book which uses examples from Numerical
             | Computation domains. Here is an overview of the chapters;
             | example domains start from chapter 10 onwards -
             | https://www.jstor.org/stable/j.ctv1ddcxfs
             | 
             | These sort of books are only "dated" when it comes to
             | specific languages/frameworks/libraries. The
             | methods/techniques are evergreen and often conceptually
             | better explained in these older books.
             | 
             | For recent up to date works on HPC, the free multi-volume
             | _The Art of High Performance Computing by Victor Eijkhout_
             | can 't be beat -
             | https://news.ycombinator.com/item?id=38815334
        
           | jonas21 wrote:
           | I think it depends on your learning style. For me, learning
           | something with a concrete implementation and code that you
           | can play around with is a lot easier than trying to study the
           | abstract general concepts first. Once you have some
           | experience with the code, you start asking why things are
           | done a certain way, and that naturally leads to the more
           | general concepts.
        
       | throwaway81523 wrote:
       | I looked at the CUDA code for Leela Chess Zero and found it
       | pretty understandable, though that was back when Leela used a
       | DCNN instead of transformers. DCNN's are fairly simple and are
       | explained in fast.ai videos that I watched a few years ago, so
       | navigating the Leela code wasn't too difficult. Transformers are
       | more complicated and I want to bone up on them, but I haven't
       | managed to spend any time understanding them.
       | 
       | CUDA itself is just a minor departure from C++, so the language
       | itself is no big deal if you've used C++ before. But, if you're
       | trying to get hired programming CUDA, what that really means is
       | they want you implementing AI stuff (unless it's game dev). AI
       | programming is a much wider and deeper subject than CUDA itself,
       | so be ready to spend a bunch of time studying and hacking to come
       | up to speed in that. But if you do, you will be in high demand.
       | As mentioned, the fast.ai videos are a great introduction.
       | 
       | In the case of games, that means 3D graphics which these days is
       | another rabbit hole. I knew a bit about this back in the day, but
       | it is fantastically more sophisticated now and I don't have any
       | idea where to even start.
        
         | upmind wrote:
         | This is a great idea! This is the code right'
         | https://github.com/leela-zero/leela-zero
         | 
         | I have two beginner (and probably very dumb) questions, why do
         | they have heavy c++/cuda usage rather than using only
         | pytorch/tensorflow. Are they too slow for training Leela?
         | Second, why is there tensorflow code?
        
           | throwaway81523 wrote:
           | As I remember, the CUDA code was about 3x faster than the
           | tensorflow code. The tensorflow stuff is there for non-Nvidia
           | GPU's. This was in the era of the GTX 1080 or 2080. No idea
           | about now.
        
             | upmind wrote:
             | Ah I see, thanks a lot!
        
           | henrikf wrote:
           | That's Leela Zero (plays Go instead of Chess). It was good
           | for its time (~2018) but it's quite outdated now. It also
           | uses OpenCL instead of Cuda. I wrote a lot of that code
           | including Winograd convolution routines.
           | 
           | Leela Chess Zero (https://github.com/LeelaChessZero/lc0) has
           | much more optimized Cuda backend targeting modern GPU
           | architectures and it's written by much more knowledgeable
           | people than me. That would be a much better source to learn.
        
         | robotnikman wrote:
         | >But if you do, you will be in high demand
         | 
         | So I'm guessing trying to find a job as a CUDA programmer is
         | nowhere as big of a headache compared to other software
         | engineering jobs right now? I'm thinking maybe learning CUDA
         | and more about AI might be a good pivot from the current
         | position as a Java middleware developer.
        
       | Onavo wrote:
       | Assuming you are asking this because of the deep learning/ChatGPT
       | hype, the first question you should ask yourself is, do you
       | really need to? The skills needed for CUDA are completely
       | unrelated to building machine learning models. It's like learning
       | to make a TLS library so you can get a full stack web development
       | job. The skills are completely orthogonal. CUDA belongs to the
       | domain of game developers, graphics people, high performance
       | computing and computer engineers (hardware). From the point of
       | view of machine learning development and research, it's nothing
       | more than an implementation detail.
       | 
       | Make sure you are very clear on what you want. Most HR
       | departments cast a wide net (it's like how every junior role
       | requires "3-5 years of experience" when in reality they don't
       | _really_ care). Similarly when hiring, most companies pray for
       | the unicorn developer who can understand the entire stack from
       | the GPU to the end user product domain when the day to day is
       | mostly in Python.
        
       | ForgotIdAgain wrote:
       | I have not tried it yet, but seems nice : https://leetgpu.com/
        
       | elashri wrote:
       | I will give you personal experience learning CUDA that might be
       | helpful.
       | 
       | Disclaime: I don't claim that this is actually a systematic way
       | to learn it and it is more for academic work.
       | 
       | I got assigned to a project that needs learning CUDA as part of
       | my PhD. There was no one in my research group who have any
       | experience or know CUDA. I started with standard NVIDIA courses
       | (Getting Started with Accelerated Computing with CUDA C/C++ and
       | there is python version too).
       | 
       | This gave me good introduction to the concepts and basic ideas
       | but I think after that I did most of learning by trial and error.
       | I tried a couple of online tutorials for specific things and some
       | books but it was always a deprecated function there or here or a
       | change of API that make things obsolete. Or basically things
       | changed for your GPU and now you have to be careful because yoy
       | might be using GPU version not compatible with what I develop for
       | in production and you need things to work for both.
       | 
       | I think learning CUDA for me is an endeavor of pain and going
       | through "compute-sanitizer" and Nsight because you will find that
       | most of your time will go into debugging why things is running
       | slower than you think.
       | 
       | Take things slowly. Take a simple project that you know how to do
       | without CUDA then port it to CUDA ane benchmark against CPU and
       | try to optimize different aspect of it.
       | 
       | The one advice that can be helpful is not to think about
       | optimization at the beginning. Start with correct, then optimize.
       | A working slow kernel beats a fast kernel that corrupts memory.
        
         | kevmo314 wrote:
         | > I think learning CUDA for me is an endeavor of pain and going
         | through "compute-sanitizer" and Nsight because you will find
         | that most of your time will go into debugging why things is
         | running slower than you think.
         | 
         | This is so true it hurts.
        
         | korbip wrote:
         | I can share a similar PhD story (the result being visible here:
         | https://github.com/NX-AI/flashrnn). Back then I didn't find any
         | tutorials that cover anything beyond the basics (which are
         | still important). Once you have understood the principle
         | working mode and architecture of a GPU, I would recommend the
         | following workflow: 1. First create an environment so that you
         | can actually test your kernels against baselines written in a
         | higher-level language. 2. If you don't have an urgent project
         | already, try to improve/re-implement existing problems (MatMul
         | being the first example). Don't get caught by wanting to
         | implement all size cases. Take an example just to learn a
         | certain functionality, rather than solving the whole problem if
         | it's just about learning. 3. Write the functionality you want
         | to have in increasing complexity. Write loops first, then
         | parallelize these loops over the grid. Use global memory first,
         | then put things into shared memory and registers. Use plain
         | matrix multiplication first, then use mma (TensorCore)
         | primitives to speed things up. 4. Iterate over the CUDA C
         | Programming Guide. It covers all (most) of the functionality
         | that you want to learn - but can't be just read an memorized.
         | When you apply it you learn it. 5. Might depend on you use-case
         | but also consider using higher-level abstractions like CUTLASS
         | or ThunderKitten. Also, if your environment is jax/torch, use
         | triton first before going to CUDA level.
         | 
         | Overall, it will be some pain for sure. And to master it
         | including PTX etc. will take a lot of time.
        
       | imjonse wrote:
       | These should keep you busy for months:
       | 
       | https://www.gpumode.com/ resources and discord community Book:
       | Programming massively parallel processors nvidia cuda docs are
       | very comprehensive too https://github.com/srush/GPU-Puzzles
        
         | amelius wrote:
         | This follows a "winner takes all" scenario. I see the
         | differences between the submissions are not so large, often
         | smaller than 1%. Kind of pointless to work on this, if you ask
         | me.
        
         | mdaniel wrote:
         | Wowzers, the line noise
         | 
         | https://github.com/HazyResearch/ThunderKittens#:~:text=here%...
        
       | weinzierl wrote:
       | Nvidia itself has a paid course series. It is a bit older but I
       | believe still relevant. I have bought it, but not yet started it
       | yet. I intend to do so during the summer holidays.
        
       | rramadass wrote:
       | CUDA GPGPU programming was invented to solve certain classes of
       | parallel problems. So studying these problems will give you
       | greater insight into CUDA based parallel programming. I suggest
       | reading the following old book along with your CUDA resources.
       | 
       |  _Scientific Parallel Computing_ by L. Ridgway Scott et. al. -
       | https://press.princeton.edu/books/hardcover/9780691119359/sc...
        
       | izharkhan wrote:
       | Haking Kase kare
        
       | epirogov wrote:
       | I bought P106-90 for 20$ and start porting my date apps to
       | parallel processing with it.
        
       | majke wrote:
       | I had a bit, limited, exposure to cuda. It was before the AI
       | boom, during Covid.
       | 
       | I found it easy to start. Then there was a pretty nice learning
       | curve to get to warps, SM's and basic concepts. Then I was able
       | to dig deeper into the integer opcodes, which was super cool. I
       | was able to optimize the compute part pretty well, without much
       | roadblocks.
       | 
       | However, getting memory loads perfect and then getting closer to
       | hw (warp groups, divergence, the L2 cache split thing,
       | scheduling), was pretty hard.
       | 
       | I'd say CUDA is pretty nice/fun to start with, and it's possible
       | to get quite far for a novice programmer. However getting deeper
       | and achieving real advantage over CPU is hard.
       | 
       | Additionally there is a problem with Nvidia segmenting the market
       | - some opcodes are present in _old_ gpu's (CUDA arch is _not_
       | forwards compatible). Some opcodes are reserved to "AI" chips
       | (like H100). So, to get code that is fast on both H100 and
       | RTX5090 is super hard. Add to that a fact that each card has
       | different SM count and memory capacity and bandwidth... and you
       | end up with an impossible compatibility matrix.
       | 
       | TLDR: Beginnings are nice and fun. You can get quite far on the
       | optimizing compute part. But getting compatibility for differnt
       | chips and memory access is hard. When you start, chose specific
       | problem, specific chip, specific instruction set.
        
       | tkuraku wrote:
       | I think you just pick a problem you want to solve with gpu
       | programming and go for it. Learning what you need along the way.
       | Nvidia blog posts are great for learning things along the way
       | such as https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-
       | kern...
        
       | sputknick wrote:
       | I used this to teach high school students. Probably not
       | sufficient to get what you want, but it should get you off the
       | ground and you can run from there.
       | https://youtu.be/86FAWCzIe_4?si=buqdqREWASNPbMQy
        
       | indianmouse wrote:
       | As a very early CUDA programmer who participated in the
       | cudacontest from NVidia during 2008 and I believe one of the only
       | entries (I'm not claiming though) to be submitted from India and
       | got a consolation and participation prize of a BlackEdition Card,
       | I can vouch the method which I followed.
       | 
       | - Look up the CUDA Programming Guide from NVidia
       | 
       | - CUDA Programming books from NVidia from
       | developer.nvidia.com/cuda-books-archive link
       | 
       | - Start creating small programs based on the existing
       | implementations (A strong C implementation knowledge is required.
       | So, brush up if needed.)
       | 
       | - Install the required Toolchains, compilers, and I am assuming
       | you have the necessary hardware to play around
       | 
       | - Github links with CUDA projects. Read the code, And now you
       | could use LLM to explain the code in the way you would need
       | 
       | - Start creating smaller, yet parallel programs etc., etc.,
       | 
       | And in about a month or two, you should have enough to start
       | writing CUDA programs.
       | 
       | I'm not aware of the skill / experience levels you have, but
       | whatever it might be, there are plenty of sources and resources
       | available now than it was in 2007/08.
       | 
       | Create a 6-8 weeks of study plan and you should be flying soon!
       | 
       | Hope it helps.
       | 
       | Feel free to comment and I can share whatever I could to guide.
        
         | hiq wrote:
         | > I am assuming you have the necessary hardware to play around
         | 
         | Can you expand on that? Is it enough to have an nvidia graphic
         | card that's like 5 year old, or do you need something more
         | specific?
        
           | rahimnathwani wrote:
           | I'm not a CUDA programmer, but AIUI:
           | 
           | - you will want to install the latest version of CUDA Toolkit
           | (12.9.1)
           | 
           | - each version of CUDA Toolkit requires the card driver to be
           | above a certain version (e.g. toolkit depends on driver
           | version 576 or above)
           | 
           | - older cards often have recent drivers, e.g. the current
           | version of CUDA Toolkit will work with a GTX 1080, as it has
           | a recent (576.x) driver
        
           | slt2021 wrote:
           | each nVidia GPU has a certain Compute Capability
           | (https://developer.nvidia.com/cuda-gpus).
           | 
           | Depending on the model and age of your GPU, it will have a
           | certain capability that will be the hard ceiling for what you
           | can program using CUDA
        
             | dpe82 wrote:
             | When you're just getting started and learning that won't
             | matter though. Any Nvidia card from the last 10 years
             | should be fine.
        
             | sanderjd wrote:
             | Recognizing that this won't result in any useful
             | benchmarks, is there a way to emulate an nvidia gpu? In a
             | docker container, for instance?
        
         | edge17 wrote:
         | What environment do you use? Is it still the case that Windows
         | is the main development environment for cuda?
        
       | mekpro wrote:
       | To professionals in the field, I have a question: what jobs,
       | positions, and companies are in need of CUDA engineers? My
       | current understanding is that while many companies use CUDA's by-
       | products (like PyTorch), direct CUDA development seems less
       | prevalent. I'm therefore seeking to identify more companies and
       | roles that heavily rely on CUDA.
        
         | kloop wrote:
         | My team uses it for geospatial data. We rasterize slippy map
         | tiles and then do a raster summary on the gpu.
         | 
         | It's a weird case, but the pixels can be processed
         | independently for most of it, so it works pretty well. Then the
         | rows can be summarized in parallel and rolled up at the end.
         | The copy onto the gpu is our current bottleneck however.
        
       | SoftTalker wrote:
       | It's 2025. Get with the times, ask Claude to do it, and then ask
       | it to explain it to you as if you're an engineer who needs to
       | convince a hiring manager that you understand it.
        
         | rakel_rakel wrote:
         | Might work in 2025, 2026 will demand more.
        
       | math_dandy wrote:
       | Are there any GPU emulators you can use to run simple CUDA
       | programs on a commodity laptops, just to get comfortable with the
       | mechanics, the toolchain, etc.?
        
         | gkbrk wrote:
         | Commodity laptops can just use regular non-emulated CUDA if
         | they have an Nvidia GPU. It's not just for datacenter GPUs, a
         | ton of regular consumer GPUs are also supported.
        
           | bee_rider wrote:
           | A commodity laptop doesn't have a GPU these days, iGPUs are
           | good enough for basic tasks.
        
         | corysama wrote:
         | https://leetgpu.com/ emulates running simple CUDA programs in a
         | web page with zero setup. It's a good way to get your toes wet.
        
         | throwaway81523 wrote:
         | You can get VPS with GPU's these days, not super cheap, but
         | affordable for those in the industry.
        
       | sremani wrote:
       | The book - PMPP - Programming Massively Parallel Processors
       | 
       | The YouTube Channel - CUDA_MODE - it is based on PMPP I could not
       | find the channel, but here is the playlist
       | https://www.youtube.com/watch?v=LuhJEEJQgUM&list=PLVEjdmwEDk...
       | 
       | Once done, you would be on solid foundation.
        
       | fifilura wrote:
       | I am not a CUDA programmer but when looking at this, I think I
       | can see the parallels to Spark and SQL
       | 
       | https://gfxcourses.stanford.edu/cs149/fall24/lecture/datapar...
       | 
       | So - start getting used to programming without using for loops,
       | would be my tip.
        
       | gdubs wrote:
       | I like to learn through projects, and as a graphics guy I love
       | the GPU Gems series. Things like:
       | 
       | https://developer.nvidia.com/gpugems/gpugems3/part-v-physics...
       | 
       | As an Apple platforms developer I actually worked through those
       | books to figure out how to convert the CUDA stuff to Metal, which
       | helped the material click even more.
       | 
       | Part of why I did it was - and this was some years back - I
       | wanted to sharpen my thinking around parallel approaches to
       | problem solving, given how central those algorithms and ways of
       | thinking are to things like ML and not just game development,
       | etc.
        
       | canyp wrote:
       | My 2 cents: "Learning CUDA" is not the interest bit. Rather, you
       | want to learn two things: 1) GPU hardware architecture, 2)
       | parallelizing algorithms. For CUDA specifically, there is the
       | book CUDA Programming Guide from Nvidia, which will teach you the
       | basics of the language. But what these jobs typically require is
       | that you know how to parallelize an algorithm and squeeze the
       | most of the hardware.
        
       | matt3210 wrote:
       | Just make cool stuff. Find people to code review. I learn way
       | more during code reviews than anything else.
        
       | alecco wrote:
       | Ignore everybody else. Start with CUDA Thrust. Study carefully
       | their examples. See how other projects use Thrust. After a year
       | or two, go deeper to cub.
       | 
       | Do not implement algorithms by hand. Recent architectures are
       | extremely hard to reach decent occupancy and such. Thrust and cub
       | solve 80% of the cases with reasonable trade-offs and they do
       | most of the work for you.
       | 
       | https://developer.nvidia.com/thrust
        
         | bee_rider wrote:
         | It looks quite nice just from skimming the link.
         | 
         | But, I don't understand the comparison to TBB. Do they have a
         | version of TBB that runs on the GPU natively? If the TBB
         | implementation is on the CPU... that's just comparing two
         | different pieces of hardware. Which would be confusing,
         | bordering on dishonest.
        
       | brudgers wrote:
       | For better or worse, direct professional experience in a
       | professional setting is the only way to learn anything to a
       | professional level.
       | 
       | That doesn't mean one-eyed-king knowledge is never enough to
       | solve that chicken-and-egg. You only have to be good enough to
       | get the job.
       | 
       | But if you haven't done it on the job, you don't have work
       | experience and you are either lying to others or lying to
       | yourself...and any sophisticated organization won't fall for
       | it...
       | 
       | ...except of course, knowingly. And the best way to get someone
       | to knowingly ignore obvious dunning-kruger and/or horseshit is to
       | know that someone personally or professionally.
       | 
       | Which is to say that the best way to get a good job is to have a
       | good relationship with someone who can hire you for a good job
       | (nepotism trumps technical ability, always). And the best way to
       | find a good job is to know a lot of people who want to work with
       | you.
       | 
       | To put it another way, looking for a job is the only way to find
       | a job and looking for a job is also much much harder than
       | everything that avoids looking for a job (like studying CUDA) by
       | pretending to be preparation...because again, studying CUDA won't
       | ever give you professional experience.
       | 
       | Don't get me wrong, there's nothing wrong with learning CUDA all
       | on your own. But it is not professional experience and it is not
       | looking for a job doing CUDA.
       | 
       | Finally, if you want to learn CUDA just learn it for its own sake
       | without worrying about a job. Learning things for their own sake
       | is the nature of learning once you get out of school.
       | 
       | Good luck.
        
       ___________________________________________________________________
       (page generated 2025-06-08 23:00 UTC)