[HN Gopher] Ask HN: How to learn CUDA to professional level
___________________________________________________________________
Ask HN: How to learn CUDA to professional level
Hi all, I was wondering what books/courses/projects one might do to
learn CUDA programming. (To be frank, the main reason is a lot of
companies I'd wish to work for require CUDA experience -- this
shouldn't change your answers hopefully, just wanted to provide
some context )
Author : upmind
Score : 178 points
Date : 2025-06-08 10:52 UTC (12 hours ago)
| dist-epoch wrote:
| As they typically say: Just Do It (tm).
|
| Start writing some CUDA core to sort an array or find the maximum
| element.
| amelius wrote:
| I'd rather learn to use a library that works on any brand of
| GPU.
|
| If that is not an option, I'll wait!
| pjmlp wrote:
| If only Khronos and the competition cared about the developer
| experience....
| the__alchemist wrote:
| This is continuously a point of frustration! Vulkan compute
| is... suboptimal. I use Cuda because it feels like the only
| practical option. I want Vulkan or something else to
| compete seriously, but until that happens, I will use Cuda.
| pjmlp wrote:
| It took until Vulkanised 2025, to acknowledge Vulkan
| became the same mess as OpenGL, and to put an action plan
| into action to try to correct this.
|
| Had it not been for Apple with OpenCL initial
| contribution, regardless of how it went from there, AMD
| with Mantle as starting point for Vulkan, NVidia with
| Vulkan-Hpp and Slang, and the ecosystem of Khronos
| standards would be much worse.
|
| Also Vulkan isn't as bad as OpenGL tooling, because
| LunarG exists, and someone pays them for the whole Vulkan
| SDK.
|
| The attitude "we put paper standards" and the community
| should step in for the implementations and tooling,
| hardly comes to the productivity from private APIs
| tooling.
|
| Also all GPU vendors, including Intel and AMD, also
| rather push their own compute APIs, even if based on top
| of Khronos ones.
| corysama wrote:
| Is https://github.com/KomputeProject/kompute +
| https://shader-slang.org/ getting there?
|
| Runs on anything + auto-differentiatation.
| Cloudef wrote:
| Both zig and rust are aiming to compile to gpus natively.
| What cuda and hip provide is heterogeneous computing runtime,
| aka hiding the boilerplate of executing code on cpu and gpu
| seamlessly
| latchkey wrote:
| Then learn PyTorch.
|
| The hardware between brands is fundamentally different. There
| isn't a standard like x86 for CPUs.
|
| So, while you may use something like HIPIFY to translate your
| code between APIs, at least with GPU programming, it makes
| sense to learn how they differ from each other or just pick
| one of them and work with it knowing that the others will
| just be some variation of the same idea.
| horsellama wrote:
| the jobs requiring cuda experience are most of the times
| because torch is not good enough
| uecker wrote:
| GCC / clang also have support for offloading.
| moralestapia wrote:
| K, bud.
|
| Perhaps you haven't noticed, but you're in a thread that
| asked about CUDA, explicitly.
| the__alchemist wrote:
| I concur with this. Then supplement with resources A/R.
| Ideally, find some tasks in your programs that are parallelize.
| (Learning what these are is important too!), and switch them to
| Cuda. If you don't have any, make a toy case, e.g. an n-body
| simulation.
| lokimedes wrote:
| There's a couple of "concerns" you may separate to make this a
| bit more tractable:
|
| 1. Learning CUDA - the framework, libraries and high-layer
| wrappers. This is something that changes with times and trends.
|
| 2. Learning high-performance computing approaches. While a GPU
| and the Nvlink interfaces are Nvidia specific, working in a
| massively-parallel distributed computing environment is a general
| branch of knowledge that is translatable across HPC
| architectures.
|
| 3. Application specifics. If your thing is Transformers, you may
| just as well start from Torch, Tensorflow, etc. and rely on the
| current high-level abstractions, to inspire your learning down to
| the fundamentals.
|
| I'm no longer active in any of the above, so I can't be more
| specific, but if you want to master CUDA, I would say learning
| how massive-parallel programming works, is the foundation that
| may translate into transferable skills.
| rramadass wrote:
| This is the right approach. Without (2) trying to learn (1)
| will just lead to "confusion worse confounded". I also suggest
| a book recommendation here -
| https://news.ycombinator.com/item?id=44216478
| lokimedes wrote:
| This one was my go-to for HPC, but it may be a bit dated by
| now: https://www.amazon.com/Introduction-Performance-
| Computing-Sc...
| rramadass wrote:
| That's a good book too (i have it) but more general than
| the Ridgway Scott book which uses examples from Numerical
| Computation domains. Here is an overview of the chapters;
| example domains start from chapter 10 onwards -
| https://www.jstor.org/stable/j.ctv1ddcxfs
|
| These sort of books are only "dated" when it comes to
| specific languages/frameworks/libraries. The
| methods/techniques are evergreen and often conceptually
| better explained in these older books.
|
| For recent up to date works on HPC, the free multi-volume
| _The Art of High Performance Computing by Victor Eijkhout_
| can 't be beat -
| https://news.ycombinator.com/item?id=38815334
| jonas21 wrote:
| I think it depends on your learning style. For me, learning
| something with a concrete implementation and code that you
| can play around with is a lot easier than trying to study the
| abstract general concepts first. Once you have some
| experience with the code, you start asking why things are
| done a certain way, and that naturally leads to the more
| general concepts.
| throwaway81523 wrote:
| I looked at the CUDA code for Leela Chess Zero and found it
| pretty understandable, though that was back when Leela used a
| DCNN instead of transformers. DCNN's are fairly simple and are
| explained in fast.ai videos that I watched a few years ago, so
| navigating the Leela code wasn't too difficult. Transformers are
| more complicated and I want to bone up on them, but I haven't
| managed to spend any time understanding them.
|
| CUDA itself is just a minor departure from C++, so the language
| itself is no big deal if you've used C++ before. But, if you're
| trying to get hired programming CUDA, what that really means is
| they want you implementing AI stuff (unless it's game dev). AI
| programming is a much wider and deeper subject than CUDA itself,
| so be ready to spend a bunch of time studying and hacking to come
| up to speed in that. But if you do, you will be in high demand.
| As mentioned, the fast.ai videos are a great introduction.
|
| In the case of games, that means 3D graphics which these days is
| another rabbit hole. I knew a bit about this back in the day, but
| it is fantastically more sophisticated now and I don't have any
| idea where to even start.
| upmind wrote:
| This is a great idea! This is the code right'
| https://github.com/leela-zero/leela-zero
|
| I have two beginner (and probably very dumb) questions, why do
| they have heavy c++/cuda usage rather than using only
| pytorch/tensorflow. Are they too slow for training Leela?
| Second, why is there tensorflow code?
| throwaway81523 wrote:
| As I remember, the CUDA code was about 3x faster than the
| tensorflow code. The tensorflow stuff is there for non-Nvidia
| GPU's. This was in the era of the GTX 1080 or 2080. No idea
| about now.
| upmind wrote:
| Ah I see, thanks a lot!
| henrikf wrote:
| That's Leela Zero (plays Go instead of Chess). It was good
| for its time (~2018) but it's quite outdated now. It also
| uses OpenCL instead of Cuda. I wrote a lot of that code
| including Winograd convolution routines.
|
| Leela Chess Zero (https://github.com/LeelaChessZero/lc0) has
| much more optimized Cuda backend targeting modern GPU
| architectures and it's written by much more knowledgeable
| people than me. That would be a much better source to learn.
| robotnikman wrote:
| >But if you do, you will be in high demand
|
| So I'm guessing trying to find a job as a CUDA programmer is
| nowhere as big of a headache compared to other software
| engineering jobs right now? I'm thinking maybe learning CUDA
| and more about AI might be a good pivot from the current
| position as a Java middleware developer.
| Onavo wrote:
| Assuming you are asking this because of the deep learning/ChatGPT
| hype, the first question you should ask yourself is, do you
| really need to? The skills needed for CUDA are completely
| unrelated to building machine learning models. It's like learning
| to make a TLS library so you can get a full stack web development
| job. The skills are completely orthogonal. CUDA belongs to the
| domain of game developers, graphics people, high performance
| computing and computer engineers (hardware). From the point of
| view of machine learning development and research, it's nothing
| more than an implementation detail.
|
| Make sure you are very clear on what you want. Most HR
| departments cast a wide net (it's like how every junior role
| requires "3-5 years of experience" when in reality they don't
| _really_ care). Similarly when hiring, most companies pray for
| the unicorn developer who can understand the entire stack from
| the GPU to the end user product domain when the day to day is
| mostly in Python.
| ForgotIdAgain wrote:
| I have not tried it yet, but seems nice : https://leetgpu.com/
| elashri wrote:
| I will give you personal experience learning CUDA that might be
| helpful.
|
| Disclaime: I don't claim that this is actually a systematic way
| to learn it and it is more for academic work.
|
| I got assigned to a project that needs learning CUDA as part of
| my PhD. There was no one in my research group who have any
| experience or know CUDA. I started with standard NVIDIA courses
| (Getting Started with Accelerated Computing with CUDA C/C++ and
| there is python version too).
|
| This gave me good introduction to the concepts and basic ideas
| but I think after that I did most of learning by trial and error.
| I tried a couple of online tutorials for specific things and some
| books but it was always a deprecated function there or here or a
| change of API that make things obsolete. Or basically things
| changed for your GPU and now you have to be careful because yoy
| might be using GPU version not compatible with what I develop for
| in production and you need things to work for both.
|
| I think learning CUDA for me is an endeavor of pain and going
| through "compute-sanitizer" and Nsight because you will find that
| most of your time will go into debugging why things is running
| slower than you think.
|
| Take things slowly. Take a simple project that you know how to do
| without CUDA then port it to CUDA ane benchmark against CPU and
| try to optimize different aspect of it.
|
| The one advice that can be helpful is not to think about
| optimization at the beginning. Start with correct, then optimize.
| A working slow kernel beats a fast kernel that corrupts memory.
| kevmo314 wrote:
| > I think learning CUDA for me is an endeavor of pain and going
| through "compute-sanitizer" and Nsight because you will find
| that most of your time will go into debugging why things is
| running slower than you think.
|
| This is so true it hurts.
| korbip wrote:
| I can share a similar PhD story (the result being visible here:
| https://github.com/NX-AI/flashrnn). Back then I didn't find any
| tutorials that cover anything beyond the basics (which are
| still important). Once you have understood the principle
| working mode and architecture of a GPU, I would recommend the
| following workflow: 1. First create an environment so that you
| can actually test your kernels against baselines written in a
| higher-level language. 2. If you don't have an urgent project
| already, try to improve/re-implement existing problems (MatMul
| being the first example). Don't get caught by wanting to
| implement all size cases. Take an example just to learn a
| certain functionality, rather than solving the whole problem if
| it's just about learning. 3. Write the functionality you want
| to have in increasing complexity. Write loops first, then
| parallelize these loops over the grid. Use global memory first,
| then put things into shared memory and registers. Use plain
| matrix multiplication first, then use mma (TensorCore)
| primitives to speed things up. 4. Iterate over the CUDA C
| Programming Guide. It covers all (most) of the functionality
| that you want to learn - but can't be just read an memorized.
| When you apply it you learn it. 5. Might depend on you use-case
| but also consider using higher-level abstractions like CUTLASS
| or ThunderKitten. Also, if your environment is jax/torch, use
| triton first before going to CUDA level.
|
| Overall, it will be some pain for sure. And to master it
| including PTX etc. will take a lot of time.
| imjonse wrote:
| These should keep you busy for months:
|
| https://www.gpumode.com/ resources and discord community Book:
| Programming massively parallel processors nvidia cuda docs are
| very comprehensive too https://github.com/srush/GPU-Puzzles
| amelius wrote:
| This follows a "winner takes all" scenario. I see the
| differences between the submissions are not so large, often
| smaller than 1%. Kind of pointless to work on this, if you ask
| me.
| mdaniel wrote:
| Wowzers, the line noise
|
| https://github.com/HazyResearch/ThunderKittens#:~:text=here%...
| weinzierl wrote:
| Nvidia itself has a paid course series. It is a bit older but I
| believe still relevant. I have bought it, but not yet started it
| yet. I intend to do so during the summer holidays.
| rramadass wrote:
| CUDA GPGPU programming was invented to solve certain classes of
| parallel problems. So studying these problems will give you
| greater insight into CUDA based parallel programming. I suggest
| reading the following old book along with your CUDA resources.
|
| _Scientific Parallel Computing_ by L. Ridgway Scott et. al. -
| https://press.princeton.edu/books/hardcover/9780691119359/sc...
| izharkhan wrote:
| Haking Kase kare
| epirogov wrote:
| I bought P106-90 for 20$ and start porting my date apps to
| parallel processing with it.
| majke wrote:
| I had a bit, limited, exposure to cuda. It was before the AI
| boom, during Covid.
|
| I found it easy to start. Then there was a pretty nice learning
| curve to get to warps, SM's and basic concepts. Then I was able
| to dig deeper into the integer opcodes, which was super cool. I
| was able to optimize the compute part pretty well, without much
| roadblocks.
|
| However, getting memory loads perfect and then getting closer to
| hw (warp groups, divergence, the L2 cache split thing,
| scheduling), was pretty hard.
|
| I'd say CUDA is pretty nice/fun to start with, and it's possible
| to get quite far for a novice programmer. However getting deeper
| and achieving real advantage over CPU is hard.
|
| Additionally there is a problem with Nvidia segmenting the market
| - some opcodes are present in _old_ gpu's (CUDA arch is _not_
| forwards compatible). Some opcodes are reserved to "AI" chips
| (like H100). So, to get code that is fast on both H100 and
| RTX5090 is super hard. Add to that a fact that each card has
| different SM count and memory capacity and bandwidth... and you
| end up with an impossible compatibility matrix.
|
| TLDR: Beginnings are nice and fun. You can get quite far on the
| optimizing compute part. But getting compatibility for differnt
| chips and memory access is hard. When you start, chose specific
| problem, specific chip, specific instruction set.
| tkuraku wrote:
| I think you just pick a problem you want to solve with gpu
| programming and go for it. Learning what you need along the way.
| Nvidia blog posts are great for learning things along the way
| such as https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-
| kern...
| sputknick wrote:
| I used this to teach high school students. Probably not
| sufficient to get what you want, but it should get you off the
| ground and you can run from there.
| https://youtu.be/86FAWCzIe_4?si=buqdqREWASNPbMQy
| indianmouse wrote:
| As a very early CUDA programmer who participated in the
| cudacontest from NVidia during 2008 and I believe one of the only
| entries (I'm not claiming though) to be submitted from India and
| got a consolation and participation prize of a BlackEdition Card,
| I can vouch the method which I followed.
|
| - Look up the CUDA Programming Guide from NVidia
|
| - CUDA Programming books from NVidia from
| developer.nvidia.com/cuda-books-archive link
|
| - Start creating small programs based on the existing
| implementations (A strong C implementation knowledge is required.
| So, brush up if needed.)
|
| - Install the required Toolchains, compilers, and I am assuming
| you have the necessary hardware to play around
|
| - Github links with CUDA projects. Read the code, And now you
| could use LLM to explain the code in the way you would need
|
| - Start creating smaller, yet parallel programs etc., etc.,
|
| And in about a month or two, you should have enough to start
| writing CUDA programs.
|
| I'm not aware of the skill / experience levels you have, but
| whatever it might be, there are plenty of sources and resources
| available now than it was in 2007/08.
|
| Create a 6-8 weeks of study plan and you should be flying soon!
|
| Hope it helps.
|
| Feel free to comment and I can share whatever I could to guide.
| hiq wrote:
| > I am assuming you have the necessary hardware to play around
|
| Can you expand on that? Is it enough to have an nvidia graphic
| card that's like 5 year old, or do you need something more
| specific?
| rahimnathwani wrote:
| I'm not a CUDA programmer, but AIUI:
|
| - you will want to install the latest version of CUDA Toolkit
| (12.9.1)
|
| - each version of CUDA Toolkit requires the card driver to be
| above a certain version (e.g. toolkit depends on driver
| version 576 or above)
|
| - older cards often have recent drivers, e.g. the current
| version of CUDA Toolkit will work with a GTX 1080, as it has
| a recent (576.x) driver
| slt2021 wrote:
| each nVidia GPU has a certain Compute Capability
| (https://developer.nvidia.com/cuda-gpus).
|
| Depending on the model and age of your GPU, it will have a
| certain capability that will be the hard ceiling for what you
| can program using CUDA
| dpe82 wrote:
| When you're just getting started and learning that won't
| matter though. Any Nvidia card from the last 10 years
| should be fine.
| sanderjd wrote:
| Recognizing that this won't result in any useful
| benchmarks, is there a way to emulate an nvidia gpu? In a
| docker container, for instance?
| edge17 wrote:
| What environment do you use? Is it still the case that Windows
| is the main development environment for cuda?
| mekpro wrote:
| To professionals in the field, I have a question: what jobs,
| positions, and companies are in need of CUDA engineers? My
| current understanding is that while many companies use CUDA's by-
| products (like PyTorch), direct CUDA development seems less
| prevalent. I'm therefore seeking to identify more companies and
| roles that heavily rely on CUDA.
| kloop wrote:
| My team uses it for geospatial data. We rasterize slippy map
| tiles and then do a raster summary on the gpu.
|
| It's a weird case, but the pixels can be processed
| independently for most of it, so it works pretty well. Then the
| rows can be summarized in parallel and rolled up at the end.
| The copy onto the gpu is our current bottleneck however.
| SoftTalker wrote:
| It's 2025. Get with the times, ask Claude to do it, and then ask
| it to explain it to you as if you're an engineer who needs to
| convince a hiring manager that you understand it.
| rakel_rakel wrote:
| Might work in 2025, 2026 will demand more.
| math_dandy wrote:
| Are there any GPU emulators you can use to run simple CUDA
| programs on a commodity laptops, just to get comfortable with the
| mechanics, the toolchain, etc.?
| gkbrk wrote:
| Commodity laptops can just use regular non-emulated CUDA if
| they have an Nvidia GPU. It's not just for datacenter GPUs, a
| ton of regular consumer GPUs are also supported.
| bee_rider wrote:
| A commodity laptop doesn't have a GPU these days, iGPUs are
| good enough for basic tasks.
| corysama wrote:
| https://leetgpu.com/ emulates running simple CUDA programs in a
| web page with zero setup. It's a good way to get your toes wet.
| throwaway81523 wrote:
| You can get VPS with GPU's these days, not super cheap, but
| affordable for those in the industry.
| sremani wrote:
| The book - PMPP - Programming Massively Parallel Processors
|
| The YouTube Channel - CUDA_MODE - it is based on PMPP I could not
| find the channel, but here is the playlist
| https://www.youtube.com/watch?v=LuhJEEJQgUM&list=PLVEjdmwEDk...
|
| Once done, you would be on solid foundation.
| fifilura wrote:
| I am not a CUDA programmer but when looking at this, I think I
| can see the parallels to Spark and SQL
|
| https://gfxcourses.stanford.edu/cs149/fall24/lecture/datapar...
|
| So - start getting used to programming without using for loops,
| would be my tip.
| gdubs wrote:
| I like to learn through projects, and as a graphics guy I love
| the GPU Gems series. Things like:
|
| https://developer.nvidia.com/gpugems/gpugems3/part-v-physics...
|
| As an Apple platforms developer I actually worked through those
| books to figure out how to convert the CUDA stuff to Metal, which
| helped the material click even more.
|
| Part of why I did it was - and this was some years back - I
| wanted to sharpen my thinking around parallel approaches to
| problem solving, given how central those algorithms and ways of
| thinking are to things like ML and not just game development,
| etc.
| canyp wrote:
| My 2 cents: "Learning CUDA" is not the interest bit. Rather, you
| want to learn two things: 1) GPU hardware architecture, 2)
| parallelizing algorithms. For CUDA specifically, there is the
| book CUDA Programming Guide from Nvidia, which will teach you the
| basics of the language. But what these jobs typically require is
| that you know how to parallelize an algorithm and squeeze the
| most of the hardware.
| matt3210 wrote:
| Just make cool stuff. Find people to code review. I learn way
| more during code reviews than anything else.
| alecco wrote:
| Ignore everybody else. Start with CUDA Thrust. Study carefully
| their examples. See how other projects use Thrust. After a year
| or two, go deeper to cub.
|
| Do not implement algorithms by hand. Recent architectures are
| extremely hard to reach decent occupancy and such. Thrust and cub
| solve 80% of the cases with reasonable trade-offs and they do
| most of the work for you.
|
| https://developer.nvidia.com/thrust
| bee_rider wrote:
| It looks quite nice just from skimming the link.
|
| But, I don't understand the comparison to TBB. Do they have a
| version of TBB that runs on the GPU natively? If the TBB
| implementation is on the CPU... that's just comparing two
| different pieces of hardware. Which would be confusing,
| bordering on dishonest.
| brudgers wrote:
| For better or worse, direct professional experience in a
| professional setting is the only way to learn anything to a
| professional level.
|
| That doesn't mean one-eyed-king knowledge is never enough to
| solve that chicken-and-egg. You only have to be good enough to
| get the job.
|
| But if you haven't done it on the job, you don't have work
| experience and you are either lying to others or lying to
| yourself...and any sophisticated organization won't fall for
| it...
|
| ...except of course, knowingly. And the best way to get someone
| to knowingly ignore obvious dunning-kruger and/or horseshit is to
| know that someone personally or professionally.
|
| Which is to say that the best way to get a good job is to have a
| good relationship with someone who can hire you for a good job
| (nepotism trumps technical ability, always). And the best way to
| find a good job is to know a lot of people who want to work with
| you.
|
| To put it another way, looking for a job is the only way to find
| a job and looking for a job is also much much harder than
| everything that avoids looking for a job (like studying CUDA) by
| pretending to be preparation...because again, studying CUDA won't
| ever give you professional experience.
|
| Don't get me wrong, there's nothing wrong with learning CUDA all
| on your own. But it is not professional experience and it is not
| looking for a job doing CUDA.
|
| Finally, if you want to learn CUDA just learn it for its own sake
| without worrying about a job. Learning things for their own sake
| is the nature of learning once you get out of school.
|
| Good luck.
___________________________________________________________________
(page generated 2025-06-08 23:00 UTC)