[HN Gopher] Nvidia adds native Python support to CUDA
___________________________________________________________________
Nvidia adds native Python support to CUDA
Author : apples2apples
Score : 368 points
Date : 2025-04-04 12:54 UTC (10 hours ago)
(HTM) web link (thenewstack.io)
(TXT) w3m dump (thenewstack.io)
| jbs789 wrote:
| Source?
| macksd wrote:
| It seems to pretty much be about these packages:
|
| https://nvidia.github.io/cuda-python/cuda-core/latest/
| https://developer.nvidia.com/nvmath-python
| dachworker wrote:
| More informative than the article:
| https://xcancel.com/blelbach/status/1902113767066103949 cuTile
| seems to be the NVIDIA answer to OpenAI Triton.
| pjmlp wrote:
| The plethora of packages, including DSLs for compute and MLIR.
|
| https://developer.nvidia.com/how-to-cuda-python
|
| https://cupy.dev/
|
| And
|
| "Zero to Hero: Programming Nvidia Hopper Tensor Core with
| MLIR's NVGPU Dialect" from 2024 EuroLLVM.
|
| https://www.youtube.com/watch?v=V3Q9IjsgXvA
| diggan wrote:
| I'm no GPU programmer, but seems easy to use even for someone
| like me. I pulled together a quick demo of using the GPU vs the
| CPU, based on what I could find
| (https://gist.github.com/victorb/452a55dbcf59b3cbf84efd8c3097...)
| which gave these results (after downloading 2.6GB of dependencies
| of course): Creating 100 random matrices of
| size 5000x5000 on CPU... Adding matrices using CPU...
| CPU matrix addition completed in 0.6541 seconds CPU
| result matrix shape: (5000, 5000) Creating 100
| random matrices of size 5000x5000 on GPU... Adding
| matrices using GPU... GPU matrix addition completed in
| 0.1480 seconds GPU result matrix shape: (5000, 5000)
|
| Definitely worth digging into more, as the API is really simple
| to use, at least for basic things like these. CUDA programming
| seems like a big chore without something higher level like this.
| rahimnathwani wrote:
| Thank you. I scrolled up and down the article hoping they
| included a code sample.
| diggan wrote:
| Yeah, I figured I wasn't alone in doing just that :)
| rahimnathwani wrote:
| EDIT: Just realized the code doesn't seem to be using the GPU
| for the addition.
| wiredfool wrote:
| Curious what the timing would be if it included the memory
| transfer time, e.g. matricies = [np.random(...)
| for _ in range] time_start = time.time()
| cp_matricies = [cp.array(m) for m in matrices]
| add_(cp_matricies) sync time_end = time.time()
| hnuser123456 wrote:
| I think it does?: (the comment is in the original source)
| print("Adding matrices using GPU...") start_time =
| time.time() gpu_result = add_matrices(gpu_matrices)
| cp.cuda.get_current_stream().synchronize() # Not 100% sure
| what this does elapsed_time = time.time() -
| start_time
|
| I was going to ask, any CUDA professionals who want to give a
| crash course on what us python guys will need to know?
| apbytes wrote:
| When you call a cuda method, it is launched asynchronously.
| That is the function queues it up for execution on gpu and
| returns.
|
| So if you need to wait for an op to finish, you need to
| `synchronize` as shown above.
|
| `get_current_stream` because the queue mentioned above is
| actually called stream in cuda.
|
| If you want to run many independent ops concurrently, you
| can use several streams.
|
| Benchmarking is one use case for synchronize. Another would
| be if you let's say run two independent ops in different
| streams and need to combine their results.
|
| Btw, if you work with pytorch, when ops are run on gpu,
| they are launched in background. If you want to bench torch
| models on gpu, they also provide a sync api.
| hnuser123456 wrote:
| Thank you kindly!
| claytonjy wrote:
| I've always thought it was weird GPU stuff in python
| doesn't use asyncio, and mostly assumed it was because
| python-on-GPU predates asyncio. But I was hoping a new
| lib like this might right that wrong, but it doesn't.
| Maybe for interop reasons?
|
| Do other languages surface the asynchronous nature of
| GPUs in language-level async, avoiding silly stuff like
| synchronize?
| apbytes wrote:
| Might have to look at specific lib implementations, but
| I'd guess that mostly gpu calls from python are actually
| happening in c++ land. And internally a lib might be
| using synchronize calls where needed.
| ImprobableTruth wrote:
| The reason is that the usage is completely different from
| coroutine based async. With GPUs you want to queue _as
| many async operations as possible_ and only then
| synchronize. That is, you would have a program like this
| (pseudocode): b = foo(a) c = bar(b)
| d = baz(c) synchronize()
|
| With coroutines/async await, something like this
| b = await foo(a) c = await bar(b) d = await
| baz(c)
|
| would synchronize after every step, being much more
| inefficient.
| hackernudes wrote:
| Pretty sure you want it to do it the first way in all
| cases (not just with GPUs)!
| halter73 wrote:
| It really depends on if you're dealing with an async
| stream or a single async result as the input to the next
| function. If a is an access token needed to access
| resource b, you cannot access a and b at the same time.
| You have to serialize your operations.
| nickysielicki wrote:
| I don't mean to call you or your pseudocode out specifically,
| but I see this sort of thing all the time, and I just want to
| put it out there:
|
| PSA: if you ever see code trying to measure timing and it's
| not using the CUDA event APIs, it's _fundamentally wrong_ and
| is lying to you. The simplest way to be sure you're not
| measuring noise is to just ban the usage of any other timing
| source. Definitely don't add unnecessary syncs just so that
| you can add a timing tap.
|
| https://docs.nvidia.com/cuda/cuda-runtime-
| api/group__CUDART_...
| bee_rider wrote:
| If I have a mostly CPU code and I want to time the
| scenario: "I have just a couple subroutines that I am
| willing to offload to the GPU," what's wrong with
| sprinkling my code with normal old python timing calls?
|
| If I don't care what part of the CUDA ecosystem is taking
| time (from my point of view it is a black-box that does
| GEMMs) so why not measure "time until my normal code is
| running again?"
| nickysielicki wrote:
| If you care enough to time it, you should care enough to
| time it correctly.
| bee_rider wrote:
| I described the correct way to time it when using the
| card as a black-box accelerator.
| nickysielicki wrote:
| You can create metrics for whatever you want! Go ahead!
|
| But cuda is not a black box math accelerator. You can
| stupidly treat it as such, but that doesn't make it that.
| It's an entire ecosystem with drivers and contexts and
| lifecycles. If everything you're doing is synchronous
| and/or you don't mind if your metrics include totally
| unrelated costs, then time.time() is fine, sure. But if
| that's the case, you've got bigger problems.
| doctorpangloss wrote:
| You're arguing with people who have no idea what they're
| talking about on a forum that is a circular "increase in
| acceleration" of a personality trait that gets co-opted
| into arguing incorrectly about everything - a trait that
| everyone else knows is defective.
| gavinray wrote:
| One of the wisest things I've read all week.
|
| I authored one of the primary tools for GraphQL server
| benchmarks.
|
| I learned about the Coordinated Omission problem and
| formats like HDR Histograms during the implementation.
|
| My takeaway from that project is that not only is
| benchmarking anything correctly difficult, but they all
| ought to come with disclaimers of:
|
| _" These are the results obtained on X machine, running
| at Y time, with Z resources."_
| ashvardanian wrote:
| CuPy has been available for years and has always worked great.
| The article is about the next wave of Python-oriented JIT
| toolchains, that will allow writing actual GPU kernels in a
| Pythonic-style instead of calling an existing precompiled GEMM
| implementation in CuPy (like in that snippet) or even JIT-ing
| CUDA C++ kernels from a Python source, that has also been
| available for years:
| https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
| almostgotcaught wrote:
| it's funny - people around here really do not have a clue
| about the GPU ecosystem even though everyone is _always_
| talking about AI:
|
| > The article is about the next wave of Python-oriented JIT
| toolchains
|
| the article is content marketing (for whatever) but the
| actual product has literally has nothing to do with kernels
| or jitting or anything
|
| https://github.com/NVIDIA/cuda-python
|
| literally just cython bindings to CUDA runtime and CUB.
|
| for once CUDA is aping ROCm:
|
| https://github.com/ROCm/hip-python
| dragonwriter wrote:
| The mistake you seem to be making is confusing the existing
| product (which has been available for many years) with the
| upcoming new features for that product just announced at
| GTC, which are not addressed _at all_ on the page for the
| existing product, but are addressed in the article about
| the GTC announcement.
| almostgotcaught wrote:
| > The mistake you seem to be making is confusing the
| existing product
|
| i'm not making any such mistake - i'm just able to
| actually read and comprehend what i'm reading rather than
| perform hype:
|
| > Over the last year, NVIDIA made CUDA Core, which Jones
| said is a "Pythonic reimagining of the CUDA runtime to be
| naturally and natively Python."
|
| so the article is about cuda-core, not whatever you think
| it's about - so i'm responding directly to what the
| article is about.
|
| > CUDA Core has the execution flow of Python, which is
| fully in process and leans heavily into JIT compilation.
|
| this is bullshit/hype about Python's new JIT which womp
| womp womp isn't all that great (yet). this has absolutely
| nothing to do with any other JIT e.g., the cutile _kernel
| driver JIT_ (which also has absolutely nothing to do with
| what you think it does).
| dragonwriter wrote:
| > i'm just able to actually read and comprehend what i'm
| reading rather than perform hype:
|
| The evidence of that is lacking.
|
| > so the article is about cuda-core, not whatever you
| think it's about
|
| cuda.core (a relatively new, rapidly developing, library
| whose entire API is experimental) is one of several
| things (NVMath is another) mentioned in the article, but
| the newer and as yet unreleased piece mentioned in the
| article and the GTC announcement, and a key part of the
| "Native Python" in the headline, is the CuTile model [0]:
|
| "The new programming model, called CuTile interface, is
| being developed first for Pythonic CUDA with an extension
| for C++ CUDA coming later."
|
| > this is bullshit/hype about Python's new JIT
|
| No, as is is fairly explicit in the next line after the
| one you quote, it is about the Nvidia CUDA Python
| toolchain using in-process compilation rather than
| relying on shelling out to out-of-process command-line
| compilers for CUDA code.
|
| [0] The article only has fairly vague qualitative
| description of what CuTile is, but (without having to
| watch the whole talk from GTC), one could look at this
| tweet for a preview of what the Python code using the
| model is expected to look like when it is released: https
| ://x.com/blelbach/status/1902113767066103949?t=uihk0M8V..
| .
| almostgotcaught wrote:
| > No, as is is fairly explicit in the next line after the
| one you quote, it is about the Nvidia CUDA Python
| toolchain using in-process compilation rather than
| relying on shelling out to out-of-process command-line
| compilers for CUDA code.
|
| my guy what i am able to read, which you are not, is the
| source and release notes. i do not need to read tweets
| and press releases because i know what these things
| actually are. here are the release notes
|
| > Support Python 3.13
|
| > Add bindings for nvJitLink (requires nvJitLink from
| CUDA 12.3 or above)
|
| > Add optional dependencies on CUDA NVRTC and nvJitLink
| wheels
|
| https://nvidia.github.io/cuda-
| python/latest/release/12.8.0-n...
|
| do you understand what "bindings" and "optional
| dependencies on..." means? it means there's nothing
| happening in this library and these are... just bindings
| to existing libraries. specifically that means you cannot
| jit python using this thing ( _except via the python 3.13
| jit interpreter_ ) and can only do what you've always
| already been able to do with eg cupy (compile and run
| C/C++ CUDA code).
|
| EDIT: y'all realize that
|
| 1. calling a compiler for your entire source file
|
| 2. loading and running that compiled code
|
| is not at all a JIT? y'all understand that right?
| squeaky-clean wrote:
| Isn't the main announcement of the article CuTile? Which
| has not been released yet.
|
| Also the cuda-core JIT stuff has nothing to do with
| Python's new JIT, it's referring to integrating nvJitLink
| with python, which you can see an example of in
| cuda_core/examples/jit_lto_fractal.py
| yieldcrv wrote:
| I just want to see benchmarks. is this new one faster than
| CuPy or not
| ashvardanian wrote:
| In case someone is looking for some performance examples &
| testimonials, even on RTX 3090 vs a 64-core AMD
| Epy/Threadripper, even a couple of years ago, CuPy was a
| blast. I have a couple of recorded sessions with roughly
| identical slides/numbers: - San Francisco
| Python meetup in 2023:
| https://youtu.be/L9ELuU3GeNc?si=TOp8lARr7rP4cYaw -
| Yerevan PyData meetup in 2022:
| https://youtu.be/OxAKSVuW2Yk?si=5s_G0hm7FvFHXx0u
|
| Of the more remarkable results: - 1000x
| sorting speedup switching from NumPy to CuPy. - 50x
| performance improvements switching from Pandas to CuDF on
| the New York Taxi Rides queries. - 20x GEMM speedup
| switching from NumPy to CuPy.
|
| CuGraph is also definitely worth checking out. At that
| time, Intel wasn't in as bad of a position as they are now
| and was trying to push Modin, but the difference in
| performance and quality of implementation was mind-
| boggling.
| ladberg wrote:
| The main release highlighted by the article is cuTile which
| _is_ certainly about jitting kernels from Python code
| almostgotcaught wrote:
| > main release
|
| there is no release of cutile (yet). so the only
| substantive thing that the article can be describing is
| cuda-core - which it does describe and is a recent/new
| addition to the ecosystem.
|
| man i can't fathom glazing a random blog this hard just
| because it's tangentially related to some other thing (NV
| GPUs) that clearly people only vaguely understand.
| moffkalast wrote:
| Only 4x speed seems rather low for GPU acceleration, does numpy
| already use AVX2 or anything SIMD?
|
| For comparison, doing something similar with torch on CPU and
| torch on GPU will get you like 100x speed difference.
| diggan wrote:
| It's a microbenchmark (if even that), take it with a grain of
| salt. You'd probably see a bigger difference with
| bigger/more/more complicated tasks,
| aixpert wrote:
| thank God, Pytorch gained so much momentum before this came out,
| Now we have a true platform independent semi standard For
| parallel computations. We are not stuck with NVIDIA specifics.
|
| It's great that parts of pie torch which concern the NVIDIA
| backend can now be implemented in Python directly, The important
| part that it doesn't really matter or shouldn't matter for end
| users / Developers
|
| that being said, maybe this new platform will extend the whole
| concept of on GPU computation via Python to even more domains
| like maybe games.
|
| Imagine running rust the Game performantly mainly on the GPU via
| Python
| disgruntledphd2 wrote:
| This just makes it much, much easier for people to build
| numeric stuff on GPU, which is great.
|
| I'm totally with you that it's better that this took so long,
| so we have things like PyTorch abstracting most of this away,
| but I'm looking forward to (in my non-existent free time :/ )
| playing with this.
| wafngar wrote:
| Why not use torch.compile()?
| the__alchemist wrote:
| Rust support next? RN I am manually [de]serializing my data
| structures as byte arrays to/from the kernels. It would be nice
| to have truly shared data structures like CUDA gives you in C++!
| Micoloth wrote:
| What do you think of the Burn framework? (Honest question, I
| have no clue what I'm talking about)
| airstrike wrote:
| I used it to train my own mini-GPT and I liked it quite a
| bit. I tend to favor a different style of Rust with fewer
| generics but maybe that just can't be avoided given the goals
| of that project.
|
| The crate seems to have a lot of momentum, with many new
| features, releases, active communities on GH and Discord. I
| expect it to continue to get better.
| the__alchemist wrote:
| Have not heard of it. Looked it up. Seems orthogonal?
|
| I am using Cudarc.
| chasely wrote:
| The Rust-CUDA project just recently started up again [0], I've
| started digging into it a little bit and am hoping to
| contribute to it since the summers are a little slower for me.
|
| [0] https://github.com/rust-gpu/rust-cuda
| the__alchemist wrote:
| Still broken though! Has been for years. In a recent GH issue
| regarding desires for the reboot, I asked: "Try it on a few
| different machines (OS, GPUs, CUDA versions etc), make it
| work on modern RustC and CUDA versions without errors." The
| response was "That will be quite some work." Meanwhile,
| Cudarc works...
| chasely wrote:
| Totally, it's going to take a minute to get it all working.
| On a positive note, they recently got some sponsorship from
| Modal [0], who is supplying GPUs for CI/CD so they should
| be able to expand their hardware coverage.
| LegNeato wrote:
| https://github.com/rust-gpu/rust-cuda
| the__alchemist wrote:
| Not functional.
| taminka wrote:
| even putting aside how rust ownership semantics map poorly onto
| gpu programming, ml researchers will never learn rust, this
| will never ever happen...
| the__alchemist wrote:
| GPGPU programming != ML.
| pjmlp wrote:
| While I agree in principle, CUDA is more than only AI, as
| people keep forgetting.
| taminka wrote:
| everyone else who uses cuda isn't going to learn rust
| either
| pjmlp wrote:
| First Rust needs to have tier 1 support for CUDA, in a
| way that doesn't feel like yak shaving when coding for
| CUDA.
| malcolmgreaves wrote:
| ML reachers don't write code, they ask ChatGPT to make a
| horribly inefficient, non-portable notebook that has to be
| rewritten from scratch :)
| staunton wrote:
| It's made easier by that notebook only having to work just
| once, to produce some plots for the paper/press
| release/demo.
| int_19h wrote:
| The ML researchers are relying on libraries written by
| someone else. Today, those libraries are mostly C++, and they
| would benefit from Rust same as most other C++ codebases.
| KeplerBoy wrote:
| Isn't Rust still very seldomly used in the areas where CUDA
| shines (e.g. number crunching of any kind, let it be
| simulations or linear algebra)? Imo C++ or even Fortran are
| perfectly fine choices for those things, since the memory
| allocation pattern aren't that complicated.
| pjmlp wrote:
| Yes, and the new kid in town, slang has more chances of
| adoption.
| KeplerBoy wrote:
| sorry, could you link to the project? Seems there are quite
| a few languages called slang.
| _0ffh wrote:
| I guess he might mean this one https://shader-slang.org/
| though at first glance at least it looks more graphics
| than GPGPU oriented.
|
| Edit: Hmm, this part of the same project looks general
| purpose-y and apparently integrates with PyTorch
| https://slangpy.shader-slang.org/en/latest/
| IshKebab wrote:
| Mainly because number crunching code tends to be _very_ long-
| lived (hence why FORTRAN is still in use).
| gymbeaux wrote:
| This is huge. Anyone who was considering AMD + ROCm as an
| alternative to NVIDIA in the AI space isn't anymore.
|
| I'm one of those people who can't (won't) learn C++ to the extent
| required to effectively write code for GPU execution.... But to
| have a direct pipeline to the GPU via Python. Wow.
|
| The efficiency implications are huge, not just for Python
| libraries like PyTorch, but also anything we write that runs on
| an NVIDIA GPU.
|
| I love seeing anything that improves efficiency because we are
| constantly hearing about how many nuclear power plants OpenAI and
| Google are going to need to power all their GPUs.
| ErrorNoBrain wrote:
| They are, if they cant find an nvidia card
| pjmlp wrote:
| NVidia cards are everywhere, the biggest difference to AMD is
| that even my lousy laptop GeForce cards can be used for CUDA.
|
| No need for a RTX for learning and getting into CUDA
| programming.
| ferguess_k wrote:
| Just curious why can't AMD do the same thing?
| bigyabai wrote:
| It can be argued that they already did. AMD and Apple worked
| with Khronos to build OpenCL as a general competitor. The
| industry didn't come together to support it though, and
| eventually major stakeholders abandoned it altogether. Those
| ~10 wasted years were spent on Nvidia's side refining their
| software offerings and redesigning their GPU architecture to
| prioritize AI performance over raster optimization. Meanwhile
| Apple and AMD were pulling the rope in the opposite
| direction, trying to optimize raster performance at all
| costs.
|
| This means that Nvidia is selling a relatively unique
| architecture with a fully-developed SDK, industry buy-in and
| relevant market demand. Getting AMD up to the same spot would
| force them to reevaluate their priorities and demand a clean-
| slate architecture to-boot.
| pjmlp wrote:
| Maybe because Apple got pissed on how Khronos took over
| OpenCL, AMD and Intel never offered tooling on par with
| CUDA in terms of IDE integration, graphical debuggers and
| library ecosystem.
|
| Khronos also never saw the need to support a polyglot
| ecosystem with C++, Fortran and anything else that the
| industry could feel like using on a GPU.
|
| When Khronos finally remember to at least add C++ support
| and SPIR, again Intel and AMD failed to deliver, and OpenCL
| 3.0 is basically OpenCL 1.0 rebranded.
|
| Followed by SYCL efforts, which only Intel seems to care,
| with their own extensions on top via DPC++, nowadays
| openAPI. And only after acquiring Codeplay, which was
| actually the first company to deliver on SYCL tooling.
|
| However contrary to AMD, at least Intel does get that
| unless everyone gets to play with their software stack, no
| one will bother to actually learn it.
| bigyabai wrote:
| Well, Apple has done nothing to replace the common
| standard they abandoned. They failed to develop their
| proprietary alternatives into a competitive position and
| now can't even use their own TSMC dies (imported at great
| expense) for training: https://www.eteknix.com/apple-set-
| to-invest-1-billion-in-nvi...
|
| However you want to paint the picture today, you can't
| say the industry didn't try to resist CUDA. The
| stakeholders shot each other in a 4-way Mexican standoff,
| and Nvidia whistled showtunes all the way to the bank. If
| OpenCL was treated with the same importance Vulkan was,
| we might see a very different market today.
| pjmlp wrote:
| Yes they did, it is called Metal Compute, and everyone
| using Apple devices has to use it.
|
| Vulkan you say?
|
| It is only relevant on GNU/Linux and Android, because
| Google is pushing it, and still most folks still keep
| using OpenGL ES, no one else cares about it, and already
| turned into the same spaghetti mess as OpenGL, to the
| point that there was a roadmap talk at Vulkanised 2025 on
| how to sort things out.
|
| NVidia and AMD keep designing their cards with Microsoft
| for DirectX first, and Vulkan, eventually.
| bigyabai wrote:
| > it is called Metal Compute, and everyone using Apple
| devices has to use it.
|
| Sounds like a submarket absolutely teeming with
| competition. Like, you have Metal Compute, and Apple
| Accelerate Framework and MLX all sitting there in the
| same spot! Apple is really outdoing themselves, albeit in
| a fairly literal sense.
|
| > It is only relevant on GNU/Linux and Android
|
| Hmm... someone ought to remind me of the first stage of
| grief, I've forgotten it suddenly.
| dismalaf wrote:
| > But to have a direct pipeline to the GPU via Python
|
| Have you ever used a GPU API (CUDA, OpenCL, OpenGL, Vulkan,
| etc...) with a scripting language?
|
| It's cool that Nvidia made a bit of an ecosystem around it but
| it won't replace C++ or Fortran and you can't simply drop in
| "normal" Python code and have it run on the GPU. CUDA is still
| fundamentally it's own thing.
|
| There's also been CUDA bindings to scripting languages for at
| least 15 years... Most people will probably still use Torch or
| higher level things built on top of it.
|
| Also, here's Nvidia's own advertisement and some instructions
| for Python on their GPUs:
|
| - https://developer.nvidia.com/cuda-python
|
| - https://developer.nvidia.com/how-to-cuda-python
|
| Reality is kind of boring, and the article posted here is just
| clickbait.
| freeone3000 wrote:
| OpenCL and OpenGL are basically already scripting languages
| that you happen to type into a C compiler. The CUDA advantage
| was actually having meaningful types and compilation errors,
| without the intense boilerplate of Vulkan. But this is 100% a
| python-for-CUDA-C replacement on the GPU, for people who
| prefer a slightly different bracketing syntax.
| dismalaf wrote:
| > But this is 100% a python-for-CUDA-C replacement on the
| GPU
|
| Ish. It's a Python maths library made by Nvidia, an eDSL
| and a collection of curated libraries. It's not
| significantly different than stuff like Numpy, Triton,
| etc..., apart from being made by Nvidia and bundled with
| their tools.
| dragonwriter wrote:
| > It's cool that Nvidia made a bit of an ecosystem around it
| but it won't replace C++ or Fortran and you can't simply drop
| in "normal" Python code and have it run on the GPU.
|
| While its not _exactly_ normal Python code, there are Python
| libraries that allow writing GPU kernels in internal DSLs
| that are normal-ish Python (e.g., Numba for CUDA specifically
| via the @cuda.jit decorator; or Taichi which has multiple
| backends supporting the same application code--Vulkan, Metal,
| CUDA, OpenGL, OpenGL ES, and CPU.)
|
| Apparently, nVidia is now doing this first party in CUDA
| Python, including adding a new paradigm for CUDA code
| (CuTile) that is going to be in Python before C++; possibly
| trying to get ahead of things like Taichi (which, because it
| is cross-platform, commoditizes the underlying GPU).
|
| > Also, here's Nvidia's own advertisement for Python on their
| GPUs
|
| That (and the documentation linked there) does not address
| the new _upcoming_ native functionality announced at GTC;
| existing CUDA Python has kernels written in C++ in inline
| strings.
| pjmlp wrote:
| Yes, shading languages which are more productive without the
| gotchas from those languages, as they were designed from the
| ground up for compute devices.
|
| The polyglot nature of CUDA is one of the plus points versus
| the original "we do only C99 dialect around here" from
| OpenCL, until it was too late.
| DeathArrow wrote:
| >In 2024, Python became the most popular programming language in
| the world -- overtaking JavaScript -- according to GitHub's 2024
| open source survey.
|
| I wonder why Python take over the world? Of course, it's easy to
| learn, it might be easy to read and understand. But it also has a
| few downsides: low performance, single threaded, lack of static
| typing.
| nhumrich wrote:
| Perhaps performance, multi threading, and static typing are not
| the #1 things that make a language great.
|
| My guess: it's the community.
| chupasaurus wrote:
| All 3 are achieved in Python with a simple _import ctypes_
| /sarcasm
| timschmidt wrote:
| Universities seem to have settled on it for CSE 101 courses in
| the post-Java academic programming era.
| PeterStuer wrote:
| It's the ecosystem, specifically the _huge_ amount of packages
| available for everything under the sun.
| diggan wrote:
| > I wonder why Python take over the world?
|
| Not sure what "most popular programming language in the world"
| even means, in terms of existing projects? In terms of
| developers who consider it their main language? In terms of
| existing actually active projects? According to new projects
| created on GitHub that are also public?
|
| My guess is that it's the last one, which probably isn't what
| one would expect when hearing "the most popular language in the
| world", so worth keeping in mind.
|
| But considering that AI/ML is the hype today, and everyone want
| to get their piece of the pie, it makes sense that there is
| more public Python projects created on GitHub today compared to
| other languages, as most AI/ML is Python.
| owebmaster wrote:
| In this case, it's because JS ecosystem is now divided between
| JavaScript and TypeScript
| timschmidt wrote:
| As soon as WASM has native bindings for the DOM, I think
| you're going to see a lot of the energy in the JS ecosystem
| drain back into other languages.
| lenerdenator wrote:
| I do backend web server development using FastAPI/Starlette and
| Django.
|
| If I were a Ruby developer, I'd be using Rails, and I'd also be
| describing 90% of Ruby development.
|
| However, I do Python. What I'm describing is a tiny fraction of
| Python development.
|
| If you want to do something with computer code - data analysis,
| ML, web development, duct-taping together parts of a #NIX
| system, even some game development - you can do it _reasonably_
| well, if not better, in Python. The paths that you can take are
| limitless, and that gets people interested.
| EnergyAmy wrote:
| There's the pithy saying that "Python is the second-best
| language for anything", and that's kind of its superpower.
| TechDebtDevin wrote:
| I don't know. It absolutely annoys me. Go is more readable,
| easier to learn, more efficient, more fun to write but doesn't
| have all the math/ml packages people want. I'd like to get
| involved in catching Go up to Python in the ML space but Go is
| so behind.
| leosanchez wrote:
| > more fun to write
|
| Go is definitely not fun to write. The rest I agree.
| georgeecollins wrote:
| Less typing-- I mean keystrokes.
|
| All the things that are not great about it make it easier to
| learn. No static typing, no control of memory, no threads.
|
| When I started there was a language like BASIC or Visual BASIC
| that was easy to learn (or also quick to use) and C or C++ that
| was performant. If the world now is Python and Rust or Go, I
| think that it is just a better word for programmers. I say that
| as someone comfortable with C/ C++ / Java. They had their time
| and will still be with us, but the improvement is real.
| hbn wrote:
| It's "easy to learn" and you get all the downsides that come
| with that.
|
| At work right now we're integrating with scoring models hosted
| in Amazon SageMaker written by a "modelling team" and as far as
| I can tell they follow absolutely no basic coding practices.
| They give us the API and are asking us to send English strings
| of text for names of things instead of any real keys, and
| they're just comparing against plain strings and magic numbers
| everywhere so if they're asked to make any change like renaming
| something it's a herculean task that breaks a bunch of other
| things. Something will break when a field is null and then
| they'll tell us instead of sending null if we have no data to
| send -9999999. One time something broke and it turned out to be
| because we sent them "MB" (Manitoba) as someone's province, and
| whoever wrote it was just plain-text checking against a list of
| province codes as strings and didn't remember to include
| Manitoba.
|
| I know this is still mainly a business/management issue that
| they're allowing people who don't know how to code to write
| code, but I'm sure this is happening at other companies, and I
| think Python's level of accessibility at the beginner level has
| been a real blight to software quality.
| lenkite wrote:
| > I wonder why Python take over the world
|
| Because data-science/ML/LLM's have taken over the world now and
| no other language offers best-in-breed libraries and
| frameworks.
|
| Other languages need to get off their ass and start offering
| options soon or be relegated to niche domains.
| CapsAdmin wrote:
| Slightly related, I had a go at doing llama 3 inference in luajit
| using cuda as one compute backend for just doing matrix
| multiplication
|
| https://github.com/CapsAdmin/luajit-llama3/blob/main/compute...
|
| While obviously not complete, it was less than I thought was
| needed.
|
| It was a bit annoying trying to figure out which version of the
| function (_v2 suffix) I have to use for which driver I was
| running.
|
| Also sometimes a bit annoying is the stateful nature of the api.
| Very similar to opengl. Hard to debug at times as to why
| something refuse to compile.
| andrewmcwatters wrote:
| Neat, thanks for sharing!
| btown wrote:
| The GTC 2025 announcement session that's mentioned in this
| article has video here: https://www.nvidia.com/en-us/on-
| demand/session/gtc25-s72383/
|
| It's a holistic approach to all levels of the stack, from high-
| level frameworks to low-level bindings, some of which is
| highlighting existing libraries, and some of which are completely
| newly announced.
|
| One of the big things seems to be a brand new Tile IR, at the
| level of PTX and supported with a driver level JIT compiler, and
| designed for Python-first semantics via a new cuTile library.
|
| https://x.com/JokerEph/status/1902758983116657112 (without login:
| https://xcancel.com/JokerEph/status/1902758983116657112 )
|
| Example of proposed syntax:
| https://pbs.twimg.com/media/GmWqYiXa8AAdrl3?format=jpg&name=...
|
| Really exciting stuff, though with the new IR it further widens
| the gap that projects like https://github.com/vosen/ZLUDA and
| AMD's own tooling are trying to bridge. But vendor lock-in isn't
| something we can complain about when it arises from the vendor
| continuing to push the boundaries of developer experience.
| skavi wrote:
| i'm curious what advantage is derived from this existing
| independently of the PTX stack? i.e. why doesn't cuTile produce
| PTX via a bundled compiler like Triton or (iirc) Warp?
|
| Even if there is some impedance mismatch, could PTX itself not
| have been updated?
| lunarboy wrote:
| Doesn't this mean AMD could make the same python API that targets
| their hardware, and now Nvidia GPUs aren't as sticky?
| KeplerBoy wrote:
| Nothing really changed. AMD already has a c++ (HIP) dialect
| very similar to CUDA, even with some automatic porting efforts
| (hipify).
|
| AMD is held back by the combination of a lot of things. They
| have a counterpart to almost everything that exists on the
| other side. The things on the AMD side are just less mature
| with worse documentation and not as easily testable on consumer
| hardware.
| pjmlp wrote:
| They could, AMD's problem is that they keep failing on
| delivery.
| jmward01 wrote:
| This will probably lead to what, I think, python has led to in
| general: A lot more things tried quicker and targeted things that
| stay in a faster language. All in all this is a great move. I am
| looking forward to playing with it for sure.
| chrisrodrigue wrote:
| Python is really shaping up to be the lingua franca of
| programming languages. Its adoption is soaring in this FOSS
| renaissance and I think it's the closest thing to a golden hammer
| that we've ever had.
|
| The PEP model is a good vehicle for self-improvement and
| standardization. Packaging and deployment will soon be solved
| problems thanks to projects such as uv and BeeWare, and I'm
| confident that we're going to see continued performance
| improvements year over year.
| silisili wrote:
| > Packaging and deployment will soon be solved problems
|
| I really hope you're right. I love Python as a language, but
| for any sufficiently large project, those items become an
| absolute nightmare without something like Docker. And even
| with, there seems to be multiple ways people solve it. I wish
| they'd put something in at the language level or bless an
| 'official' one. Go has spoiled me there.
| horsawlarway wrote:
| Honestly, I'm still incredibly shocked at just how bad Python
| is on this front.
|
| I'm plenty familiar with packaging solutions that are painful
| to work with, but the state of python was _shocking_ when I
| hopped back in because of the available ML tooling.
|
| UV seems to be at least somewhat better, but damn - watching
| pip literally download 20+ 800MB torch wheels over and over
| trying to resolve deps only to waste 25GB of bandwidth and
| finally completely fail after taking nearly an hour was
| absolutely staggering.
| SJC_Hacker wrote:
| Python was not taken seriously as something you actually
| shipped to non-devs. The solution was normally "install the
| correct version of Python on the host system". In the Linux
| world, this could be handled through Docker, pyenv. For
| Windows users, this meant installing a several GB distro
| and hoping it didn't conflict with what was already on the
| system.
| whycome wrote:
| Would you say Python is a good language to learn as a beginner?
| dpkirchner wrote:
| Not the person you replied to but I'd say definitely not.
| It'd be easy to pick up bad habits from python (untyped
| variables) and try to carry them over to other languages.
| It's also the king of runtime errors, which will frustrate
| newbies.
|
| I think a compiled language is a better choice for people
| just getting started. Java is good, IMO, because it is
| verbose. Eventually the beginner may get tired of the
| verbosity and move on to something else, but at least they'll
| understand the value of explicit types and compile-time
| errors.
| SJC_Hacker wrote:
| Huh? Python variables definitely have a type. Its just
| determined at runtime.
|
| The only untyped language I know, at least modern ones is
| assembler.
|
| Well and C, if you make everything void*.
| chrisrodrigue wrote:
| Yeah, definitely. It's basically executable pseudocode and
| it's really simple for a beginner to pick up and hit the
| ground running for a variety of use cases.
|
| Some people will tell you to start with C or C++ to get a
| better intuition for what's actually happening under the hood
| in Python, but that's not really necessary for most use cases
| unless you're doing something niche. Some of the most popular
| use cases for Python are webapps, data analysis, or general
| automation. For the 1% of use cases that Python isn't the
| right fit for, you can still use it to prototype or glue
| things together.
|
| There are a lot of great resources out there for learning
| Python, but they won't necessarily teach you how to make
| great software. You can't go wrong with the official
| tutorial. https://learn.scientific-python.org/development/ is
| pretty terse and incorporates a lot of best practices.
| somethingsome wrote:
| I was teaching python long ago to very beginners in
| programming.
|
| Honestly, the language became kinda harsh for newcomers,
| what we see as developpers is 'it's like pseudocode that
| runs'.
|
| But a beginner is often left behind the billions of methods
| in each class. He is not used to documentation, and spend
| quite a huge amount of time learning by heart stupid things
| like 'len()' in this case it's '.len()' here it's
| '.length',etc.. For many meany methods that all have their
| idiosyncracies.
|
| At least in c/(easy)c++, you need to build yourself most of
| it, helping the understanding.
|
| I'm not completely against python as a first language, but
| it need to be teached well, and that could include working
| with a very minimal set of functions on every objects. Then
| you can expand and incorporate more and more methods that
| make life easier.
| airstrike wrote:
| As someone who spent nearly a decade with Python, I'd say 90%
| of people will answer "yes", so I'd like to offer a different
| perspective.
|
| IMHO if you want to pick it up for a couple toy projects just
| to get a feel of what coding is like, then by all means try
| it out. But eventually you'll benefit tremendously from
| exploring other languages.
|
| Python will teach you a lot of bad habits. You will feel like
| you know what you're doing, but only because you don't know
| all of the ways in which it is handwaving a lot of complexity
| that is inherent to writing code which you should be very
| much aware of.
|
| Knowing what I know now, I wish Rust existed when I started
| out so that it could have been my first language. I'm never
| giving up the borrow checker and the type system that come
| with it.
|
| But you don't have to do Rust. It's fine to work on a couple
| of projects in Python, then maybe something small in C
| (though the tooling can feel arcane and frustrating), then
| maybe switch it up and go with some more functional
| programming (FP) flavored like Lisp or F#.
|
| I know Rust has a lot of zealots and a lot of haters, but I'm
| not pushing an agenda. I just think it strikes that perfect
| balance between being extremely expressive, clear to read
| (after maybe a month of writing it daily), strong type
| system, lots of FP elements, no OOP clutter but super
| powerful traits, the borrow checker which you'll invariably
| learn to love, and more...
|
| This will give you a strong foundation upon which you'll be
| able to continuously build knowledge. And even if you start
| with Rust, you should definitely explore Python, C, Lisp and
| F# later (or maybe Haskell instead of F#)
| system2 wrote:
| With the help of GPT, I think the bad habit part is non-
| existent anymore. Learning it from GPT really helps people
| nowadays. Ask ChatGPT 4.0 some questions, and you will be
| shocked by how well it describes the code.
|
| Just don't ask to fix indentations because it will do it
| line by line for hours. But it finds mistakes quickly and
| points you in the right direction.
|
| And of course, it comes up with random non-existent modules
| once in a while which is cute to me.
| airstrike wrote:
| The bad habits I was thinking about were more in the line
| of not understanding how memory is being used (even
| something as simple as stack vs. heap allocation), not
| having a type system that forces you to think about the
| types of data structure you have in your system, and
| overall just being forced to _design_ before _coding_
| silisili wrote:
| I go back and forth on this. A lot of people make good
| points.
|
| In the end, my final answer is - yes. I say that because I
| believe it's the easiest programming language to get
| something working in. And getting something working is what
| motivates people to keep going.
|
| If you sit them down and say 'well before you learn python
| you need to learn how a computer really works, here's an ASM
| x86 book', they're gonna probably read 10 pages, say this is
| boring, then go do something else. I think that because I
| went through that as a kid - I started reading a C++ book
| with no knowledge and gave up. It wasn't until I found qbasic
| and VB, by all marks a terrible language, that I really got
| motivated to learn and keep going because progress was so
| easy.
|
| Python will teach you the basics - control flow, loops,
| variables, functions, libraries, etc. Those apply to almost
| every language. Then when you move to a different language,
| you at least know the basics and can focus on what's
| different or added that you didn't have or know before.
| SJC_Hacker wrote:
| Yeah, Python or Javascript should be first languages for
| most people.
|
| People like flashy things, and Python and Javascript are
| just 10x easier to get that working. Console I/O doesn't
| really cut it anymore.
|
| Later on you can deal with memory allocation, bit-
| twiddling, 2's complement arithmetic, lower level OS
| details etc.
| IshKebab wrote:
| I would personally recommend Javascript/Typescript over
| Python, but Python is a reasonable option. Especially now
| that we have uv so you don't have to crawl through the bush
| of thorns that Python's terrible tooling (pip, venv etc)
| surrounds you with.
|
| I would just encourage you to move on from Python fairly
| quickly. It's like... a balance bike. Easy to learn and teach
| you how to balance but you don't want to actually use it to
| get around.
| ergonaught wrote:
| > Packaging and deployment will soon be solved problems ...
|
| I hope so. Every time I engage in a "Why I began using Go aeons
| ago" conversation, half of the motivation was this. The reason
| I stopped engaging in them is because most of the participants
| apparently cannot see that this is even a problem. Performance
| was always the second problem (with Python); this was always
| the first.
| airstrike wrote:
| Python is too high level, slow and duck-typed to even be
| considered for a huge number of projects.
|
| There is no one-size-fits-all programming language.
| pjmlp wrote:
| Is the new BASIC, Pascal and Lisp.
|
| Now if only CPython also got a world class JIT, V8 style.
| int_19h wrote:
| AI-generated code is going to be a major influence going
| forward. Regardless of how you feel about its quality (I'm a
| pessimist myself), it's happening anyway, and it's going to
| cement the dominant position of those languages which LLMs
| understand / can write the best. Which correlates strongly to
| their amount in the training set, which means that Python and
| JavaScript in particular are here to stay now, and will likely
| be increasingly shoved into more and more niches - even those
| they aren't well-suited to - solely because LLMs can write
| them.
| screye wrote:
| Never heard of Beeware, but Astral's products have transformed
| my python workflow (uv, ruff).
|
| Is Beeware that transformational ? What does Beeware do and
| what is its maturity level?
| ryao wrote:
| CUDA was born from C and C++
|
| It would be nice if they actually implemented a C variant of CUDA
| instead of extending C++ and calling it CUDA C.
| swyx wrote:
| why is that impt to you? just trying to understand the problem
| you couldnt solve without a C-like
| kevmo314 wrote:
| A strict C variant would indeed be quite nice. I've wanted to
| write CUDA kernels in Go apps before so the Go app can handle
| the concurrency on the CPU side. Right now, I have to write a
| C wrapper and more often than not, I end up writing more code
| in C++ instead.
|
| But then I end up finding myself juggling mutexes and wishing
| I had some newer language features.
| ryao wrote:
| I want to write C code, not C++ code. Even if I try to write
| C style C++, it is more verbose and less readable, because of
| various C++isms. For example, having to specify extern "C" to
| get sane ABI names for the Nvidia CUDA driver API:
|
| https://docs.nvidia.com/cuda/cuda-driver-api/index.html
|
| Not to mention that C++ does not support neat features like
| variable sized arrays on the stack.
| pjmlp wrote:
| First of all they extend C, and with CUDA 3.0, initial support
| was added for C++, afterwards they bought PGI and added Fortran
| into the mix.
|
| Alongside for the ride, they fostered an ecosystem from
| compiled language backends targeting CUDA.
|
| Additionally modern CUDA supports standard C++ as well, with
| frameworks that hide the original extensions.
|
| Most critics don't really get the CUDA ecosystem.
| ryao wrote:
| They replaced C with C++. For example, try passing a function
| pointer as a void pointer argument without a cast. C says
| this should work. C++ says it should not. There are plenty of
| other differences that make it C++ and not C, if you know to
| look for them. The fact that C++ symbol names are used for
| one, which means you need to specify extern "C" if you want
| to reference them from the CUDA driver API. Then there is the
| fact that it will happily compile C++ classes where a pure C
| compiler will not. There is no stdc option for the compiler.
| thegabriele wrote:
| Could pandas benefit from this integration?
| binarymax wrote:
| Pandas uses numpy which uses C. So if numpy used CUDA then it
| would benefit.
| ashvardanian wrote:
| Check out CuDF. I've mentioned them in another comment on this
| thread.
| system2 wrote:
| What do you have in mind that Pandas will benefit from cuda
| cores?
| math_dandy wrote:
| Is NVIDIA's JIT-based approach here similar JAX's, except
| targeting CUDA directly rather than XLA? Would like to know how
| these different JIT compilers relate to one another.
| WhereIsTheTruth wrote:
| python is the winner, turning pseudo code into interesting stuff
|
| it's only the beginning, there is no need to create new
| programming languages anymore
| system2 wrote:
| I heard the same thing about Ruby, Go, TypeScript, and Rust.
| (Even JS at one point when NodeJS was super popular a few years
| ago).
|
| There will be new shiny things, but of course, my choice is
| Python too.
| wiseowise wrote:
| Python - new (34 years old), shiny thing.
| steelbrain wrote:
| See also: https://tinygrad.org/
|
| Reverse-engineered python-only GPU API, works with not only CUDA
| but Also AMD's ROCm
|
| Other runtimes: https://docs.tinygrad.org/runtime/#runtimes
| odo1242 wrote:
| Technically speaking, all of this exists (including the existing
| library integration and whatnot) through CuPy and Numba already,
| but the fact that it's getting official support is cool.
| matt3210 wrote:
| We should find a new word since GPU is is from back when it was
| used for graphics.
| martinsnow wrote:
| Disagree. It's the name of the component and everyone working
| with it knows what capabilities it has. Perhaps you got old.
| SJC_Hacker wrote:
| General Processing Unit
|
| Greater Processing Unit
|
| Giant Processing Unit
| chrisrodrigue wrote:
| Linear Algebra Unit?
| ashvardanian wrote:
| CuTile, in many ways, feels like a successor to OpenAI's
| Triton... And not only are we getting tile/block-level primitives
| and TileIR, but also a proper SIMT programming model in CuPy,
| which I don't think enough people noticed even at this year's
| GTC. Very cool stuff!
|
| That said, there were almost no announcements or talks related to
| CPUs, despite the Grace CPUs being announced quite some time ago.
| It doesn't feel like we're going to see generalizable
| abstractions that work seamlessly across Nvidia CPUs and GPUs
| anytime soon. For someone working on parallel algorithms daily,
| this is an issue: debugging with NSight and CUDA-GDB still isn't
| the same as raw GDB, and it's much easier to design algorithms on
| CPUs first and then port them to GPUs.
|
| Of all the teams in the compiler space, Modular seems to be among
| the few that aren't entirely consumed by the LLM craze, actively
| building abstractions and languages spanning multiple platforms.
| Given the landscape, that's increasingly valuable. I'd love to
| see more people experimenting with Mojo -- perhaps it can finally
| bridge the CPU-GPU gap that many of us face daily!
| crazygringo wrote:
| Very curious how this compares to JAX [1].
|
| JAX lets you write Python code that executes on Nvidia, but also
| GPUs of other brands (support varies). It similarly has drop-in
| replacements for NumPy functions.
|
| This only supports Nvidia. But can it do things JAX can't? It is
| easier to use? Is it less fixed-size-array-oriented? Is it worth
| locking yourself into one brand of GPU?
|
| [1] https://github.com/jax-ml/jax
| odo1242 wrote:
| Well, the idea is that you'd be writing low level CUDA kernels
| that implement operations not already implemented by JAX/CUDA
| and integrate them into existing projects. Numba[1] is probably
| the closest thing I can think of that currently exists. (In
| fact, looking at it right now, it seems this effort from Nvidia
| is actually based on Numba)
|
| [1]: https://numba.readthedocs.io/en/stable/cuda/overview.html
| rahimnathwani wrote:
| Here is the repo: https://github.com/NVIDIA/cuda-python
| melodyogonna wrote:
| Will be interesting to compare the API to what Modular has [1]
| with Mojo.
|
| 1. https://docs.modular.com/mojo/stdlib/gpu/
| hingusdingus wrote:
| Hmm wonder what vulnerabilities will now be available with this
| addition.
| soderfoo wrote:
| Kind of late to the game, but can anyone recommend a good primer
| on GPU programming?
___________________________________________________________________
(page generated 2025-04-04 23:00 UTC)