[HN Gopher] Nvidia adds native Python support to CUDA
       ___________________________________________________________________
        
       Nvidia adds native Python support to CUDA
        
       Author : apples2apples
       Score  : 368 points
       Date   : 2025-04-04 12:54 UTC (10 hours ago)
        
 (HTM) web link (thenewstack.io)
 (TXT) w3m dump (thenewstack.io)
        
       | jbs789 wrote:
       | Source?
        
         | macksd wrote:
         | It seems to pretty much be about these packages:
         | 
         | https://nvidia.github.io/cuda-python/cuda-core/latest/
         | https://developer.nvidia.com/nvmath-python
        
         | dachworker wrote:
         | More informative than the article:
         | https://xcancel.com/blelbach/status/1902113767066103949 cuTile
         | seems to be the NVIDIA answer to OpenAI Triton.
        
         | pjmlp wrote:
         | The plethora of packages, including DSLs for compute and MLIR.
         | 
         | https://developer.nvidia.com/how-to-cuda-python
         | 
         | https://cupy.dev/
         | 
         | And
         | 
         | "Zero to Hero: Programming Nvidia Hopper Tensor Core with
         | MLIR's NVGPU Dialect" from 2024 EuroLLVM.
         | 
         | https://www.youtube.com/watch?v=V3Q9IjsgXvA
        
       | diggan wrote:
       | I'm no GPU programmer, but seems easy to use even for someone
       | like me. I pulled together a quick demo of using the GPU vs the
       | CPU, based on what I could find
       | (https://gist.github.com/victorb/452a55dbcf59b3cbf84efd8c3097...)
       | which gave these results (after downloading 2.6GB of dependencies
       | of course):                   Creating 100 random matrices of
       | size 5000x5000 on CPU...         Adding matrices using CPU...
       | CPU matrix addition completed in 0.6541 seconds         CPU
       | result matrix shape: (5000, 5000)                  Creating 100
       | random matrices of size 5000x5000 on GPU...         Adding
       | matrices using GPU...         GPU matrix addition completed in
       | 0.1480 seconds         GPU result matrix shape: (5000, 5000)
       | 
       | Definitely worth digging into more, as the API is really simple
       | to use, at least for basic things like these. CUDA programming
       | seems like a big chore without something higher level like this.
        
         | rahimnathwani wrote:
         | Thank you. I scrolled up and down the article hoping they
         | included a code sample.
        
           | diggan wrote:
           | Yeah, I figured I wasn't alone in doing just that :)
        
           | rahimnathwani wrote:
           | EDIT: Just realized the code doesn't seem to be using the GPU
           | for the addition.
        
         | wiredfool wrote:
         | Curious what the timing would be if it included the memory
         | transfer time, e.g.                 matricies = [np.random(...)
         | for _ in range]       time_start = time.time()
         | cp_matricies = [cp.array(m) for m in matrices]
         | add_(cp_matricies)       sync       time_end = time.time()
        
           | hnuser123456 wrote:
           | I think it does?: (the comment is in the original source)
           | print("Adding matrices using GPU...")         start_time =
           | time.time()         gpu_result = add_matrices(gpu_matrices)
           | cp.cuda.get_current_stream().synchronize() # Not 100% sure
           | what this does         elapsed_time = time.time() -
           | start_time
           | 
           | I was going to ask, any CUDA professionals who want to give a
           | crash course on what us python guys will need to know?
        
             | apbytes wrote:
             | When you call a cuda method, it is launched asynchronously.
             | That is the function queues it up for execution on gpu and
             | returns.
             | 
             | So if you need to wait for an op to finish, you need to
             | `synchronize` as shown above.
             | 
             | `get_current_stream` because the queue mentioned above is
             | actually called stream in cuda.
             | 
             | If you want to run many independent ops concurrently, you
             | can use several streams.
             | 
             | Benchmarking is one use case for synchronize. Another would
             | be if you let's say run two independent ops in different
             | streams and need to combine their results.
             | 
             | Btw, if you work with pytorch, when ops are run on gpu,
             | they are launched in background. If you want to bench torch
             | models on gpu, they also provide a sync api.
        
               | hnuser123456 wrote:
               | Thank you kindly!
        
               | claytonjy wrote:
               | I've always thought it was weird GPU stuff in python
               | doesn't use asyncio, and mostly assumed it was because
               | python-on-GPU predates asyncio. But I was hoping a new
               | lib like this might right that wrong, but it doesn't.
               | Maybe for interop reasons?
               | 
               | Do other languages surface the asynchronous nature of
               | GPUs in language-level async, avoiding silly stuff like
               | synchronize?
        
               | apbytes wrote:
               | Might have to look at specific lib implementations, but
               | I'd guess that mostly gpu calls from python are actually
               | happening in c++ land. And internally a lib might be
               | using synchronize calls where needed.
        
               | ImprobableTruth wrote:
               | The reason is that the usage is completely different from
               | coroutine based async. With GPUs you want to queue _as
               | many async operations as possible_ and only then
               | synchronize. That is, you would have a program like this
               | (pseudocode):                 b = foo(a)       c = bar(b)
               | d = baz(c)       synchronize()
               | 
               | With coroutines/async await, something like this
               | b = await foo(a)       c = await bar(b)       d = await
               | baz(c)
               | 
               | would synchronize after every step, being much more
               | inefficient.
        
               | hackernudes wrote:
               | Pretty sure you want it to do it the first way in all
               | cases (not just with GPUs)!
        
               | halter73 wrote:
               | It really depends on if you're dealing with an async
               | stream or a single async result as the input to the next
               | function. If a is an access token needed to access
               | resource b, you cannot access a and b at the same time.
               | You have to serialize your operations.
        
           | nickysielicki wrote:
           | I don't mean to call you or your pseudocode out specifically,
           | but I see this sort of thing all the time, and I just want to
           | put it out there:
           | 
           | PSA: if you ever see code trying to measure timing and it's
           | not using the CUDA event APIs, it's _fundamentally wrong_ and
           | is lying to you. The simplest way to be sure you're not
           | measuring noise is to just ban the usage of any other timing
           | source. Definitely don't add unnecessary syncs just so that
           | you can add a timing tap.
           | 
           | https://docs.nvidia.com/cuda/cuda-runtime-
           | api/group__CUDART_...
        
             | bee_rider wrote:
             | If I have a mostly CPU code and I want to time the
             | scenario: "I have just a couple subroutines that I am
             | willing to offload to the GPU," what's wrong with
             | sprinkling my code with normal old python timing calls?
             | 
             | If I don't care what part of the CUDA ecosystem is taking
             | time (from my point of view it is a black-box that does
             | GEMMs) so why not measure "time until my normal code is
             | running again?"
        
               | nickysielicki wrote:
               | If you care enough to time it, you should care enough to
               | time it correctly.
        
               | bee_rider wrote:
               | I described the correct way to time it when using the
               | card as a black-box accelerator.
        
               | nickysielicki wrote:
               | You can create metrics for whatever you want! Go ahead!
               | 
               | But cuda is not a black box math accelerator. You can
               | stupidly treat it as such, but that doesn't make it that.
               | It's an entire ecosystem with drivers and contexts and
               | lifecycles. If everything you're doing is synchronous
               | and/or you don't mind if your metrics include totally
               | unrelated costs, then time.time() is fine, sure. But if
               | that's the case, you've got bigger problems.
        
               | doctorpangloss wrote:
               | You're arguing with people who have no idea what they're
               | talking about on a forum that is a circular "increase in
               | acceleration" of a personality trait that gets co-opted
               | into arguing incorrectly about everything - a trait that
               | everyone else knows is defective.
        
               | gavinray wrote:
               | One of the wisest things I've read all week.
               | 
               | I authored one of the primary tools for GraphQL server
               | benchmarks.
               | 
               | I learned about the Coordinated Omission problem and
               | formats like HDR Histograms during the implementation.
               | 
               | My takeaway from that project is that not only is
               | benchmarking anything correctly difficult, but they all
               | ought to come with disclaimers of:
               | 
               |  _" These are the results obtained on X machine, running
               | at Y time, with Z resources."_
        
         | ashvardanian wrote:
         | CuPy has been available for years and has always worked great.
         | The article is about the next wave of Python-oriented JIT
         | toolchains, that will allow writing actual GPU kernels in a
         | Pythonic-style instead of calling an existing precompiled GEMM
         | implementation in CuPy (like in that snippet) or even JIT-ing
         | CUDA C++ kernels from a Python source, that has also been
         | available for years:
         | https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
        
           | almostgotcaught wrote:
           | it's funny - people around here really do not have a clue
           | about the GPU ecosystem even though everyone is _always_
           | talking about AI:
           | 
           | > The article is about the next wave of Python-oriented JIT
           | toolchains
           | 
           | the article is content marketing (for whatever) but the
           | actual product has literally has nothing to do with kernels
           | or jitting or anything
           | 
           | https://github.com/NVIDIA/cuda-python
           | 
           | literally just cython bindings to CUDA runtime and CUB.
           | 
           | for once CUDA is aping ROCm:
           | 
           | https://github.com/ROCm/hip-python
        
             | dragonwriter wrote:
             | The mistake you seem to be making is confusing the existing
             | product (which has been available for many years) with the
             | upcoming new features for that product just announced at
             | GTC, which are not addressed _at all_ on the page for the
             | existing product, but are addressed in the article about
             | the GTC announcement.
        
               | almostgotcaught wrote:
               | > The mistake you seem to be making is confusing the
               | existing product
               | 
               | i'm not making any such mistake - i'm just able to
               | actually read and comprehend what i'm reading rather than
               | perform hype:
               | 
               | > Over the last year, NVIDIA made CUDA Core, which Jones
               | said is a "Pythonic reimagining of the CUDA runtime to be
               | naturally and natively Python."
               | 
               | so the article is about cuda-core, not whatever you think
               | it's about - so i'm responding directly to what the
               | article is about.
               | 
               | > CUDA Core has the execution flow of Python, which is
               | fully in process and leans heavily into JIT compilation.
               | 
               | this is bullshit/hype about Python's new JIT which womp
               | womp womp isn't all that great (yet). this has absolutely
               | nothing to do with any other JIT e.g., the cutile _kernel
               | driver JIT_ (which also has absolutely nothing to do with
               | what you think it does).
        
               | dragonwriter wrote:
               | > i'm just able to actually read and comprehend what i'm
               | reading rather than perform hype:
               | 
               | The evidence of that is lacking.
               | 
               | > so the article is about cuda-core, not whatever you
               | think it's about
               | 
               | cuda.core (a relatively new, rapidly developing, library
               | whose entire API is experimental) is one of several
               | things (NVMath is another) mentioned in the article, but
               | the newer and as yet unreleased piece mentioned in the
               | article and the GTC announcement, and a key part of the
               | "Native Python" in the headline, is the CuTile model [0]:
               | 
               | "The new programming model, called CuTile interface, is
               | being developed first for Pythonic CUDA with an extension
               | for C++ CUDA coming later."
               | 
               | > this is bullshit/hype about Python's new JIT
               | 
               | No, as is is fairly explicit in the next line after the
               | one you quote, it is about the Nvidia CUDA Python
               | toolchain using in-process compilation rather than
               | relying on shelling out to out-of-process command-line
               | compilers for CUDA code.
               | 
               | [0] The article only has fairly vague qualitative
               | description of what CuTile is, but (without having to
               | watch the whole talk from GTC), one could look at this
               | tweet for a preview of what the Python code using the
               | model is expected to look like when it is released: https
               | ://x.com/blelbach/status/1902113767066103949?t=uihk0M8V..
               | .
        
               | almostgotcaught wrote:
               | > No, as is is fairly explicit in the next line after the
               | one you quote, it is about the Nvidia CUDA Python
               | toolchain using in-process compilation rather than
               | relying on shelling out to out-of-process command-line
               | compilers for CUDA code.
               | 
               | my guy what i am able to read, which you are not, is the
               | source and release notes. i do not need to read tweets
               | and press releases because i know what these things
               | actually are. here are the release notes
               | 
               | > Support Python 3.13
               | 
               | > Add bindings for nvJitLink (requires nvJitLink from
               | CUDA 12.3 or above)
               | 
               | > Add optional dependencies on CUDA NVRTC and nvJitLink
               | wheels
               | 
               | https://nvidia.github.io/cuda-
               | python/latest/release/12.8.0-n...
               | 
               | do you understand what "bindings" and "optional
               | dependencies on..." means? it means there's nothing
               | happening in this library and these are... just bindings
               | to existing libraries. specifically that means you cannot
               | jit python using this thing ( _except via the python 3.13
               | jit interpreter_ ) and can only do what you've always
               | already been able to do with eg cupy (compile and run
               | C/C++ CUDA code).
               | 
               | EDIT: y'all realize that
               | 
               | 1. calling a compiler for your entire source file
               | 
               | 2. loading and running that compiled code
               | 
               | is not at all a JIT? y'all understand that right?
        
               | squeaky-clean wrote:
               | Isn't the main announcement of the article CuTile? Which
               | has not been released yet.
               | 
               | Also the cuda-core JIT stuff has nothing to do with
               | Python's new JIT, it's referring to integrating nvJitLink
               | with python, which you can see an example of in
               | cuda_core/examples/jit_lto_fractal.py
        
             | yieldcrv wrote:
             | I just want to see benchmarks. is this new one faster than
             | CuPy or not
        
             | ashvardanian wrote:
             | In case someone is looking for some performance examples &
             | testimonials, even on RTX 3090 vs a 64-core AMD
             | Epy/Threadripper, even a couple of years ago, CuPy was a
             | blast. I have a couple of recorded sessions with roughly
             | identical slides/numbers:                 - San Francisco
             | Python meetup in 2023:
             | https://youtu.be/L9ELuU3GeNc?si=TOp8lARr7rP4cYaw       -
             | Yerevan PyData meetup in 2022:
             | https://youtu.be/OxAKSVuW2Yk?si=5s_G0hm7FvFHXx0u
             | 
             | Of the more remarkable results:                 - 1000x
             | sorting speedup switching from NumPy to CuPy.       - 50x
             | performance improvements switching from Pandas to CuDF on
             | the New York Taxi Rides queries.       - 20x GEMM speedup
             | switching from NumPy to CuPy.
             | 
             | CuGraph is also definitely worth checking out. At that
             | time, Intel wasn't in as bad of a position as they are now
             | and was trying to push Modin, but the difference in
             | performance and quality of implementation was mind-
             | boggling.
        
             | ladberg wrote:
             | The main release highlighted by the article is cuTile which
             | _is_ certainly about jitting kernels from Python code
        
               | almostgotcaught wrote:
               | > main release
               | 
               | there is no release of cutile (yet). so the only
               | substantive thing that the article can be describing is
               | cuda-core - which it does describe and is a recent/new
               | addition to the ecosystem.
               | 
               | man i can't fathom glazing a random blog this hard just
               | because it's tangentially related to some other thing (NV
               | GPUs) that clearly people only vaguely understand.
        
         | moffkalast wrote:
         | Only 4x speed seems rather low for GPU acceleration, does numpy
         | already use AVX2 or anything SIMD?
         | 
         | For comparison, doing something similar with torch on CPU and
         | torch on GPU will get you like 100x speed difference.
        
           | diggan wrote:
           | It's a microbenchmark (if even that), take it with a grain of
           | salt. You'd probably see a bigger difference with
           | bigger/more/more complicated tasks,
        
       | aixpert wrote:
       | thank God, Pytorch gained so much momentum before this came out,
       | Now we have a true platform independent semi standard For
       | parallel computations. We are not stuck with NVIDIA specifics.
       | 
       | It's great that parts of pie torch which concern the NVIDIA
       | backend can now be implemented in Python directly, The important
       | part that it doesn't really matter or shouldn't matter for end
       | users / Developers
       | 
       | that being said, maybe this new platform will extend the whole
       | concept of on GPU computation via Python to even more domains
       | like maybe games.
       | 
       | Imagine running rust the Game performantly mainly on the GPU via
       | Python
        
         | disgruntledphd2 wrote:
         | This just makes it much, much easier for people to build
         | numeric stuff on GPU, which is great.
         | 
         | I'm totally with you that it's better that this took so long,
         | so we have things like PyTorch abstracting most of this away,
         | but I'm looking forward to (in my non-existent free time :/ )
         | playing with this.
        
           | wafngar wrote:
           | Why not use torch.compile()?
        
       | the__alchemist wrote:
       | Rust support next? RN I am manually [de]serializing my data
       | structures as byte arrays to/from the kernels. It would be nice
       | to have truly shared data structures like CUDA gives you in C++!
        
         | Micoloth wrote:
         | What do you think of the Burn framework? (Honest question, I
         | have no clue what I'm talking about)
        
           | airstrike wrote:
           | I used it to train my own mini-GPT and I liked it quite a
           | bit. I tend to favor a different style of Rust with fewer
           | generics but maybe that just can't be avoided given the goals
           | of that project.
           | 
           | The crate seems to have a lot of momentum, with many new
           | features, releases, active communities on GH and Discord. I
           | expect it to continue to get better.
        
           | the__alchemist wrote:
           | Have not heard of it. Looked it up. Seems orthogonal?
           | 
           | I am using Cudarc.
        
         | chasely wrote:
         | The Rust-CUDA project just recently started up again [0], I've
         | started digging into it a little bit and am hoping to
         | contribute to it since the summers are a little slower for me.
         | 
         | [0] https://github.com/rust-gpu/rust-cuda
        
           | the__alchemist wrote:
           | Still broken though! Has been for years. In a recent GH issue
           | regarding desires for the reboot, I asked: "Try it on a few
           | different machines (OS, GPUs, CUDA versions etc), make it
           | work on modern RustC and CUDA versions without errors." The
           | response was "That will be quite some work." Meanwhile,
           | Cudarc works...
        
             | chasely wrote:
             | Totally, it's going to take a minute to get it all working.
             | On a positive note, they recently got some sponsorship from
             | Modal [0], who is supplying GPUs for CI/CD so they should
             | be able to expand their hardware coverage.
        
         | LegNeato wrote:
         | https://github.com/rust-gpu/rust-cuda
        
           | the__alchemist wrote:
           | Not functional.
        
         | taminka wrote:
         | even putting aside how rust ownership semantics map poorly onto
         | gpu programming, ml researchers will never learn rust, this
         | will never ever happen...
        
           | the__alchemist wrote:
           | GPGPU programming != ML.
        
           | pjmlp wrote:
           | While I agree in principle, CUDA is more than only AI, as
           | people keep forgetting.
        
             | taminka wrote:
             | everyone else who uses cuda isn't going to learn rust
             | either
        
               | pjmlp wrote:
               | First Rust needs to have tier 1 support for CUDA, in a
               | way that doesn't feel like yak shaving when coding for
               | CUDA.
        
           | malcolmgreaves wrote:
           | ML reachers don't write code, they ask ChatGPT to make a
           | horribly inefficient, non-portable notebook that has to be
           | rewritten from scratch :)
        
             | staunton wrote:
             | It's made easier by that notebook only having to work just
             | once, to produce some plots for the paper/press
             | release/demo.
        
           | int_19h wrote:
           | The ML researchers are relying on libraries written by
           | someone else. Today, those libraries are mostly C++, and they
           | would benefit from Rust same as most other C++ codebases.
        
         | KeplerBoy wrote:
         | Isn't Rust still very seldomly used in the areas where CUDA
         | shines (e.g. number crunching of any kind, let it be
         | simulations or linear algebra)? Imo C++ or even Fortran are
         | perfectly fine choices for those things, since the memory
         | allocation pattern aren't that complicated.
        
           | pjmlp wrote:
           | Yes, and the new kid in town, slang has more chances of
           | adoption.
        
             | KeplerBoy wrote:
             | sorry, could you link to the project? Seems there are quite
             | a few languages called slang.
        
               | _0ffh wrote:
               | I guess he might mean this one https://shader-slang.org/
               | though at first glance at least it looks more graphics
               | than GPGPU oriented.
               | 
               | Edit: Hmm, this part of the same project looks general
               | purpose-y and apparently integrates with PyTorch
               | https://slangpy.shader-slang.org/en/latest/
        
           | IshKebab wrote:
           | Mainly because number crunching code tends to be _very_ long-
           | lived (hence why FORTRAN is still in use).
        
       | gymbeaux wrote:
       | This is huge. Anyone who was considering AMD + ROCm as an
       | alternative to NVIDIA in the AI space isn't anymore.
       | 
       | I'm one of those people who can't (won't) learn C++ to the extent
       | required to effectively write code for GPU execution.... But to
       | have a direct pipeline to the GPU via Python. Wow.
       | 
       | The efficiency implications are huge, not just for Python
       | libraries like PyTorch, but also anything we write that runs on
       | an NVIDIA GPU.
       | 
       | I love seeing anything that improves efficiency because we are
       | constantly hearing about how many nuclear power plants OpenAI and
       | Google are going to need to power all their GPUs.
        
         | ErrorNoBrain wrote:
         | They are, if they cant find an nvidia card
        
           | pjmlp wrote:
           | NVidia cards are everywhere, the biggest difference to AMD is
           | that even my lousy laptop GeForce cards can be used for CUDA.
           | 
           | No need for a RTX for learning and getting into CUDA
           | programming.
        
         | ferguess_k wrote:
         | Just curious why can't AMD do the same thing?
        
           | bigyabai wrote:
           | It can be argued that they already did. AMD and Apple worked
           | with Khronos to build OpenCL as a general competitor. The
           | industry didn't come together to support it though, and
           | eventually major stakeholders abandoned it altogether. Those
           | ~10 wasted years were spent on Nvidia's side refining their
           | software offerings and redesigning their GPU architecture to
           | prioritize AI performance over raster optimization. Meanwhile
           | Apple and AMD were pulling the rope in the opposite
           | direction, trying to optimize raster performance at all
           | costs.
           | 
           | This means that Nvidia is selling a relatively unique
           | architecture with a fully-developed SDK, industry buy-in and
           | relevant market demand. Getting AMD up to the same spot would
           | force them to reevaluate their priorities and demand a clean-
           | slate architecture to-boot.
        
             | pjmlp wrote:
             | Maybe because Apple got pissed on how Khronos took over
             | OpenCL, AMD and Intel never offered tooling on par with
             | CUDA in terms of IDE integration, graphical debuggers and
             | library ecosystem.
             | 
             | Khronos also never saw the need to support a polyglot
             | ecosystem with C++, Fortran and anything else that the
             | industry could feel like using on a GPU.
             | 
             | When Khronos finally remember to at least add C++ support
             | and SPIR, again Intel and AMD failed to deliver, and OpenCL
             | 3.0 is basically OpenCL 1.0 rebranded.
             | 
             | Followed by SYCL efforts, which only Intel seems to care,
             | with their own extensions on top via DPC++, nowadays
             | openAPI. And only after acquiring Codeplay, which was
             | actually the first company to deliver on SYCL tooling.
             | 
             | However contrary to AMD, at least Intel does get that
             | unless everyone gets to play with their software stack, no
             | one will bother to actually learn it.
        
               | bigyabai wrote:
               | Well, Apple has done nothing to replace the common
               | standard they abandoned. They failed to develop their
               | proprietary alternatives into a competitive position and
               | now can't even use their own TSMC dies (imported at great
               | expense) for training: https://www.eteknix.com/apple-set-
               | to-invest-1-billion-in-nvi...
               | 
               | However you want to paint the picture today, you can't
               | say the industry didn't try to resist CUDA. The
               | stakeholders shot each other in a 4-way Mexican standoff,
               | and Nvidia whistled showtunes all the way to the bank. If
               | OpenCL was treated with the same importance Vulkan was,
               | we might see a very different market today.
        
               | pjmlp wrote:
               | Yes they did, it is called Metal Compute, and everyone
               | using Apple devices has to use it.
               | 
               | Vulkan you say?
               | 
               | It is only relevant on GNU/Linux and Android, because
               | Google is pushing it, and still most folks still keep
               | using OpenGL ES, no one else cares about it, and already
               | turned into the same spaghetti mess as OpenGL, to the
               | point that there was a roadmap talk at Vulkanised 2025 on
               | how to sort things out.
               | 
               | NVidia and AMD keep designing their cards with Microsoft
               | for DirectX first, and Vulkan, eventually.
        
               | bigyabai wrote:
               | > it is called Metal Compute, and everyone using Apple
               | devices has to use it.
               | 
               | Sounds like a submarket absolutely teeming with
               | competition. Like, you have Metal Compute, and Apple
               | Accelerate Framework and MLX all sitting there in the
               | same spot! Apple is really outdoing themselves, albeit in
               | a fairly literal sense.
               | 
               | > It is only relevant on GNU/Linux and Android
               | 
               | Hmm... someone ought to remind me of the first stage of
               | grief, I've forgotten it suddenly.
        
         | dismalaf wrote:
         | > But to have a direct pipeline to the GPU via Python
         | 
         | Have you ever used a GPU API (CUDA, OpenCL, OpenGL, Vulkan,
         | etc...) with a scripting language?
         | 
         | It's cool that Nvidia made a bit of an ecosystem around it but
         | it won't replace C++ or Fortran and you can't simply drop in
         | "normal" Python code and have it run on the GPU. CUDA is still
         | fundamentally it's own thing.
         | 
         | There's also been CUDA bindings to scripting languages for at
         | least 15 years... Most people will probably still use Torch or
         | higher level things built on top of it.
         | 
         | Also, here's Nvidia's own advertisement and some instructions
         | for Python on their GPUs:
         | 
         | - https://developer.nvidia.com/cuda-python
         | 
         | - https://developer.nvidia.com/how-to-cuda-python
         | 
         | Reality is kind of boring, and the article posted here is just
         | clickbait.
        
           | freeone3000 wrote:
           | OpenCL and OpenGL are basically already scripting languages
           | that you happen to type into a C compiler. The CUDA advantage
           | was actually having meaningful types and compilation errors,
           | without the intense boilerplate of Vulkan. But this is 100% a
           | python-for-CUDA-C replacement on the GPU, for people who
           | prefer a slightly different bracketing syntax.
        
             | dismalaf wrote:
             | > But this is 100% a python-for-CUDA-C replacement on the
             | GPU
             | 
             | Ish. It's a Python maths library made by Nvidia, an eDSL
             | and a collection of curated libraries. It's not
             | significantly different than stuff like Numpy, Triton,
             | etc..., apart from being made by Nvidia and bundled with
             | their tools.
        
           | dragonwriter wrote:
           | > It's cool that Nvidia made a bit of an ecosystem around it
           | but it won't replace C++ or Fortran and you can't simply drop
           | in "normal" Python code and have it run on the GPU.
           | 
           | While its not _exactly_ normal Python code, there are Python
           | libraries that allow writing GPU kernels in internal DSLs
           | that are normal-ish Python (e.g., Numba for CUDA specifically
           | via the @cuda.jit decorator; or Taichi which has multiple
           | backends supporting the same application code--Vulkan, Metal,
           | CUDA, OpenGL, OpenGL ES, and CPU.)
           | 
           | Apparently, nVidia is now doing this first party in CUDA
           | Python, including adding a new paradigm for CUDA code
           | (CuTile) that is going to be in Python before C++; possibly
           | trying to get ahead of things like Taichi (which, because it
           | is cross-platform, commoditizes the underlying GPU).
           | 
           | > Also, here's Nvidia's own advertisement for Python on their
           | GPUs
           | 
           | That (and the documentation linked there) does not address
           | the new _upcoming_ native functionality announced at GTC;
           | existing CUDA Python has kernels written in C++ in inline
           | strings.
        
           | pjmlp wrote:
           | Yes, shading languages which are more productive without the
           | gotchas from those languages, as they were designed from the
           | ground up for compute devices.
           | 
           | The polyglot nature of CUDA is one of the plus points versus
           | the original "we do only C99 dialect around here" from
           | OpenCL, until it was too late.
        
       | DeathArrow wrote:
       | >In 2024, Python became the most popular programming language in
       | the world -- overtaking JavaScript -- according to GitHub's 2024
       | open source survey.
       | 
       | I wonder why Python take over the world? Of course, it's easy to
       | learn, it might be easy to read and understand. But it also has a
       | few downsides: low performance, single threaded, lack of static
       | typing.
        
         | nhumrich wrote:
         | Perhaps performance, multi threading, and static typing are not
         | the #1 things that make a language great.
         | 
         | My guess: it's the community.
        
           | chupasaurus wrote:
           | All 3 are achieved in Python with a simple _import ctypes_
           | /sarcasm
        
         | timschmidt wrote:
         | Universities seem to have settled on it for CSE 101 courses in
         | the post-Java academic programming era.
        
         | PeterStuer wrote:
         | It's the ecosystem, specifically the _huge_ amount of packages
         | available for everything under the sun.
        
         | diggan wrote:
         | > I wonder why Python take over the world?
         | 
         | Not sure what "most popular programming language in the world"
         | even means, in terms of existing projects? In terms of
         | developers who consider it their main language? In terms of
         | existing actually active projects? According to new projects
         | created on GitHub that are also public?
         | 
         | My guess is that it's the last one, which probably isn't what
         | one would expect when hearing "the most popular language in the
         | world", so worth keeping in mind.
         | 
         | But considering that AI/ML is the hype today, and everyone want
         | to get their piece of the pie, it makes sense that there is
         | more public Python projects created on GitHub today compared to
         | other languages, as most AI/ML is Python.
        
         | owebmaster wrote:
         | In this case, it's because JS ecosystem is now divided between
         | JavaScript and TypeScript
        
           | timschmidt wrote:
           | As soon as WASM has native bindings for the DOM, I think
           | you're going to see a lot of the energy in the JS ecosystem
           | drain back into other languages.
        
         | lenerdenator wrote:
         | I do backend web server development using FastAPI/Starlette and
         | Django.
         | 
         | If I were a Ruby developer, I'd be using Rails, and I'd also be
         | describing 90% of Ruby development.
         | 
         | However, I do Python. What I'm describing is a tiny fraction of
         | Python development.
         | 
         | If you want to do something with computer code - data analysis,
         | ML, web development, duct-taping together parts of a #NIX
         | system, even some game development - you can do it _reasonably_
         | well, if not better, in Python. The paths that you can take are
         | limitless, and that gets people interested.
        
           | EnergyAmy wrote:
           | There's the pithy saying that "Python is the second-best
           | language for anything", and that's kind of its superpower.
        
         | TechDebtDevin wrote:
         | I don't know. It absolutely annoys me. Go is more readable,
         | easier to learn, more efficient, more fun to write but doesn't
         | have all the math/ml packages people want. I'd like to get
         | involved in catching Go up to Python in the ML space but Go is
         | so behind.
        
           | leosanchez wrote:
           | > more fun to write
           | 
           | Go is definitely not fun to write. The rest I agree.
        
         | georgeecollins wrote:
         | Less typing-- I mean keystrokes.
         | 
         | All the things that are not great about it make it easier to
         | learn. No static typing, no control of memory, no threads.
         | 
         | When I started there was a language like BASIC or Visual BASIC
         | that was easy to learn (or also quick to use) and C or C++ that
         | was performant. If the world now is Python and Rust or Go, I
         | think that it is just a better word for programmers. I say that
         | as someone comfortable with C/ C++ / Java. They had their time
         | and will still be with us, but the improvement is real.
        
         | hbn wrote:
         | It's "easy to learn" and you get all the downsides that come
         | with that.
         | 
         | At work right now we're integrating with scoring models hosted
         | in Amazon SageMaker written by a "modelling team" and as far as
         | I can tell they follow absolutely no basic coding practices.
         | They give us the API and are asking us to send English strings
         | of text for names of things instead of any real keys, and
         | they're just comparing against plain strings and magic numbers
         | everywhere so if they're asked to make any change like renaming
         | something it's a herculean task that breaks a bunch of other
         | things. Something will break when a field is null and then
         | they'll tell us instead of sending null if we have no data to
         | send -9999999. One time something broke and it turned out to be
         | because we sent them "MB" (Manitoba) as someone's province, and
         | whoever wrote it was just plain-text checking against a list of
         | province codes as strings and didn't remember to include
         | Manitoba.
         | 
         | I know this is still mainly a business/management issue that
         | they're allowing people who don't know how to code to write
         | code, but I'm sure this is happening at other companies, and I
         | think Python's level of accessibility at the beginner level has
         | been a real blight to software quality.
        
         | lenkite wrote:
         | > I wonder why Python take over the world
         | 
         | Because data-science/ML/LLM's have taken over the world now and
         | no other language offers best-in-breed libraries and
         | frameworks.
         | 
         | Other languages need to get off their ass and start offering
         | options soon or be relegated to niche domains.
        
       | CapsAdmin wrote:
       | Slightly related, I had a go at doing llama 3 inference in luajit
       | using cuda as one compute backend for just doing matrix
       | multiplication
       | 
       | https://github.com/CapsAdmin/luajit-llama3/blob/main/compute...
       | 
       | While obviously not complete, it was less than I thought was
       | needed.
       | 
       | It was a bit annoying trying to figure out which version of the
       | function (_v2 suffix) I have to use for which driver I was
       | running.
       | 
       | Also sometimes a bit annoying is the stateful nature of the api.
       | Very similar to opengl. Hard to debug at times as to why
       | something refuse to compile.
        
         | andrewmcwatters wrote:
         | Neat, thanks for sharing!
        
       | btown wrote:
       | The GTC 2025 announcement session that's mentioned in this
       | article has video here: https://www.nvidia.com/en-us/on-
       | demand/session/gtc25-s72383/
       | 
       | It's a holistic approach to all levels of the stack, from high-
       | level frameworks to low-level bindings, some of which is
       | highlighting existing libraries, and some of which are completely
       | newly announced.
       | 
       | One of the big things seems to be a brand new Tile IR, at the
       | level of PTX and supported with a driver level JIT compiler, and
       | designed for Python-first semantics via a new cuTile library.
       | 
       | https://x.com/JokerEph/status/1902758983116657112 (without login:
       | https://xcancel.com/JokerEph/status/1902758983116657112 )
       | 
       | Example of proposed syntax:
       | https://pbs.twimg.com/media/GmWqYiXa8AAdrl3?format=jpg&name=...
       | 
       | Really exciting stuff, though with the new IR it further widens
       | the gap that projects like https://github.com/vosen/ZLUDA and
       | AMD's own tooling are trying to bridge. But vendor lock-in isn't
       | something we can complain about when it arises from the vendor
       | continuing to push the boundaries of developer experience.
        
         | skavi wrote:
         | i'm curious what advantage is derived from this existing
         | independently of the PTX stack? i.e. why doesn't cuTile produce
         | PTX via a bundled compiler like Triton or (iirc) Warp?
         | 
         | Even if there is some impedance mismatch, could PTX itself not
         | have been updated?
        
       | lunarboy wrote:
       | Doesn't this mean AMD could make the same python API that targets
       | their hardware, and now Nvidia GPUs aren't as sticky?
        
         | KeplerBoy wrote:
         | Nothing really changed. AMD already has a c++ (HIP) dialect
         | very similar to CUDA, even with some automatic porting efforts
         | (hipify).
         | 
         | AMD is held back by the combination of a lot of things. They
         | have a counterpart to almost everything that exists on the
         | other side. The things on the AMD side are just less mature
         | with worse documentation and not as easily testable on consumer
         | hardware.
        
         | pjmlp wrote:
         | They could, AMD's problem is that they keep failing on
         | delivery.
        
       | jmward01 wrote:
       | This will probably lead to what, I think, python has led to in
       | general: A lot more things tried quicker and targeted things that
       | stay in a faster language. All in all this is a great move. I am
       | looking forward to playing with it for sure.
        
       | chrisrodrigue wrote:
       | Python is really shaping up to be the lingua franca of
       | programming languages. Its adoption is soaring in this FOSS
       | renaissance and I think it's the closest thing to a golden hammer
       | that we've ever had.
       | 
       | The PEP model is a good vehicle for self-improvement and
       | standardization. Packaging and deployment will soon be solved
       | problems thanks to projects such as uv and BeeWare, and I'm
       | confident that we're going to see continued performance
       | improvements year over year.
        
         | silisili wrote:
         | > Packaging and deployment will soon be solved problems
         | 
         | I really hope you're right. I love Python as a language, but
         | for any sufficiently large project, those items become an
         | absolute nightmare without something like Docker. And even
         | with, there seems to be multiple ways people solve it. I wish
         | they'd put something in at the language level or bless an
         | 'official' one. Go has spoiled me there.
        
           | horsawlarway wrote:
           | Honestly, I'm still incredibly shocked at just how bad Python
           | is on this front.
           | 
           | I'm plenty familiar with packaging solutions that are painful
           | to work with, but the state of python was _shocking_ when I
           | hopped back in because of the available ML tooling.
           | 
           | UV seems to be at least somewhat better, but damn - watching
           | pip literally download 20+ 800MB torch wheels over and over
           | trying to resolve deps only to waste 25GB of bandwidth and
           | finally completely fail after taking nearly an hour was
           | absolutely staggering.
        
             | SJC_Hacker wrote:
             | Python was not taken seriously as something you actually
             | shipped to non-devs. The solution was normally "install the
             | correct version of Python on the host system". In the Linux
             | world, this could be handled through Docker, pyenv. For
             | Windows users, this meant installing a several GB distro
             | and hoping it didn't conflict with what was already on the
             | system.
        
         | whycome wrote:
         | Would you say Python is a good language to learn as a beginner?
        
           | dpkirchner wrote:
           | Not the person you replied to but I'd say definitely not.
           | It'd be easy to pick up bad habits from python (untyped
           | variables) and try to carry them over to other languages.
           | It's also the king of runtime errors, which will frustrate
           | newbies.
           | 
           | I think a compiled language is a better choice for people
           | just getting started. Java is good, IMO, because it is
           | verbose. Eventually the beginner may get tired of the
           | verbosity and move on to something else, but at least they'll
           | understand the value of explicit types and compile-time
           | errors.
        
             | SJC_Hacker wrote:
             | Huh? Python variables definitely have a type. Its just
             | determined at runtime.
             | 
             | The only untyped language I know, at least modern ones is
             | assembler.
             | 
             | Well and C, if you make everything void*.
        
           | chrisrodrigue wrote:
           | Yeah, definitely. It's basically executable pseudocode and
           | it's really simple for a beginner to pick up and hit the
           | ground running for a variety of use cases.
           | 
           | Some people will tell you to start with C or C++ to get a
           | better intuition for what's actually happening under the hood
           | in Python, but that's not really necessary for most use cases
           | unless you're doing something niche. Some of the most popular
           | use cases for Python are webapps, data analysis, or general
           | automation. For the 1% of use cases that Python isn't the
           | right fit for, you can still use it to prototype or glue
           | things together.
           | 
           | There are a lot of great resources out there for learning
           | Python, but they won't necessarily teach you how to make
           | great software. You can't go wrong with the official
           | tutorial. https://learn.scientific-python.org/development/ is
           | pretty terse and incorporates a lot of best practices.
        
             | somethingsome wrote:
             | I was teaching python long ago to very beginners in
             | programming.
             | 
             | Honestly, the language became kinda harsh for newcomers,
             | what we see as developpers is 'it's like pseudocode that
             | runs'.
             | 
             | But a beginner is often left behind the billions of methods
             | in each class. He is not used to documentation, and spend
             | quite a huge amount of time learning by heart stupid things
             | like 'len()' in this case it's '.len()' here it's
             | '.length',etc.. For many meany methods that all have their
             | idiosyncracies.
             | 
             | At least in c/(easy)c++, you need to build yourself most of
             | it, helping the understanding.
             | 
             | I'm not completely against python as a first language, but
             | it need to be teached well, and that could include working
             | with a very minimal set of functions on every objects. Then
             | you can expand and incorporate more and more methods that
             | make life easier.
        
           | airstrike wrote:
           | As someone who spent nearly a decade with Python, I'd say 90%
           | of people will answer "yes", so I'd like to offer a different
           | perspective.
           | 
           | IMHO if you want to pick it up for a couple toy projects just
           | to get a feel of what coding is like, then by all means try
           | it out. But eventually you'll benefit tremendously from
           | exploring other languages.
           | 
           | Python will teach you a lot of bad habits. You will feel like
           | you know what you're doing, but only because you don't know
           | all of the ways in which it is handwaving a lot of complexity
           | that is inherent to writing code which you should be very
           | much aware of.
           | 
           | Knowing what I know now, I wish Rust existed when I started
           | out so that it could have been my first language. I'm never
           | giving up the borrow checker and the type system that come
           | with it.
           | 
           | But you don't have to do Rust. It's fine to work on a couple
           | of projects in Python, then maybe something small in C
           | (though the tooling can feel arcane and frustrating), then
           | maybe switch it up and go with some more functional
           | programming (FP) flavored like Lisp or F#.
           | 
           | I know Rust has a lot of zealots and a lot of haters, but I'm
           | not pushing an agenda. I just think it strikes that perfect
           | balance between being extremely expressive, clear to read
           | (after maybe a month of writing it daily), strong type
           | system, lots of FP elements, no OOP clutter but super
           | powerful traits, the borrow checker which you'll invariably
           | learn to love, and more...
           | 
           | This will give you a strong foundation upon which you'll be
           | able to continuously build knowledge. And even if you start
           | with Rust, you should definitely explore Python, C, Lisp and
           | F# later (or maybe Haskell instead of F#)
        
             | system2 wrote:
             | With the help of GPT, I think the bad habit part is non-
             | existent anymore. Learning it from GPT really helps people
             | nowadays. Ask ChatGPT 4.0 some questions, and you will be
             | shocked by how well it describes the code.
             | 
             | Just don't ask to fix indentations because it will do it
             | line by line for hours. But it finds mistakes quickly and
             | points you in the right direction.
             | 
             | And of course, it comes up with random non-existent modules
             | once in a while which is cute to me.
        
               | airstrike wrote:
               | The bad habits I was thinking about were more in the line
               | of not understanding how memory is being used (even
               | something as simple as stack vs. heap allocation), not
               | having a type system that forces you to think about the
               | types of data structure you have in your system, and
               | overall just being forced to _design_ before _coding_
        
           | silisili wrote:
           | I go back and forth on this. A lot of people make good
           | points.
           | 
           | In the end, my final answer is - yes. I say that because I
           | believe it's the easiest programming language to get
           | something working in. And getting something working is what
           | motivates people to keep going.
           | 
           | If you sit them down and say 'well before you learn python
           | you need to learn how a computer really works, here's an ASM
           | x86 book', they're gonna probably read 10 pages, say this is
           | boring, then go do something else. I think that because I
           | went through that as a kid - I started reading a C++ book
           | with no knowledge and gave up. It wasn't until I found qbasic
           | and VB, by all marks a terrible language, that I really got
           | motivated to learn and keep going because progress was so
           | easy.
           | 
           | Python will teach you the basics - control flow, loops,
           | variables, functions, libraries, etc. Those apply to almost
           | every language. Then when you move to a different language,
           | you at least know the basics and can focus on what's
           | different or added that you didn't have or know before.
        
             | SJC_Hacker wrote:
             | Yeah, Python or Javascript should be first languages for
             | most people.
             | 
             | People like flashy things, and Python and Javascript are
             | just 10x easier to get that working. Console I/O doesn't
             | really cut it anymore.
             | 
             | Later on you can deal with memory allocation, bit-
             | twiddling, 2's complement arithmetic, lower level OS
             | details etc.
        
           | IshKebab wrote:
           | I would personally recommend Javascript/Typescript over
           | Python, but Python is a reasonable option. Especially now
           | that we have uv so you don't have to crawl through the bush
           | of thorns that Python's terrible tooling (pip, venv etc)
           | surrounds you with.
           | 
           | I would just encourage you to move on from Python fairly
           | quickly. It's like... a balance bike. Easy to learn and teach
           | you how to balance but you don't want to actually use it to
           | get around.
        
         | ergonaught wrote:
         | > Packaging and deployment will soon be solved problems ...
         | 
         | I hope so. Every time I engage in a "Why I began using Go aeons
         | ago" conversation, half of the motivation was this. The reason
         | I stopped engaging in them is because most of the participants
         | apparently cannot see that this is even a problem. Performance
         | was always the second problem (with Python); this was always
         | the first.
        
         | airstrike wrote:
         | Python is too high level, slow and duck-typed to even be
         | considered for a huge number of projects.
         | 
         | There is no one-size-fits-all programming language.
        
         | pjmlp wrote:
         | Is the new BASIC, Pascal and Lisp.
         | 
         | Now if only CPython also got a world class JIT, V8 style.
        
         | int_19h wrote:
         | AI-generated code is going to be a major influence going
         | forward. Regardless of how you feel about its quality (I'm a
         | pessimist myself), it's happening anyway, and it's going to
         | cement the dominant position of those languages which LLMs
         | understand / can write the best. Which correlates strongly to
         | their amount in the training set, which means that Python and
         | JavaScript in particular are here to stay now, and will likely
         | be increasingly shoved into more and more niches - even those
         | they aren't well-suited to - solely because LLMs can write
         | them.
        
         | screye wrote:
         | Never heard of Beeware, but Astral's products have transformed
         | my python workflow (uv, ruff).
         | 
         | Is Beeware that transformational ? What does Beeware do and
         | what is its maturity level?
        
       | ryao wrote:
       | CUDA was born from C and C++
       | 
       | It would be nice if they actually implemented a C variant of CUDA
       | instead of extending C++ and calling it CUDA C.
        
         | swyx wrote:
         | why is that impt to you? just trying to understand the problem
         | you couldnt solve without a C-like
        
           | kevmo314 wrote:
           | A strict C variant would indeed be quite nice. I've wanted to
           | write CUDA kernels in Go apps before so the Go app can handle
           | the concurrency on the CPU side. Right now, I have to write a
           | C wrapper and more often than not, I end up writing more code
           | in C++ instead.
           | 
           | But then I end up finding myself juggling mutexes and wishing
           | I had some newer language features.
        
           | ryao wrote:
           | I want to write C code, not C++ code. Even if I try to write
           | C style C++, it is more verbose and less readable, because of
           | various C++isms. For example, having to specify extern "C" to
           | get sane ABI names for the Nvidia CUDA driver API:
           | 
           | https://docs.nvidia.com/cuda/cuda-driver-api/index.html
           | 
           | Not to mention that C++ does not support neat features like
           | variable sized arrays on the stack.
        
         | pjmlp wrote:
         | First of all they extend C, and with CUDA 3.0, initial support
         | was added for C++, afterwards they bought PGI and added Fortran
         | into the mix.
         | 
         | Alongside for the ride, they fostered an ecosystem from
         | compiled language backends targeting CUDA.
         | 
         | Additionally modern CUDA supports standard C++ as well, with
         | frameworks that hide the original extensions.
         | 
         | Most critics don't really get the CUDA ecosystem.
        
           | ryao wrote:
           | They replaced C with C++. For example, try passing a function
           | pointer as a void pointer argument without a cast. C says
           | this should work. C++ says it should not. There are plenty of
           | other differences that make it C++ and not C, if you know to
           | look for them. The fact that C++ symbol names are used for
           | one, which means you need to specify extern "C" if you want
           | to reference them from the CUDA driver API. Then there is the
           | fact that it will happily compile C++ classes where a pure C
           | compiler will not. There is no stdc option for the compiler.
        
       | thegabriele wrote:
       | Could pandas benefit from this integration?
        
         | binarymax wrote:
         | Pandas uses numpy which uses C. So if numpy used CUDA then it
         | would benefit.
        
         | ashvardanian wrote:
         | Check out CuDF. I've mentioned them in another comment on this
         | thread.
        
         | system2 wrote:
         | What do you have in mind that Pandas will benefit from cuda
         | cores?
        
       | math_dandy wrote:
       | Is NVIDIA's JIT-based approach here similar JAX's, except
       | targeting CUDA directly rather than XLA? Would like to know how
       | these different JIT compilers relate to one another.
        
       | WhereIsTheTruth wrote:
       | python is the winner, turning pseudo code into interesting stuff
       | 
       | it's only the beginning, there is no need to create new
       | programming languages anymore
        
         | system2 wrote:
         | I heard the same thing about Ruby, Go, TypeScript, and Rust.
         | (Even JS at one point when NodeJS was super popular a few years
         | ago).
         | 
         | There will be new shiny things, but of course, my choice is
         | Python too.
        
           | wiseowise wrote:
           | Python - new (34 years old), shiny thing.
        
       | steelbrain wrote:
       | See also: https://tinygrad.org/
       | 
       | Reverse-engineered python-only GPU API, works with not only CUDA
       | but Also AMD's ROCm
       | 
       | Other runtimes: https://docs.tinygrad.org/runtime/#runtimes
        
       | odo1242 wrote:
       | Technically speaking, all of this exists (including the existing
       | library integration and whatnot) through CuPy and Numba already,
       | but the fact that it's getting official support is cool.
        
       | matt3210 wrote:
       | We should find a new word since GPU is is from back when it was
       | used for graphics.
        
         | martinsnow wrote:
         | Disagree. It's the name of the component and everyone working
         | with it knows what capabilities it has. Perhaps you got old.
        
         | SJC_Hacker wrote:
         | General Processing Unit
         | 
         | Greater Processing Unit
         | 
         | Giant Processing Unit
        
         | chrisrodrigue wrote:
         | Linear Algebra Unit?
        
       | ashvardanian wrote:
       | CuTile, in many ways, feels like a successor to OpenAI's
       | Triton... And not only are we getting tile/block-level primitives
       | and TileIR, but also a proper SIMT programming model in CuPy,
       | which I don't think enough people noticed even at this year's
       | GTC. Very cool stuff!
       | 
       | That said, there were almost no announcements or talks related to
       | CPUs, despite the Grace CPUs being announced quite some time ago.
       | It doesn't feel like we're going to see generalizable
       | abstractions that work seamlessly across Nvidia CPUs and GPUs
       | anytime soon. For someone working on parallel algorithms daily,
       | this is an issue: debugging with NSight and CUDA-GDB still isn't
       | the same as raw GDB, and it's much easier to design algorithms on
       | CPUs first and then port them to GPUs.
       | 
       | Of all the teams in the compiler space, Modular seems to be among
       | the few that aren't entirely consumed by the LLM craze, actively
       | building abstractions and languages spanning multiple platforms.
       | Given the landscape, that's increasingly valuable. I'd love to
       | see more people experimenting with Mojo -- perhaps it can finally
       | bridge the CPU-GPU gap that many of us face daily!
        
       | crazygringo wrote:
       | Very curious how this compares to JAX [1].
       | 
       | JAX lets you write Python code that executes on Nvidia, but also
       | GPUs of other brands (support varies). It similarly has drop-in
       | replacements for NumPy functions.
       | 
       | This only supports Nvidia. But can it do things JAX can't? It is
       | easier to use? Is it less fixed-size-array-oriented? Is it worth
       | locking yourself into one brand of GPU?
       | 
       | [1] https://github.com/jax-ml/jax
        
         | odo1242 wrote:
         | Well, the idea is that you'd be writing low level CUDA kernels
         | that implement operations not already implemented by JAX/CUDA
         | and integrate them into existing projects. Numba[1] is probably
         | the closest thing I can think of that currently exists. (In
         | fact, looking at it right now, it seems this effort from Nvidia
         | is actually based on Numba)
         | 
         | [1]: https://numba.readthedocs.io/en/stable/cuda/overview.html
        
       | rahimnathwani wrote:
       | Here is the repo: https://github.com/NVIDIA/cuda-python
        
       | melodyogonna wrote:
       | Will be interesting to compare the API to what Modular has [1]
       | with Mojo.
       | 
       | 1. https://docs.modular.com/mojo/stdlib/gpu/
        
       | hingusdingus wrote:
       | Hmm wonder what vulnerabilities will now be available with this
       | addition.
        
       | soderfoo wrote:
       | Kind of late to the game, but can anyone recommend a good primer
       | on GPU programming?
        
       ___________________________________________________________________
       (page generated 2025-04-04 23:00 UTC)