[HN Gopher] Unifying the CUDA Python Ecosystem
___________________________________________________________________
Unifying the CUDA Python Ecosystem
Author : pjmlp
Score : 108 points
Date : 2021-04-16 14:44 UTC (7 hours ago)
(HTM) web link (developer.nvidia.com)
(TXT) w3m dump (developer.nvidia.com)
| andrew_v4 wrote:
| Just for contrast its interesting to look at an example of
| writing a similar kernel in Julia:
|
| https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/
|
| I don't think it's possible to achieve something like this in
| python because of how it's interpreted (but it sounds a bit like
| what another comment mentioned where the python was compiled to
| C)
| rrss wrote:
| I think the contrast is probably less about the language, and
| more about the scope and objective of the projects. the blog is
| describing low-level interfaces in python - probably more
| comparable is the old CUDAdrv.jl package (now merged into
| CUDA.jl):
| https://github.com/JuliaGPU/CUDAdrv.jl/blob/master/examples/...
|
| here is writing a similar kernel in python with numba:
| https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%2...
| jjoonathan wrote:
| I gave numba CUDA a spin in late 2018 and was severely
| disappointed. It didn't work out of the box, I had to tweak
| the source to remove a reference to an API that had been
| removed from CUDA more than a year prior (and deprecated long
| ago). Then I ran into a bug when converting a float array to
| a double array -- I had to declare the types three different
| times and it still did a naive byte-copy rather than a
| conversion. Thanks to a background in numerics, the symptoms
| were obvious, but yikes. The problem that finally did us in
| was an inability to get buffers to correctly pass between
| kernels without a CPU copy, which was absolutely critical for
| our perf. I think this was supported in theory but just
| didn't work.
|
| In any case, we did a complete rewrite in CUDA proper in less
| time than we spent banging our heads against that last numba-
| CUDA issue.
|
| Under every language bridge there are trolls and numba-CUDA
| had some mean ones. Hopefully things have gotten better but
| I'm definitely still inside the "once bitten twice shy"
| period.
| machineko wrote:
| Every time there is a topic about python, there is this one
| Julia guy who spam Julia "alternative" for python solution in
| every topic. Can you just guys stop? it kinda feels like
| watching a cult.
| anon_tor_12345 wrote:
| i mentioned this in the response to the other comment but
| straight compilation is exactly what numba does for CUDA
| support because, just like Julia, numba uses llvm as a
| middleend (and llvm has a ptx backend).
| albertzeyer wrote:
| JAX and TensorFlow functions both would convert some Python
| code to equivalent XLA code or a TF graph.
| jjoonathan wrote:
| > Julia has first-class support for GPU programming
|
| "First-class" is a steep claim. Does it support the nvidia perf
| tools? Those are very important for taking a kernel from (in my
| experience) ~20% theoretical perf to ~90% theoretical perf.
| maleadt wrote:
| Yeah, see this section of the documentation:
| https://juliagpu.gitlab.io/CUDA.jl/development/profiling/.
| CUDA.jl also supports NVTX, wraps CUPTI, etc. The full extent
| of the APIs and tools is available.
|
| Source line association when using PC sampling is currently
| broken due to a bug in the NVIDIA drivers though (segfaulting
| when parsing the PTX debug info emitted by LLVM), but I'm
| told that may be fixed in the next driver.
| jjoonathan wrote:
| Nice! I set a reminder to check back in a month.
| klmadfejno wrote:
| https://developer.nvidia.com/blog/gpu-computing-julia-
| progra...
| jjoonathan wrote:
| > CUDAnative.jl also [...] generates the necessary line
| number information for the NVIDIA Visual Profiler to work
| as expected
|
| That sounds very promising, but these tools are usually
| magnificent screenshot fodder yet they are conspicuously
| absent from the screenshots so I still have suspicions.
| Maybe I'll give it a try tonight and report back.
| maleadt wrote:
| Here's a screenshot:
| https://julialang.org/assets/blog/nvvp.png. Or a recent
| PR when you can see NVTX ranges from Julia:
| https://github.com/JuliaGPU/CUDA.jl/pull/760
| jjoonathan wrote:
| Thanks! Now I believe! :)
| SloopJon wrote:
| I thought for sure that someone would have posted a link to that
| xkcd comic by now. I only dabble with higher-level APIs, so I
| can't judge this on the merits. If NVIDIA really continues to
| back this, and follows through on wrapping other libraries like
| cuDNN, it could be a whole new level of vendor lock in as people
| start writing code that targets CUDA Python. I think the real
| test will be whether one of the big projects like PyTorch or
| TensorFlow gets on board.
| Ivoah wrote:
| https://xkcd.com/927/
|
| https://xkcd.com/1987/
| michelpp wrote:
| About 8 years ago an NVIDIA developer released a tool called
| Copperhead that let you write CUDA kernels in straight Python
| that were then compiled to C, no "C-in-a-string" like is shown
| here. I always thought it was so elegant and had great potential,
| and I introduced a lot of people in my circle to it, but then it
| seems NVIDIA buried it.
|
| This blog post is great, and we need these kind of tools for
| sure, but we also need high level expressibility that doesn't
| require writing kernels in C. I know there are other projects
| that have taken up that cause, but it would be great to see
| NVIDIA double down on something like Copperhead.
| ZeroCool2u wrote:
| Totally agree, Copperhead looks much easier to use. Perhaps one
| of the reasons they went and rebuilt from scratch is because
| Copperhead relies on Thrust and a couple other dependencies?
| anon_tor_12345 wrote:
| that project might be abandoned but this strategy is used in
| nvidia and nvidia adjacent projects (through llvm):
|
| https://github.com/rapidsai/cudf/blob/branch-0.20/python/cud...
|
| https://github.com/gmarkall/numba/blob/master/numba/cuda/com...
|
| >but we also need high level expressibility that doesn't
| require writing kernels in C
|
| the above are possible because C is actually just a frontend to
| PTX
|
| https://docs.nvidia.com/cuda/parallel-thread-execution/index...
|
| fundamentally you are not going to ever be able to have a way
| to write cuda kernels without thinking about cuda architecture
| anymore so than you'll ever be able to write async code without
| thinking about concurrency.
| albertzeyer wrote:
| Oh that sounds interesting. Do you know what happened to it?
|
| I think I found it here:
| https://github.com/bryancatanzaro/copperhead
|
| But I'm not sure what the state is. Looks dead (last commit 8
| years ago). Probably just a proof of concept. But why hasn't
| this been continued?
|
| Blog post and example:
| https://developer.nvidia.com/blog/copperhead-data-parallel-p...
| https://github.com/bryancatanzaro/copperhead/blob/master/sam...
|
| Btw, for compiling on-the-fly from a string, I made something
| similar for our RETURNN project. Example for LSTM:
| https://github.com/rwth-i6/returnn/blob/a5eaa4ab1bfd5f157628...
|
| This is made in a way that it compiles automatically into an op
| for Theano or TensorFlow (PyTorch could easily be added as
| well) and for both CPU and CUDA/GPU.
| dwrodri wrote:
| I don't know specifics about Copperhead in particular, but
| Bryan Catanzaro (creator of Copperhead) is now the VP of
| Applied Deep Learning Research at Nvidia. He gave a talk at
| GTC this year, which is how I heard about all of this in the
| first place.
|
| Source: https://www.linkedin.com/in/bryancatanzaro/
| BiteCode_dev wrote:
| In the IP word, there are some-hidden gems that disappear with
| no trace one day.
|
| I worked for a client that had this wonderful Python dsl that
| compiled to verilog and vhdl. It was much easier to use than
| writing the stuff the old way. Much more composable too, not to
| mention tooling.
|
| They created that by forking an open source project dating back
| to Python 2.5 that I could never find again.
|
| Imagine if that stuff would still be alive today. You could
| have a market for paid pypi.org instances providing you with
| pip installable IP components you can compose and customize
| easily.
|
| But in this market, sharing is not really a virtue.
| eslaught wrote:
| As it turns out, NVIDIA just open sourced a product called
| Legate which does not just GPUs but distributed as well. Right
| now it supports NumPy and Pandas but perhaps they'll add others
| in the future. Just thought this might be up your alley since
| it works at a higher level than the glorified CUDA in the
| article.
|
| https://github.com/nv-legate/legate.numpy
|
| Disclaimer: I work on the project they used to do the
| distributed execution, but otherwise have no connection with
| Legate.
|
| Edit: And this library was developed by a team managed by one
| of the original Copperhead developers, in case you're
| wondering.
| nuisance-bear wrote:
| Tools to make GPU development easier are sorely needed.
|
| I foolishly built an options pricing engine on top of PyTorch,
| thinking "oooh, it's a fast array library that supports CUDA
| transparently". Only to find out that array indexing is 100x
| slower than numpy.
| eslaught wrote:
| You might be interested in Legate [1]. It supports the NumPy
| interface as a drop-in replacement, supports GPUs and also
| distributed machines. And you can see for yourself their
| performance results; they're not far off from hand-tuned MPI.
|
| [1]: https://github.com/nv-legate/legate.numpy
|
| Disclaimer: I work on the library Legate uses for distributed
| computing, but otherwise have no connection.
| [deleted]
| TuringNYC wrote:
| >>> built an options pricing engine on top of PyTorch
|
| I'd love to hear more about this! Do you have any posts or
| write-ups on this?
| sideshowb wrote:
| Interesting find about the indexing. I just had the opposite
| experience, swapped from numpy to torch in a project and got
| 2000x speedup on some indexing and basic maths wrapped in
| autodiff. And I haven't moved it onto cuda yet.
| nuisance-bear wrote:
| Here's an example that illustrates the phenomenon. If memory
| serves me right, index latency is superlinear in dimension
| count. import time, torch from
| itertools import product N = 100 ten
| = torch.randn(N,N,N) arr = ten.numpy() def
| indexTimer(val): start = time.time()
| for i,j,k in product(range(N), range(N), range(N)):
| x = val[i, j, k] end = time.time()
| print('{:.2f}'.format(end-start)) indexTimer(ten)
| indexTimer(arr)
| rubatuga wrote:
| Somewhat related, I've tried running compute shaders using wgpu-
| py:
|
| https://github.com/pygfx/wgpu-py
|
| You can define any compute shader you like in Python, and
| annotate it with the data types, and it compiles to SPIRV and
| runs under macOS, Linux and windows
| The_rationalist wrote:
| Note that you can write CUDA in many languages such as Java,
| Kotlin, Python, Ruby, JS, R with https://github.com/NVIDIA/grcuda
| zcw100 wrote:
| There's a lot of may, should, and could's in there.
| nevi-me wrote:
| I have a RTX 2070 that's under-utilised, partly because I'm
| surprisingly finding it hard to understand C, C++ and CUDA by
| extension.
|
| I'm self-taught, and have been using web languages and some
| python, before learning Rust. I hope that NVIDIA can dedicate
| some resources to creating high-quality bindings to the C API for
| Rust, even if in the next 1-2 years.
|
| Perhaps being able to use a systems language that's been easy for
| me coming from TypeScript and Kotlin, could inspire me to take
| baby steps with CUDA, without worrying about understanding C.
|
| I like the CUDA.jl package, and once I make time to learn Julia,
| I would love to try that out. From this article about the Python
| library, I'm still left knowing very little about "how can I
| parallelise this function".
| jkelleyrtp wrote:
| +1 Would love to see official support for CUDA for Rust.
| sdajk3n423 wrote:
| If you are a looking to maximize use of that card, you can make
| about $5 a day mining crypto with the 2070.
| nevi-me wrote:
| No, the high electricity cost in my country + the noise
| pollution in the house + how much I generally earn from the
| machine + my views on burning the world speculatively,
| discourage me from mining crypto.
|
| Perhaps my position might change in future, but for now, I'd
| probably rather make the GPU accessible to those open-source
| distributed grids that train chess engines or compute deep-
| space related thingies :)
| sdajk3n423 wrote:
| I am not convinced that training AI to win at chess is any
| more moral than mining crypto. And the block chain is about
| as open-source as you can get.
| pjmlp wrote:
| A nice thing of the proper ALGOL linage systems programming
| languages (which C only has basic influence), is that you can
| write nice high level code and only deal with pointers and raw
| pointer stuff when actually needed, think Ada, Modula-2, Object
| Pascal kind of languages.
|
| So something like CUDA Rust would be nice to have.
|
| By the way, D already supports CUDA,
|
| https://dlang.org/blog/2017/07/17/dcompute-gpgpu-with-native...
| touisteur wrote:
| CUDA Ada would be so, so nice. Especially with non-aliasing
| guarantees from SPARK...
| Tomte wrote:
| > I have a RTX 2070 that's under-utilised
|
| I've found that there are really good and beginner-friendly
| Blender tutorials. Both free and paid ones.
| andi999 wrote:
| Actually I like pyCuda.
| https://documen.tician.de/pycuda/tutorial.html
|
| You can write all the boilerplate in python and just the kernel
| in C (which you can pass to a string and compiler automatically
| in your python script). So far the workflow is much smoother than
| with nvcc (and creating some dll bindings for the c programm).
| kolbe wrote:
| As someone who has dabbled in CUDA with some success, I'm going
| to be a little contrarian here. To me, the difficulty with GPU
| programming isn't the fact that CUDA uses C-syntax versus
| something more readable like Python. GPU programming is
| fundamentally difficult, and the minor gains from using a
| familiar language syntax are dwarfed by the need to understand
| blocks, memory alignment, thread hierarchy, etc. And I don't just
| say this. I live it. Even though I primarily program in C#, I
| don't use Hybridizer when I need GPU acceleration. I go straight
| to CUDA and marshal everything to/from C#.
|
| That's not to say that CUDA Python isn't kinda cool, but it's not
| a magic bullet to finally understanding GPU programming if you've
| been struggling.
___________________________________________________________________
(page generated 2021-04-16 22:00 UTC)