[HN Gopher] CuPy: NumPy and SciPy for GPU
___________________________________________________________________
CuPy: NumPy and SciPy for GPU
Author : tanelpoder
Score : 254 points
Date : 2024-09-20 13:18 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| __mharrison__ wrote:
| I taught my numpy class to a client who wanted to use GPUs.
| Installation (at that time) was a chore but afterwards it was
| really smooth using this library. Big gains with minimal to no
| code changes.
| gjstein wrote:
| The idea that this is a drop in replacement for numpy (e.g.,
| `import cupy as np`) is quite nice, though I've gotten similar
| benefit out of using `pytorch` for this purpose. It's a very
| popular and well-supported library with a syntax that's similar
| to numpy.
|
| However, the AMD-GPU compatibility for CuPy is quite an
| attractive feature.
| KeplerBoy wrote:
| One could also "import jax.numpy as jnp". All those libraries
| have more or less complete implementations of numpy and scipy
| (i believe CuPy has the most functions, especially when it
| comes to scipy) functionality.
|
| Also: You can just mix match all those functions and tensors
| thanks to the __cuda_array_interface__.
| yobbo wrote:
| Jax variables are immutable.
|
| Code written for CuPy looks similar to numpy but very
| different from Jax.
| bbminner wrote:
| Ah, well, that's interesting! Does anyone know how cupy
| manages tensor mutability?
| kmaehashi wrote:
| CuPy tensors (or `ndarray`) provide the same semantics as
| NumPy. In-place operations are permitted.
| KeplerBoy wrote:
| Ah yes, stumbled over that recently, but the error message
| is very helpful and it's a quick change.
| kmaehashi wrote:
| For those interested in the NumPy/SciPy API coverage in CuPy,
| here is the comparison table:
|
| https://docs.cupy.dev/en/latest/reference/comparison.html
| ogrisel wrote:
| Note that NumPy, CuPy and PyTorch are all involved in the
| definition of a shared subset of their API:
|
| https://data-apis.org/array-api/
|
| So it's possible to write array API code that consumes arrays
| from any of those libraries and delegate computation to them
| without having to explicitly import any of them in your source
| code.
|
| The only limitation for now is that PyTorch (and to some lower
| extent cupy as well) array API compliance is still incomplete
| and in practice one needs to go through this compatibility
| layer (hopefully temporarily):
|
| https://data-apis.org/array-api-compat/
| ethbr1 wrote:
| It's interesting to see hardware/software/API co-development
| in practice again.
|
| The last time I think this happen at market-scale was early
| 3d accelerator APIs? Glide/opengl/directx. Which has been a
| minute! (To a lesser extent CPU vectorization extensions)
|
| Curious how much of Nvidia's successful strategy was driven
| by people who were there during that period.
|
| Powerful first mover flywheel: build high performing hardware
| that allows you to define an API -> people write useful
| software that targets your API, because you have the highest
| performance -> GOTO 10 (because now more software is
| standardized on your API, so you can build even more
| performant hardware to optimize its operations)
| kmaehashi wrote:
| An excellent example of Array API usage can be found in
| scikit-learn. Estimators written in NumPy are now operable on
| various backends courtesy of Array API compatible libraries
| such as CuPy and PyTorch.
|
| https://scikit-learn.org/stable/modules/array_api.html
|
| Disclosure: I'm a CuPy maintainer.
| kccqzy wrote:
| And of course the native Python solution is memoryview. If
| you need to inter-operate with libraries like numpy but you
| cannot import numpy, use memoryview. It is specifically for
| fast low-level access which is why it has more C
| documentation than Python documentation:
| https://docs.python.org/3/c-api/memoryview.html
| amarcheschi wrote:
| I'm supposed to end my undergraduate degree with an internship
| at the italian national research center and i'll have to use
| pytorch to write ml models from paper to code, i've tried
| looking at the tutorial but i feel like there's a lot going on
| to grasp. until now i've only used numpy (and pandas in combo
| with numpy), i'm quite excited but i'm a bit on the edge
| because i can't know whether i'll be up to the task or not
| KeplerBoy wrote:
| Go for it! There's nothing to lose.
|
| You could checkout some of EuroCC's courses. That should get
| you up to speed. https://www.eurocc-
| access.eu/services/training/
| hedgehog wrote:
| It's kind of unfortunate that EagerPy didn't get more traction
| to make that kind of switching even easier.
| Narhem wrote:
| As nice as it is to have a drop in replacement, most of the
| cost of GPU computing is moving memory around. Wouldn't be
| surprised if this catches unsuspecting programmers in a few
| performance traps.
| WCSTombs wrote:
| > However, the AMD-GPU compatibility for CuPy is quite an
| attractive feature.
|
| Last I checked (a couple months ago) it wasn't quite there, but
| I totally agree in principle. I've not gotten it to work on my
| Radeons yet.
| paperplatter wrote:
| Hm. Tempted to try pytorch on my Mac for this. I have an AS
| chip rather than a Nvidia GPU.
| sspiff wrote:
| It only supports AMD cards supported by ROCm, which is quite a
| limited set.
|
| I know you can enable ROCm for other hardware as well, but it's
| not supported and quite hit or miss. I've had limited success
| with running stuff against ROCm on unsupported cards, mainly
| having issues with memory management IIRC.
| sitkack wrote:
| Fingers crossed that all future AMD parts ship with full ROCm
| support.
| sdenton4 wrote:
| Why not Jax?
| bee_rider wrote:
| Real answer: CuPy has a name that is very similar to SciPy. I
| don't know GPU, that's why I'm using this sort of library,
| haha. The branding for CuPy makes it obvious. Is Jax the same
| thing, but implemented better somehow?
| whimsicalism wrote:
| yes
| sdenton4 wrote:
| Yeah, Jax provides a one-to-one reimplementation of the Numpy
| interface, and a decent chunk of the scipy interface. Random
| number handling is a bit different, but Numpy random number
| handling seeeeems to be trending in the Jax direction
| (explicitly passed RNG objects).
|
| Jax also provides back-propagation wherever possible, so you
| can optimize.
| palmy wrote:
| cupy came out a long time before Jax; remember using it in a
| project for my BSc around 2015-2016.
|
| Cool to see that it's still kicking!
| johndough wrote:
| > Why not Jax?
|
| - JAX Windows support is lacking
|
| - CuPy is much closer to CUDA than JAX, so you can get better
| performance
|
| - CuPy is generally more mature than JAX (fewer bugs)
|
| - CuPy is more flexible thanks to cp.RawKernel
|
| - (For those familiar with NumPy) CuPy is closer to NumPy than
| jax.numpy
|
| But CuPy does not support automatic gradient computation, so if
| you do deep learning, use JAX instead. Or PyTorch, if you do
| not trust Google to maintain a project for a prolonged period
| of time https://killedbygoogle.com/
| gnulinux wrote:
| What about CPU-only loads? If one wants to write code that'll
| eventually run in both CPU and GPU but in the short-to-mid
| term will only be used in CPU? Since JAX natively support CPU
| (with numpy backend), but CuPy doesn't, this seems like a
| potential problem for some.
| nextaccountic wrote:
| Isn't there a way to dynamically select between numpy and
| cupy, depending on whether you want cpu or gpu code?
| gnulinux wrote:
| There is but then you're using two separate libraries,
| that seems like a fragile point of failure compared to
| just using jax. But regardless since jax will use
| different backends anyway, it's arguably not any worse
| (but it ends up being your responsibility to ensure
| correctness as opposed to the jax team).
| kmaehashi wrote:
| NumPy has a mechanism to dispatch execution to CuPy:
| https://numpy.org/neps/nep-0018-array-function-
| protocol.html
|
| Just prepare the input on NumPy or CuPy, and then you can
| just feed it to NumPy APIs. NumPy functions will handle
| itself if the input is NumPy ndarray, or dispatch the
| execution to CuPy if the input is CuPy ndarray.
| johndough wrote:
| > Isn't there a way to dynamically select between numpy
| and cupy, depending on whether you want cpu or gpu code?
|
| CuPy is an (almost) drop-in replacement for NumPy, so the
| following works surprisingly often: if
| use_cpu: import numpy as np else:
| import cupy as np
| insane_dreamer wrote:
| > CuPy does not support automatic gradient computation, so if
| you do deep learning, use JAX instead
|
| DL is major use case; is CuPy planning on adding auto
| gradient comp?
| bee_rider wrote:
| Good a place as any to ask I guess. Do any of these GPU libraries
| have a BiCGStab (or similar) that handles multiple right hand
| sides? CuPy seems to have GMRES, which would be fine, but as far
| as I can tell it just does one right hand side.
| trostaft wrote:
| IIRC jax's `scipy.sparse.linalg.bicgstab` does support multiple
| right hand sides.
|
| EDIT: Or rather, all the solvers under jax's
| `scipy.sparse.linalg` all support multiple right hand sides.
| bee_rider wrote:
| Oh dang, that's pretty awesome, thanks.
|
| "array or tree of arrays" sounds very general, probably even
| better than an old fashioned 2D array.
| trostaft wrote:
| 'tree of arrays'
|
| Ahh, that's just Jax's concept of pytrees. It was something
| that they invented to make it easier (this is how I view
| it, not complete) to pass complex objects to function but
| still be able to easily consider them as a concatenated
| vector for AD etc.. E.g. a common pattern is to pass
| parameters `p` to a function and then internally break them
| into their physical interpretations, e.g. `mass = p[0]`,
| `velocity = p[1]`. Pytrees let you just use something like
| a dictionary `p = {'mass' = 1.0, 'velocity = 1.0'}`, which
| is a stylistically more natural structure to pass around,
| and then jax is structured to understand later when AD'ing
| or otherwise that you're doing so with respect to the
| 'leaves' of the tree, or the values of the mass and
| velocity.
|
| Hopefully someone corrects me if I'm not right about this.
| I'm hardly 100% on Jax's vision on PyTrees.
|
| As an aside, just a list of right hand sides `[b1, b2, ...,
| bm]` is valid.
| johndough wrote:
| If you have many right hand sides, you could also compute an LU
| factorization and then solve the right hand sides via back-
| substitution.
|
| https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc...
|
| or
| https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc...
| if your linear system is sparse.
|
| But whether that works well depends on the problem you are
| trying to solve.
| bee_rider wrote:
| My systems are sparse, but might not fit on the GPU when
| factorized. Actually, usually I do CPU stuff with lots of
| ram, and Pardiso, so it isn't an issue.
|
| But I was hoping to try out something like ILU+bicgstab on
| the GPU and the python-verse seems like it has the lowest
| barrier-to-entry for just playing around.
| johndough wrote:
| For my tasks, I had some success with algebraic multigrid
| solvers as preconditioner, for example from AMGCL or PyAMG.
| They are also reasonably easy to get started with.
|
| https://github.com/pyamg/pyamg
|
| https://github.com/ddemidov/amgcl
|
| But I only have to deal with positive definite systems, so
| YMMV.
|
| I am not sure whether those libraries can deal with
| multiple right-hand sides, but most complexity is in the
| preconditioners anyway.
| adancalderon wrote:
| If it ran in the background it could be CuPyd
| whimsicalism wrote:
| I was just thinking we didn't have enough CUDA-accelerated numpy
| libraries.
|
| Jax, pytorch, vanilla TF, triton. They just don't cut it
| meisel wrote:
| When building something that I want to run on both CPU and GPU,
| depending, I've found it much easier to use PyTorch than some
| combination of NumPy and CuPy. I don't have to fiddle around with
| some global replacing of numpy.* with cupy.*, and PyTorch has
| very nearly all the functions that those libraries have.
| setopt wrote:
| Interesting. Any links to examples or docs on how to use
| PyTorch as a general linear algebra library for this purpose?
| Like a "SciPy to PyTorch" transition guide if I want to do the
| same?
| meisel wrote:
| It's typically just importing torch and s/np/torch, not too
| different from NumPy -> CuPy. Try it in your own code and
| see!
| ttyprintk wrote:
| Mentioned above:
|
| https://data-apis.org/array-api-compat/
| lmeyerov wrote:
| We are fans! We mostly use cudf/cuml/cugraph (GPU dataframes etc)
| in the pygraphistry ecosystem, and when things get a bit tricky,
| cupy is one of the main escape hatches
| SubiculumCode wrote:
| As an aside, since I was trying to install CuPy the other day and
| was having issues.
|
| Open projects on github often (at least superficially) require
| specific versions of Cuda Toolkit (and all the specialty nvidia
| packages e.g. cudann), Tensorflow, etc, and changing the default
| versions of these for each little project, or step in a
| processing chain, is ridiculous.
|
| pyenv et al have really made local, project specific versions of
| python packages much easier to manage. But I haven't seen a
| similar type solution for cuda toolkit and associated packages,
| and the solutions I've encountered seem terribly hacky..but I'm
| sure though that this is a common issue, so what do people do?
| coeneedell wrote:
| Ugh... docker containers. I also wish there was a simpler way
| but I don't think there is.
| SubiculumCode wrote:
| this is not what I wanted to hear. NOT AT ALL. Please whisper
| sweet lies into my ears.
| coeneedell wrote:
| At the moment I'm working on a system to quickly replicate
| academic deep learning repos (papers) at scale. At least
| Amazon has a catalogue of prebuilt containers with
| cuda/pytorch combos. I still occasionally have an issue
| where the container works on my 3090 test bench but not on
| the T4 cloud node...
| m_d_ wrote:
| conda provides cudatoolkit and associated packages. Does this
| solve the situation?
| nyrikki wrote:
| The condos 200-employee threshold licence change is
| problematic for some.
| boldlybold wrote:
| As long as you stay out of the "defaults" and "anaconda"
| repos, you're not subject to that license. For my needs
| conda-forge and bioconda have everything. I'm not sure
| about the nvidia repo but I assume it's similar.
| kmaehashi wrote:
| Actually all CUDA Toolkit libs are already available
| through the conda-forge channel:
| https://anaconda.org/conda-forge/cuda-cudart,
| https://anaconda.org/conda-forge/libcublas, etc.
| SubiculumCode wrote:
| Actually yes it does....except I seem to remember that it
| doesn't go back that far in cuda versions. I can't seem to
| find it again right now.
| whimsicalism wrote:
| in real life everyone just uses containers, might not be the
| answer you want to hear though
| kmaehashi wrote:
| As a maintainer of CuPy and also as a user of several GPU-
| powered Python libraries, I empathize with the frustrations and
| difficulties here. Indeed, one thing CuPy values is to make the
| installation step as easy and universal as possible. We strive
| to keep the binary package footprint small (currently less than
| 100 MiB), keep dependencies to a minimum, support wide variety
| of platforms including Windows and aarch64, and do not require
| a specific CUDA Toolkit version.
|
| If anyone reading this message has encountered a roadblock
| while installing CuPy, please reach out. I'd be glad to help
| you.
| mardifoufs wrote:
| One way to do it is to explicitly add the link to say, the
| pytorch+CUDA wheel from the pytorch repos in your
| requirements.txt instead of using the normal pypi package.
| Which also sucks because you then have to do some other tweaks
| to make your requirements.txt portable across different
| platforms...
|
| (and you can't just add another index for pip to look for if
| you want to use python build so it has to be explicitly linked
| to the right wheel, which absolutely sucks especially since you
| cannot get the CUDA version from pypi)
| ttyprintk wrote:
| You can jam the argument cudatoolkit=1.2.3 when creating conda
| environments.
|
| NB I'm using Miniforge.
| hamilyon2 wrote:
| There is a bit similar project which supports Intel GPU
| offloading: https://github.com/intel/scikit-learn-intelex
| johndough wrote:
| CuPy is probably the easiest way to interface with custom CUDA
| kernels:
| https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
|
| And I recently learned that CuPy has a JIT compiler now if you
| prefer Python syntax over C++.
| https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-k...
| einpoklum wrote:
| > probably the easiest way to interface with custom CUDA
| kernels
|
| In Python? Perhaps. Generally? No, it isn't. Try:
| https://github.com/eyalroz/cuda-api-wrappers/
|
| Full power of the CUDA APIs including all runtime compilation
| options etc.
|
| (Yes, I wrote that...)
| johndough wrote:
| Personally, I prefer CuPy over your library. For example,
| your vectorAdd.cu implementation at
| https://github.com/eyalroz/cuda-api-
| wrappers/blob/master/exa... is much longer than a similar
| CuPy implementation: import cupy as cp
| vector_add = cp.RawKernel(""" extern "C" __global__
| void vector_add(const float *A, const float *B, float *C, int
| num_elements) { int i = blockDim.x * blockIdx.x +
| threadIdx.x; if (i < num_elements) { C[i] =
| A[i] + B[i]; } } """, "vector_add")
| num_elements = 50_000 block_size = 256 #
| round up to next multiple of block_size grid_size =
| (num_elements + block_size - 1) // block_size a
| = cp.random.rand(num_elements, dtype=cp.float32) b =
| cp.random.rand(num_elements, dtype=cp.float32) c =
| cp.zeros(num_elements, dtype=cp.float32) args =
| (a, b, c, num_elements) print(f"[Vector addition
| of {num_elements} elements]") print(f"CUDA kernel
| launch with {grid_size} blocks of {block_size} threads each")
| vector_add((grid_size,), (block_size,), args)
| incorrect = cp.abs(a + b - c) > 1e-5 if
| cp.any(incorrect): print("Result verification
| failed at element", cp.argmax(incorrect))
| print("Test PASSED") print("SUCCESS")
|
| It could be made even shorter with a cp.ElementwiseKernel htt
| ps://docs.cupy.dev/en/stable/user_guide/kernel.html#basic...
|
| Although I have to concede that the automatic grid size
| computation in cuda-api-wrappers is nice.
|
| A few marketing tips for your README:
|
| * Put a code example directly at the top. You want to present
| the selling points of your library to the reader as fast as
| possible. For reference, look at the CuPy README
| https://github.com/cupy/cupy?tab=readme-ov-file#cupy--
| numpy-... which immediately shows reader what it is good for.
| Your README starts with lots of text, but nobody reads text
| anymore these days. A link to examples is almost at the end,
| and then the examples are deeply nested.
|
| * The first links in the README should link to your own
| library, for example to documentation or examples. You do not
| want to lead the reader away from your GitHub page.
|
| * Add syntax highlighting with "cpp" after triple backticks:
| ```cpp <code here> ```
| kunalgupta022 wrote:
| Is anyone aware of a pandas like library that is based on
| something like CuPy instead of Numpy. It would be great to have
| the ease of use of pandas with the parallelism unlocked by gpu.
| lokimedes wrote:
| Not specifically GPU, but that's also highly dependent on the
| data access pattern: https://www.dask.org/
| Scene_Cast2 wrote:
| I'd go digging into this - https://pola.rs/posts/polars-on-gpu/
| kmaehashi wrote:
| cuDF is a CuPy-based library providing drop-in replacement for
| Pandas: https://rapids.ai/
| curvilinear_m wrote:
| I'm surprised to see pytorch and Jax mentioned as alternatives
| but not numba : https://github.com/numba/numba
|
| I've recently had to implement a few kernels to lower the memory
| footprint and runtime of some pytorch function : it's been really
| nice because numba kernels have type hints support (as opposed to
| raw cupy kernels).
| killingtime74 wrote:
| Numba doesn't support GPU though
| catshamando wrote:
| Yes it does:
| https://numba.readthedocs.io/en/stable/cuda/kernels.html
| the_svd_doctor wrote:
| It does (for NVIDIA at least)
| setopt wrote:
| I've been using CuPy a bit and found it to be excellent.
|
| It's very easy to replace some slow NumPy/SciPy calls with
| appropriate CuPy calls, with sometimes literally a 1000x
| performance boost from like 10min work. It's also easy to write
| "hybrid code" where you can switch between NumPy and CuPy
| depending on what's available.
| glial wrote:
| Are you able to share what functions or situations result in
| speedups? In my experience, vectorized numpy is already fast,
| so I'm very curious.
| KeplerBoy wrote:
| Not OP, but think about stuff like FFTs or Matmuls. It's not
| even a competition, GPUs win when the algorithm is somewhat
| suitable and you're dealing with FP32 or lower precision.
| setopt wrote:
| The largest speedup I have seen was for a quantum mechanics
| simulation where I needed to repeatedly calculate all
| eigenvalues of Hermitian matrices (but not necessarily their
| eigenvectors).
|
| This was basically the code needed: import
| scipy.linalg as la if cuda: import
| cupy as cp import cupy.linalg as cla
| e = cp.asnumpy(cla.eigvalsh(cp.asarray(H))) else:
| e = la.eigvalsh(H)
|
| I was using IntelPython which already has fast (parallelized)
| methods for this using MKL, but CuPy blew it out of the
| water.
| aterrel-nvidia wrote:
| If you like cupy, definitely checkout the Multinode Multi-gpu
| version, cuNumeric: https://github.com/nv-legate/cunumeric
|
| Would love to get any feedback from the community.
___________________________________________________________________
(page generated 2024-09-20 23:00 UTC)