[HN Gopher] CuPy: NumPy and SciPy for GPU
       ___________________________________________________________________
        
       CuPy: NumPy and SciPy for GPU
        
       Author : tanelpoder
       Score  : 254 points
       Date   : 2024-09-20 13:18 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | __mharrison__ wrote:
       | I taught my numpy class to a client who wanted to use GPUs.
       | Installation (at that time) was a chore but afterwards it was
       | really smooth using this library. Big gains with minimal to no
       | code changes.
        
       | gjstein wrote:
       | The idea that this is a drop in replacement for numpy (e.g.,
       | `import cupy as np`) is quite nice, though I've gotten similar
       | benefit out of using `pytorch` for this purpose. It's a very
       | popular and well-supported library with a syntax that's similar
       | to numpy.
       | 
       | However, the AMD-GPU compatibility for CuPy is quite an
       | attractive feature.
        
         | KeplerBoy wrote:
         | One could also "import jax.numpy as jnp". All those libraries
         | have more or less complete implementations of numpy and scipy
         | (i believe CuPy has the most functions, especially when it
         | comes to scipy) functionality.
         | 
         | Also: You can just mix match all those functions and tensors
         | thanks to the __cuda_array_interface__.
        
           | yobbo wrote:
           | Jax variables are immutable.
           | 
           | Code written for CuPy looks similar to numpy but very
           | different from Jax.
        
             | bbminner wrote:
             | Ah, well, that's interesting! Does anyone know how cupy
             | manages tensor mutability?
        
               | kmaehashi wrote:
               | CuPy tensors (or `ndarray`) provide the same semantics as
               | NumPy. In-place operations are permitted.
        
             | KeplerBoy wrote:
             | Ah yes, stumbled over that recently, but the error message
             | is very helpful and it's a quick change.
        
           | kmaehashi wrote:
           | For those interested in the NumPy/SciPy API coverage in CuPy,
           | here is the comparison table:
           | 
           | https://docs.cupy.dev/en/latest/reference/comparison.html
        
         | ogrisel wrote:
         | Note that NumPy, CuPy and PyTorch are all involved in the
         | definition of a shared subset of their API:
         | 
         | https://data-apis.org/array-api/
         | 
         | So it's possible to write array API code that consumes arrays
         | from any of those libraries and delegate computation to them
         | without having to explicitly import any of them in your source
         | code.
         | 
         | The only limitation for now is that PyTorch (and to some lower
         | extent cupy as well) array API compliance is still incomplete
         | and in practice one needs to go through this compatibility
         | layer (hopefully temporarily):
         | 
         | https://data-apis.org/array-api-compat/
        
           | ethbr1 wrote:
           | It's interesting to see hardware/software/API co-development
           | in practice again.
           | 
           | The last time I think this happen at market-scale was early
           | 3d accelerator APIs? Glide/opengl/directx. Which has been a
           | minute! (To a lesser extent CPU vectorization extensions)
           | 
           | Curious how much of Nvidia's successful strategy was driven
           | by people who were there during that period.
           | 
           | Powerful first mover flywheel: build high performing hardware
           | that allows you to define an API -> people write useful
           | software that targets your API, because you have the highest
           | performance -> GOTO 10 (because now more software is
           | standardized on your API, so you can build even more
           | performant hardware to optimize its operations)
        
           | kmaehashi wrote:
           | An excellent example of Array API usage can be found in
           | scikit-learn. Estimators written in NumPy are now operable on
           | various backends courtesy of Array API compatible libraries
           | such as CuPy and PyTorch.
           | 
           | https://scikit-learn.org/stable/modules/array_api.html
           | 
           | Disclosure: I'm a CuPy maintainer.
        
           | kccqzy wrote:
           | And of course the native Python solution is memoryview. If
           | you need to inter-operate with libraries like numpy but you
           | cannot import numpy, use memoryview. It is specifically for
           | fast low-level access which is why it has more C
           | documentation than Python documentation:
           | https://docs.python.org/3/c-api/memoryview.html
        
         | amarcheschi wrote:
         | I'm supposed to end my undergraduate degree with an internship
         | at the italian national research center and i'll have to use
         | pytorch to write ml models from paper to code, i've tried
         | looking at the tutorial but i feel like there's a lot going on
         | to grasp. until now i've only used numpy (and pandas in combo
         | with numpy), i'm quite excited but i'm a bit on the edge
         | because i can't know whether i'll be up to the task or not
        
           | KeplerBoy wrote:
           | Go for it! There's nothing to lose.
           | 
           | You could checkout some of EuroCC's courses. That should get
           | you up to speed. https://www.eurocc-
           | access.eu/services/training/
        
         | hedgehog wrote:
         | It's kind of unfortunate that EagerPy didn't get more traction
         | to make that kind of switching even easier.
        
         | Narhem wrote:
         | As nice as it is to have a drop in replacement, most of the
         | cost of GPU computing is moving memory around. Wouldn't be
         | surprised if this catches unsuspecting programmers in a few
         | performance traps.
        
         | WCSTombs wrote:
         | > However, the AMD-GPU compatibility for CuPy is quite an
         | attractive feature.
         | 
         | Last I checked (a couple months ago) it wasn't quite there, but
         | I totally agree in principle. I've not gotten it to work on my
         | Radeons yet.
        
         | paperplatter wrote:
         | Hm. Tempted to try pytorch on my Mac for this. I have an AS
         | chip rather than a Nvidia GPU.
        
         | sspiff wrote:
         | It only supports AMD cards supported by ROCm, which is quite a
         | limited set.
         | 
         | I know you can enable ROCm for other hardware as well, but it's
         | not supported and quite hit or miss. I've had limited success
         | with running stuff against ROCm on unsupported cards, mainly
         | having issues with memory management IIRC.
        
           | sitkack wrote:
           | Fingers crossed that all future AMD parts ship with full ROCm
           | support.
        
       | sdenton4 wrote:
       | Why not Jax?
        
         | bee_rider wrote:
         | Real answer: CuPy has a name that is very similar to SciPy. I
         | don't know GPU, that's why I'm using this sort of library,
         | haha. The branding for CuPy makes it obvious. Is Jax the same
         | thing, but implemented better somehow?
        
           | whimsicalism wrote:
           | yes
        
           | sdenton4 wrote:
           | Yeah, Jax provides a one-to-one reimplementation of the Numpy
           | interface, and a decent chunk of the scipy interface. Random
           | number handling is a bit different, but Numpy random number
           | handling seeeeems to be trending in the Jax direction
           | (explicitly passed RNG objects).
           | 
           | Jax also provides back-propagation wherever possible, so you
           | can optimize.
        
         | palmy wrote:
         | cupy came out a long time before Jax; remember using it in a
         | project for my BSc around 2015-2016.
         | 
         | Cool to see that it's still kicking!
        
         | johndough wrote:
         | > Why not Jax?
         | 
         | - JAX Windows support is lacking
         | 
         | - CuPy is much closer to CUDA than JAX, so you can get better
         | performance
         | 
         | - CuPy is generally more mature than JAX (fewer bugs)
         | 
         | - CuPy is more flexible thanks to cp.RawKernel
         | 
         | - (For those familiar with NumPy) CuPy is closer to NumPy than
         | jax.numpy
         | 
         | But CuPy does not support automatic gradient computation, so if
         | you do deep learning, use JAX instead. Or PyTorch, if you do
         | not trust Google to maintain a project for a prolonged period
         | of time https://killedbygoogle.com/
        
           | gnulinux wrote:
           | What about CPU-only loads? If one wants to write code that'll
           | eventually run in both CPU and GPU but in the short-to-mid
           | term will only be used in CPU? Since JAX natively support CPU
           | (with numpy backend), but CuPy doesn't, this seems like a
           | potential problem for some.
        
             | nextaccountic wrote:
             | Isn't there a way to dynamically select between numpy and
             | cupy, depending on whether you want cpu or gpu code?
        
               | gnulinux wrote:
               | There is but then you're using two separate libraries,
               | that seems like a fragile point of failure compared to
               | just using jax. But regardless since jax will use
               | different backends anyway, it's arguably not any worse
               | (but it ends up being your responsibility to ensure
               | correctness as opposed to the jax team).
        
               | kmaehashi wrote:
               | NumPy has a mechanism to dispatch execution to CuPy:
               | https://numpy.org/neps/nep-0018-array-function-
               | protocol.html
               | 
               | Just prepare the input on NumPy or CuPy, and then you can
               | just feed it to NumPy APIs. NumPy functions will handle
               | itself if the input is NumPy ndarray, or dispatch the
               | execution to CuPy if the input is CuPy ndarray.
        
               | johndough wrote:
               | > Isn't there a way to dynamically select between numpy
               | and cupy, depending on whether you want cpu or gpu code?
               | 
               | CuPy is an (almost) drop-in replacement for NumPy, so the
               | following works surprisingly often:                   if
               | use_cpu:             import numpy as np         else:
               | import cupy as np
        
           | insane_dreamer wrote:
           | > CuPy does not support automatic gradient computation, so if
           | you do deep learning, use JAX instead
           | 
           | DL is major use case; is CuPy planning on adding auto
           | gradient comp?
        
       | bee_rider wrote:
       | Good a place as any to ask I guess. Do any of these GPU libraries
       | have a BiCGStab (or similar) that handles multiple right hand
       | sides? CuPy seems to have GMRES, which would be fine, but as far
       | as I can tell it just does one right hand side.
        
         | trostaft wrote:
         | IIRC jax's `scipy.sparse.linalg.bicgstab` does support multiple
         | right hand sides.
         | 
         | EDIT: Or rather, all the solvers under jax's
         | `scipy.sparse.linalg` all support multiple right hand sides.
        
           | bee_rider wrote:
           | Oh dang, that's pretty awesome, thanks.
           | 
           | "array or tree of arrays" sounds very general, probably even
           | better than an old fashioned 2D array.
        
             | trostaft wrote:
             | 'tree of arrays'
             | 
             | Ahh, that's just Jax's concept of pytrees. It was something
             | that they invented to make it easier (this is how I view
             | it, not complete) to pass complex objects to function but
             | still be able to easily consider them as a concatenated
             | vector for AD etc.. E.g. a common pattern is to pass
             | parameters `p` to a function and then internally break them
             | into their physical interpretations, e.g. `mass = p[0]`,
             | `velocity = p[1]`. Pytrees let you just use something like
             | a dictionary `p = {'mass' = 1.0, 'velocity = 1.0'}`, which
             | is a stylistically more natural structure to pass around,
             | and then jax is structured to understand later when AD'ing
             | or otherwise that you're doing so with respect to the
             | 'leaves' of the tree, or the values of the mass and
             | velocity.
             | 
             | Hopefully someone corrects me if I'm not right about this.
             | I'm hardly 100% on Jax's vision on PyTrees.
             | 
             | As an aside, just a list of right hand sides `[b1, b2, ...,
             | bm]` is valid.
        
         | johndough wrote:
         | If you have many right hand sides, you could also compute an LU
         | factorization and then solve the right hand sides via back-
         | substitution.
         | 
         | https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc...
         | 
         | or
         | https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc...
         | if your linear system is sparse.
         | 
         | But whether that works well depends on the problem you are
         | trying to solve.
        
           | bee_rider wrote:
           | My systems are sparse, but might not fit on the GPU when
           | factorized. Actually, usually I do CPU stuff with lots of
           | ram, and Pardiso, so it isn't an issue.
           | 
           | But I was hoping to try out something like ILU+bicgstab on
           | the GPU and the python-verse seems like it has the lowest
           | barrier-to-entry for just playing around.
        
             | johndough wrote:
             | For my tasks, I had some success with algebraic multigrid
             | solvers as preconditioner, for example from AMGCL or PyAMG.
             | They are also reasonably easy to get started with.
             | 
             | https://github.com/pyamg/pyamg
             | 
             | https://github.com/ddemidov/amgcl
             | 
             | But I only have to deal with positive definite systems, so
             | YMMV.
             | 
             | I am not sure whether those libraries can deal with
             | multiple right-hand sides, but most complexity is in the
             | preconditioners anyway.
        
       | adancalderon wrote:
       | If it ran in the background it could be CuPyd
        
       | whimsicalism wrote:
       | I was just thinking we didn't have enough CUDA-accelerated numpy
       | libraries.
       | 
       | Jax, pytorch, vanilla TF, triton. They just don't cut it
        
       | meisel wrote:
       | When building something that I want to run on both CPU and GPU,
       | depending, I've found it much easier to use PyTorch than some
       | combination of NumPy and CuPy. I don't have to fiddle around with
       | some global replacing of numpy.* with cupy.*, and PyTorch has
       | very nearly all the functions that those libraries have.
        
         | setopt wrote:
         | Interesting. Any links to examples or docs on how to use
         | PyTorch as a general linear algebra library for this purpose?
         | Like a "SciPy to PyTorch" transition guide if I want to do the
         | same?
        
           | meisel wrote:
           | It's typically just importing torch and s/np/torch, not too
           | different from NumPy -> CuPy. Try it in your own code and
           | see!
        
           | ttyprintk wrote:
           | Mentioned above:
           | 
           | https://data-apis.org/array-api-compat/
        
       | lmeyerov wrote:
       | We are fans! We mostly use cudf/cuml/cugraph (GPU dataframes etc)
       | in the pygraphistry ecosystem, and when things get a bit tricky,
       | cupy is one of the main escape hatches
        
       | SubiculumCode wrote:
       | As an aside, since I was trying to install CuPy the other day and
       | was having issues.
       | 
       | Open projects on github often (at least superficially) require
       | specific versions of Cuda Toolkit (and all the specialty nvidia
       | packages e.g. cudann), Tensorflow, etc, and changing the default
       | versions of these for each little project, or step in a
       | processing chain, is ridiculous.
       | 
       | pyenv et al have really made local, project specific versions of
       | python packages much easier to manage. But I haven't seen a
       | similar type solution for cuda toolkit and associated packages,
       | and the solutions I've encountered seem terribly hacky..but I'm
       | sure though that this is a common issue, so what do people do?
        
         | coeneedell wrote:
         | Ugh... docker containers. I also wish there was a simpler way
         | but I don't think there is.
        
           | SubiculumCode wrote:
           | this is not what I wanted to hear. NOT AT ALL. Please whisper
           | sweet lies into my ears.
        
             | coeneedell wrote:
             | At the moment I'm working on a system to quickly replicate
             | academic deep learning repos (papers) at scale. At least
             | Amazon has a catalogue of prebuilt containers with
             | cuda/pytorch combos. I still occasionally have an issue
             | where the container works on my 3090 test bench but not on
             | the T4 cloud node...
        
         | m_d_ wrote:
         | conda provides cudatoolkit and associated packages. Does this
         | solve the situation?
        
           | nyrikki wrote:
           | The condos 200-employee threshold licence change is
           | problematic for some.
        
             | boldlybold wrote:
             | As long as you stay out of the "defaults" and "anaconda"
             | repos, you're not subject to that license. For my needs
             | conda-forge and bioconda have everything. I'm not sure
             | about the nvidia repo but I assume it's similar.
        
               | kmaehashi wrote:
               | Actually all CUDA Toolkit libs are already available
               | through the conda-forge channel:
               | https://anaconda.org/conda-forge/cuda-cudart,
               | https://anaconda.org/conda-forge/libcublas, etc.
        
           | SubiculumCode wrote:
           | Actually yes it does....except I seem to remember that it
           | doesn't go back that far in cuda versions. I can't seem to
           | find it again right now.
        
         | whimsicalism wrote:
         | in real life everyone just uses containers, might not be the
         | answer you want to hear though
        
         | kmaehashi wrote:
         | As a maintainer of CuPy and also as a user of several GPU-
         | powered Python libraries, I empathize with the frustrations and
         | difficulties here. Indeed, one thing CuPy values is to make the
         | installation step as easy and universal as possible. We strive
         | to keep the binary package footprint small (currently less than
         | 100 MiB), keep dependencies to a minimum, support wide variety
         | of platforms including Windows and aarch64, and do not require
         | a specific CUDA Toolkit version.
         | 
         | If anyone reading this message has encountered a roadblock
         | while installing CuPy, please reach out. I'd be glad to help
         | you.
        
         | mardifoufs wrote:
         | One way to do it is to explicitly add the link to say, the
         | pytorch+CUDA wheel from the pytorch repos in your
         | requirements.txt instead of using the normal pypi package.
         | Which also sucks because you then have to do some other tweaks
         | to make your requirements.txt portable across different
         | platforms...
         | 
         | (and you can't just add another index for pip to look for if
         | you want to use python build so it has to be explicitly linked
         | to the right wheel, which absolutely sucks especially since you
         | cannot get the CUDA version from pypi)
        
         | ttyprintk wrote:
         | You can jam the argument cudatoolkit=1.2.3 when creating conda
         | environments.
         | 
         | NB I'm using Miniforge.
        
       | hamilyon2 wrote:
       | There is a bit similar project which supports Intel GPU
       | offloading: https://github.com/intel/scikit-learn-intelex
        
       | johndough wrote:
       | CuPy is probably the easiest way to interface with custom CUDA
       | kernels:
       | https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
       | 
       | And I recently learned that CuPy has a JIT compiler now if you
       | prefer Python syntax over C++.
       | https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-k...
        
         | einpoklum wrote:
         | > probably the easiest way to interface with custom CUDA
         | kernels
         | 
         | In Python? Perhaps. Generally? No, it isn't. Try:
         | https://github.com/eyalroz/cuda-api-wrappers/
         | 
         | Full power of the CUDA APIs including all runtime compilation
         | options etc.
         | 
         | (Yes, I wrote that...)
        
           | johndough wrote:
           | Personally, I prefer CuPy over your library. For example,
           | your vectorAdd.cu implementation at
           | https://github.com/eyalroz/cuda-api-
           | wrappers/blob/master/exa... is much longer than a similar
           | CuPy implementation:                   import cupy as cp
           | vector_add = cp.RawKernel("""         extern "C" __global__
           | void vector_add(const float *A, const float *B, float *C, int
           | num_elements) {             int i = blockDim.x * blockIdx.x +
           | threadIdx.x;                  if (i < num_elements) { C[i] =
           | A[i] + B[i]; }         }         """, "vector_add")
           | num_elements = 50_000              block_size = 256         #
           | round up to next multiple of block_size         grid_size =
           | (num_elements + block_size - 1) // block_size              a
           | = cp.random.rand(num_elements, dtype=cp.float32)         b =
           | cp.random.rand(num_elements, dtype=cp.float32)         c =
           | cp.zeros(num_elements, dtype=cp.float32)              args =
           | (a, b, c, num_elements)              print(f"[Vector addition
           | of {num_elements} elements]")         print(f"CUDA kernel
           | launch with {grid_size} blocks of {block_size} threads each")
           | vector_add((grid_size,), (block_size,), args)
           | incorrect = cp.abs(a + b - c) > 1e-5              if
           | cp.any(incorrect):             print("Result verification
           | failed at element", cp.argmax(incorrect))
           | print("Test PASSED")         print("SUCCESS")
           | 
           | It could be made even shorter with a cp.ElementwiseKernel htt
           | ps://docs.cupy.dev/en/stable/user_guide/kernel.html#basic...
           | 
           | Although I have to concede that the automatic grid size
           | computation in cuda-api-wrappers is nice.
           | 
           | A few marketing tips for your README:
           | 
           | * Put a code example directly at the top. You want to present
           | the selling points of your library to the reader as fast as
           | possible. For reference, look at the CuPy README
           | https://github.com/cupy/cupy?tab=readme-ov-file#cupy--
           | numpy-... which immediately shows reader what it is good for.
           | Your README starts with lots of text, but nobody reads text
           | anymore these days. A link to examples is almost at the end,
           | and then the examples are deeply nested.
           | 
           | * The first links in the README should link to your own
           | library, for example to documentation or examples. You do not
           | want to lead the reader away from your GitHub page.
           | 
           | * Add syntax highlighting with "cpp" after triple backticks:
           | ```cpp         <code here>         ```
        
       | kunalgupta022 wrote:
       | Is anyone aware of a pandas like library that is based on
       | something like CuPy instead of Numpy. It would be great to have
       | the ease of use of pandas with the parallelism unlocked by gpu.
        
         | lokimedes wrote:
         | Not specifically GPU, but that's also highly dependent on the
         | data access pattern: https://www.dask.org/
        
         | Scene_Cast2 wrote:
         | I'd go digging into this - https://pola.rs/posts/polars-on-gpu/
        
         | kmaehashi wrote:
         | cuDF is a CuPy-based library providing drop-in replacement for
         | Pandas: https://rapids.ai/
        
       | curvilinear_m wrote:
       | I'm surprised to see pytorch and Jax mentioned as alternatives
       | but not numba : https://github.com/numba/numba
       | 
       | I've recently had to implement a few kernels to lower the memory
       | footprint and runtime of some pytorch function : it's been really
       | nice because numba kernels have type hints support (as opposed to
       | raw cupy kernels).
        
         | killingtime74 wrote:
         | Numba doesn't support GPU though
        
           | catshamando wrote:
           | Yes it does:
           | https://numba.readthedocs.io/en/stable/cuda/kernels.html
        
           | the_svd_doctor wrote:
           | It does (for NVIDIA at least)
        
       | setopt wrote:
       | I've been using CuPy a bit and found it to be excellent.
       | 
       | It's very easy to replace some slow NumPy/SciPy calls with
       | appropriate CuPy calls, with sometimes literally a 1000x
       | performance boost from like 10min work. It's also easy to write
       | "hybrid code" where you can switch between NumPy and CuPy
       | depending on what's available.
        
         | glial wrote:
         | Are you able to share what functions or situations result in
         | speedups? In my experience, vectorized numpy is already fast,
         | so I'm very curious.
        
           | KeplerBoy wrote:
           | Not OP, but think about stuff like FFTs or Matmuls. It's not
           | even a competition, GPUs win when the algorithm is somewhat
           | suitable and you're dealing with FP32 or lower precision.
        
           | setopt wrote:
           | The largest speedup I have seen was for a quantum mechanics
           | simulation where I needed to repeatedly calculate all
           | eigenvalues of Hermitian matrices (but not necessarily their
           | eigenvectors).
           | 
           | This was basically the code needed:                   import
           | scipy.linalg as la              if cuda:             import
           | cupy as cp             import cupy.linalg as cla
           | e = cp.asnumpy(cla.eigvalsh(cp.asarray(H)))         else:
           | e = la.eigvalsh(H)
           | 
           | I was using IntelPython which already has fast (parallelized)
           | methods for this using MKL, but CuPy blew it out of the
           | water.
        
       | aterrel-nvidia wrote:
       | If you like cupy, definitely checkout the Multinode Multi-gpu
       | version, cuNumeric: https://github.com/nv-legate/cunumeric
       | 
       | Would love to get any feedback from the community.
        
       ___________________________________________________________________
       (page generated 2024-09-20 23:00 UTC)