[HN Gopher] VkFFT: Vulkan/CUDA/Hip/OpenCL/Level Zero/Metal Fast ...
___________________________________________________________________
VkFFT: Vulkan/CUDA/Hip/OpenCL/Level Zero/Metal Fast Fourier
Transform Library
Author : thunderbong
Score : 153 points
Date : 2023-08-02 08:06 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| wiz21c wrote:
| Rust bindings!!! pleeeeze !
|
| Very impressive performances. I'd be happy to have a comparison
| with regular CPU performances... If you put together the GPU time
| + GPU upload & download, is it faster than CPU overall ?
| gary_0 wrote:
| Rust bindings are at the bottom of the readme:
| https://github.com/semio-ai/vkfft-rs
|
| Also Python: https://github.com/vincefn/pyvkfft
| randomNumber7 wrote:
| > If you put together the GPU time + GPU upload & download, is
| it faster than CPU overall?
|
| That always depends on sample size and the hardware you use.
|
| And like for all those kind of problems on it also depends on
| the parallelizability of the computation you are doing.
| DTolm wrote:
| Hello, I am the author of VkFFT, Tolmachev Dmitrii.
|
| I remember VkFFT got a lot of initial traction thanks to Hacker
| News three years ago. Back then VkFFT was a simple collection of
| pre-made shaders for powers of two FFTs.
|
| Nowadays it is based on the runtime code generation and
| optimization platform that supports all the mentioned backends,
| has a wide range of implemented algorithms (some of which are not
| even present in other codes) to cover all system sizes and can do
| things no other GPU FFT library can so far (like real to real
| transforms, arbitrary dimensional transforms, zero-padding,
| convolutions and more).
|
| If you have some questions about the library, design choices,
| functionality or anything else - I will be happy to answer them!
| latchkey wrote:
| I'd like to see this tested on a 210/250 with ROCm 5.6. There are
| improvements in the latest release that might affect the
| benchmarks in a positive way.
| dsego wrote:
| How does this compare to something like fftw3 or pffft?
| pid-1 wrote:
| I think FFTW does not run on GPUs.
| dsego wrote:
| Right, I know, but what's the advantage of gpu vs cpu for
| fft, considering cpu-s support some vectorization and you
| need to format the data and send it to the gpu and back.
| geokon wrote:
| As far as I understand that's not a very meaningful
| question b/c it depends on what CPU and what GPU. So it's a
| bit apples to oranges and depends on the user's
| configuration. There is a benchmark at the very bottom:
| https://openbenchmarking.org/test/pts/vkfft
|
| Also, maybe a bit obvious.. but that even if there is no
| huge benefit - sending compute to the GPU frees up your
| CPU/application to do other things .. like keeping your
| application responsive :)
| the_svd_doctor wrote:
| FFT is memory bound (it's N log N flops for N bytes, so
| little arithmetic). GPU HBM is much faster than DRAM, so
| generally it's much faster on GPU.
| johnbcoughlin wrote:
| The FFT is rarely the only thing you're doing, so at the
| very least you get to keep the data local if it was already
| on gpu.
| KeplerBoy wrote:
| Nvidia has something they call cuFFTW. Its basically a drop-
| in replacement for FFTW.
|
| That's the kind of stuff Nvidia has offered for the last
| decade while AMD did god knows what.
|
| https://docs.nvidia.com/cuda/cufft/index.html
| serialx wrote:
| Now we just need VkDNN
| raphlinus wrote:
| To a first approximation, Kompute[1] is that. It doesn't seem
| to be catching on, I'm seeing more buzz around WebGPU
| solutions, including wonnx[2] and more hand-rolled approaches,
| and IREE[3], the latter of which has a Vulkan back-end.
|
| [1]: https://kompute.cc/
|
| [2]: https://github.com/webonnx/wonnx
|
| [3]: https://github.com/openxla/iree
| figomore wrote:
| Other option is Tinygrad [1] that has a WEBGPU backend and it
| works very for my case (Unet 3D).
|
| [1] - https://tinygrad.org/
| rcme wrote:
| You need VkBLAS first.
| Gimpei wrote:
| I'd love to see this in torch. What are the odds?
| Y_Y wrote:
| Implementing a custom layer isn't hard[0], that said if it were
| me I'd rather add it in the runtime, e.g. via a TensorRT
| plugin.
|
| [0] e.g.
| https://jamesmccaffrey.wordpress.com/2021/09/02/example-of-a...
| mathisfun123 wrote:
| That tut only shows you how to remix existing operators. That
| obviously won't work here (not what's required). You need to
| do this
|
| https://pytorch.org/tutorials/advanced/dispatcher.html
|
| More complex but still not that hard.
| uoaei wrote:
| What?
|
| https://pytorch.org/docs/stable/generated/torch.fft.fftn.htm...
| earthnail wrote:
| But not backed by VkFFT. The implication of the comment is
| that it would make FFTs on various backends easier if it was
| implemented on VkFFT in the first place. Not sure that's true
| though, as I don't know how much code the various backends
| share.
|
| Edit: as an example that I experienced first-hand,
| coremltools, which converts PyTorch to CoreML models, only
| gained FFT support very recently. It's also not really a
| PyTorch backend but a PyTorch code converter, though, so
| wouldn't benefit at all from PyTorch's FFT being backed by
| VkFFT. Still, good example that one shouldn't take FFTs for
| granted.
| Gimpei wrote:
| Of course, but CuFFT is half the speed of VkFFT.
| taminka wrote:
| * world if everyone just used vulkan *
|
| * futuristic buildings and flying cars *
|
| that's pretty cool, the vk prefix for a non-vulkan-only library
| is kind of confusing tho
| andrewmcwatters wrote:
| Vulkan is a terrible standard. And by relation, DirectX is now,
| too considering they're identical. The amount of absolutely
| worthless boilerplate is through the roof.
|
| Here's a compact(!) implementation of "Hello, Triangle!"
| https://github.com/Planimeter/game-engine-3d/blob/97298715b2...
|
| 921 lines of doing nothing.
| ajb wrote:
| It was originally Vulkan only, maybe they should have renamed
| it
| Archit3ch wrote:
| I propose Fastest Fourier Transform in the South by Southwest
| (FFTSXSW).
| hoosieree wrote:
| The way it works now with run-time codegen it should be
| Fast Fourier Transform Fixed That For You (FFTFTFY) which
| also somewhat evokes a butterfly diagram.
| neverrroot wrote:
| Make AMD a 1'st class citizen.
| throwaway073123 wrote:
| amd needs to make itself a first class citizen
| ineedtocall wrote:
| Did NVIDIA's ascent to a 1T company result primarily from
| their substantial software investments, or is there another
| element that AMD needs to focus on to achieve similar
| recognition and adoption in the realm of GPU compute?
| imtringued wrote:
| It's mostly software. Nobody gives a damn about your AI
| chip even if it is better than what AMD has and AMD's
| hardware is no slouch.
|
| The other factor is that AMD's data center hardware is not
| available at any cloud provider so nobody even has access
| to the supposedly supported hardware.
| tinpotpotata wrote:
| NVIDIA let you run CUDA on consumer GPUs and AMD didn't let
| you run RoCm on consumer GPUs. Big mistake
___________________________________________________________________
(page generated 2023-08-02 23:01 UTC)