[HN Gopher] VkFFT: Vulkan/CUDA/Hip/OpenCL/Level Zero/Metal Fast ...
       ___________________________________________________________________
        
       VkFFT: Vulkan/CUDA/Hip/OpenCL/Level Zero/Metal Fast Fourier
       Transform Library
        
       Author : thunderbong
       Score  : 153 points
       Date   : 2023-08-02 08:06 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | wiz21c wrote:
       | Rust bindings!!! pleeeeze !
       | 
       | Very impressive performances. I'd be happy to have a comparison
       | with regular CPU performances... If you put together the GPU time
       | + GPU upload & download, is it faster than CPU overall ?
        
         | gary_0 wrote:
         | Rust bindings are at the bottom of the readme:
         | https://github.com/semio-ai/vkfft-rs
         | 
         | Also Python: https://github.com/vincefn/pyvkfft
        
         | randomNumber7 wrote:
         | > If you put together the GPU time + GPU upload & download, is
         | it faster than CPU overall?
         | 
         | That always depends on sample size and the hardware you use.
         | 
         | And like for all those kind of problems on it also depends on
         | the parallelizability of the computation you are doing.
        
       | DTolm wrote:
       | Hello, I am the author of VkFFT, Tolmachev Dmitrii.
       | 
       | I remember VkFFT got a lot of initial traction thanks to Hacker
       | News three years ago. Back then VkFFT was a simple collection of
       | pre-made shaders for powers of two FFTs.
       | 
       | Nowadays it is based on the runtime code generation and
       | optimization platform that supports all the mentioned backends,
       | has a wide range of implemented algorithms (some of which are not
       | even present in other codes) to cover all system sizes and can do
       | things no other GPU FFT library can so far (like real to real
       | transforms, arbitrary dimensional transforms, zero-padding,
       | convolutions and more).
       | 
       | If you have some questions about the library, design choices,
       | functionality or anything else - I will be happy to answer them!
        
       | latchkey wrote:
       | I'd like to see this tested on a 210/250 with ROCm 5.6. There are
       | improvements in the latest release that might affect the
       | benchmarks in a positive way.
        
       | dsego wrote:
       | How does this compare to something like fftw3 or pffft?
        
         | pid-1 wrote:
         | I think FFTW does not run on GPUs.
        
           | dsego wrote:
           | Right, I know, but what's the advantage of gpu vs cpu for
           | fft, considering cpu-s support some vectorization and you
           | need to format the data and send it to the gpu and back.
        
             | geokon wrote:
             | As far as I understand that's not a very meaningful
             | question b/c it depends on what CPU and what GPU. So it's a
             | bit apples to oranges and depends on the user's
             | configuration. There is a benchmark at the very bottom:
             | https://openbenchmarking.org/test/pts/vkfft
             | 
             | Also, maybe a bit obvious.. but that even if there is no
             | huge benefit - sending compute to the GPU frees up your
             | CPU/application to do other things .. like keeping your
             | application responsive :)
        
             | the_svd_doctor wrote:
             | FFT is memory bound (it's N log N flops for N bytes, so
             | little arithmetic). GPU HBM is much faster than DRAM, so
             | generally it's much faster on GPU.
        
             | johnbcoughlin wrote:
             | The FFT is rarely the only thing you're doing, so at the
             | very least you get to keep the data local if it was already
             | on gpu.
        
           | KeplerBoy wrote:
           | Nvidia has something they call cuFFTW. Its basically a drop-
           | in replacement for FFTW.
           | 
           | That's the kind of stuff Nvidia has offered for the last
           | decade while AMD did god knows what.
           | 
           | https://docs.nvidia.com/cuda/cufft/index.html
        
       | serialx wrote:
       | Now we just need VkDNN
        
         | raphlinus wrote:
         | To a first approximation, Kompute[1] is that. It doesn't seem
         | to be catching on, I'm seeing more buzz around WebGPU
         | solutions, including wonnx[2] and more hand-rolled approaches,
         | and IREE[3], the latter of which has a Vulkan back-end.
         | 
         | [1]: https://kompute.cc/
         | 
         | [2]: https://github.com/webonnx/wonnx
         | 
         | [3]: https://github.com/openxla/iree
        
           | figomore wrote:
           | Other option is Tinygrad [1] that has a WEBGPU backend and it
           | works very for my case (Unet 3D).
           | 
           | [1] - https://tinygrad.org/
        
         | rcme wrote:
         | You need VkBLAS first.
        
       | Gimpei wrote:
       | I'd love to see this in torch. What are the odds?
        
         | Y_Y wrote:
         | Implementing a custom layer isn't hard[0], that said if it were
         | me I'd rather add it in the runtime, e.g. via a TensorRT
         | plugin.
         | 
         | [0] e.g.
         | https://jamesmccaffrey.wordpress.com/2021/09/02/example-of-a...
        
           | mathisfun123 wrote:
           | That tut only shows you how to remix existing operators. That
           | obviously won't work here (not what's required). You need to
           | do this
           | 
           | https://pytorch.org/tutorials/advanced/dispatcher.html
           | 
           | More complex but still not that hard.
        
         | uoaei wrote:
         | What?
         | 
         | https://pytorch.org/docs/stable/generated/torch.fft.fftn.htm...
        
           | earthnail wrote:
           | But not backed by VkFFT. The implication of the comment is
           | that it would make FFTs on various backends easier if it was
           | implemented on VkFFT in the first place. Not sure that's true
           | though, as I don't know how much code the various backends
           | share.
           | 
           | Edit: as an example that I experienced first-hand,
           | coremltools, which converts PyTorch to CoreML models, only
           | gained FFT support very recently. It's also not really a
           | PyTorch backend but a PyTorch code converter, though, so
           | wouldn't benefit at all from PyTorch's FFT being backed by
           | VkFFT. Still, good example that one shouldn't take FFTs for
           | granted.
        
           | Gimpei wrote:
           | Of course, but CuFFT is half the speed of VkFFT.
        
       | taminka wrote:
       | * world if everyone just used vulkan *
       | 
       | * futuristic buildings and flying cars *
       | 
       | that's pretty cool, the vk prefix for a non-vulkan-only library
       | is kind of confusing tho
        
         | andrewmcwatters wrote:
         | Vulkan is a terrible standard. And by relation, DirectX is now,
         | too considering they're identical. The amount of absolutely
         | worthless boilerplate is through the roof.
         | 
         | Here's a compact(!) implementation of "Hello, Triangle!"
         | https://github.com/Planimeter/game-engine-3d/blob/97298715b2...
         | 
         | 921 lines of doing nothing.
        
         | ajb wrote:
         | It was originally Vulkan only, maybe they should have renamed
         | it
        
           | Archit3ch wrote:
           | I propose Fastest Fourier Transform in the South by Southwest
           | (FFTSXSW).
        
             | hoosieree wrote:
             | The way it works now with run-time codegen it should be
             | Fast Fourier Transform Fixed That For You (FFTFTFY) which
             | also somewhat evokes a butterfly diagram.
        
       | neverrroot wrote:
       | Make AMD a 1'st class citizen.
        
         | throwaway073123 wrote:
         | amd needs to make itself a first class citizen
        
           | ineedtocall wrote:
           | Did NVIDIA's ascent to a 1T company result primarily from
           | their substantial software investments, or is there another
           | element that AMD needs to focus on to achieve similar
           | recognition and adoption in the realm of GPU compute?
        
             | imtringued wrote:
             | It's mostly software. Nobody gives a damn about your AI
             | chip even if it is better than what AMD has and AMD's
             | hardware is no slouch.
             | 
             | The other factor is that AMD's data center hardware is not
             | available at any cloud provider so nobody even has access
             | to the supposedly supported hardware.
        
             | tinpotpotata wrote:
             | NVIDIA let you run CUDA on consumer GPUs and AMD didn't let
             | you run RoCm on consumer GPUs. Big mistake
        
       ___________________________________________________________________
       (page generated 2023-08-02 23:01 UTC)