[HN Gopher] Generic GPU Kernels
       ___________________________________________________________________
        
       Generic GPU Kernels
        
       Author : spekcular
       Score  : 44 points
       Date   : 2021-12-06 18:08 UTC (4 hours ago)
        
 (HTM) web link (mikeinnes.github.io)
 (TXT) w3m dump (mikeinnes.github.io)
        
       | futharkshill wrote:
       | I cannot overstate the importance of using a programming language
       | targeting GPUs directly like Futhark (https://github.com/diku-
       | dk/futhark). In this case, it is a functional, declarative
       | language where you can focus on the why, not the how. Just like
       | CPUs are incredibly complex, higher level abstractions are very
       | important.
       | 
       | If you were a pro GPU programmer and had 10 years, Futhark would
       | be maybe 10x slower. But just like we do not program in assembly
       | when making critically fast software, most non-simple things are
       | easier written in this.
        
         | joe_the_user wrote:
         | Interesting.
         | 
         | How fast can Futhark be compared to a standard CUDA loop with a
         | few arithmetic, load and save operations? Basically, suppose
         | you're doing simple gathering and scattering?
        
       | mandelken wrote:
       | Should have (2017) in the title.
       | 
       | Indeed cool to program julia directly on the GPU and Julia on GPU
       | and this has further evolved since then, see
       | https://juliagpu.org/
        
       | vchuravy wrote:
       | One important note is that the blog is quite old. CUDAnative and
       | CUDAdriver got folded into https://github.com/JuliaGPU/CUDA.jl
        
       | chmod775 wrote:
       | Cool stuff. Compared to the choices and tools we have for
       | programming for CPUs, programming for GPUs is a really lackluster
       | experience.
        
       | dragontamer wrote:
       | The main problem with GPU is that most of the performance tricks
       | that make GPUs fast are extremely detailed.
       | 
       | Even 10 years ago, back in OpenCL 1.0 days or just as CUDA was
       | getting started, you'd need to use __shared__ memory to really
       | benefit from GPU coding.
       | 
       | Over the past 10 years, GPU programming has become even more
       | detailed: permute / bpermute, warp/wavefront-level voting (vote /
       | ballot operations), warp/wavefront-level coordination
       | (__activemask()), and of course most recently: BFloat16 matrix
       | multiplication / sparse matrix multiplication.
       | 
       | ---------
       | 
       | There's some degree of abstraction building going on inside the
       | CUDA-world. You can use cooperative groups to abstract a lot of
       | the wavefront-level stuff away, but not entirely yet. As such,
       | even if you use cooperative groups, you end up needing a mental
       | model of why these operations are efficient before you really
       | benefit.
       | 
       | CUDA cub abstracts a lot of operations away, removing the tedious
       | exercise of making your own prefix sum / scan operation whenever
       | you dip down to this level... but CUDA cub still requires the
       | programmer to learn these tricks before cub is useful.
       | 
       | ---------
       | 
       | What we really want, is for a programming interface where "CPU"
       | programmers can add custom code to GPU kernels that adds
       | flexibility and loses very little speed... but doesn't require
       | CPU programmers to spend lots of time learning the GPU-
       | programming tricks of their own.
       | 
       | And such an interface doesn't exist yet in any form. Outside of
       | like... Tensorflow (except that only works for neural net
       | programming, and not for general purpose programming).
       | 
       | -------
       | 
       | The speed thing is very important. C++ AMP, by Microsoft, was a
       | pretty decent interface by 2012 standards (competitive with CUDA
       | and OpenCL at the time), but C++ AMP was criticized as much
       | slower than OpenCL/CUDA, so the programming community stopped
       | using C++ AMP.
       | 
       | If you're writing GPU code that's slower than OpenCL / CUDA, then
       | people will simply ask : why aren't you using OpenCL/CUDA ???
       | 
       | Its hard to sell "programmer convenience" and "simpler thinking
       | process" to your boss.
        
         | corysama wrote:
         | Unfortunately, I don't see a "just a bit of magic here without
         | learning much of anything new" interface coming because it's
         | all about strategizing the movement of data. This is not unique
         | to GPUs. It's a universal problem across computing hardware.
         | It's just enabled to be explicit in OpenCL/CUDA. As compared to
         | most languages where you indirectly try to steer things the
         | right way and the CPU does it's best with whatever mess it
         | gets.
         | 
         | Closest I know of is https://halide-lang.org/ And, that is
         | specialized around images.
        
           | dragontamer wrote:
           | Both C++AMP and Tensorflow abstracted away the movement of
           | data behind relatively easy to use interfaces.
           | 
           | C++ AMP in particular was a generic interface behind a
           | template<> class: Array<> and ArrayView<>.
           | 
           | Array<> always existed on GPU-side (be it GPU#0 or GPU#1),
           | while ArrayView<> was your abstraction to read/write into
           | that Array<>.                   Array<SomeClass> foo =
           | createSomeArray();         ArrayView<SomeClass> bar = foo;
           | 
           | This could be CPU-side or GPU side code (foo could be on
           | GPU#0, while bar could be GPU#1). While you can use C++ AMP
           | code to manually manage when the copies happen, the idea is
           | to simplify the logic down so that CPU-programmers didn't
           | have to think as hard.
           | 
           | In this case, "bar" is bound to foo, and bar will remain in
           | sync with foo through async programming behind the scenes.
           | bar[25] = SomeClass(baz); will change bar in this context,
           | but the Array<> foo will only update at the designated
           | synchronization points (similar to CUDA Streams).
           | 
           | ------
           | 
           | So its a lot easier than reading/writing cudaMemcpy() in the
           | right spots, in the right order, in the right CUDA Streams.
           | But probably less efficient, and therefore going to be less
           | popular than using CUDA directly.
        
             | my123 wrote:
             | > and therefore going to be less popular than using CUDA
             | directly.
             | 
             | The current ambition of NV is trying to make it as painless
             | as possible, by relying on shared virtual memory much more
             | than before.
             | 
             | https://www.olcf.ornl.gov/wp-
             | content/uploads/2021/06/OLCF_Us...
             | 
             | With OpenMP (also supported in AMD's ROCm AOMP) and C++17
             | standard parallelism support. Legate w/ cuNumeric goes to
             | another direction with implementing a numpy-compatible
             | interface, but GPU accelerated.
             | 
             | Also deriving from the limits of a SIMD abstraction with 1
             | IP per thread for the sake of guaranteed forward progress
             | (as such changing the programming model needed for the code
             | to be functional).
             | 
             | The underlying machine is implemented as SIMT, so people
             | should still be aware of the associated performance
             | pitfalls. However this allows for better compatibility and
             | more code sharing possible than before.
             | 
             | Microsoft's problem with C++ AMP is that FXIL as present in
             | D3D11 was just too limited and had too many performance
             | pitfalls I'm afraid, if only it got to see the light of day
             | with D3D12 + DXIL later...
        
           | convolvatron wrote:
           | there were the ' _' languages on the connection machine. in
           | C_ for example,  'shapes', which were cartesian arrays of
           | values were part of the type system.
           | 
           | communications operations were explicit through the use of
           | indirection on these indices. I doubt its the case on a
           | modern gpu, but aside from the router, performance on the CM
           | was deterministic so you really could understand what you
           | were going to get from the source.
           | 
           | completely agree with you, general parallel programming
           | languages are a thing and really pretty valuable to try to
           | exploit all that machinery. Chapel is the only recent work
           | I'm familiar with.
        
       ___________________________________________________________________
       (page generated 2021-12-06 23:01 UTC)