[HN Gopher] Generic GPU Kernels
___________________________________________________________________
Generic GPU Kernels
Author : spekcular
Score : 44 points
Date : 2021-12-06 18:08 UTC (4 hours ago)
(HTM) web link (mikeinnes.github.io)
(TXT) w3m dump (mikeinnes.github.io)
| futharkshill wrote:
| I cannot overstate the importance of using a programming language
| targeting GPUs directly like Futhark (https://github.com/diku-
| dk/futhark). In this case, it is a functional, declarative
| language where you can focus on the why, not the how. Just like
| CPUs are incredibly complex, higher level abstractions are very
| important.
|
| If you were a pro GPU programmer and had 10 years, Futhark would
| be maybe 10x slower. But just like we do not program in assembly
| when making critically fast software, most non-simple things are
| easier written in this.
| joe_the_user wrote:
| Interesting.
|
| How fast can Futhark be compared to a standard CUDA loop with a
| few arithmetic, load and save operations? Basically, suppose
| you're doing simple gathering and scattering?
| mandelken wrote:
| Should have (2017) in the title.
|
| Indeed cool to program julia directly on the GPU and Julia on GPU
| and this has further evolved since then, see
| https://juliagpu.org/
| vchuravy wrote:
| One important note is that the blog is quite old. CUDAnative and
| CUDAdriver got folded into https://github.com/JuliaGPU/CUDA.jl
| chmod775 wrote:
| Cool stuff. Compared to the choices and tools we have for
| programming for CPUs, programming for GPUs is a really lackluster
| experience.
| dragontamer wrote:
| The main problem with GPU is that most of the performance tricks
| that make GPUs fast are extremely detailed.
|
| Even 10 years ago, back in OpenCL 1.0 days or just as CUDA was
| getting started, you'd need to use __shared__ memory to really
| benefit from GPU coding.
|
| Over the past 10 years, GPU programming has become even more
| detailed: permute / bpermute, warp/wavefront-level voting (vote /
| ballot operations), warp/wavefront-level coordination
| (__activemask()), and of course most recently: BFloat16 matrix
| multiplication / sparse matrix multiplication.
|
| ---------
|
| There's some degree of abstraction building going on inside the
| CUDA-world. You can use cooperative groups to abstract a lot of
| the wavefront-level stuff away, but not entirely yet. As such,
| even if you use cooperative groups, you end up needing a mental
| model of why these operations are efficient before you really
| benefit.
|
| CUDA cub abstracts a lot of operations away, removing the tedious
| exercise of making your own prefix sum / scan operation whenever
| you dip down to this level... but CUDA cub still requires the
| programmer to learn these tricks before cub is useful.
|
| ---------
|
| What we really want, is for a programming interface where "CPU"
| programmers can add custom code to GPU kernels that adds
| flexibility and loses very little speed... but doesn't require
| CPU programmers to spend lots of time learning the GPU-
| programming tricks of their own.
|
| And such an interface doesn't exist yet in any form. Outside of
| like... Tensorflow (except that only works for neural net
| programming, and not for general purpose programming).
|
| -------
|
| The speed thing is very important. C++ AMP, by Microsoft, was a
| pretty decent interface by 2012 standards (competitive with CUDA
| and OpenCL at the time), but C++ AMP was criticized as much
| slower than OpenCL/CUDA, so the programming community stopped
| using C++ AMP.
|
| If you're writing GPU code that's slower than OpenCL / CUDA, then
| people will simply ask : why aren't you using OpenCL/CUDA ???
|
| Its hard to sell "programmer convenience" and "simpler thinking
| process" to your boss.
| corysama wrote:
| Unfortunately, I don't see a "just a bit of magic here without
| learning much of anything new" interface coming because it's
| all about strategizing the movement of data. This is not unique
| to GPUs. It's a universal problem across computing hardware.
| It's just enabled to be explicit in OpenCL/CUDA. As compared to
| most languages where you indirectly try to steer things the
| right way and the CPU does it's best with whatever mess it
| gets.
|
| Closest I know of is https://halide-lang.org/ And, that is
| specialized around images.
| dragontamer wrote:
| Both C++AMP and Tensorflow abstracted away the movement of
| data behind relatively easy to use interfaces.
|
| C++ AMP in particular was a generic interface behind a
| template<> class: Array<> and ArrayView<>.
|
| Array<> always existed on GPU-side (be it GPU#0 or GPU#1),
| while ArrayView<> was your abstraction to read/write into
| that Array<>. Array<SomeClass> foo =
| createSomeArray(); ArrayView<SomeClass> bar = foo;
|
| This could be CPU-side or GPU side code (foo could be on
| GPU#0, while bar could be GPU#1). While you can use C++ AMP
| code to manually manage when the copies happen, the idea is
| to simplify the logic down so that CPU-programmers didn't
| have to think as hard.
|
| In this case, "bar" is bound to foo, and bar will remain in
| sync with foo through async programming behind the scenes.
| bar[25] = SomeClass(baz); will change bar in this context,
| but the Array<> foo will only update at the designated
| synchronization points (similar to CUDA Streams).
|
| ------
|
| So its a lot easier than reading/writing cudaMemcpy() in the
| right spots, in the right order, in the right CUDA Streams.
| But probably less efficient, and therefore going to be less
| popular than using CUDA directly.
| my123 wrote:
| > and therefore going to be less popular than using CUDA
| directly.
|
| The current ambition of NV is trying to make it as painless
| as possible, by relying on shared virtual memory much more
| than before.
|
| https://www.olcf.ornl.gov/wp-
| content/uploads/2021/06/OLCF_Us...
|
| With OpenMP (also supported in AMD's ROCm AOMP) and C++17
| standard parallelism support. Legate w/ cuNumeric goes to
| another direction with implementing a numpy-compatible
| interface, but GPU accelerated.
|
| Also deriving from the limits of a SIMD abstraction with 1
| IP per thread for the sake of guaranteed forward progress
| (as such changing the programming model needed for the code
| to be functional).
|
| The underlying machine is implemented as SIMT, so people
| should still be aware of the associated performance
| pitfalls. However this allows for better compatibility and
| more code sharing possible than before.
|
| Microsoft's problem with C++ AMP is that FXIL as present in
| D3D11 was just too limited and had too many performance
| pitfalls I'm afraid, if only it got to see the light of day
| with D3D12 + DXIL later...
| convolvatron wrote:
| there were the ' _' languages on the connection machine. in
| C_ for example, 'shapes', which were cartesian arrays of
| values were part of the type system.
|
| communications operations were explicit through the use of
| indirection on these indices. I doubt its the case on a
| modern gpu, but aside from the router, performance on the CM
| was deterministic so you really could understand what you
| were going to get from the source.
|
| completely agree with you, general parallel programming
| languages are a thing and really pretty valuable to try to
| exploit all that machinery. Chapel is the only recent work
| I'm familiar with.
___________________________________________________________________
(page generated 2021-12-06 23:01 UTC)