[HN Gopher] Rust-CUDA: Extremely fast GPU code fully in Rust
___________________________________________________________________
Rust-CUDA: Extremely fast GPU code fully in Rust
Author : amadeusine
Score : 93 points
Date : 2021-11-22 21:22 UTC (1 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dragontamer wrote:
| https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...
|
| * Missing Atomics -- Gamebreaker IMO. Atomics are absolutely
| essential when you are dealing with 10,000+ threads on a regular
| basis. You'll inevitably come across a shared data-structure that
| requires write-access from each thread, and some coordination
| mechanism is needed for that. Atomics are one important fit.
|
| Ironic, a few days ago, I argued for the use of Fork-join
| parallelism in most cases (aka: Kernel launch / synchronized
| kernel exits). Now I find myself arguing the opposite now that we
| have a topic here with missing atomics. Like... atomics need to
| be used very, very rarely, but those rare uses are incredibly
| important.
|
| * Warp Vote / Match / Reduce / Shuffle missing (Very useful tools
| for highly-optimized code, but you can write slower code that
| does the same thing through __shared__ memory just fine)
|
| ------
|
| Wait, does this support __shared__ memory at all? Raw access to
| memory is not really amenable to Rust's programming style, but
| its absolutely necessary for high-performance GPU programming.
|
| If this is missing __shared__ memory concepts, then the entire
| idea of "efficient GPU code" is dead IMO. GPU threads can only
| communicate quickly over __shared__ memory within an OpenCL
| Workgroup / CUDA Block (A Workgroup or Block is roughly a
| grouping of 1024 "threads" or SIMD-lanes)
|
| All other forms of communication are incredibly slow. Atomics are
| maybe the next fastest form of communication, but only across
| __shared__ memory. Relaxed Atomics to global memory are
| reasonably performant but once you have either Seq-cst or
| Acquire/Release semantics (aka: the right memory barriers in the
| right place), things slow down dramatically in GPU-land.
|
| The big issue is that __shared__ memory is only 64kB in size, its
| locked down to workgroups / blocks. In NVidia GPUs, the
| __shared__ memory "eats" into your L1 cache as well (In fact:
| __shared__ memory can be thought of as programmer-managed cache.
| The caching heuristics just aren't good enough for high-
| performance GPU programmers. They want to manually manage that
| high-speed memory for maximum performance).
| nynx wrote:
| Looks like they just haven't gotten around to it. Rust has fine
| language support for atomics and shared memory.
| dragontamer wrote:
| You're lucky I've had this discussion before with other Rust
| programmers. But I forgot about this issue...
|
| CUDA __shared__ memory is a 64kB block of RAM that is located
| __INSIDE THE CORE__ and is incredibly fast, but has very
| peculiar semantics. (Since the memory never "leaves" the
| core, its "stuck" only on a small set of 1024 "threads".
| Difficult to manage when you write 30,000+ thread programs
| but the speed benefits are insane and well worth the trouble)
|
| Rust "shared" memory is thread-to-thread communications that
| simply doesn't exist in the GPU-world.
|
| -------
|
| Maybe it'd be more appropriate if I used OpenCL terminology
| instead of CUDA terminology here, because Rust programmers
| have an unfortunate name conflict. In OpenCL, this 64kB on-
| core buffer is called __local memory. Does that help?
|
| Whenever I said "__shared__", replace that in your mind with
| __local instead. Because CUDA __shared__ is NOTHING like
| Rust-Shared.
| nynx wrote:
| That sounds a little complicated to deal with, but I see no
| reason why either the Rust atomic types or a new type
| supplied by the rust-cuda library couldn't handle that just
| fine.
|
| I just want to make sure that you realize that Rust does
| have regular atomics (and that's how other shared memory
| abstractions are generally implemented underneath).
| dragontamer wrote:
| EDIT: Okay, I'm switching to OpenCL terminology. I think
| this __shared__ thing is getting confusing for ya.
|
| ----------
|
| The semantics of __local memory variables are very
| peculiar, and are prone to memory ordering issues unless
| coordinated with OpenCL's barrier() function calls.
|
| That means the compiler needs to be careful about
| optimizations and orderings. The compiler's understanding
| of allowable variable orders must match the programmer's
| understanding.
|
| __local variables can be of any type. In CUDA, you can
| make new structs or even C++ classes be in __shared__
| memory. In OpenCL, you can make *arbitrary* structs be
| __local. Its entirely freeform use of memory, albeit tiny
| and 64kB in size.
|
| ---------
|
| The proper implementation of __local semantics will
| require compiler support. This isn't something you can
| just tack on with a type-system. The implications of
| __local reverberate not only through the type system, but
| also with optimization and the very understanding of how
| code gets compiled fundamentally (synchronization and
| memory orderings).
|
| --------
|
| > I just want to make sure that you realize that Rust
| does have regular atomics (and that's how other shared
| memory abstractions are generally implemented
| underneath).
|
| And __local memory is nothing like you've ever seen
| before, unless maybe you're a realtime programmer who has
| those weird manually-managed L1 CPUs.
|
| Because that's what it is: __local memory is a manually
| managed high-speed RAM. While typical CPU code relies
| upon the hardware to manage L1 cache for you, GPU
| programmers regularly manage that region of memory
| *manually*.
|
| Its not easy, but its the only way to reach the highest-
| levels of performance.
| rdambrosio wrote:
| As i mentioned, it is an early project, just making the
| simplest kernel compile was very difficult. Atomics and shared
| memory are great, but both are very difficult. Atomics need
| "proper" atomics (i.e. special instructions on sm_70+ and
| emulated on <sm_70), and shared mem needs some weird codegen
| support. I will get to both of them. Nevertheless, noalias
| _does_ cause significant performance speedups in memory bound
| kernels, see this blogpost:
| https://developer.nvidia.com/blog/cuda-pro-tip-optimize-poin...
|
| So please do not be surprised that an early project does not
| contain every single feature of cuda, something thats been
| around for decades
| dragontamer wrote:
| No problem. I understand its a work in progress.
|
| I'd push most strongly for CUDA __shared__ support first
| along with thread-barriers (CUDA's __syncthreads()), followed
| by __shared__ atomics. Finally, global atomics + associated
| memory-barrier stuffs (Ex: seq-cst atomic, acq-release atomic
| would work but maybe be a bit difficult. Might be easier to
| support the older-style memory barrier instead?)
| zozbot234 wrote:
| How would this compare with Accel (which is also built on the
| cuda ecosystem)?
| rdambrosio wrote:
| Accel uses the LLVM PTX backend, which is unusable for serious
| projects and doesn't work on windows, i have more about it here
| https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...
| sxp wrote:
| > "Extremely fast"
|
| When people make claims like this, it would be good if they put
| the benchmarks on the first page. E.g, how does it compare with
| https://github.com/gfx-rs/wgpu which lets you target Vulkan,
| Metal, DX, GL or WASM+WebGPU with rust?
| smoldesu wrote:
| I hope this doesn't come off as handwaving, but you're kinda
| comparing apples to oranges here. CUDA has always been in a
| class of it's own when it comes to GPU compute, for better and
| worse, so I think the people out there who want to use this
| will pretty quickly get an idea of who it's for. Benchmarks
| would be nice, but I don't really think they'd be germane when
| comparing a proprietary compute system with a generic cross-
| platform GPU binding.
| outworlder wrote:
| Would have it made any difference had the parent mentioned
| OpenCL?
| joe_guy wrote:
| If it's not relative to anything, than the word "fast"
| doesn't have much meaning.
| seeekr wrote:
| "Fast", to me, from a software development perspective can
| still be meaningful, in the sense of knowing what
| techniques, patterns, paths, ... enable performant
| execution, and providing easy and straightforward paths for
| the user along those. Which, ultimately, leads to high
| performance in most ways the user will apply the provided
| framework (in a more general sense of the word). Hope that
| makes sense.
|
| And it must be OK to claim "fast" as a goal, from the early
| stages of a project, even before it may be possible to
| create any meaningful benchmarks. As long as it's
| discernable for the intended audience the precise stage of
| development or maturity the project is currently at. Which,
| I believe, the project in question is communicating just
| fine ("still in early development").
| nynx wrote:
| wgpu is a library for running wgsl on GPUs, not Rust.
| zozbot234 wrote:
| https://github.com/embarkstudios/rust-gpu would be the
| closest equivalent, AFAICT.
| [deleted]
| gaze wrote:
| The writing is better than I might have produced as a first year
| college student, but this needs copy editing. I might suggest
| that the word "extremely" should be removed not only from the
| entire repository but all of the user's repositories. Fast might
| be removed as well. Nobody is trying to generate slow GPU code.
| The salient feature is that one can write the same code as one
| might write in CUDA with the advantages of Rust's type system---
| which is indeed useful! However, there's no speed to be gained by
| using this relative to CUDA.
| jacquesm wrote:
| But there is relative to plain rust. Besides, even if nobody is
| trying to write slow GPU code, it's a very easy thing to get
| subtly wrong resulting in 10's of % speed loss.
| imron wrote:
| > I might suggest...
|
| Best way to suggest is with a pull request.
| ajkjk wrote:
| Not if you're critiquing the way some writes English..
| imron wrote:
| Depends how it's worded.
|
| If someone with copy editing experience made constructive
| suggestions for the readme and other documentation, it may
| well be appreciated.
|
| A PR also gives the maintainers a way to discuss wording if
| there are points of disagreement - far more so than a post
| on HN that they may or may not even see.
| russdpale wrote:
| very cool!
___________________________________________________________________
(page generated 2021-11-22 23:00 UTC)