[HN Gopher] Rust-CUDA: Extremely fast GPU code fully in Rust
       ___________________________________________________________________
        
       Rust-CUDA: Extremely fast GPU code fully in Rust
        
       Author : amadeusine
       Score  : 93 points
       Date   : 2021-11-22 21:22 UTC (1 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dragontamer wrote:
       | https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...
       | 
       | * Missing Atomics -- Gamebreaker IMO. Atomics are absolutely
       | essential when you are dealing with 10,000+ threads on a regular
       | basis. You'll inevitably come across a shared data-structure that
       | requires write-access from each thread, and some coordination
       | mechanism is needed for that. Atomics are one important fit.
       | 
       | Ironic, a few days ago, I argued for the use of Fork-join
       | parallelism in most cases (aka: Kernel launch / synchronized
       | kernel exits). Now I find myself arguing the opposite now that we
       | have a topic here with missing atomics. Like... atomics need to
       | be used very, very rarely, but those rare uses are incredibly
       | important.
       | 
       | * Warp Vote / Match / Reduce / Shuffle missing (Very useful tools
       | for highly-optimized code, but you can write slower code that
       | does the same thing through __shared__ memory just fine)
       | 
       | ------
       | 
       | Wait, does this support __shared__ memory at all? Raw access to
       | memory is not really amenable to Rust's programming style, but
       | its absolutely necessary for high-performance GPU programming.
       | 
       | If this is missing __shared__ memory concepts, then the entire
       | idea of "efficient GPU code" is dead IMO. GPU threads can only
       | communicate quickly over __shared__ memory within an OpenCL
       | Workgroup / CUDA Block (A Workgroup or Block is roughly a
       | grouping of 1024 "threads" or SIMD-lanes)
       | 
       | All other forms of communication are incredibly slow. Atomics are
       | maybe the next fastest form of communication, but only across
       | __shared__ memory. Relaxed Atomics to global memory are
       | reasonably performant but once you have either Seq-cst or
       | Acquire/Release semantics (aka: the right memory barriers in the
       | right place), things slow down dramatically in GPU-land.
       | 
       | The big issue is that __shared__ memory is only 64kB in size, its
       | locked down to workgroups / blocks. In NVidia GPUs, the
       | __shared__ memory "eats" into your L1 cache as well (In fact:
       | __shared__ memory can be thought of as programmer-managed cache.
       | The caching heuristics just aren't good enough for high-
       | performance GPU programmers. They want to manually manage that
       | high-speed memory for maximum performance).
        
         | nynx wrote:
         | Looks like they just haven't gotten around to it. Rust has fine
         | language support for atomics and shared memory.
        
           | dragontamer wrote:
           | You're lucky I've had this discussion before with other Rust
           | programmers. But I forgot about this issue...
           | 
           | CUDA __shared__ memory is a 64kB block of RAM that is located
           | __INSIDE THE CORE__ and is incredibly fast, but has very
           | peculiar semantics. (Since the memory never "leaves" the
           | core, its "stuck" only on a small set of 1024 "threads".
           | Difficult to manage when you write 30,000+ thread programs
           | but the speed benefits are insane and well worth the trouble)
           | 
           | Rust "shared" memory is thread-to-thread communications that
           | simply doesn't exist in the GPU-world.
           | 
           | -------
           | 
           | Maybe it'd be more appropriate if I used OpenCL terminology
           | instead of CUDA terminology here, because Rust programmers
           | have an unfortunate name conflict. In OpenCL, this 64kB on-
           | core buffer is called __local memory. Does that help?
           | 
           | Whenever I said "__shared__", replace that in your mind with
           | __local instead. Because CUDA __shared__ is NOTHING like
           | Rust-Shared.
        
             | nynx wrote:
             | That sounds a little complicated to deal with, but I see no
             | reason why either the Rust atomic types or a new type
             | supplied by the rust-cuda library couldn't handle that just
             | fine.
             | 
             | I just want to make sure that you realize that Rust does
             | have regular atomics (and that's how other shared memory
             | abstractions are generally implemented underneath).
        
               | dragontamer wrote:
               | EDIT: Okay, I'm switching to OpenCL terminology. I think
               | this __shared__ thing is getting confusing for ya.
               | 
               | ----------
               | 
               | The semantics of __local memory variables are very
               | peculiar, and are prone to memory ordering issues unless
               | coordinated with OpenCL's barrier() function calls.
               | 
               | That means the compiler needs to be careful about
               | optimizations and orderings. The compiler's understanding
               | of allowable variable orders must match the programmer's
               | understanding.
               | 
               | __local variables can be of any type. In CUDA, you can
               | make new structs or even C++ classes be in __shared__
               | memory. In OpenCL, you can make *arbitrary* structs be
               | __local. Its entirely freeform use of memory, albeit tiny
               | and 64kB in size.
               | 
               | ---------
               | 
               | The proper implementation of __local semantics will
               | require compiler support. This isn't something you can
               | just tack on with a type-system. The implications of
               | __local reverberate not only through the type system, but
               | also with optimization and the very understanding of how
               | code gets compiled fundamentally (synchronization and
               | memory orderings).
               | 
               | --------
               | 
               | > I just want to make sure that you realize that Rust
               | does have regular atomics (and that's how other shared
               | memory abstractions are generally implemented
               | underneath).
               | 
               | And __local memory is nothing like you've ever seen
               | before, unless maybe you're a realtime programmer who has
               | those weird manually-managed L1 CPUs.
               | 
               | Because that's what it is: __local memory is a manually
               | managed high-speed RAM. While typical CPU code relies
               | upon the hardware to manage L1 cache for you, GPU
               | programmers regularly manage that region of memory
               | *manually*.
               | 
               | Its not easy, but its the only way to reach the highest-
               | levels of performance.
        
         | rdambrosio wrote:
         | As i mentioned, it is an early project, just making the
         | simplest kernel compile was very difficult. Atomics and shared
         | memory are great, but both are very difficult. Atomics need
         | "proper" atomics (i.e. special instructions on sm_70+ and
         | emulated on <sm_70), and shared mem needs some weird codegen
         | support. I will get to both of them. Nevertheless, noalias
         | _does_ cause significant performance speedups in memory bound
         | kernels, see this blogpost:
         | https://developer.nvidia.com/blog/cuda-pro-tip-optimize-poin...
         | 
         | So please do not be surprised that an early project does not
         | contain every single feature of cuda, something thats been
         | around for decades
        
           | dragontamer wrote:
           | No problem. I understand its a work in progress.
           | 
           | I'd push most strongly for CUDA __shared__ support first
           | along with thread-barriers (CUDA's __syncthreads()), followed
           | by __shared__ atomics. Finally, global atomics + associated
           | memory-barrier stuffs (Ex: seq-cst atomic, acq-release atomic
           | would work but maybe be a bit difficult. Might be easier to
           | support the older-style memory barrier instead?)
        
       | zozbot234 wrote:
       | How would this compare with Accel (which is also built on the
       | cuda ecosystem)?
        
         | rdambrosio wrote:
         | Accel uses the LLVM PTX backend, which is unusable for serious
         | projects and doesn't work on windows, i have more about it here
         | https://github.com/RDambrosio016/Rust-CUDA/blob/master/guide...
        
       | sxp wrote:
       | > "Extremely fast"
       | 
       | When people make claims like this, it would be good if they put
       | the benchmarks on the first page. E.g, how does it compare with
       | https://github.com/gfx-rs/wgpu which lets you target Vulkan,
       | Metal, DX, GL or WASM+WebGPU with rust?
        
         | smoldesu wrote:
         | I hope this doesn't come off as handwaving, but you're kinda
         | comparing apples to oranges here. CUDA has always been in a
         | class of it's own when it comes to GPU compute, for better and
         | worse, so I think the people out there who want to use this
         | will pretty quickly get an idea of who it's for. Benchmarks
         | would be nice, but I don't really think they'd be germane when
         | comparing a proprietary compute system with a generic cross-
         | platform GPU binding.
        
           | outworlder wrote:
           | Would have it made any difference had the parent mentioned
           | OpenCL?
        
           | joe_guy wrote:
           | If it's not relative to anything, than the word "fast"
           | doesn't have much meaning.
        
             | seeekr wrote:
             | "Fast", to me, from a software development perspective can
             | still be meaningful, in the sense of knowing what
             | techniques, patterns, paths, ... enable performant
             | execution, and providing easy and straightforward paths for
             | the user along those. Which, ultimately, leads to high
             | performance in most ways the user will apply the provided
             | framework (in a more general sense of the word). Hope that
             | makes sense.
             | 
             | And it must be OK to claim "fast" as a goal, from the early
             | stages of a project, even before it may be possible to
             | create any meaningful benchmarks. As long as it's
             | discernable for the intended audience the precise stage of
             | development or maturity the project is currently at. Which,
             | I believe, the project in question is communicating just
             | fine ("still in early development").
        
         | nynx wrote:
         | wgpu is a library for running wgsl on GPUs, not Rust.
        
           | zozbot234 wrote:
           | https://github.com/embarkstudios/rust-gpu would be the
           | closest equivalent, AFAICT.
        
           | [deleted]
        
       | gaze wrote:
       | The writing is better than I might have produced as a first year
       | college student, but this needs copy editing. I might suggest
       | that the word "extremely" should be removed not only from the
       | entire repository but all of the user's repositories. Fast might
       | be removed as well. Nobody is trying to generate slow GPU code.
       | The salient feature is that one can write the same code as one
       | might write in CUDA with the advantages of Rust's type system---
       | which is indeed useful! However, there's no speed to be gained by
       | using this relative to CUDA.
        
         | jacquesm wrote:
         | But there is relative to plain rust. Besides, even if nobody is
         | trying to write slow GPU code, it's a very easy thing to get
         | subtly wrong resulting in 10's of % speed loss.
        
         | imron wrote:
         | > I might suggest...
         | 
         | Best way to suggest is with a pull request.
        
           | ajkjk wrote:
           | Not if you're critiquing the way some writes English..
        
             | imron wrote:
             | Depends how it's worded.
             | 
             | If someone with copy editing experience made constructive
             | suggestions for the readme and other documentation, it may
             | well be appreciated.
             | 
             | A PR also gives the maintainers a way to discuss wording if
             | there are points of disagreement - far more so than a post
             | on HN that they may or may not even see.
        
       | russdpale wrote:
       | very cool!
        
       ___________________________________________________________________
       (page generated 2021-11-22 23:00 UTC)