[HN Gopher] Using SIMD for Parallel Processing in Rust
       ___________________________________________________________________
        
       Using SIMD for Parallel Processing in Rust
        
       Author : nbrempel
       Score  : 75 points
       Date   : 2024-07-01 16:29 UTC (6 hours ago)
        
 (HTM) web link (nrempel.com)
 (TXT) w3m dump (nrempel.com)
        
       | eachro wrote:
       | This is cool that simd primitives exist in the std lib of rust.
       | I've wanted wanted to mess around a bit more with simd in python
       | but I don't think that native support exists. Or your have to go
       | down to C/C++ bindings to actually mess around with it (last I
       | checked at least, please correct me if I'm wrong).
        
         | mroche wrote:
         | Quick search turned up this:
         | 
         |  _SIMD in Pure Python_
         | 
         | https://www.da.vidbuchanan.co.uk/blog/python-swar.html
         | 
         | Don't let the "SIMD in Python" section fool you, it's a short
         | stop on Numpy before putting it aside.
        
         | bardak wrote:
         | I feel like most languages could use simd in the standard
         | library. We have all this power in the vector units of our CPUs
         | that compilers struggle to use but yet we also don't make it
         | easy to do manually
        
         | Calavar wrote:
         | What would native SIMD support entail in a language without
         | first party JIT or AOT compilation?
        
           | runevault wrote:
           | At some point bytecode still turns into CPU instructions, so
           | if you added syntax or special functions that went to parts
           | of the interpreter that are SIMD you could certainly add it
           | to a purely interpreted language.
        
             | Calavar wrote:
             | If we're talking low level SIMD, like opcode level, I'm
             | really struggling to see the use case for interpreted
             | bytecode. The cost of type checking operands to dynamically
             | dispatch down a SIMD path would almost certainly outweigh
             | the savings of the SIMD path itself.
             | 
             | JIT is different because in function-level JIT, you can
             | check types just once at the opening of the function, then
             | you stay on the SIMD happy path for the rest of the
             | function. And in AOT, you may able to elide the checks
             | entirely.
             | 
             | There is certainly a space for higher level SIMD
             | functionality, but that would probably look more or less
             | like numpy. And at least for Python, that already exists.
        
       | thomashabets2 wrote:
       | The portable SIMD is quite nice. We can't really trust a
       | "sufficiently smart compiler" to make the best SIMD decisions,
       | since it may not see through what you're actually doing.
       | 
       | https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and
       | code at
       | https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr...
       | showed me getting 3x faster using portable SIMD, on my first
       | attempt.
        
       | nbrempel wrote:
       | Thanks for reading everyone. I've gotten some feedback over on
       | Reddit as well that the example is not effectively showing the
       | benefits of SIMD. I plan on revising this.
       | 
       | One of my goals of writing these articles is to learn so feedback
       | is more than welcome!
        
         | KineticLensman wrote:
         | Great read!
         | 
         | > One of my goals of writing these articles is to learn so
         | feedback is more than welcome!
         | 
         | When I went into the Rust playground to see the assembly output
         | for the Cumulative Sum example, I could only get it to show the
         | compiler warnings, not the actual assembly. I'm probably doing
         | something wrong, but for me this was a barrier that detracted
         | from the article. I'd suggest incorporating the assembly
         | directly into the article, although keeping the playground link
         | for people who are more dedicated / competent than I am.
        
           | the8472 wrote:
           | The function has to be made pub so it doesn't get optimized
           | out as unusued private function.
           | 
           | Godbolt is a better choice for looking at asm anyway.
           | https://rust.godbolt.org/z/3Y9ovsoz9
        
             | KineticLensman wrote:
             | Ah, that worked, thanks!
             | 
             | Although I can now see why he didn't include the output
             | directly.
        
       | oconnor663 wrote:
       | There are a lot of factors that go into how fast a hash function
       | is, but the case we're showing in the big red chart at
       | https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven
       | by SIMD. It's a huge deal.
        
       | anonymousDan wrote:
       | The interesting question for me is whether Rust makes it easier
       | for the compiler to extract SIMD parallelism automatically given
       | the restrictions imposed by its type system.
        
       | neonsunset wrote:
       | If you like SIMD and would like to dabble in it, I can strongly
       | recommend trying it out in C# via its platform-agnostic SIMD
       | abstraction. It is _very_ accessible especially if you already
       | know a little bit of C or C++, and compiles to very competent
       | codegen for AdvSimd, SSE2 /4.2/AVX1/2/AVX512, WASM's Packed SIMD
       | and, in .NET 9, SVE1/2:
       | 
       | https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
       | 
       | Here's an example of "checked" sum over a span of integers that
       | uses platform-specific vector width:
       | 
       | https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
       | 
       | Other examples:
       | 
       | CRC64
       | https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
       | 
       | Hamming distance
       | https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
       | 
       | Default syntax is a bit ugly in my opinion, but it can be
       | significantly improved with helper methods like here where the
       | code is a port of simdutf's UTF-8 code point counting:
       | https://github.com/U8String/U8String/blob/main/Sources/U8Str...
       | 
       | There are more advanced scenarios. Bepuphysics2 engine heavily
       | leverages SIMD to perform as fast as PhysX's CPU back-end:
       | https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...
       | 
       | Note that practically none of these need to reach out to
       | platform-specific intrinsics (except for replacing movemask
       | emulation with efficient ARM64 alternative) and use the same path
       | for all platforms, varied by vector width rather than specific
       | ISA.
        
         | runevault wrote:
         | Funny you mention c#, I started to look at this and I made the
         | mistake of wanting to do string comparison via SIMD, except you
         | can't do it externally because it relies on private internals
         | (note, the built in comparison for c# already does SIMD, you
         | just can't easily reimplement it against the built in string
         | type).
        
           | neonsunset wrote:
           | What kind of private internals do you have in mind? You
           | absolutely can hand-roll your own comparison routine, just
           | hard to beat existing implementation esp. once you start
           | considering culture-sensitive comparison (which may defer to
           | e.g. ICU).
           | 
           | There are no private SIMD APIs save for sequence comparison
           | intrisic for unrolling against known lengths which JIT/ILC
           | does for spans and strings.
        
             | runevault wrote:
             | IIRC (Been a month or so since I looked into it) I couldn't
             | access the underlying array in a way SIMD liked I think? If
             | you look at how they did it inside the actual string class
             | it uses those private properties of the string that are
             | only available internally to guarantee you don't change the
             | string data if memory serves.
        
               | neonsunset wrote:
               | String can provide you a `ReadOnlySpan<char>`, out of
               | which you can either take `ref readonly char` "byref"
               | pointer, which all vectors work with, or you can use the
               | unsafe variant and make this byref mutable (just don't
               | write to it) with `Unsafe.AsRef`.
               | 
               | Because pretty much every type that has linear memory can
               | be represented as span, it means that every span is
               | amenable to pointer (byref) arithmetics which you then
               | use to write a SIMD routine. e.g.:                   var
               | text = "Hello, World! Hello, World!";         var span =
               | MemoryMarshal.Cast<char, ushort>(text);         ref
               | readonly var ptr = ref span[0];              var chunk =
               | Vector128.LoadUnsafe(in ptr);         var needle =
               | Vector128.Create((ushort)',');         var comparison =
               | Vector128.Equals(chunk, needle);         var offset = uin
               | t.TrailingZeroCount(comparison.ExtractMostSignificantBits
               | ());              Console.WriteLine(text[..(int)offset]);
               | 
               | If you have doubts regarding codegen quality, take a look
               | at: https://godbolt.org/z/b97zjfTP7 The above vector API
               | calls are lowered to lines 17-22.
        
               | runevault wrote:
               | Oh interesting, I'll have to give that a try then. My
               | concern was avoiding a reallocation by doing it another
               | way, but if the readonly span works I can see how it
               | would get you there. I need to see if I still have that
               | project to test it out, appreciate the heads up. SIMD is
               | something I really want to get better with.
        
               | neonsunset wrote:
               | If you go through the guide at the first link, it will
               | pretty much set you up with the basics to work on
               | vectorization, and once done, you can look at what
               | CoreLib does as a reference (just keep in mind it tries
               | to squeeze all the performance for short lengths too, so
               | the tail/head scalar handlers and dispatch can be high-
               | effort, more so than you may care about). The point
               | behind the way .NET does it is to have the same API
               | exposed to external consumers as the one CoreLib uses
               | itself, which is why I was surprised by your initial
               | statement.
               | 
               | No offense taken, just clarifying, SIMD can seem daunting
               | especially if you look at intrinsics in C/C++, and I hope
               | the approach in C# will popularize it. Good luck with
               | your experiments!
        
               | runevault wrote:
               | I appreciate you taking the time to talk me through this,
               | SIMD has been an interest of mine for a while. I ran into
               | issues and then when I went and looked at how the actual
               | string class did it I stopped since they were doing
               | tricks that required said access to the internal data.
               | But this gives me a path to explore. I was already
               | planning on looking at the links you supplied.
               | 
               | Thank you again.
        
       | IshKebab wrote:
       | Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's
       | Scalable Vector Extension. Unlike traditional SIMD the code is
       | agnostic to the register width and different hardware can run the
       | same code with different widths.
       | 
       | There is also a traditional SIMD extension (P I think?) but it
       | isn't finished. Most focus has been on the vector extension.
       | 
       | I am wondering how and if Rust will support these vector
       | processing extensions.
        
         | Findecanor wrote:
         | RISC-V's vector extension will have at least 128 bits in
         | application processors, so I think you could set VLEN=128 and
         | just use SIMD algorithms.
         | 
         | The P extension is intended more for embedded microcontrollers
         | for which the V extension would be too expensive. It reuses the
         | GPRs at whatever width they are at (32 or 64 bits).
        
           | camel-cdr wrote:
           | That or you can detect the vector length and specialize for
           | it, just like it's already done on x86 with VLEN 128, 256,
           | and 512 for sse, avx, and avx512.
        
         | camel-cdr wrote:
         | > RISC-V Vector isn't SIMD
         | 
         | Isn't SIMD a subset of vector processors?
         | 
         | To that matter, can anybody here provide a proper and useful
         | distinction between the two, that is SIMD and vector ISAs?
         | 
         | You imply it's because it's vector length agnostic, but you
         | could take e.g. the SSE encoding, and apart from a few
         | instructions, make it operate on SIMD registers of any length.
         | Wouldn't that also be vector length agnostic, as long as
         | software can query the vector length? I think most people
         | wouldn't call this a vector ISA, and how is this substantially
         | different from dispatching to different implementations for SSE
         | AVX and AVX512?
         | 
         | I've also seen people say it's about the predication, which
         | would make AVX512 a vector isa.
         | 
         | I've seen others say it's about resource usage and vector
         | chaining, but that is just an implementation detail and can be
         | used or not used on traditional SIMD ISAs to the same extend as
         | on vector ISAs.
        
       | ww520 wrote:
       | Zig actually has a very nice abstraction for SIMD in the form of
       | vector programming. The size of the vector is agnostic to the
       | underlying cpu architecture. The compiler or LLVM will generate
       | code for using SIMD128, 256, or 512 registers. And you are just
       | programming straight vectors.
        
       ___________________________________________________________________
       (page generated 2024-07-01 23:00 UTC)