[HN Gopher] Using SIMD for Parallel Processing in Rust
___________________________________________________________________
Using SIMD for Parallel Processing in Rust
Author : nbrempel
Score : 75 points
Date : 2024-07-01 16:29 UTC (6 hours ago)
(HTM) web link (nrempel.com)
(TXT) w3m dump (nrempel.com)
| eachro wrote:
| This is cool that simd primitives exist in the std lib of rust.
| I've wanted wanted to mess around a bit more with simd in python
| but I don't think that native support exists. Or your have to go
| down to C/C++ bindings to actually mess around with it (last I
| checked at least, please correct me if I'm wrong).
| mroche wrote:
| Quick search turned up this:
|
| _SIMD in Pure Python_
|
| https://www.da.vidbuchanan.co.uk/blog/python-swar.html
|
| Don't let the "SIMD in Python" section fool you, it's a short
| stop on Numpy before putting it aside.
| bardak wrote:
| I feel like most languages could use simd in the standard
| library. We have all this power in the vector units of our CPUs
| that compilers struggle to use but yet we also don't make it
| easy to do manually
| Calavar wrote:
| What would native SIMD support entail in a language without
| first party JIT or AOT compilation?
| runevault wrote:
| At some point bytecode still turns into CPU instructions, so
| if you added syntax or special functions that went to parts
| of the interpreter that are SIMD you could certainly add it
| to a purely interpreted language.
| Calavar wrote:
| If we're talking low level SIMD, like opcode level, I'm
| really struggling to see the use case for interpreted
| bytecode. The cost of type checking operands to dynamically
| dispatch down a SIMD path would almost certainly outweigh
| the savings of the SIMD path itself.
|
| JIT is different because in function-level JIT, you can
| check types just once at the opening of the function, then
| you stay on the SIMD happy path for the rest of the
| function. And in AOT, you may able to elide the checks
| entirely.
|
| There is certainly a space for higher level SIMD
| functionality, but that would probably look more or less
| like numpy. And at least for Python, that already exists.
| thomashabets2 wrote:
| The portable SIMD is quite nice. We can't really trust a
| "sufficiently smart compiler" to make the best SIMD decisions,
| since it may not see through what you're actually doing.
|
| https://blog.habets.se/2024/04/Rust-is-faster-than-C.html and
| code at
| https://github.com/ThomasHabets/zipbrute/blob/master/rust/sr...
| showed me getting 3x faster using portable SIMD, on my first
| attempt.
| nbrempel wrote:
| Thanks for reading everyone. I've gotten some feedback over on
| Reddit as well that the example is not effectively showing the
| benefits of SIMD. I plan on revising this.
|
| One of my goals of writing these articles is to learn so feedback
| is more than welcome!
| KineticLensman wrote:
| Great read!
|
| > One of my goals of writing these articles is to learn so
| feedback is more than welcome!
|
| When I went into the Rust playground to see the assembly output
| for the Cumulative Sum example, I could only get it to show the
| compiler warnings, not the actual assembly. I'm probably doing
| something wrong, but for me this was a barrier that detracted
| from the article. I'd suggest incorporating the assembly
| directly into the article, although keeping the playground link
| for people who are more dedicated / competent than I am.
| the8472 wrote:
| The function has to be made pub so it doesn't get optimized
| out as unusued private function.
|
| Godbolt is a better choice for looking at asm anyway.
| https://rust.godbolt.org/z/3Y9ovsoz9
| KineticLensman wrote:
| Ah, that worked, thanks!
|
| Although I can now see why he didn't include the output
| directly.
| oconnor663 wrote:
| There are a lot of factors that go into how fast a hash function
| is, but the case we're showing in the big red chart at
| https://github.com/BLAKE3-team/BLAKE3 is almost entirely driven
| by SIMD. It's a huge deal.
| anonymousDan wrote:
| The interesting question for me is whether Rust makes it easier
| for the compiler to extract SIMD parallelism automatically given
| the restrictions imposed by its type system.
| neonsunset wrote:
| If you like SIMD and would like to dabble in it, I can strongly
| recommend trying it out in C# via its platform-agnostic SIMD
| abstraction. It is _very_ accessible especially if you already
| know a little bit of C or C++, and compiles to very competent
| codegen for AdvSimd, SSE2 /4.2/AVX1/2/AVX512, WASM's Packed SIMD
| and, in .NET 9, SVE1/2:
|
| https://github.com/dotnet/runtime/blob/main/docs/coding-guid...
|
| Here's an example of "checked" sum over a span of integers that
| uses platform-specific vector width:
|
| https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
|
| Other examples:
|
| CRC64
| https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
|
| Hamming distance
| https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
|
| Default syntax is a bit ugly in my opinion, but it can be
| significantly improved with helper methods like here where the
| code is a port of simdutf's UTF-8 code point counting:
| https://github.com/U8String/U8String/blob/main/Sources/U8Str...
|
| There are more advanced scenarios. Bepuphysics2 engine heavily
| leverages SIMD to perform as fast as PhysX's CPU back-end:
| https://github.com/bepu/bepuphysics2/blob/master/BepuPhysics...
|
| Note that practically none of these need to reach out to
| platform-specific intrinsics (except for replacing movemask
| emulation with efficient ARM64 alternative) and use the same path
| for all platforms, varied by vector width rather than specific
| ISA.
| runevault wrote:
| Funny you mention c#, I started to look at this and I made the
| mistake of wanting to do string comparison via SIMD, except you
| can't do it externally because it relies on private internals
| (note, the built in comparison for c# already does SIMD, you
| just can't easily reimplement it against the built in string
| type).
| neonsunset wrote:
| What kind of private internals do you have in mind? You
| absolutely can hand-roll your own comparison routine, just
| hard to beat existing implementation esp. once you start
| considering culture-sensitive comparison (which may defer to
| e.g. ICU).
|
| There are no private SIMD APIs save for sequence comparison
| intrisic for unrolling against known lengths which JIT/ILC
| does for spans and strings.
| runevault wrote:
| IIRC (Been a month or so since I looked into it) I couldn't
| access the underlying array in a way SIMD liked I think? If
| you look at how they did it inside the actual string class
| it uses those private properties of the string that are
| only available internally to guarantee you don't change the
| string data if memory serves.
| neonsunset wrote:
| String can provide you a `ReadOnlySpan<char>`, out of
| which you can either take `ref readonly char` "byref"
| pointer, which all vectors work with, or you can use the
| unsafe variant and make this byref mutable (just don't
| write to it) with `Unsafe.AsRef`.
|
| Because pretty much every type that has linear memory can
| be represented as span, it means that every span is
| amenable to pointer (byref) arithmetics which you then
| use to write a SIMD routine. e.g.: var
| text = "Hello, World! Hello, World!"; var span =
| MemoryMarshal.Cast<char, ushort>(text); ref
| readonly var ptr = ref span[0]; var chunk =
| Vector128.LoadUnsafe(in ptr); var needle =
| Vector128.Create((ushort)','); var comparison =
| Vector128.Equals(chunk, needle); var offset = uin
| t.TrailingZeroCount(comparison.ExtractMostSignificantBits
| ()); Console.WriteLine(text[..(int)offset]);
|
| If you have doubts regarding codegen quality, take a look
| at: https://godbolt.org/z/b97zjfTP7 The above vector API
| calls are lowered to lines 17-22.
| runevault wrote:
| Oh interesting, I'll have to give that a try then. My
| concern was avoiding a reallocation by doing it another
| way, but if the readonly span works I can see how it
| would get you there. I need to see if I still have that
| project to test it out, appreciate the heads up. SIMD is
| something I really want to get better with.
| neonsunset wrote:
| If you go through the guide at the first link, it will
| pretty much set you up with the basics to work on
| vectorization, and once done, you can look at what
| CoreLib does as a reference (just keep in mind it tries
| to squeeze all the performance for short lengths too, so
| the tail/head scalar handlers and dispatch can be high-
| effort, more so than you may care about). The point
| behind the way .NET does it is to have the same API
| exposed to external consumers as the one CoreLib uses
| itself, which is why I was surprised by your initial
| statement.
|
| No offense taken, just clarifying, SIMD can seem daunting
| especially if you look at intrinsics in C/C++, and I hope
| the approach in C# will popularize it. Good luck with
| your experiments!
| runevault wrote:
| I appreciate you taking the time to talk me through this,
| SIMD has been an interest of mine for a while. I ran into
| issues and then when I went and looked at how the actual
| string class did it I stopped since they were doing
| tricks that required said access to the internal data.
| But this gives me a path to explore. I was already
| planning on looking at the links you supplied.
|
| Thank you again.
| IshKebab wrote:
| Minor nit: RISC-V Vector isn't SIMD. It's actually like ARM's
| Scalable Vector Extension. Unlike traditional SIMD the code is
| agnostic to the register width and different hardware can run the
| same code with different widths.
|
| There is also a traditional SIMD extension (P I think?) but it
| isn't finished. Most focus has been on the vector extension.
|
| I am wondering how and if Rust will support these vector
| processing extensions.
| Findecanor wrote:
| RISC-V's vector extension will have at least 128 bits in
| application processors, so I think you could set VLEN=128 and
| just use SIMD algorithms.
|
| The P extension is intended more for embedded microcontrollers
| for which the V extension would be too expensive. It reuses the
| GPRs at whatever width they are at (32 or 64 bits).
| camel-cdr wrote:
| That or you can detect the vector length and specialize for
| it, just like it's already done on x86 with VLEN 128, 256,
| and 512 for sse, avx, and avx512.
| camel-cdr wrote:
| > RISC-V Vector isn't SIMD
|
| Isn't SIMD a subset of vector processors?
|
| To that matter, can anybody here provide a proper and useful
| distinction between the two, that is SIMD and vector ISAs?
|
| You imply it's because it's vector length agnostic, but you
| could take e.g. the SSE encoding, and apart from a few
| instructions, make it operate on SIMD registers of any length.
| Wouldn't that also be vector length agnostic, as long as
| software can query the vector length? I think most people
| wouldn't call this a vector ISA, and how is this substantially
| different from dispatching to different implementations for SSE
| AVX and AVX512?
|
| I've also seen people say it's about the predication, which
| would make AVX512 a vector isa.
|
| I've seen others say it's about resource usage and vector
| chaining, but that is just an implementation detail and can be
| used or not used on traditional SIMD ISAs to the same extend as
| on vector ISAs.
| ww520 wrote:
| Zig actually has a very nice abstraction for SIMD in the form of
| vector programming. The size of the vector is agnostic to the
| underlying cpu architecture. The compiler or LLVM will generate
| code for using SIMD128, 256, or 512 registers. And you are just
| programming straight vectors.
___________________________________________________________________
(page generated 2024-07-01 23:00 UTC)