[HN Gopher] Building a compile-time SIMD optimized smoothing filter
       ___________________________________________________________________
        
       Building a compile-time SIMD optimized smoothing filter
        
       Author : PaulHoule
       Score  : 69 points
       Date   : 2024-09-28 12:12 UTC (10 hours ago)
        
 (HTM) web link (scientificcomputing.rs)
 (TXT) w3m dump (scientificcomputing.rs)
        
       | keldaris wrote:
       | This looks like a nice case study for when you're already using
       | Rust for other reasons and just want to make a bit of numerical
       | code go fast. However, as someone mostly writing C++ and Julia,
       | this does not look promising at all - it's clear that the Julia
       | implementation is both more elegant and faster, and it seems much
       | easier to reproduce that result in C++ (which has no issues with
       | compile time float constants, SIMD, GPU support, etc.) than Rust.
       | 
       | I've written very little Rust myself, but when I've tried, I've
       | always come away with a similar impression that it's just not a
       | good fit for performant numerical computing, with seemingly basic
       | things (like proper SIMD support, const generics without weird
       | restrictions, etc.) considered afterthoughts. For those more up
       | to speed on Rust development, is this impression accurate, or
       | have I missed something and should reconsider my view?
        
         | gauge_field wrote:
         | In terms of speed, Rust is up there with C/C++. See e.g.
         | https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
         | I also ported several algo from C and it was matching the
         | performance.
         | 
         | Regarding SIMD support, the only thing that is missing, is
         | stable support for avx512, and some more exotic feature
         | extensions for deep learning e.g. avx_vnni. Those are
         | implemented and waiting to be included in the next stable
         | versions.
         | 
         | Gpu support: this is still an issue b/c of not enough people
         | working on it, but there projects trying to improve this: see
         | https://github.com/tracel-ai/cubecl .
         | 
         | Const generics: Yeah, there are a few annoying issues: it is
         | limited to small set of types. For instance, you cant use const
         | enum as a generic. Also, you cant use generic parameters in
         | const operations on stable rust: see unstable feature
         | generic_const_exprs.
         | 
         | My main reason for using rust in numerical computing:
         | 
         | - type system. Some find it weird. I find it explicit and
         | easier to understand.
         | 
         | - cargo (nicer cross platform defaults, since I tend to develop
         | both from windows and linux)
         | 
         | - unconditional code generation, with [target_feature(enable =
         | "feature_list")]. This makes it so that I dont have to set
         | different set of flags for each compilation unit when building.
         | It is enough to put that on top of function making use of SIMD.
         | 
         | I agree that if you want to be fast/exploratory in developing
         | algo and you can sacrifice a little bit of performance, Julia
         | is a better choice.
        
           | bobmcnamara wrote:
           | TBH, Intel ISAs have never been very stable, the mixed AVX
           | flavors are just the latest examples.
        
         | adgjlsfhk1 wrote:
         | This sort of thing is where Julia really shines.
         | https://github.com/miguelraz/StagedFilters.jl/blob/master/sr...
         | is the julia code and it's only 65 lines and uses some fairly
         | clean generated code to get optimal performance for all
         | floating point types.
        
       | jonathrg wrote:
       | Tight loops of SIMD operations seems like something that might be
       | more convenient to just implement directly in assembly? So you
       | don't need to babysit the compiler like this.
        
         | adgjlsfhk1 wrote:
         | Counterpoint: that makes you rewrite it for every architecture
         | and every datatype. A high level language makes it a lot easier
         | to get something more readable, runable everywhere, and
         | datatype generic.
        
           | brrrrrm wrote:
           | You almost always have to for good perf on non-trivial
           | operations
        
           | menaerus wrote:
           | If you're ready to settle on much lower IPC then yes.
        
             | adgjlsfhk1 wrote:
             | The Julia version of this
             | (https://github.com/miguelraz/StagedFilters.jl) is very
             | clearly generating pretty great code. For
             | `smooth!(SavitzkyGolayFilter{4,3}, rand(Float64, 1000),
             | rand(Float64, 1000))`, 36 out of 74 of the inner loop
             | instructions are vectorized fmadds (of one kind or
             | another). There are a few register spills that seem
             | plausibly unnecessary, and some dependency chains that I
             | think are an inherent part of the algorithm, but I'm pretty
             | that there isn't an additional 2x speed to get here.
        
               | menaerus wrote:
               | I can write SIMD code that's 2x the speed of the non-
               | vectorized code but I can also rewrite that same SIMD
               | code so that 2x becomes 6x.
               | 
               | Point being, not only that you can get 2x on top of
               | initial 2x SIMD implementation but usually much more.
               | 
               | Whether or not you see SIMD in the codegen is not a
               | testament of how good the implementation really is.
               | 
               | IPC is the relevant measure here.
        
               | adgjlsfhk1 wrote:
               | IPC looks like 3.5 instructions per clock. (and the speed
               | for bigger inputs will be memory bound rather than
               | compute bound).
        
           | jonathrg wrote:
           | For libraries, yes that is a concern. In practice you're
           | often dealing with one or a handful of instances of the
           | problem at hand.
        
         | anonymoushn wrote:
         | I'm happy to use compilers for register allocation, even though
         | they aren't particularly good at it
        
         | menaerus wrote:
         | Definitely. And I also found the title quite misleading. It's
         | the auto-vectorization that this presentation is trying to
         | cover. Compile-time SIMD OTOH would mean something totally
         | different, e.g. computation during the compile-time using the
         | constexpr SIMD intrinsics.
         | 
         | I'd also add that it's not only about babysitting the compiler
         | but you're also leaving a lot of performance off the table.
         | Auto-vectorized code, generally speaking, unfortunately cannot
         | beat the manually vectorized code (either through intrinsics or
         | asm).
        
       | feverzsj wrote:
       | I think most rust projects still depend on clang's vector
       | extension when SIMD is required.
        
       | jtrueb wrote:
       | I think scientific computing in Rust is getting too much
       | attention without contribution lately. We don't have many
       | essential language features stabilized. SIMD, generic const
       | exprs, intrinsics, function optimization overrides, and
       | reasonable floating point overrides related to fast math are a
       | long way from being stabilized. In order to get better perf, the
       | code is full of these informal compiler hints to guide it towards
       | an optimization like autovectorization or branch elision. The
       | semantics around strict floating point standards are stifling and
       | intrinsics have become less accessible than they used to be.
       | 
       | Separately, is Julia hitting a different LA backend? Rust's
       | ndarray with a blas-src on Accelerate is pretty fast, but the
       | Rust implementation is little slower on my macbook. This is a
       | benchmark of a dot product.
       | 
       | ```                   const M10: usize = 10_000_000;
       | #[divan::bench]         fn ndarray_dot32(b: Bencher) {
       | b.with_inputs(|| (Array::from_vec(vec![0f32; M10]),
       | Array::from_vec(vec![0f32; M10])))
       | .bench_values(|(a, b)| {                     a.dot(&b)
       | });         }              #[divan::bench]         fn
       | chunks_dot32(b: Bencher) {             b.with_inputs(||
       | (vec![0f32; M10], vec![0f32; M10]))
       | .bench_values(|(a, b)| {                     a.chunks_exact(32)
       | .zip(b.chunks_exact(32))                         .map(|(a, b)|
       | a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>())
       | .sum::<f32>()                 });         }
       | #[divan::bench]         fn iter_dot32(b: Bencher) {
       | b.with_inputs(|| (vec![0f32; M100], vec![0f32; M100]))
       | .bench_values(|(a, b)| {
       | a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>()
       | });         }                  ---- Rust ----         Timer
       | precision: 41 ns (100 samples)         flops             fast
       | | slow    | median  | mean         +- chunks_dot32   3.903 ms|
       | 9.96 ms | 4.366 ms| 4.411 ms         +- chunks_dot64   4.697 ms|
       | 16.29 ms| 5.472 ms| 5.516 ms         +- iter_dot32     10.37 ms|
       | 11.36 ms| 10.93 ms| 10.86 ms         +- iter_dot64     11.68 ms|
       | 13.07 ms| 12.43 ms| 12.4 ms         +- ndarray_dot32  1.984 ms|
       | 2.91 ms | 2.44 ms | 2.381 ms         +- ndarray_dot64  4.021 ms|
       | 5.718 ms| 5.141 ms| 4.965 ms              ---- Julia ----
       | native_dot32:         Median: 1.623 ms, Mean: 1.633 ms +- 341.705
       | ms         Range: 1.275 ms - 12.242 ms              native_dot64:
       | Median: 5.286 ms, Mean: 5.179 ms +- 230.997 ms         Range:
       | 4.736 ms - 5.617 ms              simd_dot32:         Median:
       | 1.818 ms, Mean: 1.830 ms +- 142.826 ms         Range: 1.558 ms -
       | 2.169 ms              simd_dot64:         Median: 3.564 ms, Mean:
       | 3.567 ms +- 586.002 ms         Range: 3.123 ms - 22.887 ms
       | iter_dot32:         Median: 9.566 ms, Mean: 9.549 ms +- 144.503
       | ms         Range: 9.302 ms - 10.941 ms              iter_dot64:
       | Median: 9.666 ms, Mean: 9.640 ms +- 84.481 ms         Range:
       | 9.310 ms - 9.867 ms              All: 0 bytes, 0 allocs
       | 
       | ```
       | 
       | https://github.com/trueb2/flops-bench
        
       | TinkersW wrote:
       | I only clicked through the slides and didn't watch the video..but
       | ugh all I see is scalar SIMD in the assembly output(the ss ending
       | means scalar, it would be ps if it was vector)
       | 
       | And they are apparently relying on the compiler to generate
       | it...just no.
       | 
       | Use intrinsics, it ant that hard.
        
       ___________________________________________________________________
       (page generated 2024-09-28 23:00 UTC)