[HN Gopher] Building a compile-time SIMD optimized smoothing filter
___________________________________________________________________
Building a compile-time SIMD optimized smoothing filter
Author : PaulHoule
Score : 69 points
Date : 2024-09-28 12:12 UTC (10 hours ago)
(HTM) web link (scientificcomputing.rs)
(TXT) w3m dump (scientificcomputing.rs)
| keldaris wrote:
| This looks like a nice case study for when you're already using
| Rust for other reasons and just want to make a bit of numerical
| code go fast. However, as someone mostly writing C++ and Julia,
| this does not look promising at all - it's clear that the Julia
| implementation is both more elegant and faster, and it seems much
| easier to reproduce that result in C++ (which has no issues with
| compile time float constants, SIMD, GPU support, etc.) than Rust.
|
| I've written very little Rust myself, but when I've tried, I've
| always come away with a similar impression that it's just not a
| good fit for performant numerical computing, with seemingly basic
| things (like proper SIMD support, const generics without weird
| restrictions, etc.) considered afterthoughts. For those more up
| to speed on Rust development, is this impression accurate, or
| have I missed something and should reconsider my view?
| gauge_field wrote:
| In terms of speed, Rust is up there with C/C++. See e.g.
| https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
| I also ported several algo from C and it was matching the
| performance.
|
| Regarding SIMD support, the only thing that is missing, is
| stable support for avx512, and some more exotic feature
| extensions for deep learning e.g. avx_vnni. Those are
| implemented and waiting to be included in the next stable
| versions.
|
| Gpu support: this is still an issue b/c of not enough people
| working on it, but there projects trying to improve this: see
| https://github.com/tracel-ai/cubecl .
|
| Const generics: Yeah, there are a few annoying issues: it is
| limited to small set of types. For instance, you cant use const
| enum as a generic. Also, you cant use generic parameters in
| const operations on stable rust: see unstable feature
| generic_const_exprs.
|
| My main reason for using rust in numerical computing:
|
| - type system. Some find it weird. I find it explicit and
| easier to understand.
|
| - cargo (nicer cross platform defaults, since I tend to develop
| both from windows and linux)
|
| - unconditional code generation, with [target_feature(enable =
| "feature_list")]. This makes it so that I dont have to set
| different set of flags for each compilation unit when building.
| It is enough to put that on top of function making use of SIMD.
|
| I agree that if you want to be fast/exploratory in developing
| algo and you can sacrifice a little bit of performance, Julia
| is a better choice.
| bobmcnamara wrote:
| TBH, Intel ISAs have never been very stable, the mixed AVX
| flavors are just the latest examples.
| adgjlsfhk1 wrote:
| This sort of thing is where Julia really shines.
| https://github.com/miguelraz/StagedFilters.jl/blob/master/sr...
| is the julia code and it's only 65 lines and uses some fairly
| clean generated code to get optimal performance for all
| floating point types.
| jonathrg wrote:
| Tight loops of SIMD operations seems like something that might be
| more convenient to just implement directly in assembly? So you
| don't need to babysit the compiler like this.
| adgjlsfhk1 wrote:
| Counterpoint: that makes you rewrite it for every architecture
| and every datatype. A high level language makes it a lot easier
| to get something more readable, runable everywhere, and
| datatype generic.
| brrrrrm wrote:
| You almost always have to for good perf on non-trivial
| operations
| menaerus wrote:
| If you're ready to settle on much lower IPC then yes.
| adgjlsfhk1 wrote:
| The Julia version of this
| (https://github.com/miguelraz/StagedFilters.jl) is very
| clearly generating pretty great code. For
| `smooth!(SavitzkyGolayFilter{4,3}, rand(Float64, 1000),
| rand(Float64, 1000))`, 36 out of 74 of the inner loop
| instructions are vectorized fmadds (of one kind or
| another). There are a few register spills that seem
| plausibly unnecessary, and some dependency chains that I
| think are an inherent part of the algorithm, but I'm pretty
| that there isn't an additional 2x speed to get here.
| menaerus wrote:
| I can write SIMD code that's 2x the speed of the non-
| vectorized code but I can also rewrite that same SIMD
| code so that 2x becomes 6x.
|
| Point being, not only that you can get 2x on top of
| initial 2x SIMD implementation but usually much more.
|
| Whether or not you see SIMD in the codegen is not a
| testament of how good the implementation really is.
|
| IPC is the relevant measure here.
| adgjlsfhk1 wrote:
| IPC looks like 3.5 instructions per clock. (and the speed
| for bigger inputs will be memory bound rather than
| compute bound).
| jonathrg wrote:
| For libraries, yes that is a concern. In practice you're
| often dealing with one or a handful of instances of the
| problem at hand.
| anonymoushn wrote:
| I'm happy to use compilers for register allocation, even though
| they aren't particularly good at it
| menaerus wrote:
| Definitely. And I also found the title quite misleading. It's
| the auto-vectorization that this presentation is trying to
| cover. Compile-time SIMD OTOH would mean something totally
| different, e.g. computation during the compile-time using the
| constexpr SIMD intrinsics.
|
| I'd also add that it's not only about babysitting the compiler
| but you're also leaving a lot of performance off the table.
| Auto-vectorized code, generally speaking, unfortunately cannot
| beat the manually vectorized code (either through intrinsics or
| asm).
| feverzsj wrote:
| I think most rust projects still depend on clang's vector
| extension when SIMD is required.
| jtrueb wrote:
| I think scientific computing in Rust is getting too much
| attention without contribution lately. We don't have many
| essential language features stabilized. SIMD, generic const
| exprs, intrinsics, function optimization overrides, and
| reasonable floating point overrides related to fast math are a
| long way from being stabilized. In order to get better perf, the
| code is full of these informal compiler hints to guide it towards
| an optimization like autovectorization or branch elision. The
| semantics around strict floating point standards are stifling and
| intrinsics have become less accessible than they used to be.
|
| Separately, is Julia hitting a different LA backend? Rust's
| ndarray with a blas-src on Accelerate is pretty fast, but the
| Rust implementation is little slower on my macbook. This is a
| benchmark of a dot product.
|
| ``` const M10: usize = 10_000_000;
| #[divan::bench] fn ndarray_dot32(b: Bencher) {
| b.with_inputs(|| (Array::from_vec(vec![0f32; M10]),
| Array::from_vec(vec![0f32; M10])))
| .bench_values(|(a, b)| { a.dot(&b)
| }); } #[divan::bench] fn
| chunks_dot32(b: Bencher) { b.with_inputs(||
| (vec![0f32; M10], vec![0f32; M10]))
| .bench_values(|(a, b)| { a.chunks_exact(32)
| .zip(b.chunks_exact(32)) .map(|(a, b)|
| a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>())
| .sum::<f32>() }); }
| #[divan::bench] fn iter_dot32(b: Bencher) {
| b.with_inputs(|| (vec![0f32; M100], vec![0f32; M100]))
| .bench_values(|(a, b)| {
| a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>()
| }); } ---- Rust ---- Timer
| precision: 41 ns (100 samples) flops fast
| | slow | median | mean +- chunks_dot32 3.903 ms|
| 9.96 ms | 4.366 ms| 4.411 ms +- chunks_dot64 4.697 ms|
| 16.29 ms| 5.472 ms| 5.516 ms +- iter_dot32 10.37 ms|
| 11.36 ms| 10.93 ms| 10.86 ms +- iter_dot64 11.68 ms|
| 13.07 ms| 12.43 ms| 12.4 ms +- ndarray_dot32 1.984 ms|
| 2.91 ms | 2.44 ms | 2.381 ms +- ndarray_dot64 4.021 ms|
| 5.718 ms| 5.141 ms| 4.965 ms ---- Julia ----
| native_dot32: Median: 1.623 ms, Mean: 1.633 ms +- 341.705
| ms Range: 1.275 ms - 12.242 ms native_dot64:
| Median: 5.286 ms, Mean: 5.179 ms +- 230.997 ms Range:
| 4.736 ms - 5.617 ms simd_dot32: Median:
| 1.818 ms, Mean: 1.830 ms +- 142.826 ms Range: 1.558 ms -
| 2.169 ms simd_dot64: Median: 3.564 ms, Mean:
| 3.567 ms +- 586.002 ms Range: 3.123 ms - 22.887 ms
| iter_dot32: Median: 9.566 ms, Mean: 9.549 ms +- 144.503
| ms Range: 9.302 ms - 10.941 ms iter_dot64:
| Median: 9.666 ms, Mean: 9.640 ms +- 84.481 ms Range:
| 9.310 ms - 9.867 ms All: 0 bytes, 0 allocs
|
| ```
|
| https://github.com/trueb2/flops-bench
| TinkersW wrote:
| I only clicked through the slides and didn't watch the video..but
| ugh all I see is scalar SIMD in the assembly output(the ss ending
| means scalar, it would be ps if it was vector)
|
| And they are apparently relying on the compiler to generate
| it...just no.
|
| Use intrinsics, it ant that hard.
___________________________________________________________________
(page generated 2024-09-28 23:00 UTC)