[HN Gopher] Verifying Vectorized Rust Revisited
___________________________________________________________________
Verifying Vectorized Rust Revisited
Author : ingve
Score : 82 points
Date : 2021-05-16 07:51 UTC (15 hours ago)
(HTM) web link (project-oak.github.io)
(TXT) w3m dump (project-oak.github.io)
| ZeroGravitas wrote:
| What's the general process for proving to yourself that your
| hand-coded vectorization produces the same output and is faster
| than a more straightforward implementation?
|
| Seems like if you have such an implementation then it can be used
| for multiple tasks, running on weird platforms, checking to see
| if autovectorization is good enough for your current use case,
| automated testing of the optimized routines against a slower,
| simpler version and this verification process is just one more
| thing that would be able to take advantage of this. Yet it would
| seem that this generally doesn't exist or at least there's no
| standard way to do it, since they propose adding a standard
| compiler flag to disable and optimization but currently need to
| creatively work around the lack of that.
|
| Do people generally assume it's going to be faster and just jump
| into the vectorized stuff and treat that as the gold standard
| with no plan for reversing or revisiting that decision later?
| Seems like a good bet in the short term, but maybe a bad bet in
| the long term.
|
| Or is this all secretly a plot by chip manufacturers to create
| lock-in by writing fast but platform specific libraries?
| burntsushi wrote:
| > What's the general process for proving to yourself that your
| hand-coded vectorization produces the same output and is faster
| than a more straightforward implementation?
|
| I spent the first few months of this year doing exactly that
| for a memmem implementation: https://users.rust-
| lang.org/t/memchr-2-4-now-has-an-implemen...
|
| In short, I:
|
| * Ensure I have a non-vector implementation.
|
| * Define a large suite of benchmarks, and ensure that I test
| the non-vector implementation:
| https://github.com/BurntSushi/memchr/tree/master/bench
|
| * Have a thorough testing strategy, including hand written
| tests, exhaustive tests, property based tests and fuzzing:
| https://github.com/BurntSushi/memchr#testing-strategy
|
| > Yet it would seem that this generally doesn't exist or at
| least there's no standard way to do it
|
| In the context of hand writing explicit vectorized code, it's
| not that hard to make sure there's a way to disable the
| vectorized path. The issue in the OP is that there's no
| _standard_ way to do that for everything everywhere. While
| having such a standard way to do it would be useful for testing
| and reduce some friction in the process, it 's not a Huge Deal.
|
| > Do people generally assume it's going to be faster and just
| jump into the vectorized stuff and treat that as the gold
| standard with no plan for reversing or revisiting that decision
| later? Seems like a good bet in the short term, but maybe a bad
| bet in the long term.
|
| No? Vectorization usually requires some kind of platform or ISA
| extension specific stuff. So if you have it, it's usually
| conditional and there's usually a "slow" path.
|
| In many cases, the perf benefits of vectorization are really
| really really clear.
|
| > Or is this all secretly a plot by chip manufacturers to
| create lock-in by writing fast but platform specific libraries?
|
| No? There do exist various SIMD abstractions that present a
| common interface to vector instructions across many targets. In
| cases where I can use that, once std::simd is a thing in Rust,
| I will definitely do so. But the SIMD abstractions generally
| won't expose everything---just the things in common across
| targets.
| jart wrote:
| SSE SIMD instructions should be unencumbered by patents at this
| point. So it's worthwhile to "bite the bullet" rather than do
| something like -mgeneral-regs-only. SIMD opcodes frequently give
| me 10x or even 54x speedups for the functions I write.
|
| I know this isn't Rust, but if anyone needs a more readable
| explanation of what the SIMD opcodes actually mean (i.e. more
| readable than intel manual) then check out the cosmopolitan
| codebase. I implemented all of them in ANSI C and called it
| ANSITRINSICS just so I didn't have to use Intel's ugly intrinsics
| API and wouldn't have to feel guilty about muh portability. It's
| well-tested and fuzzed against the real thing, so I hope it can
| serve as excellent documentation to anyone who wants to
| understand the great mystery of vector instructions:
| https://github.com/jart/cosmopolitan/tree/master/libc/intrin
|
| A good example is PUNPCKLBW which is such an important
| instruction but a name like that is very intimidating to someone
| who hasn't rote memorized its meaning. I think ANSI C does a much
| better job explaining what these instructions do than the
| pictures in the Intel manual do.
| https://github.com/jart/cosmopolitan/blob/48645651983ca46a31...
| Particularly when it comes to ones like PMADDUBSW which is
| without a doubt the greatest of the SSE instructions.
| https://github.com/jart/cosmopolitan/blob/48645651983ca46a31...
|
| In many cases if you compile these ansitrinsics with -ftree-
| vectorize you'll notice that the compiler recognizes the ANSI C
| code boilerplate and compiles it down into the underlying simd
| instruction! I think GCC and Clang should be hard coded to
| recognize all of them. That way we can write truly portable code
| that's fast too.
| nwmcsween wrote:
| Trying to understand what does SIMD intrinsics offer in a
| 'libc' that SWAR couldn't do? For absoultely massive values to
| strlen, etc sure but IIRC the average strlen was something like
| 10 or less.
| jart wrote:
| In the case of strlen, SSE goes a few cycles slower for n<8,
| gains a slight advantage over SWAR for n>8, and starts going
| ~twice as fast for n>255. Benchmarks:
| https://justine.lol/swarsimd.txt The difference is probably
| more pronounced in functions that do block copying like https
| ://github.com/jart/cosmopolitan/blob/master/libc/str/st...
| (which actually uses the ANSITRINSICS library). For more
| complicated functions like this: https://github.com/jart/cosm
| opolitan/blob/b3838173ecbb67fae5... SIMD has the advantage
| that figuring out how to do the SWAR arithmetic correctly
| might be more difficult than figuring out how to use the
| hardware API. It goes without saying that the most compelling
| benefits of SSE are the instructions like PMADDUBSW which can
| be used to great effect for signal decimation https://github.
| com/jart/cosmopolitan/blob/b3838173ecbb67fae5... assuming you
| use The Perfect Kernel {-1, -3, 3, 17, 17, 3, -3, -1}
| http://www.johncostella.com/magic/ or PADDSW which can mix
| audio samples real fast doing saturating arithmetic. You
| can't SWAR that. Another great example is carryless
| multiplication instructions which literally make CRC32
| computation go 54x faster.
| jdonaldson wrote:
| Thank you for this, you're doing the Lord's work.
| Klasiaster wrote:
| The RVT project does a really fascinating mix of ground work and
| plumbing to make it useful. I like the idea of being able to turn
| randomized testing into verification. Having different backends
| for cargo-verify and the program annotations is also a nice way
| of unifying the various projects and still allowing for
| innovation to happen in a modular manner.
| potiuper wrote:
| Seems like it would have been more effective to write a verifier
| for x87 rather than an emulator.
___________________________________________________________________
(page generated 2021-05-16 23:01 UTC)