[HN Gopher] Verifying Vectorized Rust Revisited
       ___________________________________________________________________
        
       Verifying Vectorized Rust Revisited
        
       Author : ingve
       Score  : 82 points
       Date   : 2021-05-16 07:51 UTC (15 hours ago)
        
 (HTM) web link (project-oak.github.io)
 (TXT) w3m dump (project-oak.github.io)
        
       | ZeroGravitas wrote:
       | What's the general process for proving to yourself that your
       | hand-coded vectorization produces the same output and is faster
       | than a more straightforward implementation?
       | 
       | Seems like if you have such an implementation then it can be used
       | for multiple tasks, running on weird platforms, checking to see
       | if autovectorization is good enough for your current use case,
       | automated testing of the optimized routines against a slower,
       | simpler version and this verification process is just one more
       | thing that would be able to take advantage of this. Yet it would
       | seem that this generally doesn't exist or at least there's no
       | standard way to do it, since they propose adding a standard
       | compiler flag to disable and optimization but currently need to
       | creatively work around the lack of that.
       | 
       | Do people generally assume it's going to be faster and just jump
       | into the vectorized stuff and treat that as the gold standard
       | with no plan for reversing or revisiting that decision later?
       | Seems like a good bet in the short term, but maybe a bad bet in
       | the long term.
       | 
       | Or is this all secretly a plot by chip manufacturers to create
       | lock-in by writing fast but platform specific libraries?
        
         | burntsushi wrote:
         | > What's the general process for proving to yourself that your
         | hand-coded vectorization produces the same output and is faster
         | than a more straightforward implementation?
         | 
         | I spent the first few months of this year doing exactly that
         | for a memmem implementation: https://users.rust-
         | lang.org/t/memchr-2-4-now-has-an-implemen...
         | 
         | In short, I:
         | 
         | * Ensure I have a non-vector implementation.
         | 
         | * Define a large suite of benchmarks, and ensure that I test
         | the non-vector implementation:
         | https://github.com/BurntSushi/memchr/tree/master/bench
         | 
         | * Have a thorough testing strategy, including hand written
         | tests, exhaustive tests, property based tests and fuzzing:
         | https://github.com/BurntSushi/memchr#testing-strategy
         | 
         | > Yet it would seem that this generally doesn't exist or at
         | least there's no standard way to do it
         | 
         | In the context of hand writing explicit vectorized code, it's
         | not that hard to make sure there's a way to disable the
         | vectorized path. The issue in the OP is that there's no
         | _standard_ way to do that for everything everywhere. While
         | having such a standard way to do it would be useful for testing
         | and reduce some friction in the process, it 's not a Huge Deal.
         | 
         | > Do people generally assume it's going to be faster and just
         | jump into the vectorized stuff and treat that as the gold
         | standard with no plan for reversing or revisiting that decision
         | later? Seems like a good bet in the short term, but maybe a bad
         | bet in the long term.
         | 
         | No? Vectorization usually requires some kind of platform or ISA
         | extension specific stuff. So if you have it, it's usually
         | conditional and there's usually a "slow" path.
         | 
         | In many cases, the perf benefits of vectorization are really
         | really really clear.
         | 
         | > Or is this all secretly a plot by chip manufacturers to
         | create lock-in by writing fast but platform specific libraries?
         | 
         | No? There do exist various SIMD abstractions that present a
         | common interface to vector instructions across many targets. In
         | cases where I can use that, once std::simd is a thing in Rust,
         | I will definitely do so. But the SIMD abstractions generally
         | won't expose everything---just the things in common across
         | targets.
        
       | jart wrote:
       | SSE SIMD instructions should be unencumbered by patents at this
       | point. So it's worthwhile to "bite the bullet" rather than do
       | something like -mgeneral-regs-only. SIMD opcodes frequently give
       | me 10x or even 54x speedups for the functions I write.
       | 
       | I know this isn't Rust, but if anyone needs a more readable
       | explanation of what the SIMD opcodes actually mean (i.e. more
       | readable than intel manual) then check out the cosmopolitan
       | codebase. I implemented all of them in ANSI C and called it
       | ANSITRINSICS just so I didn't have to use Intel's ugly intrinsics
       | API and wouldn't have to feel guilty about muh portability. It's
       | well-tested and fuzzed against the real thing, so I hope it can
       | serve as excellent documentation to anyone who wants to
       | understand the great mystery of vector instructions:
       | https://github.com/jart/cosmopolitan/tree/master/libc/intrin
       | 
       | A good example is PUNPCKLBW which is such an important
       | instruction but a name like that is very intimidating to someone
       | who hasn't rote memorized its meaning. I think ANSI C does a much
       | better job explaining what these instructions do than the
       | pictures in the Intel manual do.
       | https://github.com/jart/cosmopolitan/blob/48645651983ca46a31...
       | Particularly when it comes to ones like PMADDUBSW which is
       | without a doubt the greatest of the SSE instructions.
       | https://github.com/jart/cosmopolitan/blob/48645651983ca46a31...
       | 
       | In many cases if you compile these ansitrinsics with -ftree-
       | vectorize you'll notice that the compiler recognizes the ANSI C
       | code boilerplate and compiles it down into the underlying simd
       | instruction! I think GCC and Clang should be hard coded to
       | recognize all of them. That way we can write truly portable code
       | that's fast too.
        
         | nwmcsween wrote:
         | Trying to understand what does SIMD intrinsics offer in a
         | 'libc' that SWAR couldn't do? For absoultely massive values to
         | strlen, etc sure but IIRC the average strlen was something like
         | 10 or less.
        
           | jart wrote:
           | In the case of strlen, SSE goes a few cycles slower for n<8,
           | gains a slight advantage over SWAR for n>8, and starts going
           | ~twice as fast for n>255. Benchmarks:
           | https://justine.lol/swarsimd.txt The difference is probably
           | more pronounced in functions that do block copying like https
           | ://github.com/jart/cosmopolitan/blob/master/libc/str/st...
           | (which actually uses the ANSITRINSICS library). For more
           | complicated functions like this: https://github.com/jart/cosm
           | opolitan/blob/b3838173ecbb67fae5... SIMD has the advantage
           | that figuring out how to do the SWAR arithmetic correctly
           | might be more difficult than figuring out how to use the
           | hardware API. It goes without saying that the most compelling
           | benefits of SSE are the instructions like PMADDUBSW which can
           | be used to great effect for signal decimation https://github.
           | com/jart/cosmopolitan/blob/b3838173ecbb67fae5... assuming you
           | use The Perfect Kernel {-1, -3, 3, 17, 17, 3, -3, -1}
           | http://www.johncostella.com/magic/ or PADDSW which can mix
           | audio samples real fast doing saturating arithmetic. You
           | can't SWAR that. Another great example is carryless
           | multiplication instructions which literally make CRC32
           | computation go 54x faster.
        
         | jdonaldson wrote:
         | Thank you for this, you're doing the Lord's work.
        
       | Klasiaster wrote:
       | The RVT project does a really fascinating mix of ground work and
       | plumbing to make it useful. I like the idea of being able to turn
       | randomized testing into verification. Having different backends
       | for cargo-verify and the program annotations is also a nice way
       | of unifying the various projects and still allowing for
       | innovation to happen in a modular manner.
        
       | potiuper wrote:
       | Seems like it would have been more effective to write a verifier
       | for x87 rather than an emulator.
        
       ___________________________________________________________________
       (page generated 2021-05-16 23:01 UTC)