[HN Gopher] SIMD Everywhere Optimization from ARM Neon to RISC-V...
___________________________________________________________________
SIMD Everywhere Optimization from ARM Neon to RISC-V Vector
Extensions
Author : camel-cdr
Score : 68 points
Date : 2023-09-29 15:54 UTC (7 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| almatabata wrote:
| Very neat, i hope this will get easier to do in the future once
| languages start including these SIMD semantics in the language
| itself like rust tries to do:
|
| https://doc.rust-lang.org/std/simd/struct.Simd.html
|
| Libraries implemented in languages without these semantics will
| greatly benefit from this.
| geertj wrote:
| And C++:
|
| https://en.cppreference.com/w/cpp/experimental/simd
|
| This proposal has been around for a while a but it recently got
| some new momentum and seems to be on track for c++26. Gcc ships
| a version today for those wanting to try it.
| dzaima wrote:
| A problem is that most such things (the rust thing, C++'s
| experimental/simd, Zig's SIMD types) have the vector size as a
| compile-time property, while ARM's SVE and RISC-V's RVV are
| designed such that it's possible to write portable code that
| can work for a range of implementation widths. Thus such a
| fixed-width SIMD library would be forced to target the minimum
| (128-bit) even if the hardware supports 256-bit, 512-bit, or
| more. (SVE supports up to 1024-bit, RVV - up to 65536-bit)
|
| There is Highway (https://github.com/google/highway) however,
| that does support dynamically-sized SIMD.
| almatabata wrote:
| If you compile the code by specifying the target as native
| you could get around that limitation no?
| cozzyd wrote:
| yes, but then if distributing binaries, you need a
| different binary for each SIMD width.
| almatabata wrote:
| Ah makes sense if you have complete control over your
| hardware it could make sense but with open source
| projects and businesses with a wide customer base it
| might not make sense.
|
| Compiled languages like Rust, C++ and Zig cannot detect
| the hardware because they have no runtime right? Could a
| language like Go add the simd semantics and detect the
| support vector size?
| dzaima wrote:
| The problem isn't detecting the width (that's trivially
| possible at runtime with a single instruction, though
| both SVE and RVV have a way to write loops such that you
| don't even need to).
|
| The problem is that a "Simd<i32, 4>" will always have 4
| elements, but you'd need a "Simd<i32, whatever the
| hardware has>" type, which has significant impact on what
| is possible to do with such a type.
| almatabata wrote:
| Ah thank you for clarifying so you would have to create
| an abstraction layer on top of the current simd
| implementation like for example simd_vector(type, size).
| That abstraction would have to dynamically detect the
| hardware and dispatch it to the hardware like the project
| you shared (https://github.com/google/highway).
|
| So technically it sounds feasible but all of the
| languages like Zig, C++ and Rust picked a simpler
| approach. Is it simply a first step to a more abstract
| approach?
| dzaima wrote:
| Not really - you don't need to dispatch anything, the
| idea is that the same code (and thus the same
| assembly/machine code) can operate on different sizes by
| itself. e.g. with RVV "vsetvli x0,x0,e32,m1,ta,ma;
| vadd.vv v0, v1, v2" on a CPU with 128-bit vectors will do
| 4 element additions, but on a CPU with 1024-bit vectors
| it'll do 32 additions.
|
| And some things you just can't really "generalize" to
| scalable vectors. e.g. you can store Simd<i32,4> in a
| struct or global variables, or initialize with, say,
| [3,2,1,0], but none of those things are possible with
| scalable vectors (globals/struct fields need a known
| size, and initializing with a hard-coded list of elements
| doesn't make much sense if you don't even know how many
| elements you'll need).
| Conscat wrote:
| C++ comes with a runtime which, among many other things,
| allows you to detect the microarchitecture and featureset
| of the environment you're running on using
| `__builtin_cpu_init()` which calls a dynamically linked
| function `__cpu_indicator_init()`. Then using the
| `cpu_dispatch`, `target`, or `target_clones` attributes
| you can compile multiple variations of an algorithm in
| your program and dynamically select the one to execute.
| This is referred to as a "fat binary", and the feature is
| "multifunctions" or "multiversioned functions".
|
| Zig intends to support a similar feature but doesn't yet,
| at least not built into the language (you could certainly
| express this if you tried hard enough). I don't know
| about Rust, but I would be very surprised if it can't do
| this.
|
| edit: I think I replied to the wrong comment >.<
| vkazanov wrote:
| You can actually autogenerate all reasonable variants of
| the code if necessary, there aren't that many
| architectures these days. Simd imstructions are usually
| very local, this shouldn't blow up the binary.
|
| The point is to not have to write repetitive source code
| many times.
| Pet_Ant wrote:
| What is the real cost to just have those few methods be
| compiled in and then a branch? You don't need to ship a
| separate binary for each target, you can have dead code
| in it. I mean fat binaries take this idea to the extreme
| to support multiple architectures.
|
| https://en.wikipedia.org/wiki/Fat_binary
| elabajaba wrote:
| Not being able to inline and having to branch on every
| call to a simd function can sometimes make it slower than
| the basic scalar version.
| Pet_Ant wrote:
| Just a thought, but would it be possible to hot patch at
| the time of loading the binary? I realise it might
| require updates to the binary format, but it might be
| very well justified.
| dzaima wrote:
| You sould branch at the level where inlining doesn't make
| sense, which would usually be some function wrapping the
| big loop, which should be rather free. Which is the same
| situation as on x86-64 if you want to target pre-
| AVX2/post-AVX2/AVX-512.
| dzaima wrote:
| That's 5 copies for SVE (had an error in first message -
| SVE allows up to 2048-bit vectors, not 1024), and 10
| copies for RVV if you wanted to target all widths (though
| you'd probably be fine for a decade or a couple by
| targeting just 128 & 256-bit, and maybe 512-bit). Plus
| one more for a scalar fallback.
|
| And yes, it's not particularly large of a cost, other
| than it being an extremely pointless waste of space given
| that it is possible to have just one variant that covers
| them all.
|
| Though, it would become significantly more problematic if
| you wanted to target different extension groups too
| (which you would quite likely want to some extent) as
| those'd multiply with all the length targets - SVE vs
| SVE2 vs more future extensions, and on RVV there's just a
| lot (Zvfh & Zvfhmin for FP16, Zvbb for extra bitmanip
| stuff, many more here[1]; and potentially at some point
| there could be an extension that uses a wider encoding
| scheme to inline vsetvl fields & allow masking by
| registers other than v0, which could benefit everything)
|
| [1]: https://github.com/riscv/riscv-
| crypto/blob/c8ddeb7e64a3444dd...
| snvzz wrote:
| RISC-V is rapidly building the strongest ecosystem.
| adgjlsfhk1 wrote:
| and hug of death
| kierank wrote:
| The paper suggests FFmpeg uses intrinsics which is not correct.
|
| There have been many SIMD abstraction layers created in the past
| but none of them will beat the raw speed of handwritten assembly.
| Try and implement something like vpternlogd in one of these
| abstraction layers.
| dist1ll wrote:
| The main abstraction of intrinsics is register allocation,
| right? Is there anything else that can be gained by handwritten
| asm?
| camel-cdr wrote:
| For rvv specifically there are a few things that aren't
| possible using the intrinsics abstraction.
|
| E.g. in asm you can run the same instruction sequence with
| different vtype (element width and LMUL).
| Danidada wrote:
| Neat project!
|
| However, I'm pretty sure OpenCV has their "universal intrinsics"
| and RISC-V with scalable vector registers is supported in the
| latest OpenCV version
|
| Universal intrinsics (docs not updated):
| https://docs.opencv.org/4.x/d6/dd1/tutorial_univ_intrin.html
| Scalable RVV support: https://github.com/opencv/opencv/pull/22179
| KingLancelot wrote:
| We need to do better than ISA specific intrinsics.
|
| There should be a simd.h header in the C standard library that
| contains typedefs for vector types, and various functions to
| operate on them as well as Operators for them.
|
| Like my _Operator <symbol> <function name>; proposal, which
| requires no mangling.
| camel-cdr wrote:
| This doesn't seem to be upstreamed yet.
|
| I hope they have real hardware performance numbers for the rv
| summit talk.
| biocrusoe wrote:
| SIMDe maintainer here, I would welcome a PR; yes!
| atdt wrote:
| Highway (https://github.com/google/highway), Google's SIMD
| library, lets you write length-agnostic SIMD code. It has
| excellent support for a wide range of targets, including both
| RISC-V and Arm vector extensions.
| mgaunard wrote:
| There are so many SIMD libraries nowadays.
|
| I myself implemented one in the SSE4/Altivec days (later extended
| to AVX, AVX512 and NEON). There were only a few options then, but
| now everyone seems to be doing it.
| biocrusoe wrote:
| Archived copy:
| https://web.archive.org/web/20230929161438/https://arxiv.org...
|
| Direct link to PDF: https://arxiv.org/pdf/2309.16509.pdf
___________________________________________________________________
(page generated 2023-09-29 23:00 UTC)