[HN Gopher] Fundamental flaws of SIMD ISAs (2021)
       ___________________________________________________________________
        
       Fundamental flaws of SIMD ISAs (2021)
        
       Author : fanf2
       Score  : 142 points
       Date   : 2025-04-24 14:42 UTC (1 days ago)
        
 (HTM) web link (www.bitsnbites.eu)
 (TXT) w3m dump (www.bitsnbites.eu)
        
       | pkhuong wrote:
       | There's more to SIMD than BLAS.
       | https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u... .
        
         | camel-cdr wrote:
         | BLAS, specifically gemm, is one of the rare things where you
         | naturally need to specialize on vector register width.
         | 
         | Most problems don't require this: E.g. your basic penalizable
         | math stuff, unicode conversion, base64 de/encode, json parsing,
         | set intersection, quicksort*, bigint, run length encoding,
         | chacha20, ...
         | 
         | And if you run into a problem that benefits from knowing the
         | SIMD width, then just specialize on it. You can totally use
         | variable-length SIMD ISA's in a fixed-length way when required.
         | But most of the time it isn't required, and you have code that
         | easily scales between vector lengths.
         | 
         | *quicksort: most time is spent partitioning, which is vector
         | length agnostic, you can handle the leafs in a vector length
         | agnostic way, but you'll get more efficient code if you
         | specialize (idk how big the impact is, in vreg bitonic sort is
         | quite efficient).
        
       | convolvatron wrote:
       | i would certainly add lack of reductions ('horizontal'
       | operations) and a more generalized model of communication to the
       | list.
        
         | adgjlsfhk1 wrote:
         | The tricky part with reductions is that they are somewhat
         | inherently slow since they often need to be done pairwise and a
         | pairwise reduction over 16 elements will naturally have pretty
         | limited parallelism.
        
           | convolvatron wrote:
           | kinda? this is sort of a direct result of the 'vectors are
           | just sliced registers' model. if i do a pairwise operation
           | and divide my domain by 2 at each step, is the resulting
           | vector sparse or dense? if its dense then I only really top
           | out when i'm in the last log2slice steps.
        
             | sweetjuly wrote:
             | Yes, but this is not cheap for hardware. CPU designers love
             | SIMD because it lets them just slap down ALUs in parallel
             | and get 32x performance boosts. Reductions, however, are
             | not entirely parallel and instead have a relatively huge
             | gate depth.
             | 
             | For example, suppose an add operation has a latency of one
             | unit in some silicon process. To add reduce a 32 element
             | vector, you'll have a five deep tree, which means your
             | operation has a latency of five units. You can pipeline
             | this, but you can't solve the fact that this operation has
             | a 5x higher latency than the non-reduce operations.
        
               | adgjlsfhk1 wrote:
               | for some, you can cheat a little (e.g. partial sum
               | reduction), but then you don't get to share IP with the
               | rest of your ALUs. I do really want to see what an
               | optimal 32 wide reduction and circuit looks like. For
               | integers, you pretty clearly can do much better than
               | pairwise reduction. float sounds tricky
        
         | dzaima wrote:
         | ARM NEON does have sum, min, and max reductions (and/or
         | reductions can just be min/max if all bits in elements are the
         | same), along with pairwise ops. RVV has sum,min,max,and,or,xor
         | reductions. x86 has psadbw which sums windows of eight 8-bit
         | ints, and various instructions for some pairwise horizontal
         | stuff.
         | 
         | But in general building code around reductions isn't really a
         | thing you'd ideally do; they're necessarily gonna have higher
         | latency / lower throughput / take more silicon compared to
         | avoiding them where possible, best to leave reducing to a
         | single element to the loop tail.
        
       | TinkersW wrote:
       | I write a lot of SIMD and I don't really agree with this..
       | 
       |  _Flaw1:fixed width_
       | 
       | I prefer fixed width as it makes the code simpler to write, size
       | is known as compile time so we know the size of our structures.
       | Swizzle algorithms are also customized based on the size.
       | 
       |  _Flaw2:pipelining_
       | 
       | no CPU I care about is in order so mostly irrelevant, and even
       | scalar instructions are pipelined
       | 
       |  _Flaw3: tail handling_
       | 
       | I code with SIMD as the target, and have special containers that
       | pad memory to SIMD width, no need to mask or run a scalar loop. I
       | copy the last valid value into the remaining slots so it doesn't
       | cause any branch divergence.
        
         | cjbgkagh wrote:
         | I have similar thoughts,
         | 
         | I don't understand the push for variable width SIMD. Possibly
         | due to ignorance but I think it's an abstraction that can be
         | specialized for different hardware so the similar tradeoffs
         | between low level languages and high level languages apply.
         | Since I already have to be aware of hardware level concepts
         | such as 256bit shuffle not working across 128bit lanes and
         | different instructions having very different performance
         | characteristics on different CPUs I'm already knee deep in
         | hardware specifics. While in general I like abstractions I've
         | largely given up waiting for a 'sufficiently advanced compiler'
         | that would properly auto-vectorize my code. I think AGI AI is
         | more likely to happen sooner. At a guess it seems to be that
         | SIMD code could work on GPUs but GPU code has different memory
         | access costs so the code there would also be completely
         | different.
         | 
         | So my view is either create a much better higher level SIMD
         | abstraction model with a sufficiently advanced compiler that
         | knows all the tricks or let me work closely at the hardware
         | level.
         | 
         | As an outsider who doesn't really know what is going on it does
         | worry me a bit that it appears that WASM is pushing for
         | variable width SIMDs instead of supporting ISAs generally
         | supported by CPUs. I guess it's a portability vs performance
         | tradeoff - I worry that it may be difficult to make variable as
         | performant as fixed width and would prefer to deal with
         | portability by having alternative branches at code level.
         | >> Finally, any software that wants to use the new instruction
         | set needs to be rewritten (or at least recompiled). What is
         | worse, software developers often have to target several SIMD
         | generations, and add mechanisms to their programs that
         | dynamically select the optimal code paths depending on which
         | SIMD generation is supported.
         | 
         | Why not marry the two and have variable width SIMD as one of
         | the ISA options and if in the future variable width SIMD become
         | more performant then it would just be another branch to
         | dynamically select.
        
           | kevingadd wrote:
           | Part of the motive behind variable width SIMD in WASM is that
           | there's intentionally-ish no mechanism to do feature
           | detection at runtime in WASM. The whole module has to be
           | valid on your target, you can't include a handful of invalid
           | functions and conditionally execute them if the target
           | supports 256-wide or 512-wide SIMD. If you want to adapt you
           | have to ship entire modules for each set of supported feature
           | flags and select the correct module at startup after probing
           | what the target supports.
           | 
           | So variable width SIMD solves this by making any module using
           | it valid regardless of whether the target supports 512-bit
           | vectors, and the VM 'just' has to solve the problem of
           | generating good code.
           | 
           | Personally I think this is a terrible way to do things and
           | there should have just been a feature detection system, but
           | the horse fled the barn on that one like a decade ago.
        
             | __abadams__ wrote:
             | It would be very easy to support 512-bit vectors
             | everywhere, and just emulate them on most systems with a
             | small number of smaller vectors. It's easy for a compiler
             | to generate good code for this. Clang does it well if you
             | use its built-in vector types (which can be any length).
             | Variable-length vectors, on the other hand, are a very
             | challenging problem for compiler devs. You tend to get
             | worse code out than if you just statically picked a size,
             | even if it's not the native size.
        
               | jandrewrogers wrote:
               | The risk of 512-bit vectors everywhere is that many
               | algorithms will spill the registers pretty badly if
               | implemented in e.g. 128-bit vectors under the hood. In
               | such cases you may be better off with a completely
               | different algorithm implementation.
        
               | codedokode wrote:
               | Variable length vectors seem to be made for closed-source
               | manually-written assembly (nobody wants to unroll the
               | loop manually and nobody will rewrite it for new register
               | width).
        
               | Someone wrote:
               | > It would be very easy to support 512-bit vectors
               | everywhere, and just emulate them on most systems with a
               | small number of smaller vectors. It's easy for a compiler
               | to generate good code for this
               | 
               | Wouldn't that be suboptimal if/when CPUs that support
               | 1024-bit vectors come along?
               | 
               | > Variable-length vectors, on the other hand, are a very
               | challenging problem for compiler devs. You tend to get
               | worse code out than if you just statically picked a size,
               | even if it's not the native size.
               | 
               | Why would it be challenging? You could statically pick a
               | size on a system with variable-length vectors, too. How
               | would that be worse code?
        
               | kevingadd wrote:
               | Optimal performance in a vector algorithm typically
               | requires optimizing around things like the number of
               | available registers, whether the registers in use are
               | volatile (mandating stack spills when calling other
               | functions like a comparer), and sizes of sequences.
               | 
               | If you know you're engineering for 16-byte vectors you
               | can 'just' align all your data to 16 bytes. And if you
               | know you have 8 vector registers where 4 of them are non-
               | volatile you can design around that too. But without
               | information like that you have to be defensive, like
               | aligning all your data to 128 bytes instead Just In Case
               | (heaven forbid native vectors get bigger than that),
               | minimizing the number of registers you use to try and
               | avoid stack spills, etc. (I mention this because WASM
               | also doesn't expose any of this information.)
               | 
               | It's true that you could just design for a static size on
               | a system with variable-length vectors. I suspect you'd
               | see a lot of people do that, and potentially under-
               | utilize the hardware's capabilities. Better than nothing,
               | at least!
        
               | dataflow wrote:
               | > Wouldn't that be suboptimal if/when CPUs that support
               | 1024-bit vectors come along?
               | 
               | Is that likely or on anyone's roadmap? It makes a little
               | less sense than 512 bits, at least for Intel, since their
               | cache lines are 64 bytes i.e. 512 bits. Any more than
               | that and they'd have to mess with multiple cache lines
               | _all_ the time, not just on unaligned accesses. And they
               | 'd have to support crossing more than 2 cache lines on
               | unaligned accesses. They increase the cache line size
               | too, but that seems terrible for compatibility since a
               | lot of programs assume it's a compile time constant (and
               | it'd have performance overhead to make it a run-time
               | value). Somehow it feels like this isn't the way to go,
               | but hey, I'm not a CPU architect.
        
         | camel-cdr wrote:
         | > I prefer fixed width
         | 
         | Do you have examples for problems that are easier to solve in
         | fixed-width SIMD?
         | 
         | I maintain that most problems can be solved in a vector-length-
         | agnostic manner. Even if it's slightly more tricky, it's
         | certainly easier than restructuring all of your memory
         | allocations to add padding and implementing three versions for
         | all the differently sized SIMD extensions your target may
         | support. And you can always fall back to using a variable-width
         | SIMD ISA in a fixed-width way, when necessary.
        
           | jcranmer wrote:
           | There's a category of autovectorization known as Superword-
           | Level Parallelism (SLP) which effectively scavenges an entire
           | basic block for individual instruction sequences that might
           | be squeezed together into a SIMD instruction. This kind of
           | vectorization doesn't work well with vector-length-agnostic
           | ISAs, because you generally can't scavenge more than a few
           | elements anyways, and inducing any sort of dynamic vector
           | length is more likely to slow your code down as a result
           | (since you can't do constant folding).
           | 
           | There's other kinds of interesting things you can do with
           | vectors that aren't improved by dynamic-length vectors.
           | Something like abseil's hash table, which uses vector code to
           | efficiently manage the occupancy bitmap. Dynamic vector
           | length doesn't help that much in that case, particularly
           | because the vector length you can parallelize over is itself
           | intrinsically low (if you're scanning dozens of elements to
           | find an empty slot, something is wrong). Vector swizzling is
           | harder to do dynamically, and in general, at high vector
           | factors, difficult to do generically in hardware, which means
           | going to larger vectors (even before considering dynamic
           | sizes), vectorization is trickier if you have to do a lot of
           | swizzling.
           | 
           | In general, vector-length-agnostic is really only good for
           | SIMT-like codes, which you can express the vector body as
           | more or less independent f(index) for some knowable-before-
           | you-execute-the-loop range of indices. Stuff like DAXPY or
           | BLAS in general. Move away from this model, and that
           | agnosticism becomes overhead that doesn't pay for itself.
           | (Now granted, this kind of model is a _large fraction_ of
           | parallelizable code, but it 's far from all of it).
        
             | camel-cdr wrote:
             | The SLP vectorizer is a good point, but I think it's, in
             | comparison with x86, more a problem of the float and vector
             | register files not being shared (in SVE and RVV). You don't
             | need to reconfigure the vector length; just use it at the
             | full width.
             | 
             | > Something like abseil's hash table
             | 
             | If I remember this correctly, the abseil lookup does scale
             | with vector length, as long as you use the native data path
             | width. (albeit with small gains) There is a problem with
             | vector length agnostic handling of abseil, which is the
             | iterator API. With a different API, or compilers that could
             | eliminate redundant predicated load/stores, this would be
             | easier.
             | 
             | > good for SIMT-like codes
             | 
             | Certainly, but I've also seen/written a lot of vector
             | length agnostic code using shuffles, which don't fit into
             | the SIMT paradigm, which means that the scope is larger
             | than just SIMT.
             | 
             | ---
             | 
             | As a general comparison, take AVX10/128, AVX10/256 and
             | AVX10/512, overlap their instruction encodings, remove the
             | few instructions that don't make sense anymore, and add a
             | cheap instruction to query the vector length. (probably
             | also instructions like vid and viota, for easier shuffle
             | synthesization) Now you have a variable-length SIMD ISA
             | that feels familiar.
             | 
             | The above is basically what SVE is.
        
             | jonstewart wrote:
             | A number of the cool string processing SIMD techniques
             | depend a _lot_ on register widths and instruction
             | performance characteristics. There's a fair argument to be
             | made that x64 could be made more consistent/legible for
             | these use cases, but this isn't matmul--whether you have
             | 128, 256, or 512 bits matters hugely and you may want
             | entirely different algorithms that are contingent on this.
        
           | jandrewrogers wrote:
           | I also prefer fixed width. At least in C++, all of the
           | padding, alignment, etc is automagically codegen-ed for the
           | register type in my use cases, so the overhead is
           | approximately zero. All the complexity and cost is in
           | specializing for the capabilities of the underlying SIMD ISA,
           | not the width.
           | 
           | The benefit of fixed width is that optimal data structure and
           | algorithm design on various microarchitectures is dependent
           | on explicitly knowing the register width. SIMD widths aren't
           | not perfectly substitutable in practice, there is more at
           | play than stride size. You can also do things like explicitly
           | combine separate logic streams in a single SIMD instruction
           | based on knowing the word layout. Compilers don't do this
           | work in 2025.
           | 
           | The argument for vector width agnostic code seems predicated
           | on the proverbial "sufficiently advanced compiler". I will
           | likely retire from the industry before such a compiler
           | actually exists. Like fusion power, it has been ten years
           | away my entire life.
        
             | camel-cdr wrote:
             | > The argument for vector width agnostic code is seems
             | predicated on the proverbial "sufficiently advanced
             | compiler".
             | 
             | A SIMD ISA having a fixed size or not is orthogonal to
             | autovectorization. E.g. I've seen a bunch of cases where
             | things get autovectorized for RVV but not for AVX512. The
             | reason isn't fixed vs variable, but rather the supported
             | instructions themselves.
             | 
             | There are two things I'd like from a "sufficiently advanced
             | compiler", which are sizeless struct support and redundant
             | predicated load/store elimination. Those don't
             | fundamentally add new capabilities, but makes working
             | with/integrating into existing APIs easier.
             | 
             | > All the complexity and cost is in specializing for the
             | capabilities of the underlying SIMD ISA, not the width.
             | 
             | Wow, it almost sounds like you could take basically the
             | same code and run it with different vector lengths.
             | 
             | > The benefit of fixed width is that optimal data structure
             | and algorithm design on various microarchitectures is
             | dependent on explicitly knowing the register width
             | 
             | Optimal to what degree? Like sure, fixed-width SIMD can
             | always turn your pointer increments from a register add to
             | an immediate add, so it's always more "optimal", but that
             | sort of thing doesn't matter.
             | 
             | The only difference you usually encounter when writing
             | variable instead of fixed size code is that you have to
             | synthesize your shuffles outside the loop. This usually
             | just takes a few instructions, but loading a constant is
             | certainly easier.
        
               | jandrewrogers wrote:
               | The interplay of SIMD width and microarchitecture is more
               | important for performance engineering than you seem to be
               | assuming. Those codegen decisions are made at layer above
               | anything being talked about here and they operate on
               | explicit awareness of things like register size.
               | 
               | It isn't "same instruction but wider or narrower" or
               | anything that can be trivially autovectorized, it is
               | "different algorithm design". Compilers are not yet
               | rewriting data structures and algorithms based on
               | microarchitecture.
               | 
               | I write a lot of SIMD code, mostly for database engines,
               | little of which is trivial "processing a vector of data
               | types" style code. AVX512 in particular is strong enough
               | of an ISA that it is used in all kinds of contexts that
               | we traditionally wouldn't think of as a good for SIMD.
               | You can build all kinds of neat quasi-scalar idioms with
               | it and people do.
        
               | synthos wrote:
               | > Compilers are not yet rewriting data structures and
               | algorithms based on microarchitecture.
               | 
               | Afaik mojo allows for this with the autotuning capability
               | and metaprogramming
        
           | pkhuong wrote:
           | > Do you have examples for problems that are easier to solve
           | in fixed-width SIMD?
           | 
           | Regular expression matching and encryption come to mind.
        
             | camel-cdr wrote:
             | > Regular expression matching
             | 
             | That's probably true. Last time I looked at it, it seemed
             | like parts of vectorscan could be vectorized VLA, but from
             | my, very limited, understanding of the main matching
             | algorithm, it does seem to require specialization on vector
             | length.
             | 
             | It should be possible to do VLA in some capacity, but it
             | would probably be slower and it's too much work to test.
             | 
             | > encryption
             | 
             | From the things I've looked at, it's mixed.
             | 
             | E.g. chacha20 and poly1305 vectorize well in a VLA scheme:
             | https://camel-cdr.github.io/rvv-bench-
             | results/bpi_f3/chacha2..., https://camel-cdr.github.io/rvv-
             | bench-results/bpi_f3/poly130...
             | 
             | Keccak on the other hand was optimized for fast execution
             | on scalar ISAs with 32 GPRs. This is hard to vectorize in
             | general, because GPR "moves" are free and liberally
             | applied.
             | 
             | Another example where it's probably worth specializing is
             | quicksort, specifically the leaf part.
             | 
             | I've written a VLA version, which uses bitonic sort to sort
             | within vector registers. I wasn't able to meaningfully
             | compare it against a fixed size implementation, because
             | vqsort was super slow when I tried to compile it for RVV.
        
         | aengelke wrote:
         | I agree; and the article seems to have also quite a few
         | technical flaws:
         | 
         | - Register width: we somewhat maxed out at 512 bits, with Intel
         | going back to 256 bits for non-server CPUs. I don't see larger
         | widths on the horizon (even if SVE theoretically supports up to
         | 2048 bits, I don't know any implementation with ~~>256~~ >512
         | bits). Larger bit widths are not beneficial for most
         | applications and the few applications that are (e.g., some HPC
         | codes) are nowadays served by GPUs.
         | 
         | - The post mentions available opcode space: while opcode space
         | is limited, a reasonably well-designed ISA (e.g., AArch64) has
         | enough holes for extensions. Adding new instructions doesn't
         | require ABI changes, and while adding new registers requires
         | some kernel changes, this is well understood at this point.
         | 
         | - "What is worse, software developers often have to target
         | several SIMD generations" -- no way around this, though, unless
         | auto-vectorization becomes substantially better. Adjusting the
         | register width is not the big problem when porting code, making
         | better use of available instructions is.
         | 
         | - "The packed SIMD paradigm is that there is a 1:1 mapping
         | between the register width and the execution unit width" -- no.
         | E.g., AMD's Zen 4 does double pumping, and AVX was IIRC
         | originally designed to support this as well (although Intel
         | went directly for 256-bit units).
         | 
         | - "At the same time many SIMD operations are pipelined and
         | require several clock cycles to complete" -- well, they _are_
         | pipelined, but many SIMD instructions have the same latency as
         | their scalar counterpart.
         | 
         | - "Consequently, loops have to be unrolled in order to avoid
         | stalls and keep the pipeline busy." -- loop unroll has several
         | benefits, mostly to reduce the overhead of the loop and to
         | avoid data dependencies between loop iterations. Larger basic
         | blocks are better for hardware as every branch, even if
         | predicted correctly, has a small penalty. "Loop unrolling also
         | increases register pressure" -- it does, but code that really
         | requires >32 registers is extremely rare, so a good instruction
         | scheduler in the compiler can avoid spilling.
         | 
         | In my experience, dynamic vector sizes make code slower,
         | because they inhibit optimizations. E.g., spilling a
         | dynamically sized vector is like a dynamic stack allocation
         | with a dynamic offset. I don't think SVE delivered any large
         | benefits, both in terms of performance (there's not much
         | hardware with SVE to begin with...) and compiler support.
         | RISC-V pushes further into this direction, we'll see how this
         | turns out.
        
           | cherryteastain wrote:
           | Fujitsu A64FX used in the Fugaku supercomputer uses SVE with
           | 512 bit width
        
             | aengelke wrote:
             | Thanks, I misremembered. However, the microarchitecture is
             | a bit "weird" (really HPC-targeted), with very long
             | latencies (e.g., ADD (vector) 4 cycles, FADD (vector) 9
             | cycles). I remember that it was much slower than older x86
             | CPUs for non-SIMD code, and even for SIMD code, it took
             | quite a bit of effort to get reasonable performance through
             | instruction-level parallelism due to the long latencies and
             | the very limited out-of-order capacities (in particular the
             | just 2x20 reservation station entries for FP).
        
           | camel-cdr wrote:
           | > we somewhat maxed out at 512 bits
           | 
           | Which still means you have to write your code at least
           | thrice, which is two times more than with a variable length
           | SIMD ISA.
           | 
           | Also there are processors with larger vector length, e.g.
           | 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200,
           | 16384-bit: NEC SX-Aurora, Ara, EPI
           | 
           | > no way around this
           | 
           | You rarely need to rewrite SIMD code to take advantage of new
           | extensions, unless somebody decides to create a new one with
           | a larger SIMD width. This mostly happens when very
           | specialized instructions are added.
           | 
           | > In my experience, dynamic vector sizes make code slower,
           | because they inhibit optimizations.
           | 
           | Do you have more examples of this?
           | 
           | I don't see spilling as much of a problem, because you want
           | to avoid it regardless, and codegen for dynamic vector sizes
           | is pretty good in my experience.
           | 
           | > I don't think SVE delivered any large benefits
           | 
           | Well, all Arm CPUs except for the A64FX were build to execute
           | NEON as fast as possible. X86 CPUs aren't built to execute
           | MMX or SSE or the latest, even AVX, as fast as possible.
           | 
           | Anyway, I know of one comparison between NEON and SVE:
           | https://solidpixel.github.io/astcenc_meets_sve
           | 
           | > Performance was a lot better than I expected, giving
           | between 14 and 63% uplift. Larger block sizes benefitted the
           | most, as we get higher utilization of the wider vectors and
           | fewer idle lanes.
           | 
           | > I found the scale of the uplift somewhat surprising as
           | Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so
           | in terms of data-width the two should work out very similar.
        
             | vardump wrote:
             | > Which still means you have to write your code at least
             | thrice, which is two times more than with a variable length
             | SIMD ISA.
             | 
             | 256 and 512 bits are the only reasonable widths. 256 bit
             | AVX2 is what, 13 or 14 years old now.
        
               | adgjlsfhk1 wrote:
               | no. Because Intel is full of absolute idiots, Intel atom
               | didn't support AVX 1 until Gracemont. Tremont is missing
               | AVX1, AVX2, FMA, and basically the rest of X86v3, and
               | shipped in CPUs as recently as 2021 (Jasper Lake).
        
               | vardump wrote:
               | Oh damn. I've dropped SSE ages ago and no one complained.
               | I guess the customer base didn't use those chips...
        
               | ack_complete wrote:
               | Intel also shipped a bunch of Pentium-branded CPUs that
               | have AVX disabled, leading to oddities like a Kaby Lake
               | based CPU that doesn't have AVX, and even worse, also
               | shipped a few CPUs that have AVX2 but not BMI2:
               | 
               | https://sourceware.org/bugzilla/show_bug.cgi?id=29611
               | 
               | https://developercommunity.visualstudio.com/t/Crash-in-
               | Windo...
        
             | aengelke wrote:
             | > Also there are processors with larger vector length
             | 
             | How do these fare in terms of absolute performance? The NEC
             | TSUBASA is not a CPU.
             | 
             | > Do you have more examples of this?
             | 
             | I ported some numeric simulation kernel to the A64Fx some
             | time ago, fixing the vector width gave a 2x improvement.
             | Compilers probably/hopefully have gotten better in the mean
             | time and I haven't redone the experiments since then, but
             | I'd be surprised if this changed drastically. Spilling is
             | sometimes unavoidable, e.g. due to function calls.
             | 
             | > Anyway, I know of one comparison between NEON and SVE:
             | https://solidpixel.github.io/astcenc_meets_sve
             | 
             | I was specifically referring to dynamic vector sizes. This
             | experiment uses sizes fixed at compile-time, from the
             | article:
             | 
             | > For the astcenc implementation of SVE I decided to
             | implement a fixed-width 256-bit implementation, where the
             | vector length is known at compile time.
        
               | camel-cdr wrote:
               | > How do these fare in terms of absolute performance? The
               | NEC TSUBASA is not a CPU.
               | 
               | The NEC is an attached accelerator, but IIRC it can run
               | an OS in host mode. It's hard to tell how the others
               | perform, because most don't have hardware available yet
               | or only they and partner companies have access. It's also
               | hard to compare, because they don't target the desktop
               | market.
               | 
               | > I ported some numeric simulation kernel to the A64Fx
               | some time ago, fixing the vector width gave a 2x
               | improvement.
               | 
               | Oh, wow. Was this autovectorized or handwritten
               | intrinsics/assembly?
               | 
               | Any chance it's of a small enough scope that I could try
               | to recreate it?
               | 
               | > I was specifically referring to dynamic vector sizes.
               | 
               | Ah, sorry, yes you are correct. It still shows that
               | supporting VLA mechanisms in an ISA doesn't mean it's
               | slower for fixed-size usage.
               | 
               | I'm not aware of any proper VLA vs VLS comparisons. I
               | benchmarked a VLA vs VLS mandelbrot implementation once
               | where there was no performance difference, but that's a
               | too simple example.
        
             | codedokode wrote:
             | > Which still means you have to write your code at least
             | thrice, which is two times more than with a variable length
             | SIMD ISA.
             | 
             | This is a wrong approach. You should be writing you code in
             | a high-level language like this:                   x = sum
             | i for 1..n: a[i] * b[i]
             | 
             | And let the compiler write the assembly for every existing
             | architecture (including multi-threaded version of a loop).
             | 
             | I don't understand what is the advantage of writing the
             | SIMD code manually. At least have a LLM write it if you
             | don't like my imaginary high-level vector language.
        
               | otherjason wrote:
               | This is the common argument from proponents of compiler
               | autovectorization. An example like what you have is very
               | simple, so modern compilers would turn it into SIMD code
               | without a problem.
               | 
               | In practice, though, the cases that compilers can
               | successfully autovectorize are very limited relative to
               | the total problem space that SIMD is solving. Plus, if I
               | rely on that, it leaves me vulnerable to regressions in
               | the compiler vectorizer.
               | 
               | Ultimately for me, I would rather write the
               | implementation myself and know what is being generated
               | versus trying to write high-level code in just the right
               | way to make the compiler generate what I want.
        
           | deaddodo wrote:
           | > - Register width: we somewhat maxed out at 512 bits, with
           | Intel going back to 256 bits for non-server CPUs. I don't see
           | larger widths on the horizon (even if SVE theoretically
           | supports up to 2048 bits, I don't know any implementation
           | with >256 bits). Larger bit widths are not beneficial for
           | most applications and the few applications that are (e.g.,
           | some HPC codes) are nowadays served by GPUs.
           | 
           | Just to address this, it's pretty evident why scalar values
           | have stabilized at 64-bit and vectors at ~512 (though there
           | are larger implementations). Tell someone they only have 256
           | values to work with and they immediately see the limit, it's
           | why old 8-bit code wasted so much time shuffling carries to
           | compute larger values. Tell them you have 65536 values and it
           | alleviates a _large_ set of that problem, but you 're still
           | going to hit limits frequently. Now you have up to 4294967296
           | values and the limits are realistically only going to be hit
           | in computational realms, so bump it up to
           | 18446744073709551615. Now even most commodity computational
           | limits are alleviated and the compiler will handle the data
           | shuffling for larger ones.
           | 
           | There was naturally going to be a point where there was
           | enough static computational power on integers that it didn't
           | make sense to continue widening them (at least, not at the
           | previous rate). The same goes for vectorization, but in even
           | more niche and specific fields.
        
           | bjourne wrote:
           | > "Loop unrolling also increases register pressure" -- it
           | does, but code that really requires >32 registers is
           | extremely rare, so a good instruction scheduler in the
           | compiler can avoid spilling.
           | 
           | No, it actually is super common in hpc code. If you unroll a
           | loop N times you need N times as many registers. For normal
           | memory-bound code I agree with you, but most hpc kernels will
           | exploit as much of the register file as they can for
           | blocking/tiling.
        
           | xphos wrote:
           | I think the variable length stuff does solve encoding issues,
           | and RISCV takes so big strides with the ideas around chaining
           | and vl/lmul/vtype registers.
           | 
           | I think they would benefit from having 4 vtype registers,
           | though. It's wasted scalar space, but how often do you
           | actually rotate between 4 different vector types in main loop
           | bodies. The answer is pretty rarely. And you'd greatly reduce
           | the swapping between vtypes when. I think they needed to find
           | 1 more bit but it's tough the encoding space isn't that large
           | for rvv which is a perk for sure
           | 
           | Can't wait to seem more implementions of rvv to actually test
           | some of it's ideas
        
             | dzaima wrote:
             | If you had two extra bits in the instruction encoding, I
             | think it'd make much more sense to encode element width
             | directly in instructions, leaving LMUL multiplier &
             | agnosticness settings in vsetvl; only things that'd suffer
             | then would be if you need tail-undisturbed for one instr
             | (don't think that's particularly common) and fancy things
             | that reinterpret the vector between different element
             | widths (very uncommon).
             | 
             | Will be interesting to see if longer encodings for RVV with
             | encoded vtype or whatever ever materialize.
        
         | tonetegeatinst wrote:
         | AFAIK about every modern CPU uses out of order von Neumann
         | architecture. The only people who don't are the handful of
         | researchers and people who work with the government research
         | into non van Neumann designed systems.
        
           | luyu_wu wrote:
           | Low power RISC cores (both ARM and RISC-V) are typically in-
           | order actually!
           | 
           | But any core I can think of as 'high-performance' is OOO.
        
             | whaleofatw2022 wrote:
             | MIPS as well as Alpha AFAIR. And technically itanium, otoh
             | It seems to me a bit like a niche for any performance
             | advantages...
        
               | mattst88 wrote:
               | Alpha 21264 is out-of-order.
        
               | pezezin wrote:
               | I would not call neither MIPS, Alpha nor Itanium "high-
               | performance" in 2025...
        
               | p_l wrote:
               | Alpha was out of order starting with EV7, but most
               | importantly the entire architecture was designed with eye
               | for both pipeline hazards and out of order execution,
               | unlike VAX that it replaced which made it pretty much
               | impossible
        
           | IshKebab wrote:
           | Microcontrollers are often in-order.
        
         | PaulHoule wrote:
         | In AVX-512 we have a platform that rewards the assembly
         | language programmer like few platforms have since the 6502. I
         | see people doing really clever things that are specific to the
         | system and one level it is _really cool_ but on another level
         | it means SIMD is the domain of the specialist, Intel puts out
         | press releases about the really great features they have for
         | the national labs and for Facebook whereas the rest of us are
         | 5-10 years behind the curve for SIMD adoption because the juice
         | isn 't worth the squeeze.
         | 
         | Just before libraries for training neural nets on GPUs became
         | available I worked on a product that had a SIMD based neural
         | network trainer that was written in hand-coded assembly. We
         | were a generation behind in our AVX instructions so we gave up
         | half of the performance we could have got, but that was the
         | least of the challenges we had to overcome to get the product
         | in front of customers. [1]
         | 
         | My software-centric view of Intel's problems is that they've
         | been spending their customers and shareholders money to put
         | features in chips that are fused off or might as well be fused
         | off because they aren't widely supported in the industry. And
         | that they didn't see this as a problem and neither did their
         | enablers in the computing media and software industry. Just for
         | example, Apple used to ship the MKL libraries which like a
         | turbocharger for matrix math back when they were using Intel
         | chips. For whatever reason, Microsoft did not do this with
         | Windows and neither did most Linux distributions so "the rest
         | of us" are stuck with a fraction of the performance that we
         | paid for.
         | 
         | AMD did the right thing in introducing double pumped AVX-512
         | because at least assembly language wizards have some place
         | where their code runs and the industry gets closer to the place
         | where we can count on using an instruction set defined _12
         | years ago._
         | 
         | [1] If I'd been tasked with updating the to next generation I
         | would have written a compiler (if I take that many derivatives
         | by hand I'll get one wrong.) My boss would have ordered me not
         | to, I would have done it anyway and not checked it in.
        
           | bee_rider wrote:
           | It is kind of a bummer that MKL isn't open sourced, as that
           | would make inclusion in Linux easier. It is already free-as-
           | in-beer, but of course that doesn't solve everything.
           | 
           | Baffling that MS didn't use it. They have a pretty close
           | relationship...
           | 
           | Agree that they are sort of going after hard-to-use niche
           | features nowadays. But I think it is just that the real thing
           | we want--single threaded performance for branchy code--is,
           | like, incredibly difficult to improve nowadays.
        
             | PaulHoule wrote:
             | At the very least you can decode UTF-8 really quickly with
             | AVX-512
             | 
             | https://lemire.me/blog/2023/08/12/transcoding-
             | utf-8-strings-...
             | 
             | and web browsers at the very least spent a lot of cycles on
             | decoding HTML and Javascript which is UTF-8 encoded. It
             | turns out AVX-512 is good at a lot of things you wouldn't
             | think SIMD would be good at. Intel's got the problem that
             | people don't want to buy new computers because they don't
             | see much benefit from buying a new computer, but a new
             | computer doesn't have the benefit it could have because of
             | lagging software support, and the software support lags
             | because there aren't enough new computers to justify the
             | work to do the software support. Intel deserves blame for a
             | few things, one of which is that they have dragged their
             | feet at getting really innovative features into their
             | products while turning people off with various empty
             | slogans.
             | 
             | They really do have a new instruction set that targets
             | plain ordinary single threaded branchy code
             | 
             | https://www.intel.com/content/www/us/en/developer/articles/
             | t...
             | 
             | they'll probably be out of business before you can use it.
        
               | gatane wrote:
               | In the end, it doesnt even matter, javascript frameworks
               | are already big enough to slow down your pc.
               | 
               | Unless if said optimization on parsing runs at the very
               | core of JS.
        
               | saagarjha wrote:
               | It'll speed up first load times.
        
               | immibis wrote:
               | If you pay attention this isn't a UTF-8 decoder. It might
               | be some other encoding, or a complete misunderstanding of
               | how UTF-8 works, or an AI hallucination. It also doesn't
               | talk about how to handle the variable number of output
               | bytes or the possibility of a continuation sequence split
               | between input chunks.
        
               | kjs3 wrote:
               | I paid attention and I don't see where Daniel claimed
               | that this a complete UTF-8 decoder. He's illustrating a
               | programming technique using a simplified use case, not
               | solving the worlds problems. And I don't think Daniel
               | Lemire lacks an understanding of the concept or needs an
               | AI to code it.
        
           | ack_complete wrote:
           | AVX-512 also has a lot of wonderful facilities for
           | autovectorization, but I suspect its initial downclocking
           | effects plus getting yanked out of Alder Lake killed a lot of
           | the momentum in improving compiler and library usage of it.
           | 
           | Even the Steam Hardware Survey, which is skewed toward upper
           | end hardware, only shows 16% availability of baseline
           | AVX-512, compared to 94% for AVX2.
        
             | adgjlsfhk1 wrote:
             | It will be interesting seeing what happens now that AMD is
             | shipping good AVX-512. It really just makes Intel seem
             | incompetent (especially since they're theoretically
             | bringing AVX-512 back in next year anyway)
        
               | ack_complete wrote:
               | No proof, but I suspect that AMD's AVX-512 support played
               | a part in Intel dumping AVX10/256 and changing plans back
               | to shipping a full 512-bit consumer implementation again
               | (we'll see when they actually ship it).
               | 
               | The downside is that AMD also increased the latency of
               | all formerly cheap integer vector ops. This removes one
               | of the main advantages against NEON, which historically
               | has had richer operations but worse latencies. That's one
               | thing I hope Intel doesn't follow.
               | 
               | Also interesting is that Intel's E-core architecture is
               | improving dramatically compared to the P-core, even
               | surpassing it in some cases. For instance, Skymont
               | finally has no penalty for denormals, a long standing
               | Intel weakness. Would not be surprising to see the E-core
               | architecture take over at some point.
        
               | adgjlsfhk1 wrote:
               | > For instance, Skymont finally has no penalty for
               | denormals, a long standing Intel weakness.
               | 
               | yeah, that's crazy to me. Intel has been so completely
               | discunctional for the last 15 years. I feel like you
               | couldn't have a clearer sign of "we have 2 completely
               | separate teams that are competing with each other and
               | aren't allowed to/don't want to talk to each other". it's
               | just such a clear sign that the chicken is running around
               | headless
        
               | whizzter wrote:
               | Not really, to me it more seems like Pentium-4 vs
               | Pentium-M/Core again.
               | 
               | The downfall of Pentium 4 was that they had been stuffing
               | things into longer and longer pipes to keep up the
               | frequency race(with horrible branch latencies as a
               | result). They scaled it all away by "resetting" to the
               | P3/P-M/Core architecture and scaling up from that again.
               | 
               | Pipes today are even _longer_ and if E-cores has shorter
               | pipes at a similar frequency then "regular" JS,Java,etc
               | code will be far more performant even if you lose a bit
               | of perf for "performance" cases where people vectorize
               | (Did the HPC computing crowd influence Intel into a ditch
               | AGAIN? wouldn't be surprising!).
        
               | ack_complete wrote:
               | Thankfully, the P-cores are nowhere near as bad as the
               | Pentium 4 was. The Pentium 4 had such a skewed
               | architecture that it was annoyingly frustrating to
               | optimize for. Not only was the branch misprediction
               | penalty long, but all common methods of doing branchless
               | logic like conditional moves were also slow. It also had
               | a slow shifter such that small left shifts were actually
               | faster as sequences of adds, which I hadn't needed to do
               | since the 68000 and 8086. And an annoying L1 cache that
               | had 64K aliasing penalties (guess which popular OS
               | allocates all virtual memory, particularly thread stacks,
               | at 64K alignment.....)
               | 
               | The P-cores have their warts, but are still much more
               | well-rounded than the P4 was.
        
             | ezekiel68 wrote:
             | You mentioned "initial downclocking effects", yet (for
             | posterity) I want to emphasize that in 2020 Ice Lake (Sunny
             | Cove core) and later Intel processors, the downclocking is
             | really a nothingburger. The fusing off debacle in desktop
             | CPU families like Alder Lake you mentioned definitely
             | killed the momentum though.
             | 
             | I'm not sure why OS kernels couldn't have become partners
             | in CPU capability queries (where a program starting
             | execution could request a CPU core with 'X' such as
             | AVX-512F, for example) -- but without that the whole
             | P-core/E-core hybrid concept was DOA for capabilities which
             | were not least-common denominators. If I had to guess,
             | marketing got ahead of engineering and testing on that one.
        
               | ack_complete wrote:
               | Sure, but any core-wide downclocking effect at all is
               | annoying for autovectorization, since a small local win
               | easily turns into a global loss. Which is why compilers
               | have "prefer vector width" tuning parameters so autovec
               | can be tuned down to avoid 512-bit or even 256-bit ops.
               | 
               | This is also the same reason that having AVX-512 only on
               | the P-cores wouldn't have worked, even with thread
               | director support. It would only take one small routine in
               | a common location to push most threads off the P-cores.
               | 
               | I'm of the opinion that Intel's hybrid P/E-arch has been
               | mostly useless anyway and only good for winning
               | benchmarks. My current CPU has a 6P4E configuration and
               | the scheduler hardly uses the E-cores at all unless
               | forced, plus performance was better and more stable with
               | the E-cores disabled.
        
           | the__alchemist wrote:
           | Noob question! What about AVX-512 makes it unique to assembly
           | programmers? I'm just dipping my toes in, and have been doing
           | some chemistry computations using f32x8, Vec3x8 etc
           | (AVX-256). I have good workflows set up, but have only been
           | getting 2x speedup over non-SIMD. (Was hoping for closer to
           | 8). I figured AVX-512 would allow f32x16 etc, which would be
           | mostly a drop-in. (I have macros to set up the types, and you
           | input num lanes).
        
             | dzaima wrote:
             | SIMD only helps you where you're arithmetic-limited; you
             | may be limited by memory bandwidth, or perhaps float
             | division if applicable; and if your scalar comparison got
             | autovectorized you'd have roughly no benefit.
             | 
             | AVX-512 should be just fine via intrinsics/high-level
             | vector types, not different from AVX2 in this regard.
        
             | ack_complete wrote:
             | AVX-512 has a lot of instructions that just extend
             | vectorization to 512-bit and make it nicer with features
             | like masking. Thus, a very valid use of it is just to
             | double vectorization width.
             | 
             | But it also has a bunch of specialized instructions that
             | can boost performance beyond just the 2x width. One of them
             | is VPCOMPRESSB, which accelerates compact encoding of
             | sparse data. Another is GF2P8AFFINEQB, which is targeted at
             | specific encryption algorithms but can also be abused for
             | general bit shuffling. Algorithms like computing a
             | histogram can benefit significantly, but it requires
             | reshaping the algorithm around very particular and peculiar
             | intermediate data layouts that are beyond the
             | transformations a compiler can do. This doesn't _literally_
             | require assembly language, though, it can often be done
             | with intrinsics.
        
         | derf_ wrote:
         | _> I code with SIMD as the target, and have special containers
         | that pad memory to SIMD width..._
         | 
         | I think this may be domain-specific. I help maintain several
         | open-source audio libraries, and wind up being the one to
         | review the patches when people contribute SIMD for some
         | specific ISA, and I think without exception they always get the
         | tail handling wrong. Due to other interactions it cannot always
         | be avoided by padding. It can roughly double the complexity of
         | the code [0], and requires a disproportionate amount of
         | thinking time vs. the time the code spends running, but if you
         | don't spend that thinking time you can get OOB reads or writes,
         | and thus CVEs. Masked loads/stores are an improvement, but not
         | universally available. I don't have a lot of concrete
         | suggestions.
         | 
         | I also work with a lot of image/video SIMD, and this is just
         | not a problem, because most operations happen on fixed block
         | sizes, and padding buffers is easy and routine.
         | 
         | I agree I would have picked other things for the other two in
         | my own top-3 list.
         | 
         | [0] Here is a fun one, which actually performs worst when len
         | is a multiple of 8 (which it almost always is), and has 59
         | lines of code for tail handling vs. 33 lines for the main loop:
         | https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_...
        
           | jandrewrogers wrote:
           | > Masked loads/stores are an improvement, but not universally
           | available.
           | 
           | Traditionally we've worked around this with pretty idiomatic
           | hacks that efficiently implement "masked load" functionality
           | in SIMD ISAs that don't have them. We could probably be
           | better about not making people write this themselves every
           | time.
        
           | ack_complete wrote:
           | It depends on how integrated your SIMD strategy is into the
           | overall technical design. Tail handling is much easier if you
           | can afford SIMD-friendly padding so a full vector load/store
           | is possible even if you have to manually mask. That avoids a
           | lot of the hassle of breaking down memory accesses just to
           | avoid a page fault or setting off the memory checker.
           | 
           | Beyond that -- unit testing. I don't see enough of it for
           | vectorized routines. SIMD widths are small enough that you
           | can usually just test all possible offsets right up against a
           | guard page and brute force verify that no overruns occur.
        
           | codedokode wrote:
           | I think that SIMD code should not be written by hand but
           | rather in a high-level language and so dealing with tail
           | becomes a compiler's and not a programmer's problem. Or
           | people still prefer to write assembly be hand? It seems to be
           | so judging by the code you link.
           | 
           | What I wanted is to write code in a more high-level language
           | like this. For example, to compute a scalar product of a and
           | b you write:                   1..n | a[$1] * b[$1] | sum
           | 
           | Or maybe this:                   x = sum for i in 1 .. n:
           | a[i] * b[i]
           | 
           | And the code gets automatically compiled into SIMD
           | instructions for every existing architecture (and for large
           | arrays, into a multi-thread computation).
        
             | Zambyte wrote:
             | Zig exposes a Vector type to use for SIMD instructions,
             | which has been my first introduction to SIMD directly.
             | Reading through this thread I was immediately mapping what
             | people were saying to Vector operations in Zig. It seems to
             | me like SIMD can reasonably be exposed in high level
             | languages for programmers to reach to in contexts where it
             | matters.
             | 
             | Of course, the compiler vectorizing code when it can as a
             | general optimization is still useful, but when it's
             | critical that some operations _must_ be vectorized,
             | explicit SIMD structures seem nice to have.
        
         | Const-me wrote:
         | > I prefer fixed width
         | 
         | Another reason to prefer fixed width, compilers may pass
         | vectors to functions in SIMD registers. When register size is
         | unknown at compile time, they have to pass data in memory. For
         | complicated SIMD algorithms the performance overhead gonna be
         | huge.
        
           | kjs3 wrote:
           | Back in the day, you had Cray style vector registers, and you
           | had CDC style[1] 'vector pipes' (I think I remember that's
           | what they called them) that you fed from main memory. So you
           | would ( _vastly_ oversimplifying) build your vectors in
           | consecutive memory locations (up to 64k as I recall), point
           | to a result destination in memory and execute a vector
           | instruction. This works fine if there 's a close match
           | between cpu speed and memory access speed. The compilers were
           | quite good, and took care of handling variable sized vectors,
           | but I have no idea what was going on under the hood except
           | for some hi-level undergrad compiler lectures. As memory
           | speed vs cpu speed divergence became more and more pronouced,
           | it quickly became obvious that vector registers were the
           | right performance answer, basically everyone jumped that way,
           | and I don't think anyone has adopted a memory-memory vector
           | architecture since the '80s.
           | 
           | [1] from CDC STAR-100 and followons like the CDC Cyber
           | 180/990, Cyber 200 series & ETA-10.
        
       | freeone3000 wrote:
       | x86 SIMD suffers from register aliasing. xmm0 is actually the
       | low-half of ymm0, so you need to explicitly tell the processor
       | what your input type is to properly handle overflow and signing.
       | Actual vectorized instructions don't have this problem but you
       | also can't change it now.
        
       | pornel wrote:
       | There are alternative universes where these wouldn't be a
       | problem.
       | 
       | For example, if we didn't settle on executing compiled machine
       | code exactly as-is, and had a instruction-updating pass (less
       | involved than a full VM byte code compilation), then we could
       | adjust SIMD width for existing binaries instead of waiting
       | decades for a new baseline or multiversioning faff.
       | 
       | Another interesting alternative is SIMT. Instead of having a
       | handful of special-case instructions combined with heavyweight
       | software-switched threads, we could have had every instruction
       | SIMDified. It requires structuring programs differently, but
       | getting max performance out of current CPUs already requires SIMD
       | + multicore + predictable branching, so we're doing it anyway,
       | just in a roundabout way.
        
         | LegionMammal978 wrote:
         | > Another interesting alternative is SIMT. Instead of having a
         | handful of special-case instructions combined with heavyweight
         | software-switched threads, we could have had every instruction
         | SIMDified. It requires structuring programs differently, but
         | getting max performance out of current CPUs already requires
         | SIMD + multicore + predictable branching, so we're doing it
         | anyway, just in a roundabout way.
         | 
         | Is that not where we're already going with the GPGPU trend? The
         | big catch with GPU programming is that many useful routines are
         | irreducibly very branchy (or at least, to an extent that
         | removing branches slows them down unacceptably), and every
         | divergent branch throws out a huge chunk of the GPU's
         | performance. So you retain a traditional CPU to run all your
         | branchy code, but you run into memory-bandwidth woes between
         | the CPU and GPU.
         | 
         | It's generally the exception instead of the rule when you have
         | a big block of data elements upfront that can all be handled
         | uniformly with no branching. These usually have to do with
         | graphics, physical simulation, etc., which is why the SIMT
         | model was popularized by GPUs.
        
           | winwang wrote:
           | Fun fact which I'm 50%(?) sure of: a single branch divergence
           | for integer instructions on current nvidia GPUs won't hurt
           | perf, because there are only 16 int32 lanes anyway.
        
           | pornel wrote:
           | CPUs are not good at branchy code either. Branch
           | mispredictions cause costly pipeline stalls, so you have to
           | make branches either predictable or use conditional moves.
           | Trivially predictable branches are fast -- but so are non-
           | diverging warps on GPUs. Conditional moves and masked SIMD
           | work pretty much exactly like on a GPU.
           | 
           | Even if you have a branchy divide-and-conquer problem ideal
           | for diverging threads, you'll get hit by a relatively high
           | overhead of distributing work across threads, false sharing,
           | and stalls from cache misses.
           | 
           | My hot take is that GPUs will get more features to work
           | better on traditionally-CPU-problems (e.g. AMD Shader Call
           | proposal that helps processing unbalanced tree-structured
           | data), and CPUs will be downgraded to being just a
           | coprocessor for bootstrapping the GPU drivers.
        
         | aengelke wrote:
         | > if we didn't settle on executing compiled machine code
         | exactly as-is, and had a instruction-updating pass (less
         | involved than a full VM byte code compilation)
         | 
         | Apple tried something like this: they collected the LLVM
         | bitcode of apps so that they could recompile and even port to a
         | different architecture. To my knowledge, this was done exactly
         | once (watchOS armv7->AArch64) and deprecated afterwards.
         | Retargeting at this level is inherently difficult (different
         | ABIs, target-specific instructions, intrinsics, etc.). For the
         | same target with a larger feature set, the problems are
         | smaller, but so are the gains -- better SIMD usage would only
         | come from the auto-vectorizer and a better instruction selector
         | that uses different instructions. The expectable gains,
         | however, are low for typical applications and for math-heavy
         | programs, using optimized libraries or simply recompiling is
         | easier.
         | 
         | WebAssembly is a higher-level, more portable bytecode, but
         | performance levels are quite a bit behind natively compiled
         | code.
        
         | almostgotcaught wrote:
         | > There are alternative universes where these wouldn't be a
         | problem
         | 
         | Do people that say these things have literally any experience
         | of merit?
         | 
         | > For example, if we didn't settle on executing compiled
         | machine code exactly as-is, and had a instruction-updating pass
         | 
         | You do understand that at the end of the day, hardware is hard
         | (fixed) and software is soft (malleable) right? There will be
         | always be friction at some boundary - it doesn't matter where
         | you hide the rigidity of a literal rock, you eventually reach a
         | point where you cannot reconfigure something that you would
         | like to. And also the parts of that rock that are useful are
         | extremely expensive (so no one is adding instruction-updating
         | pass silicon just because it would be convenient). That's just
         | physics - the rock is very small but fully baked.
         | 
         | > we could have had every instruction SIMDified
         | 
         | Tell me you don't program GPUs without telling me. Not only is
         | SIMT a literal lie today (cf warp level primitives), there is
         | absolutely no reason to SIMDify all instructions (and you
         | better be a wise user of your scalar registers and scalar
         | instructions if you want fast GPU code).
         | 
         | I wish people would just realize there's no grand paradigm
         | shift that's coming that will save them from the difficult work
         | of actually learning how the device works in order to be able
         | to use it efficiently.
        
           | pornel wrote:
           | The point of updating the instructions isn't to have optimal
           | behavior in all cases, or to reconfigure programs for wildly
           | different hardware, but to be able to easily target
           | contemporary hardware, without having to wait for the oldest
           | hardware to die out first to be able to target a less
           | outdated baseline without conditional dispatch.
           | 
           | Users are much more forgiving about software that runs a bit
           | slower than software that doesn't run at all. ~95% of x86_64
           | CPUs have AVX2 support, but compiling binaries to
           | unconditionally rely on it makes the remaining users
           | complain. If it was merely slower on potato hardware, it'd be
           | an easier tradeoff to make.
           | 
           | This is the norm on GPUs thanks to shader recompilation
           | (they're far from optimal for all hardware, but at least get
           | to use the instruction set of the HW they're running on,
           | instead of being limited to the lowest common denominator).
           | On CPUs it's happening in limited cases: Zen 3 added AVX-512
           | by executing two 256-bit operations serially, and plenty of
           | less critical instructions are emulated in microcode, but
           | it's done by the hardware, because our software isn't set up
           | for that.
           | 
           | Compilers already need to make assumptions about pipeline
           | widths and instruction latencies, so the code is tuned for
           | specific CPU vendors/generations anyway, and that doesn't get
           | updated. Less explicitly, optimized code also makes
           | assumptions about cache sizes and compute vs memory trade-
           | offs. Code may need L1 cache of certain size to work best,
           | but it still runs on CPUs with a too-small L1 cache, just
           | slower. Imagine how annoying it would be if your code
           | couldn't take advantage of a larger L1 cache without crashing
           | on older CPUs. That's where CPUs are with SIMD.
        
             | almostgotcaught wrote:
             | i have no idea what you're saying - i'm well aware that
             | compilers do lots of things but this sentence in your
             | original comment
             | 
             | > compiled machine code exactly as-is, and had a
             | instruction-updating pass
             | 
             | implies there should be _silicon_ that implements the
             | instruction-updating - what else would be  "executing"
             | compiled machine code other than the machine
             | itself...........
        
               | pornel wrote:
               | I was talking about a software pass. Currently, the
               | machine code stored in executables (such as ELF or PE) is
               | only slightly patched by the dynamic linker, and then
               | expected to be directly executable by the CPU. The code
               | in the file has to be already compatible with the target
               | CPU, otherwise you hit illegal instructions. This is a
               | simplistic approach, dating back to when running
               | executables was just a matter of loading them into RAM
               | and jumping to their start (old a.out or DOS COM).
               | 
               | What I'm suggesting is adding a translation/fixup step
               | after loading a binary, before the code is executed, to
               | make it more tolerant to hardware changes. It doesn't
               | have to be full abstract portable bytecode compilation,
               | and not even as involved as PTX to SASS, but more like a
               | peephole optimizer for the same OS on the same general
               | CPU architecture. For example, on a pre-AVX2 x86_64 CPU,
               | the OS could scan for AVX2 instructions and patch them to
               | do equivalent work using SSE or scalar instructions.
               | There are implementation and compatibility issues that
               | make it tricky, but fundamentally it should be possible.
               | Wilder things like x86_64 to aarch64 translation have
               | been done, so let's do it for x86_64-v4 to x86_64-v1 too.
        
               | almostgotcaught wrote:
               | that's certainly more reasonable so i'm sorry for being
               | so flippant. but even this idea i wager the juice is not
               | worth the squeeze outside of stuff like Rosetta as you
               | alluded, where the value was extremely high (retaining
               | x86 customers).
        
       | gitroom wrote:
       | Oh man, totally get the pain with compilers and SIMD tricks - the
       | struggle's so real. Ever feel like keeping low level control is
       | the only way stuff actually runs as smooth as you want, or am I
       | just too stubborn to give abstractions a real shot?
        
       | sweetjuly wrote:
       | Loop unrolling isn't really done because of pipelining but rather
       | to amortize the cost of looping. Any modern out-of-order core
       | will (on the happy path) schedule the operations identically
       | whether you did one copy per loop or four. The only difference is
       | the number of branches.
        
         | Remnant44 wrote:
         | These days, I strongly believe that loop unrolling is a
         | pessimization, especially with SIMD code.
         | 
         | Scalar code should be unrolled by the compiler to the SIMD word
         | width to expose potential parallelism. But other than that,
         | correctly predicted branches are free, and so is loop
         | instruction overhead on modern wide-dispatch processors. For
         | example, even running a maximally efficient AVX512 kernel on a
         | zen5 machine that dispatches 4 EUs and some load/stores and
         | calculates 2048 bits in the vector units every cycle, you still
         | have a ton of dispatch capacity to handle the loop overhead in
         | the scalar units.
         | 
         | The cost of unrolling is decreased code density and reduced
         | effectiveness of the instruction / uOp cache. I wish Clang in
         | particular would stop unrolling the dang vector loops.
        
           | adgjlsfhk1 wrote:
           | The part that's really weird is that on modern CPUs predicted
           | branches are free iff they're sufficiently rare (<1 out of 8
           | instructions or so). but if you have too many, you will be
           | bottlenecked on the branch since you aren't allowed to
           | speculate past a 2nd (3rd on zen5 without hyperthreading?)
           | branch.
        
             | dzaima wrote:
             | The limiting thing isn't necessarily speculating, but more
             | just the number of branches per cycle, i.e. number of non-
             | contiguous locations the processor has to query from L1 /
             | uop cache (and which the branch predictor has to determine
             | the location of). You get that limit with unconditional
             | branches too.
        
               | gpderetta wrote:
               | Indeed, the limit is on _taken_ branches, hence why
               | making the most likely case fall through is often an
               | optimization.
        
           | dzaima wrote:
           | Intel still shares ports between vector and scalar on
           | P-cores; a scalar multiply in the loop will definitely fight
           | with a vector port, and the bits of pointer bumps and branch
           | and whatnot can fill up the 1 or 2 scalar-only ports. And
           | maybe there are some minor power savings from wasting
           | resources on the scalar overhead. Still, clang does unroll
           | way too much.
        
             | Remnant44 wrote:
             | My understanding is that they've changed this for Lion Cove
             | and all future P cores, moving to much more of a Zen-like
             | setup with seperate schedulers and ports for vector and
             | scalar ops.
        
               | dzaima wrote:
               | Oh, true, mistook it for an E-core while clicking through
               | diagrams due to the port spam.. Still, that's a 2024
               | microarchirecture, it'll be like a decade before it's
               | reasonable to ignore older ones.
        
           | bobmcnamara wrote:
           | > The cost of unrolling is decreased code density and reduced
           | effectiveness of the instruction / uOp cache.
           | 
           | There are some cases where useful code density goes up.
           | 
           | Ex: unroll the Goertzel algorithm by a even number, and
           | suddenly the entire delay line overhead evaporates.
        
           | Const-me wrote:
           | > schedule the operations identically whether you did one
           | copy per loop or four
           | 
           | They don't always do that well when you need a reduction in
           | that loop, e.g. you are searching for something in memory, or
           | computing dot product of long vectors.
           | 
           | Reductions in the loop form a continuous data dependency
           | chain between loop iteration, which prevents processors from
           | being able to submit instructions for multiple iterations of
           | the loop. Fixable with careful manual unrolling.
        
         | codedokode wrote:
         | > Any modern out-of-order core will (on the happy path)
         | schedule the operations identically whether you did one copy
         | per loop or four.
         | 
         | I cannot agree because in an unrolled loop you have less
         | counter increment instructions.
        
         | gpderetta wrote:
         | The looping overhead is trivial, especially on simd code where
         | the loop overhead will use the scalar hardware.
         | 
         | Unrolling is definitely needed for properly scheduling and
         | pipelining SIMD code even on OoO cores. Remember that an OoO
         | core cannot reorder dependent instructions, so the dependencies
         | need to be manually broken, for example by adding additional
         | accumulators, which in turn requires additional unrolling, this
         | is especially important on SIMD code which typically is non-
         | branchy with long dependency chains.
        
           | Remnant44 wrote:
           | That's a good point about increased dependency chain length
           | in simd due to the branchless programming style. Unrolling to
           | break a loop-carried dependency is one of the strongest
           | reasons to unroll especially simd code.
           | 
           | Unrolling trivial loops to remove loop counter overhead
           | hasn't been productive for quite a whole now but
           | unfortunately it's still the default for many compilers.
        
         | imtringued wrote:
         | Ok, but the compiler can't do that without unrolling.
        
       | bob1029 wrote:
       | > Since the register size is fixed there is no way to scale the
       | ISA to new levels of hardware parallelism without adding new
       | instructions and registers.
       | 
       | I look at SIMD as the same idea as any other aspect of the x86
       | instruction set. If you are directly interacting with it, you
       | should probably have a good reason to be.
       | 
       | I primarily interact with these primitives via types like
       | Vector<T> in .NET's System.Numerics namespace. With the
       | appropriate level of abstraction, you no longer have to worry
       | about how wide the underlying architecture is, or if it even
       | supports SIMD at all.
       | 
       | I'd prefer to let someone who is paid a very fat salary by a F100
       | spend their full time job worrying about how to emit SIMD
       | instructions for my program source.
        
       | timewizard wrote:
       | > Another problem is that each new SIMD generation requires new
       | instruction opcodes and encodings.
       | 
       | It requires new opcodes. It does not strictly require new
       | encodings. Several new encodings are legacy compatible and can
       | encode previous generations vector instructions.
       | 
       | > so the architecture must provide enough SIMD registers to avoid
       | register spilling.
       | 
       | Or the architecture allows memory operands. The great joy of
       | basic x86 encoding is that you don't actually need to put things
       | in registers to operate on them.
       | 
       | > Usually you also need extra control logic before the loop. For
       | instance if the array length is less than the SIMD register
       | width, the main SIMD loop should be skipped.
       | 
       | What do you want? No control overhead or the speed enabled by
       | SIMD? This isn't a flaw. This is a necessary price to achieve the
       | efficiency you do in the main loop.
        
         | camel-cdr wrote:
         | > The great joy of basic x86 encoding is that you don't
         | actually need to put things in registers to operate on them.
         | 
         | That's just spilling with fewer steps. The executed uops should
         | be the same.
        
           | timewizard wrote:
           | > That's just spilling with fewer steps.
           | 
           | Another way to say this is it's "more efficient."
           | 
           | > The executed uops should be the same.
           | 
           | And "more densely coded."
        
             | camel-cdr wrote:
             | hm, I was wondering how the density compares with x86
             | having more complex encodings in general.
             | 
             | vaddps zmm1,zmm0,ZMMWORD PTR [r14]
             | 
             | takes six bytes to encode:
             | 
             | 62 d1 7c 48 58 0e
             | 
             | In SVE and RVV a load+add takes 8 bytes to encode.
        
         | dzaima wrote:
         | > The great joy of basic x86 encoding is that you don't
         | actually need to put things in registers to operate on them.
         | 
         | That's... 1 register saved, out of 16 (or 32 on AVX-512).
         | Perhaps useful sometimes, but far from a particularly
         | significant aspect spill-wise.
         | 
         | And doing that means you lose the ability to move the load
         | earlier (perhaps not too significant on OoO hardware, but still
         | a consideration; while reorder windows are multiple hundreds of
         | instructions, the actual OoO limit is scheduling queues, which
         | are frequently under a hundred entries, i.e. a couple dozen
         | cycles worth of instructions, at which point the >=4 cycle
         | latency of a load is not actually insignificant. And putting
         | the load directly in the arith op is the worst-case scenario
         | for this)
        
       | lauriewired wrote:
       | The three "flaws" that this post lists are exactly what the
       | industry has been moving away from for the last decade.
       | 
       | Arm's SVE, and RISC-V's vector extension are all vector-length-
       | agnostic. RISC-V's implementation is particularly nice, you only
       | have to compile for one code path (unlike avx with the need for
       | fat-binary else/if trees).
        
       | dragontamer wrote:
       | 1. Not a problem for GPUs. NVdia and AMD are both 32-wide or
       | 1024-bit wide hard coded. AMD can swap to 64-wide mode for
       | backwards compatibility to GCN. 1024-bit or 2048-bit seems to be
       | the right values. Too wide and you get branch divergence issues,
       | so it doesn't seem to make sense to go bigger.
       | 
       | In contrast, the systems that have flexible widths have never
       | taken off. It's seemingly much harder to design a programming
       | language for a flexible width SIMD.
       | 
       | 2. Not a problem for GPUs. It should be noted that kernels
       | allocate custom amounts of registers: one kernel may use 56
       | registers, while another kernel might use 200 registers. All GPUs
       | will run these two kernels simultaneously (256+ registers per CU
       | or SM is commonly supported, so both 200+56 registers kernels can
       | run together).
       | 
       | 3. Not a problem for GPUs or really any SIMD in most cases. Tail
       | handling is O(1) problem in general and not a significant
       | contributor to code length, size, or benchmarks.
       | 
       | Overall utilization issues are certainly a concern. But in my
       | experience this is caused by branching most often. (Branching in
       | GPUs is very inefficient and forces very low utilization).
        
         | dzaima wrote:
         | Tail handling is not significant for loops with tons of
         | iterations, but there are a ton of real-world situations where
         | you might have a loop take only like 5 iterations or something
         | (even at like 100 iterations, with a loop processing 8 elements
         | at a time (i.e. 256-bit vectors, 32-bit elements), that's 12
         | vectorized iterations plus up to 7 scalar ones, which is still
         | quite significant. At 1000 iterations you could still have the
         | scalar tail be a couple percent; and still doubling the L1/uop-
         | cache space the loop takes).
         | 
         | It's absolutely a significant contributor to code size (..in
         | scenarios where vectorized code in general is a significant
         | contributor to code size, which admittedly is only very-
         | specialized software).
        
           | dragontamer wrote:
           | Note that AVX512 have per-lane execution masks so I'm not
           | fully convinced that tail handling should even be a thing
           | anymore.
           | 
           | If(my lane is beyond the buffer) then (exec flag off, do not
           | store my lane).
           | 
           | Which in practice should be a simple vcompress instruction
           | (AVX512 register) and maybe a move afterwards??? I admit that
           | I'm not an AVX512 expert but it doesn't seem all that
           | difficult with vcompress instructions + execmask.
        
             | dzaima wrote:
             | It takes like 4 instrs to compute the mask from an
             | arbitrary length (AVX-512 doesn't have any instruction for
             | this so you need to do `bzhi(-1, min(left,vl))` and move
             | that to a mask register) so you still would likely want to
             | avoid it in the hot loop.
             | 
             | Doing the tail separately but with masking SIMD is an
             | improvement over a scalar loop perf-wise (..perhaps outside
             | of the case of 1 or 2 elements, which is a realistic
             | situation for a bunch of loops too), but it'll still add a
             | double-digit percentage to code size over just a plain SIMD
             | loop without tail handling.
             | 
             | And this doesn't help pre-AVX-512, and AVX-512 isn't
             | particularly widespread (AVX2 does have masked load/store
             | with 32-/64-bit granularity, but not 8-/16-bit, and the
             | instrs that do exist are rather slow on AMD (e.g.
             | unconditional 12 cycles/instr throughput for masked-storing
             | 8 32-bit elements); SSE has none, and ARM NEON doesn't have
             | any either (and ARM SVE isn't widespread either, incl. not
             | supported on apple silicon))
             | 
             | (don't need vcompress, plain masked load/store instrs do
             | exist in AVX-512 and are sufficient)
        
               | dragontamer wrote:
               | > It takes like 2 instrs to compute the mask from a
               | length (AVX-512 doesn't have any instruction for this so
               | you need to do a bzhi in GPR and move that to a mask
               | register) so you still would likely want to avoid it in
               | the hot loop.
               | 
               | Keep a register with the values IdxAdjustment = [0, 1, 2,
               | 3, 4, 5, 6, 7].
               | 
               | ExecutionMask = (Broadcast(CurIdx) + IdxAdjustment) <
               | Length
               | 
               | Keep looping while(any(vector) < Length), which is as
               | simple as "while(exec_mask != 0)".
               | 
               | I'm not seeing this take up any "extra" instructions at
               | all. You needed the while() loop after all. It costs +1
               | Vector Register (IdxAdjustment) and a kMask by my count.
               | 
               | > And this doesn't help pre-AVX-512, and AVX-512 isn't
               | particularly widespread
               | 
               | AVX512 is over 10 years old now. And the premier SIMD
               | execution instruction set is CUDA / NVidia, not AVX512.
               | 
               | AVX512 is now available on all AMD CPUs and has been for
               | the last two generations. It is also available on a
               | select number of Intel CPUs. There is also AMD RDNA,
               | Intel Xe ISAs that could be targeted.
               | 
               | > instrs that do exist are rather slow on AMD (e.g.
               | unconditional 12 cycles/instr throughput for masked-
               | storing 8 32-bit elements);
               | 
               | Okay, I can see that possibly being an issue then.
               | 
               | EDIT: AMD Zen5 Optimization Manual states Latency1 and
               | throughput 2-per-clocktick, while Intel's Skylake
               | documentation of
               | https://www.intel.com/content/www/us/en/docs/intrinsics-
               | guid... states Latency5 Throughput 1-per-clock-tick.
               | 
               | AMD Zen5 seems to include support to vmovdqu8 (its in the
               | optimization guide .xlsx sheet with
               | latencies/throughputs, also as 1-latency / 4-throughput).
               | This includes vmovdqu8 (
               | 
               | I'm not sure if the "mask" register changes the
               | instruction. I'll do some research to see if I can verify
               | your claim (I don't have my Zen5 computer built yet...
               | but its soon).
        
               | dzaima wrote:
               | That's two instrs - bumping the indices, and doing the
               | comparison. You still want scalar pointer/index bumping
               | for contiguous loads/stores (using gathers/scatters for
               | those would be stupid and slow), and that gets you the
               | end check for free via fused cmp+jcc.
               | 
               | And those two instrs are vector instrs, i.e. competing
               | with execution units for the actual thing you want to
               | compute, whereas scalar instrs have at least some
               | independent units that allow avoiding desiring infinite
               | unrolling.
               | 
               | And if your loop is processing 32-bit (or, worse,
               | smaller) elements, those indices, if done as 64-bit, as
               | most code will do, will cost even more.
               | 
               | AVX-512 might be 10 years old, but Intel's latest (!)
               | cores still don't support it on hardware with E-cores, so
               | still a decade away from being able to just assume it
               | exists. Another thread on this post mentioned that Intel
               | has shipped hardware without AVX/AVX2/FMA as late as 2021
               | even.
               | 
               | > Okay, I can see that possibly being an issue then.
               | 
               | To be clear, that's only the AVX2 instrs; AVX-512 masked
               | loads/stores are fast (..yes, even on Zen 4 where the
               | AVX-512 masked loads/stores are fast, the AVX2 ones that
               | do an equivalent amount of work (albeit taking the mask
               | in a different register class) are slow). uops.info: http
               | s://uops.info/table.html?search=maskmovd%20m256&cb_lat=o.
               | ..
               | 
               | Intel also has AVX-512 masked 512-bit 8-bit-elt stores at
               | half the throughput of unmasked for some reason (not
               | 256-bit or >=16-bit-elt though; presumably culprit being
               | the mask having 64 elts): https://uops.info/table.html?se
               | arch=movdqu8%20m512&cb_lat=on...
               | 
               | And masked loads use some execution ports on both Intel
               | and AMD, eating away from throughput of the main
               | operation. All in all just not implemented for being able
               | to needlessly use masked loads/stores in hot loops.
        
               | dragontamer wrote:
               | Gotcha. Makes sense. Thanks for the discussion!
               | 
               | Overall, I agree that AVX and Neon have their warts and
               | performance issues. But they're like 15+ years old now
               | and designed well before GPU Compute was possible.
               | 
               | > using gathers/scatters for those would be stupid and
               | slow
               | 
               | This is where CPUs are really bad. GPUs will coalesce
               | gather/scatters thanks to __shared__ memory (with human
               | assistance of course).
               | 
               | But also the simplest of load/store patters are auto-
               | detected and coalesced. So a GPU programmer doesn't have
               | to worry about SIMD lane load/store (called vgather in
               | AVX512) being slower. It's all optimized to hell and
               | back.
               | 
               | Having a full lane-to-lane crossbar and supporting high
               | performance memory access patterns needs to be a priority
               | moving forward.
        
               | dzaima wrote:
               | Thanks for the info on how things look on the GPU side!
               | 
               | A messy thing with memory performance on CPUs is that
               | either you share the same cache hardware between scalar
               | and vector, thereby significantly limiting how much
               | latency you can trade for throughput, or you have to add
               | special vector L1 cache, which is a ton of mess and
               | silicon area; never mind uses of SIMD that are latency-
               | sensitive, e.g. SIMD hashmap probing, or small loops.
               | 
               | I guess you don't necessarily need that for just
               | detecting patterns in gather indices, but nothing's gonna
               | get a gather of consecutive 8-bit elts via 64-bit indices
               | to not perform much slower than a single contiguous load,
               | and 8-bit elts are quite important on CPUs for strings &
               | co.
        
       | dang wrote:
       | Related:
       | 
       |  _Three Fundamental Flaws of SIMD_ -
       | https://news.ycombinator.com/item?id=28114934 - Aug 2021 (20
       | comments)
        
       | xphos wrote:
       | Personally, I think load and increment address register in a
       | single instruction is extremely valuable here. It's not quite the
       | risc model but I think that it is actually pretty significant in
       | avoiding a von nurmon bottleneck with simd (the irony in this
       | statement)
       | 
       | I found that a lot of the custom simd cores I've written for
       | simply cannot issue instructions fast enough risvc. Or when they
       | it's in quick bursts and than increments and loop controls that
       | leave the engine idling for more than you'd like.
       | 
       | Better dual issue helps but when you have seperate vector queue
       | you are sending things to it's not that much to add increments
       | into vloads and vstores
        
         | codedokode wrote:
         | > load and increment address register in a single instruction
         | is extremely valuable here
         | 
         | I think this is not important anymore because modern
         | architectures allow to add offset to register value so you can
         | write something like, using weird RISC-V syntax for addition:
         | ld r2, 0(r1)         ld r3, 4(r1)         ld r4, 8(r1)
         | 
         | These operations can be executed in parallel, while with auto-
         | incrementing you cannot do that.
        
       | codedokode wrote:
       | I think that packed SIMD is better in almost every aspect and
       | Vector SIMD is worse.
       | 
       | With vector SIMD you don't know the register size beforehand and
       | therefore have to maintain and increment counters, adding extra
       | unnecessary instructions, reducing total performance. With packed
       | SIMD you can issue several loads immediately without
       | dependencies, and if you look at code examples, you can see that
       | the x86 code is more dense and uses a sequence of unrolled SIMD
       | instructions without any extra instructions which is more
       | efficient. While RISC-V has 4 SIMD instructions and 4
       | instructions dealing with counters per loop iteration, i.e. it
       | wastes 50% of command issue bandwidth and you cannot load next
       | block until you increment the counter.
       | 
       | The article mentions that you have to recompile packed SIMD code
       | when a new architecture comes out. Is that really a problem? Open
       | source software is recompiled every week anyway. You should just
       | describe your operations in a high level language that gets
       | compiled to assembly for all supported architectures.
       | 
       | So as a conclusion, it seems that Vector SIMD is optimized for
       | manually-written assembly and closed-source software while Packed
       | SIMD is made for open-source software and compilers and is more
       | efficient. Why RISC-V community prefers Vector architecture, I
       | don't understand.
        
         | LoganDark wrote:
         | This comment sort of reminds me of how Transmeta CPUs relied on
         | the compiler to precompute everything like pipelining. It
         | wasn't done by the hardware.
        
           | codedokode wrote:
           | Makes sense - writing or updating software is easier that
           | designing or updating hardware. To illustrate: anyone can
           | write software but not everyone has access to chip
           | manufacturing fabs.
        
             | LoganDark wrote:
             | Atomic Semi may be looking to change that (...eventually)
        
         | IshKebab wrote:
         | Those 4 counter instructions have no dependencies though so
         | they'll likely all be issued and executed in parallel in 1
         | cycle, surely? Probably the branch as well in fact.
        
           | codedokode wrote:
           | The load instruction has a dependency on counter increment.
           | While with packed SIMD one can issue several loads without
           | waiting. Also, extra counter instructions still waste
           | resources of a CPU (unless there is some dedicated hardware
           | for this purpose).
        
         | dzaima wrote:
         | > Open source software is recompiled every week anyway.
         | 
         | Despite being potentially compiled recently, anything from most
         | Linux package managers, and whatever precompiled downloadable
         | executables, even if from open-source code, still targets the
         | 20-year-old SSE2 baseline, wasting the majority of SIMD
         | resources available on modern (..or just not-extremely-ancient)
         | CPUs (unless you're looking at the 0.001% of software that
         | bothers with dynamic dispatch; but for that approach just
         | recompiling isn't enough, you also need to extend the
         | dispatched target set).
         | 
         | RISC-V RVV's LMUL means that you get unrolling for free, as
         | each instruction can operate over up to 8 registers per
         | operand, i.e. essentially "hardware 8x unrolling", thereby
         | making scalar overhead insignificant. (probably a minor
         | nightmare from the silicon POV, but perhaps not in a
         | particularly limiting way - double-pumping has been done by x86
         | many times so LMUL=2 is simple enough, and at LMUL=4 and LMUL=8
         | you can afford to decode/split into ups at 1 instr/cycle)
         | 
         | ARM SVE can encode adding a multiple of VL in load/store
         | instructions, allowing manual unrolling without having to
         | actually compute the intermediate sizes. (hardware-wise that's
         | an extremely tiny amount of overhead, as it's trivially
         | mappable to an immediate offset at decode time). And there's an
         | instruction to bump a variable by a multiple of VL.
         | 
         | And you need to bump pointers in any SIMD regardless; the only
         | difference is whether the bump size is an immediate, or a
         | dynamic value, and if you control the ISA you can balance
         | between the two as necessary. The packed SIMD approach isn't
         | "free" either - you need hardware support for immediate offsets
         | in SIMD load/store instrs.
         | 
         | Even in a hypothetical non-existent bad vector SIMD ISA without
         | any applicable free offsetting in loads/stores and a need for
         | unrolling, you can avoid having a dependency between unrolled
         | iterations by precomputing "vlen*2", "vlen*3", "vlen*4", ...
         | outside of the loop and adding those as necessary, instead of
         | having a strict dependency chain.
        
       | stephencanon wrote:
       | The basic problem with almost every "SIMD is flawed, we should
       | have vector ISAs" article or post (including the granddaddy,
       | "SIMD Instructions Considered Harmful"), is that they invariably
       | use SAXPY or something else trivial where everything stays neatly
       | in lane as their demonstration case. Of course vector ISAs look
       | good when you show them off using a pure vector task. This is
       | fundamentally unserious.
       | 
       | There is an enormous quantity of SIMD code in the world that
       | isn't SAXPY, and doesn't stay neatly in lane. Instead it's things
       | like "base64 encode this data" or "unpack and deinterleave this
       | 4:2:2 pixel data, apply a colorspace conversion as a 3x3 sparse
       | matrix and gamma adjustment in 16Q12 fixed-point format, resize
       | and rotate by 15@ with three shear operations represented as a
       | linear convolution with a sinc kernel per row," or "extract these
       | fields from this JSON data". All of which _can totally be done_
       | with a well-designed vector ISA, but the comparison doesn't paint
       | nearly as rosy of a picture. The reality is that you really want
       | a mixture of ideas that come from fixed-width SIMD and ideas that
       | come from the vector world (which is roughly what people actually
       | shipping hardware have been steadily building over the last two
       | decades, implementing more support for unaligned access,
       | predication, etc, while the vector ISA crowd writes purist think
       | pieces)
        
       | Someone wrote:
       | > Since the register size is fixed there is no way to scale the
       | ISA to new levels of hardware parallelism without adding new
       | instructions and registers.
       | 
       | I think there is a way: vary register size per CPU, but also add
       | an instruction to retrieve register size. Then, code using the
       | vector unit will sometimes have to dynamically allocate a buffer
       | for intermediate values, but it would allow for software to run
       | across CPUs with different vector lengths. Does anybody know
       | whether any architecture does this?
        
       ___________________________________________________________________
       (page generated 2025-04-25 23:02 UTC)