[HN Gopher] Three Fundamental Flaws of SIMD ISAs (2023)
       ___________________________________________________________________
        
       Three Fundamental Flaws of SIMD ISAs (2023)
        
       Author : fanf2
       Score  : 94 points
       Date   : 2025-04-24 14:42 UTC (8 hours ago)
        
 (HTM) web link (www.bitsnbites.eu)
 (TXT) w3m dump (www.bitsnbites.eu)
        
       | pkhuong wrote:
       | There's more to SIMD than BLAS.
       | https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u... .
        
         | camel-cdr wrote:
         | BLAS, specifically gemm, is one of the rare things where you
         | naturally need to specialize on vector register width.
         | 
         | Most problems don't require this: E.g. your basic penalizable
         | math stuff, unicode conversion, base64 de/encode, json parsing,
         | set intersection, quicksort*, bigint, run length encoding,
         | chacha20, ...
         | 
         | And if you run into a problem that benefits from knowing the
         | SIMD width, then just specialize on it. You can totally use
         | variable-length SIMD ISA's in a fixed-length way when required.
         | But most of the time it isn't required, and you have code that
         | easily scales between vector lengths.
         | 
         | *quicksort: most time is spent partitioning, which is vector
         | length agnostic, you can handle the leafs in a vector length
         | agnostic way, but you'll get more efficient code if you
         | specialize (idk how big the impact is, in vreg bitonic sort is
         | quite efficient).
        
       | convolvatron wrote:
       | i would certainly add lack of reductions ('horizontal'
       | operations) and a more generalized model of communication to the
       | list.
        
         | adgjlsfhk1 wrote:
         | The tricky part with reductions is that they are somewhat
         | inherently slow since they often need to be done pairwise and a
         | pairwise reduction over 16 elements will naturally have pretty
         | limited parallelism.
        
           | convolvatron wrote:
           | kinda? this is sort of a direct result of the 'vectors are
           | just sliced registers' model. if i do a pairwise operation
           | and divide my domain by 2 at each step, is the resulting
           | vector sparse or dense? if its dense then I only really top
           | out when i'm in the last log2slice steps.
        
       | TinkersW wrote:
       | I write a lot of SIMD and I don't really agree with this..
       | 
       |  _Flaw1:fixed width_
       | 
       | I prefer fixed width as it makes the code simpler to write, size
       | is known as compile time so we know the size of our structures.
       | Swizzle algorithms are also customized based on the size.
       | 
       |  _Flaw2:pipelining_
       | 
       | no CPU I care about is in order so mostly irrelevant, and even
       | scalar instructions are pipelined
       | 
       |  _Flaw3: tail handling_
       | 
       | I code with SIMD as the target, and have special containers that
       | pad memory to SIMD width, no need to mask or run a scalar loop. I
       | copy the last valid value into the remaining slots so it doesn't
       | cause any branch divergence.
        
         | cjbgkagh wrote:
         | I have similar thoughts,
         | 
         | I don't understand the push for variable width SIMD. Possibly
         | due to ignorance but I think it's an abstraction that can be
         | specialized for different hardware so the similar tradeoffs
         | between low level languages and high level languages apply.
         | Since I already have to be aware of hardware level concepts
         | such as 256bit shuffle not working across 128bit lanes and
         | different instructions having very different performance
         | characteristics on different CPUs I'm already knee deep in
         | hardware specifics. While in general I like abstractions I've
         | largely given up waiting for a 'sufficiently advanced compiler'
         | that would properly auto-vectorize my code. I think AGI AI is
         | more likely to happen sooner. At a guess it seems to be that
         | SIMD code could work on GPUs but GPU code has different memory
         | access costs so the code there would also be completely
         | different.
         | 
         | So my view is either create a much better higher level SIMD
         | abstraction model with a sufficiently advanced compiler that
         | knows all the tricks or let me work closely at the hardware
         | level.
         | 
         | As an outsider who doesn't really know what is going on it does
         | worry me a bit that it appears that WASM is pushing for
         | variable width SIMDs instead of supporting ISAs generally
         | supported by CPUs. I guess it's a portability vs performance
         | tradeoff - I worry that it may be difficult to make variable as
         | performant as fixed width and would prefer to deal with
         | portability by having alternative branches at code level.
         | >> Finally, any software that wants to use the new instruction
         | set needs to be rewritten (or at least recompiled). What is
         | worse, software developers often have to target several SIMD
         | generations, and add mechanisms to their programs that
         | dynamically select the optimal code paths depending on which
         | SIMD generation is supported.
         | 
         | Why not marry the two and have variable width SIMD as one of
         | the ISA options and if in the future variable width SIMD become
         | more performant then it would just be another branch to
         | dynamically select.
        
           | kevingadd wrote:
           | Part of the motive behind variable width SIMD in WASM is that
           | there's intentionally-ish no mechanism to do feature
           | detection at runtime in WASM. The whole module has to be
           | valid on your target, you can't include a handful of invalid
           | functions and conditionally execute them if the target
           | supports 256-wide or 512-wide SIMD. If you want to adapt you
           | have to ship entire modules for each set of supported feature
           | flags and select the correct module at startup after probing
           | what the target supports.
           | 
           | So variable width SIMD solves this by making any module using
           | it valid regardless of whether the target supports 512-bit
           | vectors, and the VM 'just' has to solve the problem of
           | generating good code.
           | 
           | Personally I think this is a terrible way to do things and
           | there should have just been a feature detection system, but
           | the horse fled the barn on that one like a decade ago.
        
         | camel-cdr wrote:
         | > I prefer fixed width
         | 
         | Do you have examples for problems that are easier to solve in
         | fixed-width SIMD?
         | 
         | I maintain that most problems can be solved in a vector-length-
         | agnostic manner. Even if it's slightly more tricky, it's
         | certainly easier than restructuring all of your memory
         | allocations to add padding and implementing three versions for
         | all the differently sized SIMD extensions your target may
         | support. And you can always fall back to using a variable-width
         | SIMD ISA in a fixed-width way, when necessary.
        
           | jcranmer wrote:
           | There's a category of autovectorization known as Superword-
           | Level Parallelism (SLP) which effectively scavenges an entire
           | basic block for individual instruction sequences that might
           | be squeezed together into a SIMD instruction. This kind of
           | vectorization doesn't work well with vector-length-agnostic
           | ISAs, because you generally can't scavenge more than a few
           | elements anyways, and inducing any sort of dynamic vector
           | length is more likely to slow your code down as a result
           | (since you can't do constant folding).
           | 
           | There's other kinds of interesting things you can do with
           | vectors that aren't improved by dynamic-length vectors.
           | Something like abseil's hash table, which uses vector code to
           | efficiently manage the occupancy bitmap. Dynamic vector
           | length doesn't help that much in that case, particularly
           | because the vector length you can parallelize over is itself
           | intrinsically low (if you're scanning dozens of elements to
           | find an empty slot, something is wrong). Vector swizzling is
           | harder to do dynamically, and in general, at high vector
           | factors, difficult to do generically in hardware, which means
           | going to larger vectors (even before considering dynamic
           | sizes), vectorization is trickier if you have to do a lot of
           | swizzling.
           | 
           | In general, vector-length-agnostic is really only good for
           | SIMT-like codes, which you can express the vector body as
           | more or less independent f(index) for some knowable-before-
           | you-execute-the-loop range of indices. Stuff like DAXPY or
           | BLAS in general. Move away from this model, and that
           | agnosticism becomes overhead that doesn't pay for itself.
           | (Now granted, this kind of model is a _large fraction_ of
           | parallelizable code, but it 's far from all of it).
        
             | camel-cdr wrote:
             | The SLP vectorizer is a good point, but I think it's, in
             | comparison with x86, more a problem of the float and vector
             | register files not being shared (in SVE and RVV). You don't
             | need to reconfigure the vector length; just use it at the
             | full width.
             | 
             | > Something like abseil's hash table
             | 
             | If I remember this correctly, the abseil lookup does scale
             | with vector length, as long as you use the native data path
             | width. (albeit with small gains) There is a problem with
             | vector length agnostic handling of abseil, which is the
             | iterator API. With a different API, or compilers that could
             | eliminate redundant predicated load/stores, this would be
             | easier.
             | 
             | > good for SIMT-like codes
             | 
             | Certainly, but I've also seen/written a lot of vector
             | length agnostic code using shuffles, which don't fit into
             | the SIMT paradigm, which means that the scope is larger
             | than just SIMT.
             | 
             | ---
             | 
             | As a general comparison, take AVX10/128, AVX10/256 and
             | AVX10/512, overlap their instruction encodings, remove the
             | few instructions that don't make sense anymore, and add a
             | cheap instruction to query the vector length. (probably
             | also instructions like vid and viota, for easier shuffle
             | synthesization) Now you have a variable-length SIMD ISA
             | that feels familiar.
             | 
             | The above is basically what SVE is.
        
             | jonstewart wrote:
             | A number of the cool string processing SIMD techniques
             | depend a _lot_ on register widths and instruction
             | performance characteristics. There's a fair argument to be
             | made that x64 could be made more consistent/legible for
             | these use cases, but this isn't matmul--whether you have
             | 128, 256, or 512 bits matters hugely and you may want
             | entirely different algorithms that are contingent on this.
        
           | jandrewrogers wrote:
           | I also prefer fixed width. At least in C++, all of the
           | padding, alignment, etc is automagically codegen-ed for the
           | register type in my use cases, so the overhead is
           | approximately zero. All the complexity and cost is in
           | specializing for the capabilities of the underlying SIMD ISA,
           | not the width.
           | 
           | The benefit of fixed width is that optimal data structure and
           | algorithm design on various microarchitectures is dependent
           | on explicitly knowing the register width. SIMD widths aren't
           | not perfectly substitutable in practice, there is more at
           | play than stride size. You can also do things like explicitly
           | combine separate logic streams in a single SIMD instruction
           | based on knowing the word layout. Compilers don't do this
           | work in 2025.
           | 
           | The argument for vector width agnostic code seems predicated
           | on the proverbial "sufficiently advanced compiler". I will
           | likely retire from the industry before such a compiler
           | actually exists. Like fusion power, it has been ten years
           | away my entire life.
        
             | camel-cdr wrote:
             | > The argument for vector width agnostic code is seems
             | predicated on the proverbial "sufficiently advanced
             | compiler".
             | 
             | A SIMD ISA having a fixed size or not is orthogonal to
             | autovectorization. E.g. I've seen a bunch of cases where
             | things get autovectorized for RVV but not for AVX512. The
             | reason isn't fixed vs variable, but rather the supported
             | instructions themselves.
             | 
             | There are two things I'd like from a "sufficiently advanced
             | compiler", which are sizeless struct support and redundant
             | predicated load/store elimination. Those don't
             | fundamentally add new capabilities, but makes working
             | with/integrating into existing APIs easier.
             | 
             | > All the complexity and cost is in specializing for the
             | capabilities of the underlying SIMD ISA, not the width.
             | 
             | Wow, it almost sounds like you could take basically the
             | same code and run it with different vector lengths.
             | 
             | > The benefit of fixed width is that optimal data structure
             | and algorithm design on various microarchitectures is
             | dependent on explicitly knowing the register width
             | 
             | Optimal to what degree? Like sure, fixed-width SIMD can
             | always turn your pointer increments from a register add to
             | an immediate add, so it's always more "optimal", but that
             | sort of thing doesn't matter.
             | 
             | The only difference you usually encounter when writing
             | variable instead of fixed size code is that you have to
             | synthesize your shuffles outside the loop. This usually
             | just takes a few instructions, but loading a constant is
             | certainly easier.
        
           | pkhuong wrote:
           | > Do you have examples for problems that are easier to solve
           | in fixed-width SIMD?
           | 
           | Regular expression matching and encryption come to mind.
        
             | camel-cdr wrote:
             | > Regular expression matching
             | 
             | That's probably true. Last time I looked at it, it seemed
             | like parts of vectorscan could be vectorized VLA, but from
             | my, very limited, understanding of the main matching
             | algorithm, it does seem to require specialization on vector
             | length.
             | 
             | It should be possible to do VLA in some capacity, but it
             | would probably be slower and it's too much work to test.
             | 
             | > encryption
             | 
             | From the things I've looked at, it's mixed.
             | 
             | E.g. chacha20 and poly1305 vectorize well in a VLA scheme:
             | https://camel-cdr.github.io/rvv-bench-
             | results/bpi_f3/chacha2..., https://camel-cdr.github.io/rvv-
             | bench-results/bpi_f3/poly130...
             | 
             | Keccak on the other hand was optimized for fast execution
             | on scalar ISAs with 32 GPRs. This is hard to vectorize in
             | general, because GPR "moves" are free and liberally
             | applied.
             | 
             | Another example where it's probably worth specializing is
             | quicksort, specifically the leaf part.
             | 
             | I've written a VLA version, which uses bitonic sort to sort
             | within vector registers. I wasn't able to meaningfully
             | compare it against a fixed size implementation, because
             | vqsort was super slow when I tried to compile it for RVV.
        
         | aengelke wrote:
         | I agree; and the article seems to have also quite a few
         | technical flaws:
         | 
         | - Register width: we somewhat maxed out at 512 bits, with Intel
         | going back to 256 bits for non-server CPUs. I don't see larger
         | widths on the horizon (even if SVE theoretically supports up to
         | 2048 bits, I don't know any implementation with ~~>256~~ >512
         | bits). Larger bit widths are not beneficial for most
         | applications and the few applications that are (e.g., some HPC
         | codes) are nowadays served by GPUs.
         | 
         | - The post mentions available opcode space: while opcode space
         | is limited, a reasonably well-designed ISA (e.g., AArch64) has
         | enough holes for extensions. Adding new instructions doesn't
         | require ABI changes, and while adding new registers requires
         | some kernel changes, this is well understood at this point.
         | 
         | - "What is worse, software developers often have to target
         | several SIMD generations" -- no way around this, though, unless
         | auto-vectorization becomes substantially better. Adjusting the
         | register width is not the big problem when porting code, making
         | better use of available instructions is.
         | 
         | - "The packed SIMD paradigm is that there is a 1:1 mapping
         | between the register width and the execution unit width" -- no.
         | E.g., AMD's Zen 4 does double pumping, and AVX was IIRC
         | originally designed to support this as well (although Intel
         | went directly for 256-bit units).
         | 
         | - "At the same time many SIMD operations are pipelined and
         | require several clock cycles to complete" -- well, they _are_
         | pipelined, but many SIMD instructions have the same latency as
         | their scalar counterpart.
         | 
         | - "Consequently, loops have to be unrolled in order to avoid
         | stalls and keep the pipeline busy." -- loop unroll has several
         | benefits, mostly to reduce the overhead of the loop and to
         | avoid data dependencies between loop iterations. Larger basic
         | blocks are better for hardware as every branch, even if
         | predicted correctly, has a small penalty. "Loop unrolling also
         | increases register pressure" -- it does, but code that really
         | requires >32 registers is extremely rare, so a good instruction
         | scheduler in the compiler can avoid spilling.
         | 
         | In my experience, dynamic vector sizes make code slower,
         | because they inhibit optimizations. E.g., spilling a
         | dynamically sized vector is like a dynamic stack allocation
         | with a dynamic offset. I don't think SVE delivered any large
         | benefits, both in terms of performance (there's not much
         | hardware with SVE to begin with...) and compiler support.
         | RISC-V pushes further into this direction, we'll see how this
         | turns out.
        
           | cherryteastain wrote:
           | Fujitsu A64FX used in the Fugaku supercomputer uses SVE with
           | 512 bit width
        
             | aengelke wrote:
             | Thanks, I misremembered. However, the microarchitecture is
             | a bit "weird" (really HPC-targeted), with very long
             | latencies (e.g., ADD (vector) 4 cycles, FADD (vector) 9
             | cycles). I remember that it was much slower than older x86
             | CPUs for non-SIMD code, and even for SIMD code, it took
             | quite a bit of effort to get reasonable performance through
             | instruction-level parallelism due to the long latencies and
             | the very limited out-of-order capacities (in particular the
             | just 2x20 reservation station entries for FP).
        
           | camel-cdr wrote:
           | > we somewhat maxed out at 512 bits
           | 
           | Which still means you have to write your code at least
           | thrice, which is two times more than with a variable length
           | SIMD ISA.
           | 
           | Also there are processors with larger vector length, e.g.
           | 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200,
           | 16384-bit: NEC SX-Aurora, Ara, EPI
           | 
           | > no way around this
           | 
           | You rarely need to rewrite SIMD code to take advantage of new
           | extensions, unless somebody decides to create a new one with
           | a larger SIMD width. This mostly happens when very
           | specialized instructions are added.
           | 
           | > In my experience, dynamic vector sizes make code slower,
           | because they inhibit optimizations.
           | 
           | Do you have more examples of this?
           | 
           | I don't see spilling as much of a problem, because you want
           | to avoid it regardless, and codegen for dynamic vector sizes
           | is pretty good in my experience.
           | 
           | > I don't think SVE delivered any large benefits
           | 
           | Well, all Arm CPUs except for the A64FX were build to execute
           | NEON as fast as possible. X86 CPUs aren't built to execute
           | MMX or SSE or the latest, even AVX, as fast as possible.
           | 
           | Anyway, I know of one comparison between NEON and SVE:
           | https://solidpixel.github.io/astcenc_meets_sve
           | 
           | > Performance was a lot better than I expected, giving
           | between 14 and 63% uplift. Larger block sizes benefitted the
           | most, as we get higher utilization of the wider vectors and
           | fewer idle lanes.
           | 
           | > I found the scale of the uplift somewhat surprising as
           | Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so
           | in terms of data-width the two should work out very similar.
        
             | vardump wrote:
             | > Which still means you have to write your code at least
             | thrice, which is two times more than with a variable length
             | SIMD ISA.
             | 
             | 256 and 512 bits are the only reasonable widths. 256 bit
             | AVX2 is what, 13 or 14 years old now.
        
               | adgjlsfhk1 wrote:
               | no. Because Intel is full of absolute idiots, Intel atom
               | didn't support AVX 1 until Gracemont. Tremont is missing
               | AVX1, AVX2, FMA, and basically the rest of X86v3, and
               | shipped in CPUs as recently as 2021 (Jasper Lake).
        
               | vardump wrote:
               | Oh damn. I've dropped SSE ages ago and no one complained.
               | I guess the customer base didn't use those chips...
        
             | aengelke wrote:
             | > Also there are processors with larger vector length
             | 
             | How do these fare in terms of absolute performance? The NEC
             | TSUBASA is not a CPU.
             | 
             | > Do you have more examples of this?
             | 
             | I ported some numeric simulation kernel to the A64Fx some
             | time ago, fixing the vector width gave a 2x improvement.
             | Compilers probably/hopefully have gotten better in the mean
             | time and I haven't redone the experiments since then, but
             | I'd be surprised if this changed drastically. Spilling is
             | sometimes unavoidable, e.g. due to function calls.
             | 
             | > Anyway, I know of one comparison between NEON and SVE:
             | https://solidpixel.github.io/astcenc_meets_sve
             | 
             | I was specifically referring to dynamic vector sizes. This
             | experiment uses sizes fixed at compile-time, from the
             | article:
             | 
             | > For the astcenc implementation of SVE I decided to
             | implement a fixed-width 256-bit implementation, where the
             | vector length is known at compile time.
        
               | camel-cdr wrote:
               | > How do these fare in terms of absolute performance? The
               | NEC TSUBASA is not a CPU.
               | 
               | The NEC is an attached accelerator, but IIRC it can run
               | an OS in host mode. It's hard to tell how the others
               | perform, because most don't have hardware available yet
               | or only they and partner companies have access. It's also
               | hard to compare, because they don't target the desktop
               | market.
               | 
               | > I ported some numeric simulation kernel to the A64Fx
               | some time ago, fixing the vector width gave a 2x
               | improvement.
               | 
               | Oh, wow. Was this autovectorized or handwritten
               | intrinsics/assembly?
               | 
               | Any chance it's of a small enough scope that I could try
               | to recreate it?
               | 
               | > I was specifically referring to dynamic vector sizes.
               | 
               | Ah, sorry, yes you are correct. It still shows that
               | supporting VLA mechanisms in an ISA doesn't mean it's
               | slower for fixed-size usage.
               | 
               | I'm not aware of any proper VLA vs VLS comparisons. I
               | benchmarked a VLA vs VLS mandelbrot implementation once
               | where there was no performance difference, but that's a
               | too simple example.
        
           | deaddodo wrote:
           | > - Register width: we somewhat maxed out at 512 bits, with
           | Intel going back to 256 bits for non-server CPUs. I don't see
           | larger widths on the horizon (even if SVE theoretically
           | supports up to 2048 bits, I don't know any implementation
           | with >256 bits). Larger bit widths are not beneficial for
           | most applications and the few applications that are (e.g.,
           | some HPC codes) are nowadays served by GPUs.
           | 
           | Just to address this, it's pretty evident why scalar values
           | have stabilized at 64-bit and vectors at ~512 (though there
           | are larger implementations). Tell someone they only have 256
           | values to work with and they immediately see the limit, it's
           | why old 8-bit code wasted so much time shuffling carries to
           | compute larger values. Tell them you have 65536 values and it
           | alleviates a _large_ set of that problem, but you 're still
           | going to hit limits frequently. Now you have up to 4294967296
           | values and the limits are realistically only going to be hit
           | in computational realms, so bump it up to
           | 18446744073709551615. Now even most commodity computational
           | limits are alleviated and the compiler will handle the data
           | shuffling for larger ones.
           | 
           | There was naturally going to be a point where there was
           | enough static computational power on integers that it didn't
           | make sense to continue widening them (at least, not at the
           | previous rate). The same goes for vectorization, but in even
           | more niche and specific fields.
        
         | tonetegeatinst wrote:
         | AFAIK about every modern CPU uses out of order von Neumann
         | architecture. The only people who don't are the handful of
         | researchers and people who work with the government research
         | into non van Neumann designed systems.
        
           | luyu_wu wrote:
           | Low power RISC cores (both ARM and RISC-V) are typically in-
           | order actually!
           | 
           | But any core I can think of as 'high-performance' is OOO.
        
             | whaleofatw2022 wrote:
             | MIPS as well as Alpha AFAIR. And technically itanium, otoh
             | It seems to me a bit like a niche for any performance
             | advantages...
        
         | PaulHoule wrote:
         | In AVX-512 we have a platform that rewards the assembly
         | language programmer like few platforms have since the 6502. I
         | see people doing really clever things that are specific to the
         | system and one level it is _really cool_ but on another level
         | it means SIMD is the domain of the specialist, Intel puts out
         | press releases about the really great features they have for
         | the national labs and for Facebook whereas the rest of us are
         | 5-10 years behind the curve for SIMD adoption because the juice
         | isn 't worth the squeeze.
         | 
         | Just before libraries for training neural nets on GPUs became
         | available I worked on a product that had a SIMD based neural
         | network trainer that was written in hand-coded assembly. We
         | were a generation behind in our AVX instructions so we gave up
         | half of the performance we could have got, but that was the
         | least of the challenges we had to overcome to get the product
         | in front of customers. [1]
         | 
         | My software-centric view of Intel's problems is that they've
         | been spending their customers and shareholders money to put
         | features in chips that are fused off or might as well be fused
         | off because they aren't widely supported in the industry. And
         | that they didn't see this as a problem and neither did their
         | enablers in the computing media and software industry. Just for
         | example, Apple used to ship the MKL libraries which like a
         | turbocharger for matrix math back when they were using Intel
         | chips. For whatever reason, Microsoft did not do this with
         | Windows and neither did most Linux distributions so "the rest
         | of us" are stuck with a fraction of the performance that we
         | paid for.
         | 
         | AMD did the right thing in introducing double pumped AVX-512
         | because at least assembly language wizards have some place
         | where their code runs and the industry gets closer to the place
         | where we can count on using an instruction set defined _12
         | years ago._
         | 
         | [1] If I'd been tasked with updating the to next generation I
         | would have written a compiler (if I take that many derivatives
         | by hand I'll get one wrong.) My boss would have ordered me not
         | to, I would have done it anyway and not checked it in.
        
           | bee_rider wrote:
           | It is kind of a bummer that MKL isn't open sourced, as that
           | would make inclusion in Linux easier. It is already free-as-
           | in-beer, but of course that doesn't solve everything.
           | 
           | Baffling that MS didn't use it. They have a pretty close
           | relationship...
           | 
           | Agree that they are sort of going after hard-to-use niche
           | features nowadays. But I think it is just that the real thing
           | we want--single threaded performance for branchy code--is,
           | like, incredibly difficult to improve nowadays.
        
             | PaulHoule wrote:
             | At the very least you can decode UTF-8 really quickly with
             | AVX-512
             | 
             | https://lemire.me/blog/2023/08/12/transcoding-
             | utf-8-strings-...
             | 
             | and web browsers at the very least spent a lot of cycles on
             | decoding HTML and Javascript which is UTF-8 encoded. It
             | turns out AVX-512 is good at a lot of things you wouldn't
             | think SIMD would be good at. Intel's got the problem that
             | people don't want to buy new computers because they don't
             | see much benefit from buying a new computer, but a new
             | computer doesn't have the benefit it could have because of
             | lagging software support, and the software support lags
             | because there aren't enough new computers to justify the
             | work to do the software support. Intel deserves blame for a
             | few things, one of which is that they have dragged their
             | feet at getting really innovative features into their
             | products while turning people off with various empty
             | slogans.
             | 
             | They really do have a new instruction set that targets
             | plain ordinary single threaded branchy code
             | 
             | https://www.intel.com/content/www/us/en/developer/articles/
             | t...
             | 
             | they'll probably be out of business before you can use it.
        
         | derf_ wrote:
         | _> I code with SIMD as the target, and have special containers
         | that pad memory to SIMD width..._
         | 
         | I think this may be domain-specific. I help maintain several
         | open-source audio libraries, and wind up being the one to
         | review the patches when people contribute SIMD for some
         | specific ISA, and I think without exception they always get the
         | tail handling wrong. Due to other interactions it cannot always
         | be avoided by padding. It can roughly double the complexity of
         | the code [0], and requires a disproportionate amount of
         | thinking time vs. the time the code spends running, but if you
         | don't spend that thinking time you can get OOB reads or writes,
         | and thus CVEs. Masked loads/stores are an improvement, but not
         | universally available. I don't have a lot of concrete
         | suggestions.
         | 
         | I also work with a lot of image/video SIMD, and this is just
         | not a problem, because most operations happen on fixed block
         | sizes, and padding buffers is easy and routine.
         | 
         | I agree I would have picked other things for the other two in
         | my own top-3 list.
         | 
         | [0] Here is a fun one, which actually performs worst when len
         | is a multiple of 8 (which it almost always is), and has 59
         | lines of code for tail handling vs. 33 lines for the main loop:
         | https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_...
        
       | freeone3000 wrote:
       | x86 SIMD suffers from register aliasing. xmm0 is actually the
       | low-half of ymm0, so you need to explicitly tell the processor
       | what your input type is to properly handle overflow and signing.
       | Actual vectorized instructions don't have this problem but you
       | also can't change it now.
        
       | pornel wrote:
       | There are alternative universes where these wouldn't be a
       | problem.
       | 
       | For example, if we didn't settle on executing compiled machine
       | code exactly as-is, and had a instruction-updating pass (less
       | involved than a full VM byte code compilation), then we could
       | adjust SIMD width for existing binaries instead of waiting
       | decades for a new baseline or multiversioning faff.
       | 
       | Another interesting alternative is SIMT. Instead of having a
       | handful of special-case instructions combined with heavyweight
       | software-switched threads, we could have had every instruction
       | SIMDified. It requires structuring programs differently, but
       | getting max performance out of current CPUs already requires SIMD
       | + multicore + predictable branching, so we're doing it anyway,
       | just in a roundabout way.
        
         | LegionMammal978 wrote:
         | > Another interesting alternative is SIMT. Instead of having a
         | handful of special-case instructions combined with heavyweight
         | software-switched threads, we could have had every instruction
         | SIMDified. It requires structuring programs differently, but
         | getting max performance out of current CPUs already requires
         | SIMD + multicore + predictable branching, so we're doing it
         | anyway, just in a roundabout way.
         | 
         | Is that not where we're already going with the GPGPU trend? The
         | big catch with GPU programming is that many useful routines are
         | irreducibly very branchy (or at least, to an extent that
         | removing branches slows them down unacceptably), and every
         | divergent branch throws out a huge chunk of the GPU's
         | performance. So you retain a traditional CPU to run all your
         | branchy code, but you run into memory-bandwidth woes between
         | the CPU and GPU.
         | 
         | It's generally the exception instead of the rule when you have
         | a big block of data elements upfront that can all be handled
         | uniformly with no branching. These usually have to do with
         | graphics, physical simulation, etc., which is why the SIMT
         | model was popularized by GPUs.
        
           | winwang wrote:
           | Fun fact which I'm 50%(?) sure of: a single branch divergence
           | for integer instructions on current nvidia GPUs won't hurt
           | perf, because there are only 16 int32 lanes anyway.
        
         | aengelke wrote:
         | > if we didn't settle on executing compiled machine code
         | exactly as-is, and had a instruction-updating pass (less
         | involved than a full VM byte code compilation)
         | 
         | Apple tried something like this: they collected the LLVM
         | bitcode of apps so that they could recompile and even port to a
         | different architecture. To my knowledge, this was done exactly
         | once (watchOS armv7->AArch64) and deprecated afterwards.
         | Retargeting at this level is inherently difficult (different
         | ABIs, target-specific instructions, intrinsics, etc.). For the
         | same target with a larger feature set, the problems are
         | smaller, but so are the gains -- better SIMD usage would only
         | come from the auto-vectorizer and a better instruction selector
         | that uses different instructions. The expectable gains,
         | however, are low for typical applications and for math-heavy
         | programs, using optimized libraries or simply recompiling is
         | easier.
         | 
         | WebAssembly is a higher-level, more portable bytecode, but
         | performance levels are quite a bit behind natively compiled
         | code.
        
       | gitroom wrote:
       | Oh man, totally get the pain with compilers and SIMD tricks - the
       | struggle's so real. Ever feel like keeping low level control is
       | the only way stuff actually runs as smooth as you want, or am I
       | just too stubborn to give abstractions a real shot?
        
       | sweetjuly wrote:
       | Loop unrolling isn't really done because of pipelining but rather
       | to amortize the cost of looping. Any modern out-of-order core
       | will (on the happy path) schedule the operations identically
       | whether you did one copy per loop or four. The only difference is
       | the number of branches.
        
         | Remnant44 wrote:
         | These days, I strongly believe that loop unrolling is a
         | pessimization, especially with SIMD code.
         | 
         | Scalar code should be unrolled by the compiler to the SIMD word
         | width to expose potential parallelism. But other than that,
         | correctly predicted branches are free, and so is loop
         | instruction overhead on modern wide-dispatch processors. For
         | example, even running a maximally efficient AVX512 kernel on a
         | zen5 machine that dispatches 4 EUs and some load/stores and
         | calculates 2048 bits in the vector units every cycle, you still
         | have a ton of dispatch capacity to handle the loop overhead in
         | the scalar units.
         | 
         | The cost of unrolling is decreased code density and reduced
         | effectiveness of the instruction / uOp cache. I wish Clang in
         | particular would stop unrolling the dang vector loops.
        
           | adgjlsfhk1 wrote:
           | The part that's really weird is that on modern CPUs predicted
           | branches are free iff they're sufficiently rare (<1 out of 8
           | instructions or so). but if you have too many, you will be
           | bottlenecked on the branch since you aren't allowed to
           | speculate past a 2nd (3rd on zen5 without hyperthreading?)
           | branch.
        
             | dzaima wrote:
             | The limiting thing isn't necessarily speculating, but more
             | just the number of branches per cycle, i.e. number of non-
             | contiguous locations the processor has to query from L1 /
             | uop cache (and which the branch predictor has to determine
             | the location of). You get that limit with unconditional
             | branches too.
        
           | dzaima wrote:
           | Intel still shares ports between vector and scalar on
           | P-cores; a scalar multiply in the loop will definitely fight
           | with a vector port, and the bits of pointer bumps and branch
           | and whatnot can fill up the 1 or 2 scalar-only ports. And
           | maybe there are some minor power savings from wasting
           | resources on the scalar overhead. Still, clang does unroll
           | way too much.
        
             | Remnant44 wrote:
             | My understanding is that they've changed this for Lion Cove
             | and all future P cores, moving to much more of a Zen-like
             | setup with seperate schedulers and ports for vector and
             | scalar ops.
        
       | bob1029 wrote:
       | > Since the register size is fixed there is no way to scale the
       | ISA to new levels of hardware parallelism without adding new
       | instructions and registers.
       | 
       | I look at SIMD as the same idea as any other aspect of the x86
       | instruction set. If you are directly interacting with it, you
       | should probably have a good reason to be.
       | 
       | I primarily interact with these primitives via types like
       | Vector<T> in .NET's System.Numerics namespace. With the
       | appropriate level of abstraction, you no longer have to worry
       | about how wide the underlying architecture is, or if it even
       | supports SIMD at all.
       | 
       | I'd prefer to let someone who is paid a very fat salary by a F100
       | spend their full time job worrying about how to emit SIMD
       | instructions for my program source.
        
       | timewizard wrote:
       | > Another problem is that each new SIMD generation requires new
       | instruction opcodes and encodings.
       | 
       | It requires new opcodes. It does not strictly require new
       | encodings. Several new encodings are legacy compatible and can
       | encode previous generations vector instructions.
       | 
       | > so the architecture must provide enough SIMD registers to avoid
       | register spilling.
       | 
       | Or the architecture allows memory operands. The great joy of
       | basic x86 encoding is that you don't actually need to put things
       | in registers to operate on them.
       | 
       | > Usually you also need extra control logic before the loop. For
       | instance if the array length is less than the SIMD register
       | width, the main SIMD loop should be skipped.
       | 
       | What do you want? No control overhead or the speed enabled by
       | SIMD? This isn't a flaw. This is a necessary price to achieve the
       | efficiency you do in the main loop.
        
         | camel-cdr wrote:
         | > The great joy of basic x86 encoding is that you don't
         | actually need to put things in registers to operate on them.
         | 
         | That's just spilling with fewer steps. The executed uops should
         | be the same.
        
           | timewizard wrote:
           | > That's just spilling with fewer steps.
           | 
           | Another way to say this is it's "more efficient."
           | 
           | > The executed uops should be the same.
           | 
           | And "more densely coded."
        
             | camel-cdr wrote:
             | hm, I was wondering how the density compares with x86
             | having more complex encodings in general.
             | 
             | vaddps zmm1,zmm0,ZMMWORD PTR [r14]
             | 
             | takes six bytes to encode:
             | 
             | 62 d1 7c 48 58 0e
             | 
             | In SVE and RVV a load+add takes 8 bytes to encode.
        
       | lauriewired wrote:
       | The three "flaws" that this post lists are exactly what the
       | industry has been moving away from for the last decade.
       | 
       | Arm's SVE, and RISC-V's vector extension are all vector-length-
       | agnostic. RISC-V's implementation is particularly nice, you only
       | have to compile for one code path (unlike avx with the need for
       | fat-binary else/if trees).
        
       | dragontamer wrote:
       | 1. Not a problem for GPUs. NVdia and AMD are both 32-wide or
       | 1024-bit wide hard coded. AMD can swap to 64-wide mode for
       | backwards compatibility to GCN. 1024-bit or 2048-bit seems to be
       | the right values. Too wide and you get branch divergence issues,
       | so it doesn't seem to make sense to go bigger.
       | 
       | In contrast, the systems that have flexible widths have never
       | taken off. It's seemingly much harder to design a programming
       | language for a flexible width SIMD.
       | 
       | 2. Not a problem for GPUs. It should be noted that kernels
       | allocate custom amounts of registers: one kernel may use 56
       | registers, while another kernel might use 200 registers. All GPUs
       | will run these two kernels simultaneously (256+ registers per CU
       | or SM is commonly supported, so both 200+56 registers kernels can
       | run together).
       | 
       | 3. Not a problem for GPUs or really any SIMD in most cases. Tail
       | handling is O(1) problem in general and not a significant
       | contributor to code length, size, or benchmarks.
       | 
       | Overall utilization issues are certainly a concern. But in my
       | experience this is caused by branching most often. (Branching in
       | GPUs is very inefficient and forces very low utilization).
        
         | dzaima wrote:
         | Tail handling is not significant for loops with tons of
         | iterations, but there are a ton of real-world situations where
         | you might have a loop take only like 5 iterations or something
         | (even at like 100 iterations, with a loop processing 8 elements
         | at a time (i.e. 256-bit vectors, 32-bit elements), that's 12
         | vectorized iterations plus up to 7 scalar ones, which is still
         | quite significant. At 1000 iterations you could still have the
         | scalar tail be a couple percent; and still doubling the L1/uop-
         | cache space the loop takes).
         | 
         | It's absolutely a significant contributor to code size (..in
         | scenarios where vectorized code in general is a significant
         | contributor to code size, which admittedly is only very-
         | specialized software).
        
       ___________________________________________________________________
       (page generated 2025-04-24 23:00 UTC)