[HN Gopher] Three Fundamental Flaws of SIMD ISAs (2023)
___________________________________________________________________
Three Fundamental Flaws of SIMD ISAs (2023)
Author : fanf2
Score : 94 points
Date : 2025-04-24 14:42 UTC (8 hours ago)
(HTM) web link (www.bitsnbites.eu)
(TXT) w3m dump (www.bitsnbites.eu)
| pkhuong wrote:
| There's more to SIMD than BLAS.
| https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u... .
| camel-cdr wrote:
| BLAS, specifically gemm, is one of the rare things where you
| naturally need to specialize on vector register width.
|
| Most problems don't require this: E.g. your basic penalizable
| math stuff, unicode conversion, base64 de/encode, json parsing,
| set intersection, quicksort*, bigint, run length encoding,
| chacha20, ...
|
| And if you run into a problem that benefits from knowing the
| SIMD width, then just specialize on it. You can totally use
| variable-length SIMD ISA's in a fixed-length way when required.
| But most of the time it isn't required, and you have code that
| easily scales between vector lengths.
|
| *quicksort: most time is spent partitioning, which is vector
| length agnostic, you can handle the leafs in a vector length
| agnostic way, but you'll get more efficient code if you
| specialize (idk how big the impact is, in vreg bitonic sort is
| quite efficient).
| convolvatron wrote:
| i would certainly add lack of reductions ('horizontal'
| operations) and a more generalized model of communication to the
| list.
| adgjlsfhk1 wrote:
| The tricky part with reductions is that they are somewhat
| inherently slow since they often need to be done pairwise and a
| pairwise reduction over 16 elements will naturally have pretty
| limited parallelism.
| convolvatron wrote:
| kinda? this is sort of a direct result of the 'vectors are
| just sliced registers' model. if i do a pairwise operation
| and divide my domain by 2 at each step, is the resulting
| vector sparse or dense? if its dense then I only really top
| out when i'm in the last log2slice steps.
| TinkersW wrote:
| I write a lot of SIMD and I don't really agree with this..
|
| _Flaw1:fixed width_
|
| I prefer fixed width as it makes the code simpler to write, size
| is known as compile time so we know the size of our structures.
| Swizzle algorithms are also customized based on the size.
|
| _Flaw2:pipelining_
|
| no CPU I care about is in order so mostly irrelevant, and even
| scalar instructions are pipelined
|
| _Flaw3: tail handling_
|
| I code with SIMD as the target, and have special containers that
| pad memory to SIMD width, no need to mask or run a scalar loop. I
| copy the last valid value into the remaining slots so it doesn't
| cause any branch divergence.
| cjbgkagh wrote:
| I have similar thoughts,
|
| I don't understand the push for variable width SIMD. Possibly
| due to ignorance but I think it's an abstraction that can be
| specialized for different hardware so the similar tradeoffs
| between low level languages and high level languages apply.
| Since I already have to be aware of hardware level concepts
| such as 256bit shuffle not working across 128bit lanes and
| different instructions having very different performance
| characteristics on different CPUs I'm already knee deep in
| hardware specifics. While in general I like abstractions I've
| largely given up waiting for a 'sufficiently advanced compiler'
| that would properly auto-vectorize my code. I think AGI AI is
| more likely to happen sooner. At a guess it seems to be that
| SIMD code could work on GPUs but GPU code has different memory
| access costs so the code there would also be completely
| different.
|
| So my view is either create a much better higher level SIMD
| abstraction model with a sufficiently advanced compiler that
| knows all the tricks or let me work closely at the hardware
| level.
|
| As an outsider who doesn't really know what is going on it does
| worry me a bit that it appears that WASM is pushing for
| variable width SIMDs instead of supporting ISAs generally
| supported by CPUs. I guess it's a portability vs performance
| tradeoff - I worry that it may be difficult to make variable as
| performant as fixed width and would prefer to deal with
| portability by having alternative branches at code level.
| >> Finally, any software that wants to use the new instruction
| set needs to be rewritten (or at least recompiled). What is
| worse, software developers often have to target several SIMD
| generations, and add mechanisms to their programs that
| dynamically select the optimal code paths depending on which
| SIMD generation is supported.
|
| Why not marry the two and have variable width SIMD as one of
| the ISA options and if in the future variable width SIMD become
| more performant then it would just be another branch to
| dynamically select.
| kevingadd wrote:
| Part of the motive behind variable width SIMD in WASM is that
| there's intentionally-ish no mechanism to do feature
| detection at runtime in WASM. The whole module has to be
| valid on your target, you can't include a handful of invalid
| functions and conditionally execute them if the target
| supports 256-wide or 512-wide SIMD. If you want to adapt you
| have to ship entire modules for each set of supported feature
| flags and select the correct module at startup after probing
| what the target supports.
|
| So variable width SIMD solves this by making any module using
| it valid regardless of whether the target supports 512-bit
| vectors, and the VM 'just' has to solve the problem of
| generating good code.
|
| Personally I think this is a terrible way to do things and
| there should have just been a feature detection system, but
| the horse fled the barn on that one like a decade ago.
| camel-cdr wrote:
| > I prefer fixed width
|
| Do you have examples for problems that are easier to solve in
| fixed-width SIMD?
|
| I maintain that most problems can be solved in a vector-length-
| agnostic manner. Even if it's slightly more tricky, it's
| certainly easier than restructuring all of your memory
| allocations to add padding and implementing three versions for
| all the differently sized SIMD extensions your target may
| support. And you can always fall back to using a variable-width
| SIMD ISA in a fixed-width way, when necessary.
| jcranmer wrote:
| There's a category of autovectorization known as Superword-
| Level Parallelism (SLP) which effectively scavenges an entire
| basic block for individual instruction sequences that might
| be squeezed together into a SIMD instruction. This kind of
| vectorization doesn't work well with vector-length-agnostic
| ISAs, because you generally can't scavenge more than a few
| elements anyways, and inducing any sort of dynamic vector
| length is more likely to slow your code down as a result
| (since you can't do constant folding).
|
| There's other kinds of interesting things you can do with
| vectors that aren't improved by dynamic-length vectors.
| Something like abseil's hash table, which uses vector code to
| efficiently manage the occupancy bitmap. Dynamic vector
| length doesn't help that much in that case, particularly
| because the vector length you can parallelize over is itself
| intrinsically low (if you're scanning dozens of elements to
| find an empty slot, something is wrong). Vector swizzling is
| harder to do dynamically, and in general, at high vector
| factors, difficult to do generically in hardware, which means
| going to larger vectors (even before considering dynamic
| sizes), vectorization is trickier if you have to do a lot of
| swizzling.
|
| In general, vector-length-agnostic is really only good for
| SIMT-like codes, which you can express the vector body as
| more or less independent f(index) for some knowable-before-
| you-execute-the-loop range of indices. Stuff like DAXPY or
| BLAS in general. Move away from this model, and that
| agnosticism becomes overhead that doesn't pay for itself.
| (Now granted, this kind of model is a _large fraction_ of
| parallelizable code, but it 's far from all of it).
| camel-cdr wrote:
| The SLP vectorizer is a good point, but I think it's, in
| comparison with x86, more a problem of the float and vector
| register files not being shared (in SVE and RVV). You don't
| need to reconfigure the vector length; just use it at the
| full width.
|
| > Something like abseil's hash table
|
| If I remember this correctly, the abseil lookup does scale
| with vector length, as long as you use the native data path
| width. (albeit with small gains) There is a problem with
| vector length agnostic handling of abseil, which is the
| iterator API. With a different API, or compilers that could
| eliminate redundant predicated load/stores, this would be
| easier.
|
| > good for SIMT-like codes
|
| Certainly, but I've also seen/written a lot of vector
| length agnostic code using shuffles, which don't fit into
| the SIMT paradigm, which means that the scope is larger
| than just SIMT.
|
| ---
|
| As a general comparison, take AVX10/128, AVX10/256 and
| AVX10/512, overlap their instruction encodings, remove the
| few instructions that don't make sense anymore, and add a
| cheap instruction to query the vector length. (probably
| also instructions like vid and viota, for easier shuffle
| synthesization) Now you have a variable-length SIMD ISA
| that feels familiar.
|
| The above is basically what SVE is.
| jonstewart wrote:
| A number of the cool string processing SIMD techniques
| depend a _lot_ on register widths and instruction
| performance characteristics. There's a fair argument to be
| made that x64 could be made more consistent/legible for
| these use cases, but this isn't matmul--whether you have
| 128, 256, or 512 bits matters hugely and you may want
| entirely different algorithms that are contingent on this.
| jandrewrogers wrote:
| I also prefer fixed width. At least in C++, all of the
| padding, alignment, etc is automagically codegen-ed for the
| register type in my use cases, so the overhead is
| approximately zero. All the complexity and cost is in
| specializing for the capabilities of the underlying SIMD ISA,
| not the width.
|
| The benefit of fixed width is that optimal data structure and
| algorithm design on various microarchitectures is dependent
| on explicitly knowing the register width. SIMD widths aren't
| not perfectly substitutable in practice, there is more at
| play than stride size. You can also do things like explicitly
| combine separate logic streams in a single SIMD instruction
| based on knowing the word layout. Compilers don't do this
| work in 2025.
|
| The argument for vector width agnostic code seems predicated
| on the proverbial "sufficiently advanced compiler". I will
| likely retire from the industry before such a compiler
| actually exists. Like fusion power, it has been ten years
| away my entire life.
| camel-cdr wrote:
| > The argument for vector width agnostic code is seems
| predicated on the proverbial "sufficiently advanced
| compiler".
|
| A SIMD ISA having a fixed size or not is orthogonal to
| autovectorization. E.g. I've seen a bunch of cases where
| things get autovectorized for RVV but not for AVX512. The
| reason isn't fixed vs variable, but rather the supported
| instructions themselves.
|
| There are two things I'd like from a "sufficiently advanced
| compiler", which are sizeless struct support and redundant
| predicated load/store elimination. Those don't
| fundamentally add new capabilities, but makes working
| with/integrating into existing APIs easier.
|
| > All the complexity and cost is in specializing for the
| capabilities of the underlying SIMD ISA, not the width.
|
| Wow, it almost sounds like you could take basically the
| same code and run it with different vector lengths.
|
| > The benefit of fixed width is that optimal data structure
| and algorithm design on various microarchitectures is
| dependent on explicitly knowing the register width
|
| Optimal to what degree? Like sure, fixed-width SIMD can
| always turn your pointer increments from a register add to
| an immediate add, so it's always more "optimal", but that
| sort of thing doesn't matter.
|
| The only difference you usually encounter when writing
| variable instead of fixed size code is that you have to
| synthesize your shuffles outside the loop. This usually
| just takes a few instructions, but loading a constant is
| certainly easier.
| pkhuong wrote:
| > Do you have examples for problems that are easier to solve
| in fixed-width SIMD?
|
| Regular expression matching and encryption come to mind.
| camel-cdr wrote:
| > Regular expression matching
|
| That's probably true. Last time I looked at it, it seemed
| like parts of vectorscan could be vectorized VLA, but from
| my, very limited, understanding of the main matching
| algorithm, it does seem to require specialization on vector
| length.
|
| It should be possible to do VLA in some capacity, but it
| would probably be slower and it's too much work to test.
|
| > encryption
|
| From the things I've looked at, it's mixed.
|
| E.g. chacha20 and poly1305 vectorize well in a VLA scheme:
| https://camel-cdr.github.io/rvv-bench-
| results/bpi_f3/chacha2..., https://camel-cdr.github.io/rvv-
| bench-results/bpi_f3/poly130...
|
| Keccak on the other hand was optimized for fast execution
| on scalar ISAs with 32 GPRs. This is hard to vectorize in
| general, because GPR "moves" are free and liberally
| applied.
|
| Another example where it's probably worth specializing is
| quicksort, specifically the leaf part.
|
| I've written a VLA version, which uses bitonic sort to sort
| within vector registers. I wasn't able to meaningfully
| compare it against a fixed size implementation, because
| vqsort was super slow when I tried to compile it for RVV.
| aengelke wrote:
| I agree; and the article seems to have also quite a few
| technical flaws:
|
| - Register width: we somewhat maxed out at 512 bits, with Intel
| going back to 256 bits for non-server CPUs. I don't see larger
| widths on the horizon (even if SVE theoretically supports up to
| 2048 bits, I don't know any implementation with ~~>256~~ >512
| bits). Larger bit widths are not beneficial for most
| applications and the few applications that are (e.g., some HPC
| codes) are nowadays served by GPUs.
|
| - The post mentions available opcode space: while opcode space
| is limited, a reasonably well-designed ISA (e.g., AArch64) has
| enough holes for extensions. Adding new instructions doesn't
| require ABI changes, and while adding new registers requires
| some kernel changes, this is well understood at this point.
|
| - "What is worse, software developers often have to target
| several SIMD generations" -- no way around this, though, unless
| auto-vectorization becomes substantially better. Adjusting the
| register width is not the big problem when porting code, making
| better use of available instructions is.
|
| - "The packed SIMD paradigm is that there is a 1:1 mapping
| between the register width and the execution unit width" -- no.
| E.g., AMD's Zen 4 does double pumping, and AVX was IIRC
| originally designed to support this as well (although Intel
| went directly for 256-bit units).
|
| - "At the same time many SIMD operations are pipelined and
| require several clock cycles to complete" -- well, they _are_
| pipelined, but many SIMD instructions have the same latency as
| their scalar counterpart.
|
| - "Consequently, loops have to be unrolled in order to avoid
| stalls and keep the pipeline busy." -- loop unroll has several
| benefits, mostly to reduce the overhead of the loop and to
| avoid data dependencies between loop iterations. Larger basic
| blocks are better for hardware as every branch, even if
| predicted correctly, has a small penalty. "Loop unrolling also
| increases register pressure" -- it does, but code that really
| requires >32 registers is extremely rare, so a good instruction
| scheduler in the compiler can avoid spilling.
|
| In my experience, dynamic vector sizes make code slower,
| because they inhibit optimizations. E.g., spilling a
| dynamically sized vector is like a dynamic stack allocation
| with a dynamic offset. I don't think SVE delivered any large
| benefits, both in terms of performance (there's not much
| hardware with SVE to begin with...) and compiler support.
| RISC-V pushes further into this direction, we'll see how this
| turns out.
| cherryteastain wrote:
| Fujitsu A64FX used in the Fugaku supercomputer uses SVE with
| 512 bit width
| aengelke wrote:
| Thanks, I misremembered. However, the microarchitecture is
| a bit "weird" (really HPC-targeted), with very long
| latencies (e.g., ADD (vector) 4 cycles, FADD (vector) 9
| cycles). I remember that it was much slower than older x86
| CPUs for non-SIMD code, and even for SIMD code, it took
| quite a bit of effort to get reasonable performance through
| instruction-level parallelism due to the long latencies and
| the very limited out-of-order capacities (in particular the
| just 2x20 reservation station entries for FP).
| camel-cdr wrote:
| > we somewhat maxed out at 512 bits
|
| Which still means you have to write your code at least
| thrice, which is two times more than with a variable length
| SIMD ISA.
|
| Also there are processors with larger vector length, e.g.
| 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200,
| 16384-bit: NEC SX-Aurora, Ara, EPI
|
| > no way around this
|
| You rarely need to rewrite SIMD code to take advantage of new
| extensions, unless somebody decides to create a new one with
| a larger SIMD width. This mostly happens when very
| specialized instructions are added.
|
| > In my experience, dynamic vector sizes make code slower,
| because they inhibit optimizations.
|
| Do you have more examples of this?
|
| I don't see spilling as much of a problem, because you want
| to avoid it regardless, and codegen for dynamic vector sizes
| is pretty good in my experience.
|
| > I don't think SVE delivered any large benefits
|
| Well, all Arm CPUs except for the A64FX were build to execute
| NEON as fast as possible. X86 CPUs aren't built to execute
| MMX or SSE or the latest, even AVX, as fast as possible.
|
| Anyway, I know of one comparison between NEON and SVE:
| https://solidpixel.github.io/astcenc_meets_sve
|
| > Performance was a lot better than I expected, giving
| between 14 and 63% uplift. Larger block sizes benefitted the
| most, as we get higher utilization of the wider vectors and
| fewer idle lanes.
|
| > I found the scale of the uplift somewhat surprising as
| Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so
| in terms of data-width the two should work out very similar.
| vardump wrote:
| > Which still means you have to write your code at least
| thrice, which is two times more than with a variable length
| SIMD ISA.
|
| 256 and 512 bits are the only reasonable widths. 256 bit
| AVX2 is what, 13 or 14 years old now.
| adgjlsfhk1 wrote:
| no. Because Intel is full of absolute idiots, Intel atom
| didn't support AVX 1 until Gracemont. Tremont is missing
| AVX1, AVX2, FMA, and basically the rest of X86v3, and
| shipped in CPUs as recently as 2021 (Jasper Lake).
| vardump wrote:
| Oh damn. I've dropped SSE ages ago and no one complained.
| I guess the customer base didn't use those chips...
| aengelke wrote:
| > Also there are processors with larger vector length
|
| How do these fare in terms of absolute performance? The NEC
| TSUBASA is not a CPU.
|
| > Do you have more examples of this?
|
| I ported some numeric simulation kernel to the A64Fx some
| time ago, fixing the vector width gave a 2x improvement.
| Compilers probably/hopefully have gotten better in the mean
| time and I haven't redone the experiments since then, but
| I'd be surprised if this changed drastically. Spilling is
| sometimes unavoidable, e.g. due to function calls.
|
| > Anyway, I know of one comparison between NEON and SVE:
| https://solidpixel.github.io/astcenc_meets_sve
|
| I was specifically referring to dynamic vector sizes. This
| experiment uses sizes fixed at compile-time, from the
| article:
|
| > For the astcenc implementation of SVE I decided to
| implement a fixed-width 256-bit implementation, where the
| vector length is known at compile time.
| camel-cdr wrote:
| > How do these fare in terms of absolute performance? The
| NEC TSUBASA is not a CPU.
|
| The NEC is an attached accelerator, but IIRC it can run
| an OS in host mode. It's hard to tell how the others
| perform, because most don't have hardware available yet
| or only they and partner companies have access. It's also
| hard to compare, because they don't target the desktop
| market.
|
| > I ported some numeric simulation kernel to the A64Fx
| some time ago, fixing the vector width gave a 2x
| improvement.
|
| Oh, wow. Was this autovectorized or handwritten
| intrinsics/assembly?
|
| Any chance it's of a small enough scope that I could try
| to recreate it?
|
| > I was specifically referring to dynamic vector sizes.
|
| Ah, sorry, yes you are correct. It still shows that
| supporting VLA mechanisms in an ISA doesn't mean it's
| slower for fixed-size usage.
|
| I'm not aware of any proper VLA vs VLS comparisons. I
| benchmarked a VLA vs VLS mandelbrot implementation once
| where there was no performance difference, but that's a
| too simple example.
| deaddodo wrote:
| > - Register width: we somewhat maxed out at 512 bits, with
| Intel going back to 256 bits for non-server CPUs. I don't see
| larger widths on the horizon (even if SVE theoretically
| supports up to 2048 bits, I don't know any implementation
| with >256 bits). Larger bit widths are not beneficial for
| most applications and the few applications that are (e.g.,
| some HPC codes) are nowadays served by GPUs.
|
| Just to address this, it's pretty evident why scalar values
| have stabilized at 64-bit and vectors at ~512 (though there
| are larger implementations). Tell someone they only have 256
| values to work with and they immediately see the limit, it's
| why old 8-bit code wasted so much time shuffling carries to
| compute larger values. Tell them you have 65536 values and it
| alleviates a _large_ set of that problem, but you 're still
| going to hit limits frequently. Now you have up to 4294967296
| values and the limits are realistically only going to be hit
| in computational realms, so bump it up to
| 18446744073709551615. Now even most commodity computational
| limits are alleviated and the compiler will handle the data
| shuffling for larger ones.
|
| There was naturally going to be a point where there was
| enough static computational power on integers that it didn't
| make sense to continue widening them (at least, not at the
| previous rate). The same goes for vectorization, but in even
| more niche and specific fields.
| tonetegeatinst wrote:
| AFAIK about every modern CPU uses out of order von Neumann
| architecture. The only people who don't are the handful of
| researchers and people who work with the government research
| into non van Neumann designed systems.
| luyu_wu wrote:
| Low power RISC cores (both ARM and RISC-V) are typically in-
| order actually!
|
| But any core I can think of as 'high-performance' is OOO.
| whaleofatw2022 wrote:
| MIPS as well as Alpha AFAIR. And technically itanium, otoh
| It seems to me a bit like a niche for any performance
| advantages...
| PaulHoule wrote:
| In AVX-512 we have a platform that rewards the assembly
| language programmer like few platforms have since the 6502. I
| see people doing really clever things that are specific to the
| system and one level it is _really cool_ but on another level
| it means SIMD is the domain of the specialist, Intel puts out
| press releases about the really great features they have for
| the national labs and for Facebook whereas the rest of us are
| 5-10 years behind the curve for SIMD adoption because the juice
| isn 't worth the squeeze.
|
| Just before libraries for training neural nets on GPUs became
| available I worked on a product that had a SIMD based neural
| network trainer that was written in hand-coded assembly. We
| were a generation behind in our AVX instructions so we gave up
| half of the performance we could have got, but that was the
| least of the challenges we had to overcome to get the product
| in front of customers. [1]
|
| My software-centric view of Intel's problems is that they've
| been spending their customers and shareholders money to put
| features in chips that are fused off or might as well be fused
| off because they aren't widely supported in the industry. And
| that they didn't see this as a problem and neither did their
| enablers in the computing media and software industry. Just for
| example, Apple used to ship the MKL libraries which like a
| turbocharger for matrix math back when they were using Intel
| chips. For whatever reason, Microsoft did not do this with
| Windows and neither did most Linux distributions so "the rest
| of us" are stuck with a fraction of the performance that we
| paid for.
|
| AMD did the right thing in introducing double pumped AVX-512
| because at least assembly language wizards have some place
| where their code runs and the industry gets closer to the place
| where we can count on using an instruction set defined _12
| years ago._
|
| [1] If I'd been tasked with updating the to next generation I
| would have written a compiler (if I take that many derivatives
| by hand I'll get one wrong.) My boss would have ordered me not
| to, I would have done it anyway and not checked it in.
| bee_rider wrote:
| It is kind of a bummer that MKL isn't open sourced, as that
| would make inclusion in Linux easier. It is already free-as-
| in-beer, but of course that doesn't solve everything.
|
| Baffling that MS didn't use it. They have a pretty close
| relationship...
|
| Agree that they are sort of going after hard-to-use niche
| features nowadays. But I think it is just that the real thing
| we want--single threaded performance for branchy code--is,
| like, incredibly difficult to improve nowadays.
| PaulHoule wrote:
| At the very least you can decode UTF-8 really quickly with
| AVX-512
|
| https://lemire.me/blog/2023/08/12/transcoding-
| utf-8-strings-...
|
| and web browsers at the very least spent a lot of cycles on
| decoding HTML and Javascript which is UTF-8 encoded. It
| turns out AVX-512 is good at a lot of things you wouldn't
| think SIMD would be good at. Intel's got the problem that
| people don't want to buy new computers because they don't
| see much benefit from buying a new computer, but a new
| computer doesn't have the benefit it could have because of
| lagging software support, and the software support lags
| because there aren't enough new computers to justify the
| work to do the software support. Intel deserves blame for a
| few things, one of which is that they have dragged their
| feet at getting really innovative features into their
| products while turning people off with various empty
| slogans.
|
| They really do have a new instruction set that targets
| plain ordinary single threaded branchy code
|
| https://www.intel.com/content/www/us/en/developer/articles/
| t...
|
| they'll probably be out of business before you can use it.
| derf_ wrote:
| _> I code with SIMD as the target, and have special containers
| that pad memory to SIMD width..._
|
| I think this may be domain-specific. I help maintain several
| open-source audio libraries, and wind up being the one to
| review the patches when people contribute SIMD for some
| specific ISA, and I think without exception they always get the
| tail handling wrong. Due to other interactions it cannot always
| be avoided by padding. It can roughly double the complexity of
| the code [0], and requires a disproportionate amount of
| thinking time vs. the time the code spends running, but if you
| don't spend that thinking time you can get OOB reads or writes,
| and thus CVEs. Masked loads/stores are an improvement, but not
| universally available. I don't have a lot of concrete
| suggestions.
|
| I also work with a lot of image/video SIMD, and this is just
| not a problem, because most operations happen on fixed block
| sizes, and padding buffers is easy and routine.
|
| I agree I would have picked other things for the other two in
| my own top-3 list.
|
| [0] Here is a fun one, which actually performs worst when len
| is a multiple of 8 (which it almost always is), and has 59
| lines of code for tail handling vs. 33 lines for the main loop:
| https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_...
| freeone3000 wrote:
| x86 SIMD suffers from register aliasing. xmm0 is actually the
| low-half of ymm0, so you need to explicitly tell the processor
| what your input type is to properly handle overflow and signing.
| Actual vectorized instructions don't have this problem but you
| also can't change it now.
| pornel wrote:
| There are alternative universes where these wouldn't be a
| problem.
|
| For example, if we didn't settle on executing compiled machine
| code exactly as-is, and had a instruction-updating pass (less
| involved than a full VM byte code compilation), then we could
| adjust SIMD width for existing binaries instead of waiting
| decades for a new baseline or multiversioning faff.
|
| Another interesting alternative is SIMT. Instead of having a
| handful of special-case instructions combined with heavyweight
| software-switched threads, we could have had every instruction
| SIMDified. It requires structuring programs differently, but
| getting max performance out of current CPUs already requires SIMD
| + multicore + predictable branching, so we're doing it anyway,
| just in a roundabout way.
| LegionMammal978 wrote:
| > Another interesting alternative is SIMT. Instead of having a
| handful of special-case instructions combined with heavyweight
| software-switched threads, we could have had every instruction
| SIMDified. It requires structuring programs differently, but
| getting max performance out of current CPUs already requires
| SIMD + multicore + predictable branching, so we're doing it
| anyway, just in a roundabout way.
|
| Is that not where we're already going with the GPGPU trend? The
| big catch with GPU programming is that many useful routines are
| irreducibly very branchy (or at least, to an extent that
| removing branches slows them down unacceptably), and every
| divergent branch throws out a huge chunk of the GPU's
| performance. So you retain a traditional CPU to run all your
| branchy code, but you run into memory-bandwidth woes between
| the CPU and GPU.
|
| It's generally the exception instead of the rule when you have
| a big block of data elements upfront that can all be handled
| uniformly with no branching. These usually have to do with
| graphics, physical simulation, etc., which is why the SIMT
| model was popularized by GPUs.
| winwang wrote:
| Fun fact which I'm 50%(?) sure of: a single branch divergence
| for integer instructions on current nvidia GPUs won't hurt
| perf, because there are only 16 int32 lanes anyway.
| aengelke wrote:
| > if we didn't settle on executing compiled machine code
| exactly as-is, and had a instruction-updating pass (less
| involved than a full VM byte code compilation)
|
| Apple tried something like this: they collected the LLVM
| bitcode of apps so that they could recompile and even port to a
| different architecture. To my knowledge, this was done exactly
| once (watchOS armv7->AArch64) and deprecated afterwards.
| Retargeting at this level is inherently difficult (different
| ABIs, target-specific instructions, intrinsics, etc.). For the
| same target with a larger feature set, the problems are
| smaller, but so are the gains -- better SIMD usage would only
| come from the auto-vectorizer and a better instruction selector
| that uses different instructions. The expectable gains,
| however, are low for typical applications and for math-heavy
| programs, using optimized libraries or simply recompiling is
| easier.
|
| WebAssembly is a higher-level, more portable bytecode, but
| performance levels are quite a bit behind natively compiled
| code.
| gitroom wrote:
| Oh man, totally get the pain with compilers and SIMD tricks - the
| struggle's so real. Ever feel like keeping low level control is
| the only way stuff actually runs as smooth as you want, or am I
| just too stubborn to give abstractions a real shot?
| sweetjuly wrote:
| Loop unrolling isn't really done because of pipelining but rather
| to amortize the cost of looping. Any modern out-of-order core
| will (on the happy path) schedule the operations identically
| whether you did one copy per loop or four. The only difference is
| the number of branches.
| Remnant44 wrote:
| These days, I strongly believe that loop unrolling is a
| pessimization, especially with SIMD code.
|
| Scalar code should be unrolled by the compiler to the SIMD word
| width to expose potential parallelism. But other than that,
| correctly predicted branches are free, and so is loop
| instruction overhead on modern wide-dispatch processors. For
| example, even running a maximally efficient AVX512 kernel on a
| zen5 machine that dispatches 4 EUs and some load/stores and
| calculates 2048 bits in the vector units every cycle, you still
| have a ton of dispatch capacity to handle the loop overhead in
| the scalar units.
|
| The cost of unrolling is decreased code density and reduced
| effectiveness of the instruction / uOp cache. I wish Clang in
| particular would stop unrolling the dang vector loops.
| adgjlsfhk1 wrote:
| The part that's really weird is that on modern CPUs predicted
| branches are free iff they're sufficiently rare (<1 out of 8
| instructions or so). but if you have too many, you will be
| bottlenecked on the branch since you aren't allowed to
| speculate past a 2nd (3rd on zen5 without hyperthreading?)
| branch.
| dzaima wrote:
| The limiting thing isn't necessarily speculating, but more
| just the number of branches per cycle, i.e. number of non-
| contiguous locations the processor has to query from L1 /
| uop cache (and which the branch predictor has to determine
| the location of). You get that limit with unconditional
| branches too.
| dzaima wrote:
| Intel still shares ports between vector and scalar on
| P-cores; a scalar multiply in the loop will definitely fight
| with a vector port, and the bits of pointer bumps and branch
| and whatnot can fill up the 1 or 2 scalar-only ports. And
| maybe there are some minor power savings from wasting
| resources on the scalar overhead. Still, clang does unroll
| way too much.
| Remnant44 wrote:
| My understanding is that they've changed this for Lion Cove
| and all future P cores, moving to much more of a Zen-like
| setup with seperate schedulers and ports for vector and
| scalar ops.
| bob1029 wrote:
| > Since the register size is fixed there is no way to scale the
| ISA to new levels of hardware parallelism without adding new
| instructions and registers.
|
| I look at SIMD as the same idea as any other aspect of the x86
| instruction set. If you are directly interacting with it, you
| should probably have a good reason to be.
|
| I primarily interact with these primitives via types like
| Vector<T> in .NET's System.Numerics namespace. With the
| appropriate level of abstraction, you no longer have to worry
| about how wide the underlying architecture is, or if it even
| supports SIMD at all.
|
| I'd prefer to let someone who is paid a very fat salary by a F100
| spend their full time job worrying about how to emit SIMD
| instructions for my program source.
| timewizard wrote:
| > Another problem is that each new SIMD generation requires new
| instruction opcodes and encodings.
|
| It requires new opcodes. It does not strictly require new
| encodings. Several new encodings are legacy compatible and can
| encode previous generations vector instructions.
|
| > so the architecture must provide enough SIMD registers to avoid
| register spilling.
|
| Or the architecture allows memory operands. The great joy of
| basic x86 encoding is that you don't actually need to put things
| in registers to operate on them.
|
| > Usually you also need extra control logic before the loop. For
| instance if the array length is less than the SIMD register
| width, the main SIMD loop should be skipped.
|
| What do you want? No control overhead or the speed enabled by
| SIMD? This isn't a flaw. This is a necessary price to achieve the
| efficiency you do in the main loop.
| camel-cdr wrote:
| > The great joy of basic x86 encoding is that you don't
| actually need to put things in registers to operate on them.
|
| That's just spilling with fewer steps. The executed uops should
| be the same.
| timewizard wrote:
| > That's just spilling with fewer steps.
|
| Another way to say this is it's "more efficient."
|
| > The executed uops should be the same.
|
| And "more densely coded."
| camel-cdr wrote:
| hm, I was wondering how the density compares with x86
| having more complex encodings in general.
|
| vaddps zmm1,zmm0,ZMMWORD PTR [r14]
|
| takes six bytes to encode:
|
| 62 d1 7c 48 58 0e
|
| In SVE and RVV a load+add takes 8 bytes to encode.
| lauriewired wrote:
| The three "flaws" that this post lists are exactly what the
| industry has been moving away from for the last decade.
|
| Arm's SVE, and RISC-V's vector extension are all vector-length-
| agnostic. RISC-V's implementation is particularly nice, you only
| have to compile for one code path (unlike avx with the need for
| fat-binary else/if trees).
| dragontamer wrote:
| 1. Not a problem for GPUs. NVdia and AMD are both 32-wide or
| 1024-bit wide hard coded. AMD can swap to 64-wide mode for
| backwards compatibility to GCN. 1024-bit or 2048-bit seems to be
| the right values. Too wide and you get branch divergence issues,
| so it doesn't seem to make sense to go bigger.
|
| In contrast, the systems that have flexible widths have never
| taken off. It's seemingly much harder to design a programming
| language for a flexible width SIMD.
|
| 2. Not a problem for GPUs. It should be noted that kernels
| allocate custom amounts of registers: one kernel may use 56
| registers, while another kernel might use 200 registers. All GPUs
| will run these two kernels simultaneously (256+ registers per CU
| or SM is commonly supported, so both 200+56 registers kernels can
| run together).
|
| 3. Not a problem for GPUs or really any SIMD in most cases. Tail
| handling is O(1) problem in general and not a significant
| contributor to code length, size, or benchmarks.
|
| Overall utilization issues are certainly a concern. But in my
| experience this is caused by branching most often. (Branching in
| GPUs is very inefficient and forces very low utilization).
| dzaima wrote:
| Tail handling is not significant for loops with tons of
| iterations, but there are a ton of real-world situations where
| you might have a loop take only like 5 iterations or something
| (even at like 100 iterations, with a loop processing 8 elements
| at a time (i.e. 256-bit vectors, 32-bit elements), that's 12
| vectorized iterations plus up to 7 scalar ones, which is still
| quite significant. At 1000 iterations you could still have the
| scalar tail be a couple percent; and still doubling the L1/uop-
| cache space the loop takes).
|
| It's absolutely a significant contributor to code size (..in
| scenarios where vectorized code in general is a significant
| contributor to code size, which admittedly is only very-
| specialized software).
___________________________________________________________________
(page generated 2025-04-24 23:00 UTC)