[HN Gopher] Fundamental flaws of SIMD ISAs (2021)
___________________________________________________________________
Fundamental flaws of SIMD ISAs (2021)
Author : fanf2
Score : 142 points
Date : 2025-04-24 14:42 UTC (1 days ago)
(HTM) web link (www.bitsnbites.eu)
(TXT) w3m dump (www.bitsnbites.eu)
| pkhuong wrote:
| There's more to SIMD than BLAS.
| https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u... .
| camel-cdr wrote:
| BLAS, specifically gemm, is one of the rare things where you
| naturally need to specialize on vector register width.
|
| Most problems don't require this: E.g. your basic penalizable
| math stuff, unicode conversion, base64 de/encode, json parsing,
| set intersection, quicksort*, bigint, run length encoding,
| chacha20, ...
|
| And if you run into a problem that benefits from knowing the
| SIMD width, then just specialize on it. You can totally use
| variable-length SIMD ISA's in a fixed-length way when required.
| But most of the time it isn't required, and you have code that
| easily scales between vector lengths.
|
| *quicksort: most time is spent partitioning, which is vector
| length agnostic, you can handle the leafs in a vector length
| agnostic way, but you'll get more efficient code if you
| specialize (idk how big the impact is, in vreg bitonic sort is
| quite efficient).
| convolvatron wrote:
| i would certainly add lack of reductions ('horizontal'
| operations) and a more generalized model of communication to the
| list.
| adgjlsfhk1 wrote:
| The tricky part with reductions is that they are somewhat
| inherently slow since they often need to be done pairwise and a
| pairwise reduction over 16 elements will naturally have pretty
| limited parallelism.
| convolvatron wrote:
| kinda? this is sort of a direct result of the 'vectors are
| just sliced registers' model. if i do a pairwise operation
| and divide my domain by 2 at each step, is the resulting
| vector sparse or dense? if its dense then I only really top
| out when i'm in the last log2slice steps.
| sweetjuly wrote:
| Yes, but this is not cheap for hardware. CPU designers love
| SIMD because it lets them just slap down ALUs in parallel
| and get 32x performance boosts. Reductions, however, are
| not entirely parallel and instead have a relatively huge
| gate depth.
|
| For example, suppose an add operation has a latency of one
| unit in some silicon process. To add reduce a 32 element
| vector, you'll have a five deep tree, which means your
| operation has a latency of five units. You can pipeline
| this, but you can't solve the fact that this operation has
| a 5x higher latency than the non-reduce operations.
| adgjlsfhk1 wrote:
| for some, you can cheat a little (e.g. partial sum
| reduction), but then you don't get to share IP with the
| rest of your ALUs. I do really want to see what an
| optimal 32 wide reduction and circuit looks like. For
| integers, you pretty clearly can do much better than
| pairwise reduction. float sounds tricky
| dzaima wrote:
| ARM NEON does have sum, min, and max reductions (and/or
| reductions can just be min/max if all bits in elements are the
| same), along with pairwise ops. RVV has sum,min,max,and,or,xor
| reductions. x86 has psadbw which sums windows of eight 8-bit
| ints, and various instructions for some pairwise horizontal
| stuff.
|
| But in general building code around reductions isn't really a
| thing you'd ideally do; they're necessarily gonna have higher
| latency / lower throughput / take more silicon compared to
| avoiding them where possible, best to leave reducing to a
| single element to the loop tail.
| TinkersW wrote:
| I write a lot of SIMD and I don't really agree with this..
|
| _Flaw1:fixed width_
|
| I prefer fixed width as it makes the code simpler to write, size
| is known as compile time so we know the size of our structures.
| Swizzle algorithms are also customized based on the size.
|
| _Flaw2:pipelining_
|
| no CPU I care about is in order so mostly irrelevant, and even
| scalar instructions are pipelined
|
| _Flaw3: tail handling_
|
| I code with SIMD as the target, and have special containers that
| pad memory to SIMD width, no need to mask or run a scalar loop. I
| copy the last valid value into the remaining slots so it doesn't
| cause any branch divergence.
| cjbgkagh wrote:
| I have similar thoughts,
|
| I don't understand the push for variable width SIMD. Possibly
| due to ignorance but I think it's an abstraction that can be
| specialized for different hardware so the similar tradeoffs
| between low level languages and high level languages apply.
| Since I already have to be aware of hardware level concepts
| such as 256bit shuffle not working across 128bit lanes and
| different instructions having very different performance
| characteristics on different CPUs I'm already knee deep in
| hardware specifics. While in general I like abstractions I've
| largely given up waiting for a 'sufficiently advanced compiler'
| that would properly auto-vectorize my code. I think AGI AI is
| more likely to happen sooner. At a guess it seems to be that
| SIMD code could work on GPUs but GPU code has different memory
| access costs so the code there would also be completely
| different.
|
| So my view is either create a much better higher level SIMD
| abstraction model with a sufficiently advanced compiler that
| knows all the tricks or let me work closely at the hardware
| level.
|
| As an outsider who doesn't really know what is going on it does
| worry me a bit that it appears that WASM is pushing for
| variable width SIMDs instead of supporting ISAs generally
| supported by CPUs. I guess it's a portability vs performance
| tradeoff - I worry that it may be difficult to make variable as
| performant as fixed width and would prefer to deal with
| portability by having alternative branches at code level.
| >> Finally, any software that wants to use the new instruction
| set needs to be rewritten (or at least recompiled). What is
| worse, software developers often have to target several SIMD
| generations, and add mechanisms to their programs that
| dynamically select the optimal code paths depending on which
| SIMD generation is supported.
|
| Why not marry the two and have variable width SIMD as one of
| the ISA options and if in the future variable width SIMD become
| more performant then it would just be another branch to
| dynamically select.
| kevingadd wrote:
| Part of the motive behind variable width SIMD in WASM is that
| there's intentionally-ish no mechanism to do feature
| detection at runtime in WASM. The whole module has to be
| valid on your target, you can't include a handful of invalid
| functions and conditionally execute them if the target
| supports 256-wide or 512-wide SIMD. If you want to adapt you
| have to ship entire modules for each set of supported feature
| flags and select the correct module at startup after probing
| what the target supports.
|
| So variable width SIMD solves this by making any module using
| it valid regardless of whether the target supports 512-bit
| vectors, and the VM 'just' has to solve the problem of
| generating good code.
|
| Personally I think this is a terrible way to do things and
| there should have just been a feature detection system, but
| the horse fled the barn on that one like a decade ago.
| __abadams__ wrote:
| It would be very easy to support 512-bit vectors
| everywhere, and just emulate them on most systems with a
| small number of smaller vectors. It's easy for a compiler
| to generate good code for this. Clang does it well if you
| use its built-in vector types (which can be any length).
| Variable-length vectors, on the other hand, are a very
| challenging problem for compiler devs. You tend to get
| worse code out than if you just statically picked a size,
| even if it's not the native size.
| jandrewrogers wrote:
| The risk of 512-bit vectors everywhere is that many
| algorithms will spill the registers pretty badly if
| implemented in e.g. 128-bit vectors under the hood. In
| such cases you may be better off with a completely
| different algorithm implementation.
| codedokode wrote:
| Variable length vectors seem to be made for closed-source
| manually-written assembly (nobody wants to unroll the
| loop manually and nobody will rewrite it for new register
| width).
| Someone wrote:
| > It would be very easy to support 512-bit vectors
| everywhere, and just emulate them on most systems with a
| small number of smaller vectors. It's easy for a compiler
| to generate good code for this
|
| Wouldn't that be suboptimal if/when CPUs that support
| 1024-bit vectors come along?
|
| > Variable-length vectors, on the other hand, are a very
| challenging problem for compiler devs. You tend to get
| worse code out than if you just statically picked a size,
| even if it's not the native size.
|
| Why would it be challenging? You could statically pick a
| size on a system with variable-length vectors, too. How
| would that be worse code?
| kevingadd wrote:
| Optimal performance in a vector algorithm typically
| requires optimizing around things like the number of
| available registers, whether the registers in use are
| volatile (mandating stack spills when calling other
| functions like a comparer), and sizes of sequences.
|
| If you know you're engineering for 16-byte vectors you
| can 'just' align all your data to 16 bytes. And if you
| know you have 8 vector registers where 4 of them are non-
| volatile you can design around that too. But without
| information like that you have to be defensive, like
| aligning all your data to 128 bytes instead Just In Case
| (heaven forbid native vectors get bigger than that),
| minimizing the number of registers you use to try and
| avoid stack spills, etc. (I mention this because WASM
| also doesn't expose any of this information.)
|
| It's true that you could just design for a static size on
| a system with variable-length vectors. I suspect you'd
| see a lot of people do that, and potentially under-
| utilize the hardware's capabilities. Better than nothing,
| at least!
| dataflow wrote:
| > Wouldn't that be suboptimal if/when CPUs that support
| 1024-bit vectors come along?
|
| Is that likely or on anyone's roadmap? It makes a little
| less sense than 512 bits, at least for Intel, since their
| cache lines are 64 bytes i.e. 512 bits. Any more than
| that and they'd have to mess with multiple cache lines
| _all_ the time, not just on unaligned accesses. And they
| 'd have to support crossing more than 2 cache lines on
| unaligned accesses. They increase the cache line size
| too, but that seems terrible for compatibility since a
| lot of programs assume it's a compile time constant (and
| it'd have performance overhead to make it a run-time
| value). Somehow it feels like this isn't the way to go,
| but hey, I'm not a CPU architect.
| camel-cdr wrote:
| > I prefer fixed width
|
| Do you have examples for problems that are easier to solve in
| fixed-width SIMD?
|
| I maintain that most problems can be solved in a vector-length-
| agnostic manner. Even if it's slightly more tricky, it's
| certainly easier than restructuring all of your memory
| allocations to add padding and implementing three versions for
| all the differently sized SIMD extensions your target may
| support. And you can always fall back to using a variable-width
| SIMD ISA in a fixed-width way, when necessary.
| jcranmer wrote:
| There's a category of autovectorization known as Superword-
| Level Parallelism (SLP) which effectively scavenges an entire
| basic block for individual instruction sequences that might
| be squeezed together into a SIMD instruction. This kind of
| vectorization doesn't work well with vector-length-agnostic
| ISAs, because you generally can't scavenge more than a few
| elements anyways, and inducing any sort of dynamic vector
| length is more likely to slow your code down as a result
| (since you can't do constant folding).
|
| There's other kinds of interesting things you can do with
| vectors that aren't improved by dynamic-length vectors.
| Something like abseil's hash table, which uses vector code to
| efficiently manage the occupancy bitmap. Dynamic vector
| length doesn't help that much in that case, particularly
| because the vector length you can parallelize over is itself
| intrinsically low (if you're scanning dozens of elements to
| find an empty slot, something is wrong). Vector swizzling is
| harder to do dynamically, and in general, at high vector
| factors, difficult to do generically in hardware, which means
| going to larger vectors (even before considering dynamic
| sizes), vectorization is trickier if you have to do a lot of
| swizzling.
|
| In general, vector-length-agnostic is really only good for
| SIMT-like codes, which you can express the vector body as
| more or less independent f(index) for some knowable-before-
| you-execute-the-loop range of indices. Stuff like DAXPY or
| BLAS in general. Move away from this model, and that
| agnosticism becomes overhead that doesn't pay for itself.
| (Now granted, this kind of model is a _large fraction_ of
| parallelizable code, but it 's far from all of it).
| camel-cdr wrote:
| The SLP vectorizer is a good point, but I think it's, in
| comparison with x86, more a problem of the float and vector
| register files not being shared (in SVE and RVV). You don't
| need to reconfigure the vector length; just use it at the
| full width.
|
| > Something like abseil's hash table
|
| If I remember this correctly, the abseil lookup does scale
| with vector length, as long as you use the native data path
| width. (albeit with small gains) There is a problem with
| vector length agnostic handling of abseil, which is the
| iterator API. With a different API, or compilers that could
| eliminate redundant predicated load/stores, this would be
| easier.
|
| > good for SIMT-like codes
|
| Certainly, but I've also seen/written a lot of vector
| length agnostic code using shuffles, which don't fit into
| the SIMT paradigm, which means that the scope is larger
| than just SIMT.
|
| ---
|
| As a general comparison, take AVX10/128, AVX10/256 and
| AVX10/512, overlap their instruction encodings, remove the
| few instructions that don't make sense anymore, and add a
| cheap instruction to query the vector length. (probably
| also instructions like vid and viota, for easier shuffle
| synthesization) Now you have a variable-length SIMD ISA
| that feels familiar.
|
| The above is basically what SVE is.
| jonstewart wrote:
| A number of the cool string processing SIMD techniques
| depend a _lot_ on register widths and instruction
| performance characteristics. There's a fair argument to be
| made that x64 could be made more consistent/legible for
| these use cases, but this isn't matmul--whether you have
| 128, 256, or 512 bits matters hugely and you may want
| entirely different algorithms that are contingent on this.
| jandrewrogers wrote:
| I also prefer fixed width. At least in C++, all of the
| padding, alignment, etc is automagically codegen-ed for the
| register type in my use cases, so the overhead is
| approximately zero. All the complexity and cost is in
| specializing for the capabilities of the underlying SIMD ISA,
| not the width.
|
| The benefit of fixed width is that optimal data structure and
| algorithm design on various microarchitectures is dependent
| on explicitly knowing the register width. SIMD widths aren't
| not perfectly substitutable in practice, there is more at
| play than stride size. You can also do things like explicitly
| combine separate logic streams in a single SIMD instruction
| based on knowing the word layout. Compilers don't do this
| work in 2025.
|
| The argument for vector width agnostic code seems predicated
| on the proverbial "sufficiently advanced compiler". I will
| likely retire from the industry before such a compiler
| actually exists. Like fusion power, it has been ten years
| away my entire life.
| camel-cdr wrote:
| > The argument for vector width agnostic code is seems
| predicated on the proverbial "sufficiently advanced
| compiler".
|
| A SIMD ISA having a fixed size or not is orthogonal to
| autovectorization. E.g. I've seen a bunch of cases where
| things get autovectorized for RVV but not for AVX512. The
| reason isn't fixed vs variable, but rather the supported
| instructions themselves.
|
| There are two things I'd like from a "sufficiently advanced
| compiler", which are sizeless struct support and redundant
| predicated load/store elimination. Those don't
| fundamentally add new capabilities, but makes working
| with/integrating into existing APIs easier.
|
| > All the complexity and cost is in specializing for the
| capabilities of the underlying SIMD ISA, not the width.
|
| Wow, it almost sounds like you could take basically the
| same code and run it with different vector lengths.
|
| > The benefit of fixed width is that optimal data structure
| and algorithm design on various microarchitectures is
| dependent on explicitly knowing the register width
|
| Optimal to what degree? Like sure, fixed-width SIMD can
| always turn your pointer increments from a register add to
| an immediate add, so it's always more "optimal", but that
| sort of thing doesn't matter.
|
| The only difference you usually encounter when writing
| variable instead of fixed size code is that you have to
| synthesize your shuffles outside the loop. This usually
| just takes a few instructions, but loading a constant is
| certainly easier.
| jandrewrogers wrote:
| The interplay of SIMD width and microarchitecture is more
| important for performance engineering than you seem to be
| assuming. Those codegen decisions are made at layer above
| anything being talked about here and they operate on
| explicit awareness of things like register size.
|
| It isn't "same instruction but wider or narrower" or
| anything that can be trivially autovectorized, it is
| "different algorithm design". Compilers are not yet
| rewriting data structures and algorithms based on
| microarchitecture.
|
| I write a lot of SIMD code, mostly for database engines,
| little of which is trivial "processing a vector of data
| types" style code. AVX512 in particular is strong enough
| of an ISA that it is used in all kinds of contexts that
| we traditionally wouldn't think of as a good for SIMD.
| You can build all kinds of neat quasi-scalar idioms with
| it and people do.
| synthos wrote:
| > Compilers are not yet rewriting data structures and
| algorithms based on microarchitecture.
|
| Afaik mojo allows for this with the autotuning capability
| and metaprogramming
| pkhuong wrote:
| > Do you have examples for problems that are easier to solve
| in fixed-width SIMD?
|
| Regular expression matching and encryption come to mind.
| camel-cdr wrote:
| > Regular expression matching
|
| That's probably true. Last time I looked at it, it seemed
| like parts of vectorscan could be vectorized VLA, but from
| my, very limited, understanding of the main matching
| algorithm, it does seem to require specialization on vector
| length.
|
| It should be possible to do VLA in some capacity, but it
| would probably be slower and it's too much work to test.
|
| > encryption
|
| From the things I've looked at, it's mixed.
|
| E.g. chacha20 and poly1305 vectorize well in a VLA scheme:
| https://camel-cdr.github.io/rvv-bench-
| results/bpi_f3/chacha2..., https://camel-cdr.github.io/rvv-
| bench-results/bpi_f3/poly130...
|
| Keccak on the other hand was optimized for fast execution
| on scalar ISAs with 32 GPRs. This is hard to vectorize in
| general, because GPR "moves" are free and liberally
| applied.
|
| Another example where it's probably worth specializing is
| quicksort, specifically the leaf part.
|
| I've written a VLA version, which uses bitonic sort to sort
| within vector registers. I wasn't able to meaningfully
| compare it against a fixed size implementation, because
| vqsort was super slow when I tried to compile it for RVV.
| aengelke wrote:
| I agree; and the article seems to have also quite a few
| technical flaws:
|
| - Register width: we somewhat maxed out at 512 bits, with Intel
| going back to 256 bits for non-server CPUs. I don't see larger
| widths on the horizon (even if SVE theoretically supports up to
| 2048 bits, I don't know any implementation with ~~>256~~ >512
| bits). Larger bit widths are not beneficial for most
| applications and the few applications that are (e.g., some HPC
| codes) are nowadays served by GPUs.
|
| - The post mentions available opcode space: while opcode space
| is limited, a reasonably well-designed ISA (e.g., AArch64) has
| enough holes for extensions. Adding new instructions doesn't
| require ABI changes, and while adding new registers requires
| some kernel changes, this is well understood at this point.
|
| - "What is worse, software developers often have to target
| several SIMD generations" -- no way around this, though, unless
| auto-vectorization becomes substantially better. Adjusting the
| register width is not the big problem when porting code, making
| better use of available instructions is.
|
| - "The packed SIMD paradigm is that there is a 1:1 mapping
| between the register width and the execution unit width" -- no.
| E.g., AMD's Zen 4 does double pumping, and AVX was IIRC
| originally designed to support this as well (although Intel
| went directly for 256-bit units).
|
| - "At the same time many SIMD operations are pipelined and
| require several clock cycles to complete" -- well, they _are_
| pipelined, but many SIMD instructions have the same latency as
| their scalar counterpart.
|
| - "Consequently, loops have to be unrolled in order to avoid
| stalls and keep the pipeline busy." -- loop unroll has several
| benefits, mostly to reduce the overhead of the loop and to
| avoid data dependencies between loop iterations. Larger basic
| blocks are better for hardware as every branch, even if
| predicted correctly, has a small penalty. "Loop unrolling also
| increases register pressure" -- it does, but code that really
| requires >32 registers is extremely rare, so a good instruction
| scheduler in the compiler can avoid spilling.
|
| In my experience, dynamic vector sizes make code slower,
| because they inhibit optimizations. E.g., spilling a
| dynamically sized vector is like a dynamic stack allocation
| with a dynamic offset. I don't think SVE delivered any large
| benefits, both in terms of performance (there's not much
| hardware with SVE to begin with...) and compiler support.
| RISC-V pushes further into this direction, we'll see how this
| turns out.
| cherryteastain wrote:
| Fujitsu A64FX used in the Fugaku supercomputer uses SVE with
| 512 bit width
| aengelke wrote:
| Thanks, I misremembered. However, the microarchitecture is
| a bit "weird" (really HPC-targeted), with very long
| latencies (e.g., ADD (vector) 4 cycles, FADD (vector) 9
| cycles). I remember that it was much slower than older x86
| CPUs for non-SIMD code, and even for SIMD code, it took
| quite a bit of effort to get reasonable performance through
| instruction-level parallelism due to the long latencies and
| the very limited out-of-order capacities (in particular the
| just 2x20 reservation station entries for FP).
| camel-cdr wrote:
| > we somewhat maxed out at 512 bits
|
| Which still means you have to write your code at least
| thrice, which is two times more than with a variable length
| SIMD ISA.
|
| Also there are processors with larger vector length, e.g.
| 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200,
| 16384-bit: NEC SX-Aurora, Ara, EPI
|
| > no way around this
|
| You rarely need to rewrite SIMD code to take advantage of new
| extensions, unless somebody decides to create a new one with
| a larger SIMD width. This mostly happens when very
| specialized instructions are added.
|
| > In my experience, dynamic vector sizes make code slower,
| because they inhibit optimizations.
|
| Do you have more examples of this?
|
| I don't see spilling as much of a problem, because you want
| to avoid it regardless, and codegen for dynamic vector sizes
| is pretty good in my experience.
|
| > I don't think SVE delivered any large benefits
|
| Well, all Arm CPUs except for the A64FX were build to execute
| NEON as fast as possible. X86 CPUs aren't built to execute
| MMX or SSE or the latest, even AVX, as fast as possible.
|
| Anyway, I know of one comparison between NEON and SVE:
| https://solidpixel.github.io/astcenc_meets_sve
|
| > Performance was a lot better than I expected, giving
| between 14 and 63% uplift. Larger block sizes benefitted the
| most, as we get higher utilization of the wider vectors and
| fewer idle lanes.
|
| > I found the scale of the uplift somewhat surprising as
| Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so
| in terms of data-width the two should work out very similar.
| vardump wrote:
| > Which still means you have to write your code at least
| thrice, which is two times more than with a variable length
| SIMD ISA.
|
| 256 and 512 bits are the only reasonable widths. 256 bit
| AVX2 is what, 13 or 14 years old now.
| adgjlsfhk1 wrote:
| no. Because Intel is full of absolute idiots, Intel atom
| didn't support AVX 1 until Gracemont. Tremont is missing
| AVX1, AVX2, FMA, and basically the rest of X86v3, and
| shipped in CPUs as recently as 2021 (Jasper Lake).
| vardump wrote:
| Oh damn. I've dropped SSE ages ago and no one complained.
| I guess the customer base didn't use those chips...
| ack_complete wrote:
| Intel also shipped a bunch of Pentium-branded CPUs that
| have AVX disabled, leading to oddities like a Kaby Lake
| based CPU that doesn't have AVX, and even worse, also
| shipped a few CPUs that have AVX2 but not BMI2:
|
| https://sourceware.org/bugzilla/show_bug.cgi?id=29611
|
| https://developercommunity.visualstudio.com/t/Crash-in-
| Windo...
| aengelke wrote:
| > Also there are processors with larger vector length
|
| How do these fare in terms of absolute performance? The NEC
| TSUBASA is not a CPU.
|
| > Do you have more examples of this?
|
| I ported some numeric simulation kernel to the A64Fx some
| time ago, fixing the vector width gave a 2x improvement.
| Compilers probably/hopefully have gotten better in the mean
| time and I haven't redone the experiments since then, but
| I'd be surprised if this changed drastically. Spilling is
| sometimes unavoidable, e.g. due to function calls.
|
| > Anyway, I know of one comparison between NEON and SVE:
| https://solidpixel.github.io/astcenc_meets_sve
|
| I was specifically referring to dynamic vector sizes. This
| experiment uses sizes fixed at compile-time, from the
| article:
|
| > For the astcenc implementation of SVE I decided to
| implement a fixed-width 256-bit implementation, where the
| vector length is known at compile time.
| camel-cdr wrote:
| > How do these fare in terms of absolute performance? The
| NEC TSUBASA is not a CPU.
|
| The NEC is an attached accelerator, but IIRC it can run
| an OS in host mode. It's hard to tell how the others
| perform, because most don't have hardware available yet
| or only they and partner companies have access. It's also
| hard to compare, because they don't target the desktop
| market.
|
| > I ported some numeric simulation kernel to the A64Fx
| some time ago, fixing the vector width gave a 2x
| improvement.
|
| Oh, wow. Was this autovectorized or handwritten
| intrinsics/assembly?
|
| Any chance it's of a small enough scope that I could try
| to recreate it?
|
| > I was specifically referring to dynamic vector sizes.
|
| Ah, sorry, yes you are correct. It still shows that
| supporting VLA mechanisms in an ISA doesn't mean it's
| slower for fixed-size usage.
|
| I'm not aware of any proper VLA vs VLS comparisons. I
| benchmarked a VLA vs VLS mandelbrot implementation once
| where there was no performance difference, but that's a
| too simple example.
| codedokode wrote:
| > Which still means you have to write your code at least
| thrice, which is two times more than with a variable length
| SIMD ISA.
|
| This is a wrong approach. You should be writing you code in
| a high-level language like this: x = sum
| i for 1..n: a[i] * b[i]
|
| And let the compiler write the assembly for every existing
| architecture (including multi-threaded version of a loop).
|
| I don't understand what is the advantage of writing the
| SIMD code manually. At least have a LLM write it if you
| don't like my imaginary high-level vector language.
| otherjason wrote:
| This is the common argument from proponents of compiler
| autovectorization. An example like what you have is very
| simple, so modern compilers would turn it into SIMD code
| without a problem.
|
| In practice, though, the cases that compilers can
| successfully autovectorize are very limited relative to
| the total problem space that SIMD is solving. Plus, if I
| rely on that, it leaves me vulnerable to regressions in
| the compiler vectorizer.
|
| Ultimately for me, I would rather write the
| implementation myself and know what is being generated
| versus trying to write high-level code in just the right
| way to make the compiler generate what I want.
| deaddodo wrote:
| > - Register width: we somewhat maxed out at 512 bits, with
| Intel going back to 256 bits for non-server CPUs. I don't see
| larger widths on the horizon (even if SVE theoretically
| supports up to 2048 bits, I don't know any implementation
| with >256 bits). Larger bit widths are not beneficial for
| most applications and the few applications that are (e.g.,
| some HPC codes) are nowadays served by GPUs.
|
| Just to address this, it's pretty evident why scalar values
| have stabilized at 64-bit and vectors at ~512 (though there
| are larger implementations). Tell someone they only have 256
| values to work with and they immediately see the limit, it's
| why old 8-bit code wasted so much time shuffling carries to
| compute larger values. Tell them you have 65536 values and it
| alleviates a _large_ set of that problem, but you 're still
| going to hit limits frequently. Now you have up to 4294967296
| values and the limits are realistically only going to be hit
| in computational realms, so bump it up to
| 18446744073709551615. Now even most commodity computational
| limits are alleviated and the compiler will handle the data
| shuffling for larger ones.
|
| There was naturally going to be a point where there was
| enough static computational power on integers that it didn't
| make sense to continue widening them (at least, not at the
| previous rate). The same goes for vectorization, but in even
| more niche and specific fields.
| bjourne wrote:
| > "Loop unrolling also increases register pressure" -- it
| does, but code that really requires >32 registers is
| extremely rare, so a good instruction scheduler in the
| compiler can avoid spilling.
|
| No, it actually is super common in hpc code. If you unroll a
| loop N times you need N times as many registers. For normal
| memory-bound code I agree with you, but most hpc kernels will
| exploit as much of the register file as they can for
| blocking/tiling.
| xphos wrote:
| I think the variable length stuff does solve encoding issues,
| and RISCV takes so big strides with the ideas around chaining
| and vl/lmul/vtype registers.
|
| I think they would benefit from having 4 vtype registers,
| though. It's wasted scalar space, but how often do you
| actually rotate between 4 different vector types in main loop
| bodies. The answer is pretty rarely. And you'd greatly reduce
| the swapping between vtypes when. I think they needed to find
| 1 more bit but it's tough the encoding space isn't that large
| for rvv which is a perk for sure
|
| Can't wait to seem more implementions of rvv to actually test
| some of it's ideas
| dzaima wrote:
| If you had two extra bits in the instruction encoding, I
| think it'd make much more sense to encode element width
| directly in instructions, leaving LMUL multiplier &
| agnosticness settings in vsetvl; only things that'd suffer
| then would be if you need tail-undisturbed for one instr
| (don't think that's particularly common) and fancy things
| that reinterpret the vector between different element
| widths (very uncommon).
|
| Will be interesting to see if longer encodings for RVV with
| encoded vtype or whatever ever materialize.
| tonetegeatinst wrote:
| AFAIK about every modern CPU uses out of order von Neumann
| architecture. The only people who don't are the handful of
| researchers and people who work with the government research
| into non van Neumann designed systems.
| luyu_wu wrote:
| Low power RISC cores (both ARM and RISC-V) are typically in-
| order actually!
|
| But any core I can think of as 'high-performance' is OOO.
| whaleofatw2022 wrote:
| MIPS as well as Alpha AFAIR. And technically itanium, otoh
| It seems to me a bit like a niche for any performance
| advantages...
| mattst88 wrote:
| Alpha 21264 is out-of-order.
| pezezin wrote:
| I would not call neither MIPS, Alpha nor Itanium "high-
| performance" in 2025...
| p_l wrote:
| Alpha was out of order starting with EV7, but most
| importantly the entire architecture was designed with eye
| for both pipeline hazards and out of order execution,
| unlike VAX that it replaced which made it pretty much
| impossible
| IshKebab wrote:
| Microcontrollers are often in-order.
| PaulHoule wrote:
| In AVX-512 we have a platform that rewards the assembly
| language programmer like few platforms have since the 6502. I
| see people doing really clever things that are specific to the
| system and one level it is _really cool_ but on another level
| it means SIMD is the domain of the specialist, Intel puts out
| press releases about the really great features they have for
| the national labs and for Facebook whereas the rest of us are
| 5-10 years behind the curve for SIMD adoption because the juice
| isn 't worth the squeeze.
|
| Just before libraries for training neural nets on GPUs became
| available I worked on a product that had a SIMD based neural
| network trainer that was written in hand-coded assembly. We
| were a generation behind in our AVX instructions so we gave up
| half of the performance we could have got, but that was the
| least of the challenges we had to overcome to get the product
| in front of customers. [1]
|
| My software-centric view of Intel's problems is that they've
| been spending their customers and shareholders money to put
| features in chips that are fused off or might as well be fused
| off because they aren't widely supported in the industry. And
| that they didn't see this as a problem and neither did their
| enablers in the computing media and software industry. Just for
| example, Apple used to ship the MKL libraries which like a
| turbocharger for matrix math back when they were using Intel
| chips. For whatever reason, Microsoft did not do this with
| Windows and neither did most Linux distributions so "the rest
| of us" are stuck with a fraction of the performance that we
| paid for.
|
| AMD did the right thing in introducing double pumped AVX-512
| because at least assembly language wizards have some place
| where their code runs and the industry gets closer to the place
| where we can count on using an instruction set defined _12
| years ago._
|
| [1] If I'd been tasked with updating the to next generation I
| would have written a compiler (if I take that many derivatives
| by hand I'll get one wrong.) My boss would have ordered me not
| to, I would have done it anyway and not checked it in.
| bee_rider wrote:
| It is kind of a bummer that MKL isn't open sourced, as that
| would make inclusion in Linux easier. It is already free-as-
| in-beer, but of course that doesn't solve everything.
|
| Baffling that MS didn't use it. They have a pretty close
| relationship...
|
| Agree that they are sort of going after hard-to-use niche
| features nowadays. But I think it is just that the real thing
| we want--single threaded performance for branchy code--is,
| like, incredibly difficult to improve nowadays.
| PaulHoule wrote:
| At the very least you can decode UTF-8 really quickly with
| AVX-512
|
| https://lemire.me/blog/2023/08/12/transcoding-
| utf-8-strings-...
|
| and web browsers at the very least spent a lot of cycles on
| decoding HTML and Javascript which is UTF-8 encoded. It
| turns out AVX-512 is good at a lot of things you wouldn't
| think SIMD would be good at. Intel's got the problem that
| people don't want to buy new computers because they don't
| see much benefit from buying a new computer, but a new
| computer doesn't have the benefit it could have because of
| lagging software support, and the software support lags
| because there aren't enough new computers to justify the
| work to do the software support. Intel deserves blame for a
| few things, one of which is that they have dragged their
| feet at getting really innovative features into their
| products while turning people off with various empty
| slogans.
|
| They really do have a new instruction set that targets
| plain ordinary single threaded branchy code
|
| https://www.intel.com/content/www/us/en/developer/articles/
| t...
|
| they'll probably be out of business before you can use it.
| gatane wrote:
| In the end, it doesnt even matter, javascript frameworks
| are already big enough to slow down your pc.
|
| Unless if said optimization on parsing runs at the very
| core of JS.
| saagarjha wrote:
| It'll speed up first load times.
| immibis wrote:
| If you pay attention this isn't a UTF-8 decoder. It might
| be some other encoding, or a complete misunderstanding of
| how UTF-8 works, or an AI hallucination. It also doesn't
| talk about how to handle the variable number of output
| bytes or the possibility of a continuation sequence split
| between input chunks.
| kjs3 wrote:
| I paid attention and I don't see where Daniel claimed
| that this a complete UTF-8 decoder. He's illustrating a
| programming technique using a simplified use case, not
| solving the worlds problems. And I don't think Daniel
| Lemire lacks an understanding of the concept or needs an
| AI to code it.
| ack_complete wrote:
| AVX-512 also has a lot of wonderful facilities for
| autovectorization, but I suspect its initial downclocking
| effects plus getting yanked out of Alder Lake killed a lot of
| the momentum in improving compiler and library usage of it.
|
| Even the Steam Hardware Survey, which is skewed toward upper
| end hardware, only shows 16% availability of baseline
| AVX-512, compared to 94% for AVX2.
| adgjlsfhk1 wrote:
| It will be interesting seeing what happens now that AMD is
| shipping good AVX-512. It really just makes Intel seem
| incompetent (especially since they're theoretically
| bringing AVX-512 back in next year anyway)
| ack_complete wrote:
| No proof, but I suspect that AMD's AVX-512 support played
| a part in Intel dumping AVX10/256 and changing plans back
| to shipping a full 512-bit consumer implementation again
| (we'll see when they actually ship it).
|
| The downside is that AMD also increased the latency of
| all formerly cheap integer vector ops. This removes one
| of the main advantages against NEON, which historically
| has had richer operations but worse latencies. That's one
| thing I hope Intel doesn't follow.
|
| Also interesting is that Intel's E-core architecture is
| improving dramatically compared to the P-core, even
| surpassing it in some cases. For instance, Skymont
| finally has no penalty for denormals, a long standing
| Intel weakness. Would not be surprising to see the E-core
| architecture take over at some point.
| adgjlsfhk1 wrote:
| > For instance, Skymont finally has no penalty for
| denormals, a long standing Intel weakness.
|
| yeah, that's crazy to me. Intel has been so completely
| discunctional for the last 15 years. I feel like you
| couldn't have a clearer sign of "we have 2 completely
| separate teams that are competing with each other and
| aren't allowed to/don't want to talk to each other". it's
| just such a clear sign that the chicken is running around
| headless
| whizzter wrote:
| Not really, to me it more seems like Pentium-4 vs
| Pentium-M/Core again.
|
| The downfall of Pentium 4 was that they had been stuffing
| things into longer and longer pipes to keep up the
| frequency race(with horrible branch latencies as a
| result). They scaled it all away by "resetting" to the
| P3/P-M/Core architecture and scaling up from that again.
|
| Pipes today are even _longer_ and if E-cores has shorter
| pipes at a similar frequency then "regular" JS,Java,etc
| code will be far more performant even if you lose a bit
| of perf for "performance" cases where people vectorize
| (Did the HPC computing crowd influence Intel into a ditch
| AGAIN? wouldn't be surprising!).
| ack_complete wrote:
| Thankfully, the P-cores are nowhere near as bad as the
| Pentium 4 was. The Pentium 4 had such a skewed
| architecture that it was annoyingly frustrating to
| optimize for. Not only was the branch misprediction
| penalty long, but all common methods of doing branchless
| logic like conditional moves were also slow. It also had
| a slow shifter such that small left shifts were actually
| faster as sequences of adds, which I hadn't needed to do
| since the 68000 and 8086. And an annoying L1 cache that
| had 64K aliasing penalties (guess which popular OS
| allocates all virtual memory, particularly thread stacks,
| at 64K alignment.....)
|
| The P-cores have their warts, but are still much more
| well-rounded than the P4 was.
| ezekiel68 wrote:
| You mentioned "initial downclocking effects", yet (for
| posterity) I want to emphasize that in 2020 Ice Lake (Sunny
| Cove core) and later Intel processors, the downclocking is
| really a nothingburger. The fusing off debacle in desktop
| CPU families like Alder Lake you mentioned definitely
| killed the momentum though.
|
| I'm not sure why OS kernels couldn't have become partners
| in CPU capability queries (where a program starting
| execution could request a CPU core with 'X' such as
| AVX-512F, for example) -- but without that the whole
| P-core/E-core hybrid concept was DOA for capabilities which
| were not least-common denominators. If I had to guess,
| marketing got ahead of engineering and testing on that one.
| ack_complete wrote:
| Sure, but any core-wide downclocking effect at all is
| annoying for autovectorization, since a small local win
| easily turns into a global loss. Which is why compilers
| have "prefer vector width" tuning parameters so autovec
| can be tuned down to avoid 512-bit or even 256-bit ops.
|
| This is also the same reason that having AVX-512 only on
| the P-cores wouldn't have worked, even with thread
| director support. It would only take one small routine in
| a common location to push most threads off the P-cores.
|
| I'm of the opinion that Intel's hybrid P/E-arch has been
| mostly useless anyway and only good for winning
| benchmarks. My current CPU has a 6P4E configuration and
| the scheduler hardly uses the E-cores at all unless
| forced, plus performance was better and more stable with
| the E-cores disabled.
| the__alchemist wrote:
| Noob question! What about AVX-512 makes it unique to assembly
| programmers? I'm just dipping my toes in, and have been doing
| some chemistry computations using f32x8, Vec3x8 etc
| (AVX-256). I have good workflows set up, but have only been
| getting 2x speedup over non-SIMD. (Was hoping for closer to
| 8). I figured AVX-512 would allow f32x16 etc, which would be
| mostly a drop-in. (I have macros to set up the types, and you
| input num lanes).
| dzaima wrote:
| SIMD only helps you where you're arithmetic-limited; you
| may be limited by memory bandwidth, or perhaps float
| division if applicable; and if your scalar comparison got
| autovectorized you'd have roughly no benefit.
|
| AVX-512 should be just fine via intrinsics/high-level
| vector types, not different from AVX2 in this regard.
| ack_complete wrote:
| AVX-512 has a lot of instructions that just extend
| vectorization to 512-bit and make it nicer with features
| like masking. Thus, a very valid use of it is just to
| double vectorization width.
|
| But it also has a bunch of specialized instructions that
| can boost performance beyond just the 2x width. One of them
| is VPCOMPRESSB, which accelerates compact encoding of
| sparse data. Another is GF2P8AFFINEQB, which is targeted at
| specific encryption algorithms but can also be abused for
| general bit shuffling. Algorithms like computing a
| histogram can benefit significantly, but it requires
| reshaping the algorithm around very particular and peculiar
| intermediate data layouts that are beyond the
| transformations a compiler can do. This doesn't _literally_
| require assembly language, though, it can often be done
| with intrinsics.
| derf_ wrote:
| _> I code with SIMD as the target, and have special containers
| that pad memory to SIMD width..._
|
| I think this may be domain-specific. I help maintain several
| open-source audio libraries, and wind up being the one to
| review the patches when people contribute SIMD for some
| specific ISA, and I think without exception they always get the
| tail handling wrong. Due to other interactions it cannot always
| be avoided by padding. It can roughly double the complexity of
| the code [0], and requires a disproportionate amount of
| thinking time vs. the time the code spends running, but if you
| don't spend that thinking time you can get OOB reads or writes,
| and thus CVEs. Masked loads/stores are an improvement, but not
| universally available. I don't have a lot of concrete
| suggestions.
|
| I also work with a lot of image/video SIMD, and this is just
| not a problem, because most operations happen on fixed block
| sizes, and padding buffers is easy and routine.
|
| I agree I would have picked other things for the other two in
| my own top-3 list.
|
| [0] Here is a fun one, which actually performs worst when len
| is a multiple of 8 (which it almost always is), and has 59
| lines of code for tail handling vs. 33 lines for the main loop:
| https://gitlab.xiph.org/xiph/opus/-/blob/main/celt/arm/celt_...
| jandrewrogers wrote:
| > Masked loads/stores are an improvement, but not universally
| available.
|
| Traditionally we've worked around this with pretty idiomatic
| hacks that efficiently implement "masked load" functionality
| in SIMD ISAs that don't have them. We could probably be
| better about not making people write this themselves every
| time.
| ack_complete wrote:
| It depends on how integrated your SIMD strategy is into the
| overall technical design. Tail handling is much easier if you
| can afford SIMD-friendly padding so a full vector load/store
| is possible even if you have to manually mask. That avoids a
| lot of the hassle of breaking down memory accesses just to
| avoid a page fault or setting off the memory checker.
|
| Beyond that -- unit testing. I don't see enough of it for
| vectorized routines. SIMD widths are small enough that you
| can usually just test all possible offsets right up against a
| guard page and brute force verify that no overruns occur.
| codedokode wrote:
| I think that SIMD code should not be written by hand but
| rather in a high-level language and so dealing with tail
| becomes a compiler's and not a programmer's problem. Or
| people still prefer to write assembly be hand? It seems to be
| so judging by the code you link.
|
| What I wanted is to write code in a more high-level language
| like this. For example, to compute a scalar product of a and
| b you write: 1..n | a[$1] * b[$1] | sum
|
| Or maybe this: x = sum for i in 1 .. n:
| a[i] * b[i]
|
| And the code gets automatically compiled into SIMD
| instructions for every existing architecture (and for large
| arrays, into a multi-thread computation).
| Zambyte wrote:
| Zig exposes a Vector type to use for SIMD instructions,
| which has been my first introduction to SIMD directly.
| Reading through this thread I was immediately mapping what
| people were saying to Vector operations in Zig. It seems to
| me like SIMD can reasonably be exposed in high level
| languages for programmers to reach to in contexts where it
| matters.
|
| Of course, the compiler vectorizing code when it can as a
| general optimization is still useful, but when it's
| critical that some operations _must_ be vectorized,
| explicit SIMD structures seem nice to have.
| Const-me wrote:
| > I prefer fixed width
|
| Another reason to prefer fixed width, compilers may pass
| vectors to functions in SIMD registers. When register size is
| unknown at compile time, they have to pass data in memory. For
| complicated SIMD algorithms the performance overhead gonna be
| huge.
| kjs3 wrote:
| Back in the day, you had Cray style vector registers, and you
| had CDC style[1] 'vector pipes' (I think I remember that's
| what they called them) that you fed from main memory. So you
| would ( _vastly_ oversimplifying) build your vectors in
| consecutive memory locations (up to 64k as I recall), point
| to a result destination in memory and execute a vector
| instruction. This works fine if there 's a close match
| between cpu speed and memory access speed. The compilers were
| quite good, and took care of handling variable sized vectors,
| but I have no idea what was going on under the hood except
| for some hi-level undergrad compiler lectures. As memory
| speed vs cpu speed divergence became more and more pronouced,
| it quickly became obvious that vector registers were the
| right performance answer, basically everyone jumped that way,
| and I don't think anyone has adopted a memory-memory vector
| architecture since the '80s.
|
| [1] from CDC STAR-100 and followons like the CDC Cyber
| 180/990, Cyber 200 series & ETA-10.
| freeone3000 wrote:
| x86 SIMD suffers from register aliasing. xmm0 is actually the
| low-half of ymm0, so you need to explicitly tell the processor
| what your input type is to properly handle overflow and signing.
| Actual vectorized instructions don't have this problem but you
| also can't change it now.
| pornel wrote:
| There are alternative universes where these wouldn't be a
| problem.
|
| For example, if we didn't settle on executing compiled machine
| code exactly as-is, and had a instruction-updating pass (less
| involved than a full VM byte code compilation), then we could
| adjust SIMD width for existing binaries instead of waiting
| decades for a new baseline or multiversioning faff.
|
| Another interesting alternative is SIMT. Instead of having a
| handful of special-case instructions combined with heavyweight
| software-switched threads, we could have had every instruction
| SIMDified. It requires structuring programs differently, but
| getting max performance out of current CPUs already requires SIMD
| + multicore + predictable branching, so we're doing it anyway,
| just in a roundabout way.
| LegionMammal978 wrote:
| > Another interesting alternative is SIMT. Instead of having a
| handful of special-case instructions combined with heavyweight
| software-switched threads, we could have had every instruction
| SIMDified. It requires structuring programs differently, but
| getting max performance out of current CPUs already requires
| SIMD + multicore + predictable branching, so we're doing it
| anyway, just in a roundabout way.
|
| Is that not where we're already going with the GPGPU trend? The
| big catch with GPU programming is that many useful routines are
| irreducibly very branchy (or at least, to an extent that
| removing branches slows them down unacceptably), and every
| divergent branch throws out a huge chunk of the GPU's
| performance. So you retain a traditional CPU to run all your
| branchy code, but you run into memory-bandwidth woes between
| the CPU and GPU.
|
| It's generally the exception instead of the rule when you have
| a big block of data elements upfront that can all be handled
| uniformly with no branching. These usually have to do with
| graphics, physical simulation, etc., which is why the SIMT
| model was popularized by GPUs.
| winwang wrote:
| Fun fact which I'm 50%(?) sure of: a single branch divergence
| for integer instructions on current nvidia GPUs won't hurt
| perf, because there are only 16 int32 lanes anyway.
| pornel wrote:
| CPUs are not good at branchy code either. Branch
| mispredictions cause costly pipeline stalls, so you have to
| make branches either predictable or use conditional moves.
| Trivially predictable branches are fast -- but so are non-
| diverging warps on GPUs. Conditional moves and masked SIMD
| work pretty much exactly like on a GPU.
|
| Even if you have a branchy divide-and-conquer problem ideal
| for diverging threads, you'll get hit by a relatively high
| overhead of distributing work across threads, false sharing,
| and stalls from cache misses.
|
| My hot take is that GPUs will get more features to work
| better on traditionally-CPU-problems (e.g. AMD Shader Call
| proposal that helps processing unbalanced tree-structured
| data), and CPUs will be downgraded to being just a
| coprocessor for bootstrapping the GPU drivers.
| aengelke wrote:
| > if we didn't settle on executing compiled machine code
| exactly as-is, and had a instruction-updating pass (less
| involved than a full VM byte code compilation)
|
| Apple tried something like this: they collected the LLVM
| bitcode of apps so that they could recompile and even port to a
| different architecture. To my knowledge, this was done exactly
| once (watchOS armv7->AArch64) and deprecated afterwards.
| Retargeting at this level is inherently difficult (different
| ABIs, target-specific instructions, intrinsics, etc.). For the
| same target with a larger feature set, the problems are
| smaller, but so are the gains -- better SIMD usage would only
| come from the auto-vectorizer and a better instruction selector
| that uses different instructions. The expectable gains,
| however, are low for typical applications and for math-heavy
| programs, using optimized libraries or simply recompiling is
| easier.
|
| WebAssembly is a higher-level, more portable bytecode, but
| performance levels are quite a bit behind natively compiled
| code.
| almostgotcaught wrote:
| > There are alternative universes where these wouldn't be a
| problem
|
| Do people that say these things have literally any experience
| of merit?
|
| > For example, if we didn't settle on executing compiled
| machine code exactly as-is, and had a instruction-updating pass
|
| You do understand that at the end of the day, hardware is hard
| (fixed) and software is soft (malleable) right? There will be
| always be friction at some boundary - it doesn't matter where
| you hide the rigidity of a literal rock, you eventually reach a
| point where you cannot reconfigure something that you would
| like to. And also the parts of that rock that are useful are
| extremely expensive (so no one is adding instruction-updating
| pass silicon just because it would be convenient). That's just
| physics - the rock is very small but fully baked.
|
| > we could have had every instruction SIMDified
|
| Tell me you don't program GPUs without telling me. Not only is
| SIMT a literal lie today (cf warp level primitives), there is
| absolutely no reason to SIMDify all instructions (and you
| better be a wise user of your scalar registers and scalar
| instructions if you want fast GPU code).
|
| I wish people would just realize there's no grand paradigm
| shift that's coming that will save them from the difficult work
| of actually learning how the device works in order to be able
| to use it efficiently.
| pornel wrote:
| The point of updating the instructions isn't to have optimal
| behavior in all cases, or to reconfigure programs for wildly
| different hardware, but to be able to easily target
| contemporary hardware, without having to wait for the oldest
| hardware to die out first to be able to target a less
| outdated baseline without conditional dispatch.
|
| Users are much more forgiving about software that runs a bit
| slower than software that doesn't run at all. ~95% of x86_64
| CPUs have AVX2 support, but compiling binaries to
| unconditionally rely on it makes the remaining users
| complain. If it was merely slower on potato hardware, it'd be
| an easier tradeoff to make.
|
| This is the norm on GPUs thanks to shader recompilation
| (they're far from optimal for all hardware, but at least get
| to use the instruction set of the HW they're running on,
| instead of being limited to the lowest common denominator).
| On CPUs it's happening in limited cases: Zen 3 added AVX-512
| by executing two 256-bit operations serially, and plenty of
| less critical instructions are emulated in microcode, but
| it's done by the hardware, because our software isn't set up
| for that.
|
| Compilers already need to make assumptions about pipeline
| widths and instruction latencies, so the code is tuned for
| specific CPU vendors/generations anyway, and that doesn't get
| updated. Less explicitly, optimized code also makes
| assumptions about cache sizes and compute vs memory trade-
| offs. Code may need L1 cache of certain size to work best,
| but it still runs on CPUs with a too-small L1 cache, just
| slower. Imagine how annoying it would be if your code
| couldn't take advantage of a larger L1 cache without crashing
| on older CPUs. That's where CPUs are with SIMD.
| almostgotcaught wrote:
| i have no idea what you're saying - i'm well aware that
| compilers do lots of things but this sentence in your
| original comment
|
| > compiled machine code exactly as-is, and had a
| instruction-updating pass
|
| implies there should be _silicon_ that implements the
| instruction-updating - what else would be "executing"
| compiled machine code other than the machine
| itself...........
| pornel wrote:
| I was talking about a software pass. Currently, the
| machine code stored in executables (such as ELF or PE) is
| only slightly patched by the dynamic linker, and then
| expected to be directly executable by the CPU. The code
| in the file has to be already compatible with the target
| CPU, otherwise you hit illegal instructions. This is a
| simplistic approach, dating back to when running
| executables was just a matter of loading them into RAM
| and jumping to their start (old a.out or DOS COM).
|
| What I'm suggesting is adding a translation/fixup step
| after loading a binary, before the code is executed, to
| make it more tolerant to hardware changes. It doesn't
| have to be full abstract portable bytecode compilation,
| and not even as involved as PTX to SASS, but more like a
| peephole optimizer for the same OS on the same general
| CPU architecture. For example, on a pre-AVX2 x86_64 CPU,
| the OS could scan for AVX2 instructions and patch them to
| do equivalent work using SSE or scalar instructions.
| There are implementation and compatibility issues that
| make it tricky, but fundamentally it should be possible.
| Wilder things like x86_64 to aarch64 translation have
| been done, so let's do it for x86_64-v4 to x86_64-v1 too.
| almostgotcaught wrote:
| that's certainly more reasonable so i'm sorry for being
| so flippant. but even this idea i wager the juice is not
| worth the squeeze outside of stuff like Rosetta as you
| alluded, where the value was extremely high (retaining
| x86 customers).
| gitroom wrote:
| Oh man, totally get the pain with compilers and SIMD tricks - the
| struggle's so real. Ever feel like keeping low level control is
| the only way stuff actually runs as smooth as you want, or am I
| just too stubborn to give abstractions a real shot?
| sweetjuly wrote:
| Loop unrolling isn't really done because of pipelining but rather
| to amortize the cost of looping. Any modern out-of-order core
| will (on the happy path) schedule the operations identically
| whether you did one copy per loop or four. The only difference is
| the number of branches.
| Remnant44 wrote:
| These days, I strongly believe that loop unrolling is a
| pessimization, especially with SIMD code.
|
| Scalar code should be unrolled by the compiler to the SIMD word
| width to expose potential parallelism. But other than that,
| correctly predicted branches are free, and so is loop
| instruction overhead on modern wide-dispatch processors. For
| example, even running a maximally efficient AVX512 kernel on a
| zen5 machine that dispatches 4 EUs and some load/stores and
| calculates 2048 bits in the vector units every cycle, you still
| have a ton of dispatch capacity to handle the loop overhead in
| the scalar units.
|
| The cost of unrolling is decreased code density and reduced
| effectiveness of the instruction / uOp cache. I wish Clang in
| particular would stop unrolling the dang vector loops.
| adgjlsfhk1 wrote:
| The part that's really weird is that on modern CPUs predicted
| branches are free iff they're sufficiently rare (<1 out of 8
| instructions or so). but if you have too many, you will be
| bottlenecked on the branch since you aren't allowed to
| speculate past a 2nd (3rd on zen5 without hyperthreading?)
| branch.
| dzaima wrote:
| The limiting thing isn't necessarily speculating, but more
| just the number of branches per cycle, i.e. number of non-
| contiguous locations the processor has to query from L1 /
| uop cache (and which the branch predictor has to determine
| the location of). You get that limit with unconditional
| branches too.
| gpderetta wrote:
| Indeed, the limit is on _taken_ branches, hence why
| making the most likely case fall through is often an
| optimization.
| dzaima wrote:
| Intel still shares ports between vector and scalar on
| P-cores; a scalar multiply in the loop will definitely fight
| with a vector port, and the bits of pointer bumps and branch
| and whatnot can fill up the 1 or 2 scalar-only ports. And
| maybe there are some minor power savings from wasting
| resources on the scalar overhead. Still, clang does unroll
| way too much.
| Remnant44 wrote:
| My understanding is that they've changed this for Lion Cove
| and all future P cores, moving to much more of a Zen-like
| setup with seperate schedulers and ports for vector and
| scalar ops.
| dzaima wrote:
| Oh, true, mistook it for an E-core while clicking through
| diagrams due to the port spam.. Still, that's a 2024
| microarchirecture, it'll be like a decade before it's
| reasonable to ignore older ones.
| bobmcnamara wrote:
| > The cost of unrolling is decreased code density and reduced
| effectiveness of the instruction / uOp cache.
|
| There are some cases where useful code density goes up.
|
| Ex: unroll the Goertzel algorithm by a even number, and
| suddenly the entire delay line overhead evaporates.
| Const-me wrote:
| > schedule the operations identically whether you did one
| copy per loop or four
|
| They don't always do that well when you need a reduction in
| that loop, e.g. you are searching for something in memory, or
| computing dot product of long vectors.
|
| Reductions in the loop form a continuous data dependency
| chain between loop iteration, which prevents processors from
| being able to submit instructions for multiple iterations of
| the loop. Fixable with careful manual unrolling.
| codedokode wrote:
| > Any modern out-of-order core will (on the happy path)
| schedule the operations identically whether you did one copy
| per loop or four.
|
| I cannot agree because in an unrolled loop you have less
| counter increment instructions.
| gpderetta wrote:
| The looping overhead is trivial, especially on simd code where
| the loop overhead will use the scalar hardware.
|
| Unrolling is definitely needed for properly scheduling and
| pipelining SIMD code even on OoO cores. Remember that an OoO
| core cannot reorder dependent instructions, so the dependencies
| need to be manually broken, for example by adding additional
| accumulators, which in turn requires additional unrolling, this
| is especially important on SIMD code which typically is non-
| branchy with long dependency chains.
| Remnant44 wrote:
| That's a good point about increased dependency chain length
| in simd due to the branchless programming style. Unrolling to
| break a loop-carried dependency is one of the strongest
| reasons to unroll especially simd code.
|
| Unrolling trivial loops to remove loop counter overhead
| hasn't been productive for quite a whole now but
| unfortunately it's still the default for many compilers.
| imtringued wrote:
| Ok, but the compiler can't do that without unrolling.
| bob1029 wrote:
| > Since the register size is fixed there is no way to scale the
| ISA to new levels of hardware parallelism without adding new
| instructions and registers.
|
| I look at SIMD as the same idea as any other aspect of the x86
| instruction set. If you are directly interacting with it, you
| should probably have a good reason to be.
|
| I primarily interact with these primitives via types like
| Vector<T> in .NET's System.Numerics namespace. With the
| appropriate level of abstraction, you no longer have to worry
| about how wide the underlying architecture is, or if it even
| supports SIMD at all.
|
| I'd prefer to let someone who is paid a very fat salary by a F100
| spend their full time job worrying about how to emit SIMD
| instructions for my program source.
| timewizard wrote:
| > Another problem is that each new SIMD generation requires new
| instruction opcodes and encodings.
|
| It requires new opcodes. It does not strictly require new
| encodings. Several new encodings are legacy compatible and can
| encode previous generations vector instructions.
|
| > so the architecture must provide enough SIMD registers to avoid
| register spilling.
|
| Or the architecture allows memory operands. The great joy of
| basic x86 encoding is that you don't actually need to put things
| in registers to operate on them.
|
| > Usually you also need extra control logic before the loop. For
| instance if the array length is less than the SIMD register
| width, the main SIMD loop should be skipped.
|
| What do you want? No control overhead or the speed enabled by
| SIMD? This isn't a flaw. This is a necessary price to achieve the
| efficiency you do in the main loop.
| camel-cdr wrote:
| > The great joy of basic x86 encoding is that you don't
| actually need to put things in registers to operate on them.
|
| That's just spilling with fewer steps. The executed uops should
| be the same.
| timewizard wrote:
| > That's just spilling with fewer steps.
|
| Another way to say this is it's "more efficient."
|
| > The executed uops should be the same.
|
| And "more densely coded."
| camel-cdr wrote:
| hm, I was wondering how the density compares with x86
| having more complex encodings in general.
|
| vaddps zmm1,zmm0,ZMMWORD PTR [r14]
|
| takes six bytes to encode:
|
| 62 d1 7c 48 58 0e
|
| In SVE and RVV a load+add takes 8 bytes to encode.
| dzaima wrote:
| > The great joy of basic x86 encoding is that you don't
| actually need to put things in registers to operate on them.
|
| That's... 1 register saved, out of 16 (or 32 on AVX-512).
| Perhaps useful sometimes, but far from a particularly
| significant aspect spill-wise.
|
| And doing that means you lose the ability to move the load
| earlier (perhaps not too significant on OoO hardware, but still
| a consideration; while reorder windows are multiple hundreds of
| instructions, the actual OoO limit is scheduling queues, which
| are frequently under a hundred entries, i.e. a couple dozen
| cycles worth of instructions, at which point the >=4 cycle
| latency of a load is not actually insignificant. And putting
| the load directly in the arith op is the worst-case scenario
| for this)
| lauriewired wrote:
| The three "flaws" that this post lists are exactly what the
| industry has been moving away from for the last decade.
|
| Arm's SVE, and RISC-V's vector extension are all vector-length-
| agnostic. RISC-V's implementation is particularly nice, you only
| have to compile for one code path (unlike avx with the need for
| fat-binary else/if trees).
| dragontamer wrote:
| 1. Not a problem for GPUs. NVdia and AMD are both 32-wide or
| 1024-bit wide hard coded. AMD can swap to 64-wide mode for
| backwards compatibility to GCN. 1024-bit or 2048-bit seems to be
| the right values. Too wide and you get branch divergence issues,
| so it doesn't seem to make sense to go bigger.
|
| In contrast, the systems that have flexible widths have never
| taken off. It's seemingly much harder to design a programming
| language for a flexible width SIMD.
|
| 2. Not a problem for GPUs. It should be noted that kernels
| allocate custom amounts of registers: one kernel may use 56
| registers, while another kernel might use 200 registers. All GPUs
| will run these two kernels simultaneously (256+ registers per CU
| or SM is commonly supported, so both 200+56 registers kernels can
| run together).
|
| 3. Not a problem for GPUs or really any SIMD in most cases. Tail
| handling is O(1) problem in general and not a significant
| contributor to code length, size, or benchmarks.
|
| Overall utilization issues are certainly a concern. But in my
| experience this is caused by branching most often. (Branching in
| GPUs is very inefficient and forces very low utilization).
| dzaima wrote:
| Tail handling is not significant for loops with tons of
| iterations, but there are a ton of real-world situations where
| you might have a loop take only like 5 iterations or something
| (even at like 100 iterations, with a loop processing 8 elements
| at a time (i.e. 256-bit vectors, 32-bit elements), that's 12
| vectorized iterations plus up to 7 scalar ones, which is still
| quite significant. At 1000 iterations you could still have the
| scalar tail be a couple percent; and still doubling the L1/uop-
| cache space the loop takes).
|
| It's absolutely a significant contributor to code size (..in
| scenarios where vectorized code in general is a significant
| contributor to code size, which admittedly is only very-
| specialized software).
| dragontamer wrote:
| Note that AVX512 have per-lane execution masks so I'm not
| fully convinced that tail handling should even be a thing
| anymore.
|
| If(my lane is beyond the buffer) then (exec flag off, do not
| store my lane).
|
| Which in practice should be a simple vcompress instruction
| (AVX512 register) and maybe a move afterwards??? I admit that
| I'm not an AVX512 expert but it doesn't seem all that
| difficult with vcompress instructions + execmask.
| dzaima wrote:
| It takes like 4 instrs to compute the mask from an
| arbitrary length (AVX-512 doesn't have any instruction for
| this so you need to do `bzhi(-1, min(left,vl))` and move
| that to a mask register) so you still would likely want to
| avoid it in the hot loop.
|
| Doing the tail separately but with masking SIMD is an
| improvement over a scalar loop perf-wise (..perhaps outside
| of the case of 1 or 2 elements, which is a realistic
| situation for a bunch of loops too), but it'll still add a
| double-digit percentage to code size over just a plain SIMD
| loop without tail handling.
|
| And this doesn't help pre-AVX-512, and AVX-512 isn't
| particularly widespread (AVX2 does have masked load/store
| with 32-/64-bit granularity, but not 8-/16-bit, and the
| instrs that do exist are rather slow on AMD (e.g.
| unconditional 12 cycles/instr throughput for masked-storing
| 8 32-bit elements); SSE has none, and ARM NEON doesn't have
| any either (and ARM SVE isn't widespread either, incl. not
| supported on apple silicon))
|
| (don't need vcompress, plain masked load/store instrs do
| exist in AVX-512 and are sufficient)
| dragontamer wrote:
| > It takes like 2 instrs to compute the mask from a
| length (AVX-512 doesn't have any instruction for this so
| you need to do a bzhi in GPR and move that to a mask
| register) so you still would likely want to avoid it in
| the hot loop.
|
| Keep a register with the values IdxAdjustment = [0, 1, 2,
| 3, 4, 5, 6, 7].
|
| ExecutionMask = (Broadcast(CurIdx) + IdxAdjustment) <
| Length
|
| Keep looping while(any(vector) < Length), which is as
| simple as "while(exec_mask != 0)".
|
| I'm not seeing this take up any "extra" instructions at
| all. You needed the while() loop after all. It costs +1
| Vector Register (IdxAdjustment) and a kMask by my count.
|
| > And this doesn't help pre-AVX-512, and AVX-512 isn't
| particularly widespread
|
| AVX512 is over 10 years old now. And the premier SIMD
| execution instruction set is CUDA / NVidia, not AVX512.
|
| AVX512 is now available on all AMD CPUs and has been for
| the last two generations. It is also available on a
| select number of Intel CPUs. There is also AMD RDNA,
| Intel Xe ISAs that could be targeted.
|
| > instrs that do exist are rather slow on AMD (e.g.
| unconditional 12 cycles/instr throughput for masked-
| storing 8 32-bit elements);
|
| Okay, I can see that possibly being an issue then.
|
| EDIT: AMD Zen5 Optimization Manual states Latency1 and
| throughput 2-per-clocktick, while Intel's Skylake
| documentation of
| https://www.intel.com/content/www/us/en/docs/intrinsics-
| guid... states Latency5 Throughput 1-per-clock-tick.
|
| AMD Zen5 seems to include support to vmovdqu8 (its in the
| optimization guide .xlsx sheet with
| latencies/throughputs, also as 1-latency / 4-throughput).
| This includes vmovdqu8 (
|
| I'm not sure if the "mask" register changes the
| instruction. I'll do some research to see if I can verify
| your claim (I don't have my Zen5 computer built yet...
| but its soon).
| dzaima wrote:
| That's two instrs - bumping the indices, and doing the
| comparison. You still want scalar pointer/index bumping
| for contiguous loads/stores (using gathers/scatters for
| those would be stupid and slow), and that gets you the
| end check for free via fused cmp+jcc.
|
| And those two instrs are vector instrs, i.e. competing
| with execution units for the actual thing you want to
| compute, whereas scalar instrs have at least some
| independent units that allow avoiding desiring infinite
| unrolling.
|
| And if your loop is processing 32-bit (or, worse,
| smaller) elements, those indices, if done as 64-bit, as
| most code will do, will cost even more.
|
| AVX-512 might be 10 years old, but Intel's latest (!)
| cores still don't support it on hardware with E-cores, so
| still a decade away from being able to just assume it
| exists. Another thread on this post mentioned that Intel
| has shipped hardware without AVX/AVX2/FMA as late as 2021
| even.
|
| > Okay, I can see that possibly being an issue then.
|
| To be clear, that's only the AVX2 instrs; AVX-512 masked
| loads/stores are fast (..yes, even on Zen 4 where the
| AVX-512 masked loads/stores are fast, the AVX2 ones that
| do an equivalent amount of work (albeit taking the mask
| in a different register class) are slow). uops.info: http
| s://uops.info/table.html?search=maskmovd%20m256&cb_lat=o.
| ..
|
| Intel also has AVX-512 masked 512-bit 8-bit-elt stores at
| half the throughput of unmasked for some reason (not
| 256-bit or >=16-bit-elt though; presumably culprit being
| the mask having 64 elts): https://uops.info/table.html?se
| arch=movdqu8%20m512&cb_lat=on...
|
| And masked loads use some execution ports on both Intel
| and AMD, eating away from throughput of the main
| operation. All in all just not implemented for being able
| to needlessly use masked loads/stores in hot loops.
| dragontamer wrote:
| Gotcha. Makes sense. Thanks for the discussion!
|
| Overall, I agree that AVX and Neon have their warts and
| performance issues. But they're like 15+ years old now
| and designed well before GPU Compute was possible.
|
| > using gathers/scatters for those would be stupid and
| slow
|
| This is where CPUs are really bad. GPUs will coalesce
| gather/scatters thanks to __shared__ memory (with human
| assistance of course).
|
| But also the simplest of load/store patters are auto-
| detected and coalesced. So a GPU programmer doesn't have
| to worry about SIMD lane load/store (called vgather in
| AVX512) being slower. It's all optimized to hell and
| back.
|
| Having a full lane-to-lane crossbar and supporting high
| performance memory access patterns needs to be a priority
| moving forward.
| dzaima wrote:
| Thanks for the info on how things look on the GPU side!
|
| A messy thing with memory performance on CPUs is that
| either you share the same cache hardware between scalar
| and vector, thereby significantly limiting how much
| latency you can trade for throughput, or you have to add
| special vector L1 cache, which is a ton of mess and
| silicon area; never mind uses of SIMD that are latency-
| sensitive, e.g. SIMD hashmap probing, or small loops.
|
| I guess you don't necessarily need that for just
| detecting patterns in gather indices, but nothing's gonna
| get a gather of consecutive 8-bit elts via 64-bit indices
| to not perform much slower than a single contiguous load,
| and 8-bit elts are quite important on CPUs for strings &
| co.
| dang wrote:
| Related:
|
| _Three Fundamental Flaws of SIMD_ -
| https://news.ycombinator.com/item?id=28114934 - Aug 2021 (20
| comments)
| xphos wrote:
| Personally, I think load and increment address register in a
| single instruction is extremely valuable here. It's not quite the
| risc model but I think that it is actually pretty significant in
| avoiding a von nurmon bottleneck with simd (the irony in this
| statement)
|
| I found that a lot of the custom simd cores I've written for
| simply cannot issue instructions fast enough risvc. Or when they
| it's in quick bursts and than increments and loop controls that
| leave the engine idling for more than you'd like.
|
| Better dual issue helps but when you have seperate vector queue
| you are sending things to it's not that much to add increments
| into vloads and vstores
| codedokode wrote:
| > load and increment address register in a single instruction
| is extremely valuable here
|
| I think this is not important anymore because modern
| architectures allow to add offset to register value so you can
| write something like, using weird RISC-V syntax for addition:
| ld r2, 0(r1) ld r3, 4(r1) ld r4, 8(r1)
|
| These operations can be executed in parallel, while with auto-
| incrementing you cannot do that.
| codedokode wrote:
| I think that packed SIMD is better in almost every aspect and
| Vector SIMD is worse.
|
| With vector SIMD you don't know the register size beforehand and
| therefore have to maintain and increment counters, adding extra
| unnecessary instructions, reducing total performance. With packed
| SIMD you can issue several loads immediately without
| dependencies, and if you look at code examples, you can see that
| the x86 code is more dense and uses a sequence of unrolled SIMD
| instructions without any extra instructions which is more
| efficient. While RISC-V has 4 SIMD instructions and 4
| instructions dealing with counters per loop iteration, i.e. it
| wastes 50% of command issue bandwidth and you cannot load next
| block until you increment the counter.
|
| The article mentions that you have to recompile packed SIMD code
| when a new architecture comes out. Is that really a problem? Open
| source software is recompiled every week anyway. You should just
| describe your operations in a high level language that gets
| compiled to assembly for all supported architectures.
|
| So as a conclusion, it seems that Vector SIMD is optimized for
| manually-written assembly and closed-source software while Packed
| SIMD is made for open-source software and compilers and is more
| efficient. Why RISC-V community prefers Vector architecture, I
| don't understand.
| LoganDark wrote:
| This comment sort of reminds me of how Transmeta CPUs relied on
| the compiler to precompute everything like pipelining. It
| wasn't done by the hardware.
| codedokode wrote:
| Makes sense - writing or updating software is easier that
| designing or updating hardware. To illustrate: anyone can
| write software but not everyone has access to chip
| manufacturing fabs.
| LoganDark wrote:
| Atomic Semi may be looking to change that (...eventually)
| IshKebab wrote:
| Those 4 counter instructions have no dependencies though so
| they'll likely all be issued and executed in parallel in 1
| cycle, surely? Probably the branch as well in fact.
| codedokode wrote:
| The load instruction has a dependency on counter increment.
| While with packed SIMD one can issue several loads without
| waiting. Also, extra counter instructions still waste
| resources of a CPU (unless there is some dedicated hardware
| for this purpose).
| dzaima wrote:
| > Open source software is recompiled every week anyway.
|
| Despite being potentially compiled recently, anything from most
| Linux package managers, and whatever precompiled downloadable
| executables, even if from open-source code, still targets the
| 20-year-old SSE2 baseline, wasting the majority of SIMD
| resources available on modern (..or just not-extremely-ancient)
| CPUs (unless you're looking at the 0.001% of software that
| bothers with dynamic dispatch; but for that approach just
| recompiling isn't enough, you also need to extend the
| dispatched target set).
|
| RISC-V RVV's LMUL means that you get unrolling for free, as
| each instruction can operate over up to 8 registers per
| operand, i.e. essentially "hardware 8x unrolling", thereby
| making scalar overhead insignificant. (probably a minor
| nightmare from the silicon POV, but perhaps not in a
| particularly limiting way - double-pumping has been done by x86
| many times so LMUL=2 is simple enough, and at LMUL=4 and LMUL=8
| you can afford to decode/split into ups at 1 instr/cycle)
|
| ARM SVE can encode adding a multiple of VL in load/store
| instructions, allowing manual unrolling without having to
| actually compute the intermediate sizes. (hardware-wise that's
| an extremely tiny amount of overhead, as it's trivially
| mappable to an immediate offset at decode time). And there's an
| instruction to bump a variable by a multiple of VL.
|
| And you need to bump pointers in any SIMD regardless; the only
| difference is whether the bump size is an immediate, or a
| dynamic value, and if you control the ISA you can balance
| between the two as necessary. The packed SIMD approach isn't
| "free" either - you need hardware support for immediate offsets
| in SIMD load/store instrs.
|
| Even in a hypothetical non-existent bad vector SIMD ISA without
| any applicable free offsetting in loads/stores and a need for
| unrolling, you can avoid having a dependency between unrolled
| iterations by precomputing "vlen*2", "vlen*3", "vlen*4", ...
| outside of the loop and adding those as necessary, instead of
| having a strict dependency chain.
| stephencanon wrote:
| The basic problem with almost every "SIMD is flawed, we should
| have vector ISAs" article or post (including the granddaddy,
| "SIMD Instructions Considered Harmful"), is that they invariably
| use SAXPY or something else trivial where everything stays neatly
| in lane as their demonstration case. Of course vector ISAs look
| good when you show them off using a pure vector task. This is
| fundamentally unserious.
|
| There is an enormous quantity of SIMD code in the world that
| isn't SAXPY, and doesn't stay neatly in lane. Instead it's things
| like "base64 encode this data" or "unpack and deinterleave this
| 4:2:2 pixel data, apply a colorspace conversion as a 3x3 sparse
| matrix and gamma adjustment in 16Q12 fixed-point format, resize
| and rotate by 15@ with three shear operations represented as a
| linear convolution with a sinc kernel per row," or "extract these
| fields from this JSON data". All of which _can totally be done_
| with a well-designed vector ISA, but the comparison doesn't paint
| nearly as rosy of a picture. The reality is that you really want
| a mixture of ideas that come from fixed-width SIMD and ideas that
| come from the vector world (which is roughly what people actually
| shipping hardware have been steadily building over the last two
| decades, implementing more support for unaligned access,
| predication, etc, while the vector ISA crowd writes purist think
| pieces)
| Someone wrote:
| > Since the register size is fixed there is no way to scale the
| ISA to new levels of hardware parallelism without adding new
| instructions and registers.
|
| I think there is a way: vary register size per CPU, but also add
| an instruction to retrieve register size. Then, code using the
| vector unit will sometimes have to dynamically allocate a buffer
| for intermediate values, but it would allow for software to run
| across CPUs with different vector lengths. Does anybody know
| whether any architecture does this?
___________________________________________________________________
(page generated 2025-04-25 23:02 UTC)