[HN Gopher] Faster CRC32 on the Apple M1
___________________________________________________________________
Faster CRC32 on the Apple M1
Author : panic
Score : 251 points
Date : 2022-05-22 15:36 UTC (7 hours ago)
(HTM) web link (dougallj.wordpress.com)
(TXT) w3m dump (dougallj.wordpress.com)
| terrelln wrote:
| Could you combine both techniques to run both the SIMD version on
| some chunks and the crc32 instruction on other chunks, in
| parallel? Of course this would only work if they execute on
| different ports.
| dougall wrote:
| Hmm, yeah, this might work out... Two SIMD uops process 16
| bytes, so each SIMD uop is doing eight bytes of work - the same
| as CRC32X, but with more frontend pressure (and preferable
| because they can run on any of the four SIMD ports, not just
| the one distinct CRC32X port).
|
| It gets a bit messy, and we can't expect a ton from this
| approach - the same loop with only the loads only runs at
| ~86GB/s, but it'd be worth a shot.
| AlotOfReading wrote:
| Yes, and ZLib provides an excellent example of how to do this
| (it's called braiding in the source code). There's an
| additional initialization and combination cost though. It
| doesn't really make sense for short messages.
| terrelln wrote:
| Thanks for the pointer, will have to take a look!
| sgtnoodle wrote:
| It seems like CRC inherently depends on results from earlier
| calculations, so it would be hard to parallelize like that. You
| could potentially do multiple independent CRC calculations in
| parallel, but then you're getting into more niche use cases.
| aaaaaaaaaaab wrote:
| Wrong. CRC is just polynomial division, which is simple to do
| in a divide and conquer fashion. It's pretty easy to derive
| CRC(A concat B) from CRC(A) and CRC(B). It needs a
| multiplication and a XOR.
| 13of40 wrote:
| Weird, I talked to the guy who invented the crypto scheme for ZIP
| and he said he invented the CRC algorithm for it as well. I
| wonder if there's more backstory there.
|
| Edit: He invented the crypto not the CRC, which Phil Katz was
| already using.
| AlotOfReading wrote:
| That's a very strange claim. As far as I know, the same CRC has
| been used for ZIP since it was invented by the PKWARE guys in
| the 80s. Moreover, no one deeply understood the properties and
| tradeoffs of various CRCs the way we do now, so everyone used
| largely identical algorithms with only trivial variations. Phil
| Katz did the same and reused the same polynomial that was in
| ethernet and dozens of other standards, which in turn had
| originated from this 1975 report:
| https://apps.dtic.mil/sti/pdfs/ADA013939.pdf
|
| He wasn't even the first to put a CRC in an archive format, as
| the predecessor format ARC had a CRC-16 doing the same thing.
| carbonbee wrote:
| I think what OP meant to write: the zip encryption algorithm
| is a custom stream cypher that uses crc32 as the main
| building block.
|
| (It's a very bad cypher, vulnerable to known plaintext and
| other attacks, don't use it for anything except light
| scrambling).
| mappu wrote:
| Nowadays most zip programs will default to using AES (in
| the way WinZip invented) instead of ZipCrypto.
| 13of40 wrote:
| Ah, OK, I looked in my email and it was the guy who invented
| the encryption scheme, but he said "Yes, I invented it
| [referring to the encryption]. It wasn't based on anything
| else, except that it used the same CRC he [Phil] was already
| using in zip." (As a historical note, he also said the crypto
| scheme was intended to be exportable, which at that time
| meant "intentionally weak".)
| IncRnd wrote:
| It's always important to know which crc you are using.
|
| Looking at
| https://developer.arm.com/documentation/ddi0596/2020-12/Base...,
| based upon the polynomial constant, the CRC32 class of
| instructions appear to calculate a CCITT 32 reversed polynomial.
| Are there any ARM developers who can help me out here? Does this
| apply in the same way to the M1?
| dougall wrote:
| The post glosses over it a bit - the CRC32X instruction always
| uses the common polynomial 0x04C11DB7 (matching zlib, commonly
| just called CRC-32), and there's a second instruction, CRC32CX,
| which is the same but uses the polynomial 0x1EDC6F41, known as
| CRC-32C (Castagnoli). The constants in the post are also for
| 0x04C11DB7, but the linked Intel article explains how to they
| can be calculated for arbitrary polynomials, so the faster
| method is also generic, which is nice.
| DeathArrow wrote:
| I wonder how fast can someone get it to run on Intel's 12
| generation core CPU.
|
| It seems a good idea to start a Code Golf competition.
| fwsgonzo wrote:
| https://github.com/komrad36/CRC
|
| I measured it on my computer to run at 32GB/s, i7 4770k.
| jeffbee wrote:
| I don't think they will hit 75GB/s because Intel doesn't make
| that kind of memory throughput available to a single core.
| dragontamer wrote:
| https://www.intel.com/content/dam/www/public/us/en/documents...
| Twirrim wrote:
| Likely as much noise as signal, but anyway:
|
| I'm using an 8th(?) generation Intel, i7-8665U.
|
| https://github.com/htot/crc32c has some interesting
| implementations of CRC32 algorithms of different speeds, the
| highest I see is (function, aligned, bytes, MiB/s) :
| crc32cIntelC true 16 3907.613 crc32cIntelC true
| 64 15096.758 crc32cIntelC true 128 24692.803
| crc32cIntelC true 192 22732.392 crc32cIntelC
| true 256 16233.397 crc32cIntelC true 288 16748.952
| crc32cIntelC true 512 19862.039 crc32cIntelC
| true 1024 22373.350 crc32cIntelC true 1032
| 22482.031 crc32cIntelC true 4096 24690.531
| crc32cIntelC true 8192 24992.827
|
| So pushing 25GiB/s on a 3ish year old CPU.
| wtallis wrote:
| FYI, CRC32C is not the same checksum as CRC32. CRC32C is used
| in iSCSI, btrfs, ext4. CRC32 is used in Ethernet, SATA, gzip.
| Intel's SSE4.2 provides an instruction implementing CRC32C
| but not CRC32. ARM defines instructions for both.
|
| This article seems to miss that distinction but appears to be
| testing CRC32, so it's not quite correct to compare against
| something using Intel's CRC32C instruction.
| neurostimulant wrote:
| I'm using i7-4790 (7 year old cpu) and the numbers are
| slightly better. Maybe because it's a desktop.
| crc32cIntelC true 16 4025.334 crc32cIntelC
| true 64 15749.095 crc32cIntelC true 128
| 26608.064 crc32cIntelC true 192 25828.486
| crc32cIntelC true 256 17448.436
| crc32cIntelC true 288 18336.381
| crc32cIntelC true 512 22635.590
| crc32cIntelC true 1024 24654.248
| crc32cIntelC true 1032 24180.107
| crc32cIntelC true 4096 28251.903
| crc32cIntelC true 8192 28768.134
| jeffbee wrote:
| Core i7-12700K: crc32cIntelC true 64
| 26210.561 crc32cIntelC true 128 35870.309
| crc32cIntelC true 192 36850.224 crc32cIntelC
| true 256 30343.690 crc32cIntelC true 288 30671.327
| crc32cIntelC true 512 32443.251 crc32cIntelC
| true 1024 34654.719 crc32cIntelC true 1032
| 34265.440 crc32cIntelC true 4096 38111.089
| crc32cIntelC true 8192 38634.925
| DantesKite wrote:
| I really like that the author gave some context at the top. So
| many times I struggle to read or realize the importance of a
| concept because there just isn't enough context for me to follow
| along.
|
| And certainly not all blogs have to, but it's nice when it is.
| parentheses wrote:
| It is very interesting that since the release of the M1 chip, CPU
| performance on Apple silicon has really come under a microscope.
| It leads me to ask:
|
| Was Apple silicon always this best-in-class and we weren't
| looking this closely as a community?
| mhh__ wrote:
| You couldn't buy one without a screen, or run arbitrary code on
| them.
| minhazm wrote:
| Apple's chips in their iPhones & iPads have been outperforming
| the competition (Qualcomm & Samsung) for a long time now in
| both power efficiency and performance. Apple has usually been
| around 2 yrs ahead of the competition. The Qualcomm Snapdragon
| 888 chip with 8 cores has a Geekbench multi-core score of
| 3592[1]. The six-score Apple A15 Bionic scored 4673. The Apple
| chip has ~30% better multi-core performance with 25% fewer
| cores than the Qualcomm chip. In single-core performance the
| difference is even larger, with the Apple chips performing ~47%
| faster. You can get the same A15 Bionic in both the $429 iPhone
| SE and the $1100+ iPhone 13 Pro Max.
|
| It hasn't really been a huge deal though because people don't
| develop directly on an iPhone, so it doesn't affect their every
| day productivity all that much. Also phone's have reached the
| point of "fast enough" a few years ago, it's hard to tell the
| difference between an iPhone 13 Pro and an 11 Pro unless you
| use them side by side. But with the release of the M1 chip,
| people are getting the performance & energy efficiency gains in
| their every day workflows.
|
| [1]. https://browser.geekbench.com/android-benchmarks
|
| [2]. https://browser.geekbench.com/ios-benchmarks
| skavi wrote:
| Yes, and anyone who had been reading AnandTech's excellent
| mobile reviews (by Andrei Frumusanu) knew what M1 would bring.
| Aissen wrote:
| Everyone was looking, and it was well known that it was best
| in-class; two random examples:
| https://www.anandtech.com/show/7335/the-iphone-5s-review/4
|
| https://twitter.com/codinghorror/status/912047023871860737
| ntoskrnl wrote:
| So ARM64 has dedicated instructions for CRC32, but implementing
| it by hand using SIMD is still faster. Score another point for
| RISC.
| pclmulqdq wrote:
| It is not faster to use SIMD by hand - it is faster to use the
| vector unit alongside the integer unit, using both paths at the
| same time.
| dragontamer wrote:
| SIMD is a very powerful parallelization technique, with
| marvelous gains whenever I see it used. It seems like a
| fundamentally more efficient form of compute, but is very
| difficult to design algorithms for.
|
| I'd argue against "SIMD" as being "RISC", since you need all
| sorts of complicated instructions (ex: gather/scatter) to
| really support the methodology well in practice.
| tremon wrote:
| But scatter/gather is a primitive operation for SIMD, so if
| you want a RISC-based version of it, that's exactly what you
| would provide. Having dedicated instructions for specific
| operations (whether for crc/aes/nnp or whatever) feels like a
| CISC-based approach, so I think I agree with the GP.
|
| RISC vs CISC is about the simplicity of the instruction set,
| not about whether it's easy to use.
| mhh__ wrote:
| These days I'd argue risc vs cisc is more about regularity
| and directness than the size of the ISA as per se.
|
| I'd argue AArch64 isn't particularly RISC by the standards
| of the past but it sets the bar and tone for RISC today.
| dragontamer wrote:
| And which SIMD instruction set should we be talking
| about? NEON-instructions or with the SVE instruction set?
|
| And if we're talking about multiple instruction-sets
| designed for the same purpose, is this thing really RISC
| anymore? Or do you really mean "just not x86" when you
| say RISC ??
| mhh__ wrote:
| That depends on how precisely you define the purpose.
| NEON and SVE seem to be aimed at different intensities of
| work.
| dragontamer wrote:
| > But scatter/gather is a primitive operation for SIMD
|
| Not in NEON, and therefore not in M1. AVX512 and SVE add
| scatter/gather instructions.
|
| Intel/AMD's AVX has vgather instructions, but is missing
| vscatter until AVX512.
|
| > Having dedicated instructions for specific operations
| (whether for crc/aes/nnp or whatever) feels like a CISC-
| based approach, so I think I agree with the GP.
|
| Not only are there AES instructions on ARM, but there's
| also SHA-instructions. The mix-columns step of AES more or
| less demands dedicated hardware if you want high-speed
| today, so everybody implements that as a hardware specific
| instruction.
| tlb wrote:
| Similarly, I wish that on x86, REP STOSB was the fastest way to
| copy memory. Because it only takes a few bytes in the icache.
| But fast memcpys end up being hundreds of bytes, to work with
| larger words while handling start and end alignment.
| userbinator wrote:
| It still is in general situations (i.e. not the
| microbenchmarks where the ridiculously bloated unrolled
| "optimised" implementations may have a very slight edge.) I
| believe the Linux kernel uses it for this reason.
| jabl wrote:
| The kernel is a bit of a special case since very likely a
| syscall starts off with a cold I$, and also there's a lot
| of extra overhead if you insist on using SIMD registers.
|
| In general I agree with you though, optimizing memcpy
| implementations only against microbenchmarks is dumb.
| saagarjha wrote:
| With ERMS it's definitely not going to be slow, so it's a
| good choice when you're in a constrained environment (high
| instruction cache pressure, can't use vector instructions).
| userbinator wrote:
| That's very shortsighted thinking. The dedicated instruction
| could be optimised by the hardware in a future revision to
| become much faster.
| MichaelZuo wrote:
| It's very impressive someone messing around for a few hours
| could get the m1 chip to more than 2x the performance. Easy
| gains like that really shouldn't be possible, assuming Apple's
| silicon team are competent. Maybe there's some hidden gotcha
| here?
| dzaima wrote:
| CRC32X works on 8 bytes at a time and has a throughput of one
| invocation per cycle, whereas the SIMD operates on blocks in
| parallel (the chromium code does 64 bytes an iteration, with
| a lot of instruction-level parallelism too). Theoretically M1
| could have thrown more silicon at it to allow more than one
| CRC32X invocation per cycle, but that's not very useful if
| you can achieve the same with SIMD anyway.
| Sirened wrote:
| This is way more common than you'd think, and it's not by
| accident. Engineering teams optimize the paths that are
| heavily used to get the biggest improvement across the
| platform as a whole. CRC32X is certainly not as heavily used
| as NEON and so if you're forced to decide between spending
| area on being able to fuse extra instructions for NEON and
| slightly improving throughput for CRC32X, the obvious choice
| is NEON. You see this way more obviously on Intel's x86-64
| cores where many of the highly used instructions are fast
| path decoded but some of the weirder CISC instructions that
| nobody really uses are offloaded to very slow microcode.
| bee_rider wrote:
| I wonder -- could CRC32X be something that would also,
| specifically, not as interesting for Apple? They are mostly
| optimizing for desktop workloads. I wonder if worrying
| about checksuming, especially maximizing the throughput of
| checksum operations, is more of a server thing. (Like we
| have to checksum when we download things on desktop, but
| that's a one-off, and I guess things get checksummed in the
| filesystem, but even the nice NVME drives are pretty slow
| from the CPUs point of view).
| astrange wrote:
| MachO uses codesigning with adhoc signatures as a form of
| checksumming, and there's also TCP and whatever drives
| do. So it's the converse, it's so common the dedicated
| hardware does it instead of the CPU. And it's not all the
| same algorithm.
| AlotOfReading wrote:
| Intel's algorithm is very clever and would take a lot of
| space to implement in hardware. The implementation underlying
| the CRC32** instructions is probably some set of shift
| registers. That's a pretty good space/speed tradeoff to make.
|
| My largely uninformed guess is that they added the
| instructions to get fast CRCs for the filesystem 'for free'.
| There aren't many other cases where software CRC can be a
| bottleneck that also use these polynomials.
| interestica wrote:
| Saving it for M2 to have something to show off?
| nicoburns wrote:
| I feel like CRC32 may be simple enough (and close enough to
| the kind of operation like adding and bit-shifting that
| general-purpose CPUs are good at anyway, that perhaps it
| doesn't benefit as much from dedicated silicon as other
| algorithms would.
| [deleted]
| naniwaduni wrote:
| > Easy gains like that really shouldn't be possible,
|
| Easy gains are _everywhere_. The "gotcha", if you can call
| it that, is that optimizing particular operations comes with
| space tradeoffs that are more expensive when you do them in
| hardware.
| d_tr wrote:
| I am not taking any hard stance on the usefulness of the
| specialized instruction, but M1 is a very wide and powerful
| core, so this won't be true everywhere.
|
| The single instruction might also be more power-efficient and
| keep other resources free for other stuff.
| brigade wrote:
| Specifically, the M1's NEON ALUs can handle 64 bytes per
| cycle on the big cores. Most Cortex can only do 32 bytes per
| cycle (Cortex X1/X2 is the exception), and the lower end
| designs are only 16B/cycle. Plus the pmull+eor fusion on M1
| increases effective throughput by 33%, and I don't know if
| that's implemented on any Cortex.
|
| Without those, the NEON version would fall to the same
| throughput as one CRC32X per cycle, or half the throughput on
| cores with 2x64bit ALUs.
| athrowaway3z wrote:
| Also not taking hard stances, but both cases are suspect.
| Power efficiency because being 3 times faster means you're
| done 3 times earlier. Keeping other resources free because I
| suspect a CRC calculation is generally followed by an `if eq`
| statement. ( Even with out-of-order or speculative execution
| this creates a bottle neck that is nice to remove 3x faster )
| stingraycharles wrote:
| If you're writing optimized code, hardly ever would you
| evaluate one CRC check at a time. You would process them in
| chunks, as the OP stated, but would just let a compiler do
| the auto-vectorization.
|
| This is even more true in the case of CRC, where there's
| clearly almost always one branch that wins: this is perfect
| for branch prediction, which would mean the whole "if eq"
| condition is preemptively skipped.
| saagarjha wrote:
| The compiler probably isn't going to be able to
| autovectorize a CRC unless you help it out.
| dottrap wrote:
| I think power efficiency has a lot more variables now so it
| is not easy to know if consumption is linear with time.
| CPUs now dynamically throttle themselves, plus now Apple
| has advertised that its M1 cores are divided up between
| high-performance and high-efficiency efficiency cores, let
| alone how the underlying chip itself may consumer power
| differently for implementing different instructions.
|
| So for a hypothetical example, it could be that using
| general purpose SIMD triggers the system to throttle up the
| CPUs and/or move to the high performance CPUs, whereas the
| dedicated CRC instructions might exist on the high-
| efficiency cores and not trigger any throttling.
|
| I've forgotten all my computer architecture theory, but if
| I look back at Ohm's law and look at power, the equation is
| P = I^2 * R. Handwaving from my forgotten theory a bit
| here, ramping up the CPUs increases current, and we see
| that it is a squared factor. So by cutting the time by say
| a factor of 3 does mean you are done 3 times faster (which
| is a linear component), you still have to contend that you
| have a squared component in current which may have been
| increased.
|
| I have no clue if the M1 actually does any of this, but
| merely stating that it is not obvious what is happening in
| terms of power efficiency. We've seen other examples of
| this. For example, I've read that Intel's AVX family
| instruction generally increases the power consumption and
| frequency of when utilized, but non-obviously, it often
| runs at a lower frequency when in 256 or 512 wide forms
| compared to the lesser widths (which then requires more
| work on the developer to figure out what is the optimal
| performance path as wider isn't necessarily faster). And as
| another example, when Apple shipped 2 video cards in their
| Macbooks, some general purpose Mac desktop application
| developers who cared about battery life were tip-toeing
| around different high level Apple APIs (e.g. Cocoa, Core
| Animation, etc.) because some APIs under the hood
| automatically triggered the high performance GPU to switch
| on (and eat power), while these general purpose desktop
| applications didn't want or need the extra performance (at
| the cost of eating the user's battery).
| saagarjha wrote:
| > whereas the dedicated CRC instructions might exist on
| the high-efficiency cores
|
| M1 has a heterogeneous ISA, FWIW.
| astrange wrote:
| Homogenous? The P and E cores have all the same
| instructions. You won't get suddenly moved from one to
| the other or hit emulations.
| saagarjha wrote:
| Oops, yes, that's what I meant. Thanks for catching that!
| rowanG077 wrote:
| It's not really a fair comparison. There is only one CRC32 unit
| which means it can't make use of superscalar (at least if I
| understand the article correctly). If it would have more CRC32
| units that would be the most efficient.
| adrian_b wrote:
| No, the faster implementation uses another dedicated
| instruction, which happens to be more general than CRC32, i.e.
| the multiplication of polynomials having binary coefficients.
|
| So this has little to do with RISC, except the general
| principle that the instructions that are used more frequently
| should be implemented to be faster, a principle that has been
| used by the M1 designers and by any other competent CPU
| designers.
|
| In this case, ARM has added the polynomial multiplication
| instruction a few years after Intel, with the same main purpose
| of accelerating the authenticated encryption with AES. There is
| little doubt that ARM was inspired by the Intel Westmere new
| instructions (announced by Intel a few years before the
| Westmere launch in 2010).
|
| The dedicated CRC32 instruction could have been made much
| faster, but the designers of the M1 core did not believe that
| this is worthwhile, because that instruction is not used often.
|
| The polynomial multiplication is used by many more
| applications, because it can implement CRC computations based
| on any polynomial, no only that one specified for CRC32, and it
| can also be used in a great number of other algorithms that are
| based on the properties of the fields whose elements are
| polynomials with binary coefficients.
|
| So it made sense to have a better implementation for the
| polynomial multiplication, which allows greater speeds in many
| algorithms, including the CRC computation.
| saagarjha wrote:
| Amusingly the Rosetta runtime uses crc32x
| IshKebab wrote:
| That sounds like a point for RISC to me?
|
| Ok maybe it is just a point against really complex
| instructions. There's clearly an optimum middle ground.
| DonHopkins wrote:
| Mary Payne designed the VAX floating point POLY instruction
| at DEC.
|
| But the microcode and hardware floating point
| implementations did it slightly differently. Then the
| MicroVAX dropped it, then picked it up again but wrong,
| then fixed it, then lost it again.
|
| http://simh.trailing-edge.com/docs/vax_poly.pdf
|
| https://documentation.help/VAX11/op_POLY.htm
|
| https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_o
| p...
|
| >Multiply-accumulate operation
|
| >The Digital Equipment Corporation (DEC) VAX's POLY
| instruction is used for evaluating polynomials with
| Horner's rule using a succession of multiply and add steps.
| Instruction descriptions do not specify whether the
| multiply and add are performed using a single FMA step.
| This instruction has been a part of the VAX instruction set
| since its original 11/780 implementation in 1977.
|
| https://news.ycombinator.com/item?id=20558618
|
| >VAX was a crazy town ISA too, with stuff like single
| isntruction polynomial evaluation.
| astrange wrote:
| Complex instructions are often good ideas. They're best at
| combining lots of bitshifting (hardware is good at that and
| it can factor things out) but even for memory ops it can be
| good (the HW can optimize them by knowing things like cache
| line sizes).
|
| They get a bad rap because the only really complex ISA left
| is x86 and it just had especially bad ideas about which
| operations to use its shortest codes on. Nobody uses BOUND
| to the point some CPUs don't even include it.
|
| One point against them in SIMD is there definitely is an
| instruction explosion there, but I haven't seen a
| convincing better idea, and I think the RISC-V people's
| vector proposal is bad and shows they have serious knowing
| what they're talking about issues.
| adgjlsfhk1 wrote:
| what's wrong with the riscv approach?
| astrange wrote:
| They want to go back to the 70s (not meant as an insult)
| and use Cray-style vector instructions instead of SIMD.
| Vector instructions are kind of like loops in one
| instruction; you set a vector length register and then
| all the vector instructions change to run on that much
| data.
|
| That's ok for big vectors you'd find in like scientific
| computations, but I believe it's bad for anything I've
| ever written in SIMD. Games and multimedia usually use
| short vectors (say 4 ints) or mixed lengths (2 and 4 at
| once) and are pretty comfortable with how x86/ARM/PPC
| work.
|
| Not saying it couldn't be better, but the RISCV designers
| wrote an article about how their approach would be better
| basically entirely because they thought SIMD adds too
| many new instructions and isn't aesthetically pretty
| enough. Which doesn't matter.
|
| Also I remember them calling SIMD something weird and
| politically incorrect in the article but can't remember
| what it was...
| adgjlsfhk1 wrote:
| To me, the much bigger problem with SIMD is that it is
| really hard to program for since vector sizes keep
| getting bigger, and it really doesn't look good at all
| for big-little designs which seem to be the future. Most
| compilers still aren't good at generating AVX-512 code,
| and very few applications use it because it requires
| optimization for a very small portion of the market. With
| a variable length vector, everyone gets good code.
|
| Also, I'm not clear why you think riscv style vector
| instructions will perform worse on 2-4 length vectors.
| devit wrote:
| The RISC-V design is the sensible one since it does not
| hardcode the vector register size; the other designs are
| idiotic since a new instruction set needs to be created
| every time the vector register size is increased
| (MMX->SSE2->AVX2->AVX-512) and you can't use shorter
| vector registers on lower-power CPUs without reducing
| application compatibility.
|
| And it doesn't "loop" in one instruction, the vsetvl
| instruction will set the vector length to the _minimum_
| of the vector register size and the amount of data left
| to process.
| atq2119 wrote:
| Isn't there a sort of obvious "best of both worlds" by
| having a vector instruction ISA with a vector length
| register, plus the promise that a length >= 4 is always
| supported and good support for intra-group-of-4 shuffles?
|
| Then you can do whatever you did for games and multimedia
| in the past, except that you can process N
| samples/pixels/vectors/whatever at once, where N = vector
| length / 4, and your code can automatically make use of
| chips that allow longer vectors without requiring a
| recompile.
|
| Mind you, I don't know if that's the direction that the
| RISC-V people are taking. But it seems like a pretty
| obvious thing to do.
___________________________________________________________________
(page generated 2022-05-22 23:00 UTC)