[HN Gopher] RISC-V Instructions
___________________________________________________________________
RISC-V Instructions
Author : robalni
Score : 8 points
Date : 2023-07-09 17:48 UTC (5 hours ago)
(HTM) web link (www.robalni.org)
(TXT) w3m dump (www.robalni.org)
| sylware wrote:
| What seems to be missing are the hardware optimized and
| accelerated short and big memcpy/memset.
|
| On x86_64, on modern micro-archs, "rep stos[bwdq]" and "rep
| movs[bwdq]". I bet that, in modern binaries, memcpy/memset call
| sites are actually place holders for such instructions (before
| the memory segment goes back to Read/Executable), registers are
| rdi,rsi,rdx (rcx would be pushed on the stack or the code
| generated to account for just rcx availability on the call site).
|
| Also, expect x86_64 -> risc-v port bugs because to: byte->byte
| word->halfword doubleword->word quadword->doubleword
| camel-cdr wrote:
| You'll likely see memcpy implemented using the vector
| extension, e.g.: memcpy: mv a3,
| a0 # Copy destination loop: vsetvli t0, a2,
| e8, m8, ta, ma # Vectors of 8b vle8.v v0, (a1)
| # Load bytes add a1, a1, t0 # Bump
| pointer sub a2, a2, t0 # Decrement
| count vse8.v v0, (a3) # Store bytes
| add a3, a3, t0 # Bump pointer bnez a2,
| loop # Any more? ret
| # Return
|
| Are you sure `rep stos/movs` are actually optimal on x86_64
| systems?
|
| Edit: I just ran tinymembench on my CPU (Ryzen 5 1600X)
| C copy backwards : 7300.7
| MB/s (1.2%) C copy backwards (32 byte blocks)
| : 7330.5 MB/s (1.5%) C copy backwards (64 byte
| blocks) : 7313.6 MB/s (0.7%) C
| copy : 7385.3
| MB/s (1.0%) C copy prefetched (32 bytes step)
| : 7737.9 MB/s (1.0%) C copy prefetched (64 bytes
| step) : 7701.1 MB/s (1.6%) C
| 2-pass copy : 6414.2
| MB/s (2.1%) C 2-pass copy prefetched (32 bytes step)
| : 6947.9 MB/s (1.4%) C 2-pass copy prefetched (64
| bytes step) : 6985.8 MB/s (1.5%) C fill
| : 9197.2 MB/s (1.2%) C fill (shuffle within 16 byte
| blocks) : 9193.0 MB/s (1.4%) C fill
| (shuffle within 32 byte blocks) : 9175.0 MB/s
| (2.2%) C fill (shuffle within 64 byte blocks)
| : 9229.0 MB/s (1.1%) --- standard memcpy
| : 11302.6 MB/s (1.2%) standard memset
| : 11046.1 MB/s (1.4%) --- MOVSB copy
| : 7668.6 MB/s (1.5%) MOVSD copy
| : 7607.0 MB/s (0.8%) SSE2 copy
| : 7987.0 MB/s (5.0%) SSE2 nontemporal copy
| : 11989.2 MB/s (2.7%) SSE2 copy prefetched (32 bytes
| step) : 7739.9 MB/s (1.3%) SSE2 copy
| prefetched (64 bytes step) : 7807.6 MB/s
| (2.9%) SSE2 nontemporal copy prefetched (32 bytes
| step) : 12503.7 MB/s (1.5%) SSE2 nontemporal copy
| prefetched (64 bytes step) : 12605.2 MB/s (2.5%)
| SSE2 2-pass copy : 6977.1
| MB/s (1.7%) SSE2 2-pass copy prefetched (32 bytes
| step) : 7311.1 MB/s (1.8%) SSE2 2-pass copy
| prefetched (64 bytes step) : 7334.7 MB/s (1.5%)
| SSE2 2-pass nontemporal copy : 3223.3
| MB/s SSE2 fill
| : 10919.1 MB/s (1.8%) SSE2 nontemporal fill
| : 30713.9 MB/s (1.8%)
| snvzz wrote:
| There's no such thing, as RISC-V memory operations are very
| consciously explicit load/store instructions.
|
| RISC-V does not have memory to memory ops.
| robalni wrote:
| > Also, expect x86_64 -> risc-v port bugs because to:
| byte->byte word->halfword doubleword->word quadword->doubleword
|
| Yeah, I don't know why everyone doesn't just call it int8,
| int16 and so on. That would be much better. This "word" naming
| is just confusing.
___________________________________________________________________
(page generated 2023-07-09 23:02 UTC)