[HN Gopher] RISC-V Instructions
       ___________________________________________________________________
        
       RISC-V Instructions
        
       Author : robalni
       Score  : 8 points
       Date   : 2023-07-09 17:48 UTC (5 hours ago)
        
 (HTM) web link (www.robalni.org)
 (TXT) w3m dump (www.robalni.org)
        
       | sylware wrote:
       | What seems to be missing are the hardware optimized and
       | accelerated short and big memcpy/memset.
       | 
       | On x86_64, on modern micro-archs, "rep stos[bwdq]" and "rep
       | movs[bwdq]". I bet that, in modern binaries, memcpy/memset call
       | sites are actually place holders for such instructions (before
       | the memory segment goes back to Read/Executable), registers are
       | rdi,rsi,rdx (rcx would be pushed on the stack or the code
       | generated to account for just rcx availability on the call site).
       | 
       | Also, expect x86_64 -> risc-v port bugs because to: byte->byte
       | word->halfword doubleword->word quadword->doubleword
        
         | camel-cdr wrote:
         | You'll likely see memcpy implemented using the vector
         | extension, e.g.:                   memcpy:             mv a3,
         | a0 # Copy destination         loop:           vsetvli t0, a2,
         | e8, m8, ta, ma   # Vectors of 8b           vle8.v v0, (a1)
         | # Load bytes             add a1, a1, t0              # Bump
         | pointer             sub a2, a2, t0              # Decrement
         | count           vse8.v v0, (a3)               # Store bytes
         | add a3, a3, t0              # Bump pointer             bnez a2,
         | loop               # Any more?             ret
         | # Return
         | 
         | Are you sure `rep stos/movs` are actually optimal on x86_64
         | systems?
         | 
         | Edit: I just ran tinymembench on my CPU (Ryzen 5 1600X)
         | C copy backwards                                     :   7300.7
         | MB/s (1.2%)          C copy backwards (32 byte blocks)
         | :   7330.5 MB/s (1.5%)          C copy backwards (64 byte
         | blocks)                    :   7313.6 MB/s (0.7%)          C
         | copy                                               :   7385.3
         | MB/s (1.0%)          C copy prefetched (32 bytes step)
         | :   7737.9 MB/s (1.0%)          C copy prefetched (64 bytes
         | step)                    :   7701.1 MB/s (1.6%)          C
         | 2-pass copy                                        :   6414.2
         | MB/s (2.1%)          C 2-pass copy prefetched (32 bytes step)
         | :   6947.9 MB/s (1.4%)          C 2-pass copy prefetched (64
         | bytes step)             :   6985.8 MB/s (1.5%)          C fill
         | :   9197.2 MB/s (1.2%)          C fill (shuffle within 16 byte
         | blocks)               :   9193.0 MB/s (1.4%)          C fill
         | (shuffle within 32 byte blocks)               :   9175.0 MB/s
         | (2.2%)          C fill (shuffle within 64 byte blocks)
         | :   9229.0 MB/s (1.1%)          ---          standard memcpy
         | :  11302.6 MB/s (1.2%)          standard memset
         | :  11046.1 MB/s (1.4%)          ---          MOVSB copy
         | :   7668.6 MB/s (1.5%)          MOVSD copy
         | :   7607.0 MB/s (0.8%)          SSE2 copy
         | :   7987.0 MB/s (5.0%)          SSE2 nontemporal copy
         | :  11989.2 MB/s (2.7%)          SSE2 copy prefetched (32 bytes
         | step)                 :   7739.9 MB/s (1.3%)          SSE2 copy
         | prefetched (64 bytes step)                 :   7807.6 MB/s
         | (2.9%)          SSE2 nontemporal copy prefetched (32 bytes
         | step)     :  12503.7 MB/s (1.5%)          SSE2 nontemporal copy
         | prefetched (64 bytes step)     :  12605.2 MB/s (2.5%)
         | SSE2 2-pass copy                                     :   6977.1
         | MB/s (1.7%)          SSE2 2-pass copy prefetched (32 bytes
         | step)          :   7311.1 MB/s (1.8%)          SSE2 2-pass copy
         | prefetched (64 bytes step)          :   7334.7 MB/s (1.5%)
         | SSE2 2-pass nontemporal copy                         :   3223.3
         | MB/s          SSE2 fill
         | :  10919.1 MB/s (1.8%)          SSE2 nontemporal fill
         | :  30713.9 MB/s (1.8%)
        
         | snvzz wrote:
         | There's no such thing, as RISC-V memory operations are very
         | consciously explicit load/store instructions.
         | 
         | RISC-V does not have memory to memory ops.
        
         | robalni wrote:
         | > Also, expect x86_64 -> risc-v port bugs because to:
         | byte->byte word->halfword doubleword->word quadword->doubleword
         | 
         | Yeah, I don't know why everyone doesn't just call it int8,
         | int16 and so on. That would be much better. This "word" naming
         | is just confusing.
        
       ___________________________________________________________________
       (page generated 2023-07-09 23:02 UTC)