[HN Gopher] Why we need SIMD
       ___________________________________________________________________
        
       Why we need SIMD
        
       Author : atan2
       Score  : 64 points
       Date   : 2025-10-06 03:10 UTC (2 days ago)
        
 (HTM) web link (parallelprogrammer.substack.com)
 (TXT) w3m dump (parallelprogrammer.substack.com)
        
       | lordnacho wrote:
       | When I optimize stuff, I just think of the SIMD instructions as a
       | long sandwich toaster. You can have a normal toaster that makes
       | one sandwich, or you can have a 4x toaster that makes 4
       | sandwiches as once. If you have a bunch of sandwiches to make,
       | obviously you want to align your work so that you can do 4 at a
       | time.
       | 
       | If you want to make 4 at a time though, you have to keep the
       | thing fed. You need your ingredients in the cache, or you are
       | just going to waste time finding them.
        
       | vardump wrote:
       | Wider SIMD would be useful, especially with AVX-512 style
       | improvements. 1024 or even 2048 bits wide operations.
       | 
       | Of course memory bandwidth should increase proportionally
       | otherwise the cores might have no data to process.
        
         | TimorousBestie wrote:
         | I would love to be able to fit small matrices (4x4 or 16x16
         | depending on precision) in SIMD registers together with
         | intrinsics for matrix arithmetic.
        
         | TinkersW wrote:
         | I wouldn't mind, but might need to increase cache line size on
         | x86, as avx512 has reached the current size.
        
         | owlbite wrote:
         | Much better to burn the area for multiple smaller units, its a
         | bit more area for frontend handling, but worth it for the
         | flexibility (see Apple's M-series chips vs intel avx*).
        
           | Remnant44 wrote:
           | Yes and no. I think neon is undersized for today at 128bit
           | registers -- if you're working with doubles for example,
           | that's only two values per register, which is pretty anemic.
           | Things like shuffles and other tricky bitops benefit from
           | wider widths as well (see my other reply)
        
             | adgjlsfhk1 wrote:
             | Agreed that 128 bit is undersized, but 512 feels pretty
             | good for the time being. We're unlikely to see further size
             | increases since going to 1024 would require doubling the
             | cache line, register file, and ram bandwidth, while just
             | adding an extra fma port is far less hardware.
        
         | api wrote:
         | This would start looking a lot like a GPU.
        
           | Veliladon wrote:
           | GPUs are literal SIMD devices. Usually 32 or 64 ALU lanes.
        
             | tubs wrote:
             | They are sim-t. It's slightly different. I made them for a
             | living for quite some time.
        
       | dang wrote:
       | Recent and related:
       | 
       |  _Why do we even need SIMD instructions?_ -
       | https://news.ycombinator.com/item?id=44850991 - Aug 2025 (8
       | comments)
        
       | Remnant44 wrote:
       | I'm just happy that finally, with the popularity of zen4 and 5
       | chips, AVX512 is around ~20% of the running hardware in the steam
       | hardware survey. It's going to be a long while before it gets to
       | a majority - Intel still isn't shipping its own instruction set
       | in consumer CPUs - but its going the right direction.
       | 
       | Compared to the weird, lumpy lego set of avx1/2, avx512 is quite
       | enjoyable to write with, and still has some fun instructions that
       | deliver more than just twice the width.
       | 
       | Personal example: The double width byte shuffles
       | (_mm512_permutex2var_epi8) that takes 128 bytes as input in two
       | registers. I had a critical inner loop that uses a 256 byte
       | lookup table; running an upper/lower double-shuffle and blending
       | them essentially pops out 64 answers a cycle from the lookup
       | table on zen5 (which has two shuffle units), which is pretty
       | incredible, and on its own produced a global 4x speedup for the
       | kernel as a whole.
        
         | shihab wrote:
         | Could you please elaborate on your example? Thanks.
        
           | kbolino wrote:
           | Here's a non-parallel and unoptimized implementation of that
           | operation in Go:                 func
           | _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
           | var dst [64]uint8         for j := 0; j < 64; j++ {
           | i := idx[j]           src := a           if i&0b0100_0000 !=
           | 0 {             src = b           }           dst[j] =
           | src[i&0b0011_1111]         }         return dst       }
           | 
           | Basically, for a lookup table of 8-bit values, you need only
           | 1 instruction to perform up to 64 lookups simultaneously, for
           | each 128 bytes of table.
        
           | Remnant44 wrote:
           | Sure.. in detail and abstracted slightly, the byte table
           | problem:
           | 
           | Maybe you're remapping RGB values [0..255] with a tone curve
           | in graphics, or doing a mapping lookup of IDs to indexes in a
           | set, or a permutation table, or .. well, there's a lot of use
           | cases, right? This is essentially an arbitrary function
           | lookup where the domain and range is on bytes.
           | 
           | It looks like this in scalar code:
           | 
           | transform_lut(byte* dest, const byte* src, int size, const
           | byte* lut) { for (int i = 0; i < size; i++) { dest[i] =
           | lut[src[i]]; } }
           | 
           | The function above is basically load/store limited - it's
           | doing negligible arithmetic, just loading a byte from the
           | source, using that to index a load into the table, and then
           | storing the result to the destination. So two loads and a
           | store per element. Zen5 has 4 load pipes and 2 store pipes,
           | so our CPU can do two elements per cycle in scalar code.
           | (Zen4 has only 1 store pipe, so 1 per cycle there)
           | 
           | Here's a snippet of the AVX512 version.
           | 
           | You load the lookup table into 4 registers outside the loop:
           | __m512i p0, p1, p2, p3;       p0 = _mm512_load_epi8(lut);
           | p1 = _mm512_load_epi8(lut + 64);       p2 =
           | _mm512_load_epi8(lut + 128);       p3 = _mm512_load_epi8(lut
           | + 192);
           | 
           | Then, for each SIMD vector of 64 elements, use each lane's
           | value as an index into the lookup table, just like the scalar
           | version. Since we only can use 128 bytes, we DO have to do it
           | twice, once for the lower and again for the upper half, and
           | use a mask to choose between them appropriately on a per-
           | element basis.                 auto tLow  =
           | _mm512_permutex2var_epi8(p0, x, p1);       auto tHigh =
           | _mm512_permutex2var_epi8(p2, x, p3);
           | 
           | You can use _mm512_movepi8_mask to load the mask register.
           | That instruction sets each lane is active if its high bit of
           | the byte is set, which perfectly sets up our table. You could
           | use the mask register directly on the second shuffle
           | instruction or a later blend instruction, it doesn't really
           | matter.
           | 
           | For every 64 bytes, the avx512 version has one load&store and
           | does two permutes, which Zen5 can do at 2 a cycle. So 64
           | elements per cycle.
           | 
           | So our theoretical speedup here is ~32x over the scalar code!
           | You could pull tricks like this with SSE and pshufb, but the
           | size of the lookup table is too small to really be useful.
           | Being able to do an arbitrary super-fast byte-byte transform
           | is incredibly useful.
        
       | jasonthorsness wrote:
       | Compared to GPU programming the gains from SIMD are limited but
       | it's a small-multiple boost and available pretty much everywhere.
       | C# makes it easy to use through Vector classes. WASM SIMD still
       | has a way to go but even with the current 128-bit you can see
       | dramatic improvements in some buffer-processing cases (I did a
       | little comparison demo here showing a 20x improvement in bitwise
       | complement of a large buffer: https://www.jasonthorsness.com/2)
        
         | zozbot234 wrote:
         | The WASM folks should just include an arbitrary-length vector
         | compute extension. We should also explore automatically
         | compiling WASM to GPU compute as appropriate, the hardware
         | independence makes it a rather natural fit for that.
        
         | ncruces wrote:
         | I merged a few PRs to SIMD optimize Wasm WASI libc, but it all
         | got stalled in str(c)spn (which is slightly more sophisticated
         | than the rest).
         | 
         | There wasn't much appetite for any of it on Emscripten.
         | 
         | https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+opt...
        
           | jasonthorsness wrote:
           | subscribed to the str(c)spn thread for the eventual
           | explanation of why the non-simd version seemed to give the
           | wrong answer
        
       | chasil wrote:
       | The author has neglected the 3DNow! SIMD instructions from AMD.
       | 
       | They were notable for several reasons, although they are no
       | longer included in modern silicon.
       | 
       | https://en.wikipedia.org/wiki/3DNow!
        
       | p0nce wrote:
       | 4 lanes of SIMD (like in say SSE) is not necessarily 4x faster
       | because of the memory access, sometimes it's better than that
       | (and often it's less).
       | 
       | PSHUFB wins in case of unpredictable access patterns. Though I
       | don't remember how much it typically wins.
       | 
       | PMOVMSKB can replace several conditionals (up to 16 in SSE2 for
       | byte operands) with only one, winning in terms of branch
       | prediction.
       | 
       | PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP
       | rounding that doesn't require changing the rounding mode, etc.
       | The weird string functions in SSE4.2. Non-temporal moves and
       | prefetching in some cases.
       | 
       | The cool thing with SIMD is that it's a lot less stress for the
       | CPU access prediction and branch prediction, not only ALU. So
       | when you optimize it will help unrelated parts of your code to go
       | faster.
        
       | kristianp wrote:
       | No mention of branches, which is a complementary concept. If you
       | unwind your loop, you can get part of the way to SIMD performance
       | by keeping the CPU pipeline filled.
        
       | dpifke wrote:
       | Related: Go is looking to add SIMD intrinsics, which should
       | provide a more elegant way to use SIMD instructions from Go code:
       | https://go.dev/issue/73787
        
       ___________________________________________________________________
       (page generated 2025-10-08 23:00 UTC)