[HN Gopher] Why we need SIMD
___________________________________________________________________
Why we need SIMD
Author : atan2
Score : 64 points
Date : 2025-10-06 03:10 UTC (2 days ago)
(HTM) web link (parallelprogrammer.substack.com)
(TXT) w3m dump (parallelprogrammer.substack.com)
| lordnacho wrote:
| When I optimize stuff, I just think of the SIMD instructions as a
| long sandwich toaster. You can have a normal toaster that makes
| one sandwich, or you can have a 4x toaster that makes 4
| sandwiches as once. If you have a bunch of sandwiches to make,
| obviously you want to align your work so that you can do 4 at a
| time.
|
| If you want to make 4 at a time though, you have to keep the
| thing fed. You need your ingredients in the cache, or you are
| just going to waste time finding them.
| vardump wrote:
| Wider SIMD would be useful, especially with AVX-512 style
| improvements. 1024 or even 2048 bits wide operations.
|
| Of course memory bandwidth should increase proportionally
| otherwise the cores might have no data to process.
| TimorousBestie wrote:
| I would love to be able to fit small matrices (4x4 or 16x16
| depending on precision) in SIMD registers together with
| intrinsics for matrix arithmetic.
| TinkersW wrote:
| I wouldn't mind, but might need to increase cache line size on
| x86, as avx512 has reached the current size.
| owlbite wrote:
| Much better to burn the area for multiple smaller units, its a
| bit more area for frontend handling, but worth it for the
| flexibility (see Apple's M-series chips vs intel avx*).
| Remnant44 wrote:
| Yes and no. I think neon is undersized for today at 128bit
| registers -- if you're working with doubles for example,
| that's only two values per register, which is pretty anemic.
| Things like shuffles and other tricky bitops benefit from
| wider widths as well (see my other reply)
| adgjlsfhk1 wrote:
| Agreed that 128 bit is undersized, but 512 feels pretty
| good for the time being. We're unlikely to see further size
| increases since going to 1024 would require doubling the
| cache line, register file, and ram bandwidth, while just
| adding an extra fma port is far less hardware.
| api wrote:
| This would start looking a lot like a GPU.
| Veliladon wrote:
| GPUs are literal SIMD devices. Usually 32 or 64 ALU lanes.
| tubs wrote:
| They are sim-t. It's slightly different. I made them for a
| living for quite some time.
| dang wrote:
| Recent and related:
|
| _Why do we even need SIMD instructions?_ -
| https://news.ycombinator.com/item?id=44850991 - Aug 2025 (8
| comments)
| Remnant44 wrote:
| I'm just happy that finally, with the popularity of zen4 and 5
| chips, AVX512 is around ~20% of the running hardware in the steam
| hardware survey. It's going to be a long while before it gets to
| a majority - Intel still isn't shipping its own instruction set
| in consumer CPUs - but its going the right direction.
|
| Compared to the weird, lumpy lego set of avx1/2, avx512 is quite
| enjoyable to write with, and still has some fun instructions that
| deliver more than just twice the width.
|
| Personal example: The double width byte shuffles
| (_mm512_permutex2var_epi8) that takes 128 bytes as input in two
| registers. I had a critical inner loop that uses a 256 byte
| lookup table; running an upper/lower double-shuffle and blending
| them essentially pops out 64 answers a cycle from the lookup
| table on zen5 (which has two shuffle units), which is pretty
| incredible, and on its own produced a global 4x speedup for the
| kernel as a whole.
| shihab wrote:
| Could you please elaborate on your example? Thanks.
| kbolino wrote:
| Here's a non-parallel and unoptimized implementation of that
| operation in Go: func
| _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
| var dst [64]uint8 for j := 0; j < 64; j++ {
| i := idx[j] src := a if i&0b0100_0000 !=
| 0 { src = b } dst[j] =
| src[i&0b0011_1111] } return dst }
|
| Basically, for a lookup table of 8-bit values, you need only
| 1 instruction to perform up to 64 lookups simultaneously, for
| each 128 bytes of table.
| Remnant44 wrote:
| Sure.. in detail and abstracted slightly, the byte table
| problem:
|
| Maybe you're remapping RGB values [0..255] with a tone curve
| in graphics, or doing a mapping lookup of IDs to indexes in a
| set, or a permutation table, or .. well, there's a lot of use
| cases, right? This is essentially an arbitrary function
| lookup where the domain and range is on bytes.
|
| It looks like this in scalar code:
|
| transform_lut(byte* dest, const byte* src, int size, const
| byte* lut) { for (int i = 0; i < size; i++) { dest[i] =
| lut[src[i]]; } }
|
| The function above is basically load/store limited - it's
| doing negligible arithmetic, just loading a byte from the
| source, using that to index a load into the table, and then
| storing the result to the destination. So two loads and a
| store per element. Zen5 has 4 load pipes and 2 store pipes,
| so our CPU can do two elements per cycle in scalar code.
| (Zen4 has only 1 store pipe, so 1 per cycle there)
|
| Here's a snippet of the AVX512 version.
|
| You load the lookup table into 4 registers outside the loop:
| __m512i p0, p1, p2, p3; p0 = _mm512_load_epi8(lut);
| p1 = _mm512_load_epi8(lut + 64); p2 =
| _mm512_load_epi8(lut + 128); p3 = _mm512_load_epi8(lut
| + 192);
|
| Then, for each SIMD vector of 64 elements, use each lane's
| value as an index into the lookup table, just like the scalar
| version. Since we only can use 128 bytes, we DO have to do it
| twice, once for the lower and again for the upper half, and
| use a mask to choose between them appropriately on a per-
| element basis. auto tLow =
| _mm512_permutex2var_epi8(p0, x, p1); auto tHigh =
| _mm512_permutex2var_epi8(p2, x, p3);
|
| You can use _mm512_movepi8_mask to load the mask register.
| That instruction sets each lane is active if its high bit of
| the byte is set, which perfectly sets up our table. You could
| use the mask register directly on the second shuffle
| instruction or a later blend instruction, it doesn't really
| matter.
|
| For every 64 bytes, the avx512 version has one load&store and
| does two permutes, which Zen5 can do at 2 a cycle. So 64
| elements per cycle.
|
| So our theoretical speedup here is ~32x over the scalar code!
| You could pull tricks like this with SSE and pshufb, but the
| size of the lookup table is too small to really be useful.
| Being able to do an arbitrary super-fast byte-byte transform
| is incredibly useful.
| jasonthorsness wrote:
| Compared to GPU programming the gains from SIMD are limited but
| it's a small-multiple boost and available pretty much everywhere.
| C# makes it easy to use through Vector classes. WASM SIMD still
| has a way to go but even with the current 128-bit you can see
| dramatic improvements in some buffer-processing cases (I did a
| little comparison demo here showing a 20x improvement in bitwise
| complement of a large buffer: https://www.jasonthorsness.com/2)
| zozbot234 wrote:
| The WASM folks should just include an arbitrary-length vector
| compute extension. We should also explore automatically
| compiling WASM to GPU compute as appropriate, the hardware
| independence makes it a rather natural fit for that.
| ncruces wrote:
| I merged a few PRs to SIMD optimize Wasm WASI libc, but it all
| got stalled in str(c)spn (which is slightly more sophisticated
| than the rest).
|
| There wasn't much appetite for any of it on Emscripten.
|
| https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+opt...
| jasonthorsness wrote:
| subscribed to the str(c)spn thread for the eventual
| explanation of why the non-simd version seemed to give the
| wrong answer
| chasil wrote:
| The author has neglected the 3DNow! SIMD instructions from AMD.
|
| They were notable for several reasons, although they are no
| longer included in modern silicon.
|
| https://en.wikipedia.org/wiki/3DNow!
| p0nce wrote:
| 4 lanes of SIMD (like in say SSE) is not necessarily 4x faster
| because of the memory access, sometimes it's better than that
| (and often it's less).
|
| PSHUFB wins in case of unpredictable access patterns. Though I
| don't remember how much it typically wins.
|
| PMOVMSKB can replace several conditionals (up to 16 in SSE2 for
| byte operands) with only one, winning in terms of branch
| prediction.
|
| PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP
| rounding that doesn't require changing the rounding mode, etc.
| The weird string functions in SSE4.2. Non-temporal moves and
| prefetching in some cases.
|
| The cool thing with SIMD is that it's a lot less stress for the
| CPU access prediction and branch prediction, not only ALU. So
| when you optimize it will help unrelated parts of your code to go
| faster.
| kristianp wrote:
| No mention of branches, which is a complementary concept. If you
| unwind your loop, you can get part of the way to SIMD performance
| by keeping the CPU pipeline filled.
| dpifke wrote:
| Related: Go is looking to add SIMD intrinsics, which should
| provide a more elegant way to use SIMD instructions from Go code:
| https://go.dev/issue/73787
___________________________________________________________________
(page generated 2025-10-08 23:00 UTC)