[HN Gopher] 86 GB/s bitpacking with ARM SIMD (single thread)
___________________________________________________________________
86 GB/s bitpacking with ARM SIMD (single thread)
Author : ashtonsix
Score : 93 points
Date : 2025-10-05 12:27 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Retr0id wrote:
| I tried to run the benchmark on my M1 Pro macbook, but the
| "baseline" is written with x86 intrinsics and won't compile.
|
| Are the benchmark results in the README real? (The README itself
| feels very AI-generated)
|
| Looking at the makefile, it tries to link the x86 SSE "baseline"
| implementation and the NEON version into the same binary. A real
| headscratcher!
|
| Edit: The SSE impl gets shimmed via simd-everywhere, and the
| benchmark results do seem legit (aside from being slightly
| apples-to-oranges, but that's unavoidable)
| Asmod4n wrote:
| Maybe this could help you: https://github.com/simd-
| everywhere/simde/issues/1099
| Retr0id wrote:
| But this project isn't using simd-everywhere. I'd like to
| reproduce the results as documented in the README
| guipsp wrote:
| Look at the parent dir. I agree it is a bit confusing
| Retr0id wrote:
| Ah! Yup, that works, I can compile the binary. I get an
| "Illegal instruction" error when I run it but that's
| probably just because M1 doesn't support some of the NEON
| instructions. I retract my implicit AI-slop accusations.
| Retr0id wrote:
| Results from M1 Pro (after setting CPU=native in the
| makefile): https://gist.github.com/DavidBuchanan314/e3cde
| 76e4dab2758ec4...
| ashtonsix wrote:
| Thank you so much for attempting a reproduction! (I posted this
| on Reddit and most commenters didn't even click the link)
|
| For the baseline you need SIMDe headers:
| https://github.com/simd-everywhere/simde/tree/master/simde.
| These alias x86 intrinsics to ARM intrinsics. The baseline is
| based on the previous State-of-The-Art
| (https://arxiv.org/abs/1209.2137) which happens to be
| x86-based; using SIMDe to compile was the highest-integrity way
| I could think of to compare with the previous SOTA.
|
| Note: M1 chips specifically have notoriously bad small-shift
| performance, so the benchmark results will be very bad on your
| machine. M3 partially fixed this, M4 fixed completely. My
| primary target is server-class rather than consumer-class
| hardware so I'm not too worried about this.
|
| The benchmark results were cpy-pasted from the terminal. The
| README prose was AI generated from my rough notes (I'm
| confident when communicating with other experts/researchers,
| but less-so with communication to a general audience).
| ozgrakkurt wrote:
| Super cool!
|
| Pretty sure anyone going into this kind of post about simd
| would prefer your writing to llm
| deadmutex wrote:
| Here is a repro using GCE's C4A Axion instances
| (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of
| the thread can be updated to a larger number :) ? I used the
| largest instance to avoid noisy neighbor issues.
| $ ./out/bytepack_eval Bytepack Bench -- 16 KiB,
| reps=20000 (pinned if available) Throughput GB/s
| K NEON pack NEON unpack Baseline pack Baseline unpack
| 1 94.77 84.05 45.01 63.12
| 2 123.63 94.74 52.70 66.63
| 3 94.62 83.89 45.32 68.43
| 4 112.68 77.91 58.10 78.20
| 5 86.96 80.02 44.32 60.77
| 6 93.50 92.08 51.22 67.20
| 7 87.10 79.53 43.94 57.95
| 8 90.49 92.36 68.99 83.88
| danlark1 wrote:
| Great work!
|
| Popular narrative that NEON does not have a move mask
| alternative. Some time ago I published an article to simulate
| popular bit packing use cases with NEON with 1-2 instructions.
| This does not include unpacking cases but can be great for real
| world applications like compare+find, compare+iterate,
| compare+test.
|
| https://community.arm.com/arm-community-blogs/b/servers-and-...
| ashtonsix wrote:
| Nice article! I personally find the ARM ISA far more cohesive
| than x86's: far less historical quirks. I also really
| appreciate the ubiquity of support for 8-bit elements in ARM
| and the absence of SMT (make performance much more
| predictable).
| Sesse__ wrote:
| I never understood why they couldn't just include a movmskb
| instruction to begin with. It's massively useful for integer
| tasks, not expensive to implement as far as I know, and the
| vshrn+mov trick often requires an extra instruction either in
| front or after (in addition to just being, well, pretty
| obscure).
|
| NEON in general is a bit sad, really; it's built around the
| idea of being implementable with a 64-bit ALU, and it shows.
| And SVE support is pretty much non-existent on the client.
| ashtonsix wrote:
| Not having (1 << k) - 1 as a single instruction sucks when it
| HAS to be in a hot loop, but you can usually hoist this to
| the loop prolougue: my stuff uses dummy inline assembly hints
| to force compilers to do this `asm volatile("" : "+w"(m));`.
|
| I personally think calibrating ARM's ISA on smaller VL was a
| good choice: you get much better IPC. You also have an
| almost-complete absence of support for 8-bit elements with
| x86 ISAs, so elements per instruction is tied. And NEON kind-
| of-ish makes up for its small VL with multi-register TBL/TBX
| and LDP/STP.
|
| Also: AVX512 is just as non-existent on clients as SVE2;
| although not really relevant for the server-side targets I'm
| optimising for (mostly OLAP).
| TinkersW wrote:
| 8 bit is not absent in x86 SIMD, it is a slightly less
| covered than 32 & 16 bit, but you can fully implement all
| the common 8 bit ops and most are 1 instruction(with AVX2).
| There are even various horizontal ops on 8 bit values(avg,
| dot etc).
|
| Also AVX512 is way more common than SVE2, all Zen4 & Zen5
| support it.
| dzaima wrote:
| More specifically, basically the only absent 8-bit ops
| that have 32-bit equivalents in AVX2 are shifts and
| multiplies. Shifts are quite annoying (though, with a
| uniform shift they can be emulated on AVX-512 via GFNI
| abuse in 1 instr), multiplies are rather rare (though
| note that there is vpmaddubsw for an 8-bit-16-bit
| multiply-add). There's even a case of the opposite -
| saturating add/sub exist for 8-bit and 16-bit ints, but
| not wider.
| namibj wrote:
| GFNI is distinct from AVX-512; it was merely introduced
| in cores that also had AVX-512.
| Sesse__ wrote:
| Does _any_ SIMD instruction set have (1 << k) - 1 as a
| single instruction?
| camel-cdr wrote:
| Not sure in which context this is used, but you can do -1
| << k in most ISAs but that still requires a bit-not. But
| if you want to use the value in a bitwise instruction,
| then there are often variants that can invert the input
| operand.
|
| E.g. in RVV instead of vand.vv(a,
| vadd.vi(vsll.vv(1,k),-1)) you could do
| vandn.vv(vsll.vv(-1,k))
|
| AVX-512 can do this with any binary or ternary bitwise
| logic function via vpternlog
| Sesse__ wrote:
| I don't know either; I was talking about the lack of
| PMOVMSKB in NEON and then the comment I replied to
| started talking about (1 << k) - 1 not being a single
| instruction sucks. Which I don't think it is in any non-
| NEON set either.
| robert3005 wrote:
| Highly recommend
| https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a
| comparable algorithm. It generalizes to arbitrary input and
| output bit widths.
| ashtonsix wrote:
| Good work, wish I saw it earlier as it overlaps with a lot of
| my recent work. I'm actually planning to release new SOTAs on
| zigzag/delta/delta-of-delta/xor-with-previous coding next week.
| Some areas the work doesn't give enough attention to (IMO):
| register pressure, kernel fusion, cache locality (wrt multi-
| pass). They also fudge a lot of the numbers and comparisons eg,
| pitching themselves against Parquet (storage-optimised) when
| Arrow (compute-optimised) is the most-comparable tech and
| obvious target to beat. They definitely improve on current best
| work, but only by a modest margin.
|
| I'm also skeptical of the "unified" paradigm: performance
| improvements are often realised by stepping away from
| generalisation and exploiting the specifics of a given problem;
| under a unified paradigm there's definite room for improvement
| vs Arrow, but that's very unlikely to bring you all the way to
| theoretically optimal performance.
| fzeroff wrote:
| Very interesting! Uh you don't mind my basic question, but what
| would you use that for?
| ashtonsix wrote:
| If you have an array of numbers with a known upper-bound, such
| as enums with 8 possible values (representable with 3 bits),
| and a memory-bound operation on those numbers eg, for (int i; i
| < n; i++) if (user_category[i] == 0) filtered.push_pack(i),
| which is common in data warehouses, using my code can more than
| 2x performance by allowing more efficient usage of the
| DRAM<->CPU bus.
___________________________________________________________________
(page generated 2025-10-05 23:00 UTC)