[HN Gopher] 86 GB/s bitpacking with ARM SIMD (single thread)
       ___________________________________________________________________
        
       86 GB/s bitpacking with ARM SIMD (single thread)
        
       Author : ashtonsix
       Score  : 93 points
       Date   : 2025-10-05 12:27 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Retr0id wrote:
       | I tried to run the benchmark on my M1 Pro macbook, but the
       | "baseline" is written with x86 intrinsics and won't compile.
       | 
       | Are the benchmark results in the README real? (The README itself
       | feels very AI-generated)
       | 
       | Looking at the makefile, it tries to link the x86 SSE "baseline"
       | implementation and the NEON version into the same binary. A real
       | headscratcher!
       | 
       | Edit: The SSE impl gets shimmed via simd-everywhere, and the
       | benchmark results do seem legit (aside from being slightly
       | apples-to-oranges, but that's unavoidable)
        
         | Asmod4n wrote:
         | Maybe this could help you: https://github.com/simd-
         | everywhere/simde/issues/1099
        
           | Retr0id wrote:
           | But this project isn't using simd-everywhere. I'd like to
           | reproduce the results as documented in the README
        
             | guipsp wrote:
             | Look at the parent dir. I agree it is a bit confusing
        
               | Retr0id wrote:
               | Ah! Yup, that works, I can compile the binary. I get an
               | "Illegal instruction" error when I run it but that's
               | probably just because M1 doesn't support some of the NEON
               | instructions. I retract my implicit AI-slop accusations.
        
               | Retr0id wrote:
               | Results from M1 Pro (after setting CPU=native in the
               | makefile): https://gist.github.com/DavidBuchanan314/e3cde
               | 76e4dab2758ec4...
        
         | ashtonsix wrote:
         | Thank you so much for attempting a reproduction! (I posted this
         | on Reddit and most commenters didn't even click the link)
         | 
         | For the baseline you need SIMDe headers:
         | https://github.com/simd-everywhere/simde/tree/master/simde.
         | These alias x86 intrinsics to ARM intrinsics. The baseline is
         | based on the previous State-of-The-Art
         | (https://arxiv.org/abs/1209.2137) which happens to be
         | x86-based; using SIMDe to compile was the highest-integrity way
         | I could think of to compare with the previous SOTA.
         | 
         | Note: M1 chips specifically have notoriously bad small-shift
         | performance, so the benchmark results will be very bad on your
         | machine. M3 partially fixed this, M4 fixed completely. My
         | primary target is server-class rather than consumer-class
         | hardware so I'm not too worried about this.
         | 
         | The benchmark results were cpy-pasted from the terminal. The
         | README prose was AI generated from my rough notes (I'm
         | confident when communicating with other experts/researchers,
         | but less-so with communication to a general audience).
        
           | ozgrakkurt wrote:
           | Super cool!
           | 
           | Pretty sure anyone going into this kind of post about simd
           | would prefer your writing to llm
        
           | deadmutex wrote:
           | Here is a repro using GCE's C4A Axion instances
           | (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of
           | the thread can be updated to a larger number :) ? I used the
           | largest instance to avoid noisy neighbor issues.
           | $ ./out/bytepack_eval       Bytepack Bench -- 16 KiB,
           | reps=20000 (pinned if available)       Throughput GB/s
           | K  NEON pack   NEON unpack  Baseline pack   Baseline unpack
           | 1  94.77       84.05        45.01           63.12
           | 2  123.63      94.74        52.70           66.63
           | 3  94.62       83.89        45.32           68.43
           | 4  112.68      77.91        58.10           78.20
           | 5  86.96       80.02        44.32           60.77
           | 6  93.50       92.08        51.22           67.20
           | 7  87.10       79.53        43.94           57.95
           | 8  90.49       92.36        68.99           83.88
        
       | danlark1 wrote:
       | Great work!
       | 
       | Popular narrative that NEON does not have a move mask
       | alternative. Some time ago I published an article to simulate
       | popular bit packing use cases with NEON with 1-2 instructions.
       | This does not include unpacking cases but can be great for real
       | world applications like compare+find, compare+iterate,
       | compare+test.
       | 
       | https://community.arm.com/arm-community-blogs/b/servers-and-...
        
         | ashtonsix wrote:
         | Nice article! I personally find the ARM ISA far more cohesive
         | than x86's: far less historical quirks. I also really
         | appreciate the ubiquity of support for 8-bit elements in ARM
         | and the absence of SMT (make performance much more
         | predictable).
        
         | Sesse__ wrote:
         | I never understood why they couldn't just include a movmskb
         | instruction to begin with. It's massively useful for integer
         | tasks, not expensive to implement as far as I know, and the
         | vshrn+mov trick often requires an extra instruction either in
         | front or after (in addition to just being, well, pretty
         | obscure).
         | 
         | NEON in general is a bit sad, really; it's built around the
         | idea of being implementable with a 64-bit ALU, and it shows.
         | And SVE support is pretty much non-existent on the client.
        
           | ashtonsix wrote:
           | Not having (1 << k) - 1 as a single instruction sucks when it
           | HAS to be in a hot loop, but you can usually hoist this to
           | the loop prolougue: my stuff uses dummy inline assembly hints
           | to force compilers to do this `asm volatile("" : "+w"(m));`.
           | 
           | I personally think calibrating ARM's ISA on smaller VL was a
           | good choice: you get much better IPC. You also have an
           | almost-complete absence of support for 8-bit elements with
           | x86 ISAs, so elements per instruction is tied. And NEON kind-
           | of-ish makes up for its small VL with multi-register TBL/TBX
           | and LDP/STP.
           | 
           | Also: AVX512 is just as non-existent on clients as SVE2;
           | although not really relevant for the server-side targets I'm
           | optimising for (mostly OLAP).
        
             | TinkersW wrote:
             | 8 bit is not absent in x86 SIMD, it is a slightly less
             | covered than 32 & 16 bit, but you can fully implement all
             | the common 8 bit ops and most are 1 instruction(with AVX2).
             | There are even various horizontal ops on 8 bit values(avg,
             | dot etc).
             | 
             | Also AVX512 is way more common than SVE2, all Zen4 & Zen5
             | support it.
        
               | dzaima wrote:
               | More specifically, basically the only absent 8-bit ops
               | that have 32-bit equivalents in AVX2 are shifts and
               | multiplies. Shifts are quite annoying (though, with a
               | uniform shift they can be emulated on AVX-512 via GFNI
               | abuse in 1 instr), multiplies are rather rare (though
               | note that there is vpmaddubsw for an 8-bit-16-bit
               | multiply-add). There's even a case of the opposite -
               | saturating add/sub exist for 8-bit and 16-bit ints, but
               | not wider.
        
               | namibj wrote:
               | GFNI is distinct from AVX-512; it was merely introduced
               | in cores that also had AVX-512.
        
             | Sesse__ wrote:
             | Does _any_ SIMD instruction set have (1 << k) - 1 as a
             | single instruction?
        
               | camel-cdr wrote:
               | Not sure in which context this is used, but you can do -1
               | << k in most ISAs but that still requires a bit-not. But
               | if you want to use the value in a bitwise instruction,
               | then there are often variants that can invert the input
               | operand.
               | 
               | E.g. in RVV instead of vand.vv(a,
               | vadd.vi(vsll.vv(1,k),-1)) you could do
               | vandn.vv(vsll.vv(-1,k))
               | 
               | AVX-512 can do this with any binary or ternary bitwise
               | logic function via vpternlog
        
               | Sesse__ wrote:
               | I don't know either; I was talking about the lack of
               | PMOVMSKB in NEON and then the comment I replied to
               | started talking about (1 << k) - 1 not being a single
               | instruction sucks. Which I don't think it is in any non-
               | NEON set either.
        
       | robert3005 wrote:
       | Highly recommend
       | https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a
       | comparable algorithm. It generalizes to arbitrary input and
       | output bit widths.
        
         | ashtonsix wrote:
         | Good work, wish I saw it earlier as it overlaps with a lot of
         | my recent work. I'm actually planning to release new SOTAs on
         | zigzag/delta/delta-of-delta/xor-with-previous coding next week.
         | Some areas the work doesn't give enough attention to (IMO):
         | register pressure, kernel fusion, cache locality (wrt multi-
         | pass). They also fudge a lot of the numbers and comparisons eg,
         | pitching themselves against Parquet (storage-optimised) when
         | Arrow (compute-optimised) is the most-comparable tech and
         | obvious target to beat. They definitely improve on current best
         | work, but only by a modest margin.
         | 
         | I'm also skeptical of the "unified" paradigm: performance
         | improvements are often realised by stepping away from
         | generalisation and exploiting the specifics of a given problem;
         | under a unified paradigm there's definite room for improvement
         | vs Arrow, but that's very unlikely to bring you all the way to
         | theoretically optimal performance.
        
       | fzeroff wrote:
       | Very interesting! Uh you don't mind my basic question, but what
       | would you use that for?
        
         | ashtonsix wrote:
         | If you have an array of numbers with a known upper-bound, such
         | as enums with 8 possible values (representable with 3 bits),
         | and a memory-bound operation on those numbers eg, for (int i; i
         | < n; i++) if (user_category[i] == 0) filtered.push_pack(i),
         | which is common in data warehouses, using my code can more than
         | 2x performance by allowing more efficient usage of the
         | DRAM<->CPU bus.
        
       ___________________________________________________________________
       (page generated 2025-10-05 23:00 UTC)