[HN Gopher] Bit Vectors and my first steps into assembly
       ___________________________________________________________________
        
       Bit Vectors and my first steps into assembly
        
       Author : thunderbong
       Score  : 112 points
       Date   : 2024-12-22 03:00 UTC (4 days ago)
        
 (HTM) web link (blog.smidt.dev)
 (TXT) w3m dump (blog.smidt.dev)
        
       | ajross wrote:
       | > Since go doesn't inline assembly i decided to include the
       | search loop into the assembly itself and to return the index of
       | the non matching character. Sadly i didn't get it to work fully
       | because i'm missing tools to actually debug the assembly itself.
       | 
       | Um... "gdb <my_binary>", "break <my_function>", "run", just like
       | any other tool.
       | 
       | Then "disassemble" to see the instructions around the PC, "stepi"
       | to execute one instruction. "info register" to get current state
       | (or use registers in an expression as e.g. "print $rax" or "print
       | $r11").
       | 
       | And that's it. Debuggers are actually _natively_ tools operating
       | at the level of machine instructions. The language syntax support
       | is a level on top.
        
         | xyzzy_plugh wrote:
         | Assembly is one of those things that is feared by so many
         | engineers, but is actually incredibly accessible. Whenever I
         | find an excuse to drop down to assembly my colleagues have
         | looked at me like I'm a wizard with 3 heads.
         | 
         | The tooling to debug, dump or inspect, or to link or otherwise
         | glue bits of assembly together has existed for decades. It's
         | all right there!
        
           | RicoElectrico wrote:
           | It's one of these areas LLMs really help to get started. Not
           | much invention needed, but quite involved setup if you don't
           | know what you're precisely looking for. Having them spit out
           | boilerplate and needed invocations lowers the barrier
           | substantially.
        
             | dragontamer wrote:
             | Assuming you have a C compiler, you should use C to get
             | basic boilerplate and use asm or volatile asm to drop down
             | to assembly level.
             | 
             | This is true for GPUs, CPUs like x86 and ARM, and more.
             | Only on the smallest of embedded systems like AVR with 4kB
             | RAM have I ever found dropping to raw assembly to be
             | useful.
             | 
             | Bonus points: input/output/clobbers works well with the GCC
             | and CLang optimizers. Though it's not intuitive how
             | compilers can still move your assembly around, it's
             | surprisingly full featured for how to get info into and out
             | of registers.
             | 
             | There are rare times where the C boilerplate is
             | counterproductive. I guess OS level setup is another spot
             | where raw non-C assembly is useful.
        
               | david-gpu wrote:
               | Intrinsics, when available, are a convenient tool. But
               | even then, I think assembly these days has very limited
               | applications and most programmers don't need to know any
               | assembly to be effective.
               | 
               | Preemptive disclaimer: all my jobs were the exception to
               | the rule and I sadly had to work with various assembly
               | languages on a daily basis. Still not a fan.
        
               | AlotOfReading wrote:
               | You should be able to _read_ assembly, not least because
               | understanding what 's generated by the compiler or jit is
               | pretty important to performance optimization.
        
               | david-gpu wrote:
               | Aren't there many other steps with better bang for your
               | buck to be done before that? Libraries, better algorithms
               | and data structures, parallelism, etc.
        
               | dragontamer wrote:
               | Things like the condition flags, or GPU execution flags,
               | or other 'implicit' flags are when assembly is needed.
               | 
               | Otherwise, intrinsics are sufficient. And with wavefront
               | voting / ballot instructions (and intrinsics) being so
               | readily available on modern GPUs, there's very little
               | reason to go to those assembly level / machine code flags
               | or registers.
               | 
               | Similarly, back when ccmov instructions weren't reliably
               | output by compilers, you'd probably want raw assembly.
               | But today I think it's safe to rely upon the optimizers.
        
           | dist1ll wrote:
           | I agree it's unnecessarily feared. At its core, it lacks a
           | lot of complexity.
           | 
           | But actually _writing_ asm is an experience straight from the
           | 80s. Compared to modern languages, the tooling is simply
           | antiquated. I noticed this once I started writing larger
           | sections of asm at a time. It 's "fine" in the same sense
           | that writing makefiles is "fine" - it becomes a PITA after a
           | certain size.
           | 
           | I think there's lots to improve on the current state of asm,
           | but it's just too niche. The only power users I know are
           | hobbyists working on operating systems and games for retro
           | hardware.
        
         | pm215 wrote:
         | I also like "disp /3i $pc", which tells gdb to disassemble the
         | next three instructions whenever it stops (including after
         | stepi). Though I'm sure you can configure it to give you more
         | sophisticated disassembly display and context, especially with
         | the tui interface, this is usually "good enough" for me when I
         | run into something I want to step through at the assembly level
         | in the middle of debugging a program that's primarily C.
        
           | purplesyringa wrote:
           | On a related note, https://hugsy.github.io/gef/ provides a
           | TUI interface like that.
        
       | MobiusHorizons wrote:
       | I'm surprised he didn't use a switch statement, as that is the
       | canonical way to ask the compiler to make the jump table for you.
        
         | rep_lodsb wrote:
         | Without SIMD, that would probably the optimal way to implement
         | it. And this bit vector idea really isn't doing SIMD at all,
         | it's just (ab)using the vector registers to store 128 bits.
         | 
         | I think the Zig code would also compile to better output, and
         | be a lot simpler, if it was using an array of u1, instead of
         | doing the bit shifting manually.
        
           | MobiusHorizons wrote:
           | The bit vector does give a nice compression compared to the
           | jump table. But I believe the main thing to do for speeding
           | up string parsing is to consume more than one character at a
           | time. I personally found the techniques in https://tia.mat.br
           | /posts/2018/02/01/more_on_string_switch_in... very helpful to
           | explain these issues.
           | 
           | EDIT: I especially like this statement from the conclusion:
           | 
           | > The good thing about the switch statement in C is that it
           | is maybe the highest level statement in the language: the
           | compiler can get really creative in how its code is
           | generated. It's not uncommon for it to generate jump tables
           | or even binary searches.
        
           | dragontamer wrote:
           | Bit processing is always SIMD.
           | 
           | A 64-bit register being processed 64-bits at a time is, from
           | a mental and programmers perspective, a 64-wide SIMD computer
           | over 1-bit AND, OR, and NOT instructions.
           | 
           | If you aren't using SIMD / parallel programming concepts
           | while thinking about 64-parallel bits (or even in this case,
           | parallel processing of 8-bit bytes or 7-bits of ASCII), then
           | you are making life much harder for yourself.
           | 
           | -------
           | 
           | I believe in the 90s, some people called 32-bit parallelism
           | as SIMD Within A Register (SWAR). Some guy wrote 3DES
           | encryption with 64-bit AND/OR/NOTs and progressed bit-by-bit
           | (but in parallel), achieving absurdly efficient encryption
           | back in the day.
           | 
           | And of course, Chess Programmers with all their 64-bit boards
           | for their 8x8 game constantly think in terms of SIMD inside
           | of 64-bits.
           | 
           | -------
           | 
           | Once you learn to think in SIMD, you can extend these
           | concepts out to 128-bits (SSE), 256-bits (AVX) or 512-bit
           | (AVX512) or beyond (GPU 1024-bits and greater).
        
             | selimthegrim wrote:
             | http://lamport.azurewebsites.net/pubs/multiple-byte.pdf
        
       | madflame991 wrote:
       | In certain conditions gcc will actually generate instructions
       | that use this same bitvector described:
       | https://godbolt.org/z/6Gfjv6PGd (notice the funny 130159 value in
       | there)
       | 
       | The only thing I did was to adjust the range of characters so
       | that the bitvector fits 64 bits. Sadly, I do not this it would
       | work for the author's bigger range.
        
       | dan-robertson wrote:
       | Another trick for classifying bytes is to use pshufb. That
       | instruction works like:                 fn pshufb(a: [i8; 16], b:
       | [i8; 16]) -> [i8; 16] {         result = [0; 16]         for j in
       | 0, 1, ..., 15 {           if b[j] >= 0 {             result[j] =
       | a[b[j] & 15];           }         }         return result;
       | }
       | 
       | So you can often construct values m1, m2, and compute:
       | a = pshufb(m1, input & 0xf)       b = pshufb(m2, (unsigned_input)
       | >> 4)       x = f(a,b)
       | 
       | Where f is something like bit wise AND. Roughly, m1 and m2 assign
       | sets of 8 potential properties based on the high or low order
       | bits of each byte, and then we and together the sets of
       | properties to make sure the upper and lower bits 'agree'.
       | 
       | This technique can sometimes be used to mark a few special
       | characters.
        
         | dragontamer wrote:
         | Pshufb is an old trick. An oldie but a goodie from SSE2 days.
         | 
         | All hail modern improvements!! pdep, pext, vpcompressb,
         | vpexpandb, and especially vpermi2v.
         | 
         | Forget about 16 byte lookup table with pshufb. Try a 128-byte
         | lookup with vpermi2v AVX512.
         | 
         | --------
         | 
         | I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for
         | 512-bits is also more flexible than pshufb (!!!!) in practice,
         | if a bit slower as it is over multiple instructions.
         | 
         | Supercomputer programs are written with a gather-conpute-
         | scatter pattern. Gather your data, compute on the data, and
         | finally place the data where needed.
         | 
         | Pext and vpcompressb are gather operations.
         | 
         | Pdep and vpexpandb are scatter.
         | 
         | Pshufb is only a gather and thus fails to match the
         | supercomputer pattern. GPUs provide a bpermute (backwards-
         | permute) that scatters results back to a table and is really
         | the missing link IMO.
         | 
         | Hopefully Intel will provide bpermute of their own
         | eventually....
        
         | teo_zero wrote:
         | > So you can often construct values m1, m2, and compute:
         | 
         | And what would m1 and m2 be for the specific problem1 described
         | in TFA?
         | 
         | 1 Namely, to find out if the input belongs to [0-9A-Za-z]
        
           | dragontamer wrote:
           | The original poster got lucky and m1 and m2 can in fact be
           | constructed.
           | 
           | But your overall post and skepticism is warranted. It's
           | definitely wrong in the general case the original post made.
           | 
           | To solve this easily, you truly need the 128-byte table from
           | vpermi2v / AVX512 (note that ASCII is a 7-bit code and thus
           | all functions of ASCII fit inside of a 128-table).
           | 
           | The reason the original post got lucky is that the top 2 bits
           | of ASCII determine special vs numbers vs lowercase vs
           | uppercase.
           | 
           | The bottom 5 bits determine the specific character. ASCII is
           | a table of tables.
           | 
           | 0x30-0x39 are '0' - '9' respectively. So 0x3 in the 'top'
           | table proves you have a number.
           | 
           | 0x40 is uppercase and 0x60 is lower case letters.
           | 
           | Bottom table is harder to construct but we don't actually
           | need to know the top bits. We just need 0x0-0x9 (valid for
           | numbers) vs 0x1-0xF (valid for 0x4 and 0x6, as 0x40 and 0x60
           | are invalid letters), and finally 0x0-0x1A (aka p-z, valid
           | for top half of alphabets in ASCII).
           | 
           | This harder bottom table is solvable with 3 bits of the 8,
           | meaning we have 5 spare bits for future extensions.
           | 
           | So valid for numbers is bit 0x1. Valid for lower-half
           | alphabet is 0x2 and valid for upper-half alphabet is 0x4.
           | 
           | Locations like 5 are valid for all three (0x7) while
           | locations like 0 are only valid for numbers and upper half
           | alphabet (aka 0x5).
           | 
           | ----------
           | 
           | In the general case, the [0-9a-zA-Z] language is a regular
           | language (even in an arbitrary non-ASCII code in binary),
           | which means it can always be solved with a state transition
           | table and a finite automata. And yes, pshufb can be a very
           | useful tool for making such a table (and a bigger table like
           | vpermi2v is even better).
           | 
           | But it's not clear that the pshufb sequence in the original
           | post would always work for an arbitrary jump table. Indeed,
           | for best results you likely need to use pext and other
           | complex bit instructions to efficiently extract the most info
           | out of the bits available.
           | 
           | Or you know.... Use the 128-tanle from vpermi2v.
           | 
           | ------
           | 
           | Note: FPGAs are composed of 4-LUTs which share many
           | similarities with these 4-bit / 16-byte tables. One can argue
           | that an FPGA is nothing more than a whole bunch of parallel
           | pshufb units.
           | 
           | But that doesn't mean that two such tables solves all
           | problems lol. It's often surprising how some problems blow up
           | in your face and suddenly eat up thousands of LUTs.
        
       | purplesyringa wrote:
       | This article is kind of fine overall but gets so many specifics
       | wrong. I guess this is to be expected from someone just beginning
       | to verse in assembly, but still.                   if '0' <= x &&
       | x <= '9' || 'a' <= x && x <= 'z' || 'A' <= x && x <= 'Z' {
       | ...         }
       | 
       | Go is garbage, but GCC and LLVM compile this to branchless code.
       | If the goal was to redo the book in Zig and Rust, perhaps it
       | would've been a good idea to profile the implementations under
       | Zig and Rust.
       | 
       | If you really want to optimize code like `a <= x && x <= b` by
       | hand, you can rewrite it to `x - a <= b - a`, where the
       | comparison is unsigned; GCC and LLVM do that for you. Next, you
       | can optimize the case-insensitive check by masking out the case
       | bit:                   void f();         void test(char x) {
       | if ((unsigned char)(x - '0') < 10 || (unsigned char)((x & 0x5f) -
       | 'A') < 26) {                 f();             }         }
       | 
       | compiles to just                   lea     eax, [rdi - 48]
       | cmp     al, 10         jb      f         and     dil, 95
       | add     dil, -65         cmp     dil, 25         jbe     f
       | ret
       | 
       | This is still not as fast as a memory read, but IMO this
       | should've been the first thing to try. However, a) GCC and LLVM
       | can autovectorize this, whereas memory reads can't be (although I
       | admit the resulting codegen makes me uneasy), b) you can go on
       | and vectorize this by hand, parsing an identifier in just one
       | iteration of a loop, which you can't do with a LUT. I might have
       | made a bug, but something along the lines of
       | unsigned short test(__m128i x) {             return
       | _mm_movemask_epi8(_mm_or_si128(
       | _mm_cmplt_epi8(_mm_sub_epi8(x, _mm_set1_epi8(0xb0)),
       | _mm_set1_epi8(0x8a)),
       | _mm_cmplt_epi8(_mm_sub_epi8(_mm_and_si128(x,
       | _mm_set1_epi8(0x5f)), _mm_set1_epi8(0xc1)), _mm_set1_epi8(0x9a))
       | ));         }
       | 
       | should work. This has 5x more throughput than a LUT and should
       | have better branch prediction characteristics, although it
       | consumes 16 bytes at a time, slightly limiting the applications.
       | 
       | The section about bitvectors is fine.
       | 
       | x86 processors don't have 128-bit MMX registers and 256-bit MMY
       | registers. They have legacy 64-bit MMX registers, modern 128-bit
       | XMM registers, and modern 256-bit YMM registers. This MMX/XMM
       | duality sure does confuse people, so I don't blame the author.
       | But come on, at least read the article before publishing; broken
       | GitHub URLs without `blob/master` and broken ()[] Markdown syntax
       | is not funny.
       | 
       | To those trying to make sense of Zig output, don't forget `-O
       | ReleaseFast`. The codegen here is very simple, just a 128-bit
       | integer simulated with two 64-bit ones.
        
         | jesse__ wrote:
         | Excellent nitpicks, hopefully the author reads this and does a
         | follow up :)
        
           | selimthegrim wrote:
           | How good is Hacker's Delight on this stuff?
        
             | purplesyringa wrote:
             | The main problem with Hacker's Delight is its age. It has
             | lots of bit tricks and optimization guidelines you'll find
             | useful, but you have to understand that a big part of it
             | doesn't apply on modern devices. But as long as you treat
             | it as a collection of snippets and benchmark the code on
             | modern devices (or, if you have built some intuition, that
             | works too), it's an invaluable resource.
        
               | selimthegrim wrote:
               | Thanks, I stumbled across a copy in a local coffee shop
               | by happenstance that had been remaindered out of a
               | University of Alaska library.
        
       | ww520 wrote:
       | I know the post only uses SIMD as an attempt to do bitmask lookup
       | for each byte, but a good alternative use of SIMD is for parallel
       | data processing, i.e. operating on more than one character. Being
       | able to identify a range of characters making up the identifier
       | is much faster than checking one character at a time, especially
       | for avoiding branching.
       | 
       | Did a bit of xmas hacking to come up with the function to find
       | identifiers in parallel in 32-byte chunks. I used Zig since it
       | makes vector/SIMD programming easy.
       | 
       | The code is in:
       | https://github.com/williamw520/misc_zig/blob/main/identifier...
       | 
       | The generated code of the nextBitmask32() function has about 38
       | instructions. https://godbolt.org/z/7ME3sW55e
       | 
       | That's 38 instructions for processing every 32 bytes. With no
       | branching!
        
         | ww520 wrote:
         | Slightly optimized code to get down to 32 asm instructions.
         | 
         | https://godbolt.org/z/ccE3r1bjc
        
       ___________________________________________________________________
       (page generated 2024-12-26 23:01 UTC)