[HN Gopher] Bit Vectors and my first steps into assembly
___________________________________________________________________
Bit Vectors and my first steps into assembly
Author : thunderbong
Score : 112 points
Date : 2024-12-22 03:00 UTC (4 days ago)
(HTM) web link (blog.smidt.dev)
(TXT) w3m dump (blog.smidt.dev)
| ajross wrote:
| > Since go doesn't inline assembly i decided to include the
| search loop into the assembly itself and to return the index of
| the non matching character. Sadly i didn't get it to work fully
| because i'm missing tools to actually debug the assembly itself.
|
| Um... "gdb <my_binary>", "break <my_function>", "run", just like
| any other tool.
|
| Then "disassemble" to see the instructions around the PC, "stepi"
| to execute one instruction. "info register" to get current state
| (or use registers in an expression as e.g. "print $rax" or "print
| $r11").
|
| And that's it. Debuggers are actually _natively_ tools operating
| at the level of machine instructions. The language syntax support
| is a level on top.
| xyzzy_plugh wrote:
| Assembly is one of those things that is feared by so many
| engineers, but is actually incredibly accessible. Whenever I
| find an excuse to drop down to assembly my colleagues have
| looked at me like I'm a wizard with 3 heads.
|
| The tooling to debug, dump or inspect, or to link or otherwise
| glue bits of assembly together has existed for decades. It's
| all right there!
| RicoElectrico wrote:
| It's one of these areas LLMs really help to get started. Not
| much invention needed, but quite involved setup if you don't
| know what you're precisely looking for. Having them spit out
| boilerplate and needed invocations lowers the barrier
| substantially.
| dragontamer wrote:
| Assuming you have a C compiler, you should use C to get
| basic boilerplate and use asm or volatile asm to drop down
| to assembly level.
|
| This is true for GPUs, CPUs like x86 and ARM, and more.
| Only on the smallest of embedded systems like AVR with 4kB
| RAM have I ever found dropping to raw assembly to be
| useful.
|
| Bonus points: input/output/clobbers works well with the GCC
| and CLang optimizers. Though it's not intuitive how
| compilers can still move your assembly around, it's
| surprisingly full featured for how to get info into and out
| of registers.
|
| There are rare times where the C boilerplate is
| counterproductive. I guess OS level setup is another spot
| where raw non-C assembly is useful.
| david-gpu wrote:
| Intrinsics, when available, are a convenient tool. But
| even then, I think assembly these days has very limited
| applications and most programmers don't need to know any
| assembly to be effective.
|
| Preemptive disclaimer: all my jobs were the exception to
| the rule and I sadly had to work with various assembly
| languages on a daily basis. Still not a fan.
| AlotOfReading wrote:
| You should be able to _read_ assembly, not least because
| understanding what 's generated by the compiler or jit is
| pretty important to performance optimization.
| david-gpu wrote:
| Aren't there many other steps with better bang for your
| buck to be done before that? Libraries, better algorithms
| and data structures, parallelism, etc.
| dragontamer wrote:
| Things like the condition flags, or GPU execution flags,
| or other 'implicit' flags are when assembly is needed.
|
| Otherwise, intrinsics are sufficient. And with wavefront
| voting / ballot instructions (and intrinsics) being so
| readily available on modern GPUs, there's very little
| reason to go to those assembly level / machine code flags
| or registers.
|
| Similarly, back when ccmov instructions weren't reliably
| output by compilers, you'd probably want raw assembly.
| But today I think it's safe to rely upon the optimizers.
| dist1ll wrote:
| I agree it's unnecessarily feared. At its core, it lacks a
| lot of complexity.
|
| But actually _writing_ asm is an experience straight from the
| 80s. Compared to modern languages, the tooling is simply
| antiquated. I noticed this once I started writing larger
| sections of asm at a time. It 's "fine" in the same sense
| that writing makefiles is "fine" - it becomes a PITA after a
| certain size.
|
| I think there's lots to improve on the current state of asm,
| but it's just too niche. The only power users I know are
| hobbyists working on operating systems and games for retro
| hardware.
| pm215 wrote:
| I also like "disp /3i $pc", which tells gdb to disassemble the
| next three instructions whenever it stops (including after
| stepi). Though I'm sure you can configure it to give you more
| sophisticated disassembly display and context, especially with
| the tui interface, this is usually "good enough" for me when I
| run into something I want to step through at the assembly level
| in the middle of debugging a program that's primarily C.
| purplesyringa wrote:
| On a related note, https://hugsy.github.io/gef/ provides a
| TUI interface like that.
| MobiusHorizons wrote:
| I'm surprised he didn't use a switch statement, as that is the
| canonical way to ask the compiler to make the jump table for you.
| rep_lodsb wrote:
| Without SIMD, that would probably the optimal way to implement
| it. And this bit vector idea really isn't doing SIMD at all,
| it's just (ab)using the vector registers to store 128 bits.
|
| I think the Zig code would also compile to better output, and
| be a lot simpler, if it was using an array of u1, instead of
| doing the bit shifting manually.
| MobiusHorizons wrote:
| The bit vector does give a nice compression compared to the
| jump table. But I believe the main thing to do for speeding
| up string parsing is to consume more than one character at a
| time. I personally found the techniques in https://tia.mat.br
| /posts/2018/02/01/more_on_string_switch_in... very helpful to
| explain these issues.
|
| EDIT: I especially like this statement from the conclusion:
|
| > The good thing about the switch statement in C is that it
| is maybe the highest level statement in the language: the
| compiler can get really creative in how its code is
| generated. It's not uncommon for it to generate jump tables
| or even binary searches.
| dragontamer wrote:
| Bit processing is always SIMD.
|
| A 64-bit register being processed 64-bits at a time is, from
| a mental and programmers perspective, a 64-wide SIMD computer
| over 1-bit AND, OR, and NOT instructions.
|
| If you aren't using SIMD / parallel programming concepts
| while thinking about 64-parallel bits (or even in this case,
| parallel processing of 8-bit bytes or 7-bits of ASCII), then
| you are making life much harder for yourself.
|
| -------
|
| I believe in the 90s, some people called 32-bit parallelism
| as SIMD Within A Register (SWAR). Some guy wrote 3DES
| encryption with 64-bit AND/OR/NOTs and progressed bit-by-bit
| (but in parallel), achieving absurdly efficient encryption
| back in the day.
|
| And of course, Chess Programmers with all their 64-bit boards
| for their 8x8 game constantly think in terms of SIMD inside
| of 64-bits.
|
| -------
|
| Once you learn to think in SIMD, you can extend these
| concepts out to 128-bits (SSE), 256-bits (AVX) or 512-bit
| (AVX512) or beyond (GPU 1024-bits and greater).
| selimthegrim wrote:
| http://lamport.azurewebsites.net/pubs/multiple-byte.pdf
| madflame991 wrote:
| In certain conditions gcc will actually generate instructions
| that use this same bitvector described:
| https://godbolt.org/z/6Gfjv6PGd (notice the funny 130159 value in
| there)
|
| The only thing I did was to adjust the range of characters so
| that the bitvector fits 64 bits. Sadly, I do not this it would
| work for the author's bigger range.
| dan-robertson wrote:
| Another trick for classifying bytes is to use pshufb. That
| instruction works like: fn pshufb(a: [i8; 16], b:
| [i8; 16]) -> [i8; 16] { result = [0; 16] for j in
| 0, 1, ..., 15 { if b[j] >= 0 { result[j] =
| a[b[j] & 15]; } } return result;
| }
|
| So you can often construct values m1, m2, and compute:
| a = pshufb(m1, input & 0xf) b = pshufb(m2, (unsigned_input)
| >> 4) x = f(a,b)
|
| Where f is something like bit wise AND. Roughly, m1 and m2 assign
| sets of 8 potential properties based on the high or low order
| bits of each byte, and then we and together the sets of
| properties to make sure the upper and lower bits 'agree'.
|
| This technique can sometimes be used to mark a few special
| characters.
| dragontamer wrote:
| Pshufb is an old trick. An oldie but a goodie from SSE2 days.
|
| All hail modern improvements!! pdep, pext, vpcompressb,
| vpexpandb, and especially vpermi2v.
|
| Forget about 16 byte lookup table with pshufb. Try a 128-byte
| lookup with vpermi2v AVX512.
|
| --------
|
| I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for
| 512-bits is also more flexible than pshufb (!!!!) in practice,
| if a bit slower as it is over multiple instructions.
|
| Supercomputer programs are written with a gather-conpute-
| scatter pattern. Gather your data, compute on the data, and
| finally place the data where needed.
|
| Pext and vpcompressb are gather operations.
|
| Pdep and vpexpandb are scatter.
|
| Pshufb is only a gather and thus fails to match the
| supercomputer pattern. GPUs provide a bpermute (backwards-
| permute) that scatters results back to a table and is really
| the missing link IMO.
|
| Hopefully Intel will provide bpermute of their own
| eventually....
| teo_zero wrote:
| > So you can often construct values m1, m2, and compute:
|
| And what would m1 and m2 be for the specific problem1 described
| in TFA?
|
| 1 Namely, to find out if the input belongs to [0-9A-Za-z]
| dragontamer wrote:
| The original poster got lucky and m1 and m2 can in fact be
| constructed.
|
| But your overall post and skepticism is warranted. It's
| definitely wrong in the general case the original post made.
|
| To solve this easily, you truly need the 128-byte table from
| vpermi2v / AVX512 (note that ASCII is a 7-bit code and thus
| all functions of ASCII fit inside of a 128-table).
|
| The reason the original post got lucky is that the top 2 bits
| of ASCII determine special vs numbers vs lowercase vs
| uppercase.
|
| The bottom 5 bits determine the specific character. ASCII is
| a table of tables.
|
| 0x30-0x39 are '0' - '9' respectively. So 0x3 in the 'top'
| table proves you have a number.
|
| 0x40 is uppercase and 0x60 is lower case letters.
|
| Bottom table is harder to construct but we don't actually
| need to know the top bits. We just need 0x0-0x9 (valid for
| numbers) vs 0x1-0xF (valid for 0x4 and 0x6, as 0x40 and 0x60
| are invalid letters), and finally 0x0-0x1A (aka p-z, valid
| for top half of alphabets in ASCII).
|
| This harder bottom table is solvable with 3 bits of the 8,
| meaning we have 5 spare bits for future extensions.
|
| So valid for numbers is bit 0x1. Valid for lower-half
| alphabet is 0x2 and valid for upper-half alphabet is 0x4.
|
| Locations like 5 are valid for all three (0x7) while
| locations like 0 are only valid for numbers and upper half
| alphabet (aka 0x5).
|
| ----------
|
| In the general case, the [0-9a-zA-Z] language is a regular
| language (even in an arbitrary non-ASCII code in binary),
| which means it can always be solved with a state transition
| table and a finite automata. And yes, pshufb can be a very
| useful tool for making such a table (and a bigger table like
| vpermi2v is even better).
|
| But it's not clear that the pshufb sequence in the original
| post would always work for an arbitrary jump table. Indeed,
| for best results you likely need to use pext and other
| complex bit instructions to efficiently extract the most info
| out of the bits available.
|
| Or you know.... Use the 128-tanle from vpermi2v.
|
| ------
|
| Note: FPGAs are composed of 4-LUTs which share many
| similarities with these 4-bit / 16-byte tables. One can argue
| that an FPGA is nothing more than a whole bunch of parallel
| pshufb units.
|
| But that doesn't mean that two such tables solves all
| problems lol. It's often surprising how some problems blow up
| in your face and suddenly eat up thousands of LUTs.
| purplesyringa wrote:
| This article is kind of fine overall but gets so many specifics
| wrong. I guess this is to be expected from someone just beginning
| to verse in assembly, but still. if '0' <= x &&
| x <= '9' || 'a' <= x && x <= 'z' || 'A' <= x && x <= 'Z' {
| ... }
|
| Go is garbage, but GCC and LLVM compile this to branchless code.
| If the goal was to redo the book in Zig and Rust, perhaps it
| would've been a good idea to profile the implementations under
| Zig and Rust.
|
| If you really want to optimize code like `a <= x && x <= b` by
| hand, you can rewrite it to `x - a <= b - a`, where the
| comparison is unsigned; GCC and LLVM do that for you. Next, you
| can optimize the case-insensitive check by masking out the case
| bit: void f(); void test(char x) {
| if ((unsigned char)(x - '0') < 10 || (unsigned char)((x & 0x5f) -
| 'A') < 26) { f(); } }
|
| compiles to just lea eax, [rdi - 48]
| cmp al, 10 jb f and dil, 95
| add dil, -65 cmp dil, 25 jbe f
| ret
|
| This is still not as fast as a memory read, but IMO this
| should've been the first thing to try. However, a) GCC and LLVM
| can autovectorize this, whereas memory reads can't be (although I
| admit the resulting codegen makes me uneasy), b) you can go on
| and vectorize this by hand, parsing an identifier in just one
| iteration of a loop, which you can't do with a LUT. I might have
| made a bug, but something along the lines of
| unsigned short test(__m128i x) { return
| _mm_movemask_epi8(_mm_or_si128(
| _mm_cmplt_epi8(_mm_sub_epi8(x, _mm_set1_epi8(0xb0)),
| _mm_set1_epi8(0x8a)),
| _mm_cmplt_epi8(_mm_sub_epi8(_mm_and_si128(x,
| _mm_set1_epi8(0x5f)), _mm_set1_epi8(0xc1)), _mm_set1_epi8(0x9a))
| )); }
|
| should work. This has 5x more throughput than a LUT and should
| have better branch prediction characteristics, although it
| consumes 16 bytes at a time, slightly limiting the applications.
|
| The section about bitvectors is fine.
|
| x86 processors don't have 128-bit MMX registers and 256-bit MMY
| registers. They have legacy 64-bit MMX registers, modern 128-bit
| XMM registers, and modern 256-bit YMM registers. This MMX/XMM
| duality sure does confuse people, so I don't blame the author.
| But come on, at least read the article before publishing; broken
| GitHub URLs without `blob/master` and broken ()[] Markdown syntax
| is not funny.
|
| To those trying to make sense of Zig output, don't forget `-O
| ReleaseFast`. The codegen here is very simple, just a 128-bit
| integer simulated with two 64-bit ones.
| jesse__ wrote:
| Excellent nitpicks, hopefully the author reads this and does a
| follow up :)
| selimthegrim wrote:
| How good is Hacker's Delight on this stuff?
| purplesyringa wrote:
| The main problem with Hacker's Delight is its age. It has
| lots of bit tricks and optimization guidelines you'll find
| useful, but you have to understand that a big part of it
| doesn't apply on modern devices. But as long as you treat
| it as a collection of snippets and benchmark the code on
| modern devices (or, if you have built some intuition, that
| works too), it's an invaluable resource.
| selimthegrim wrote:
| Thanks, I stumbled across a copy in a local coffee shop
| by happenstance that had been remaindered out of a
| University of Alaska library.
| ww520 wrote:
| I know the post only uses SIMD as an attempt to do bitmask lookup
| for each byte, but a good alternative use of SIMD is for parallel
| data processing, i.e. operating on more than one character. Being
| able to identify a range of characters making up the identifier
| is much faster than checking one character at a time, especially
| for avoiding branching.
|
| Did a bit of xmas hacking to come up with the function to find
| identifiers in parallel in 32-byte chunks. I used Zig since it
| makes vector/SIMD programming easy.
|
| The code is in:
| https://github.com/williamw520/misc_zig/blob/main/identifier...
|
| The generated code of the nextBitmask32() function has about 38
| instructions. https://godbolt.org/z/7ME3sW55e
|
| That's 38 instructions for processing every 32 bytes. With no
| branching!
| ww520 wrote:
| Slightly optimized code to get down to 32 asm instructions.
|
| https://godbolt.org/z/ccE3r1bjc
___________________________________________________________________
(page generated 2024-12-26 23:01 UTC)