[HN Gopher] Bit Vectors and my first steps into assembly
       ___________________________________________________________________
        
       Bit Vectors and my first steps into assembly
        
       Author : thunderbong
       Score  : 87 points
       Date   : 2024-12-22 03:00 UTC (3 days ago)
        
 (HTM) web link (blog.smidt.dev)
 (TXT) w3m dump (blog.smidt.dev)
        
       | ajross wrote:
       | > Since go doesn't inline assembly i decided to include the
       | search loop into the assembly itself and to return the index of
       | the non matching character. Sadly i didn't get it to work fully
       | because i'm missing tools to actually debug the assembly itself.
       | 
       | Um... "gdb <my_binary>", "break <my_function>", "run", just like
       | any other tool.
       | 
       | Then "disassemble" to see the instructions around the PC, "stepi"
       | to execute one instruction. "info register" to get current state
       | (or use registers in an expression as e.g. "print $rax" or "print
       | $r11").
       | 
       | And that's it. Debuggers are actually _natively_ tools operating
       | at the level of machine instructions. The language syntax support
       | is a level on top.
        
         | xyzzy_plugh wrote:
         | Assembly is one of those things that is feared by so many
         | engineers, but is actually incredibly accessible. Whenever I
         | find an excuse to drop down to assembly my colleagues have
         | looked at me like I'm a wizard with 3 heads.
         | 
         | The tooling to debug, dump or inspect, or to link or otherwise
         | glue bits of assembly together has existed for decades. It's
         | all right there!
        
           | RicoElectrico wrote:
           | It's one of these areas LLMs really help to get started. Not
           | much invention needed, but quite involved setup if you don't
           | know what you're precisely looking for. Having them spit out
           | boilerplate and needed invocations lowers the barrier
           | substantially.
        
             | dragontamer wrote:
             | Assuming you have a C compiler, you should use C to get
             | basic boilerplate and use asm or volatile asm to drop down
             | to assembly level.
             | 
             | This is true for GPUs, CPUs like x86 and ARM, and more.
             | Only on the smallest of embedded systems like AVR with 4kB
             | RAM have I ever found dropping to raw assembly to be
             | useful.
             | 
             | Bonus points: input/output/clobbers works well with the GCC
             | and CLang optimizers. Though it's not intuitive how
             | compilers can still move your assembly around, it's
             | surprisingly full featured for how to get info into and out
             | of registers.
             | 
             | There are rare times where the C boilerplate is
             | counterproductive. I guess OS level setup is another spot
             | where raw non-C assembly is useful.
        
               | david-gpu wrote:
               | Intrinsics, when available, are a convenient tool. But
               | even then, I think assembly these days has very limited
               | applications and most programmers don't need to know any
               | assembly to be effective.
               | 
               | Preemptive disclaimer: all my jobs were the exception to
               | the rule and I sadly had to work with various assembly
               | languages on a daily basis. Still not a fan.
        
               | AlotOfReading wrote:
               | You should be able to _read_ assembly, not least because
               | understanding what 's generated by the compiler or jit is
               | pretty important to performance optimization.
        
               | dragontamer wrote:
               | Things like the condition flags, or GPU execution flags,
               | or other 'implicit' flags are when assembly is needed.
               | 
               | Otherwise, intrinsics are sufficient. And with wavefront
               | voting / ballot instructions (and intrinsics) being so
               | readily available on modern GPUs, there's very little
               | reason to go to those assembly level / machine code flags
               | or registers.
               | 
               | Similarly, back when ccmov instructions weren't reliably
               | output by compilers, you'd probably want raw assembly.
               | But today I think it's safe to rely upon the optimizers.
        
           | dist1ll wrote:
           | I agree it's unnecessarily feared. At its core, it lacks a
           | lot of complexity.
           | 
           | But actually _writing_ asm is an experience straight from the
           | 80s. Compared to modern languages, the tooling is simply
           | antiquated. I noticed this once I started writing larger
           | sections of asm at a time. It 's "fine" in the same sense
           | that writing makefiles is "fine" - it becomes a PITA after a
           | certain size.
           | 
           | I think there's lots to improve on the current state of asm,
           | but it's just too niche. The only power users I know are
           | hobbyists working on operating systems and games for retro
           | hardware.
        
         | pm215 wrote:
         | I also like "disp /3i $pc", which tells gdb to disassemble the
         | next three instructions whenever it stops (including after
         | stepi). Though I'm sure you can configure it to give you more
         | sophisticated disassembly display and context, especially with
         | the tui interface, this is usually "good enough" for me when I
         | run into something I want to step through at the assembly level
         | in the middle of debugging a program that's primarily C.
        
       | MobiusHorizons wrote:
       | I'm surprised he didn't use a switch statement, as that is the
       | canonical way to ask the compiler to make the jump table for you.
        
         | rep_lodsb wrote:
         | Without SIMD, that would probably the optimal way to implement
         | it. And this bit vector idea really isn't doing SIMD at all,
         | it's just (ab)using the vector registers to store 128 bits.
         | 
         | I think the Zig code would also compile to better output, and
         | be a lot simpler, if it was using an array of u1, instead of
         | doing the bit shifting manually.
        
           | MobiusHorizons wrote:
           | The bit vector does give a nice compression compared to the
           | jump table. But I believe the main thing to do for speeding
           | up string parsing is to consume more than one character at a
           | time. I personally found the techniques in https://tia.mat.br
           | /posts/2018/02/01/more_on_string_switch_in... very helpful to
           | explain these issues.
           | 
           | EDIT: I especially like this statement from the conclusion:
           | 
           | > The good thing about the switch statement in C is that it
           | is maybe the highest level statement in the language: the
           | compiler can get really creative in how its code is
           | generated. It's not uncommon for it to generate jump tables
           | or even binary searches.
        
       | madflame991 wrote:
       | In certain conditions gcc will actually generate instructions
       | that use this same bitvector described:
       | https://godbolt.org/z/6Gfjv6PGd (notice the funny 130159 value in
       | there)
       | 
       | The only thing I did was to adjust the range of characters so
       | that the bitvector fits 64 bits. Sadly, I do not this it would
       | work for the author's bigger range.
        
       | dan-robertson wrote:
       | Another trick for classifying bytes is to use pshufb. That
       | instruction works like:                 fn pshufb(a: [i8; 16], b:
       | [i8; 16]) -> [i8; 16] {         result = [0; 16]         for j in
       | 0, 1, ..., 15 {           if b[j] >= 0 {             result[j] =
       | a[b[j] & 15];           }         }         return result;
       | }
       | 
       | So you can often construct values m1, m2, and compute:
       | a = pshufb(m1, input & 0xf)       b = pshufb(m2, (unsigned_input)
       | >> 4)       x = f(a,b)
       | 
       | Where f is something like bit wise AND. Roughly, m1 and m2 assign
       | sets of 8 potential properties based on the high or low order
       | bits of each byte, and then we and together the sets of
       | properties to make sure the upper and lower bits 'agree'.
       | 
       | This technique can sometimes be used to mark a few special
       | characters.
        
         | dragontamer wrote:
         | Pshufb is an old trick. An oldie but a goodie from SSE2 days.
         | 
         | All hail modern improvements!! pdep, pext, vpcompressb,
         | vpexpandb, and especially vpermi2v.
         | 
         | Forget about 16 byte lookup table with pshufb. Try a 128-byte
         | lookup with vpermi2v AVX512.
         | 
         | --------
         | 
         | I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for
         | 512-bits is also more flexible than pshufb (!!!!) in practice,
         | if a bit slower as it is over multiple instructions.
         | 
         | Supercomputer programs are written with a gather-conpute-
         | scatter pattern. Gather your data, compute on the data, and
         | finally place the data where needed.
         | 
         | Pext and vpcompressb are gather operations.
         | 
         | Pdep and vpexpandb are scatter.
         | 
         | Pshufb is only a gather and thus fails to match the
         | supercomputer pattern. GPUs provide a bpermute (backwards-
         | permute) that scatters results back to a table and is really
         | the missing link IMO.
         | 
         | Hopefully Intel will provide bpermute of their own
         | eventually....
        
         | teo_zero wrote:
         | > So you can often construct values m1, m2, and compute:
         | 
         | And what would m1 and m2 be for the specific problem1 described
         | in TFA?
         | 
         | 1 Namely, to find out if the input belongs to [0-9A-Za-z]
        
       ___________________________________________________________________
       (page generated 2024-12-25 23:00 UTC)