[HN Gopher] Bit Vectors and my first steps into assembly
___________________________________________________________________
Bit Vectors and my first steps into assembly
Author : thunderbong
Score : 87 points
Date : 2024-12-22 03:00 UTC (3 days ago)
(HTM) web link (blog.smidt.dev)
(TXT) w3m dump (blog.smidt.dev)
| ajross wrote:
| > Since go doesn't inline assembly i decided to include the
| search loop into the assembly itself and to return the index of
| the non matching character. Sadly i didn't get it to work fully
| because i'm missing tools to actually debug the assembly itself.
|
| Um... "gdb <my_binary>", "break <my_function>", "run", just like
| any other tool.
|
| Then "disassemble" to see the instructions around the PC, "stepi"
| to execute one instruction. "info register" to get current state
| (or use registers in an expression as e.g. "print $rax" or "print
| $r11").
|
| And that's it. Debuggers are actually _natively_ tools operating
| at the level of machine instructions. The language syntax support
| is a level on top.
| xyzzy_plugh wrote:
| Assembly is one of those things that is feared by so many
| engineers, but is actually incredibly accessible. Whenever I
| find an excuse to drop down to assembly my colleagues have
| looked at me like I'm a wizard with 3 heads.
|
| The tooling to debug, dump or inspect, or to link or otherwise
| glue bits of assembly together has existed for decades. It's
| all right there!
| RicoElectrico wrote:
| It's one of these areas LLMs really help to get started. Not
| much invention needed, but quite involved setup if you don't
| know what you're precisely looking for. Having them spit out
| boilerplate and needed invocations lowers the barrier
| substantially.
| dragontamer wrote:
| Assuming you have a C compiler, you should use C to get
| basic boilerplate and use asm or volatile asm to drop down
| to assembly level.
|
| This is true for GPUs, CPUs like x86 and ARM, and more.
| Only on the smallest of embedded systems like AVR with 4kB
| RAM have I ever found dropping to raw assembly to be
| useful.
|
| Bonus points: input/output/clobbers works well with the GCC
| and CLang optimizers. Though it's not intuitive how
| compilers can still move your assembly around, it's
| surprisingly full featured for how to get info into and out
| of registers.
|
| There are rare times where the C boilerplate is
| counterproductive. I guess OS level setup is another spot
| where raw non-C assembly is useful.
| david-gpu wrote:
| Intrinsics, when available, are a convenient tool. But
| even then, I think assembly these days has very limited
| applications and most programmers don't need to know any
| assembly to be effective.
|
| Preemptive disclaimer: all my jobs were the exception to
| the rule and I sadly had to work with various assembly
| languages on a daily basis. Still not a fan.
| AlotOfReading wrote:
| You should be able to _read_ assembly, not least because
| understanding what 's generated by the compiler or jit is
| pretty important to performance optimization.
| dragontamer wrote:
| Things like the condition flags, or GPU execution flags,
| or other 'implicit' flags are when assembly is needed.
|
| Otherwise, intrinsics are sufficient. And with wavefront
| voting / ballot instructions (and intrinsics) being so
| readily available on modern GPUs, there's very little
| reason to go to those assembly level / machine code flags
| or registers.
|
| Similarly, back when ccmov instructions weren't reliably
| output by compilers, you'd probably want raw assembly.
| But today I think it's safe to rely upon the optimizers.
| dist1ll wrote:
| I agree it's unnecessarily feared. At its core, it lacks a
| lot of complexity.
|
| But actually _writing_ asm is an experience straight from the
| 80s. Compared to modern languages, the tooling is simply
| antiquated. I noticed this once I started writing larger
| sections of asm at a time. It 's "fine" in the same sense
| that writing makefiles is "fine" - it becomes a PITA after a
| certain size.
|
| I think there's lots to improve on the current state of asm,
| but it's just too niche. The only power users I know are
| hobbyists working on operating systems and games for retro
| hardware.
| pm215 wrote:
| I also like "disp /3i $pc", which tells gdb to disassemble the
| next three instructions whenever it stops (including after
| stepi). Though I'm sure you can configure it to give you more
| sophisticated disassembly display and context, especially with
| the tui interface, this is usually "good enough" for me when I
| run into something I want to step through at the assembly level
| in the middle of debugging a program that's primarily C.
| MobiusHorizons wrote:
| I'm surprised he didn't use a switch statement, as that is the
| canonical way to ask the compiler to make the jump table for you.
| rep_lodsb wrote:
| Without SIMD, that would probably the optimal way to implement
| it. And this bit vector idea really isn't doing SIMD at all,
| it's just (ab)using the vector registers to store 128 bits.
|
| I think the Zig code would also compile to better output, and
| be a lot simpler, if it was using an array of u1, instead of
| doing the bit shifting manually.
| MobiusHorizons wrote:
| The bit vector does give a nice compression compared to the
| jump table. But I believe the main thing to do for speeding
| up string parsing is to consume more than one character at a
| time. I personally found the techniques in https://tia.mat.br
| /posts/2018/02/01/more_on_string_switch_in... very helpful to
| explain these issues.
|
| EDIT: I especially like this statement from the conclusion:
|
| > The good thing about the switch statement in C is that it
| is maybe the highest level statement in the language: the
| compiler can get really creative in how its code is
| generated. It's not uncommon for it to generate jump tables
| or even binary searches.
| madflame991 wrote:
| In certain conditions gcc will actually generate instructions
| that use this same bitvector described:
| https://godbolt.org/z/6Gfjv6PGd (notice the funny 130159 value in
| there)
|
| The only thing I did was to adjust the range of characters so
| that the bitvector fits 64 bits. Sadly, I do not this it would
| work for the author's bigger range.
| dan-robertson wrote:
| Another trick for classifying bytes is to use pshufb. That
| instruction works like: fn pshufb(a: [i8; 16], b:
| [i8; 16]) -> [i8; 16] { result = [0; 16] for j in
| 0, 1, ..., 15 { if b[j] >= 0 { result[j] =
| a[b[j] & 15]; } } return result;
| }
|
| So you can often construct values m1, m2, and compute:
| a = pshufb(m1, input & 0xf) b = pshufb(m2, (unsigned_input)
| >> 4) x = f(a,b)
|
| Where f is something like bit wise AND. Roughly, m1 and m2 assign
| sets of 8 potential properties based on the high or low order
| bits of each byte, and then we and together the sets of
| properties to make sure the upper and lower bits 'agree'.
|
| This technique can sometimes be used to mark a few special
| characters.
| dragontamer wrote:
| Pshufb is an old trick. An oldie but a goodie from SSE2 days.
|
| All hail modern improvements!! pdep, pext, vpcompressb,
| vpexpandb, and especially vpermi2v.
|
| Forget about 16 byte lookup table with pshufb. Try a 128-byte
| lookup with vpermi2v AVX512.
|
| --------
|
| I find that pext/pdep for 64-bits and vpcompressb/vpexpandb for
| 512-bits is also more flexible than pshufb (!!!!) in practice,
| if a bit slower as it is over multiple instructions.
|
| Supercomputer programs are written with a gather-conpute-
| scatter pattern. Gather your data, compute on the data, and
| finally place the data where needed.
|
| Pext and vpcompressb are gather operations.
|
| Pdep and vpexpandb are scatter.
|
| Pshufb is only a gather and thus fails to match the
| supercomputer pattern. GPUs provide a bpermute (backwards-
| permute) that scatters results back to a table and is really
| the missing link IMO.
|
| Hopefully Intel will provide bpermute of their own
| eventually....
| teo_zero wrote:
| > So you can often construct values m1, m2, and compute:
|
| And what would m1 and m2 be for the specific problem1 described
| in TFA?
|
| 1 Namely, to find out if the input belongs to [0-9A-Za-z]
___________________________________________________________________
(page generated 2024-12-25 23:00 UTC)