the following instructions are actually slower than common counterparts loopnz, jcxz all of the transcendental x87 instructions this doesnt seem to refer to MMX technology which aliases to the registers to the floating-point stack fbstp, fbld lods[bwdq], stos[bwdq], scas[bwdq], movs[bwdq] all of these with REP prefixes except rep movsb For specifics, consult Agner Fog's tables. the fsin/fcos instructions are wildly innaccurate and its better to use glibcs implementation when caluclating sines and cosines == Performance notes (Merom) == (cycles are the average values in the agner fogs table reciprocal throughput) `mov r,m` 1 cycle `lea r,m` 1 cycle `test r,r/i` 0.33 cycles `test m,r/i` 1 cycle `bt r,r/i` 1 cycle `bt m,r` 5 cycles is slower than (?) `mov r,m` `bt r,r` `inc m` 1 cycle is faster than `inc r` `mov m,r` `cmp m,imm` 1 cycle is faster then `mov r,m` `cmp r,imm` unless more than 2 compares are done with the same register later on == Floating point numbers == checking if a xmmword has a 0x00 can be done as follows xorps xmm0,xmm0 movq xmm1,rax pcmpeqb xmm1,xmm0 #^ stores 0xff for every matched byte pmovmskb ecx,xmm1 #^ not sure what this does really... == Moving and converting values == cvt- family of instructions convert integers to various formats to move a float into xmm1 eax, 1000 cvtsi2ss xmm1, eax ; xmm1 = 1000.0 cvtsi2-- converts to a float. si stands for Scalar Integer and cvtss2-- converts back to dword/qword so `cvtsi2ss` converts a dword or qword to a single prec. float value and `cvtss2si` converts one single scalar prec. float to a qword/dword wether its a qword or dword depends on the source/target register size parameter passing order: rdi, rsi, rdx, rcx, r8, r9 result register: rax rdx:rax - used for idiv and imul and div and mul other: rsp - stack pointer rbp - base/frame pointer, saved by callee rbx - saved by callee (us) r8-r11 - misc r12-r15 - misc, saved by callee r8 to r11 are also called scratch registers. we do not need to preserve their values as a callee unpreserved registers: rcx, r8,r9,r10,r11 Convetions: https://en.wikipedia.org/wiki/X86_calling_conventions == Windows == In windows, the register order is as follows: rcx, rdx, r8, r9 more info at: https://www.nasm.us/xdoc/2.16.02rc5/html/nasmdo12.html#section-12.1 its quite different, needs reading Even prologue and epilogue code is different {{{masm ;prologue mov [RSP + 8], RCX push R15 push R14 push R13 sub RSP, fixed-allocation-size lea R13, 128[RSP] }}} {{{masm ;epilogue add RSP, fixed-allocation-size pop R13 pop R14 pop R15 ret }}} More info here: https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog?view=msvc-170 Apparently, we cannot really use `push` and `pop` for the extra parameters on the stack, because it inherently modifies the `RSP` register which might be causing all these weird stack alignment issues. Windows requires 0x20 minimum for the home addresses of saved registers The stack parameters in windows: [other saved regs ] rsp+0x40 .. etc. [ param 1 ] rsp+0x38 / rbp+0x10 .. etc. [ param 2 ] rsp+0x30 / rbp+0x8 [ rbp pushed ] rsp+0x28 (we need to skip this one over) [ local variables ] <- rbp / rsp+0x20 [ r9 home ] rsp+0x18 [ r8 home ] rsp+0x10 [ rdx home ] rsp+0x8 [ rcx home ] <- rsp / rbp-0x20 -- call happens [ return address ] <- rsp-0x8 so we cant just push parameters on the stack before the call that would place them bellow [ rcx home ] and shift the home location. Its why we can get seemingly random values into functions when this is not considered == Assembly tricks (NASM) == mov [rbp-SDL_rect.x], word 1 mov [rbp-SDL_rect.y], word 2 mov [rbp-SDL_rect.w], word 3 mov [rbp-SDL_rect.h], word 4 ; easy encoding of 4 words of values into 1 64bit register ; the above is equivalent to this when it comes to structure ; and array initialization ; I think this is also known as a 'vectorized' instruction ; but just using the regular 64bit registers mov rdx, 1 | (2 << 16) | (3 << 32) | (4 << 48) mov [rbp-SDL_rect_address], rdx === PLT === Unix: To refer to a function in the PLT, we have to use `wrt ..plt` syntax `call SDL_Init wrt ..plt` Windows: In windows, we use `wrt ..imagebase` instead == JMP Tables == All jumps in a jump table should contain the 'near' keyword afterwards to make them of equal size. NASM might decide to include fewer bytes for a jump thats a lot closer than the other ones, making it harder to calculate the size of a jmp table entry since they can change. Doing `jmp near ` avoids this problem. == GCC == Macro names should be ALL_CAPS when it is important to understand that it is a macro and all_lowercase when its supposed to be considered as a function but for pure effiency reasons. We can thus in theory, also use inline functions if we just want to inline stuff by default. To always inline a function, we must use the following format: `inline __attribute__ ((always_inline)) () { ... }` "Function"-like macros thus aren't really necessary or benefitial unless the param type really doesn't matter == ISSUES == When interfacing with C *make sure* to setup proper call stacks with function prologues. Otherwise we get some sort stack-based segfault as it tries to access memory. This doesn't show up when the main function acts like \_start so there is no need to return but rather just exit. == CPU BUGS == FSRM (fast short repeat move) https://www.techradar.com/pro/security/a-cpu-mystery-intel-just-fixed-a-huge-security-flaw-affecting-nearly-every-cpu-out-there-today A cpu with this bug basically just breaks completelly JMP instructions being ignored, XSAVE and CALL instructions no longer correctly recording the RIP instruction pointer A debugger would report impossible states. Fairly new, affected CPUs https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html CVE-2023-23583 == Misc info (IRC) == 12:42 what does it mean to 'move data using non-temporal hint' ? 12:44 like with the MOVNTI instruction 12:44 Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion. 12:45 The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future. 13:57 I see, thanks. An 'near future' means how long into the future? miliseconds, seconds, cycles? .