[HN Gopher] i386 Assembly Language trick for storing data in .text
___________________________________________________________________
i386 Assembly Language trick for storing data in .text
Author : ingve
Score : 127 points
Date : 2023-11-09 06:47 UTC (16 hours ago)
(HTM) web link (ratfactor.com)
(TXT) w3m dump (ratfactor.com)
| majke wrote:
| Yeah, in i386 syntax there is no way to address EIP directly.
| poping EIP from call is a common trick.
|
| In newer processors there exists a cache for return addresses
| Return Stack Buffer (RSB).
|
| But there is a penalty for doing call and never doing ret.
|
| From Intel's Optimization Reference Manual:[2]
|
| "The return address stack mechanism augments the static and
| dynamic predictors to optimize specifically for calls and
| returns. It holds 16 entries, which is large enough to cover the
| call depth of most programs. If there is a chain of more than 16
| nested calls and more than 16 returns in rapid succession,
| performance may degrade.
|
| [...] To enable the use of the return stack mechanism, calls and
| returns must be matched in pairs. If this is done, the likelihood
| of exceeding the stack depth in a manner that will impact
| performance is very low.
|
| This trick is also what I understand retpolines are about:
|
| Citing kernel doc
|
| The kernel can protect itself against consuming poisoned branch
| target buffer entries by using return trampolines (also known as
| "retpoline") for all indirect branches. Return trampolines trap
| speculative execution paths to prevent jumping to gadget code
| during speculative execution. x86 CPUs with Enhanced Indirect
| Branch Restricted Speculation (Enhanced IBRS) available in
| hardware should use the feature to mitigate Spectre variant 2
| instead of retpoline. Enhanced IBRS is more efficient than
| retpoline.
|
| retbleed https://lwn.net/Articles/901834/
|
| [2] https://discourse.llvm.org/t/is-pic-code-defeating-the-
| branc...
| pjc50 wrote:
| > there is no way to address EIP directly
|
| In general this is a "thing" on pipelined processors,
| especially once you have out-of-order, because "current
| instruction" starts to smear out across a bunch of
| instructions. But for CALL the processor has to dump the
| pipeline and pick a specific return address to put on the
| stack.
|
| (unless you're MIPS with the branch delay slot nonsense)
| planede wrote:
| My impression was that pipelined processors pipeline across
| CALL instructions. They even speculate across indirect calls.
| How you describe this is as if there was a significant
| penalty to reading EIP or calling a function on a pipelined
| processor, but I don't think that's true.
| phire wrote:
| Don't forget, the CPU needs to know the current PC for
| relative branch instructions too.
|
| Relative branch instructions and calls are pretty common, so
| flushing the entite pipeline just to get the current PC would
| be way too expensive.
|
| So pipelined CPUs actually implementate extra resources to
| track the current PC of every single instruction (at least as
| far as the execute stage) just so they can get the PC
| rapidly.
|
| And you will find almost every single modern CPU has an
| instruction to copy the current PC to a register.
| titzer wrote:
| > has to dump the pipeline
|
| Modern Intel processors have a "stack engine" (similar to the
| return stack buffer) that speeds up access through RSP. But
| regardless, there's no need to dump any pipeline; a call just
| has an implicit store to memory and the RSP gets updated
| (register-renamed, really). Calls are very, very fast these
| days.
| adrian_b wrote:
| There was no way in 80386, except by pushing EIP on the stack
| within a CALL.
|
| In 64-bit Intel/AMD CPUs, the instruction LEA (load effective
| address), with an address relative to RIP, can be used to
| save the instruction pointer in a register, then a jump to
| any address will arrive there providing the old instruction
| pointer in the register, ready to be used without popping it
| from the stack.
|
| In all modern pipelined CPUs there are multiple copies of the
| instruction pointer, one that runs ahead providing the
| instruction fetch addresses, and an older value that provides
| base addresses for relative jumps and for relative loads or
| stores.
| ajross wrote:
| > In general this is a "thing" on pipelined processors
|
| It's actually not? Or, indeed, it's a hard problem. And
| modern designs certainly don't expose the instruction counter
| as a general register anymore (ARM32 PC is the last of its
| kind).
|
| But having an IP-relative data addressing mode is a critical
| feature for any reasonable modern device, for exactly the
| reasons detailed here: you want constants stored with your
| compiled code without having to incur overhead (c.f. the
| linked article, or the GOT/PLT indirection in shared
| libraries, etc...) to get it.
| sweetjuly wrote:
| This is generally not true. While OOO processors generally
| don't love passing PC around (and they usually don't since
| most instructions won't need it), both RISC-V, ARMv7, and
| ARM-v8 all provide mechanisms to access PC:
|
| ARMv7: you can just use PC directly as a register
|
| ARMv8: adr Rd, #0 will move PC into Rd
|
| RISC-V: auipc Rd, 0 will move PC into Rd
|
| Usually what implementers tend to do to avoid piping around
| massive 64-bit addresses for no good reason is that they form
| fetch groups (a linear series of instructions starting at a
| base PC) and then number instructions both by their group ID
| and their position in the group. The base PC for each group
| is then stored in an array indexed over the fetch group. If
| an instruction later needs its PC (such as to perform a
| branch or because it caused an exception), a request is sent
| to the fetch group array to fetch the base PC for the
| operation which can then be used in conjunction with the
| offset to reconstruct the original PC.
| im3w1l wrote:
| > In newer processors there exists a cache for return addresses
| Return Stack Buffer (RSB). But there is a penalty for doing
| call and never doing ret.
|
| I think you could play nice by actually doing the ret instead
| of poping EIP. Something like GET_EIP:
| mov eax, [esp] ret
|
| And then call GET_EIP
| nynyny7 wrote:
| I don't get the purpose, at least of his minimal example. The
| author says he wants to make his code position-independent, i.e.,
| so that it can be executed from everywhere in memory (without
| relocation). But that is defeated by the...
|
| mov edx, print
|
| ... in the example.
| yenz0r wrote:
| Yeah the example wont work, but since it's only used for
| getting the length of the string its an easy to fix to instead
| use pascal/counted strings with a length prefix byte.
| 0x0 wrote:
| They should have put labels in front of and after the string
| bytes, then most assemblers would evaluate "(labelafter -
| labelbefore)" to a constant integer giving the length as
| needed. No need for a runtime sub instruction either, then.
| messe wrote:
| I love using a variant on that trick in real mode code:
| print: pop si .loop: lodsb
| jz .end mov ah, 0x0E xor bh, bh
| int 10h jmp .loop .end: push si
| ret
|
| I'm writing this from memory, so there may be an off by one error
| in the above code.
|
| It's used like this, with a null terminated string, rather than a
| hardcoded length: call print db "hello,
| world", 0
|
| This can even be transformed into something like
| puts "hello, world"
|
| with the aid of NASM macros. I can't recall where I saw this
| trick originally. Maybe some FreeDOS or GRUB code.
| stevekemp wrote:
| If first saw this in virus-code from the 80s, where you'd have
| code to get the current location: call
| next next: pop ax
|
| I've used the same approach for printing "inline" strings
| myself, though in my case I tend to be working with CP/M and
| there the string are terminated with "$".
| EvanAnderson wrote:
| This is exactly what I thought of. Learning x86 assembler in
| the context of reverse engineering MS-DOS made this trick
| seem perfectly normal (as did the idea of writing position
| independent code).
| stevekemp wrote:
| A later comment in this discussion reminds me that this was
| called "calculating the delta-offset".
| EvanAnderson wrote:
| Yep. That's the terminology I'd expect to see in a 40Hex!
| >smile<
| akoboldfrying wrote:
| Cute! And the time overhead vs. the usual stack-based parameter
| passing convention is roughly zero, since even though the
| callee has to "push si" at the end, the caller needs zero
| instructions to pass the argument, instead of the usual one.
| messe wrote:
| Too late for me to edit, but my code is missing a "test al, al"
| after "lodsb".
| amluto wrote:
| The fact that code like this would get acceptable performance
| is amazing! By modern standards, there's maybe a few cycles
| (assuming no cache misses) in the loop body plus that INT
| instruction. That's maybe 20k cycles for the round trip (read
| IDT and GDT, go through the whole awful ucode flow, and jump to
| kernel, then do the work, and the do IRET, which is, again,
| amazingly slow).
|
| Fortunately, I'm pretty sure the CPUs that were intended to run
| this code were rather more efficient at interrupts (in terms of
| cycles) than modern x86 monsters.
|
| Intel is at least trying to fix this with FRED.
| OhNoNotAgain_99 wrote:
| interesting but can you still get i386's?
| blueflow wrote:
| This article is referring to Intels 32-bit instruction set,
| which seemingly all x86 machines still support.
| phire wrote:
| All 32bit/64bit x86 machines.
|
| I believe you can still buy new 8086 class and 286 class
| 16bit x86 cores in random SoCs.
|
| And I know you can buy SoCs with 386 class x86 cores. It
| might be more accurate to describe them as 486 class cores
| that don't implement the full 486 instruction set.
| Dwedit wrote:
| It's what the architecture is named. 32-bit mode code for intel
| processors is still called i386, even if that processor is
| decades old.
|
| The most significant (non-SIMD) change was adding in CMOV.
| irdc wrote:
| Tricks like these are going to become less common with execute-
| only mapping of .text slowly proliferating through the industry
| (iOS, OpenBSD).
|
| Though i386 is unlikely to ever become execute-only.
| H8crilA wrote:
| Is that a security measure? What would execute-only prevent?
| irdc wrote:
| Yeah, it makes constructing ROP chains slightly more
| difficult when combined with ASLR and the like as you cannot
| defeat the randomisation by inspecting the running binary.
| H8crilA wrote:
| As in you already roughly know where code is mapped, but
| need the lower bits of the offset? Or also to learn the
| specific version of the running code?
| irdc wrote:
| A successful ROP attack requires the exact addresses of
| the various gadgets used (refer to a definition of ROP if
| this is unclear, as I'm currently on mobile). ASLR
| thwarts this, as does the libc layout randomisation that
| OpenBSD does on every boot. However, it's not perfect,
| and if you can read program memory you could scan for
| gadgets at run-time. This last point is prevented by
| execute-only.
| H8crilA wrote:
| Ah, but you first need to have even an approximate idea
| of where some code is mapped, otherwise you'll fault on
| nearly all requests into a 64 bit space.
| irdc wrote:
| Yes, that's true. That's where infoleaks come in. Plus a
| lot of crashes are likely not even noticed, or blamed on
| the software just being buggy. Repeatedly crashing a
| fork()'ing server might just give you enough information
| to reconstruct its memory layout (which doesn't vary
| between parent and child processes after a fork(), which
| is why OpenSSH does an execve() of itself after
| fork()'ing).
| H8crilA wrote:
| I see, but for a 1GiB mapped code space we're talking
| here about 2^64/(1 Gi) = 17'179'869'184 attempts, or
| perhaps about half of that with average luck.
| Findecanor wrote:
| There are also attacks such as "JIT spraying" where JIT-
| compiled code contains large constants that the runtime
| gets tricked into jumping into. Execute-only would make
| that attack a little less likely.
| PrimeMcFly wrote:
| > Tricks like these are going to become less common with
| execute-only mapping of .text slowly proliferating through the
| industry (iOS, OpenBSD).
|
| Give the PaX project some credit, since they had it before
| OpenBSD did. Windows has had it for a while also, since XP.
| taway1237 wrote:
| Is this some obscure feature of Windows? In my experience,
| while code sections are almost never writable, they're always
| readable.
| PrimeMcFly wrote:
| I was wrong, I was thinking of DEP which is quite
| different.
| irdc wrote:
| > Give the PaX project some credit, since they had it before
| OpenBSD did.
|
| I didn't know that and cannot find anything that confirms
| this. You have a source?
| adastra22 wrote:
| 32-bit intel ISA supports execute-only memory pages.
| irdc wrote:
| Only through segmentation hacks right? In page table entries,
| execute doesn't have a separate flag but shares it with read.
| blibble wrote:
| execute disable was added with the pentium 4 (but needs pae
| page tables)
| irdc wrote:
| Yes, but execute _disable_ is not the same as execute
| _only_. AFAIK there 's no way to prevent executable pages
| from being readable using only the i386/amd64 page table.
| amluto wrote:
| You can fudge it with protection keys (poorly), and you
| can do it for real with EPT tricks.
| Findecanor wrote:
| Only through segmentation, I think. However x86-S is supposed
| to force a flat memory model for 32-bit programs as well.
|
| On recent Intel processors, it is possible to execute-only
| protect pages using Intel MPK (Memory-Protection Keys) by
| having pages with a key be read-only in the page table but
| "access disable" in the PKRU register. PKRU is accessible
| from user mode though.
|
| AFAIK, the only (still) mainstream CPU arch with reliable
| execute-only protection is RISC-V. (I would like to be wrong,
| and see it on e.g. ARM as well)
| ehaliewicz2 wrote:
| execute-only as in no reading?
| qweqwe14 wrote:
| This trick isn't i386-specific. In general, you can merge .data,
| .rodata etc into one section with a linker script and it will
| just work, pretty useful for saving a few bytes.
|
| Also see sstrip for ELF files and this legendary writeup
| https://www.muppetlabs.com/~breadbox/software/tiny/teensy.ht...
| flohofwoe wrote:
| A similar "trick" was used on some 8-bit home computers for
| passing (optionally variable-length) data to operating system
| calls.
|
| For instance on the KC85/2..4 operating system (CAOS) the
| equivalent of "puts()" expects the "syscall index" and zero-
| terminated text to print after the call instruction, e.g.:
| CALL 0F003H ; call into generic "syscall" entry DEFB
| 23H ; "syscall" identifier DEFM 'HELLO WORLD!'
| DEFW 0D0AH ; newline DEFB 00 ; end of text
| NOP ; execution continues here
|
| The syscall dispatcher would pop the return address from the
| stack and that way discover the data. Before the syscall returns,
| a modified return address which points to the first byte after
| the data is pushed back on the stack.
|
| Only downside of this approach was that disassemblers would get
| terribly confused, unless they had specific knowledge about this
| CAOS peculiarity.
| warpspin wrote:
| Yes. C64 GEOS also used this a lot. They used to call it
| ,,inline calls":
|
| https://archive.org/details/The_Official_GEOS_Programmers_Re...
| n_plus_1_acc wrote:
| The TI 83 family of calculators with a Z80 also use this.
| PinguTS wrote:
| Oh, yeah, KC85. We had them in school, when I was in my early
| teens.
| jsymolon wrote:
| Apple II, DOS and PRODOS calls do that too.
|
| https://prodos8.com/docs/techref/calls-to-the-mli/
| tenebrisalietum wrote:
| The C128 had a KERNAL routine called PRIMM that did that.
| cancerhacker wrote:
| Classic Mac used this for some toolbox traps as well. Most apps
| used a jump table at some offset from the A5 register, which
| looked like: addr: _LoadSeg dc.w
| segmentNumber dc.w segmentOffset
|
| The _LoadSeg trap would ensure that 'CODE'(segmentNumber) was
| loaded from disk and then modify the jump code @addr to become
| an absolute JMP (0x4ef9 + 32 bits) and then set the PC back to
| @addr and return from the trap. There was also an _UnloadSeg
| mechanism that would reverse this!
| BruceEel wrote:
| Well yes, as said here, it's more of a linker thing and not so
| much a language or assembly thing. On Windows you could do the
| below to have a single, executable, readable and writable
| section. Not sure whether it still works anno 2023.
|
| It's generally considered a bad idea from a security standpoint
| #pragma comment(linker,"/MERGE:.data=.text /MERGE:.rdata=.text
| /MERGE:.flat=.text /SECTION:.text,EWR ")
| taway1237 wrote:
| The article is about call+pop "trick" in assembly, linker is
| not relevant here.
| _nalply wrote:
| Right, but that trick is not so useful if you have a
| different section than .text only, and that's what GP is
| referring to.
| taway1237 wrote:
| I disagree. For me it's useful mostly for position
| independent shellcode prologue, which has no sections to
| speak of, and may get embedded in a "normal" executable or
| something that is not an executable at all (useful in a
| bootloader, or for injecting code to another process, or
| self-relocating code, etc). I use this "trick" all the time
| and I never felt the need to mess with a linker for this.
|
| But it's a good hint, I hope I didn't sound overly
| negative.
| _nalply wrote:
| Your point is interesting. I didn't think about this use
| case. Inject code with ptrace. Like the LD_PRELOAD trick
| but you don't even need LD_PRELOAD, just attach and
| bamboozle the running process into running some code you
| provided. In such cases sections don't exist, but pages.
| Right.
| Dwedit wrote:
| It happens to be a lot easier to reverse-engineer a program
| where the sections are not combined, and you can predict that
| strings will reliably be in the .rdata section. While it does
| save a few KB, it just makes things so much nicer for the next
| people who need to patch features into the binary manually.
| PinguTS wrote:
| It seems I'm getting old. What's the trick here? Is this a trick?
| Yeah, that is who have done things in the past. That's when I
| learned programming and then "hacked in" the hex codes on a hex
| keypad on a Z80. That's when I learned programming on my first
| 8086. You tried to figured out what caused the least overhead.
| That meant saving space on instructions and in processing
| power/speed. But then I learned that this is called Spaghetti
| code.
| self_awareness wrote:
| You knew it. New generation didn't. This is how the world
| works.
| polynomial wrote:
| This may be one of the most surprising things I have learned
| in my life.
| self_awareness wrote:
| Reading EIP through CALL is called "delta addressing", and it was
| a common technique in malware back in the days when viruses were
| infecting executable files (nowadays this doesn't exist because
| of digitally signed code on all major platforms except Linux)
| hun3 wrote:
| Reading the call return address is basically how you write
| position-independent code (relocatable without modifying the on-
| memory executable image).
|
| On Linux there's a stub subroutine that does exactly that:
| __i686.get_pc_thunk.<reg>.
|
| Here's the entire subroutine:
| __i686.get_pc_thunk.bx: MOV EBX, DWORD PTR [ESP]
| RET
|
| Yup, that's all. If you compile with gcc -m32 -fPIC, you'll see a
| call to that thunk whenever a function accesses GOT or other
| relocatable symbols.
| russdill wrote:
| I've also seen: call 1f
|
| 1: <next instruction>
|
| So commonly I hadn't considered that people thought getting the
| EIP on x86 was an obstacle.
| krylon wrote:
| I discovered this trick in 2008, during my one excursion into
| assembly programming. But it was for a different purpose. I even
| did it in inline binary. I felt so clever. X-D
| iefbr14 wrote:
| In '75 we re-used the memory of code that was executed only once
| to store stuff with IBM's system/370 assembler.
| layer8 wrote:
| I never get used to the fact that the segment for executable
| machine code is called "text". Anyone know the history of that?
| projektfu wrote:
| It goes back to at least OS/360 (TXT Record), probably earlier.
| It follows from referring to the text of the program vs the
| data.
| rbanffy wrote:
| One interesting advantage of very small programs in the age of
| slow storage was that, if they fit in one disk block, they'd skip
| one drive seek and read the whole file from the block indicated
| in the directory entry.
| snickerbockers wrote:
| I see this a lot reverse engineering programs made for an older
| ISA from the 90s called SH4. Its a 32-bit RISC that uses 16-bit
| instructions[1] and is therefore unable to load more than 8 bits
| of arbitrary immediate data (sometimes 12 but usually 8) into a
| register without spreading the operation over several
| instructions so most functions will have large blocks of data at
| the end (and sometimes even in the middle, because it needs to
| get a pointer to the data by offset from the PC and the
| instruction format is limited to 8 bit offsets) where they load
| in constant values and pointers. I'm pretty sure gcc even does
| this. I see it so often it never occurred to me this would be
| unusual on other CPUs.
|
| [1] doubles the effective size of the instruction cache and also
| makes dual pipelines easier to implement because they can both
| fetch from the same bus at the same time. Legend has it that this
| was successful enough to be the inspiration behind thumb mode on
| ARM, which is also a 32 bit ISA with 16 bit instructions.
| projektfu wrote:
| It's 8086's history as a descendant of a limited 8-bit ISA that
| made it lack PC-relative addressing. I'm not sure why it was
| never added in all of its iterations until x64.
|
| Other ISAs from the minicomputer age (PDP-11) and their
| descendants and inspirations (H8, 68k) had it. Zilog added PC
| relative loads and address calculations to the Z8000, and it's
| a generally popular form now in x64.
|
| The Unix V6 assembly source code is very readable because of
| this and also I think it was unfortunately responsible for the
| 0-terminated string use because of the ease of writing it that
| way.
| NobodyNada wrote:
| This is also very common on ARM; they call it a "literal pool":
| https://developer.arm.com/documentation/dui0473/m/writing-ar...
| duskwuff wrote:
| The Thumb encoding for ARM also has some _very_ clever
| encodings for inline constants:
|
| https://developer.arm.com/documentation/ddi0308/d/Thumb-
| Inst...
|
| Specifically, it can encode any 8-bit value rotated by any
| number of bits, as well as any value of the form 0x00XY00XY,
| 0xXY00XY00, or 0xXYXYXYXY. Combined with the use of inverted
| instructions (e.g. MVN instead of MOV, SUB instead of AND,
| BIC instead of AND, etc), this covers a surprising number of
| the 32-bit constants which are likely to appear in a program.
| ipython wrote:
| This has been used for decades by malware and shellcode that
| needs to be compact and position-independent (loaded at any
| virtual address). It is clever and as a side effect is that the
| string is already loaded on the stack, so if your next step is to
| call a function with that string as an argument, you can just
| call that function directly.
|
| It used to confuse a lot of disassemblers, where you'd have to
| re-synchronize the disassembly after the string and disambiguate
| between 'code' and 'data' by hand.
| ithkuil wrote:
| pdp-11 had a very elegant unification of "immediate operand" and
| "pc-relative addressing".
|
| Basically one of the addressing mode is "access word pointed to
| by register+offset and post-increment the register by word size
| (2 bytes)".
|
| That can be used to pop off a word from the stack if the register
| is the stack pointer, but if the register is the program counter,
| that basically reads the word following the instruction and
| causes the CPU to continue execution after the immediate data.
|
| A truly orthogonal instruction set :-)
| higherhalf wrote:
| This can be done even simpler: global _start
| _start: jmp next string: db `Hello
| World!\n` len: equ $ - string next: mov
| ecx, string mov edx, len mov ebx, 1
| mov eax, 4 int 80h mov ebx, 0
| mov eax, 1 int 80h
|
| For NASM, it can also be put into a macro, for example printing
| to video memory at 0xb8000: %macro print 1
| mov ecx, %%loop_start - %%strdata mov eax, 0x0700
| jmp %%loop_start %%strdata: db %1 %%loop_start:
| mov al, [%%strdata + ecx - 1] mov [0xb8000 + ecx * 2 -
| 2], ax loop %%loop_start %endmacro
| projektfu wrote:
| That will require a fixup or a fixed load address. The example
| in the article is position independent.
| fargle wrote:
| the author wanted it to be position-independent (PIC), so it
| works no matter what address the .text segment is loaded to and
| run.
|
| This example uses a fixed symbolic reference ("string:") and is
| the normal way to do it. The trick is to it in a PC relative
| way.
| hota_mazi wrote:
| The first time I saw this trick was with ProDOS on the Apple ][,
| circa 1983.
|
| ProDOS came up with this new call syntax where the parameters to
| the API follow the call to it.
|
| For example: ldx #$00
| ldy #$10 sty params+4 stx
| params+5 ; setup number of bytes to read (16)
| jsr $BF00 ; call ProDOS .BYTE $CA ;
| ProDOS command number = CA (read) .WORD params
| ; address of parameter table, lo/hi bcs error
| ; carry set, error . .
| params .BYTE $04 ; number of parameters for a read
| .BYTE $00 ; file reference number, 0, 1, 2 in MacQForth
| .WORD BUFFER ; pointer to data buffer .WORD
| $0000 ; requested number of bytes to read, fill in
| .WORD $0000 ; number actually read, returned by ProDOS
___________________________________________________________________
(page generated 2023-11-09 23:01 UTC)