https://gpfault.net/posts/asm-tut-0.txt.html
[logo] MAIN ABOUT RSS
Let's Learn x86-64 Assembly! Part 0 - Setup and First Steps
published on Apr 18 2020
[header]
The way I was taught x86 assembly at the university had been
completely outdated for many years by the time I had my first class.
It was around 2008 or 2009, and 64-bit processors had already started
becoming a thing even in my neck of the woods. Meanwhile, we were
doing DOS, real-mode, memory segmentation and all the other stuff
from the bad old days.
Nevertheless, I picked up enough of it during the classes (and over
the subsequent years) to be able to understand the stuff coming out
of the other end of a compiler, and that has helped me a few times.
However, I've never manually written any substantial amount of x86
assembly for something non-trivial. Due to being locked up inside (on
account of a global pandemic), I decided to change that situation, to
pass the time.
I wanted to focus on x86-64 specifically, and completely forget/skip
any and all legacy crap that is no longer relevant for this
architecture. After getting a bit deeper into it, I also decided to
publish my notes in the form of tutorials on this blog since there
seems to be a desire for this type of content.
Everything I write in these posts will be a normal, 64-bit, Windows
program. We'll be using Windows because that is the OS I'm running on
all of my non-work machines, and when you drop down to the level of
writing assembly it starts becoming incresingly impossible to ignore
the operating system you're running on. I will also try to go as
"from scratch" as possible - no libraries, we're only allowed to call
out to the operating system and that's it.
In this first, introductory part (yeah, I'm planning a series and I
know I will regret this later), I will talk about the tools we will
need, show how to use them, explain how I generally think about
programming in assembly and show how to write what is perhaps the
smallest viable Windows program.
Getting the Tools
There are two main tools that we will use throughout this series.
Assembler
CPUs execute machine code - an efficient representation of
instructions for the processor that is almost completely impenetrable
to humans. The assembly language is a human-readable representation
of it. A program that converts this symbolic representation into
machine code ready to be executed by a CPU is called an assembler.
There is no single, agreed-upon standard for x86-64 assembly
language. There are many assemblers out there, and even though some
of them share a great deal of similarities, each has its own set of
features and quirks. It is therefore important which assembler you
choose. In this series, we will be using Flat Assembler (or FASM for
short). I like it because it's small, easy to obtain and use, has a
nice macro system and comes with a handy little editor.
Debugger
Another important tool is the debugger. We'll use it to examine the
state of our programs. While I'm pretty sure it's possible to use
Visual Studio's integrated debugger for this, I think a standalone
debugger is better when all you want to do is look at the
disassembly, memory and registers. I've always used OllyDbg for stuff
like that, but unfortunately it does not have a 64-bit version.
Therefore we will be using WinDbg. The version linked here is a
revamp of this venerable tool with a slightly nicer interface.
Alternatively, you can get the non-Windows-store version here as part
of the Windows 10 SDK. Just make sure you deselect everything else
besides WinDbg during installation. For our purposes, the two
versions are mostly interchangeable.
Thinking in Assembly
Now that we have our tools, I want to spend a bit of time to discuss
some basics. For the purpose of these tutorials I'm assuming some
knowledge of languages like C or C++, but little or no previous
exposure to assembly, therefore many readers will find this stuff
familiar.
A 10000-foot view
CPUs only "know" how to do a fixed number of certain things. When you
hear someone talk about an "instruction set", they're referring to
the set of things a particular CPU has been designed to do, and the
term "instruction" just means "one of the things a CPU can do". Most
instructions are parameterized in one way or another, and they're
generally really simple. Usually an instruction is somthing along the
lines of "write a given 8-bit value to a given location in memory",
or "interpreting the values from registers A and B as 16-bit signed
integers, multiply them and record the result into register A".
Below is a simple mental model of the architecture that we'll start
with.
[diag0]
This skips a ton of things (there can be more than one core executing
instructions and reading/writing memory, there's different levels of
cache, etc. etc.), but should serve as a good starting point.
To be effective at low-level programming or debugging you need to
understand that every high-level concept eventually maps to this
low-level model, and learning how the mapping works will help you.
Registers
You can think of registers as a special kind of memory built right
into the CPU that is very small, but extremely fast to access. There
are many different kinds of registers in x86-64, and for now we'll
concern ourselves only with the so-called general-purpose registers,
of which there are sixteen. Each of them is 64 bits wide, and for
each of them the lower byte, word and double-word can be addressed
individually (incidentally, 1 "word" = 2 bytes, 1 "double-word" = 4
bytes, in case you haven't heard this terminology before).
Register Lower byte Lower word Lower dword
rax al ax eax
rbx bl bx ebx
rcx cl cx ecx
rdx dl dx edx
rsp spl sp esp
rsi sil si esi
rdi dil di edi
rbp bpl bp ebp
r8 r8b r8w r8d
r9 r9b r9w r9d
r10 r10b r10w r10d
r11 r11b r11w r11d
r12 r12b r12w r12d
r13 r13b r13w r13d
r14 r14b r14w r14d
r15 r15b r15w r15d
Additionally, the higher 8 bits of rax, rbx, rcx and rdx can be
referred to as ah, bh, ch and dh.
Note that even though I said those were "general-purpose" registers,
some instructions can only be used with certain registers, and some
registers have special meaning for certain instructions. In
particular, rsp holds the stack pointer (which is used by
instructions like push, pop, call and ret), and rsi and rdi serve as
source and destination index for "string manipulation" instructions.
Another example where certain registers get "special treatment" are
the multiplication instructions, which require one of the multiplier
values to be in the register rax, and write the result into the pair
of registers rax and rdx.
In addition to these registers, we will also consider the special
registers rip and rflags. rip holds the address of the next
instruction to execute. It is modified by control flow instructions
like call or jmp. rflags holds a bunch of binary flags indicating
various aspects of the program's state, such as whether the result of
the last arithmetic operation was less, equal or greater than zero.
The behavior of many instructions depends on those flags, and many
instructions update certain flags as part of their execution. The
flags register can also be read and written "wholesale" using special
instructions.
There are a lot more registers on x86-64. Most of them are used for
SIMD or floating-point instructions, and we'll not be considering
them in this series.
Memory and Addresses
You can think of memory as a large array of byte-sized "cells",
numbered starting at 0. We'll call these numbers "memory addresses".
Simple, right?
Well... addressing memory used to be rather annoying back in the old
days. You see, registers in old x86 processors used to be only 16-bit
wide. Sixteen bits is enough to address 64 kilobytes worth of memory,
but not more. The hardware was actually capable of using addresses as
wide as 20 bits, but you had put a "base" address into a special
segment register, and instructions that read or wrote memory would
use a 16-bit offset into that segment to obtain the final 20-bit
"linear" address. There were separate segment registers for code,
data and stack portions (and a few more "extra" ones), and segments
could overlap.
In x86-64 these concerns are non-existant. The segment registers for
code, data and stack are still present, and they're loaded with some
special values, but as a user-space programmer you needn't concern
yourself with them. For all intents and purposes you can assume that
all segments start at 0 and extend for the entire addressable length
of memory. So, as far as we're concerned, on x86-64 our programs see
memory as a "flat" contiguous array of bytes, with sequential
addresses, starting at 0, just like we said in the beginning of this
section...
Okay, I may have distorted the truth a little bit. Things aren't
quite as simple. While it is true that on 64-bit Windows your
programs see memory as a flat contiguous array of bytes with
addresses starting at 0, it is actually an elaborate illusion
maintained by the OS and CPU working together.
The truth is, if you were really able to read and write any byte in
memory willy-nilly, you'd stomp all over other programs' code and
data (something that indeed could happen in the Bad Old Days). To
prevent that, special protection mechanisms exist. I won't get too
deep into their inner workings here because this stuff matters mostly
for OS developers. Nevertheless, here's a very short overview:
Each process gets a "flat" address space as described above (we'll
call it the "virtual address space"). For each process, the OS sets
up a mapping between its virtual addresses and actual physical
addresses in memory. This mapping is respected by the hardware: the
"virtual" addresses get translated to physical addresses dynamically
at runtime. Thus, the same address (e.g. 0x410F119C) can map to two
different locations in physical memory for two different processes.
This, in a nutshell, is how the separation between processes in
enforced.
The final thing I want to invite your attention to here is how the
instructions and data which they operate on are held in the same
memory. While it may seem an obvious choice, it's not how computers
necessarily have to work. This is a property characteristic of the
von Neumann model - as opposed to the Harvard model, where
instructions and data are held in separate memories. A real-world
example of a Harvard computer is the AVR microcontroller on your
Arduino.
Our First Program
Hopefully by this point you have downloaded FASM and are ready to
write some code. Our first program will be really simple: it will
load and then immediately exit. We mostly want it just to get
acquainted with the tools.
Here's the code for our first program in x86-64 assembly:
format PE64 NX GUI 6.0
entry start
section '.text' code readable executable
start:
int3
ret
Analyzing the Code
We'll go through this line-by-line.
* format PE64 NX GUI 6.0 - this is a directive telling FASM the
format of the binary we expect it to produce - in our case,
Portable Executable Format (which is what most Windows programs
use). We'll talk about it in a bit more detail later.
* entry start - this defines the entry point into our program. The
entry directive requires a label, which in this case is "start".
A label can be thought of as a name for an address within our
program, so in this case we're saying "the entry point to the
program is at whatever address the 'start' label is". Note that
you're allowed to refer to labels even if they're defined later
in the program code (as is the case here).
* section '.text' code readable executable - this directive
indicates the beginning of a new section in a Portable Executable
file, in this case a section containing executable code. More on
this later.
* start: - this is the label that denotes the entry point to our
program. We referred to it earlier in the "entry" directive. Note
that labels themselves don't produce any executable machine code:
they're just a way for the programmer to mark locations within
the executable's address space.
* int3 - this is a special instruction that causes the program to
call the debug exception handler - when running under a debugger,
this will pause the program and allow us to examine its state or
proceed with the execution step-by-step. This is how breakpoints
are actually implemented - the debugger replaces a single byte in
the executable with the opcode corresponding to int3, and when
the program hits it, the debugger takes over (obviously, the
original content of the memory at breakpoint address has to be
remembered and restored before proceeding with execution or
single-stepping). In our case, we are hard-coding a breakpoint
immediately at the entry point for convenience, so that we don't
have to set it manually via the debugger every time.
* ret - this instruction pops off an address from the top of the
stack, and transfers execution to that address. In our case,
we'll return into the OS code that initially invoked our entry
point.
Fire up FASMW.EXE, paste the code above into the editor, save the
file and press Ctrl+F9. Your first assembly program is now complete!
Let's now load it up in a debugger and single-step through it to see
it actually working.
Using the Debugger
Open up WinDbg. Go to the View tab and make sure the following
windows are visible: Disassembly, Registers, Stack, Memory and
Command. Go to File > Launch Executable and select the executable you
just built with FASM. At this point your workspace should resemble
something like this:
[windbg0]
In the disassembly window you can see the code that is currently
being executed. Right now it's not our program's code, but some OS
loader code - this stuff will load our program into memory and
eventually transfer execution to our entry point. WinDbg ensures a
breakpoint is triggered before any of that happens.
In the registers window, you can see the contents of x86-64 registers
that we discussed earlier.
The memory window shows the raw content of the program's memory
around a given virtual address. We'll use it later.
The stack window shows the current call stack (as you can see, it's
all inside ntdll.dll right now).
Finally, the command window allows entering text commands and shows
log messages.
If you press F5 at this time, it will cause the program to continue
running until it hits another breakpoint. The next breakpoint it will
hit is the one we hardcoded. Try pressing F5, and you'll see
something like this:
[windbg1]
You should be able to recognize the two instructions we wrote - int3
and ret. To advance to the next instruction, press F8. When you do
that, pay attention to the registers window - you should see the rip
register being updated as you advance (WinDbg highlights the
registers that change in red).
Right after the ret instruction is executed, you will return to the
code that invoked our program's entry point.
[windbg2]
As you can see from the image above, the next thing that will happen
is a call to RtlExitUserThread (a pretty self-explanatory name). If
you press F5 now, your program's main thread will clean up and end,
and so will the program. Or will it?...
The truth is, by using ret, I took a bit of a shortcut. On Windows a
process will terminate if any of the following conditions are met:
* Any thread calls the WinAPI function ExitProcess explicitly
* All threads have exited
But, we're exiting the main thread here so we should be good, right?
Well, sort of. There's no guarantee that Windows hasn't started any
other background threads (for example, to load DLLs or something like
that) within our process. It seems that at least in this example, the
main thread is the only one (I've checked and the process doesn't
stick around), but this may change. A well-behaved Windows program
should always call ExitProcess at the appropriate time.
In order to be able to call WinAPI functions, we need to learn a few
things about the Portable Executable file format, how DLLs are loaded
and calling conventions.
The PE Format and DLL Imports
The ExitProcess function lives in KERNEL32.DLL (yes, that's not a
typo, KERNEL32 is the name of the 64-bit library. The 32-bit versions
of those libs provided for back-compat pueporses, live in a folder
names SysWOW64. I'm not joking.). In order to be able to call it, we
first need to import it.
We won't cover the Portable Executable format in its entirety here.
It is documented extensively on the Microsoft docs website. Here are
a couple of basic facts we'll need to know:
* PE files are comprised of sections. We have already seen a
section containing executable code in our program, but sections
may contain other types of data.
* Information about what symbols are imported from what DLLs is
stored in a special section called '.idata'.
Let's have a look at the .idata section.
As per the docs, the .idata section begins with an import directory
table (IDT). Each entry in the IDT corresponds to one DLL, is 20
bytes in length and consists of the following fields:
* A 4-byte relative virtual address (RVA) of the Import Lookup
Table (ILT), which contains the names of functions to import.
More on that later
* A 4-byte timestamp field (usually 0)
* Forwarder chain index (usually 0)
* A 4-byte RVA of a null-terminated string containing the name of
the DLL
* A 4-byte RVA of the Import Address Table (IAT). The structure of
the IAT is the same as ILT, the only difference is that the
content of IAT is modified at runtime by the loader - it
overwrites each entry with the address of the corresponding
imported function. So theoretically, you can have both ILT and
IAT fields point to the same exact piece of memory. Moreover,
I've found that setting the ILT pointer to zero also works,
although I am not sure if this behavior is officially supported.
The Import Directory Table is terminated by an entry where all fields
are equal zero.
The ILT/IAT is an array of 64-bit values terminated by a null value.
The bottom 31 bits of each entry contain the RVA of an entry in a
hint/name table (containing the name of the imported function).
During runtime, the entries of the IAT are replaced with the actual
addresses of the imported functions.
The hint/name table mentioned above consists of entries, each of
which needs to be aligned on an even boundary. Each entry begins by a
2-byte hint (which we'll ignore for now) and a null-terminated string
containing the imported function name, and a null byte (if
necessary), to align the next entry on an even boundary.
With that out of the way, let's see how we would define our
executable's .idata section in FASM
section '.idata' import readable writeable
idt: ; import directory table starts here
; entry for KERNEL32.DLL
dd rva kernel32_iat
dd 0
dd 0
dd rva kernel32_name
dd rva kernel32_iat
; NULL entry - end of IDT
dd 5 dup(0)
name_table: ; hint/name table
_ExitProcess_Name dw 0
db "ExitProcess", 0, 0
kernel32_name: db "KERNEL32.DLL", 0
kernel32_iat: ; import address table for KERNEL32.DLL
ExitProcess dq rva _ExitProcess_Name
dq 0 ; end of KERNEL32's IAT
The directive for a new PE section is already familiar to us. In this
case, we're communicating that the section we're about to introduce
contains the imports data and needs to be made writeable when loaded
into memory (since addresses of the imported functions will be
written in there).
The directives db, dw, dd and dq all cause FASM to emit a raw byte/
word/double-word/quad-word value respectively. The rva operator,
unsurprisingly, yields the relative virtual address of its argument.
So, dd rva kernel32_iat will cause FASM to emit a 4-byte binary value
equal to the RVA of kernel32_iat label.
Here we've just made use of fasm's db/dw/etc. directives to precisely
describe the contents of our .idata section.
The 64-bit Windows Calling Convention
We're now almost ready to finally call ExitProcess. One thing we have
to answer though, is - how does a function call work? Think about it.
There is a call instruction, which pushes the current value of rip
onto the stack, and transfers execution to the address specified by
its parameter. There is also the ret instruction, which pops off an
address from the stack and transfers execution there. Nowhere is it
specified how arguments should be passed to a function, or how to
handle the return values. The hardware simply doesn't care about
that. It is the job of the caller and the callee to establish a
contract between themselves. These rules might look along the lines
of:
* The caller shall push the arguments onto the stack (starting from
the last one)
* The callee shall remove the parameters from the stack before
returning.
* The callee shall place return values in the register eax
* ...
A set of rules like that is referred to as the calling convention,
and there are many different calling conventions in use. When you try
to call a function from assembly, you must know what type of calling
convention it expects.
The good news is that on 64-bit Windows there's pretty much only one
calling convention that you need to be aware of - the Microsoft x64
calling convention. The bad news is that it's a tricky one - unlike
many of the older conventions, it requires the first few parameters
to be passed via registers (as opposed to being passed on the stack),
which can be good for performance.
You may read the full docs if you're interested in details, I will
cover only the parts of the calling convention relevant to us here:
* The stack pointer has to be aligned to a 16-byte boundary
* The first four integer or pointer arguments are passed in the
registers rcx, rdx, r8 and r9; the first four floating point
arguments are passed in registers xmm0 to xmm3. Any additional
args are passed on the stack.
* Even though the first 4 arguments aren't passed on the stack, the
caller is still required to allocate 32 bytes of space for them
on the stack. This has to be done even if the function has less
than 4 arguments.
* The caller is responsible for cleaning up the stack.
Armed with this knowledge, we can finally call ExitProcess:
format PE64 NX GUI 6.0
entry start
section '.text' code readable executable
start:
int3
sub rsp, 8 * 5 ; adjust stack ptr and allocate shadow space.
xor rcx, rcx ; The first and only argument is the return code - passed in rcx.
call [ExitProcess]
section '.idata' import readable writeable
idt: ; import directory table starts here
; entry for KERNEL32.DLL
dd rva kernel32_iat
dd 0
dd 0
dd rva kernel32_name
dd rva kernel32_iat
; NULL entry - end of IDT
dd 5 dup(0)
name_table: ; hint/name table
_ExitProcess_Name dw 0
db "ExitProcess", 0, 0
kernel32_name db "KERNEL32.DLL", 0
kernel32_iat: ; import address table for KERNEL32.DLL
ExitProcess dq rva _ExitProcess_Name
dq 0 ; end of KERNEL32's IAT
Let's go through the new lines one-by-one.
* sub rsp, 8 * 5 - the sub instruction subtracts its second operand
from its first operand and stores the result in the first
operand. In this case, we're subtracting 40 from the current
value of the stack pointer (note that somewhat
counterintuitively, the stack "grows" downward, i.e. pushing onto
the stack or allocating space on it diminishes the value of the
stack pointer). Thus, we're aligning the stack to a 16-byte
boundary, and allocating a "shadow space" for the first 4
arguments in one fell swoop. How does this work? Well, before our
entry point was invoked, the stack pointer was aligned to a
16-byte boundary. As a result of the call, a return address was
pushed onto the stack, diminishing the stack pointer value by 8
and throwing it out of alignment. We need to subtract another 8
bytes to bring it into alignment again, and another 32 bytes to
account for the shadow space, hence the value 40.
* xor rcx, rcx - recall that the first integer argument should be
passed in the rcx register. Here, we're setting the value of that
register to zero by performing a bitwise exclusive-or operation
with itself.
* call [ExitProcess] - this is what finally calls ExitProcess. The
square brackets around the label name denote indirection - rather
than calling the address referred to by the label, the value
recorded in memory at that address is used as the target address
for the call. Of course, the label we're using is pointing to the
location within the import table where the loader has written the
address of the required function!
Fire it up in WinDbg again, run until our hardcoded breakpoint, then
single-step to see how we eventually call ExitProcess, making note of
how the rsp and rcx registers change.
[windbg3]
That's it for this first part. Next time, we'll try to do something
more interesting than just exiting the process :)
---------------------------------------------------------------------
Like this post? Follow me on bluesky for more!