https://secret.club/2021/01/12/callout.html

SECRET CLUB HOME ABOUT US

Hiding execution of unsigned code in system threads

main authors image drew
Jan 12, 2021
---------------------------------------------------------------------

Anti-cheat development is, by nature, reactive; anti-cheats exist to
respond to and thwart a videogame's population of cheaters. For
instance, a videogame with an exceedingly low amount of cheaters
would have little need for an anti-cheat, while a videogame rife with
cheaters would have a clear need for an anti-cheat. In order to catch
cheaters, anti-cheats will employ as many methods as possible.
Unfortunately, anti-cheats are not omniscient; they can not know of
every single method or detection vector to catch cheaters. Likewise,
the game hacks themselves must continue to discover new or unique
methods in order to evade anti-cheats.

The Reactive Development Cycle of Game Hacking

This brings forth a reactive and continuous development cycle, for
both the cheats and anti-cheats: the opposite party (cheat or
anti-cheat) will employ a unique method to circumvent the adjacent
party (anti-cheat or cheat) which, in response, will then do the
same.

One such method employed by an increasing number of anti-cheats is to
execute core anti-cheat functions from within the operating system's
kernel. A clear advantage over the alternative (i.e. usermode
execution) is in the fact that, on Windows NT systems, the anti-cheat
can selectively filter which processes are able to interact with the
memory of the game process in which they are protecting, thus
nullifying a plethora of methods used by game hacks.

In response to this, many (but not all) hack developers made (or are
making) the decision to do the same; they too would, or will, execute
their hack, either wholly or in part, from within the operating
system's kernel, thus nullifying what the anti-cheats had done.

Unlike with anti-cheats, however, this decision carries with it
numerous concessions: namely, the fact that, for various reasons, it
is most convenient (or it is only practical) to execute the hack as
an unsigned kernel driver running without the kernel's knowledge; the
"driver" is typically a region of executable memory in the kernel's
address space and is never loaded or allocated by the kernel. In
other words, it is a "manually-mapped" driver, loaded by a tool used
by a game hack.

This ultimately provides anti-cheats with many opportunities to
detect so-called "kernel-mode" or "ring 0" game hacks (noting that
those terms are typically said with a marketable significance; they
are literally used to market such game hacks, as if to imply
robustness or security); if the anti-cheat can prove that the system
is executing, or had executed, unsigned code, it can then potentially
flag a user as being a cheater.

Analyzing a Thread's Kernel Stack

One such method - the focus of this article, in fact - of detecting
unsigned code execution in the kernel is to iterate each thread that
is running in the system (optionally deciding to only iterate threads
associated with the system process, i.e. system threads) and to
initiate some kind of stack trace.

Bluntly, this allows the anti-cheat to quite effectively determine if
a cheat were executing unsigned code. For example, some anti-cheats
(e.g. BattlEye) will queue to each system thread an APC which will
then initiate a stack trace. If the stack trace returns an
instruction pointer that is not within the confines of any loaded
kernel driver, the anti-cheat can then know that it may have
encountered a system thread that is executing unsigned code.
Furthermore, because it is a stack trace and not a direct sampling of
the return instruction pointer, it would work quite reliably, even if
a game hack were, for example, executing a spin-loop or continuous
wait; the stack trace would always lead back to the unsigned code.

It is quite clear to any cheat developer that they can respond to
this behavior by simply running their thread(s) with kernel APCs
disabled, preventing delivery of such APCs and avoiding the detection
vector. As is will be seen, however, this method does not entirely
prevent detection of unsigned code execution.

(Copying Out, Then) Analyzing a Thread's Kernel Stack

Certain anti-cheats - EasyAntiCheat, in particular - had a much more
apt method of generating a pseudo-stacktrace: instead of generating a
stack trace with a blockable APC, why not copy the contents of the
thread's kernel stack asynchronously? Continuing the reactive
cheat-anti-cheat development cycle, EasyAntiCheat had opted to
manually search for instances of nonpaged code pointers that may have
been left behind as a result of system thread execution.

While the downsides of this method are debatable, the upside is quite
clear: as long as the thread is making procedure calls (e.g. x86 call
instruction) from within its own code, either to kernel routines or
to its own, and regardless of its IRQL or if the thread is even
running, its execution will leave behind detectable traces on its
stack in the form of pointers to its own code which can be extracted
and analyzed.

Callouts: Continuing The Reactive Development Cycle

Proposed is the "callout" method of system thread execution, born
from the recognition that:

 1. A thread's kernel stack, as identified by the kernel stack
    pointer in a thread's ETHREAD object, can be analyzed
    asynchronously by a potential anti-cheat to detect traces of
    unsigned code execution; and that
 2. To be useful in most cases, a system thread must be able to make
    calls to most external NT kernel or executive procedures with
    little compromise.

The Life-cycle of the Callout Thread

The life-cycle of a callout thread is quite simple and can be used to
demonstrate its implementation:

  * Before thread creation:
      + Allocate a non-paged stack to be loaded by the thread; the
        callout thread's "real stack"
      + Allocate shellcode (ideally in executable memory not
        associated with the main driver module) which disables
        interrupts, preserves the old/kernel stack pointer (as it was
        on function entry), loads the real stack, and jumps to an
        initialization routine (the callout thread's "bootstrap
        routine")
      + Create a system thread (i.e. PsCreateSystemThread) whose
        start address points to the initialization shellcode
  * At thread entry (i.e. the bootstrap routine):
      + Preserve the stack pointer that had been given to the thread
        at thread entry (this must be given by the shellcode)
      + (Optionally) Iterate the thread's old/kernel stack pointer,
        ceasing iteration at the stack base, eliminating any
        references/pointers to the initialization shellcode
      + (Optionally) Eliminate references to the initialization
        shellcode within the thread's ETHREAD; for example, it may be
        worth changing the thread's start address
      + (Optionally, but recommended) Free the memory containing the
        initialization shellcode, if it was allocated separately from
        the driver module
      + Proceed to thread execution

In clearer terms, the callout thread spends most of its time
executing the driver's unsigned code with interrupts disabled and
with its own kernel stack - the real stack. It can also attempt to
wipe any other traces of its execution which may have been present
upon its creation.

The Usefulness of the Callout Thread

The callout thread must also be capable of executing most, if not
all, NT kernel and executive procedures. As proposed, this is
effectively impossible; the thread must run with interrupts disabled
and with its own stack, thus creating an obvious problem as most
procedures of interest would run at an IRQL <= DISPATCH_LEVEL.
Furthermore, the NT IRQL model may be liable to ignore our setting of
the interrupt flag, causing most routines to unpredictibly enter a
deadlock or enable interrupts without our consent.

A mechanism to allow for a callout thread to invoke these routines of
interest, the callout mechanism, is therefore used to:

 1. Provide a routine which can be used to conveniently invoke ("call
    out") an external function; and in this routine,
 2. Load the thread's original/kernel stack pointer;
 3. Copy function arguments on to the kernel thread's stack from the
    real stack;
 4. Enable interrupts;
 5. Invoke the requested routine (within the same instruction
    boundary as when interrupts are enabled);
 6. Cleanly return from the routine without generating obvious stack
    traces (e.g. function pointers);
 7. Load the real stack pointer and disable the interrupt flag, and
    do so before returning to unsigned code; and
 8. Continue execution, preserving the function's return value

While somewhat complicated, the callout mechanism can be achieved
easily and, to a reasonable degree, portably, using two
widely-available ROP gadgets from within the NT kernel.

The Usefulness of IRET(Q)

The constraint of needing to load a new stack pointer, interrupt
flag, and interrupt pointer within an instruction boundary was
immediately satisfied by the IRET instruction.

For those unfamiliar, the IRET (lit. "interrupt return") instruction
is intended to be used by an operating system or executive (here, the
NT kernel) to return from an interrupt routine. To support the
recognition of an interrupt from any mode of execution, and to
generically resume to any mode of execution, the processor will need
to (effectively) preserve the instruction pointer, stack pointer, CPL
or privilege level (through the CS and SS selectors; and while they
have a more general use-case, this is effectively what is preserved
on most operating systems with a flat memory model), and RFLAGS
register (as interrupts may be liable to modify certain flags).

To report this information to the OS interrupt handler, the CPU will,
in a specific order:

 1. Push the SS (stack segment selector) register;
 2. Push the RSP (stack pointer) register;
 3. Push the RFLAGS (arithmetic/system flags) register;
 4. Push the CS (code segment selector) register;
 5. Push the RIP (instruction pointer) register; and, for some
    exception-class interrupts,
 6. Push an error code which may describe certain interrupt
    conditions (e.g. a page fault will know if the fault was caused
    by a non-present page, or if it were caused by a protection
    violation)

Note that the error code is not important to the CPU and must be
accounted for by the interrupt handler. Each operation is an 8-byte
push, meaning that, when the interrupt handler is invoked, the stack
pointer will point to the preserved RIP (or error code) values.

It is hopefully obvious as to how, approximately, the IRET
instruction would be implemented:

 1. Pop a value from the stack to retrieve the new instruction
    pointer (RIP)
 2. Pop a value from the stack to retrieve the new code segment
    selector (CS)
 3. Pop a value from the stack to retrieve the new arithmetic/system
    flags register (RFLAGS)
 4. Pop a value from the stack to retrieve the new stack pointer
    (RSP)
 5. Pop a value from the stack to retrieve the new stack segment
    selector (SS)

Or, as modeled as a series of pseudo-assembly instructions,

GENERIC_INTERRUPT:

;note that all push and pop operations are 8 bytes (64 bits) wide!
push ss
push rsp
push rflags
push cs
push rip ;return instruction pointer
;optionally, push a zero-extended 4-byte error code. any interrupt which pushes an error code must have its handler add 8 bytes to their instruction pointer before executing its IRET.

IRET:

pop rip ;pop return instruction pointer into RIP. do not treat this as a series of regular assembly instructions; treat it instead as CPU microcode!
pop cs
pop rflags
pop rsp
pop ss

The callout mechanism uses the IRET instruction to accomplish its
constraints, as the desired RFLAGS (which holds the interrupt flag),
instruction pointer, and stack pointer can be loaded by the
instruction at the same time (within an instruction boundary).

ROP; Chaining It All Together

To reiterate, the callout routine uses IRET to change the instruction
pointer, stack pointer, and interrupt flag within the same
instruction boundary in order to jump to external procedures with the
interrupt flag enabled. This must be done within an instruction
boundary to prevent unfortunately-timed external interrupts from
being received just before the external procedure call.

It, however, must also be able to return from the external procedure
call without leaving unsigned code pointers on the kernel stack;
furthermore, it must also not rely on unlikely/unaligned ROP gadgets
(e.g. a cli;ret sequence) which may not exist on future NT kernel
builds. Thus also required is an IRET instruction to be executed upon
the routine's completion.

It must be recognized that the nature of the IRET instruction is such
that the return instruction pointer is located on the stack. However,
it is also recognized that a new stack pointer is loaded. We can
therefore use IRET to load the callout thread's real stack, with the
stack pointer pointing to the actual return address.

This eliminates the problem of code pointers being present in the
kernel stack; the return address back to our thread's execution is
located on another stack loaded by IRET and which isn't obviously
visible on a stack trace. To facilitate this, the stack frame loaded
by the IRET gadget must be such that the return instruction pointer
simply contains a RET instruction.

So, the ideal stack frame when calling an external procedure is as
such:

 1. IRET return data, where the return address is a RET instruction
    within ntoskrnl.exe (or any region of signed code), and where the
    stack pointer to load is the thread's real stack; which would
    have a return address pushed on to it; and
 2. The address of an IRET instruction within a region of signed code

Within most, if not all, versions of ntoskrnl.exe, this can be
achieved with a simple RET instruction (0xC3 byte); along with the
following gadget:

mov rsp, rbp
mov rbp, [rbp + some_offset] ;where some_offset could be liable to change
add rsp, some_other_offset
iretq

This also slightly modifies the mechanism of the ROP chain in that it
must also load a pointer to the desired IRET frame in RBP when
calling the function. Thankfully, the x64 calling convention
specifies the RBP register as non-volatile, or unchanging across
function calls, meaning that we can initialize it with our desired
pointer when invoking the external procedure. It also means that the
callout mechanism is permitted to allocate a non-paged region of
memory to be given in RBP; preventing it from having to keep an IRET
frame on the kernel stack. This notes, of course, the potential for
an awful race condition where an interrupt is received in between the
mov rsp, rbp and iretq instructions; the stack pointer value may
point to memory that is insufficient to use for stack operations.

In having the external procedure return to the above IRET gadget, we
can easily return to our unsigned code without ever leaking unsigned
code pointers on the kernel stack.

Implementation

An example implementation of the callout mechanism can be found here.

Tagged kernel, windows, anti-cheats  
PREVIOUS
New year, new anti-debug: Don't Thread On Me