https://blog.mggross.com/intercepting-syscalls/

Personal blog by Magnus Gross
Reliable system call interception.
Posted on 2025-01-05

Historically, intercepting Linux system calls was done with ptrace.
While ptrace is more commonly known for debugging purposes, one could
easily monitor system calls by using PTRACE_SYSCALL (or even
PTRACE_SYSEMU) to wait for the traced process to make a system call,
then send off PTRACE_GETREGS and PTRACE_SETREGS to read and write the
registers associated with the system call.

So while the Linux kernel always had the facilities to monitor, fake,
modify and restrict system calls, the glaring problem with ptrace is
that it is very slow, as it stops twice for every system call (unless
PTRACE_SYSEMU is used) and there is no way to natively filter for a
specific set of system calls. It gets worse, because reading and
writing to the registers is incredibly cumbersome and one quickly
encounters architecture-specific quirks.

This is where seccomp user notify comes in, where recent advancements
by Christian Brauner have made it possible to intercept system calls
in a much more elegant way. Due to the addition of BPF it can be
programmed to yield back only for the desired system calls, which
significantly reduces the performance penalty and unaffected sections
of the traced program run almost as if no tracer was attached at all.
This is also similar to what strace is doing with the --seccomp-bpf
option as a means for lessening the performance overhead, although it
still uses ptrace for the main functionality.

Usecase

A few years ago, I wrote a tool called copycat that uses this
mechanism to dynamically intercept all open()-style system calls made
by a supervised process and returns, depending on some rules, either
the requested file or a completely different faked file. This can be
very useful in some situations, for example when a program is
hardcoded to use one specific location for a configuration file, but
you rather want to use a different location.

The replacement of the opened file is completely transparent to the
application and can easily be configured with simple environment
variables. For example, the following snippet will trick cat into
outputting /tmp/b instead of /tmp/a:

COPYCAT="/tmp/a /tmp/b" copycat -- cat /tmp/a

What happens behind the curtains, actually involves a lot more detail
than just intercepting system calls. For one, it is also necessary to
inject file descriptors directly into the file descriptor table of
the traced process, as otherwise the faked file would only be valid
in the tracer process.

seccomp unotify

Originally seccomp user notify was intended for container usecases,
but we can use it just as easily for normal processes by adopting the
age-old fork+exec pattern. The child process simply registers a
seccomp filter with SECCOMP_SET_MODE_FILTER and then executes the
target application, while the parent process acts as the supervisor
and repeats an ioctl loop with the special SECCOMP_IOCTL_NOTIF_RECV
flag, which will yield any time the supervised process attempts a
matching system call.

A few special prerequisites have to be met to make this work. First
the child process needs to drop all privileges.

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

This is necessary, as otherwise an unprivileged process could execve
a setuid program with a malicious filter attached.


    Such a malicious filter might, for example, cause an attempt to
    use setuid to set the caller's user IDs to nonzero values to
    instead return 0 without actually making the system call.

On the supervisor side we have to make some extra checks due to a
kernel bug that does not notify the supervisor when the supervised
process exits. Since Linux 6.11 the bug is resolved, so in that case
the SECCOMP_IOCTL_NOTIF_RECV loop is sufficient and will return with
ENOENT, when the child terminates. However on older kernel versions
that ioctl would hang forever, so an easy workaround is to install a
signal handler for SIGCHLD with sigaction. Just keep in mind to do
the old Unix dance of just using async-signal-safe functions inside
of it, in particular no allocations or locks. Alternatively it is
possible to epoll the file descriptor returned when registering the
BPF filter.

Finally when the supervisor handles an intercepted system call
received by SECCOMP_IOCTL_NOTIF_RECV, the struct seccomp_notif *req
contains all the system call's arguments as part of its data.args
array. Except that is not the whole truth, because while arguments
fitting into one register are usually directly visible there, larger
arguments (such as the file name to open) are passed as a pointer.
Thus, all information that we then get at this stage is a useless
pointer pointing into the memory of another process.

long syscall(SYS_open, const char *pathname, int flags, mode_t mode)

So we end up having to open /proc/$PID/mem just to read the pathname.
Luckily we do not get any problems with yama security policies, as
the seccomp operations already require us to have a predefined
relationship between the supervisor and supervised process anyway,
which in this case means one is the parent of the other. If you feel
this is hacky, wait until you think about all the TOCTOU
opportunities when we read the memory referred to by just a PID. Here
seccomp can help us out: As long as we read it before we continue the
syscall and confirm the notification ID is still valid with
SECCOMP_IOCTL_NOTIF_ID_VALID, we are safe.

Now that we have all the system call arguments, we can decide if we
want to allow it or modify it and return it with a different file. In
both cases we need to use the struct seccomp_notif_resp *resp
parameter . If we want to allow the system call normally, we can just
set its flags field to SECCOMP_USER_NOTIF_FLAG_CONTINUE and send back
the response with SECCOMP_IOCTL_NOTIF_SEND.

resp->flags |= SECCOMP_USER_NOTIF_FLAG_CONTINUE;
resp->error = 0;
resp->val = 0;
ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, resp)

However, it becomes slightly more involved if we want to modify the
system call arguments. In that case we need to open the faked file on
the supervisor side, pretend like the originally intended system call
worked and return the file descriptor number of the faked file to the
traced process. Except there is the huge problem that the file
descriptor will obviously not be valid in the target process.


    The curious reader might wonder why we do not just rewrite the
    system call arguments to the faked file, and then let the process
    continue the system call normally. Again, the argument points to
    memory inside the traced process. Rewriting memory is not
    transparent to the process, but injecting file descriptors is.

This is where SECCOMP_IOCTL_NOTIF_ADDFD comes in. It will atomically
both install a file descriptor directly into the file descriptor
table of the target process and return it as part of the system call.

struct seccomp_notif_addfd addfd = {};
addfd.id = req->id;
addfd.flags = SECCOMP_ADDFD_FLAG_SEND;
addfd.srcfd = ret;
resp->error = 0;
resp->val = ret
ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
close(addfd.srcfd);

After we have injected the file descriptor, we can simply close it on
our side.

BPF filter

For the most part now we have overglossed the most important detail,
which is the BPF filter that decides if we want to intercept a system
call. We can always continue a system call normally from our handler
with SECCOMP_USER_NOTIF_FLAG_CONTINUE, but a BPF filter is crucial
for skipping this expensive roundtrip in the first place.

While eBPF has made some fame in the tracing and profiling scene
lately, seccomp uses the original unextended Berkeley packet filter.
Both instruction sets are quite similar and the Linux kernel actually
internally translates BPF to an eBPF representation. Since BPF
filters run in kernel space, static checks are done to make sure that
they do not crash and that they terminate. There is no difficulty in
solving the halting problem for circle-free programs, so the eBPF
verifier uses a simple DFS to check that, i.e. loops are not allowed.
For most architectures, the kernel can also JIT compile eBPF to
native machine code.

Essentially the BPF instruction set has two registers, A and X, but
the kernel C definitions refer to them as BPF_K and BPF_X. We can
load into these registers with the BPF_LD instruction (e.g. 32-bit
wide with BPF_W) and BPF_JMP instructions allow us to jump based on
comparing a register value with a given value. For example BPF_JUMP
(BPF_JMP+BPF_JGE+BPF_X, 42, jt, jf) will increase the instruction
pointer by jt, if the value in the X register is greater or equal to
42. Otherwise it will increase it by jf. The instruction pointer will
also further be increased by one after each instruction.

With the BPF_RET instruction we can finally return a value, which the
kernel then will use to decide what to do with the system call. So a
BPF filter to intercept a certain set of system calls nrs of length
len would look a little something like this:

int trap_syscalls(const int *nrs, size_t len, unsigned int flags) {
        struct sock_filter filter[MAX_FILTER_SIZE];
        int i = 0;

        filter[i++] = BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, arch));
        filter[i++] = BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 0, 2);
        filter[i++] = BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr));
        filter[i++] = BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, X32_SYSCALL_BIT, 0, 1);
        filter[i++] = BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL_PROCESS);

        for (int j = 0; j < len; ++j) {
                filter[i++] = BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nrs[j], len - j, 0);
        }

        filter[i++] = BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW);
        filter[i++] = BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF);

        struct sock_fprog prog = {
                .len = (unsigned short) i,
                .filter = filter,
        };
        return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
}

There is a lot to unpack here, so let's go through the individual BPF
instructions one by one.

BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, arch))
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 0, 2)

The first two instructions load the architecture and check that it
matches our expectations. To understand why this is important, we can
refer directly to the common pitfalls section in the official
documentation:


    On any architecture that supports multiple system call invocation
    conventions, the system call numbers may vary based on the
    specific invocation. If the numbers in the different calling
    conventions overlap, then checks in the filters may be abused.
    Always check the arch value!

The next two instructions check for something similar:

BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr))
BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, X32_SYSCALL_BIT, 0, 1)

The arch field is actually not unique for all calling conventions.
For example, both the x86-64 ABI and the x32 ABI use
AUDIT_ARCH_X86_64, so the only way to tell them apart is by checking
if the __X32_SYSCALL_BIT is set. Furthermore, if system calls are
denied only based on its exact nr, then a malicious program could
simply set __X32_SYSCALL_BIT to bypass this filter.

If any of these checks fail, the jump location is the following BPF
instruction, which results in immediate termination of the process
that made the system call.

BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL_PROCESS)

Now with all the fun boilerplate done, we can finally insert BPF
instructions for all passed system call numbers, that check if we
want to intercept or just pass-through that specific system call.

BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nrs[j], len - j, 0)

This jump to the intercept instruction is a rats nest of off-by-one
errors: The jump-true branch is actually more like (len - 1) - j + 2
- 1. The intercept instruction is the second instruction after the
for loop, so we have to jump to the end of the for loop (which has
index len - 1) relatively from the current index j, then jump to the
second instruction, but subtract one again, because BPF automatically
increments the instruction pointer by one after each instruction.

Then the final instructions are the jump targets from all the
previous checks.

BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF)

The first one is reached if none of the checks in the for loop match
and simply allows the system call normally. The second one kicks off
interception of the system call and will return back to our userspace
handler that waits in an ioctl() loop with the
SECCOMP_IOCTL_NOTIF_RECV flag. For installing the BPF filter we
simply use SECCOMP_SET_MODE_FILTER at the end. For a more complete
picture have a look at the source code yourself, specifically
seccomp_exec.c for handling the system calls and seccomp_trap.c for
registering the BPF filter. There is also a smaller sample in the
Linux kernel source tree to get started.

Finally it should be emphasized that seccomp unotify should never be
used to implement security policy decisions. The TOCTOU attacks alone
hidden here make this impossible, for example if the supervisor
signals SECCOMP_USER_NOTIF_FLAG_CONTINUE, the system call will in
fact continue, but the process still has a small opportunity window
to rewrite the system call arguments before it actually runs.
However, it is still a great tool to intercept system calls with
minimal performance impact.
Tagged with:

  * linux