https://justine.lol/pledge/

July 13^th, 2022 @ justine's web page

Porting OpenBSD pledge() to Linux

[OpenBSD Blowfish Logo]

OpenBSD is an operating system that's famous for its focus on
security. Unfortunately, OpenBSD leader Theo states that there are
only 7000 users of OpenBSD. So it's a very small but elite group,
that yields a disproportionate influence; since we hear all the time
about the awesome security features these guys get to use, even
though we usually can't use them ourselves.

Pledge is like the forbidden fruit we all covet when the boss says we
must use things like Linux. Why does it matter? It's because pledge()
actually makes security comprehensible. Linux has never really had a
security layer that mere mortals can understand. For example, let's
say you want to do something on Linux like control whether or not
some program you downloaded from the web is allowed to have
telemetry. You'd need to write stuff like this:

static const struct sock_filter kFilter[] = {
    /* L0*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall, 0, 14 - 1),
    /* L1*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[0])),
    /* L2*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 2, 4 - 3, 0),
    /* L3*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 10, 0, 13 - 4),
    /* L4*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[1])),
    /* L5*/ BPF_STMT(BPF_ALU | BPF_AND | BPF_K, ~0x80800),
    /* L6*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 1, 8 - 7, 0),
    /* L7*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 2, 0, 13 - 8),
    /* L8*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[2])),
    /* L9*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0, 12 - 10, 0),
    /*L10*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 6, 12 - 11, 0),
    /*L11*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 17, 0, 13 - 11),
    /*L12*/ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    /*L13*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(nr)),
    /*L14*/ /* next filter */
};

Oh my gosh. It's like we traded one form of security privilege for
another. OpenBSD limits security to a small pond, but makes it easy.
Linux is a big tent, but makes it impossibly hard. SECCOMP BPF might
as well be the Traditional Chinese of programming languages, since
only a small number of people who've devoted the oodles of time it
takes to understand code like what you see above have actually been
able to benefit from it. But if you've got OpenBSD privilege, then
doing the same thing becomes easy:

pledge("stdio rpath", 0);

That's really all OpenBSD users have to do to prevent things like
leaks of confidential information. So how do we get it that simple on
Linux? I believe the answer is to find someone with enough free time
to figure out how to use SECCOMP BPF to implement pledge. The latest
volunteer is me, so look upon my code ye mighty and despair.

  * cosmopolitan/libc/mem/pledge.c
    system call polyfill
  * cosmopolitan/tool/build/pledge.c
    pledge command
  * cosmopolitan/test/libc/mem/pledge_test.c
    unit tests

There's been a few devs in the past who've tried this. I'm not going
to name names, because most of these projects were never completed.
When it comes to SECCOMP, the online tutorials only explain how to
whitelist the system calls themselves, so most people lose interest
before figuring out how to filter arguments. The projects that got
further along also had oversights like allowing the changing of
setuid/setgid/sticky bits. So none of the current alternatives should
be used. I believe this effort gets us much closer to having pledge()
than ever before.

Command Line Utility   [Linux]

I originally wrote my pledge() polyfill for the redbean web server as
a sandboxing solution. However it turns out pledge() is robust enough
as an abstraction that I thought it'd be useful to create a small
command line utility which launches processes under pledge(), so that
anyone can use it, without having to configure it in C code.

pledge.com
44kb - x86-64 elf executable (debug data, source code)
Written by Justine Alexandra Roberts Tunney (Twitter, GitHub,
LinkedIn)
ab61efbc68afc94a5812bacd4c93d91f1da3b8fb267a2622724821cd9cace169

That binary will work on all Linux distros since RHEL6. Root
privileges are not required. You just use it to wrap your command
invocations. It's so tiny and lightweight that it only adds a few
microseconds of startup latency to your program. It's great for shell
scripts and automated tools. For example, if you want to run the list
directory command, and only permit that command to do basic stdio and
filesystem path reading, you'd say:

$ wget https://justine.lol/pledge/pledge.com
$ chmod +x pledge.com
$ ./pledge.com -p 'stdio rpath' ls
file listing output...

You can now be certain your ls command isn't doing things like spying
on you, or uploading your bitcoin wallet to the cloud. However let's
say authorizing network access is what you want. One command that has
a real legitimate need for that is curl, which can be configured as
follows:

$ ./pledge.com -p 'stdio rpath inet thread' curl http://justine.lol/hello.txt
hello world

Here's another example. Let's say you have a public ssh server and
you want to let people read and take notes of your book collection,
but you don't want anyone rewriting your books. In that case, you can
repupose something like the nano command as a strictly read-only
editor. Since nano has a TUI interface, you'd need to grant it TTY
privileges.

./pledge.com -np 'stdio rpath tty' nano ~/books/bofh.txt

Troubleshooting

If your program crashes, then you can figure out why by tracing the
binary and seeing which system call is EPERM'ing. Since the
invocation above used the default set of promises (thereby making -p
'stdio rpath' redundant) then let's see what happens if we reduce the
privileges to just stdio.

$ strace -ff ./pledge.com -p stdio ls
open("/etc/ld-musl-x86_64.path", O_RDONLY|O_CLOEXEC) = -1 EPERM (Operation not permitted)

Well that didn't take long. Now that you know what's wrong, you would
then consult the Promises section to see which promise you need. For
example, you'd know open(O_RDONLY) is provided by rpath and that in
order to fork() you need -p proc.

Resource Limits

In addition to polyfilling pledge, your pledge command is also able
to apply some other very important safety hacks that aren't obvious
to the uninitiated. For example, we've all run a program before that
hammers the system. Linux is very generous in how much memory
programs can allocate. An accidental loop in just one program, by
default on Linux, will absolutely take the whole machine out of
commission for a few minutes before the "OOM Killer" kicks in. In
other cases, like a fork() bomb, the default Linux environment
provides no such protection, so it's essentially equivalent to a blue
screen of death.

Your pledge command imposes some perfectly reasonable resource quotas
on programs by default, to prevent that from happening. By default,
unless you tune the flags, a program is allowed to use only 4gb of
memory and, if you've permitted it to fork off new processes, then it
won't be able to spawn more of them at the same time than twice your
number of CPUs. That way your sandbox won't compromise the stability
of your machine.

We also have a niceness feature. Have you ever had a program use so
much disk i/o that everything crawls to a halt? You run some program,
and then suddenly every small file takes seconds to load in Emacs?
Your pledge command can fix that. If you're got a compute heavy long
running program, then pass the -n flag for a nice that's actually
nice. The naive nice command doesn't really do much, since it doesn't
change the scheduler and it doesn't change the i/o priority. This
command actually does. Using the -n flag will guarantee the sandbox
program will stay out of the way, since the kernel will only let it
use spare capacity.

Pledge Command Flags

-n
    Apply maximum niceness to program. This means (1) nice is set to
    19, (2) i/o priority is set to idle, and (3) scheduler is set to
    idle.
-N
    Don't normalize file descriptors. by default, pledge.com
    guarantees (1) the stdio file descriptors exist, and (2) file
    descriptors that the parent process or shell forgot to close will
    be closed. In the latter case, we only poll up to fd=256 which is
    fast, but the number may be lower depending on system limits.
-g GID
    Call setgid() before executing program (not allowed if setuid
    binary)
-u UID
    Call setuid() before executing program (not allowed if setuid
    binary)
-c PATH
    Call chroot() before executing program (needs root privileges)
-C SECS
    set cpu limit in seconds [default: inherited]
-M BYTES
    set virtual memory limit in bytes [default: 4gb]
-P PROCS
    set process limit [default: GetCpuCount()*2]
-F BYTES
    set individual file size limit [default: 4gb]
-p PLEDGE
    Defaults to -p 'stdio rpath'. It's repeatable. May contain any of
    following separated by spaces:
    See also the Promises section below which goes into much greater
    depth on what each category does.
      + stdio: allow stdio and benign system calls
      + rpath: read-only path ops
      + wpath: write path ops
      + cpath: create path ops
      + dpath: create special files
      + flock: file locks
      + tty: terminal ioctls
      + recvfd: allow SCM_RIGHTS
      + fattr: allow changing some struct stat bits
      + inet: allow IPv4 and IPv6
      + unix: allow local sockets
      + dns: allow dns
      + proc: allow fork, clone and friends
      + thread: allow clone
      + id: allow setuid and friends
      + exec: allow executing ape binaries

Securing APE Binaries

Actually Portable Executables should be written to call pledge()
internally. But if you want to secure an APE binary that doesn't,
using the pledge.com command, then you need to convert (or
"assimilate") it into the ELF format beforehand. You can usually do
this by saying:

$ file redbean.com
redbean.com: DOS/MBR boot sector
$ ./redbean.com --assimilate
$ file redbean.com
redbean.com: ELF 64-bit LSB executable

Please note that won't work if you're using the binfmt_misc with the
new APE Loader then you can't run the APE shell script to assimilate
your binary. We instead provide a new assimilate.com program which
can be used to convert APE programs to ELF or Mach-O.

assimilate.com
Works on x86-64 Linux+Mac+Windows+FreeBSD+NetBSD+OpenBSD
92kb - PE+ELF+MachO+ZIP+SH executable (debug data, source code)
Written by Justine Alexandra Roberts Tunney (Twitter, GitHub,
LinkedIn)
593a8119049e9e8a88d29f80af83bfdbb5fcdd8a4cbad934af05dd6a5145ce77

C API

Pledge works best when developing software using Cosmpolitan Libc.
You can get started relatively easily writing pledge() programs using
the cosmopolitan monorepo. The zero config solution is to just plop
this program file into the examples folder. Start by cloning the
repo:

$ git clone https://github.com/jart/cosmopolitan
$ cd cosmopolitan
$ nano examples/mypledge.c

You can then copy and paste this code:

#include "libc/calls/calls.h"
#include "libc/stdio/stdio.h"

int main() {
  pledge("stdio", 0);
  printf("hello world\n");
}

You can then build and run your program as follows:

$ make -j8 o//examples/mypledge.com
$ o//examples/mypledge.com
hello world

One of the things you may have noticed about the pledge.com command,
is its most restrictive mode (pledge.com -p "" cmd...) can't actually
be used. Your program will just crash. That's because it's intended
for the C API. What it means is that your process or thread won't be
able to call any system call except exit. Such a program might sound
impossible, but you can actually communicate between processes using
shared memory. For example, here's how you'd do it with threads.

int enclave(void *arg, int tid) {
  if (pledge("", 0)) return 1;
  int *job = arg;            // get job
  job[0] = job[0] + job[1];  // do work
  return 0;                  // exit
}
int main() {
  struct spawn worker;
  int job[2] = {2, 2};            // create workload
  _spawn(enclave, job, &worker);  // create worker
  _join(&worker);                 // wait for exit
  assert(job[0] == 4);            // check result
}

The above example shows an enclaved worker doing some kind of
computational task, possibly executing untrusted code, and then
storing the result to some memory location that the parent thread can
see when the worker has finished executing. It works great and is
fast.

One of the disadvantages of the above example, is that the enclaved
worker has unfettered access to your stack memory and might make a
mess of things. That's potentially creepy and not very enclaved. One
way to fix that is to use fork() instead of threads. In that case,
you can explicitly whitelist which memory is shared.

int ws;
// create small shared memory region
int *job = mmap(0, FRAMESIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
job[0] = 2;  // create workload
job[1] = 2;
if (!fork()) {  // create enclaved worker
  if (pledge("", 0)) _Exit(1);
  job[0] = job[0] + job[1];  // do work
  _Exit(0);
}
wait(&ws);  // wait for worker
assert(WIFEXITED(ws));
assert(WEXITSTATUS(ws) == 0);
assert(job[0] == 4);  // check result
munmap(job, FRAMESIZE);

Most of our the Cosmopolitan Libc unit tests have been set up to use
pledge() these days. Not necessarily because we're concerned about
them being compromised, but because the pledge function has
outstanding documentation value in helping people understand our
tests, since it readily communicates what system functionality they
need. For example, our tests for the access() filesystem function
says:

__attribute__((__constructor__)) static void init(void) {
  pledge("stdio rpath wpath cpath fattr", 0);
  errno = 0;
}

System Call Origin Verification

When you write your own Actually Portable Executables, you also get
some added security benefits compared to pledge.com. For example,
another famous OpenBSD system call is msyscall() which causes the
kernel to validate the RIP register of anything that issues a system
call. In Cosmopolitan, calling pledge() will polyfill that feature
too automatically, to only allow functions which are annotated with
the priviliged keyword to use SYSCALL. What that means is if someone
manages to compromise your server to inject executable code into your
program's memory, then that code effectively will have pledge("", 0)
privileges, even if when your app called pledge(), it specified
something much broader. The redbean web server's unix.pledge()
function is also able to take advantage of this.

Caveats

File system access is a blind spot. OpenBSD solves this with another
famous system call called unveil(), which lets users control file
system paths too. Right now there's no clear way to implement that
for Linux. However our pledge() polyfill does do a reasonable job in
restricting which file system operations are possible. But once you
permit the file system ops, the ops are allowed to happen on pretty
much any file the user has access to.

I personally don't view this as a problem. What I love about
pledge.com is it tells me if the programs I run that I downloaded
from random strangers on the Internet, are actually the good little
command line citizens that they claim to be. For example, if I
download a tool for computing some math, or compressing a file, then
it really shouldn't need any access except -p "stdio rpath"
especially if I'm able to use pipes. So I can use pledge.com to make
sure the command keeps its promise and lets me know if there's any
surprising behaviors. So this is great security if you're dealing
with command line programs that are written in a conscientious
manner. If it's only able to read files and can't talk to the
Internet, then seriously, what could it possible do? It's such a
simple pareto-optimized niche that I can't believe no one's made it
easily addressable until now.

However, there's always going to be that one program you want that's
power hungry, possibly due to bloated frameworks and dependencies. In
that case, we may want access to some (but not all) of the file
system. pledge.com is able to address the need somewhat using chroot
(). It's worth noting though that chroot() has weaknesses that kernel
devs have refused to fix for decades. Most of the docs on this
subject are unprofessional and crazy. For example, the chroot(2) man
page is probably the only category 2 man page I've ever seen that
uses shell script code to describe its functionality. As far as I can
tell, the only convincing weakness with chroot() is that the jail is
only locked from the inside. If you take away the freedom of a
process by putting it in a chroot jail, then another process that's
free can use its freedom to bust its friend out of jail. For example,
here's how root can leave a backdoor that lets the process escape:

mkdir("/tmp/mydir", 0755);
// privileged user opens a backdoor
int dirfd = open("/tmp", O_RDONLY | O_DIRECTORY);
// process enters chroot jail
chdir("/tmp/mydir");
chroot("/tmp/mydir");
// process escapes jail
fchdir(dirfd);
chdir("..");
// list root directory
struct dirent *e;
DIR *d = opendir(".");
while ((e = readdir(d))) {
  printf("%s\n", e->d_name);
}
closedir(d);

The Linux devs could fix that if they wanted to. However I personally
don't see why it's a total dealbreaker, pledge.com helps avoid it by
closing rogue file descriptors at startup using poll(). What even
more surprising is that this weakness is also exploitable on OpenBSD,
since they too seem to have given up on securing the traditional
chroot() call. But at least OpenBSD provides an alternative that's
easy to use, called unveil(). It'd be great to see that leadership
from the Linux kernel, but instead we just see blog posts from
companies like RedHat saying that having chroot() will make us more
insecure than having no security at all. It's like banning locks
because lockpick kits exist. RedHat must be experts at mental
gymnastics to publish such communiques. It's also comical that Linux
addresses the problem by restricting chroot() to the root user
account, since clearly something which is so "insecure" will become
more secure if you only do it from the most privileged user. What an
unfortunate state of affairs, since many of us have needed to look
elsewhere for answers, and the only folks offering those right now is
bloatware like Docker that locks-in your filesystem with a bunch of
cryptically named tar files. And they say that Docker isn't a
security layer too! Even though it's based things like cgroups which
are even more elite and difficult to understand than SECCOMP BPF. We
can only guess why the kernel devs do it. Maybe they're afraid of
issue workload burnout and figure people won't complain about
security if no one understands it! That's something we're working to
change.

It should also be noted that there's some features OpenBSD bakes into
pledge() that we're not able to polyfill with Linux SECCOMP BPF. One
of the things OpenBSD does is it can check file system paths, in
order to loosen up restrictions around things like accessing the time
zone database. This isn't a problem if you're a Cosmopolitan Libc
user. Because APE binaries don't read tzdata from the filesystem and
instead embed time zone data inside the ZIP structure of the binary.
However it could potentially be problematic if you're using
pledge.com to launch binaries that are provided by your distro. Ask
your friendly distro maintainers to improve their security solutions.
If they can't, then you can always switch to Cosmopolitan Libc.

Another caveat is that, so far, I've only implemented the things
described in the OpenBSD pledge(2) manual page. We still need to
reconcile this properly with the primary materials which would be the
OpenBSD pledge() kernel source code. We also need more community
feedback to make sure there aren't things we haven't considered. For
example, Linux has a lot of sneaky capabilities in a shifting
landscape that aren't always widely understood, which can potentially
bite the authors of security tools, even when they've done due
diligence.

I've also only really tested this on console applications. If you
want a pledge() that's likely to work with GUIs, then, knowing the
way the Linux desktop goes, you really should consider SerenityOS
since Andreas added pledge() support a couple years ago.

Pledge Documentation

Pledging causes most system calls to become unavailable. Your system
call policy is enforced by the kernel, which means it can propagate
across execve() if permitted. This system call is supported on
OpenBSD and Linux where it's polyfilled using SECCOMP BPF. The way it
works on Linux is verboten system calls will raise EPERM whereas
OpenBSD just kills the process while logging a helpful message to /
var/log/messages explaining which promise category you needed.

By default exit and exit_group are always allowed. This is useful for
processes that perform pure computation and interface with the parent
via shared memory.

Once pledge is in effect, the chmod functions (if allowed) will not
permit the sticky/setuid/setgid bits to change. Linux will EPERM here
and OpenBSD should ignore those three bits rather than crashing.

User and group IDs can't be changed once pledge is in effect. OpenBSD
should ignore chown without crashing; whereas Linux will just EPERM.

Memory functions won't permit creating executable code after pledge.
Restrictions on origin of SYSCALL instructions will become enforced
on Linux (cf. msyscall) after pledge too, which means the process
gets killed if SYSCALL is used outside the .privileged section. One
exception is if the "exec" group is specified, in which case these
restrictions need to be loosened.

Using pledge is irreversible. On Linux it causes PR_SET_NO_NEW_PRIVS
to be set on your process; however, if "id" or "recvfd" are allowed
then then they theoretically could permit the gaining of some new
privileges. You may call pledge() multiple times if "stdio" is
allowed. In that case, the process can only move towards a more
restrictive state.

pledge() can't filter file system paths or internet addresses. For
example, if you enable a category like "inet" then your process will
be able to talk to any internet address. The same applies to
categories like "wpath" and "cpath"; if enabled, any path the
effective user id is permitted to change will be changeable.

The Linux pledge() polyfill isn't able to support the OpenBSD
`execpromises` parameter.

Promises

Your promises is a string that may include any of the following
groups delimited by spaces.

stdio
    allows close, dup, dup2, dup3, fchdir, fstat, fsync, fdatasync,
    ftruncate, getdents, getegid, getrandom, geteuid, getgid,
    getgroups, getitimer, getpgid, getpgrp, getpid, getppid,
    getresgid, getresuid, getrlimit, getsid, wait4, gettimeofday,
    getuid, lseek, madvise, brk, arch_prctl, uname, set_tid_address,
    clock_getres, clock_gettime, clock_nanosleep, mmap (PROT_EXEC and
    weird flags aren't allowed), mprotect (PROT_EXEC isn't allowed),
    msync, munmap, nanosleep, pipe, pipe2, read, readv, pread, recv,
    poll, recvfrom, preadv, write, writev, pwrite, pwritev, select,
    send, sendto (only if addr is null), setitimer, shutdown,
    sigaction (but SIGSYS is forbidden), sigaltstack, sigprocmask,
    sigreturn, sigsuspend, umask, socketpair, ioctl(FIONREAD), ioctl
    (FIONBIO), ioctl(FIOCLEX), ioctl(FIONCLEX), fcntl(F_GETFD), fcntl
    (F_SETFD), fcntl(F_GETFL), fcntl(F_SETFL).
rpath
    (read-only path ops) allows chdir, getcwd, open(O_RDONLY), openat
    (O_RDONLY), stat, fstat, lstat, fstatat, access, faccessat,
    readlink, readlinkat, statfs, fstatfs.
wpath
    (write path ops) allows getcwd, open(O_WRONLY), openat(O_WRONLY),
    stat, fstat, lstat, fstatat, access, faccessat, readlink,
    readlinkat, chmod, fchmod, fchmodat.
cpath
    (create path ops) allows open(O_CREAT), openat(O_CREAT), rename,
    renameat, renameat2, link, linkat, symlink, symlinkat, unlink,
    rmdir, unlinkat, mkdir, mkdirat.
dpath
    (create special path ops) allows mknod, mknodat, mkfifo.
flock
    allows flock, fcntl(F_GETLK), fcntl(F_SETLK), fcntl(F_SETLKW).
tty
    allows ioctl(TIOCGWINSZ), ioctl(TCGETS), ioctl(TCSETS), ioctl
    (TCSETSW), ioctl(TCSETSF).
recvfd
    allows recvmsg(SCM_RIGHTS).
fattr
    allows chmod, fchmod, fchmodat, utime, utimes, futimens,
    utimensat.
inet
    allows socket(AF_INET), listen, bind, connect, accept, accept4,
    getpeername, getsockname, setsockopt, getsockopt, sendto.
unix
    allows socket(AF_UNIX), listen, bind, connect, accept, accept4,
    getpeername, getsockname, setsockopt, getsockopt.
dns
    allows socket(AF_INET), sendto, recvfrom, connect.
proc
    allows fork, vfork, kill, getpriority, setpriority, prlimit,
    setrlimit, setpgid, setsid.
thread
    allows clone, futex, and permits PROT_EXEC in mprotect.
id
    allows setuid, setreuid, setresuid, setgid, setregid, setresgid,
    setgroups, prlimit, setrlimit, getpriority, setpriority,
    setfsuid, setfsgid.
exec
    allows execve, execveat, access, faccessat. On Linux this also
    weakens some security to permit running APE binaries. However on
    OpenBSD they must be assimilate beforehand. On Linux, mmap() will
    be loosened up to allow creating PROT_EXEC memory (for APE
    loader) and system call origin verification won't be activated.
execnative
    allows execve, execveat. Can only be used to run native
    executables; you won't be able to run APE binaries. mmap() and
    mprotect() are still prevented from creating executable memory.
    System call origin verification can't be enabled. If you always
    assimilate your APE binaries, then this should be preferred.

Funding

[United States of Lemuria - two dollar bill - all debts public and
primate]

Funding for the development of pledge() on Linux was crowdsourced
from Justine Tunney's GitHub sponsors and Patreon subscribers. Your
support is what makes projects like Cosmopolitan Libc possible. Thank
you.
twitter.com/justinetunney

github.com/jart

Written by Justine Tunney

jtunney@gmail.com