https://justine.lol/pledge/ July 13^th, 2022 @ justine's web page Porting OpenBSD pledge() to Linux [OpenBSD Blowfish Logo] OpenBSD is an operating system that's famous for its focus on security. Unfortunately, OpenBSD leader Theo states that there are only 7000 users of OpenBSD. So it's a very small but elite group, that yields a disproportionate influence; since we hear all the time about the awesome security features these guys get to use, even though we usually can't use them ourselves. Pledge is like the forbidden fruit we all covet when the boss says we must use things like Linux. Why does it matter? It's because pledge() actually makes security comprehensible. Linux has never really had a security layer that mere mortals can understand. For example, let's say you want to do something on Linux like control whether or not some program you downloaded from the web is allowed to have telemetry. You'd need to write stuff like this: static const struct sock_filter kFilter[] = { /* L0*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall, 0, 14 - 1), /* L1*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[0])), /* L2*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 2, 4 - 3, 0), /* L3*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 10, 0, 13 - 4), /* L4*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[1])), /* L5*/ BPF_STMT(BPF_ALU | BPF_AND | BPF_K, ~0x80800), /* L6*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 1, 8 - 7, 0), /* L7*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 2, 0, 13 - 8), /* L8*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(args[2])), /* L9*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0, 12 - 10, 0), /*L10*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 6, 12 - 11, 0), /*L11*/ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 17, 0, 13 - 11), /*L12*/ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), /*L13*/ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, OFF(nr)), /*L14*/ /* next filter */ }; Oh my gosh. It's like we traded one form of security privilege for another. OpenBSD limits security to a small pond, but makes it easy. Linux is a big tent, but makes it impossibly hard. SECCOMP BPF might as well be the Traditional Chinese of programming languages, since only a small number of people who've devoted the oodles of time it takes to understand code like what you see above have actually been able to benefit from it. But if you've got OpenBSD privilege, then doing the same thing becomes easy: pledge("stdio rpath", 0); That's really all OpenBSD users have to do to prevent things like leaks of confidential information. So how do we get it that simple on Linux? I believe the answer is to find someone with enough free time to figure out how to use SECCOMP BPF to implement pledge. The latest volunteer is me, so look upon my code ye mighty and despair. * cosmopolitan/libc/mem/pledge.c system call polyfill * cosmopolitan/tool/build/pledge.c pledge command * cosmopolitan/test/libc/mem/pledge_test.c unit tests There's been a few devs in the past who've tried this. I'm not going to name names, because most of these projects were never completed. When it comes to SECCOMP, the online tutorials only explain how to whitelist the system calls themselves, so most people lose interest before figuring out how to filter arguments. The projects that got further along also had oversights like allowing the changing of setuid/setgid/sticky bits. So none of the current alternatives should be used. I believe this effort gets us much closer to having pledge() than ever before. Command Line Utility [Linux] I originally wrote my pledge() polyfill for the redbean web server as a sandboxing solution. However it turns out pledge() is robust enough as an abstraction that I thought it'd be useful to create a small command line utility which launches processes under pledge(), so that anyone can use it, without having to configure it in C code. pledge.com 44kb - x86-64 elf executable (debug data, source code) Written by Justine Alexandra Roberts Tunney (Twitter, GitHub, LinkedIn) ab61efbc68afc94a5812bacd4c93d91f1da3b8fb267a2622724821cd9cace169 That binary will work on all Linux distros since RHEL6. Root privileges are not required. You just use it to wrap your command invocations. It's so tiny and lightweight that it only adds a few microseconds of startup latency to your program. It's great for shell scripts and automated tools. For example, if you want to run the list directory command, and only permit that command to do basic stdio and filesystem path reading, you'd say: $ wget https://justine.lol/pledge/pledge.com $ chmod +x pledge.com $ ./pledge.com -p 'stdio rpath' ls file listing output... You can now be certain your ls command isn't doing things like spying on you, or uploading your bitcoin wallet to the cloud. However let's say authorizing network access is what you want. One command that has a real legitimate need for that is curl, which can be configured as follows: $ ./pledge.com -p 'stdio rpath inet thread' curl http://justine.lol/hello.txt hello world Here's another example. Let's say you have a public ssh server and you want to let people read and take notes of your book collection, but you don't want anyone rewriting your books. In that case, you can repupose something like the nano command as a strictly read-only editor. Since nano has a TUI interface, you'd need to grant it TTY privileges. ./pledge.com -np 'stdio rpath tty' nano ~/books/bofh.txt Troubleshooting If your program crashes, then you can figure out why by tracing the binary and seeing which system call is EPERM'ing. Since the invocation above used the default set of promises (thereby making -p 'stdio rpath' redundant) then let's see what happens if we reduce the privileges to just stdio. $ strace -ff ./pledge.com -p stdio ls open("/etc/ld-musl-x86_64.path", O_RDONLY|O_CLOEXEC) = -1 EPERM (Operation not permitted) Well that didn't take long. Now that you know what's wrong, you would then consult the Promises section to see which promise you need. For example, you'd know open(O_RDONLY) is provided by rpath and that in order to fork() you need -p proc. Resource Limits In addition to polyfilling pledge, your pledge command is also able to apply some other very important safety hacks that aren't obvious to the uninitiated. For example, we've all run a program before that hammers the system. Linux is very generous in how much memory programs can allocate. An accidental loop in just one program, by default on Linux, will absolutely take the whole machine out of commission for a few minutes before the "OOM Killer" kicks in. In other cases, like a fork() bomb, the default Linux environment provides no such protection, so it's essentially equivalent to a blue screen of death. Your pledge command imposes some perfectly reasonable resource quotas on programs by default, to prevent that from happening. By default, unless you tune the flags, a program is allowed to use only 4gb of memory and, if you've permitted it to fork off new processes, then it won't be able to spawn more of them at the same time than twice your number of CPUs. That way your sandbox won't compromise the stability of your machine. We also have a niceness feature. Have you ever had a program use so much disk i/o that everything crawls to a halt? You run some program, and then suddenly every small file takes seconds to load in Emacs? Your pledge command can fix that. If you're got a compute heavy long running program, then pass the -n flag for a nice that's actually nice. The naive nice command doesn't really do much, since it doesn't change the scheduler and it doesn't change the i/o priority. This command actually does. Using the -n flag will guarantee the sandbox program will stay out of the way, since the kernel will only let it use spare capacity. Pledge Command Flags -n Apply maximum niceness to program. This means (1) nice is set to 19, (2) i/o priority is set to idle, and (3) scheduler is set to idle. -N Don't normalize file descriptors. by default, pledge.com guarantees (1) the stdio file descriptors exist, and (2) file descriptors that the parent process or shell forgot to close will be closed. In the latter case, we only poll up to fd=256 which is fast, but the number may be lower depending on system limits. -g GID Call setgid() before executing program (not allowed if setuid binary) -u UID Call setuid() before executing program (not allowed if setuid binary) -c PATH Call chroot() before executing program (needs root privileges) -C SECS set cpu limit in seconds [default: inherited] -M BYTES set virtual memory limit in bytes [default: 4gb] -P PROCS set process limit [default: GetCpuCount()*2] -F BYTES set individual file size limit [default: 4gb] -p PLEDGE Defaults to -p 'stdio rpath'. It's repeatable. May contain any of following separated by spaces: See also the Promises section below which goes into much greater depth on what each category does. + stdio: allow stdio and benign system calls + rpath: read-only path ops + wpath: write path ops + cpath: create path ops + dpath: create special files + flock: file locks + tty: terminal ioctls + recvfd: allow SCM_RIGHTS + fattr: allow changing some struct stat bits + inet: allow IPv4 and IPv6 + unix: allow local sockets + dns: allow dns + proc: allow fork, clone and friends + thread: allow clone + id: allow setuid and friends + exec: allow executing ape binaries Securing APE Binaries Actually Portable Executables should be written to call pledge() internally. But if you want to secure an APE binary that doesn't, using the pledge.com command, then you need to convert (or "assimilate") it into the ELF format beforehand. You can usually do this by saying: $ file redbean.com redbean.com: DOS/MBR boot sector $ ./redbean.com --assimilate $ file redbean.com redbean.com: ELF 64-bit LSB executable Please note that won't work if you're using the binfmt_misc with the new APE Loader then you can't run the APE shell script to assimilate your binary. We instead provide a new assimilate.com program which can be used to convert APE programs to ELF or Mach-O. assimilate.com Works on x86-64 Linux+Mac+Windows+FreeBSD+NetBSD+OpenBSD 92kb - PE+ELF+MachO+ZIP+SH executable (debug data, source code) Written by Justine Alexandra Roberts Tunney (Twitter, GitHub, LinkedIn) 593a8119049e9e8a88d29f80af83bfdbb5fcdd8a4cbad934af05dd6a5145ce77 C API Pledge works best when developing software using Cosmpolitan Libc. You can get started relatively easily writing pledge() programs using the cosmopolitan monorepo. The zero config solution is to just plop this program file into the examples folder. Start by cloning the repo: $ git clone https://github.com/jart/cosmopolitan $ cd cosmopolitan $ nano examples/mypledge.c You can then copy and paste this code: #include "libc/calls/calls.h" #include "libc/stdio/stdio.h" int main() { pledge("stdio", 0); printf("hello world\n"); } You can then build and run your program as follows: $ make -j8 o//examples/mypledge.com $ o//examples/mypledge.com hello world One of the things you may have noticed about the pledge.com command, is its most restrictive mode (pledge.com -p "" cmd...) can't actually be used. Your program will just crash. That's because it's intended for the C API. What it means is that your process or thread won't be able to call any system call except exit. Such a program might sound impossible, but you can actually communicate between processes using shared memory. For example, here's how you'd do it with threads. int enclave(void *arg, int tid) { if (pledge("", 0)) return 1; int *job = arg; // get job job[0] = job[0] + job[1]; // do work return 0; // exit } int main() { struct spawn worker; int job[2] = {2, 2}; // create workload _spawn(enclave, job, &worker); // create worker _join(&worker); // wait for exit assert(job[0] == 4); // check result } The above example shows an enclaved worker doing some kind of computational task, possibly executing untrusted code, and then storing the result to some memory location that the parent thread can see when the worker has finished executing. It works great and is fast. One of the disadvantages of the above example, is that the enclaved worker has unfettered access to your stack memory and might make a mess of things. That's potentially creepy and not very enclaved. One way to fix that is to use fork() instead of threads. In that case, you can explicitly whitelist which memory is shared. int ws; // create small shared memory region int *job = mmap(0, FRAMESIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); job[0] = 2; // create workload job[1] = 2; if (!fork()) { // create enclaved worker if (pledge("", 0)) _Exit(1); job[0] = job[0] + job[1]; // do work _Exit(0); } wait(&ws); // wait for worker assert(WIFEXITED(ws)); assert(WEXITSTATUS(ws) == 0); assert(job[0] == 4); // check result munmap(job, FRAMESIZE); Most of our the Cosmopolitan Libc unit tests have been set up to use pledge() these days. Not necessarily because we're concerned about them being compromised, but because the pledge function has outstanding documentation value in helping people understand our tests, since it readily communicates what system functionality they need. For example, our tests for the access() filesystem function says: __attribute__((__constructor__)) static void init(void) { pledge("stdio rpath wpath cpath fattr", 0); errno = 0; } System Call Origin Verification When you write your own Actually Portable Executables, you also get some added security benefits compared to pledge.com. For example, another famous OpenBSD system call is msyscall() which causes the kernel to validate the RIP register of anything that issues a system call. In Cosmopolitan, calling pledge() will polyfill that feature too automatically, to only allow functions which are annotated with the priviliged keyword to use SYSCALL. What that means is if someone manages to compromise your server to inject executable code into your program's memory, then that code effectively will have pledge("", 0) privileges, even if when your app called pledge(), it specified something much broader. The redbean web server's unix.pledge() function is also able to take advantage of this. Caveats File system access is a blind spot. OpenBSD solves this with another famous system call called unveil(), which lets users control file system paths too. Right now there's no clear way to implement that for Linux. However our pledge() polyfill does do a reasonable job in restricting which file system operations are possible. But once you permit the file system ops, the ops are allowed to happen on pretty much any file the user has access to. I personally don't view this as a problem. What I love about pledge.com is it tells me if the programs I run that I downloaded from random strangers on the Internet, are actually the good little command line citizens that they claim to be. For example, if I download a tool for computing some math, or compressing a file, then it really shouldn't need any access except -p "stdio rpath" especially if I'm able to use pipes. So I can use pledge.com to make sure the command keeps its promise and lets me know if there's any surprising behaviors. So this is great security if you're dealing with command line programs that are written in a conscientious manner. If it's only able to read files and can't talk to the Internet, then seriously, what could it possible do? It's such a simple pareto-optimized niche that I can't believe no one's made it easily addressable until now. However, there's always going to be that one program you want that's power hungry, possibly due to bloated frameworks and dependencies. In that case, we may want access to some (but not all) of the file system. pledge.com is able to address the need somewhat using chroot (). It's worth noting though that chroot() has weaknesses that kernel devs have refused to fix for decades. Most of the docs on this subject are unprofessional and crazy. For example, the chroot(2) man page is probably the only category 2 man page I've ever seen that uses shell script code to describe its functionality. As far as I can tell, the only convincing weakness with chroot() is that the jail is only locked from the inside. If you take away the freedom of a process by putting it in a chroot jail, then another process that's free can use its freedom to bust its friend out of jail. For example, here's how root can leave a backdoor that lets the process escape: mkdir("/tmp/mydir", 0755); // privileged user opens a backdoor int dirfd = open("/tmp", O_RDONLY | O_DIRECTORY); // process enters chroot jail chdir("/tmp/mydir"); chroot("/tmp/mydir"); // process escapes jail fchdir(dirfd); chdir(".."); // list root directory struct dirent *e; DIR *d = opendir("."); while ((e = readdir(d))) { printf("%s\n", e->d_name); } closedir(d); The Linux devs could fix that if they wanted to. However I personally don't see why it's a total dealbreaker, pledge.com helps avoid it by closing rogue file descriptors at startup using poll(). What even more surprising is that this weakness is also exploitable on OpenBSD, since they too seem to have given up on securing the traditional chroot() call. But at least OpenBSD provides an alternative that's easy to use, called unveil(). It'd be great to see that leadership from the Linux kernel, but instead we just see blog posts from companies like RedHat saying that having chroot() will make us more insecure than having no security at all. It's like banning locks because lockpick kits exist. RedHat must be experts at mental gymnastics to publish such communiques. It's also comical that Linux addresses the problem by restricting chroot() to the root user account, since clearly something which is so "insecure" will become more secure if you only do it from the most privileged user. What an unfortunate state of affairs, since many of us have needed to look elsewhere for answers, and the only folks offering those right now is bloatware like Docker that locks-in your filesystem with a bunch of cryptically named tar files. And they say that Docker isn't a security layer too! Even though it's based things like cgroups which are even more elite and difficult to understand than SECCOMP BPF. We can only guess why the kernel devs do it. Maybe they're afraid of issue workload burnout and figure people won't complain about security if no one understands it! That's something we're working to change. It should also be noted that there's some features OpenBSD bakes into pledge() that we're not able to polyfill with Linux SECCOMP BPF. One of the things OpenBSD does is it can check file system paths, in order to loosen up restrictions around things like accessing the time zone database. This isn't a problem if you're a Cosmopolitan Libc user. Because APE binaries don't read tzdata from the filesystem and instead embed time zone data inside the ZIP structure of the binary. However it could potentially be problematic if you're using pledge.com to launch binaries that are provided by your distro. Ask your friendly distro maintainers to improve their security solutions. If they can't, then you can always switch to Cosmopolitan Libc. Another caveat is that, so far, I've only implemented the things described in the OpenBSD pledge(2) manual page. We still need to reconcile this properly with the primary materials which would be the OpenBSD pledge() kernel source code. We also need more community feedback to make sure there aren't things we haven't considered. For example, Linux has a lot of sneaky capabilities in a shifting landscape that aren't always widely understood, which can potentially bite the authors of security tools, even when they've done due diligence. I've also only really tested this on console applications. If you want a pledge() that's likely to work with GUIs, then, knowing the way the Linux desktop goes, you really should consider SerenityOS since Andreas added pledge() support a couple years ago. Pledge Documentation Pledging causes most system calls to become unavailable. Your system call policy is enforced by the kernel, which means it can propagate across execve() if permitted. This system call is supported on OpenBSD and Linux where it's polyfilled using SECCOMP BPF. The way it works on Linux is verboten system calls will raise EPERM whereas OpenBSD just kills the process while logging a helpful message to / var/log/messages explaining which promise category you needed. By default exit and exit_group are always allowed. This is useful for processes that perform pure computation and interface with the parent via shared memory. Once pledge is in effect, the chmod functions (if allowed) will not permit the sticky/setuid/setgid bits to change. Linux will EPERM here and OpenBSD should ignore those three bits rather than crashing. User and group IDs can't be changed once pledge is in effect. OpenBSD should ignore chown without crashing; whereas Linux will just EPERM. Memory functions won't permit creating executable code after pledge. Restrictions on origin of SYSCALL instructions will become enforced on Linux (cf. msyscall) after pledge too, which means the process gets killed if SYSCALL is used outside the .privileged section. One exception is if the "exec" group is specified, in which case these restrictions need to be loosened. Using pledge is irreversible. On Linux it causes PR_SET_NO_NEW_PRIVS to be set on your process; however, if "id" or "recvfd" are allowed then then they theoretically could permit the gaining of some new privileges. You may call pledge() multiple times if "stdio" is allowed. In that case, the process can only move towards a more restrictive state. pledge() can't filter file system paths or internet addresses. For example, if you enable a category like "inet" then your process will be able to talk to any internet address. The same applies to categories like "wpath" and "cpath"; if enabled, any path the effective user id is permitted to change will be changeable. The Linux pledge() polyfill isn't able to support the OpenBSD `execpromises` parameter. Promises Your promises is a string that may include any of the following groups delimited by spaces. stdio allows close, dup, dup2, dup3, fchdir, fstat, fsync, fdatasync, ftruncate, getdents, getegid, getrandom, geteuid, getgid, getgroups, getitimer, getpgid, getpgrp, getpid, getppid, getresgid, getresuid, getrlimit, getsid, wait4, gettimeofday, getuid, lseek, madvise, brk, arch_prctl, uname, set_tid_address, clock_getres, clock_gettime, clock_nanosleep, mmap (PROT_EXEC and weird flags aren't allowed), mprotect (PROT_EXEC isn't allowed), msync, munmap, nanosleep, pipe, pipe2, read, readv, pread, recv, poll, recvfrom, preadv, write, writev, pwrite, pwritev, select, send, sendto (only if addr is null), setitimer, shutdown, sigaction (but SIGSYS is forbidden), sigaltstack, sigprocmask, sigreturn, sigsuspend, umask, socketpair, ioctl(FIONREAD), ioctl (FIONBIO), ioctl(FIOCLEX), ioctl(FIONCLEX), fcntl(F_GETFD), fcntl (F_SETFD), fcntl(F_GETFL), fcntl(F_SETFL). rpath (read-only path ops) allows chdir, getcwd, open(O_RDONLY), openat (O_RDONLY), stat, fstat, lstat, fstatat, access, faccessat, readlink, readlinkat, statfs, fstatfs. wpath (write path ops) allows getcwd, open(O_WRONLY), openat(O_WRONLY), stat, fstat, lstat, fstatat, access, faccessat, readlink, readlinkat, chmod, fchmod, fchmodat. cpath (create path ops) allows open(O_CREAT), openat(O_CREAT), rename, renameat, renameat2, link, linkat, symlink, symlinkat, unlink, rmdir, unlinkat, mkdir, mkdirat. dpath (create special path ops) allows mknod, mknodat, mkfifo. flock allows flock, fcntl(F_GETLK), fcntl(F_SETLK), fcntl(F_SETLKW). tty allows ioctl(TIOCGWINSZ), ioctl(TCGETS), ioctl(TCSETS), ioctl (TCSETSW), ioctl(TCSETSF). recvfd allows recvmsg(SCM_RIGHTS). fattr allows chmod, fchmod, fchmodat, utime, utimes, futimens, utimensat. inet allows socket(AF_INET), listen, bind, connect, accept, accept4, getpeername, getsockname, setsockopt, getsockopt, sendto. unix allows socket(AF_UNIX), listen, bind, connect, accept, accept4, getpeername, getsockname, setsockopt, getsockopt. dns allows socket(AF_INET), sendto, recvfrom, connect. proc allows fork, vfork, kill, getpriority, setpriority, prlimit, setrlimit, setpgid, setsid. thread allows clone, futex, and permits PROT_EXEC in mprotect. id allows setuid, setreuid, setresuid, setgid, setregid, setresgid, setgroups, prlimit, setrlimit, getpriority, setpriority, setfsuid, setfsgid. exec allows execve, execveat, access, faccessat. On Linux this also weakens some security to permit running APE binaries. However on OpenBSD they must be assimilate beforehand. On Linux, mmap() will be loosened up to allow creating PROT_EXEC memory (for APE loader) and system call origin verification won't be activated. execnative allows execve, execveat. Can only be used to run native executables; you won't be able to run APE binaries. mmap() and mprotect() are still prevented from creating executable memory. System call origin verification can't be enabled. If you always assimilate your APE binaries, then this should be preferred. Funding [United States of Lemuria - two dollar bill - all debts public and primate] Funding for the development of pledge() on Linux was crowdsourced from Justine Tunney's GitHub sponsors and Patreon subscribers. Your support is what makes projects like Cosmopolitan Libc possible. Thank you. twitter.com/justinetunney github.com/jart Written by Justine Tunney jtunney@gmail.com