https://www.muppetlabs.com/~breadbox/txt/mopb.html

My Own Private Binary

An Idiosyncratic Introduction to Linux Kernel Modules

---------------------------------------------------------------------

How This Began

Several years ago, I spent a serious chunk of time figuring out how
to make really teensy ELF executable files. I started down this path
because I was annoyed that all of my programs, no matter how short
they were, never got smaller than 4k or so. I felt that was
excessive, for C, and so I started looking at what ELF files
contained, and how much of that actually needed to be there. (And
then, after a while, how much of it was supposed to be there but
could be ripped out anyway.) Anyway, I eventually managed to shrink
an executable down to 45 bytes, and I was able to demonstrate that
that was the smallest possible size an ELF executable could be and
still run, under x86 Linux at least.

I wrote up my findings, and some people found it interesting, and I
got some positive feedback. A couple of people naturally pointed out
that a shell script that did the same thing was much shorter than 45
bytes, to which my response was always that a shell script is not an
executable, and if you want to consider scripts then you need to
include the size of the interpreter binary along with the script
size.

But then one Internet Random Person(tm) pointed out that I could have
made a smaller executable if I had created an aout binary instead. If
you don't know what aout files are, don't worry -- that just means
you're not old. (They are also called "a.out files", but that can be
easily confused with just a file named "a.out", so I prefer to spell
the name of the format "aout".) At one time the aout format was
widely used on Linux, because it was Linux's only binary format. ELF
was not introduced to Linux until version 2.0 (or more precisely in
one of the 1.x experimental kernels). aout was, and still is, a very
simple format. It sports a 32-byte header, along with a handful of
other metadata. Unfortunately the aout format has some annoying
limitations around dynamic linking, so the fact that Linux switched
away from it early on is not too surprising. ELF is a much nicer
format for a mature system. But even though aout binaries were no
longer fashionable, they still worked.

However, when I first tried to run the aout executable the person had
sent me, I got an "Exec format error" message -- i.e. this file format
is not supported. It turned out that a security issue had been
uncovered at one point that involved aout core dumps. (Did you know?
Executable file formats come with their own core dump file formats to
match.) I'm vague on the details, but as a result most distros
started compiling their kernels without aout support. The format was
considered to be pretty thoroughly deprecated by that time, so it
wasn't seen as a difficult call to make.

(Support for aout is still present in the kernel source tree,
however, and you're free to include it if you compile your own
kernel. At the time of this writing there was talk about removing it
entirely, but apparently some architectures still have a use for it.
Some people have suggested removing support just for aout core dumps,
but in the absence of a pressing issue it remains as it is. The whole
conversation is a good reminder that adding a feature to software is
often far easier than removing it.)

But so I did compile a kernel with aout support, and verify that the
35-byte binary did in fact work. And the whole thing got me to
wondering, how many executable file formats does Linux actually
support? I looked into it, and I found out that the way in which
Linux handles binary formats is a dynamic feature of the kernel. That
is to say, it's relatively straightforward to add support for a new
format, without having to recompile your kernel, or reboot your
machine even.

An aside: I want to clarify that I'm not talking about the
"miscellaneous binary format" feature of the kernel. That feature
allows you to dynamically designate an interpreter to be run when the
user attempts to execute certain files. Thus, for example, running a
file ending in .jar can automatically invoke the JavaVM for you. That
feature is controlled via the /proc/sys/fs/binfmt_misc system, so
check out the binfmt_misc documentation if you're curious.
Interpreters are not what I'm interested in here, though; I'm
focusing exclusively on actual binary files.

What I was secretly hoping to discover was if a "flat" format existed
-- that is, a binary file format with no metadata at all. Obviously,
such a format would allow for even smaller executables. No such
format was supported, however. But that isn't too surprising, as such
a format isn't very useful. Where there is no metadata, there are no
features, no options. A flat format is a one-size-fits-all approach,
and that's not what most people need from their binary format
standards.

In order to avoid confusion, I should mention here that the Linux
kernel does have support for a format that is called "flat", but this
is just the name of the uClinux native binary format. It is actually
larger and more featureful than the aout format from which it was
derived. Presumably this format is flat along some other dimension.

Despite such shortcomings, it so happens that there is a flat,
metadata-less executable file format that is well supported on
another popular OS. If you haven't already guessed, I'm referring to
the .com file format that MS Windows supports, having inherited it
from MS-DOS, which in turn inherited it from CP/M. It is truly flat.
When you run a .com file, the OS loads the whole thing in memory at a
standard address and runs it. And this approach works okay on a
single-tasking system like MS-DOS. (Or rather, on the MS-DOS-like
subsystem that MS Windows presents to a running .com file.) In that
environment, the OS can say, "Here you go, program. You have 640
kilobytes of RAM. Have fun!"

And so naturally I asked myself, what would it take to get a .com
file format working under Linux? I mean yes it would not be terribly
useful ... but it would let me make the smallest executable file ever.
Sure, I could never hope to get support for such a format added to
the actual Linux kernel. But I could add it to my own kernel, where
at least I would be able to use it myself. I could be living in my
own private binary.

Kernel Modules

Linux makes it easy to do this sort of thing via loadable kernel
modules. What exactly is a kernel module? Basically, a kernel module
is an object file built for the kernel that just hasn't been linked
yet. The trick is that you can link one into a running kernel without
having to stop the kernel or even pull over. (This isn't quite the
same thing as what is generally meant by "dynamic linking", though it
is very similar in spirit.) Kernel modules mainly allow users to
manage support for various kinds of hardware dynamically, but they
allow you to add support for all kinds of things without having to
recompile the kernel. So let's take a moment and look at how to
create a kernel module.

Before we get started, we need to make sure that the kernel's header
files are present on the machine. For users on Debian-based systems,
this is typically done with the shell command:

uname(1) is used to ensure that you're getting the files that match
the specific version of the kernel you're currently running.
$ sudo apt install linux-headers-$(uname -r)
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
linux-headers-4.15.0-156-generic
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
&vellip;
/etc/kernel/header_postinst.d/dkms:
* dkms: running auto installation service for kernel
4.15.0-156-generic
...done.

This will install a directory under /usr/src corresponding to the
your current kernel version. (In fact, you may find that you already
have several directories located there, one for every version of the
kernel that you can currently boot into.) This directory contains,
among other things, all of the header files that we'll need to build
kernel modules.

We'll start with a very simple one, predictably named "hello kernel".

 hello.c

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("me");
MODULE_DESCRIPTION("hello kernel");
MODULE_VERSION("0.1");

static int __init hello_init(void)
{
    printk(KERN_INFO "hello, kernel\n");
    return 0;
}

static void __exit hello_exit(void)
{
    printk(KERN_INFO "goodbye, kernel\n");
}

module_init(hello_init);
module_exit(hello_exit);

A quick rundown of what is being done here:

                                    The header file linux/module.h
#include <linux/module.h>           defines the MODULE_* macros,
#include <linux/kernel.h>           linux/init.h defines the __init
#include <linux/init.h>             and __exit macros, and linux/
                                    kernel.h gives us the printk()
                                    function.
                                    These macros just insert some
                                    metadata into our module. You can
MODULE_LICENSE("GPL");              view this information via the
MODULE_AUTHOR("me");                modprobe(1) utility. Among other
MODULE_DESCRIPTION("hello kernel"); things, the kernel keeps track of
MODULE_VERSION("0.1");              the presence of non-free
                                    software.

                                    &vellip;
                                    And then there are two special
                                    functions, the init function and
                                    the exit function. The
                                    module_init() and module_exit()
                                    macros mark which function is
module_init(hello_init);            which, so the kernel can find
module_exit(hello_exit);            them. The first one gets called
                                    at the time the module is
                                    inserted into a running kernel,
                                    and the second one gets called
                                    when the module is being removed
                                    from the kernel.

The kernel folks have made it extremely easy to build kernel modules.
Here's the makefile:

 Makefile

obj-m = hello.o
kver = $(shell uname -r)
all:
        make -C /lib/modules/$(kver)/build/ M=$(PWD) modules
clean:
        make -C /lib/modules/$(kver)/build M=$(PWD) clean

All you do is put the name of your object file on the first line, and
everything else is done for you.

So, if we run make(1), a bunch of stuff will get created:

$ ls
hello.c Makefile
$ make
make -C /lib/modules/4.15.0-156-generic/build/ M=/home/breadbox/km/
hello modules
make[1]: Entering directory '/usr/src/
linux-headers-4.15.0-156-generic'
CC [M] /home/breadbox/km/hello/hello.o
Building modules, stage 2.
MODPOST 1 modules
CC /home/breadbox/km/hello/hello.mod.o
LD [M] /home/breadbox/km/hello/hello.ko
make[1]: Leaving directory '/usr/src/
linux-headers-4.15.0-156-generic'
$ ls
hello.c hello.mod.c hello.o modules.order
hello.ko hello.mod.o Makefile Module.symvers

There's hello.o, a regular object file, but we also have an object
file named hello.ko. This is the kernel module. We have become kernel
developers.

The insmod(8) tool can be used to load this module into the running
kernel:

$ sudo insmod ./hello.ko
$ lsmod | head
Module Size Used by
hello 16384 0
nls_iso8859_1 16384 0
uas 24576 0
usb_storage 69632 1 uas
btrfs 1155072 0
zstd_compress 163840 1 btrfs
xor 24576 1 btrfs
raid6_pq 114688 1 btrfs
ufs 77824 0

When we type lsmod(8) to list the active modules, it's right there at
the top, as the most recently added module.

The actual effect of this module is simply to output some log
messages. We can use dmesg(8) to verify that our module really did
load:

$ dmesg | tail -n4
[2419362.787463] e1000e: eth3 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx/Tx
[2448541.604799] EXT4-fs (sda1): re-mounted. Opts: (null)
[2474408.687761] usb 5-1: USB disconnect, device number 14
[2478627.755269] hello, kernel

We can then use rmmod(8) to remove the module from the running kernel
whenever we want:

$ sudo rmmod hello
$ dmesg | tail -n4
[2448541.604799] EXT4-fs (sda1): re-mounted. Opts: (null)
[2474408.687761] usb 5-1: USB disconnect, device number 14
[2478627.755269] hello, kernel
[2478639.607702] goodbye, kernel

So that's all there is to writing kernel modules.

No, of course that's not true. Writing a useful kernel module does
require some specialized knowledge. For example, kernel modules don't
have access to libc, since libc itself is mainly an abstraction layer
that sits atop the kernel. That said, much of the functionality
you're used to having easy access to as a C programmer is also
present in the kernel, though sometimes in slightly different dress.
(Filesystems, for example, are one bit of detail that we as Unix
programmers often ignore, but are obviously a major concern inside
the kernel.)

But don't let all that deter you. The rewards of writing kernel
modules are worth the inconveniences. There is a great deal that you
can do inside a kernel module that is simply impossible outside of
it. Remember, Linux is a monolithic kernel -- which means that once
you are loaded, you have the keys to the kingdom. Your lowly,
nonstandard kernel module can do anything. Of course, this is a
double-edged sword, because that also means that you can accidentally
do anything. As it happens, a lot of bad things are actually pretty
hard to do accidentally, such as mucking up some other process's
code. You'd need to jump through a few hoops just to get a pointer to
someone else's memory. But some unfortunate things are remarkably
easy to do. For example, one time while working on my kernel module,
I accidentally put --i instead of ++i in the iterator of my for loop.
I inserted that module into my kernel to test it, and my mouse cursor
disappeared, and my music stopped playing ... and then it was time to
reboot my computer.

But that sort of risk shouldn't scare you away. With modern
journaling file systems and the like, you're never going to be at any
real risk of losing data. (I mean, unless you're actually working on
implementing a filesystem, in which case please back up your files
regularly.) I encourage you to experiment and try out your own ideas
for kernel modules.

Binary File Formats Under Linux

All right, but what exactly does this have to do with making smaller
executables? Well, as I mentioned earlier, the kernel's list of
accepted binary file formats is dynamic. Specifically, this means
that there are functions inside the kernel that allow code to add and
remove binary formats from this list.

This is done by registering a set of callback functions, and these
callbacks get invoked when the kernel is asked to execute a binary
file. The kernel invokes the callbacks on this list, and the first
one that claims to recognize the file takes responsibility for
getting it properly loaded into memory. If nobody on the list accepts
it, then as a last resort the kernel will attempt to treat it as a
shell script without a shebang line. And if that doesn't fly, then
you'll get that "Exec format error" message described above.
Interesting side note: The kernel decides whether or not to try to
parse a file as a shell script by whether or not it contains a line
break in the first few hundred bytes -- specifically if it contains a
line break before the first zero byte. Thus a data file that just
happens to have a "\n" near the top can produce some odd-looking
error messages if you try to execute it.

So we find ourselves in possession of the following facts:

 1. The Linux kernel can dynamically introduce new binary file
    formats.
 2. Kernel modules can be added to a running kernel.
 3. Being able to run flat binaries would be really neat.

Obviously, there is only one possible response to this situation.

Version 0.1: Look Ma, No Metadata

We wish to write a kernel module that implements a flat,
metadata-less binary file format for Linux. So, that's what I did.

 comfile.c

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/binfmts.h>
#include <linux/personality.h>
#include <linux/processor.h>
#include <linux/ptrace.h>
#include <linux/sched/task_stack.h>

MODULE_DESCRIPTION("Linux command executable files");
MODULE_AUTHOR("Brian Raiter <breadbox@muppetlabs.com>");
MODULE_VERSION("0.1");
MODULE_LICENSE("GPL");

/* Given an address or size, round up to the next page boundary.
 */
#define pagealign(n)  (((n) + PAGE_SIZE - 1) & PAGE_MASK)

static struct linux_binfmt comfile_fmt;

static int load_comfile_binary(struct linux_binprm *lbp)
{
    long const loadaddr = 0x00010000;

    char const *ext;
    loff_t filesize;
    int r;

    ext = strrchr(lbp->filename, '.');
    if (!ext || strcmp(ext, ".com"))
        return -ENOEXEC;

    r = flush_old_exec(lbp);
    if (r)
        return r;
    set_personality(PER_LINUX);
    set_binfmt(&comfile_fmt);
    setup_new_exec(lbp);

    filesize = generic_file_llseek(lbp->file, 0, SEEK_END);
    generic_file_llseek(lbp->file, 0, SEEK_SET);

    current->mm->start_code = loadaddr;
    current->mm->end_code = current->mm->start_code + filesize;

    r = setup_arg_pages(lbp, STACK_TOP, EXSTACK_DEFAULT);
    if (r)
        return r;

    r = vm_mmap(lbp->file, loadaddr, filesize,
                PROT_READ | PROT_WRITE | PROT_EXEC,
                MAP_FIXED | MAP_PRIVATE, 0);
    if (r < 0)
        return r;

    install_exec_creds(lbp);
    /*finalize_exec(lbp);*/
    start_thread(current_pt_regs(), loadaddr,
                 current->mm->start_stack);
    return 0;
}

static struct linux_binfmt comfile_fmt = {
    .module = THIS_MODULE,
    .load_binary = load_comfile_binary,
    .load_shlib = NULL,
    .core_dump = NULL,
    .min_coredump = 0
};

static int __init comfile_start(void)
{
    register_binfmt(&comfile_fmt);
    return 0;
}

static void __exit comfile_end(void)
{
    unregister_binfmt(&comfile_fmt);
}

module_init(comfile_start);
module_exit(comfile_end);

Unlike our first kernel module, this one is actually doing some
interesting work. So let's take the time to walk through this code
and understand what's going on.

                                                         Our very-short
                                                         init function
                                                         just calls
                                                         register_binfmt
                                                         (), and likewise
static int __init comfile_start(void)                    our exit function
{                                                        calls
    register_binfmt(&comfile_fmt);                       unregister_binfmt
    return 0;                                            (). As you have
}                                                        probably already
                                                         guessed, these
static void __exit comfile_end(void)                     are the functions
{                                                        that add and
    unregister_binfmt(&comfile_fmt);                     remove support
}                                                        for a new binary
                                                         format. The
                                                         argument to both
                                                         functions is a
                                                         pointer to a
                                                         static struct of
                                                         type
                                                         linux_binfmt.
                                                         The important
                                                         fields of the
                                                         linux_binfmt
                                                         struct are three
                                                         function
                                                         pointers. They
                                                         provide callbacks
static struct linux_binfmt comfile_fmt = {               for loading an
    .module = THIS_MODULE,                               executable,
    .load_binary = load_comfile_binary,                  loading a
    .load_shlib = NULL,                                  shared-object
    .core_dump = NULL,                                   library, and
    .min_coredump = 0                                    dumping a core
};                                                       file. Thankfully,
                                                         those latter two
                                                         features are
                                                         optional, so we
                                                         can leave them
                                                         unimplemented,
                                                         and just provide
                                                         the first
                                                         callback.
                                                         And this function
                                                         is where all the
                                                         work gets done.
                                                         It will be
                                                         invoked by the
                                                         kernel every time
                                                         someone is
                                                         attempting to
                                                         execute one of
                                                         our files, and
                                                         its purpose is to
                                                         get the file's
                                                         contents into
                                                         memory and
                                                         running. The
                                                         function is
                                                         passed a single
                                                         argument, lbp,
                                                         which is a
                                                         pointer to a
                                                         struct called
                                                         linux_binprm that
static int load_comfile_binary(struct linux_binprm *lbp) contains our
                                                         actual arguments.
                                                         It has a dozen or
                                                         so fields that
                                                         summarize
                                                         everything the
                                                         kernel knows
                                                         about our file.
                                                         The callback
                                                         returns an int
                                                         value, as is
                                                         typical for
                                                         internal kernel
                                                         functions. If all
                                                         goes well, the
                                                         return value is
                                                         zero. When an
                                                         error occurs, the
                                                         function should
                                                         return a negative
                                                         number that
                                                         corresponds to a
                                                         negated errno
                                                         value.

Recall how a program is launched under Unix: first the fork system
call is used to duplicate the process, and then the execve system
call replaces the process's current program with a new one.
Note that the fork system call is not quite the same thing as the
fork() function supplied by libc, although the latter is just a thin
wrapper around the former. Similarly, libc provides a family of seven
different "exec" functions, but they all ultimately invoke the execve
system call.

The nice thing about this system is that we never have to worry about
actually creating a process from scratch. That's done for us. Every
program's process is a copy of pid 1, duplicated through a succession
of forks. Our callback will instead be invoked during the execve
system call. In effect, when the kernel calls us, it is asking, "Hey,
I've got a file here. The user claims it's an executable binary, but
it's not an ELF file. Do you want to deal with it?" Every callback
function that has been registered with register_binfmt() gets called,
in order, going down the list, until someone takes responsibility for
the file.

So that's the first thing our callback function needs to do: it needs
to decide whether or not this is actually a .com file. Which raises
the obvious question: how do we even do that? Most binary formats
looks for a magic-number identifier in the first few bytes of
metadata -- but we have no metadata. So then what?

Well, how does MS Windows identify .com files? Answer: it looks at
the filename. When you try to execute a file with a name ending in
".com", that's all MS Windows really cares about. "Oh, you're a .com
file, are you? Okay: here's 640k and an interrupt table. Call me when
you're done."

                                       So that's what we do, too. One
                                       of the fields of the
                                       linux_binprm struct is the
                                       filename, so we examine it,
                                       and if there's no ".com"
                                       extension, then we return
                                       negative ENOEXEC, the errno
                                       equivalent to our "Exec format
                                       error" message. This error
                                       normally means "this is not an
    ext = strrchr(lbp->filename, '.'); executable", but in this
    if (!ext || strcmp(ext, ".com"))   particular context, it really
        return -ENOEXEC;               means "this is not one of my
                                       executables." When the kernel
                                       gets this return value, it
                                       will just continue trying
                                       other formats. If all the
                                       callbacks return this value,
                                       then ENOEXEC will actually get
                                       returned from execve itself,
                                       which libc will then package
                                       up and store in errno. But, if
                                       it does end in ".com", then
                                       our callback continues.

All we have to do now is load and run the file. No pressure, right?
Luckily for us, the kernel provides lots of functions that will do
almost all of the heavy lifting for us. We just need to oversee the
whole process. So let's quickly run down the sequence of events.

                                                                The very first thing
                                                                we do is call
                                                                flush_old_exec().
                                                                Boom. Nearly
                                                                everything that was
                                                                specific to the old
                                                                process is now gone.
                                                                The process is now an
                                                                empty salt flat,
                                                                extending
                                                                featurelessly to the
    r = flush_old_exec(lbp);                                    horizon. Wait, that's
    if (r)                                                      a little bleak.
        return r;                                               Instead, let's
                                                                imagine it as a
                                                                fallow field, ready
                                                                for planting. Note
                                                                also that if a
                                                                non-zero value is
                                                                returned, then a
                                                                failure occurred, in
                                                                which case we
                                                                dutifully pass the
                                                                negated errno value
                                                                back up the call
                                                                chain.
                                                                Personality is an
                                                                obscure feature that
                                                                allows certain
    set_personality(PER_LINUX);                                 behaviors of the
                                                                kernel to vary on a
                                                                per-process basis.
                                                                For whatever reason,
                                                                it's not reset by the
                                                                flush.
                                                                The set_binfmt()
                                                                function explicitly
    set_binfmt(&comfile_fmt);                                   claims this binary as
                                                                one of our own. As
                                                                far as I can tell,
                                                                this is only used for
                                                                debugging purposes.
                                                                setup_new_exec()
                                                                initializes the
                                                                process to some basic
    setup_new_exec(lbp);                                        defaults, and allows
                                                                for any
                                                                architecture-specific
                                                                initializations to
                                                                occur.
                                                                At this point we are
                                                                now cleared to start
                                                                defining our memory
                                                                image, which is
                                                                currently very empty.
                                                                So the first thing we
                                                                want to do is
                                                                determine how big the
                                                                file is, since that's
                                                                also the size of
                                                                program. Inside the
                                                                kernel, we don't have
    filesize = generic_file_llseek(lbp->file, 0, SEEK_END);     the familiar file
    generic_file_llseek(lbp->file, 0, SEEK_SET);                descriptors. Instead,
                                                                we have file objects.
                                                                As you might expect,
                                                                the linux_binprm
                                                                struct includes an
                                                                already-opened file
                                                                object, and the
                                                                kernel function
                                                                generic_file_llseek()
                                                                works pretty much the
                                                                same as libc's more
                                                                familiar lseek()
                                                                function for
                                                                retrieving the file
                                                                size.
                                                                current is a global
                                                                variable that points
                                                                to the current task.
                                                                A task is like a
                                                                process or a thread,
                                                                except that instead
                                                                of being a numerical
                                                                identifier, it's the
                                                                actual thing itself --
                                                                the noumenon, the
                                                                ding-an-sich. It's a
                                                                struct with literally
                                                                hundreds of fields.
                                                                It's, like, really
                                                                big. Pretty much
                                                                anything you might
                                                                want to know about a
                                                                process is in this
                                                                thing, somewhere. One
                                                                of those things is
                                                                the task's memory
    current->mm->start_code = loadaddr;                         manager. And right
    current->mm->end_code = current->mm->start_code + filesize; now, the memory
                                                                manager is eager to
                                                                know where the
                                                                process's component
                                                                parts are going to be
                                                                located. Since our
                                                                format is so simple --
                                                                all we have is a blob
                                                                of code -- we mainly
                                                                need to provide a
                                                                valid load address.
                                                                There aren't too many
                                                                requirements for this
                                                                address. It just
                                                                needs to be
                                                                page-aligned, well
                                                                away from the stack,
                                                                and not zero. I
                                                                selected 0x10000 as
                                                                our load address
                                                                because there wasn't
                                                                a particular reason
                                                                not to.
                                                                We aren't setting up
                                                                anything else that
                                                                processes typically
                                                                contain, because
                                                                we're just so down to
    r = setup_arg_pages(lbp, STACK_TOP, EXSTACK_DEFAULT);       earth like that, so
    if (r)                                                      we can go straight to
        return r;                                               calling
                                                                setup_arg_pages().
                                                                This function
                                                                finalizes the
                                                                location and access
                                                                permissions of the
                                                                stack.
                                                                And now that that's
                                                                official, let's
                                                                actually load
                                                                something into
                                                                memory. Yes folks,
                                                                it's finally time to
                                                                call vm_mmap(). This
                                                                function is basically
                                                                identical to libc's
                                                                mmap(), and is the
                                                                natural way to load a
                                                                file into
                                                                (page-aligned)
                                                                memory. Of course,
                                                                normally when you
                                                                call mmap() with a
                                                                fixed load address
                                                                you need to handle
                                                                the case where that
                                                                address is already in
                                                                use. We don't need to
                                                                worry about that
                                                                here, as nothing is
                                                                currently in use.
                                                                We're requesting that
                                                                the memory be marked
                                                                as readable,
    r = vm_mmap(lbp->file, loadaddr, filesize,                  writeable, and
                PROT_READ | PROT_WRITE | PROT_EXEC,             executable.
                MAP_FIXED | MAP_PRIVATE, 0);                    Traditionally,
    if (r < 0)                                                  programs will place
        return r;                                               their code into
                                                                non-writeable memory,
                                                                and store variable
                                                                data in memory that
                                                                is writeable but not
                                                                executable. And
                                                                that's definitely the
                                                                safer way to do
                                                                things, but we can't
                                                                be bothered with all
                                                                that. After all, the
                                                                .com format hearkens
                                                                back to a simpler
                                                                time, when RAM was
                                                                RAM, and didn't come
                                                                with geegaws like
                                                                protection. Staying
                                                                true to this approach
                                                                is a way of honoring
                                                                our roots. Also,
                                                                without metadata it's
                                                                basically impossible
                                                                to know which parts
                                                                of the file are code
                                                                and which are data,
                                                                so we don't really
                                                                have a choice, but it
                                                                sounds better if we
                                                                claim it's because of
                                                                our heritage.
                                                                We now call
                                                                install_exec_creds(),
    install_exec_creds(lbp);                                    which will set up the
                                                                correct user ID vs
                                                                effective user ID, in
                                                                case it needs to be
                                                                changed.
                                                                The function
                                                                finalize_exec() does
                                                                something with the
                                                                stack's rlimit value.
                                                                I'm a little vague on
                                                                its purpose because
    /*finalize_exec(lbp);*/                                     it's somewhat new. In
                                                                fact, it doesn't even
                                                                exist on my kernel
                                                                version, which is why
                                                                it's commented out in
                                                                my code. If you're
                                                                running a 5.x kernel
                                                                or later, feel free
                                                                to restore it.
                                                                And then, at last, we
                                                                call start_thread().
                                                                This is the big one.
                                                                We pass it a pointer
                                                                to a struct that
                                                                contains the
                                                                process's current
                                                                register values, a
                                                                pointer to the top of
    start_thread(current_pt_regs(), loadaddr,                   the stack, and most
                 current->mm->start_stack);                     importantly, the
    return 0;                                                   address for the
                                                                instruction pointer
                                                                (which for us is the
                                                                same thing as the
                                                                load address). The
                                                                process is now ready
                                                                to be scheduled. And,
                                                                since we have indeed
                                                                made it this far, we
                                                                return zero to
                                                                indicate success.

Phew. It's definitely not trivial, the process of setting up a
process. But as I said, all of the real work is done by other code.

$ make
make -C /lib/modules/4.15.0-156-generic/build/ M=/home/breadbox/km/
com modules
make[1]: Entering directory '/usr/src/
linux-headers-4.15.0-156-generic'
CC [M] /home/breadbox/km/com/comfile.o
Building modules, stage 2.
MODPOST 1 modules
CC /home/breadbox/km/com/comfile.mod.o
LD [M] /home/breadbox/km/com/comfile.ko
make[1]: Leaving directory '/usr/src/
linux-headers-4.15.0-156-generic'
$ sudo insmod comfile.ko
$ lsmod | head -n3
Module Size Used by
comfile 16384 0
nls_iso8859_1 16384 0

Now in order to actually put this kernel module to the test, we'll
need a program to execute. Specifically, we need to create a binary
file in our flat, metadata-less format. One that actually does
something.

The down side of creating our own binary file format is that none of
our usual tools know anything about it. If we want to build a program
in this format, we're on our own here. But, our format is so utterly
simple that this shouldn't be hard. However, it does mean that we'll
need to use assembly code.

As is traditional, our minimal test program will be one that exits
with a status code of 42. In order to make a system call under 64-bit
Linux, we need to set rax to the system call number, and rdi to the
(first) argument, and then use the syscall instruction. The exit
system call is assigned the ID number 60, so this should be all we
need:

$ cat >tiny.asm
BITS 64
mov rax, 60
mov rdi, 42
syscall
$ nasm -f bin -o tiny.com tiny.asm
$ chmod +x tiny.com
$ wc -c tiny.com
12 tiny.com

The bin format is nasm's name for its flat binary output format, so
what we get in our output file is nothing more than the assembly code
that we specified.

$ ./tiny.com
$ echo $?
42

Our project has borne fruit. Behold: it works, and it's twelve bytes
in size. And we can verify that it is, in fact, our kernel module
that is actually loading and running it:

$ sudo rmmod comfile
$ ./tiny.com
bash: ./tiny.com: cannot execute binary file: Exec format error
$ sudo insmod ./comfile.ko
$ ./tiny.com
$ echo $?
42

This delightfully unadulterated binary file is almost a quarter the
size of the smallest possible ELF executable, and less than a third
the size of the aout executable that inspired this (admittedly
ridiculous) exploration. And with zero bytes of overhead in our file
format, we can be confident that no binary using a format that
includes metadata can touch this one.

Although, of course, if we're going to start crowing about the size,
then we should probably go ahead and use the smallest possible
instructions.

 tiny.asm

BITS 64
        push    42
        pop     rdi
        mov     al, 60
        syscall

rdi can be initialized in only three bytes of machine code, and rax
can be initialized in even less, thanks to the fact that it is
pre-initialized to zero.

$ nasm -f bin -o tiny.com tiny.asm
$ chmod +x tiny.com
$ ./tiny.com
$ echo $?
42
$ wc -c tiny.com
7 tiny.com

Seven bytes. Seven!

To be clear, this is very much a nonstandard binary, and therefore it
in no way invalidates or supplants my 45-byte ELF executable (or the
aout executable). But it does make me very happy.

Further Testing of the Waters

We should try writing a few more programs, just to verify that our
kernel module really does work in general. Let's try a proper
hello-world program.

 hello.asm

BITS 64

        org     0x10000

        mov     eax, 1                  ; rax = 1: write system call
        mov     edi, eax                ; rdi = 1: stdout file desc
        lea     rsi, [rel str]          ; rsi = pointer to string
        mov     edx, strlen             ; rdx = string length
        syscall                         ; call write(rdi, rsi, rdx)
        mov     eax, 60                 ; rax = 60: exit system call
        xor     edi, edi                ; rdi = 0: exit code
        syscall                         ; call exit(rdi)

str:    db      'hello, world', 10
strlen equ $ - str

More assembly, yes, but it's very straightforward for assembly. It
compiles down to a 43-byte binary, and it does work:

$ nasm -f bin -o hello.com hello.asm
$ chmod +x hello.com
$ ./hello.com
hello, world
$ wc -c hello.com
43 hello.com

By using shorter instructions, we could reduce this program to 35
bytes, perhaps less. I will leave that as an exercise to the
interested reader.

As long as our programs only use a fixed amount of data, we can
allocate space by just adding it to our binary file. If we need to
allocate space dynamically, however, then that's going to be a
problem. Why? Because our processes don't have a heap. Why not?
Because we didn't set one up in our loader. Oops.

Version 0.2: On Having a Heap

Let's address this oversight by going back to our kernel module and
adding a few more lines of code. It's actually pretty easy. We just
need to let the memory manager know what we want.

Most programs use a memory layout that looks something like this:

Code Data Heap  Stack

The sections are frequently broken up for the purpose of providing
different access rights. Code sections are marked executable but not
writeable, and the other sections are marked writeable and not
executable. The code and data sections have a constant size, while
the heap and the stack change in size as the program runs, with the
heap growing upwards and the stack growing downwards. (And if they
meet in the middle, then you've run out of memory -- although on a
64-bit machine you'll run out of physical RAM long before that
point.) For historical reasons, the end of the heap is called the
program break, and the brk system call can be used to move it around.
If we're being pedantic, we should note that access permissions are
not attached to the memory itself, but rather to the addresses.
Different address ranges can be mapped to the same physical memory
but with different permissions. This is not a distinction we need to
worry about here.

So let's tell the memory manager that we want our processes to enjoy
the benefits of a heap:

 comfile.c

    filesize = generic_file_llseek(lbp->file, 0, SEEK_END);
    generic_file_llseek(lbp->file, 0, SEEK_SET);
    allocsize = PAGE_ALIGN(filesize);

    current->mm->start_code = loadaddr;
    current->mm->end_code = current->mm->start_code + filesize;
    current->mm->start_data = current->mm->end_code;
    current->mm->end_data = loadaddr + allocsize;
    current->mm->start_brk = current->mm->end_data;
    current->mm->brk = current->mm->start_brk;

    r = setup_arg_pages(lbp, STACK_TOP, EXSTACK_DEFAULT);
    if (r)
        return r;

The only measurement we have available to us is the file size, which
we're using to determine the size of the code section. The PAGE_ALIGN
macro rounds a value up to the next page boundary. Since we can't
allocate a fractional number of memory pages, we can take whatever
padding we'll get at the page's end, and let that be our data
section. Our heap will then be located directly following this. It
starts off with a size of zero, which the program can then expand as
desired.

(There's no reason why we couldn't define a larger data section for
our binaries, by the way. We would just need to hard-code a size for
it. Perhaps, to stay true to the MS-DOS roots of our format, we
should allocate a data section of 640k. But this requires making a
second vm_mmap() call to allocate non-file-backed memory, so I've
chosen to punt on that modification for the moment.)

 comfile.c

    r = vm_mmap(lbp->file, loadaddr, filesize,
                PROT_READ | PROT_WRITE | PROT_EXEC,
                MAP_FIXED | MAP_PRIVATE, 0);
    if (r < 0)
        return r;
    r = vm_brk(current->mm->start_brk, 0);
    if (r < 0)
        return r;

After memory has actually been allocated, we will set the process's
program break to its starting value. Our process should now have a
functional heap that the program can dynamically modify.

In order to test this new feature, I've written an implementation of
cat, one that reads all of standard input into memory before doing
any output. It's not an interesting program beyond being a basic
demonstration of low-level heap allocation, so I won't go into it. If
you're curious, you can see the source here: cat.asm. For now, we'll
just note that it succeeds at allocating memory at runtime:

On my machine, /etc/mailcap is a text file well over 64k in size.
$ nasm -f bin -o cat.com cat.asm
$ chmod +x cat.com
$ ./cat.com </etc/mailcap | cmp - /etc/mailcap
$ wc -c cat.com
100 cat.com

This version of cat is not really a cat utility, however, as it can
only read from standard input. A proper cat program -- and, indeed, a
majority of proper programs -- will need to be able to open files
named on the command line. So how do we access the command-line
arguments?

Frankly, we can't, at least not as things stand. You see, when an ELF
binary runs, it has values for argc, argv, and envp placed at the top
of its stack. Surprise! Those values are put there by the loader.
Yes, this issue is also the responsiblity of our kernel module.

So let's add support for this, too.

Version 0.3: Leaving Room for Arguments

To be sure, the strings that make up the command-line arguments are
present in our process's image -- specifically, they're sitting in
memory just above the stack. (Which means, given that the stack is
currently empty, that the rsp register currently points to them.) But
this is no neat array of string pointers. It's just a lot of strings,
one after the other. A string of strings, if you will. Worse, the
environment variables come immediately after the command-line
arguments, without any indication of where one set ends and the other
begins. So they aren't really usable, as they stand.

At the absolute least, a program needs to know how many of the
strings are in each set. Well, it turns out that our kernel module
has exactly that information. In the linux_binprm struct (provided
via the argument to our callback function, remember) there are two
fields named argc and envc. These are the number of command-line
arguments and environment variables, respectively. In theory, if we
transmit these two values to the running program, that would be
enough for the code to safely access the data. Of course, if that's
all we did, then every .com program would need to trawl through their
string of strings, to determine where each item begins and ends. We
could just accept that as a fact of life for programmers using our
format, but since we're doing this work anyway, why not take the time
to do it right? We should provide our processes with argv and envp
arguments -- neat arrays of pointers to the strings in question -- like
all the cool binary formats do.

Since these two arrays don't currently exist, we'll need to reserve
some memory for them. It may feel intuitive to tap our newly-minted
heap, but for this it actually makes more sense to just take it off
the top of the stack. (In fact, these arrays can be seen as forming a
sort of zeroth stack frame.) The top of the stack is at the top of
memory, which is stored in the linux_binprm struct in the
intuitively-named field p. So we want to build arrays in the memory
immediately preceding this address, and then move the top of the
stack down to precede our arrays.

Note, however, that we cannot use familiar functions like strlen() to
walk through these strings. Why? Because the memory holding these
strings isn't owned by the kernel; it belongs to the process itself.
So far, we've only been dealing with addresses in the process's
memory. We haven't tried to access that memory, except through other
functions, such as vm_mmap(). It can thus be easy to forget that
kernel memory and user memory exist in two different address spaces
(not to mention different permission rings). The kernel is allowed to
access user memory, of course, but it needs to be intentional about
it, and this requires some extra work.

Within our kernel module, we use the __user annotation to declare
pointers to user memory. And instead of using the * operator to
dereference such pointers, we have two special macros: get_user() to
read through a user pointer, and put_user() to write through one. And
the kernel provides a handful of convenient functions, like
strnlen_user(), that will operate on strings stored in user space.

Once we have gone through the strings and populated our two arrays,
we'll still need to communicate the values for argc, argv, and envp
to the program. The usual way to do this is to place them on the
stack, allowing the program to access them at its convenience.

So let's add all this to our kernel module. We'll start by defining a
separate function to handle the work of walking through the strings
and building the two arrays.

 comfile.c

/* Given argc + envc strings above the top of the stack, construct the
 * argv and envp arrays in the memory preceding, and then push argc,
 * argv, and envp onto the stack. Return the new stack top address.
 */
static unsigned long make_arrays(struct linux_binprm const *lbp)
{
    void* __user *sp;
    char* __user *argv;
    char* __user *envp;
    char __user *p;
    int i;

    p = (char __user *)lbp->p;
    envp = (char* __user *)ALIGN(lbp->p, sizeof *envp);
    envp = envp - (lbp->envc + 1);
    argv = envp - (lbp->argc + 1);
    sp = (void* __user *)argv - 3;

    current->mm->arg_start = (unsigned long)p;
    for (i = 0 ; i < lbp->argc ; ++i) {
        put_user(p, argv + i);
        p += strnlen_user(p, MAX_ARG_STRLEN);
    }
    put_user(NULL, argv + i);
    current->mm->arg_end = (unsigned long)p;

    current->mm->env_start = (unsigned long)p;
    for (i = 0 ; i < lbp->envc ; ++i) {
        put_user(p, envp + i);
        p += strnlen_user(p, MAX_ARG_STRLEN);
    }
    put_user(NULL, envp + i);
    current->mm->env_end = (unsigned long)p;

    put_user((void*)(unsigned long)lbp->argc, sp);
    put_user(argv, sp + 1);
    put_user(envp, sp + 2);

    return (unsigned long)sp;
}

There's a fair bit of casting in this function because the kernel
tends to store user addresses as unsigned long values. This may sound
counterproductive, but it's somewhat natural given that kernel code
rarely dereferences such addresses. But our function wants to work
with them as pointer types, in order to take advantage of pointer
arithmetic. (We also have to cast argc into a pointer while we
briefly pretend that the stack is an array.)

With this function in our code, we just need to use it at the
appropriate time, after the stack has been created and its address
has been finalized.

 comfile.c

    r = setup_arg_pages(lbp, STACK_TOP, EXSTACK_DEFAULT);
    if (r)
        return r;
    current->mm->start_stack = make_arrays(lbp);

In order to verify that all of these changes do in fact work, we can
write a couple more program, echo.asm and env.asm. Again, they aren't
particularly interesting in the details, so I won't dissect them
here. But if you're at all familiar with reading x86 assembly, they
should be relatively straightforward to understand.

$ nasm -f bin -o echo.com echo.asm
$ chmod +x echo.com
$ ./echo.com foo bar baz
foo bar baz
$ ./echo.com

$ wc -c echo.com
84 echo.com

Take This Discussion of Practicality Outside

At this point, we have created a binary format for the Linux kernel
that functions without metadata. Writing code for it is a bit of a
pain, though -- we have to write everything in assembly, and none of
the standard tools work with our format.

Well, it so happens that there is something that we can do about
that. With some investment of effort, we can coax our familiar tools
into generating .com binaries, allowing us to use things like C
compilers once more. There's a number of steps to the whole journey,
however, so I've decided to put the gory details in a separate
appendix, and I encourage you to peruse it if you are at all curious.

Click here to check out the appendix.

But for now, I don't want to be sidelined by meandering distractions
like "usability". The focus of this essay, after all, is using kernel
modules to let us produce working binaries that are really teensy. We
have already produced a valid seven-byte executable file, and it is
undeniably a thing of beauty. But a question immediately presents
itself.

Could it be even smaller?

Well, we can't shrink the program itself down any further. It's as
small as it can get. But maybe we could get by with a simpler
program, if we changed the binary loader a little bit. Nothing too
ridiculous, mind you. But I'm thinking ... what if our binaries could
just automatically exit when they came to the end, instead of forcing
the programmer to use the exit system call? I mean, a majority of
programming languages work that way, right? If a Python program makes
it to the end of the file, it just quietly exits. Could we make our
binaries do that as well?

Version 0.?: Hello, I Must Be Going

We absolutely can. What we would need to do is append a few extra
bytes of machine code to the end of our file image. This code will
only be executed if the program would have crashed otherwise, so one
could argue that this is a purely beneficial modification to our
binary format.

However, it does mean that the code size will now be larger than the
file size. This alters the addresses of how we lay out memory, which
could mess with programmers' assumptions, given that programs are
generally written in low-level assembly.

Rather than calling this a minor feature upgrade, I therefore feel
compelled to fork the code base at this point, and give this file
format a new name, so as not to muddy the until-now transparent ABI
of the .com format.

[poster]Originally, I was planning on calling this new, extended
format "dot-e-com", or "dot-x-com" or something equally trite.
However, after some reflection, I decided to be less boring and
instead I called it the "keep-calm" (and carry on) format. This name
refers naturally to the fact that you no longer have to fret about
setting up mandatory instructions for exiting. Instead you can just
carry on, right up to the end of the program.

However, I decided I didn't like using .calm as a filename extension.
It was too easy to confuse with .com when said aloud. So I instead
chose to adopt the Unicode character for a crown as the extension:
.. (This is U+265A, one of the Unicode chess piece glyphs.)

Now I realize that some people may find it a bit controversial to use
a non-ASCII symbol in a filename extension. But hey, the Unicode
Consortium has gone to a great deal of trouble to standardize
thousands upon thousands of glyphs, and we programmers continue to
draw upon the same 100 or so characters for our symbols, keywords,
and filenames. All these others are just sitting around, neglected.
We should start using them more. I mean, why not? Heck, put emojis in
your filenames. I'm not your dad.
Alan, if you're reading this: I forbid you to put emojis in your
filenames.

 calmfile.c

    ext = strrchr(lbp->filename, '.');
    [S:if (!ext || strcmp(ext, ".com")):S]
    if (!ext || strcmp(ext, "."))
        return -ENOEXEC;

(If for whatever reason you don't trust your compiler to properly
handle Unicode characters, you can instead assume a UTF-8 environment
and write the second argument to strcmp() as ".\342\231\232".)

Anyway. The machine code that we want to append to the program can be
squeezed into eight bytes:

 epilog.lst

00000000 31FF           xor     edi, edi
00000002 678D473C       lea     eax, [edi + 60]
00000006 0F05           syscall

That's convenient, as it means that we can stuff it all into a long
value, which can be stored in user memory with a simple put_user().
The next change we need to make is to reserve an extra eight bytes in
the layout that we report to the memory manager.

 calmfile.c

    filesize = generic_file_llseek(lbp->file, 0, SEEK_END);
    generic_file_llseek(lbp->file, 0, SEEK_SET);
    [S:allocsize = PAGE_ALIGN(filesize);:S]
    codesize = filesize + 8;
    allocsize = PAGE_ALIGN(codesize);

    current->mm->start_code = loadaddr;
    [S:current->mm->end_code = current->mm->start_code + filesize;:S]
    current->mm->end_code = current->mm->start_code + codesize;
    current->mm->start_data = current->mm->end_code;
    current->mm->end_data = loadaddr + allocsize;
    current->mm->start_brk = current->mm->end_data;
    current->mm->brk = current->mm->start_brk;

We will still use filesize when we call vm_mmap(), since we can only
map what's in the file. And here we hit a subtle point. Due to the
fact that memory is always mapped in page-sized chunks, we typically
get more memory than we ask for from vm_mmap() -- but not always. If
the binary file happens to be exactly (or nearly exactly) page-sized,
then there won't be enough memory mapped for our eight-byte epilog.
So, we need to check for this edge case, and when it happens we need
to call vm_mmap() a second time to reserve a page of anonymous memory
immediately following:

 calmfile.c

    r = vm_mmap(lbp->file, loadaddr, filesize,
                PROT_READ | PROT_WRITE | PROT_EXEC,
                MAP_FIXED | MAP_PRIVATE, 0);
    if (r < 0)
        return r;
    if (allocsize != PAGE_ALIGN(filesize)) {
        r = vm_mmap(NULL, loadaddr + PAGE_ALIGN(filesize), PAGE_SIZE,
                    PROT_READ | PROT_WRITE | PROT_EXEC,
                    MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, 0);
        if (r < 0)
            return r;
    }
    r = vm_brk(current->mm->start_brk, 0);
    if (r < 0)
        return r;

    put_user(0x050F3C478D67FF31, (long __user *)(loadaddr + filesize));

And of course that ridiculous-looking 64-bit magic number at the
bottom is the little-endian encoding of our epilog.

We don't need to rmmod the previous module this time, since we are
defining a new binary format. The two can be active simultaneously
without interfering with each other.
$ make
make -C /lib/modules/4.15.0-156-generic/build/ M=/home/breadbox/km/
calm modules
make[1]: Entering directory '/usr/src/
linux-headers-4.15.0-156-generic'
CC [M] /home/breadbox/km/calm/calmfile.o
Building modules, stage 2.
MODPOST 1 modules
CC /home/breadbox/km/calm/calmfile.mod.o
LD [M] /home/breadbox/km/calm/calmfile.ko
make[1]: Leaving directory '/usr/src/
linux-headers-4.15.0-156-generic'
$ sudo insmod ./calmfile.ko
$ lsmod | head -n3
Module Size Used by
calmfile 16384 0
comfile 16384 0

Of the programs we've currently written, cat.asm is the best
candidate to benefit from this change, having twelve bytes' worth of
machine code that can be omitted with the new format:

To enter a Unicode glyph at the terminal, type Ctrl-Shift-U, followed
by the codepoint number, and then Enter. The sequence for the crown
symbol is thus Ctrl-Shift-U 2 6 5 A Enter.
$ cp -i cat.com cat.
$ ./cat. <hello.txt
hello, world
$ truncate -s -12 cat.
$ wc -c cat.
88 cat.
$ ./cat. <hello.txt
hello, world
$ echo $?
0

The only down side of this change is that the exit status no longer
reports error values from read failures.

No, that's not true. The other, more signficant, down side of this
change is that this new feature can't be used to improve the size of
our seven-byte executable! That program still has to explicitly exit,
in order to exit with a non-zero status. (And the non-zero exit is
how we can be certain that the program actually ran, and wasn't, say,
mistakenly handled as a do-nothing shell script.)

Well, it so happens that I have the perfect answer to both of these
concerns.

Version 0.??: This Is Not At All a Ridiculous Idea

The way to address these issues is to modify our epilog so that
instead of always returning zero, it uses the value at the top of the
stack as the exit code. Additionally (and here's the brilliant part),
the loader will set up our stack so that it starts out with a zero
entry at the top. That way, if the program is well-behaved and pops
everything off the stack that it pushed, it will automatically exit
with a successful zero status -- but it can also quit prematurely at
any time, leaving an error code on the stack that will be
automatically transmitted to the user. This is clearly an improvement
to our original idea, right? I think it makes perfect sense, and
isn't at all contrived.

Fortunately, this new epilog will still fit snugly into eight bytes:

 epilog.lst

00000000 5F             pop     rdi
00000001 B83C000000     mov     eax, 60
00000006 0F05           syscall

This improvement requires only a three-line change to our kernel
module:

 calmfile.c

    r = vm_brk(current->mm->start_brk, 0);
    if (r < 0)
        return r;

    [S:put_user(0x050F3C478D67FF31, (long __user *)(loadaddr + filesize));:S]
    put_user(0x050F0000003CB85F, (long __user *)(loadaddr + filesize));
    current->mm->start_stack -= sizeof(void*);
    put_user(0, (long __user *)current->mm->start_stack);

    install_exec_creds(lbp);

$ make
make -C /lib/modules/4.15.0-156-generic/build/ M=/home/breadbox/km/
calm modules
make[1]: Entering directory '/usr/src/
linux-headers-4.15.0-156-generic'
CC [M] /home/breadbox/km/calm/calmfile.o
Building modules, stage 2.
MODPOST 1 modules
CC /home/breadbox/km/calm/calmfile.mod.o
LD [M] /home/breadbox/km/calm/calmfile.ko
make[1]: Leaving directory '/usr/src/
linux-headers-4.15.0-156-generic'
$ sudo rmmod ./calmfile.ko
$ sudo insmod ./calmfile.ko

Once we've verified that it builds, let's revisit our current-best
seven-byte binary. That same program should work just as well in our
new format:

$ ./cat. <tiny.asm
BITS 64
push 42
pop rdi
mov al, 60
syscall
$ nasm -f bin -o tiny. tiny.asm
$ chmod +x tiny.
$ ./tiny.
$ echo $?
42
$ wc -c tiny.
7 tiny.

But it should also work if we push the number 42 onto the stack and
just leave it there. In other words, if the program just consists of
the push instruction.

$ truncate -s 2 tiny.
$ ./tiny.
$ echo $?
42
$ wc -c tiny.
2 tiny.

Two bytes. Two bytes! Our program is the size of a single
machine-language instruction! This is the limit! There's no way to
make a working program any smaller than that!

Well, I mean. I don't think there is. That is, obviously, there does
exist a number that is less than two, so in theory it could be
smaller, but how would that even be possible? Like, if there was a
one-byte instruction for storing an arbitrary value on the stack,
then maybe. But there isn't. The only other realistic possibility I
can think of would be if the binary file could make use of some
metadata to request a particular value to place on the stack
initially, instead of having it be a hard-coded zero value. But there
is no metadata in our file! That's the whole point of the format,
right? I mean, of course, all files have some metadata; that's just
the nature of filesystems. You could argue that the filename itself
counts as metadata. I mean, we are basically using the filename as
metadata already, I suppose. We're looking at the extension to
determine the file type. So you could in theory potentially maybe
argue that, just for example, using the character immediately
preceding the extension would also be valid metadata, and that could
be defined to specify optional behavior, like the default stack value
for example ...

 calmfile.c

    put_user(0x050F0000003CB85F, (long __user *)(loadaddr + filesize));
    current->mm->start_stack -= sizeof(void*);
    [S:put_user(0, (long __user *)current->mm->start_stack);:S]
    put_user(ext[-1], (long __user *)current->mm->start_stack);

ext is the pointer to the filename's extension, that we initialized
way back at the top of the function, so ext[-1] is the character
directly to the left of the dot. Don't get me wrong; I fully realize
that this is indefensibly contrived -- especially since, as it turns
out, the ASCII character for 42 is the asterisk. That's quite an
inconvenient character to have in a filename.

But having come this far, how can I not continue?

$ make
make -C /lib/modules/4.15.0-156-generic/build/ M=/home/breadbox/km/
calm modules
make[1]: Entering directory '/usr/src/
linux-headers-4.15.0-156-generic'
CC [M] /home/breadbox/km/calm/calmfile.o
Building modules, stage 2.
MODPOST 1 modules
CC /home/breadbox/km/calm/calmfile.mod.o
LD [M] /home/breadbox/km/calm/calmfile.ko
make[1]: Leaving directory '/usr/src/
linux-headers-4.15.0-156-generic'
$ sudo rmmod calmfile
$ sudo insmod ./calmfile.ko
$ touch '*.'
$ chmod +x '*.'
$ './*.'
$ echo $?
42

Okay.

$ wc -c '*.'
0 *.

Beat that, Internet Random Person(tm)!

[cartoon2]
Illustration of author surveying the fruits of his labor by
Bomberanian

(appendix)

---------------------------------------------------------------------

Texts
Brian Raiter