https://thecoder08.github.io/hello-world.html

Lennon McLean

April 6, 2024

Hello World

A deep dive into the world of abstraction behind a modern Hello World
program.

Before we start

This article is written about a Hello World program written in C.
This is about as high as you can go as a high-level language without
having to worry about what your specific language is doing in the
interpreter/compiler/JIT before your Hello World proper actually
runs.

I originally started writing this with the intent of making it
understandable to anyone with some coding background, but I now think
it would be helpful to have at least some knowledge of C or assembly.

The start

Everyone should be familiar with a Hello World program. In python,
the very first program you wrote may have been:

print('Hello World!')

It simply prints the text "Hello World!" onto the screen.

In this article, we're going to look at a Hello World in the C
programming language. If you're unfamiliar, it's:

#include <stdio.h>

int main() {
    printf("Hello World!\n");
    return 0;
}

This program does the exact same thing as the python one. Unlike
python, however, you can't just call an interpreter to run this
program. You have to run the compiler first to convert this code into
machine code that the computer's processor can run directly. All
modern big and important programs that make a computer work are
written this way.

So to do this, we run the following command:

gcc hello.c -o hello

This take our C code from the file hello.c and generates a machine
code program in the file called hello. We can then run it by running
the following command:

./hello

which gives:

Hello World!

Neat.

Our program

OK, so how did it do that? Well, the first place to look is our
program. What exactly is it?

$ file hello
hello: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=b74da2c9c77d221eeaa98f87f4a7a529782db280, for GNU/Linux 3.2.0, not stripped

This is mostly stuff we won't worry about, or don't need to worry
about until later. The important part is just the

    ELF executable, x86-64

This tells us that the program is an ELF executable file for the
x86_64 instruction set architecture. What does that mean?

An ELF executable is the Linux equivalent to a windows .exe file.
It's just a program that your computer can run. But we already knew
that. The other part tells us that it's a machine code program meant
to run on a 64-bit x86 processor, which is the CPU architecture that
has been used in PCs since the introduction of the IBM PC in 1981.
That wasn't 64-bit, mind you, but our modern processors can still run
code written for the IBM PC (kind of). I digress.

So this file contains machine code, a sort of language, and the only
language that the CPU can understand. So where does the CPU start
running it's code?

$ readelf -h hello
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1060
  Start of program headers:          64 (bytes into file)
  Start of section headers:          13976 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30

The important part here is Entry point address:, which is set to
0x1060. This is a hexadecimal number which represents a location in
our program, or, once it gets loaded, in our computer's memory. So
what exactly is there?

The code

$ objdump -D hello

I'm not going to put the entire output of this command in here,
because it would be too long. But if we scroll through it, we
eventually find some lines of text, where the first line starts with
1060:

Disassembly of section .text:

0000000000001060 <_start>:
    1060:       f3 0f 1e fa             endbr64
    1064:       31 ed                   xor    %ebp,%ebp
    1066:       49 89 d1                mov    %rdx,%r9
    1069:       5e                      pop    %rsi
    106a:       48 89 e2                mov    %rsp,%rdx
    106d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
    1071:       50                      push   %rax
    1072:       54                      push   %rsp
    1073:       45 31 c0                xor    %r8d,%r8d
    1076:       31 c9                   xor    %ecx,%ecx
    1078:       48 8d 3d ca 00 00 00    lea    0xca(%rip),%rdi        # 1149 <main>
    107f:       ff 15 53 2f 00 00       call   *0x2f53(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1085:       f4                      hlt
    1086:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    108d:       00 00 00

What does this mean? The first numbers before the colons are the
addresses of the following bytes, essentially their position in the
file. The next numbers are bytes of data in our program file, which
in this case represent machine code. The following text is the
dissassembly of that code. Assembly language is a human-readable
representation of machine code. Note that even if the bytes on the
left don't represent code, the disassembler will still try to
dissassemble them. This leads to garbage and nonsensical assembly
code.

So we've found some code! But not code that we wrote. It was added to
our program automatically by the compiler (technically the linker).
Basically, this code does some initialization, and then runs an
important instruction:

call *0x2f53(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>

This instruction tells the computer to go execute some code somewhere
else, in this case at address 0x2f53, which is changed to address
0x3fd8 when our program is loaded by the dynamic linker. I won't get
into that.

But no matter how hard you look, you won't be able to find either of
these addresses in our file. The 0x3fd8 is technically there in the
global offset table, again, beyond the scope of this article, but
it's empty right now. That's because this code isn't defined in our
program, it's somewhere else.

The C library

So where is it?

$ readelf -d hello

Dynamic section at offset 0x2dc8 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x1000
 0x000000000000000d (FINI)               0x1168
 0x0000000000000019 (INIT_ARRAY)         0x3db8
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x3dc0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x3b0
 0x0000000000000005 (STRTAB)             0x480
 0x0000000000000006 (SYMTAB)             0x3d8
 0x000000000000000a (STRSZ)              141 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x3fb8
 0x0000000000000002 (PLTRELSZ)           24 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x610
 0x0000000000000007 (RELA)               0x550
 0x0000000000000008 (RELASZ)             192 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000000000001e (FLAGS)              BIND_NOW
 0x000000006ffffffb (FLAGS_1)            Flags: NOW PIE
 0x000000006ffffffe (VERNEED)            0x520
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x50e
 0x000000006ffffff9 (RELACOUNT)          3
 0x0000000000000000 (NULL)               0x0

This is a list of, among other things, the libraries that our code
depends on. In this case, we see the line

0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

The is our system's standard C library, a collection of routines and
functions used by nearly all of the programs on our computer. In
Windows-land, this is equivalent to the C runtime, either msvcrt.dll
or ucrt<something>.dll. One thing to note is that files in Linux with
the extension .so, called Shared Objects, are equivalent to files in
Windows with the extension .dll, called Dynamically Linked Libraries.
They both contain code that can be shared between muliple programs.

So we could repeat the process of using objdump to find where this
code is in our C library, and what it does, but the C library is huge
and complex, and we haven't even got to the code we wrote yet. So
I'll save you the trouble: it does some initialisation, like getting
our program's command-line parameters and environment variables, and
calls our main() function. Then when we return from main(), it exits
our progran with the status code we provide.

So where is our main function?

main()

It's in our program, of course. Returning to our disassembly, we see:

0000000000001149 <main>:
    1149:       f3 0f 1e fa             endbr64
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       48 8d 05 ac 0e 00 00    lea    0xeac(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    1158:       48 89 c7                mov    %rax,%rdi
    115b:       e8 f0 fe ff ff          call   1050 <puts@plt>
    1160:       b8 00 00 00 00          mov    $0x0,%eax
    1165:       5d                      pop    %rbp
    1166:       c3                      ret

Finally, our code! So what does it do? It:

 1. Sets up a stack frame
 2. Sets up the arguments to our function call
 3. Calls our Hello World
 4. Cleans up the stack frame
 5. Returns from the function with exit code 0

This is what we see in our source code. But what is a stack frame?
It's a part of the computer's memory that our program uses to store
local variables, i.e. variables declared inside our main function.
Luckily, we don't declare any variables, so we don't really have to
worry about that. The important parts here are:

lea    0xeac(%rip),%rax

call   1050 <puts@plt>

These instructions:

 1. Set the memory address of our Hello World string as the first
    argument to our function call (indirectly)
 2. Call the puts() function

Wait, puts()? Didn't we call printf()?

Yes. However, the compiler performed an optimisation. The printf
function is complex, because it's able to print "formatted output",
which means that we can embed variables in our output. The function
will handle converting them to strings and printing them for us, but
we aren't using any of that. So that compiler replaces printf() with
the much simpler puts(), which just prints a string of unformatted
text. So where is our text?

The string

According to the disassembler, it's in address 0x0eac, which gets
converted to address 0x2004 upon loading. So what does that look
like?

Disassembly of section .rodata:

0000000000002000 <_IO_stdin_used>:
    2000:       01 00                   add    %eax,(%rax)
    2002:       02 00                   add    (%rax),%al
    2004:       48                      rex.W
    2005:       65 6c                   gs insb (%dx),%es:(%rdi)
    2007:       6c                      insb   (%dx),%es:(%rdi)
    2008:       6f                      outsl  %ds:(%rsi),(%dx)
    2009:       20 57 6f                and    %dl,0x6f(%rdi)
    200c:       72 6c                   jb     207a <__GNU_EH_FRAME_HDR+0x66>
    200e:       64 21 00                and    %eax,%fs:(%rax)

Remember earlier how I said that the disassembler tries to
disassemble code even if it isn't code? This is a good example.
Ignore the assembly language, it's complete gibberish. But if we look
at address 0x2004, we see the hex bytes 48 65 6c 6c 6f 20 57 6f 72 6c
64 21 00, which translates to the string "Hello World!", followed by
a NULL terminator.

But didn't our string also include a newline, \n, which should be
translated to ASCII 0x0a? Yes, but this is another artifact of the
compiler's optimisation. The puts() function prints out the string
with a trailing newline, while printf() doesn't. So it removes our
newline so we only end up with one in the output.

Then we see a 0x00 NULL byte. This is called a NULL terminator, and
it appears at the end of all C strings. in C, our string isn't
associated with any length information. So a function that takes a
string of any length as an argument will act on it one byte a time,
until it sees a NULL terminator. If we had multiple string in memory
and no NULL terminators between them, then C functions would operate
on all of the strings together. Eventually the functions would get to
the end and start reading memory that they aren't allowed to read,
and your program would crash with the dreaded "Segmentation Fault".

following the puts()

So puts() is located at 0x1050.

Disassembly of section .plt.sec:

0000000000001050 <puts@plt>:
    1050:       f3 0f 1e fa             endbr64
    1054:       f2 ff 25 75 2f 00 00    bnd jmp *0x2f75(%rip)        # 3fd0 <puts@GLIBC_2.2.5>
    105b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Okay, so it's now calling back into the standard library.
(technically the global offset table, but eventually the standard
library)

Again, we don't want to read the disassembly of the standard library,
but luckily Glibc (our C standard library) is open source. So where
does that take us?

Well, puts() is aliased to the function _IO_puts in the standard
library.

int
_IO_puts (const char *str)
{
  int result = EOF;
  size_t len = strlen (str);
  _IO_acquire_lock (stdout);

  if ((_IO_vtable_offset (stdout) != 0
       || _IO_fwide (stdout, -1) == -1)
      && _IO_sputn (stdout, str, len) == len
      && _IO_putc_unlocked ('\n', stdout) != EOF)
    result = MIN (INT_MAX, len + 1);

  _IO_release_lock (stdout);
  return result;
}

So it gets the length of our string, obtains a lock on the output
stream, does some checks, and calls _IO_sputn. It then releases the
lock and returns the number of printed characters.

I searched for this function, but couldn't find it. Apparently it
does something through a function called _IO_file_jumps, and calls
_IO_new_file_xsputn.

size_t
_IO_new_file_xsputn (FILE *f, const void *data, size_t n)
{
  const char *s = (const char *) data;
  size_t to_do = n;
  int must_flush = 0;
  size_t count = 0;

  if (n <= 0)
    return 0;
  /* This is an optimized implementation.
     If the amount to be written straddles a block boundary
     (or the filebuf is unbuffered), use sys_write directly. */

  /* First figure out how much space is available in the buffer. */
  if ((f->_flags & _IO_LINE_BUF) && (f->_flags & _IO_CURRENTLY_PUTTING))
    {
      count = f->_IO_buf_end - f->_IO_write_ptr;
      if (count >= n)
        {
          const char *p;
          for (p = s + n; p > s; )
            {
              if (*--p == '\n')
                {
                  count = p - s + 1;
                  must_flush = 1;
                  break;
                }
            }
        }
    }
  else if (f->_IO_write_end > f->_IO_write_ptr)
    count = f->_IO_write_end - f->_IO_write_ptr; /* Space available. */

  /* Then fill the buffer. */
  if (count > 0)
    {
      if (count > to_do)
        count = to_do;
      f->_IO_write_ptr = __mempcpy (f->_IO_write_ptr, s, count);
      s += count;
      to_do -= count;
    }
  if (to_do + must_flush > 0)
    {
      size_t block_size, do_write;
      /* Next flush the (full) buffer. */
      if (_IO_OVERFLOW (f, EOF) == EOF)
        /* If nothing else has to be written we must not signal the
           caller that everything has been written.  */
        return to_do == 0 ? EOF : n - to_do;

      /* Try to maintain alignment: write a whole number of blocks.  */
      block_size = f->_IO_buf_end - f->_IO_buf_base;
      do_write = to_do - (block_size >= 128 ? to_do % block_size : 0);

      if (do_write)
        {
          count = new_do_write (f, s, do_write);
          to_do -= count;
          if (count < do_write)
            return n - to_do;
        }

      /* Now write out the remainder.  Normally, this will fit in the
         buffer, but it's somewhat messier for line-buffered files,
         so we let _IO_default_xsputn handle the general case. */
      if (to_do)
        to_do -= _IO_default_xsputn (f, s+do_write, to_do);
    }
  return n - to_do;
}

Wow. All of this for a Hello World. I am not going to try and
understand how this code works, even with comments. So at this point
I realised that using Glibc to explain this is going to be a pain. So
here, I decided to look at musl libc, which I know is supposed to be
smaller.

musl

So in musl, puts() is defined as follows:

int puts(const char *s)
{
        int r;
        FLOCK(stdout);
        r = -(fputs(s, stdout) < 0 || putc_unlocked('\n', stdout) < 0);
        FUNLOCK(stdout);
        return r;
}

Okay, so it obtains a lock on the output stream, calls fputs, and
unlocks the output stream.

How is fputs() defined?

#include "stdio_impl.h"
#include <string.h>

int fputs(const char *restrict s, FILE *restrict f)
{
        size_t l = strlen(s);
        return (fwrite(s, 1, l, f)==l) - 1;you.
}

It gets the length of our string, and calls fwrite() with the output
stream, our string, and its length.

How is fwrite() defined?

size_t fwrite(const void *restrict src, size_t size, size_t nmemb, FILE *restrict f)
{
        size_t k, l = size*nmemb;
        if (!size) nmemb = 0;
        FLOCK(f);
        k = __fwritex(src, l, f);
        FUNLOCK(f);
        return k==l ? nmemb : k/size;
}

It gets another lock on the output stream, calls __fwritex(), and
unlocks the output stream.

How is __fwritex() defined?

size_t __fwritex(const unsigned char *restrict s, size_t l, FILE *restrict f)
{
        size_t i=0;

        if (!f->wend && __towrite(f)) return 0;

        if (l > f->wend - f->wpos) return f->write(f, s, l);

        if (f->lbf >= 0) {
                /* Match /^(.*\n|)/ */
                for (i=l; i && s[i-1] != '\n'; i--);
                if (i) {
                        size_t n = f->write(f, s, i);
                        if (n < i) return n;
                        s += i;
                        l -= i;
                }
        }

        memcpy(f->wpos, s, l);
        f->wpos += l;
        return l+i;
}

This is a fair bit of code, but the main thing it's doing it calling
write() on the output stream's FILE object. Our stream is defined as
stdout, so where is that defined?

hidden FILE __stdout_FILE = {
        .buf = buf+UNGET,
        .buf_size = sizeof buf-UNGET,
        .fd = 1,
        .flags = F_PERM | F_NORD,
        .lbf = '\n',
        .write = __stdout_write,
        .seek = __stdio_seek,
        .close = __stdio_close,
        .lock = -1,
};

So the write function is defined as __stdout_write(). How is that
defined?

size_t __stdout_write(FILE *f, const unsigned char *buf, size_t len)
{
        struct winsize wsz;
        f->write = __stdio_write;
        if (!(f->flags & F_SVB) && __syscall(SYS_ioctl, f->fd, TIOCGWINSZ, &wsz))
                f->lbf = -1;
        return __stdio_write(f, buf, len);
}

It makes a TIOCGWINSZ ioctl on the output stream, and calls
__stdio_write(). How is that defined?

size_t __stdio_write(FILE *f, const unsigned char *buf, size_t len)
{
        struct iovec iovs[2] = {
                { .iov_base = f->wbase, .iov_len = f->wpos-f->wbase },
                { .iov_base = (void *)buf, .iov_len = len }
        };
        struct iovec *iov = iovs;
        size_t rem = iov[0].iov_len + iov[1].iov_len;
        int iovcnt = 2;
        ssize_t cnt;
        for (;;) {
                cnt = syscall(SYS_writev, f->fd, iov, iovcnt);
                if (cnt == rem) {
                        f->wend = f->buf + f->buf_size;
                        f->wpos = f->wbase = f->buf;
                        return len;
                }
                if (cnt < 0) {
                        f->wpos = f->wbase = f->wend = 0;
                        f->flags |= F_ERR;
                        return iovcnt == 2 ? 0 : len-iov[0].iov_len;
                }
                rem -= cnt;
                if (cnt > iov[0].iov_len) {
                        cnt -= iov[0].iov_len;
                        iov++; iovcnt--;
                }
                iov[0].iov_base = (char *)iov[0].iov_base + cnt;
                iov[0].iov_len -= cnt;
        }
}

We're on the home stretch now. This is doing a lot, but it's calling
syscall() with SYS_writev as its first parameter. So how is syscall()
defined?

long syscall(long n, ...)
{
        va_list ap;
        syscall_arg_t a,b,c,d,e,f;
        va_start(ap, n);
        a=va_arg(ap, syscall_arg_t);
        b=va_arg(ap, syscall_arg_t);
        c=va_arg(ap, syscall_arg_t);
        d=va_arg(ap, syscall_arg_t);
        e=va_arg(ap, syscall_arg_t);
        f=va_arg(ap, syscall_arg_t);
        va_end(ap);
        return __syscall_ret(__syscall(n,a,b,c,d,e,f));
}

syscall() takes a system call number as its first argument, and a
variable number of additional arguments. The va_arg() calls are
reading those arguments into the variables a, b, c, d, e, and f. we
then call __syscall() with those arguments, and the result goes into
__syscall_ret().

Unfortunately, I couldn't find the definition for __syscall(), But I
feel that this is because we're getting into platform-specific
territory. Musl is a multi-architecture C library, so from this point
the code that runs depends on what architecture we're using. Before I
dove into that, I looked at __syscall_ret():

long __syscall_ret(unsigned long r)
{
        if (r > -4096UL) {
                errno = -r;
                return -1;
        }
        return r;
}

It just checks to see if the return value from __syscall() is valid,
and if not, the system call failed, so it returns -1.

System Calls

So, the last few stages of our Hello World call have been involving
systems calls. What is a system call? Well, no matter how big our C
library is, there are some things it will never be able to do for us.
One of those things is talking to the hardware. The ability to do
that is reserved for the kernel, the part of the operating system
that controls and shares access to IO devices, memory, and CPU time.
In our case, this is the Linux kernel. In Windows world, this is
ntoskrnl.exe, which shows up as System in Task Manager.

This means that our puts() call has to end with us telling the OS to
do something for us. In this case, we're asking the OS to write some
text to the output stream. Writing to a stream is done with the write
system call. Musl is using a similar system call called writev, which
can write multiple buffers in an array. So lets look at how musl
makes system calls.

static __inline long __syscall0(long n)
{
        unsigned long ret;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n) : "rcx", "r11", "memory");
        return ret;
}

static __inline long __syscall1(long n, long a1)
{
        unsigned long ret;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1) : "rcx", "r11", "memory");
        return ret;
}

static __inline long __syscall2(long n, long a1, long a2)
{
        unsigned long ret;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2)
                                                  : "rcx", "r11", "memory");
        return ret;
}

static __inline long __syscall3(long n, long a1, long a2, long a3)
{
        unsigned long ret;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
                                                  "d"(a3) : "rcx", "r11", "memory");
        return ret;
}

static __inline long __syscall4(long n, long a1, long a2, long a3, long a4)
{
        unsigned long ret;
        register long r10 __asm__("r10") = a4;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
                                                  "d"(a3), "r"(r10): "rcx", "r11", "memory");
        return ret;
}

static __inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a5)
{
        unsigned long ret;
        register long r10 __asm__("r10") = a4;
        register long r8 __asm__("r8") = a5;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
                                                  "d"(a3), "r"(r10), "r"(r8) : "rcx", "r11", "memory");
        return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
        unsigned long ret;
        register long r10 __asm__("r10") = a4;
        register long r8 __asm__("r8") = a5;
        register long r9 __asm__("r9") = a6;
        __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
                                                  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
        return ret;
}

We have reached the bottom. These are 7 different functions that musl
uses for making system calls on the x86_64 platform. Each of them
takes a different number of arguments for the system call.

Each function has an __asm__ directive. This embeds inline assembly
code into the machine language output of the compiler. We make system
calls to the operating system by setting some CPU registers with our
parameters, and executing the syscall instruction. Control then gets
passed to the kernel, which reads our parameters and executes our
system call.

The kernel

The Linux kernel now has to perform the action requested by the
system call. The write system call tells the kernel to write to an
opened file on the filesystem, or write to a stream, which is what we
are doing in this case.

The write system call takes 3 parameters: the file descriptor to
write to, the buffer to write, and the number of bytes to write. The
writev system call used by musl is different, but lets focus on write
for now.

So where exactly are we writing to?

$ ps
    PID TTY          TIME CMD
  15705 pts/0    00:00:00 bash
  23332 pts/0    00:00:00 ps
$ cd /proc/15705/fd
$ readlink 1
/dev/pts/0

That depends.

In my case, I'm running the hello program in the GNOME terminal
emulator, a graphical application. It appears to the kernel as a
pseudo-terminal (pty). So the kernel saves our Hello World message in
a buffer, and when the terminal emulator program runs, it reads it
and displays it. Voila.

Of course, we aren't done. The terminal emulator then has to render
the text into a frame (potentially using the GPU to do it), send this
frame to X server/compositor, which combines it with the other apps I
have running (also using the GPU), like the text editor I'm using to
write this, and sends it back to the kernel, which then displays it.

Sheesh. I glossed over a lot there, because it doesn't matter and it
may be completely different for you. Maybe you're logged in remotely,
in which case, the kernel sends your text to sshd, which then sends
it (encrypted) back to the kernel in a packet to be sent over the
internet. Maybe you're using a physical terminal, connected to a
serial-to-USB adapter. The kernel then has to put your text in a USB
packet and send it down the line. Maybe you're using the framebuffer
console, which is the default way to interact with the OS if you
don't have a GUI installed. In that case, the kernel has to render to
text into a frame and output that to the display.

The point is that it could be anything that happens next, and it
really doesn't matter what it is. Because your Hello World message
being sent is only one system call, from one program, out of millions
of system calls and thousands of programs running on your computer
right now.

Conclusion

So, modern software systems on today's hardware are so complex and
intricate that it really makes no sense to try and fully understand
one little thing that your computer did. It's clear that I glossed
over a lot in order to explain everything that I did. I didn't go
over all the edge cases, additional information, and other things
that the computer does. I didn't explain how the kernel works. This
is all stuff for other people to explain, or for you to learn about
on your own time.

If you actually read this all the way through, congratulations. I'm
sorry the ending maybe wasn't as satisfying as you hoped. I'm happy
someone found this interesting. I'm not quite sure why I wrote this,
but it's now after midnight so I should get some sleep.

Thank you for reading.

    Hey, So how does a Hello World program actually work?

    Don't ask.