https://popovicu.com/posts/risc-v-sbi-and-full-boot-process/

Skip to content
Uros Popovic

  * Posts
  * Tags
  * About
  *  
  * 

---------------------------------------------------------------------
Go back

RISC-V SBI and the full boot process

Posted on:September 9, 2023 |  at 09:00 PM

In the last article, we covered bare metal programming on RISC-V.
Please familiarize yourself with that material before proceeding with
the rest of this article, as this article is a direct continuation of
the aforementioned one.

This time we are talking about RISC-V SBI (Supervisor Binary
Interface), with OpenSBI as the example. We'll look at how SBI can
assist us with implementing operating system kernel primitives and
we'll end the article with a practical example using riscv64 virt
machine.

Table of contents

Open Table of contents

  * RISC-V and "BIOS"

      + Machine modes
      + SBI
      + Fancy abstractions
      + Binary interface
  * Practical example with OpenSBI

  * Booting the OS kernel after SBI and calling into OpenSBI

      + What really happens in the ZSBL?

      + 3 flavors of OpenSBI

          o FW_PAYLOAD
          o FW_JUMP
          o FW_DYNAMIC
          o Exploring the fw_dynamic_info struct
          o Building an "infinite-loop fake kernel"
          o Intentionally skipped details
  * Hello world fake kernel

  * Conclusion

  * Code pointers

RISC-V and "BIOS"

In the article mentioned above, we talked extensively about the very
first stages of the RISC-V bootup process. We mentioned that first
the ZSBL (Zero Stage Bootloader) runs, initializes a few registers
and jumps directly to some address hardcoded by ZSBL. In the case of
QEMU's riscv64 virt, the hardcoded address is 0x80000000. This is
where the first user-provided code runs, and if left to default, QEMU
will load OpenSBI there.

Machine modes

So far we have avoided talking about different machine modes, and now
is the perfect time to introduce them. The concept with machines
modes is that not every piece of software should be able to access
just about any memory address on the machine, or even execute just
about any instructions available with the CPU. Traditionally, in a
textbook example, the two main divisions are made here:

 1. Privileged mode
 2. Unprivileged mode

The privileged mode is where the machine starts at the boot time. Any
instruction is permitted and no address access is considered an
access violation. Once the operating system takes over the control of
the system and starts launching the user code (aka userspace code),
the modes start switching. When the user code is running on the CPU
core, it is running within the unprivileged mode where not everything
is accessible. Going back to the kernel mode means switching back to
the privilged mode.

This is a very textbook and simplistic view at the permissions of
operations and the question arises: why only 2 modes?

In systems, more than 2 modes typically exist, forming a protection
ring with multiple access modes. RISC-V specification does not
necessarily prescribe exactly which modes must be implemented for a
core, except the M (Machine) mode. This is the most privileged mode.

Typically, the processors with M mode only are simple embedded
systems, moving over more secure systems (M and S modes), all the way
to full systems that can run Unix-like operating systems (M, S and U
modes).

SBI

The official docs provide a formal definition, and I will try to
water it down here with the goals of making it more intuitive.

RISC-V's SBI spec defines the layer of software that sits at the
bottom of the RISC-V software stack. This is very similar to BIOS,
which is traditionally the first bit of software that runs on a
machine. You might have seen some of the guides for developing a
simple kernel from scratch, and they typically involve something
similar to what we did in the initial guide for bare metal
programming on RISC-V, with a small twist -- they are very often
actually depending on the pre-existing software to do some I/O. The
similarity to our previous guide is that they also carefully align
the first instructions to the correct address to ensure that the
processor's execution flow goes as intended and the simple kernel
takes over at the right time, however, what I have typically observed
in those short guides is that the goal is typically to print
something like 'Hello world' to the VGA screen. This last bit sounds
like a fairly complex operation, and it really is.

How is printing to the VGA then done easily then? The answer is that
BIOS is here to assist with the most basic I/O operations such as
printing some characters to the screen, hence its name -- Basic Input
Output System! Please pay attention to the opening section of the
bare metal programming guide: we were achieving interaction with the
user without depending on any existing software on the machine (well,
almost true, we still went through the Zero Stage Bootloader, but we
didn't depend on any outcome from it, nor we really had any control
over it; it's simply hardcoded into the system). If we were to print
something on the VGA screen, instead of sending characters out
through UART, we would have to do a lot more than send an ASCII code
to a single address. VGA involves setting up the display device into
the right mode, by sending multiple values over, setting up different
parameters, etc. It's a fairly ellaborate operation.

So how does BIOS traditionally help with tasks like these? The main
concept is that whatever operating system ends up installed on the
machine, it would anyway need some basic functionality, such as
printing some information to the VGA screen. Thus, the machine can
have these standard operations simply baked into it and ready to
consume by whatever operating system ends up on the machine.
Conceptually, we can think of these procedures as an everyday library
we write our applications against.

Additionally, if an operating system is written against such a
"library", it automatically becomes more portable. The "library"
should have all the low level details, such as "outputting to UART
means writing to 0x10000000" (as is the case with QEMU's riscv64 virt
VM), vs. "outputting to UART means writing to 0x12345678", and the
operating system simply needs to invoke "outputting to UART"
procedure, while this "library" will know exactly how to interact
with the hardware.

Fancy abstractions

This is all just a lot of talk for a very simple concept we have been
using in programming since day 1: we apply layers of abstractions in
our coding. Think of something like a Python function that does
something like "sending a local file to an email address". From a
high level perspective, we simply call a function send_file_to_email
(file, email) and the underlying library opens up the network
connection and starts pumping the bytes. This could be just another
Python library. At some point, that will likely move down the
software stack, and the Python library will depend on the Python
runtime written in something like C to make a system call to the
operating system (for example, to perform a core operation such as
opening a network socket). The operating system has a network driver
somewhere deep down, which knows to which address in the address
space does it need to send the individual bytes in order to send the
bytes over the wire to the network and so on. The main concept here
is that we have an established way of hiding the complexity of
operations by delegating them to the lower layers of the software
stack. We built the larger system not from the atomic parts, but out
of "molecules".

If we're delegating the complexity to the underlying library, it
probably just means a function call. However, once it's time to
delegate the complexity to the operating system and lower, this
happens through a binary interface.

Binary interface

Since basically forever, the x86 has been the dominant architecture
for the computers we use, be it desktops or laptops. Things have been
changing a lot lately, and other architectures are entering the
picture, but let's focus on just x86. What then, makes an application
built for Linux incompatible with the application for Windows? If
it's written for x86, and both Linux and Windows run on x86, what
could possibly be the differentiator here? The CPU instructions are
not different from one platform and the other, so what could it be?
The answer is the interface between the application and the operating
system. This particular link between the user software and the
operating system is called the application binary interface (ABI).
ABI is just a definition that says how the services from the
operating system are invoked from the user application.

Therefore, when we say something like "this software is written for
platform X", it's not enough to just say that X is x86 or RISC-V, we
must say x86/Linux or x86/Windows or RISC-V Linux etc. The platform
definition may be even more complex than that if things like dynamic
linking are involved, but let us not go there for now.

Let's take a quick example at a program written in assembly for x86/
Linux that just prints a 'Hello' string to the standard output.

.global _start

.section .text

_start: mov $4, %eax ; 4 is the code for the 'write' system call
        mov $1, %ebx ; We are writing to file 1, i.e. the 'standard output'
        mov $message, %ecx ; The data we want to print is at the address defined by the symbol message
        mov $5, %edx ; The length of the data we want to print is 5
        int $0x80 ; Invoke the system call, i.e. ask kernel to print the data to the standard output

        mov $1, %eax ; 1 is the code for the 'exit' system call
        mov $0, %ebx ; 0 is the process return code
        int $0x80  ; Invoke the system call, i.e. ask the the kernel to close this process

.section .data
message: .ascii "Hello"

Assemble this program with:

as -o syscall.o syscall.s

Link it with:

ld -o syscall syscall.o

Run with:

./syscall

You should see the output "Hello". If you're on Bash and you also
want to double check the process return code, simply run:

echo $?

And you should see 0.

Tip: If you want to try out this example from above, but you do not
have access to an x86/Linux machine, you can do this through a
JavaScript VM that emulates an x86 system in-browser here; that's a
really cool website!

And there we have it: a program which prints a message to the
standard output when run on an x86 machine with a Linux kernel. C
standard library was not used. The final ELF binary should run on
Linux with no dependencies other than it is run on the correct
platform.

Now back to the question, what makes this binary incompatible with
Windows (potentially)? Another operating system encodes the system
calls differently (e.g. writing isn't code 4, but code 123, or the
parameters are passed through different CPU registers). And now you
have a good idea of how to directly interface with the kernel,
without the assistance of the standard library (although you probably
almost never want to do it). This means you have uncovered the layer
at which software does things like opening files, allocates memory,
sends signals, etc. The C standard library can be thought of as a
wrapper which hides this complexity of invoking software interrupts
through the int instruction to communicate with the kernel, and
instead makes it look like a normal call to a C function, and then
under the hood, this is what it is. To be fair, the library does a
lot more than that, but for the purposes of this article, it can be
thought of simply as a wrapper.

And now in the RISC-V world, we have the same thing: the user
application interfaces with the kernel through software interrupt CPU
instructions, and passing the parameters through the CPU registers.
And the kernel basically does the same thing with the SBI in order to
invoke its services! It's just that this final layer of logic
invocation is called the SBI, not the ABI. A way to think about it is
that it is not the application that works in the lower layer, but
rather the supervisor of the applications. The difference, however,
is in the name only, and the concept remains absolutely the same.

Practical example with OpenSBI

At this point we have established that SBI, much like ABI, is just a
way of invoking a functionality in the lower layers of the software
stack. Furthermore, we also established the SBI sits at the bottom of
the software stack on a RISC-V machine, and runs in the most
privileged M mode. Let's add some more details to this picture.

It should also make sense at this point why the QEMU developers chose
the -bios flag in order to accept the SBI software image (because the
functionality is basically the same as BIOS). As a reminder, the
-bios flag should point to an ELF file that will lay out the SBI
software out in memory starting from address 0x80000000.

Let's start the QEMU's VM with just OpenSBI loaded, and see what
happens. We shouldn't really have to pass anything to QEMU since it
defaults to loading OpenSBI at 0x80000000.

qemu-system-riscv64 -machine virt

This is the output (on the serial port, not VGA):

OpenSBI v0.8
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name       : riscv-virtio,qemu
Platform Features   : timer,mfdeleg
Platform HART Count : 1
Boot HART ID        : 0
Boot HART ISA       : rv64imafdcsu
BOOT HART Features  : pmp,scounteren,mcounteren,time
BOOT HART PMP Count : 16
Firmware Base       : 0x80000000
Firmware Size       : 96 KB
Runtime SBI Version : 0.2

MIDELEG : 0x0000000000000222
MEDELEG : 0x000000000000b109
PMP0    : 0x0000000080000000-0x000000008001ffff (A)
PMP1    : 0x0000000000000000-0xffffffffffffffff (A,R,W,X)

The machine keeps spinning in place, presumably because it is set up
to do so by default since there is no other piece of software passed
to QEMU to take over the control after OpenSBI. At this point, things
look good, it seems like OpenSBI has been set up properly (and its
output confirms that it sits right at 0x80000000).

How do we keep going up the software stack, how do we add a new
layer? The new layer could be something like an operating system
kernel, so similarly to how we have previously built an ELF file
containing instructions to be placed at 0x80000000, we will build
another ELF file for QEMU to load into its memory, but this time the
instructions will come to another address, since the portion starting
at 0x80000000 has already been taken over by OpenSBI.

Which address should we load our fake "kernel" at, then?

Booting the OS kernel after SBI and calling into OpenSBI

When we loaded the BIOS/SBI/whatever you want to call it, the address
was basically burnt into the machine's logic. The first few
instructions were Zero Stage Bootloader (ZSBL) and the final
instruction from there was jumping to the hardcoded address
0x80000000. As we previously mentioned, this is an immutable fact of
the platform we're working with, it's just simply what it does.
However, that's all it really hardcodes at this point: it just
hardcodes that you will have to start from 0x80000000, and now we
have OpenSBI placed there, so where does OpenSBI take us next?

Now enters the importance of the ZSBL again and now it really matters
how it initializes those registers before performing that hardcoded
jump to 0x80000000. What ZSBL really does is two things:

 1. Ensures that the software running after OpenSBI's initialization
    can run, and this is basically the OS kernel bootloader, or it
    could be the kernel itself directly (which is what you typically
    see in QEMU guides where you launch Linux, bootloader is skipped
    and the memory is immediately loaded with the kernel).
 2. Jumps to the OpenSBI.

We have covered the second point in great detail so far, so let's now
dig deeper into how does it accomplish point #1.

What really happens in the ZSBL?

We have mentioned before that ZSBL execution starts at the address
0x1000. Let's trace the execution through QEMU and see what's going
on. To do that, we'll add 2 flags to the QEMU CLI command: -s and -S.
These flags ensure that QEMU exposes a gdb debug port, and
additionally, the VM pauses immediately upon creation, waiting for us
to drive it manually (which we will do through gdb).

Let's begin this reverse engineering process. We're starting QEMU as
so:

qemu-system-riscv64 -machine virt -s -S

In another terminal, we connect to the gdb server nested in QEMU, so
we can drive the VM forward. I am doing this on an x86 machine, so I
will use gdb-multiarch so I can do a cross-platform debug for riscv.
So in this new terminal, I just run:

gdb-multiarch

I want to set up a few things before I connect into the VM to drive
it forward:

set architecture riscv:rv64

It should be obvious what the line above does. Next, I want to get
the actual running instruction printed to my terminal each time I
move one instruction:

set disassemble-next-line on

It's time to connect to the QEMU gdb server (port 1234 is I believe
hardcoded by QEMU, though it may be configurable by the -s flag
somehow; I never tried it and I don't think you'll need to change
this behavior)

target remote :1234

And right there, gdb is waiting for us at 0x1000, exactly where the
very first instruction after power on happens. We will use si a few
times to step through instructions one by one, until we get to the
jump to SBI at 0x80000000.

(gdb) target remote:1234
Remote debugging using :1234
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x0000000000001000 in ?? ()
=> 0x0000000000001000:  97 02 00 00     auipc   t0,0x0
(gdb) si
0x0000000000001004 in ?? ()
=> 0x0000000000001004:  13 86 82 02     addi    a2,t0,40
(gdb) si
0x0000000000001008 in ?? ()
=> 0x0000000000001008:  73 25 40 f1     csrr    a0,mhartid
(gdb) si
0x000000000000100c in ?? ()
=> 0x000000000000100c:  83 b5 02 02     ld      a1,32(t0)
(gdb) si
0x0000000000001010 in ?? ()
=> 0x0000000000001010:  83 b2 82 01     ld      t0,24(t0)
(gdb) si
0x0000000000001014 in ?? ()
=> 0x0000000000001014:  67 80 02 00     jr      t0
(gdb) si
0x0000000080000000 in ?? ()
=> 0x0000000080000000:  33 04 05 00     add     s0,a0,zero

There were only 6 instructions in ZSBL before handing the control
over to the OpenSBI, including the jump itself. However, what are
these few instructions that happened, what is their significance?

It turns out that all this is part of the SBI specification too, it's
a part of the boot sequence. However, with OpenSBI, there are 3
different flavors of this dance, and let's look at those flavors
first before getting into a lot of details on what happens after the
ZSBL.

3 flavors of OpenSBI

You can build OpenSBI in 3 different ways:

 1. FW_PAYLOAD (official docs)
 2. FW_JUMP (official docs)
 3. FW_DYNAMIC (official docs)

FW_PAYLOAD

This one is probably the easiest to understand conceptually. When
building this flavor of OpenSBI, you will literally point the make
tool to your kernel/"whatever you want to run after OpenSBI" image
and you will get a single binary payload that you can directly load
wherever you first CPU instructions start from (in QEMU's VM case,
0x80000000). As I understand, it is possible to tweak the exact
location of your software in relation to the OpenSBI blob in the
memory, but for simplicity, the mental model we can apply here is
that OpenSBI and your software blob are spliced together into a
single blob and once the OpenSBI initialization finishes, the very
next instruction is your software (you basically slide right into
your software after OpenSBI).

The way to achieve this is:

 1. Make sure FW_PAYLOAD=y is set in the make process, this will
    ensure a file called fw_payload is generated.
 2. Point FW_PAYLOAD_PATH in your make process to the software you
    want to run after OpenSBI.

Per the docs linked aboved, if you skip the second flag, a very
simple piece of software will be spliced with OpenSBI: a blank
infinite loop. That explains why when we just launched QEMU with no
flags, basically with OpenSBI only, the machine kept spinning in
place -- OpenSBI was likely built this way (since you can't just keep
executing random contents of the memory) and it was just busy waiting
in place.

The upside of this approach is that now you have a single, spliced,
monolithic software image to load into your machine. You don't have
to deal with multiple floating pieces, just one monolith. If your
build process for the software is straightforward, you may even end
up with a really easy way to manage all the software on the target
machine, while getting all the upside of having OpenSBI do some work
for you.

The downside is that you are now responsible for building everything
together, including OpenSBI. What's worse, if the machine already had
OpenSBI, let's imagine, burnt into some ROM, it already has OpenSBI
to boot up, having it twice on a machine likely won't cut it.

FW_JUMP

This one is fairly simple too: you basically hardcode the address of
your software that comes after OpenSBI. Similarly to above, 2 steps
are needed.

 1. Make sure FW_JUMP=y is set in the make process, this will ensure
    a file called fw_jump is generated.
 2. Set FW_JUMP_ADDR in the make process to the address where OpenSBI
    should jump once its done.

This is quite similar to what we had in the previous scenario, only
the jump address is hardcoded. It seems like in this case you are
still necessarily responsible for building the OpenSBI image, but
it's easy to rebuild it and point to different addresses for
different machines (let's say different machines with varying memory
layouts).

FW_DYNAMIC

This one is the most generalized flavor and that's why we leave it
for last. This is where the importance of the register set up in ZSLB
shines.

In this flavor, the boot stage that happens before OpenSBI is in
charge of passing a few pointers to OpenSBI. In this case, we're of
course talking about the ZSBL. If we play close attention, we see
that it touches the register a2.

At this point, I would like to encourage the reader to also read the
section on ZSBL from this article. The whole article is great, I just
initially found it a little tough to go through, so consider this
article a warmup for understanding that article, it's really worth
going through.

Anyway, keeping this article watered down still -- what is the
significance of setting up the register a2 in ZSBL? It points to a
struct struct fw_dynamic_info which gives the dynamic OpenSBI flavor
a way to continue going through the boot process! In fact, one of the
piece of data in this struct is the address of the next piece of
software running after OpenSBI! A good question to ask is: on a real
machine, who populates this struct? Based on what we'll see below,
it's obvious that QEMU hardcodes this content into the memory, and
that logic is not part a of the ZSBL, but I can definitely imagine a
device where ZSBL actually populates this struct and passes it on to
OpenSBI.

Slide 17 of this presentation by an engineer from Western Digital
(presumably a core contributor to OpenSBI) outlines the contents of
this struct:

 1. Magic number
 2. Version
 3. Next address
 4. Next mode
 5. Options

All of these are unsigned longs (I guess that means 64 bit, 8
bytes?).

Exploring the fw_dynamic_info struct

At this point, let's take a quick detour to make sure we're on the
same page. Let's quickly make sure we're all looking at the same
version of the OpenSBI because different systems have different
version of QEMU which may come with a different version of OpenSBI.
Building OpenSBI from source is really straightforward, so let's
quickly do it. First, we need to clone the Git repo (time of writing
of this article is 10th Sept 2023; if you want to achieve full
reproducibility, build at a commit at this date):

git clone https://github.com/riscv-software-src/opensbi.git
cd opensbi
make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- PLATFORM=generic

The build should be fairly fast and lightweight. The output file
we're interested in is build/platform/generic/firmware/
fw_dynamic.bin. We'll pass this through the -bios flag to QEMU.
Starting QEMU with (from the opensbi folder we just cloned with Git):

qemu-system-riscv64 -machine virt -s -S -bios build/platform/generic/firmware/fw_dynamic.bin

After a few sis in gdb, we get back to where we were before. Let's
poke QEMU's memory to see what's going on there at the end of ZSBL.
At the last instruction of ZSBL, we look at the register dump (we use
i r for this).

=> 0x0000000080000000:  33 04 05 00     add     s0,a0,zero
(gdb) i r
ra             0x0      0x0
sp             0x0      0x0
gp             0x0      0x0
tp             0x0      0x0
t0             0x80000000       2147483648
t1             0x0      0
t2             0x0      0
fp             0x0      0x0
s1             0x0      0
a0             0x0      0
a1             0x87e00000       2279604224
a2             0x1028   4136
a3             0x0      0
a4             0x0      0
a5             0x0      0
a6             0x0      0
a7             0x0      0
s2             0x0      0
s3             0x0      0
s4             0x0      0
s5             0x0      0
s6             0x0      0
s7             0x0      0
s8             0x0      0
s9             0x0      0
s10            0x0      0
s11            0x0      0
t3             0x0      0
t4             0x0      0
t5             0x0      0
t6             0x0      0
pc             0x80000000       0x80000000

a2 is therefore pointing to 0x1028. As we said, let's poke that
memory with gdb. We'll ask it to read 10 successive 8-byte values
starting from 0x1028, and display them in hex format.

(gdb) x/10xg 0x1028

The g flag prints out the memory contents in 8-byte (giant) chunks.

(gdb) x/10xg 0x1028
0x1028: 0x000000004942534f      0x0000000000000002
0x1038: 0x0000000000000000      0x0000000000000001
0x1048: 0x0000000000000000      0x0000000000000000
0x1058: 0x0000000000000000      0x0000000000000000
0x1068: 0x0000000000000000      0x0000000000000000

This roughly seems to match Vysakh's article. We definitely see the
magic described in that article, followed by the 0x02 info version.
Next should be the address for the next jump, but there are all
zeroes... This is strange, but let's keep looking. Next value is 0x01
which again, according to the article, should correspond to the next
mode of execution which is S. This is correct, we're going from M
mode running SBI to the S mode running the OS kernel bootloader, or
the kernel itself, whatever we want. Why is the address of the next
jump all zeroes though? At this point, I'll just let QEMU run without
interference from gdb. I run the following in gdb:

continue

Everything is sort of hanging, but I got a newer OpenSBI output on
UART since I am now running a newer version of OpenSBI:

OpenSBI v1.3-54-g901d3d7
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name             : riscv-virtio,qemu
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : aclint-mswi
Platform Timer Device     : aclint-mtimer @ 10000000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : syscon-reboot
Platform Shutdown Device  : syscon-poweroff
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 322 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 66 KB
Firmware Heap Offset      : 0x48000
Firmware Heap Size        : 34 KB (total), 2 KB (reserved), 9 KB (used), 22 KB (free)
Firmware Scratch Size     : 4096 B (total), 768 B (used), 3328 B (free)
Runtime SBI Version       : 1.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000002000000-0x000000000200ffff M: (I,R,W) S/U: ()
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000000000000
Domain0 Next Arg1         : 0x0000000087e00000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.10
Boot HART Base ISA        : rv64imafdc
Boot HART ISA Extensions  : zicntr
Boot HART PMP Count       : 16
Boot HART PMP Granularity : 4
Boot HART PMP Address Bits: 54
Boot HART MHPM Info       : 0 (0x00000000)
Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109

This matches what we saw above, the next address is all zeroes... This
is strange, there's no way that could be true. I now ran QEMU without
the initial pause, just letting it run and connecting with gdb
asynchronously. I'll spare you the details, but inspecting the
registers on that "live run" definitely showed to me that nothing is
executing in the 0x0000000000000000 area. The CPU seems to be
spinning around some other address.

This likely has something to do with the fact that I actually didn't
pass any software to QEMU to load other than OpenSBI, so that's
probably what's throwing it off. QEMU likely populated the struct in
memory with all zeroes, and OpenSBI identifies it as an illegal edge
case, so it just keeps spinning in OpenSBI forever -- this is my
educated guess.

How do we pass some software to run other than OpenSBI? The same way
we passed OpenSBI, just a diferent flag name! This time, we're using
the -kernel QEMU flag. And how are we going to build this software?
The same way we built the "fake BIOS" in our previous article, we'll
just map it to a different memory location. Let's give it a shot at
0x80200000.

Building an "infinite-loop fake kernel"

Our OS kernel will just spin in place. It will be a single jump
instruction at 0x80200000 that just stays there infinitely. Here's
the assembly source code:

        .global _start
        .section .text.kernel

_start: j _start

The linker script is the following:

MEMORY {
  kernel_space (rwx) : ORIGIN = 0x80200000, LENGTH = 128
}

SECTIONS {
  .text : {
    infinite_loop.o(.text.kernel)
  } > kernel_space
}

For details on how to use these files to build an ELF image that can
be loaded into QEMU, please see the original bare metal programming
article.

Once we build it, we end up with the infinte_loop ELF file that can
serve as our fake kernel. We now run QEMU

qemu-system-riscv64 -machine virt -s -S -bios build/platform/generic/firmware/fw_dynamic.bin -kernel ~/work/github_demo/risc-v-bare-metal-fake-kernel/infinite_loop

Again, I connect gdb and si my way to the end of ZSBL. Now when I
read the infamous struct at 0x1028, things look a lot better, which
confirms the theory that QEMU was populating that struct weirdly.

=> 0x0000000080000000:  33 04 05 00     add     s0,a0,zero
(gdb) x/10xg 0x1028
0x1028: 0x000000004942534f      0x0000000000000002
0x1038: 0x0000000080200000      0x0000000000000001
0x1048: 0x0000000000000000      0x0000000000000000
0x1058: 0x0000000000000000      0x0000000000000000
0x1068: 0x0000000000000000      0x0000000000000000

We now see that the new address is populated in this struct, as is
expected. This is also reflected in the OpenSBI output on UART. Let's
continue to our fake kernel with gdb and see if everything is OK
there.

(gdb) break *0x080200000
Breakpoint 1 at 0x80200000
(gdb) continue
Continuing.

Breakpoint 1, 0x0000000080200000 in ?? ()
=> 0x0000000080200000:  6f 00 00 00     j       0x80200000

Everything looks good here. Let's recap:

 1. ZSBL is the first thing that runs after the power-on. It
    initializes a few registers. The key register is a2, which points
    to a fw_dynamic_info struct containing the crucial info for the
    FW_DYNAMIC flavor of OpenSBI to operate. In QEMU case, this
    struct is somehow populated during the power-on, magically by the
    virutalization engine, but in reality, this is likely the job of
    the ZSBL. Either way, OpenSBI now knows what to do after it's
    done.
 2. OpenSBI provides an interrupt-based interface for the software up
    on the stack (presumably OS kernel bootloader and kernel itself)
    to invoke it. This interface is called SBI and it's conceptually
    the same as ABI for the application software on top of an
    operating system.
 3. We pass the kernel image to QEMU as yet another ELF which just
    populated another section of the memory. QEMU populates the
    struct in such way that OpenSBI can pass the control to there,
    and before it switches there, it enters the S mode of execution.

Intentionally skipped details

ZSBL also touched the a0 and a1 registers.

a0 has something to do with RISC-V harts, but let's not get into
those details, they are not relevant for the rest of this article.
Besides, this particular step in the boot process doesn't seem to be
particularly relevant, per comments from Github.

a1 is an interesting pointer because it points to the device tree
data structure in memory. For the rest of this article, this data
structure is not relevant, so we can disregard this piece of
information. However, the device tree is really useful for real
kernels like Linux. Linux is able to scan the device tree from memory
and understand the structure of the machine it's running on, rather
than having to run a lot of if/else branches in its programming for
every hardware combination. The Wikipedia article should give a
decent idea of how this is used in Linux. As mentioned, however, we
won't be concerned with the details of device tree in the rest of
this article.

Hello world fake kernel

Now we have all the knowledge we need to code a fake OS kernel that
just prints "Hello world" to the UART device. The functionality is
not at all different from the bare metal program we looked at in the
previous guide, but the way we'll get there is significantly
different. We'll be using an SBI call to print to UART, instead of
directly interacting with the UART device (we're using a more
privileged lower layer of software to do this work for us). This
could have serious consequences, even on a trivial example such as a
"hello world" one: we delegate the responsibility of interacting with
the UART hardware to the SBI layer, thus achieving portability across
different machines that conform to this SBI interface.

How do we call into RISC-V SBI layer? Conceptually, it's exactly the
same as invoking a print to standard output in x86 Linux -- we'll
populate some registers and invoke a software interrupt/trap to pass
the control down the software stack to OpenSBI. OpenSBI offers a lot
of services in the SBI layer, and many of them can be extremely
useful for developing a portable operating system kernel, such as
interaction with the timers (relevant for time slicing and enabling
multiple threads to share the same CPU core). For the full list of
functionality exposed through the SBI layer, please take a look here.

In this guide, we'll be focusing on the debug console functionality,
i.e. we'll be writing out to UART through SBI. Let's code!

First, we need to know how do we encode the functionality we want
OpenSBI to execute through registers. This is well documented here.
tl;dr is that SBI functionality is grouped into "extensions".
Register a7 contains the extension ID (EID), while a6 encodes the
individual function ID (FID) within that extension. The parameters
are then passed through a0, a1, a2, ...

For printing to the console, the EID we are looking for is 0x4442434E
(a rather interesting value) and the FID is simply 0x00.

This time, instead of printing one by one character as we did in the
initial bare metal programming guide, we'll invoke the printing as a
single operation. After all, we should be benefiting from the high
level functionality that the SBI layer offers. Therefore, our binary
should store the output string somewhere in the memory, and ideally
we want to do something like invoking the SBI to print from that
address. We'll do just that:

        .global _start
        .section .text.kernel

_start: li a7, 0x4442434E
        li a6, 0x00
1:      auipc a3, %pcrel_hi(debug_string)
        addi a3, a3, %pcrel_lo(1b)
        li a4, 0x00000000FFFFFFFF
        li a5, 0xFFFFFFFF00000000
        li a0, 12
        and a1, a3, a4
        and a2, a3, a5
        ecall

        li a7, 0x01
        mv a6, a0
        ecall

loop:   j loop

        .section .rodata
debug_string:
        .string "Hello world\n"

A couple of things to note here:

 1. We use PC-relative addressing here for the output string. As a
    reminder, the kernel is stored at an address represented by a
    very large unsigned integer. This value is too high to be encoded
    within any RISC-V 32-bit instruction word. That's not a problem,
    we simply use a short sequence of AUIPC and ADDI instructions to
    get there (check out this article for more information on this).
    If you do not understand what this point is all about, please
    make sure to revise different memory addressing modes and the
    differences between them: this is crucial for any sort of bare
    metal programming.
 2. There is some bit-masking happening as well for registers a1 and
    a2. SBI for some reason asks for the pointer to the string to be
    printed to be broken down into two 32-bit pieces.

So our SBI call is defined by several registers:

 1. a7 identifies the SBI extension
 2. a6 identifies the function within the extension (in this case,
    debug console extension)
 3. a0 contains the length of the string that needs to go to the
    debug console output
 4. a1 and a2, when joined together, contain the 64-bit pointer to
    the address of the stirng that needs to be printed

The SBI call is now invoked through an ecall instruction, which
activates a CPU trap. At this point, OpenSBI takes over and writes to
UART, in exactly the same way as we did in the initial bare metal
programming guide. If you are wondering how a simple ecall invocation
takes us to OpenSBI, that is because OpenSBI set up the trap handling
mechanism in such way that when our kernel gets into a trap, the
program counter will jump into the OpenSBI software section. The
details of this are way outside the scope of this article, but we may
cover this in some other article.

For now, just check out the QEMU serial port and confirm that "Hello
world" is printed properly:

qemu-system-riscv64 -machine virt -s -S -bios build/platform/generic/firmware/fw_dynamic.bin -kernel ~/work/github_demo/risc-v-bare-metal-fake-kernel/hello_world_kernel

OpenSBI v1.3-54-g901d3d7
   ____                    _____ ____ _____
  / __ \                  / ____|  _ \_   _|
 | |  | |_ __   ___ _ __ | (___ | |_) || |
 | |  | | '_ \ / _ \ '_ \ \___ \|  _ < | |
 | |__| | |_) |  __/ | | |____) | |_) || |_
  \____/| .__/ \___|_| |_|_____/|____/_____|
        | |
        |_|

Platform Name             : riscv-virtio,qemu
Platform Features         : medeleg
Platform HART Count       : 1
Platform IPI Device       : aclint-mswi
Platform Timer Device     : aclint-mtimer @ 10000000Hz
Platform Console Device   : uart8250
Platform HSM Device       : ---
Platform PMU Device       : ---
Platform Reboot Device    : syscon-reboot
Platform Shutdown Device  : syscon-poweroff
Platform Suspend Device   : ---
Platform CPPC Device      : ---
Firmware Base             : 0x80000000
Firmware Size             : 322 KB
Firmware RW Offset        : 0x40000
Firmware RW Size          : 66 KB
Firmware Heap Offset      : 0x48000
Firmware Heap Size        : 34 KB (total), 2 KB (reserved), 9 KB (used), 22 KB (free)
Firmware Scratch Size     : 4096 B (total), 768 B (used), 3328 B (free)
Runtime SBI Version       : 1.0

Domain0 Name              : root
Domain0 Boot HART         : 0
Domain0 HARTs             : 0*
Domain0 Region00          : 0x0000000002000000-0x000000000200ffff M: (I,R,W) S/U: ()
Domain0 Region01          : 0x0000000080040000-0x000000008005ffff M: (R,W) S/U: ()
Domain0 Region02          : 0x0000000080000000-0x000000008003ffff M: (R,X) S/U: ()
Domain0 Region03          : 0x0000000000000000-0xffffffffffffffff M: () S/U: (R,W,X)
Domain0 Next Address      : 0x0000000080200000
Domain0 Next Arg1         : 0x0000000087e00000
Domain0 Next Mode         : S-mode
Domain0 SysReset          : yes
Domain0 SysSuspend        : yes

Boot HART ID              : 0
Boot HART Domain          : root
Boot HART Priv Version    : v1.10
Boot HART Base ISA        : rv64imafdc
Boot HART ISA Extensions  : zicntr
Boot HART PMP Count       : 16
Boot HART PMP Granularity : 4
Boot HART PMP Address Bits: 54
Boot HART MHPM Info       : 0 (0x00000000)
Boot HART MIDELEG         : 0x0000000000000222
Boot HART MEDELEG         : 0x000000000000b109
Hello world

As an exercise, I suggest probing the base extension (0x10) with gdb
to investigate what the QEMU machine + OpenSBI you build are capable
of offering.

Conclusion

We ended up with an entirely portable fake kernel that prints "Hello
world" to UART! This may seem like nothing special, but the concept
here is very powerful. Without rebuilding, you can drop the same
kernel image on a different RISC-V 64-bit machine with OpenSBI that
supports the debug console extension.

In fact, I played a little trick here. :) One of the main reasons I
suggested building OpenSBI from source is that some QEMU versions
provided by the Linux distro package managers do not support the
debug console extension (they're simply old). This was the case with
my default OpenSBI which came with Debian's version of QEMU.

Finally, I would like to remind the reader that we have extensively
focused on the QEMU virt machine with a RISC-V core and all the fine
details of this article are related to it. That said, my hope is that
the reader has learned enough about the boot sequence concepts and
bare metal programming that adapting this knowledge to a particular
real-world scenario becomes easy.

In the next posts, we'll talk about taking this further and booting
up a full blown Linux kernel. We'll expand that step by step until we
reach a Linux deployment that can handle I/O with keyboard, mouse,
screen and Ethernet network.

I hope you enjoyed this lengthy writeup!

Code pointers

If you prefer not to copy/paste, the code is available on this GitHub
repo.

  *  risc-v
  *  sbi
  *  opensbi
  *  bare-metal

---------------------------------------------------------------------
    
Copyright (c) 2023  |  All rights reserved.