https://danielmangum.com/posts/risc-v-bytes-passing-on-the-stack/

Daniel Mangum

  * Blog
  * About

RISC-V Bytes: Passing on the Stack

June 30, 2021

    This is part of a series on the blog where we explore RISC-V by
    breaking down real programs and explaining how they work. You can
    view all posts in this series on the RISC-V Bytes page.

I once took a class on compilers where my professor told us that a
CPU is like a human brain: it can store important data and access it
quickly, but there is a limit to the amount of data that can be
stored. When that limit is reached, it must store data elsewhere. For
instance, when doing math, most humans find it useful to write
different steps of the operations down on a piece of paper because
the larger the computation, the harder it is to keep track of all of
its components. Likewise, a CPU can store the most critical data in
easy to access locations, but must eventually move information
farther down the memory hierarchy when the computation becomes
sufficiently complex.

Revisiting Our Last Post

In our most recent post, we primarily looked at the easiest to access
memory locations: registers. We specifically looked at how registers
are used to communicate between procedures via calling conventions.
However, we also saw that callee-saved registers, such as the stack
pointer (sp) that needed to be re-used within a procedure had to have
their contents stored on the stack, then loaded back into the
appropriate register before returning. Storing these registers on the
stack is an example of moving the data down the memory hierarchy.

Let's look back at the source for that program:

sum.c

#include <stdio.h>

int main()
{
    int num1 = 1;
    int num2 = 2;
    int sum = num1 + num2;

    printf("The sum is: %d", sum);
    return 0;
}


This program is needlessly complex: the result of our addition will
always be constant. However, because we compiled without any
optimization, these wasteful operations were preserved in the
generated assembly:

(gdb) disass main
Dump of assembler code for function main:
   0x0000000000010158 <+0>:     addi       sp,sp,-32
   0x000000000001015a <+2>:     sd         ra,24(sp)
   0x000000000001015c <+4>:     sd         s0,16(sp)
   0x000000000001015e <+6>:     addi       s0,sp,32
   0x0000000000010160 <+8>:     li         a5,1
   0x0000000000010162 <+10>:    sw         a5,-20(s0)
   0x0000000000010166 <+14>:    li         a5,2
   0x0000000000010168 <+16>:    sw         a5,-24(s0)
   0x000000000001016c <+20>:    lw         a4,-20(s0)
   0x0000000000010170 <+24>:    lw         a5,-24(s0)
   0x0000000000010174 <+28>:    addw       a5,a5,a4
   0x0000000000010176 <+30>:    sw         a5,-28(s0)
   0x000000000001017a <+34>:    lw         a5,-28(s0)
   0x000000000001017e <+38>:    mv         a1,a5
   0x0000000000010180 <+40>:    lui        a5,0x1c
   0x0000000000010182 <+42>:    addi       a0,a5,176 # 0x1c0b0
   0x0000000000010186 <+46>:    jal        ra,0x10332 <printf>
   0x000000000001018a <+50>:    li         a5,0
   0x000000000001018c <+52>:    mv         a0,a5
   0x000000000001018e <+54>:    ld         ra,24(sp)
   0x0000000000010190 <+56>:    ld         s0,16(sp)
   0x0000000000010192 <+58>:    addi       sp,sp,32
   0x0000000000010194 <+60>:    ret
End of assembler dump.


View on Compiler Explorer

    See the first post in this series for how to set up
    cross-platform compilation and debugging for RISC-V.

In fact, the generated assembly is even more wasteful. Ignoring the
function prologue and epilogue, the procedure body not only performs
all of our computations (<+28>), but also does not make use of all
available registers, forcing us to store all data on the stack. A
particularly egregious example is when we initialize num1 (<+8>) and
num2 (<+14>), using a5 in both cases, forcing each value to be stored
on the stack (<+10>, <+16>).

If we employed full optimization by passing -O3 to our compiler, we
would get a much more sensible output where we skip addition
altogether, instead loading 3 as an immediate value, which will
always be the result of the operation (<+4>).

riscv64-unriscv64-unknown-elf-gcc -O3 sum.c

(gdb) disass main
Dump of assembler code for function main:
   0x00000000000100b0 <+0>:     lui     a0,0x1c
   0x00000000000100b2 <+2>:     addi    sp,sp,-16
   0x00000000000100b4 <+4>:     li      a1,3
   0x00000000000100b6 <+6>:     addi    a0,a0,144 # 0x1c090
   0x00000000000100ba <+10>:    sd      ra,8(sp)
   0x00000000000100bc <+12>:    jal     ra,0x1030c <printf>
   0x00000000000100c0 <+16>:    ld      ra,8(sp)
   0x00000000000100c2 <+18>:    li      a0,0
   0x00000000000100c4 <+20>:    addi    sp,sp,16
   0x00000000000100c6 <+22>:    ret
End of assembler dump.


View on Compiler Explorer

What we are illustrating here is efficient use of registers, avoiding
moving down the memory hierarchy unless we absolutely have to, such
as when storing the return address of our caller (<+10>).

Sharing "Large" Data

Today we want to look at what happens when we are passing data
between procedures and we have too much data to store in our argument
registers. Let's take another look at our general purpose registers
in RISC-V:

 Name   ABI Mnemonic Calling Convention  Preserved across calls?
x0      zero         Zero                n/a
x1      ra           Return address      No
x2      sp           Stack pointer       Yes
x3      gp           Global pointer      n/a
x4      tp           Thread pointer      n/a
x5-x7   t0-t2        Temporary registers No
x8-x9   s0-s1        Saved registers     Yes
x10-x17 a0-a7        Argument registers  No
x18-x27 s2-s11       Saved registers     Yes
x28-x31 t3-t6        Temporary registers No

The "argument registers" are where we store data that we want to
share with a procedure we are calling. When passing minimal data
between procedures, this isn't a problem:

minimal.c

#include <stdio.h>

int sum(int one, int two) {
    return one + two;
}

int main() {
    printf("The sum is: %d\n", sum(1, 2));
    return 0;
}


riscv64-unknown-elf-gcc -O3 -fno-inline minimal.c

(gdb) disass main
Dump of assembler code for function main:
   0x00000000000100b0 <+0>:     addi    sp,sp,-16
   0x00000000000100b2 <+2>:     li      a1,2
   0x00000000000100b4 <+4>:     li      a0,1
   0x00000000000100b6 <+6>:     sd      ra,8(sp)
   0x00000000000100b8 <+8>:     jal     ra,0x10178 <sum>
   0x00000000000100bc <+12>:    mv      a1,a0
   0x00000000000100be <+14>:    lui     a0,0x1c
   0x00000000000100c0 <+16>:    addi    a0,a0,160 # 0x1c0a0
   0x00000000000100c4 <+20>:    jal     ra,0x10318 <printf>
   0x00000000000100c8 <+24>:    ld      ra,8(sp)
   0x00000000000100ca <+26>:    li      a0,0
   0x00000000000100cc <+28>:    addi    sp,sp,16
   0x00000000000100ce <+30>:    ret
End of assembler dump.
(gdb) disass sum
Dump of assembler code for function sum:
   0x0000000000010178 <+0>:     addw    a0,a0,a1
   0x000000000001017a <+2>:     ret
End of assembler dump.


View on Compiler Explorer

    We pass -fno-inline during compilation because we want to
    preserve the call to sum and the passing of data between the
    procedures. Without it, at any optimization level >= 1, GCC will
    inline the sum function.

We load our arguments into our argument registers (main:<+2>,main:
<+4>), then perform our addition in sum using those registers. We
even re-use a0 to pass our return value back to main (sum:<+0>),
which we are permitted to do because argument registers are not
callee-saved (RISC-V calling conventions also specify that that a0
and a1 are to be used for return values).

So what happens when we can't fit all of our arguments into the
argument registers? Similar to how we preserved register contents
within a procedure by storing them on the stack, we can also pass
data between procedures on the stack. Let's expand our minimal
example with more data:

passonstack.c

#include <stdio.h>

int sum(int one, int two, int three, int four, int five, int six, int seven, int eight, int nine) {
    return one + two + three + four + five + six + seven + eight + nine;
}

int main() {
    printf("The sum is: %d\n", sum(1, 2, 3, 4, 5, 6, 7, 8, 9));
    return 0;
}


riscv64-unknown-elf-gcc -O3 -fno-inline passonstack.c

(gdb) disass main
Dump of assembler code for function main:
   0x00000000000100b0 <+0>:     addi    sp,sp,-32
   0x00000000000100b2 <+2>:     li      a1,9
   0x00000000000100b4 <+4>:     sd      a1,0(sp)
   0x00000000000100b6 <+6>:     li      a7,8
   0x00000000000100b8 <+8>:     li      a6,7
   0x00000000000100ba <+10>:    li      a5,6
   0x00000000000100bc <+12>:    li      a4,5
   0x00000000000100be <+14>:    li      a3,4
   0x00000000000100c0 <+16>:    li      a2,3
   0x00000000000100c2 <+18>:    li      a1,2
   0x00000000000100c4 <+20>:    li      a0,1
   0x00000000000100c6 <+22>:    sd      ra,24(sp)
   0x00000000000100c8 <+24>:    jal     ra,0x10188 <sum>
   0x00000000000100cc <+28>:    mv      a1,a0
   0x00000000000100ce <+30>:    lui     a0,0x1c
   0x00000000000100d0 <+32>:    addi    a0,a0,192 # 0x1c0c0
   0x00000000000100d4 <+36>:    jal     ra,0x1033c <printf>
   0x00000000000100d8 <+40>:    ld      ra,24(sp)
   0x00000000000100da <+42>:    li      a0,0
   0x00000000000100dc <+44>:    addi    sp,sp,32
   0x00000000000100de <+46>:    ret
End of assembler dump.
(gdb) disass sum
Dump of assembler code for function sum:
   0x0000000000010188 <+0>:     addw    a1,a1,a0
   0x000000000001018a <+2>:     addw    a1,a1,a2
   0x000000000001018c <+4>:     addw    a1,a1,a3
   0x000000000001018e <+6>:     addw    a1,a1,a4
   0x0000000000010190 <+8>:     addw    a1,a1,a5
   0x0000000000010192 <+10>:    lw      a0,0(sp)
   0x0000000000010194 <+12>:    addw    a1,a1,a6
   0x0000000000010198 <+16>:    addw    a1,a1,a7
   0x000000000001019c <+20>:    addw    a0,a0,a1
   0x000000000001019e <+22>:    ret
End of assembler dump.


View on Compiler Explorer

    The concept of storing data on the stack when we run out of
    registers is commonly referred to as "register spilling".
    Compilers typically want to reduce spilling registers as much as
    possible.

We once again are utilizing our argument registers to pass our
arguments to sum, but because we are passing nine integers and only
have eight argument registers, we must store one of our arguments on
the stack. How do we know where to place our "spilled" argument on
the stack? The RISC-V calling conventions specify:

    The stack grows downwards (towards lower addresses) and the stack
    pointer shall be aligned to a 128-bit boundary upon procedure
    entry. The first argument passed on the stack is located at
    offset zero of the stack pointer on function entry; following
    arguments are stored at correspondingly higher addresses.

We could test this out by passing a tenth argument and seeing that it
is stored at an offset of 8 bytes from the stack pointer:

(gdb) disass sum
Dump of assembler code for function sum:
   0x000000000001018c <+0>:     addw    a1,a1,a0
   0x000000000001018e <+2>:     addw    a1,a1,a2
   0x0000000000010190 <+4>:     addw    a1,a1,a3
   0x0000000000010192 <+6>:     addw    a1,a1,a4
   0x0000000000010194 <+8>:     addw    a1,a1,a5
   0x0000000000010196 <+10>:    addw    a1,a1,a6
   0x000000000001019a <+14>:    addw    a1,a1,a7
   0x000000000001019e <+18>:    lw      a7,0(sp)
   0x00000000000101a0 <+20>:    lw      a0,8(sp)
   0x00000000000101a2 <+22>:    addw    a1,a1,a7
   0x00000000000101a6 <+26>:    addw    a0,a0,a1
   0x00000000000101a8 <+28>:    ret
End of assembler dump.


View on Compiler Explorer

These are clearly contrived examples (exemplified by the fact that we
have to force the compiler not to eliminate our call to the sum
function entirely), but serve to get us thinking about how the data
we share between procedures affects our memory access patterns.

Concluding Thoughts

Understanding the memory hierarchy of a computer and what operations
cause it to move to a lower (and slower) level in the hierarchy allow
us to be more effective programmers. While disassembling and
examining every function in a program is not a feasible option,
building up an intuition for how a certain operation may impact the
performance of an application can lead to better designed systems.

As always, these posts are meant to serve as a useful resource for
folks who are interested in learning more about RISC-V and low-level
software in general. If I can do a better job of reaching that goal,
or you have any questions or comments, please feel free to send me a
message @hasheddan on Twitter!


(c) 2021