https://danielmangum.com/posts/risc-v-bytes-passing-on-the-stack/ Daniel Mangum * Blog * About RISC-V Bytes: Passing on the Stack June 30, 2021 This is part of a series on the blog where we explore RISC-V by breaking down real programs and explaining how they work. You can view all posts in this series on the RISC-V Bytes page. I once took a class on compilers where my professor told us that a CPU is like a human brain: it can store important data and access it quickly, but there is a limit to the amount of data that can be stored. When that limit is reached, it must store data elsewhere. For instance, when doing math, most humans find it useful to write different steps of the operations down on a piece of paper because the larger the computation, the harder it is to keep track of all of its components. Likewise, a CPU can store the most critical data in easy to access locations, but must eventually move information farther down the memory hierarchy when the computation becomes sufficiently complex. Revisiting Our Last Post In our most recent post, we primarily looked at the easiest to access memory locations: registers. We specifically looked at how registers are used to communicate between procedures via calling conventions. However, we also saw that callee-saved registers, such as the stack pointer (sp) that needed to be re-used within a procedure had to have their contents stored on the stack, then loaded back into the appropriate register before returning. Storing these registers on the stack is an example of moving the data down the memory hierarchy. Let's look back at the source for that program: sum.c #include int main() { int num1 = 1; int num2 = 2; int sum = num1 + num2; printf("The sum is: %d", sum); return 0; } This program is needlessly complex: the result of our addition will always be constant. However, because we compiled without any optimization, these wasteful operations were preserved in the generated assembly: (gdb) disass main Dump of assembler code for function main: 0x0000000000010158 <+0>: addi sp,sp,-32 0x000000000001015a <+2>: sd ra,24(sp) 0x000000000001015c <+4>: sd s0,16(sp) 0x000000000001015e <+6>: addi s0,sp,32 0x0000000000010160 <+8>: li a5,1 0x0000000000010162 <+10>: sw a5,-20(s0) 0x0000000000010166 <+14>: li a5,2 0x0000000000010168 <+16>: sw a5,-24(s0) 0x000000000001016c <+20>: lw a4,-20(s0) 0x0000000000010170 <+24>: lw a5,-24(s0) 0x0000000000010174 <+28>: addw a5,a5,a4 0x0000000000010176 <+30>: sw a5,-28(s0) 0x000000000001017a <+34>: lw a5,-28(s0) 0x000000000001017e <+38>: mv a1,a5 0x0000000000010180 <+40>: lui a5,0x1c 0x0000000000010182 <+42>: addi a0,a5,176 # 0x1c0b0 0x0000000000010186 <+46>: jal ra,0x10332 0x000000000001018a <+50>: li a5,0 0x000000000001018c <+52>: mv a0,a5 0x000000000001018e <+54>: ld ra,24(sp) 0x0000000000010190 <+56>: ld s0,16(sp) 0x0000000000010192 <+58>: addi sp,sp,32 0x0000000000010194 <+60>: ret End of assembler dump. View on Compiler Explorer See the first post in this series for how to set up cross-platform compilation and debugging for RISC-V. In fact, the generated assembly is even more wasteful. Ignoring the function prologue and epilogue, the procedure body not only performs all of our computations (<+28>), but also does not make use of all available registers, forcing us to store all data on the stack. A particularly egregious example is when we initialize num1 (<+8>) and num2 (<+14>), using a5 in both cases, forcing each value to be stored on the stack (<+10>, <+16>). If we employed full optimization by passing -O3 to our compiler, we would get a much more sensible output where we skip addition altogether, instead loading 3 as an immediate value, which will always be the result of the operation (<+4>). riscv64-unriscv64-unknown-elf-gcc -O3 sum.c (gdb) disass main Dump of assembler code for function main: 0x00000000000100b0 <+0>: lui a0,0x1c 0x00000000000100b2 <+2>: addi sp,sp,-16 0x00000000000100b4 <+4>: li a1,3 0x00000000000100b6 <+6>: addi a0,a0,144 # 0x1c090 0x00000000000100ba <+10>: sd ra,8(sp) 0x00000000000100bc <+12>: jal ra,0x1030c 0x00000000000100c0 <+16>: ld ra,8(sp) 0x00000000000100c2 <+18>: li a0,0 0x00000000000100c4 <+20>: addi sp,sp,16 0x00000000000100c6 <+22>: ret End of assembler dump. View on Compiler Explorer What we are illustrating here is efficient use of registers, avoiding moving down the memory hierarchy unless we absolutely have to, such as when storing the return address of our caller (<+10>). Sharing "Large" Data Today we want to look at what happens when we are passing data between procedures and we have too much data to store in our argument registers. Let's take another look at our general purpose registers in RISC-V: Name ABI Mnemonic Calling Convention Preserved across calls? x0 zero Zero n/a x1 ra Return address No x2 sp Stack pointer Yes x3 gp Global pointer n/a x4 tp Thread pointer n/a x5-x7 t0-t2 Temporary registers No x8-x9 s0-s1 Saved registers Yes x10-x17 a0-a7 Argument registers No x18-x27 s2-s11 Saved registers Yes x28-x31 t3-t6 Temporary registers No The "argument registers" are where we store data that we want to share with a procedure we are calling. When passing minimal data between procedures, this isn't a problem: minimal.c #include int sum(int one, int two) { return one + two; } int main() { printf("The sum is: %d\n", sum(1, 2)); return 0; } riscv64-unknown-elf-gcc -O3 -fno-inline minimal.c (gdb) disass main Dump of assembler code for function main: 0x00000000000100b0 <+0>: addi sp,sp,-16 0x00000000000100b2 <+2>: li a1,2 0x00000000000100b4 <+4>: li a0,1 0x00000000000100b6 <+6>: sd ra,8(sp) 0x00000000000100b8 <+8>: jal ra,0x10178 0x00000000000100bc <+12>: mv a1,a0 0x00000000000100be <+14>: lui a0,0x1c 0x00000000000100c0 <+16>: addi a0,a0,160 # 0x1c0a0 0x00000000000100c4 <+20>: jal ra,0x10318 0x00000000000100c8 <+24>: ld ra,8(sp) 0x00000000000100ca <+26>: li a0,0 0x00000000000100cc <+28>: addi sp,sp,16 0x00000000000100ce <+30>: ret End of assembler dump. (gdb) disass sum Dump of assembler code for function sum: 0x0000000000010178 <+0>: addw a0,a0,a1 0x000000000001017a <+2>: ret End of assembler dump. View on Compiler Explorer We pass -fno-inline during compilation because we want to preserve the call to sum and the passing of data between the procedures. Without it, at any optimization level >= 1, GCC will inline the sum function. We load our arguments into our argument registers (main:<+2>,main: <+4>), then perform our addition in sum using those registers. We even re-use a0 to pass our return value back to main (sum:<+0>), which we are permitted to do because argument registers are not callee-saved (RISC-V calling conventions also specify that that a0 and a1 are to be used for return values). So what happens when we can't fit all of our arguments into the argument registers? Similar to how we preserved register contents within a procedure by storing them on the stack, we can also pass data between procedures on the stack. Let's expand our minimal example with more data: passonstack.c #include int sum(int one, int two, int three, int four, int five, int six, int seven, int eight, int nine) { return one + two + three + four + five + six + seven + eight + nine; } int main() { printf("The sum is: %d\n", sum(1, 2, 3, 4, 5, 6, 7, 8, 9)); return 0; } riscv64-unknown-elf-gcc -O3 -fno-inline passonstack.c (gdb) disass main Dump of assembler code for function main: 0x00000000000100b0 <+0>: addi sp,sp,-32 0x00000000000100b2 <+2>: li a1,9 0x00000000000100b4 <+4>: sd a1,0(sp) 0x00000000000100b6 <+6>: li a7,8 0x00000000000100b8 <+8>: li a6,7 0x00000000000100ba <+10>: li a5,6 0x00000000000100bc <+12>: li a4,5 0x00000000000100be <+14>: li a3,4 0x00000000000100c0 <+16>: li a2,3 0x00000000000100c2 <+18>: li a1,2 0x00000000000100c4 <+20>: li a0,1 0x00000000000100c6 <+22>: sd ra,24(sp) 0x00000000000100c8 <+24>: jal ra,0x10188 0x00000000000100cc <+28>: mv a1,a0 0x00000000000100ce <+30>: lui a0,0x1c 0x00000000000100d0 <+32>: addi a0,a0,192 # 0x1c0c0 0x00000000000100d4 <+36>: jal ra,0x1033c 0x00000000000100d8 <+40>: ld ra,24(sp) 0x00000000000100da <+42>: li a0,0 0x00000000000100dc <+44>: addi sp,sp,32 0x00000000000100de <+46>: ret End of assembler dump. (gdb) disass sum Dump of assembler code for function sum: 0x0000000000010188 <+0>: addw a1,a1,a0 0x000000000001018a <+2>: addw a1,a1,a2 0x000000000001018c <+4>: addw a1,a1,a3 0x000000000001018e <+6>: addw a1,a1,a4 0x0000000000010190 <+8>: addw a1,a1,a5 0x0000000000010192 <+10>: lw a0,0(sp) 0x0000000000010194 <+12>: addw a1,a1,a6 0x0000000000010198 <+16>: addw a1,a1,a7 0x000000000001019c <+20>: addw a0,a0,a1 0x000000000001019e <+22>: ret End of assembler dump. View on Compiler Explorer The concept of storing data on the stack when we run out of registers is commonly referred to as "register spilling". Compilers typically want to reduce spilling registers as much as possible. We once again are utilizing our argument registers to pass our arguments to sum, but because we are passing nine integers and only have eight argument registers, we must store one of our arguments on the stack. How do we know where to place our "spilled" argument on the stack? The RISC-V calling conventions specify: The stack grows downwards (towards lower addresses) and the stack pointer shall be aligned to a 128-bit boundary upon procedure entry. The first argument passed on the stack is located at offset zero of the stack pointer on function entry; following arguments are stored at correspondingly higher addresses. We could test this out by passing a tenth argument and seeing that it is stored at an offset of 8 bytes from the stack pointer: (gdb) disass sum Dump of assembler code for function sum: 0x000000000001018c <+0>: addw a1,a1,a0 0x000000000001018e <+2>: addw a1,a1,a2 0x0000000000010190 <+4>: addw a1,a1,a3 0x0000000000010192 <+6>: addw a1,a1,a4 0x0000000000010194 <+8>: addw a1,a1,a5 0x0000000000010196 <+10>: addw a1,a1,a6 0x000000000001019a <+14>: addw a1,a1,a7 0x000000000001019e <+18>: lw a7,0(sp) 0x00000000000101a0 <+20>: lw a0,8(sp) 0x00000000000101a2 <+22>: addw a1,a1,a7 0x00000000000101a6 <+26>: addw a0,a0,a1 0x00000000000101a8 <+28>: ret End of assembler dump. View on Compiler Explorer These are clearly contrived examples (exemplified by the fact that we have to force the compiler not to eliminate our call to the sum function entirely), but serve to get us thinking about how the data we share between procedures affects our memory access patterns. Concluding Thoughts Understanding the memory hierarchy of a computer and what operations cause it to move to a lower (and slower) level in the hierarchy allow us to be more effective programmers. While disassembling and examining every function in a program is not a feasible option, building up an intuition for how a certain operation may impact the performance of an application can lead to better designed systems. As always, these posts are meant to serve as a useful resource for folks who are interested in learning more about RISC-V and low-level software in general. If I can do a better job of reaching that goal, or you have any questions or comments, please feel free to send me a message @hasheddan on Twitter! (c) 2021