https://mtlynch.io/zig-extraneous-build/ Author photo Michael Lynch [ ] * Blog * Retrospectives * Projects * Book Reports * About Why does an extraneous build step make my Zig app 10x faster? March 19, 2024 11-minute read zig * ethereum * bash For the past few months, I've been curious about two technologies: the Zig programming language and Ethereum cryptocurrency. To learn more about both, I've been using Zig to write a bytecode interpreter for the Ethereum Virtual Machine. Zig is a great language for performance optimization, as it gives you fine-grained control over memory and control flow. To motivate myself, I've been benchmarking my Ethereum implementation against the official Go implementation. At the beginning of this process, my hobby Ethereum Zig implementation underperformed the official Go implementation by about 40%. Recently, I made what I thought was a simple refactoring to my benchmarking script, but my app's performance tanked. I identified the relevant change as the difference between these two commands: $ echo '60016000526001601ff3' | xxd -r -p | zig build run -Doptimize=ReleaseFast execution time: 58.808us $ echo '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/eth-zvm execution time: 438.059us zig build run is just a shortcut command for compiling a binary and executing it. It should be equivalent to the following two commands: zig build ./zig-out/bin/eth-zvm How could an additional build step cause my program to run almost 10x faster? Creating a minimal reproduction of the phenomenon [?] To debug the performance mystery, I tried simplifying my app until it was no longer a bytecode interpreter and was just a program that counted the number of bytes it read from stdin: // src/main.zig const std = @import("std"); pub fn countBytes(reader: anytype) !u32 { var count: u32 = 0; while (true) { _ = reader.readByte() catch |err| switch (err) { error.EndOfStream => { return count; }, else => { return err; }, }; count += 1; } } pub fn main() !void { var reader = std.io.getStdIn().reader(); var timer = try std.time.Timer.start(); const start = timer.lap(); const count = try countBytes(&reader); const end = timer.read(); const elapsed_micros = @as(f64, @floatFromInt(end - start)) / std.time.ns_per_us; const output = std.io.getStdOut().writer(); try output.print("bytes: {}\n", .{count}); try output.print("execution time: {d:.3}us\n", .{elapsed_micros}); } With the simplified app, I could still see the performance difference. When I ran the byte counter with zig build run, it ran in 13 microseconds: $ echo '00010203040506070809' | xxd -r -p | zig build run -Doptimize=ReleaseFast bytes: 10 execution time: 13.549us When I ran the compiled binary directly, it took 12x as long to run, completing in 162 microseconds: $ echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes bytes: 10 execution time: 162.195us My test consisted of three commands in a bash pipeline: 1. echo prints a sequence of ten hex-encoded bytes (0x00, 0x01, & mldr;). 2. xxd converts echo's hex-encoded bytes to binary-encoded bytes. 3. zig build run compiles and executes my byte counter program, counting the number of binary-encoded bytes that xxd emitted. The only difference between zig build run and ./zig-out/bin/ count-bytes was that the second command runs the already-compiled app, whereas the first one recompiles the app. Again, I was dumbfounded. How could does an extra compilation step make the program faster? Does a Zig app somehow run quicker when it's fresh out of the oven? Asking the Zig community for help [?] At this point, I was stumped. I had read my source code over and over, and I couldn't understand how compiling and running an application could be faster than running the already-compiled binary. Zig is still a new language, so there had to be something about Zig I'd misunderstood. Surely, if experienced Zig programmers looked at my code, they'd spot my error instantly. I posted my question on Ziggit, a discussion forum for Zig. The first few responses said I had a problem with "input buffering" but they didn't have concrete suggestions to fix it or investigate further. Andrew Kelly, Zig's founder and lead developer made a surprise appearance in the thread. He couldn't explain the phenomenon I was seeing, but he pointed out that I was making a different performance mistake: Looks like you're doing 1 syscall per byte read? That's going to perform extremely poorly. My guess is that the extra steps of using the build system incidentally introduced some buffering. Not sure why though. The build system is making the child process inherit the file descriptors directly. Finally, my friend Andrew Ayer saw my post about this on Mastodon and solved the mystery: Do you still see the 10x disparity with significantly larger inputs (i.e. > 1MB)? Do you still the disparity if you redirect stdin from a file instead of a pipe? My guess is that when you execute the program directly, xxd and count-bytes start at the same time, so the pipe buffer is empty when count-bytes first tries to read from stdin, requiring it to wait until xxd fills it. But when you use zig build run, xxd gets a head start while the program is compiling, so by the time count-bytes reads from stdin, the pipe buffer has been filled. Andrew Ayer got it exactly right, and I'll break it down below. Sidenote: Andrew Ayer also had the key insight that solved my last performance mystery. My mental model of bash pipelines is wrong [?] I had never thought too carefully about bash pipelines, but Andrew's comment made me realize my mental model was wrong. Imagine a simple bash pipeline like the following: ./jobA | ./jobB My mental model was that jobA would start and run to completion and then jobB would start with jobA's output as its input. Gantt chart of jobB starting after jobA finishes My incorrect mental model of how jobs in a bash pipeline work It turns out that all commands in a bash pipeline start at the same time. Gantt chart of jobA and jobB starting simultaneously, but jobB is longer because it has to wait for jobA's results The actual way that jobs in a bash pipeline work To demonstrate parallel execution in a bash pipeline, I wrote a proof of concept with two simple bash scripts. jobA starts, sleeps for three seconds, prints to stdout, sleeps for two more seconds, then exits: #!/usr/bin/env bash function print_status() { local message="$1" local timestamp=$(date +"%T.%3N") echo "$timestamp $message" >&2 } print_status 'jobA is starting' sleep 3 echo 'result of jobA is...' sleep 2 echo '42' print_status 'jobA is terminating' download jobA jobB starts, waits for input on stdin, then prints everything it can read from stdin until stdin closes: #!/usr/bin/env bash function print_status() { local message="$1" local timestamp=$(date +"%T.%3N") echo "$timestamp $message" >&2 } print_status 'jobB is starting' print_status 'jobB is waiting on input' while read line; do print_status "jobB read '${line}' from input" done < /dev/stdin print_status 'jobB is done reading input' print_status 'jobB is terminating' download jobB If I run jobA and jobB in a bash pipeline, exactly 5.009 seconds elapse between the jobB is starting and jobB is terminating messages: $ ./jobA | ./jobB 09:11:53.326 jobA is starting 09:11:53.326 jobB is starting 09:11:53.328 jobB is waiting on input 09:11:56.330 jobB read 'result of jobA is...' from input 09:11:58.331 jobA is terminating 09:11:58.331 jobB read '42' from input 09:11:58.333 jobB is done reading input 09:11:58.335 jobB is terminating If I adjust the execution so that jobA and jobB run in sequence instead of a pipeline, only 0.008 seconds elapse between jobB's starting and terminating messages: $ ./jobA > /tmp/output && ./jobB < /tmp/output 16:52:10.406 jobA is starting 16:52:15.410 jobA is terminating 16:52:15.415 jobB is starting 16:52:15.417 jobB is waiting on input 16:52:15.418 jobB read 'result of jobA is...' from input 16:52:15.420 jobB read '42' from input 16:52:15.421 jobB is done reading input 16:52:15.423 jobB is terminating Revisiting my byte counter [?] Once I understood that all commands in a bash pipeline run in parallel, the behavior I was seeing in my byte counter made sense: $ echo '00010203040506070809' | xxd -r -p | zig build run -Doptimize=ReleaseFast bytes: 10 execution time: 13.549us $ echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes bytes: 10 execution time: 162.195us It looks like the time to run the echo '00010203040506070809' | xxd -r -p part of the pipeline takes about 150 microseconds. The zig build run step must take at least 150 microseconds. By the time the count-bytes application actually begins in the zig build version, it doesn't have to wait for the previous jobs to complete. The input is already waiting on stdin. Gantt chart where echo, xxd, and zig build run start at the same time, but the execute phase of zig build run starts after echo and xxd are complete With zig build run, there's a delay before my application executes, so previous jobs in the pipeline have already completed by the time count-bytes starts. When I skip the zig build step and run the compiled binary directly, count-bytes starts immediately and the timer begins. The problem is that count-bytes has to sit around waiting ~150 microseconds for the echo and xxd commands to deliver input to stdin. Gantt chart where echo, xxd, and count-bytes all start at the same time, but count-bytes can't begin processing input until 150 microseconds after starting, as it's waiting on results from xxd When I run count-bytes directly, it has to wait around for ~150 microseconds until echo and xxd feed input to stdin. Fixing my benchmark [?] Fixing my benchmark was simple. Instead of running my application as part of a bash pipeline, I split the preparation stage and the execution stage into separate commands: # Convert the hex-encoded input to binary encoding. $ INPUT_FILE_BINARY="$(mktemp)" $ echo '60016000526001601ff3' | xxd -r -p > "${INPUT_FILE_BINARY}" # Read the binary-encoded input into the virtual machine. $ ./zig-out/bin/eth-zvm < "${INPUT_FILE_BINARY}" execution time: 67.378us My benchmark dropped from the 438 microseconds I was seeing before down to just 67 microseconds. Difference in measured performance of my Zig app after I fixed my benchmarking script Applying Andrew Kelly's performance fix [?] Recall that Andrew Kelly pointed out that I was doing one syscall for every byte I read. var reader = std.io.getStdIn().reader(); ... while (true) { _ = reader.readByte() { // Slow! One syscall per byte ... }; ... } So, every time my application called readByte in the loop, it had to halt execution, request an input read from the OS and then resume when the OS delivered the single byte. The fix was simple. I had to use a buffered reader. Instead of reading a single byte at a time from the OS, I'd use Zig's built-in std.io.bufferedReader, which causes my application to read large chunks of data from the OS. That way, I only have to make a fraction of the syscalls. Here's the entire change: diff --git a/src/main.zig b/src/main.zig index d6e50b2..a46f8fa 100644 --- a/src/main.zig +++ b/src/main.zig @@ -7,7 +7,9 @@ pub fn main() !void { const allocator = gpa.allocator(); defer _ = gpa.deinit(); - var reader = std.io.getStdIn().reader(); + const in = std.io.getStdIn(); + var buf = std.io.bufferedReader(in.reader()); + var reader = buf.reader(); var evm = vm.VM{}; evm.init(allocator); I re-ran my example, and it sped up performance by another 11 microseconds, a modest 16% speedup. $ zig build -Doptimize=ReleaseFast && ./zig-out/bin/eth-zvm < "${INPUT_FILE_BINARY}" execution time: 56.602us Buffering input reads increased performance by another 16%. Benchmarking a larger input [?] My Ethereum interpreter currently only supports a small subset of Ethereum's opcodes. The most complex computation my interpreter can do at this point is add numbers together. For example, here's an Ethereum application that counts to three by pushing 1 to the stack three times and then adding the values together: PUSH1 1 # Stack now contains [1] PUSH1 1 # Stack now contains [1, 1] PUSH1 1 # Stack now contains [1, 1, 1] ADD # Stack now contains [2, 1] ADD # Stack now contains [3] The largest application I tested in my benchmarks was Ethereum bytecode that counted to 1,000 by adding 1 values together. After Andrew Kelly's tip helped me reduce syscalls, my "count to 1,000" application's runtime dropped from 2,024 microseconds to just 58 microseconds, a 35x speedup. I was now beating the official Ethereum implementation by almost a factor of two. Buffering my input reads allowed my Zig implementation to run about 2x faster than the official Ethereum implementation on the largest Ethereum application in my test set. Cheating my way to maximum performance [?] I was excited to see my Zig implementation finally outperforming the official Go version, but I wanted to see just how much I could leverage Zig to improve performance. One common bottleneck in software is memory allocation. The program has to stop and wait for the OS to allocate RAM, which may involve shuffling around data to find enough contiguous space. Zig has a memory allocator called the fixed buffer allocator. Instead of the memory allocator requesting memory from the OS, you provide the allocator a fixed buffer of bytes, and it uses only those bytes to allocate memory. I can cheat my benchmarks by compiling a version of my Ethereum interpreter that's limited to 2 KB of memory allocated from the stack: diff --git a/src/main.zig b/src/main.zig index a46f8fa..9e462fe 100644 --- a/src/main.zig +++ b/src/main.zig @@ -3,9 +3,9 @@ const stack = @import("stack.zig"); const vm = @import("vm.zig"); pub fn main() !void { - var gpa = std.heap.GeneralPurposeAllocator(.{}){}; - const allocator = gpa.allocator(); - defer _ = gpa.deinit(); + var buffer: [2000]u8 = undefined; + var fba = std.heap.FixedBufferAllocator.init(&buffer); + const allocator = fba.allocator(); const in = std.io.getStdIn(); var buf = std.io.bufferedReader(in.reader()); I call this a "cheat" as I'm optimizing for my specific benchmarks. There are certainly valid Ethereum programs that require more than 2 KB of memory, but I'm just curious how fast I can go with this optimization. Let's see what performance looks like if I know my max memory requirement at compile time: $ ./zig-out/bin/eth-zvm < "${COUNT_TO_1000_INPUT_BYTECODE_FILE}" execution time: 34.4578us Cool! With a fixed memory buffer, my Ethereum implementation runs my "count to 1,000" bytecode in 34 microseconds, nearly 3x faster than the official Go implementation. If I know the maximum memory requirements of my Ethereum interpreter at compile time, I can outperform the official implementation by 3x. Conclusion [?] My takeaway from this experience is to benchmark performance early and often. By adding a benchmarking script to my continuous integration and archiving the results, it was easy for me to identify when my measurements changed. Had I relegated benchmarking to a manual, periodic task, it would have been difficult for me to identify exactly what caused the difference in my measurements. This experience also underscores the importance of understanding your metrics. Before hitting this bug, I hadn't considered that my benchmark included the time waiting for other processes to fill stdin. Be the first to know when I post cool stuff Subscribe to get my latest posts by email. [ ] [ ] ( ) Only blog posts (*) All posts (monthly retrospectives, book reports, etc.) [Subscribe] Thanks for signing up! Check your email to confirm your subscription. Whoops, we weren't able to process your signup. Share on Twitter Facebook LinkedIn Discuss on Hacker News Reddit Please enable JavaScript to view comments. Crafted with [?] in Western Massachusetts. * (c) 2024 Creative Commons License This work is licensed under the Creative Commons Attribution 4.0 License. Feel free to republish or adapt with attribution. * *