https://qsantos.fr/2024/08/25/linux-pipes-are-slow/ Quentin Santos Make it work. Make it right. Make it fast. Menu Skip to content * Contact * Favorites * Portfolio Search for: [ ] [Search] Linux Pipes are Slow August 25, 2024Pipes, ProgrammingQuentin Santos vmsplice is too fast Some programs use a particular system call "vmsplice" to move data faster through a pipe. Francesco already did a deep dive on using vmsplice to make things fast. However, while experimenting with it, I noticed that, when not using vmsplice, Linux pipes are slower than what I would have expected. Since you cannot always use it, I wanted to understand exactly why that was, and whether it could be improved. The reason I want to move data through pipes is that I am writing a program encode/decode Morse code blazingly fast. To get a point of reference, the obvious candidate is the Fizz Buzz throughput competition at the Code Golf StackExchange. There are two kinds of solutions: 1. the ones that manage to reach up to a few gigabytes per second, with neil's reaching 8.4 GiB/s; 2. the ones which largely surpass that, from tkluck's at 15.5 GiB/s to ais523's at 60.8 GiB/s, to david's at 208.3 GiB/s using multiple cores. The difference between the first and the second group is that the second is using vmsplice, while the first is not^1. But how can using vmsplice enable such a large gain in performance? My intuition about vmsplice is that it allows you to avoid copying data to and from kernel space. Surely, copying data cannot be slower than generating it? Even assuming it is not faster, and that you have to copy the data twice to get it through the pipe, you would assume a throughput gain of 3x, at best. But here, we have 7, even just looking at single-core solutions. Something is missing in my mental model, I want to know what. First, I'll need to perform my own measurements to easily compare with what I'll do afterward. Compiling and running aie523's solution on my computer^2, I get: $ ./fizzbuzz | pv >/dev/null 96.4GiB 0:00:01 [96.4GiB/s] With david's solution, I reach 277 GB/s when using 7 cores (40 GB/s per core). Now, to understand what's going on, we need to find the answer to these questions: 1. How fast can we write data ideally? 2. How fast can we actually write data to a pipe? 3. How does vmsplice help? Writing Data in the Ideal Wonderland First, let's consider the program below, which just copies data without doing any system call. I use std::hint::black_box to stop the compiler from noticing that we are not using the result. Without this, the compiler would optimize the program to nothing. fn main() { let dst = [0u8; 1 << 15]; let src = [0u8; 1 << 15]; let mut copied = 0; while copied < (1000 << 30) { std::hint::black_box(dst).copy_from_slice(&src); copied += src.len(); } } On my system, this runs at 167 GB/s. This is consistent with the speed of writing to L1 cache for my CPU^3. When profiling this with ftrace, we see that 99.9% of the time is spent in __memset_avx512_unaligned_erms, directly called by main, and calling no other functions. The flamegraph is pretty much flat. If you do not feel like running a full-fledged profiler, you can just use gdb and hit Ctrl+C at a random time: $ cargo build --release $ gdb target/release/copy ... (gdb) run ... ^C (hitting Ctrl+C) Program received signal SIGINT, Interrupt. __memset_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:236 ... => 0x00007ffff7f15dba f3 aa rep stos %al,%es:(%rdi) In any case, note that we are using AVX-512. The reference to memset in the names can be surprising, but this is just because part of the logic is common with memcpy. The implementation is in a generic file dedicated to SIMD vectorization that supports SSE, AVX2 and AVX-512. In our case, the AVX-512 specialization is used. As an aside, note that the implementation of memcpy in glibc uses vm_copy to copy pages directly on Mach-based systems (mostly Apply products) uses a kernel feature to copy pages directly. However, AVX-512 is quite niche. According to Steam's hardware survey (section "Other Settings"), only about 12% of Steam users have it. In fact, Intel only included AVX-512 for consumer-grade processors in the 11th generation; and now reserves it for servers. AMD CPUs support AVX-512 since the Ryzen 7000 series (Zen 4). So I tested this same program while disabling AVX-512. For this, I used the Linux kernel option clearcpuid=304. I was able to check that it used __memset_avx2_unaligned_erms using the gdb and Ctrl+C trick. I then did the same to disable AVX2 with clearcpuid=304,avx2,avx, making it use __memset_sse2_unaligned_erms. Although SSE2 is always available on x86-64, I also disabled the cpuid bit for SSE2 and SSE to see if it could nudge glibc into using scalar registers to copy data. I immediately got a kernel panic. Ah, well. When using AVX2, the throughput was... 167 GB/s. When using only SSE2, the throughput was... still 167 GB/s. To an extent, it makes sense: even SSE2 is quite enough to fully use the bus and saturate L1 bandwidth. Using wider registers only helps when performing ALU operations. The conclusion from this experiment is that, as long as vectorization is used, I should reach 167 GB/s. Actually Writing Data to a Pipe So, let's look at what happen when we write to a pipe instead of to user space memory: use std::io::Write; use std::os::fd::FromRawFd; fn main() { let vec = vec![b'\0'; 1 << 15]; let mut total_written = 0; let mut stdout = unsafe { std::fs::File::from_raw_fd(1) }; while let Ok(n) = stdout.write(&vec) { total_written += n; if total_written >= (100 << 30) { break; } } } We then measure the throughput using: cargo run --release | pv >/dev/null On my computer, this reaches 17 GB/s. This is 10 times as slow as just writing to a buffer! How can a system call which basically writes to a kernel buffer be so much slower? And no, context switches don't take that much time. So let's do some profiling of this program. ./zeroes | pv >/dev/null Profiling of ./zeroes Reset Zoom Search ic tr.. __GI___libc_write vfs_write _r.. _copy_from_iter clear_page_erms ksys_write __memcg_kmem.. get_page_from_freelist __mutex_lock.constprop.0 zeroes::main copy_page_from_iter mutex_spin_on_owner __alloc_pages copy_user_enhanced_fast_string zeroes do_syscall_64 entry_SYSCALL_64_after_hwframe rmqueu.. pipe_write main _raw_s.. Note that __GI___libc_write is the glibc wrapper that performs the system call. It and everything below is in user land. Everything above is in the kernel. As expected, we are spending virtually all our time calling write. In particular, we are spending 95% of our time inside pipe_write. Inside this function, we are spending 36% of our total time in __alloc_pages, which provisions new memory pages for the pipe. We cannot just reuse a handful of pages in a loop because pv moves these pages using splice to /dev/null, which consume them. Next to it are __mutex_lock.constprop.0 that takes 25% of the time and _raw_spin_lock_irq that takes 5%. They lock the pipe for writing. This leaves just 20% of the time for the copying of data itself in copy_user_enhanced_fast_string. But, even with only 20% of the CPU time, we would expect to be able to move 167 GB/s * 20% = 33 GB/s. It means that, even taken separately, this function is still twice as slow as __memset_avx512_unaligned_erms, which was used in the program that just wrote to user space memory. What is copy_user_enhanced_fast_string doing to be so slow? We need to dig deeper. For this, I disassembled my Linux kernel^4, and looked at that function. $ grep -w copy_user_enhanced_fast_string /usr/lib/debug/boot/System.map-6.1.0-18-amd64 ffffffff819d3d90 T copy_user_enhanced_fast_string $ objdump -d --start-address=0xffffffff819d3d90 vmlinuz | less vmlinuz: file format elf64-x86-64 Disassembly of section .text: ffffffff819d3d90 <.text+0x9d3d90>: ffffffff819d3d90: 90 nop ffffffff819d3d91: 90 nop ffffffff819d3d92: 90 nop ffffffff819d3d93: 83 fa 40 cmp $0x40,%edx ffffffff819d3d96: 72 48 jb 0xffffffff819d3de0 ffffffff819d3d98: 89 d1 mov %edx,%ecx ffffffff819d3d9a: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) ffffffff819d3d9c: 31 c0 xor %eax,%eax ffffffff819d3d9e: 90 nop ffffffff819d3d9f: 90 nop ffffffff819d3da0: 90 nop ffffffff819d3da1: e9 9a dd 42 00 jmp 0xffffffff81e01b40 ... ffffffff81e01b40: c3 ret The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed. This lets it collect data about specific kernel function calls without inducing any slow down for kernel functions that are not being profiled. The CPU instruction decoding pipeline takes care of NOP early, so they have basically no impact on performance (other than taking room in the L1i cache). I do not know why the JMP is not just a RET, however^5. In any case, the CMP test and JB jump handle the case of buffers that are smaller than 64 bytes by jumping to another function that copy 8 bytes at a time with 64-bit registers, then 1 byte at a time with 8 bit register in two loops. For large buffers, the copying is handled by a REP MOV instruction. That's definitely not vectorized code. In fact, this function is not implemented in C but directly in Assembly! This means that there is no need to look at the result of compilation; we can just look at the source code. And it's not just a missed optimization when compiling, it was written like that. But is the lack of vector instruction the only reason why copy_user_enhanced_fast_string is twice as slow as __memset_avx512_unaligned_erms? To check this, I adapted the initial Rust program to explicitly use REP MOVS: use std::arch::asm; fn main() { let src = [0u8; 1 << 15]; let mut dst = [0u8; 1 << 15]; let mut copied = 0; while copied < (1000u64 << 30) { unsafe { asm!( "rep movsb", inout("rsi") src.as_ptr() => _, inout("rdi") dst.as_mut_ptr() => _, inout("ecx") 1 << 15 => _, ); } copied += 1 << 15; } } The throughput is 80 GB/s. This is a factor 2 slow down, exactly what we observe with the kernel function! Now, we know that the Linux kernel is not using SIMD to copy memory and that this makes copy_user_enhanced_fast_string twice as slow as it could be. But why is that? Over at Stack Overflow, Peter Cordes explains that using SSE/AVX instructions is not worth it in most cases, because of the cost of saving and restoring the SIMD context. In summary: the kernel is spending quite a bit of time on managing memory, and it is not even using SIMD when actually copying the bytes. This is the source of the 10x slow-down we see when comparing with the ideal case. vmsplice to the Rescue We now have an upper bound (167 GB/s to write the data in memory once) and a lower bound (17 GB/s when using write on a pipe). Let's look in details at the effect of usng vmsplice. It mitigates the cost of using a pipe by moving entire buffers from user space to the kernel without copying them. To understand how it works, again, read the excellent article by Francesco. We'll be using the ./write program from that article to get a minimal example of using vmsplice. This program just writes an infinite number of 'X's. This will simplify the profiling by not having any time dedicated to compute Fizz Buzz data or something else. ./write actually achieves 210 GB/s, well above our upper bound, but that's because the program is kind of cheating by reusing the same buffers to pass to vmsplice. For anything other than a stream of constant bytes, we will actually have to fill the buffers with new data, which is where the upper bound actually applies. In any case, we only care about what vmsplice does: ./write -write_with_vmsplice -huge_page -busy_loop | ./read -read_with_splice -busy_loop Profiling of ./write Reset Zoom Search ic __iov_iter_get_pages_a.. internal_get_user.. write import_i.. __import.. i.. do_syscall_64 mutex_spin_on_owner osq.. _cop.. __mutex_lock.constprop.0 entry_SYSCALL_64_after_hwframe __do_sys_vmsplice wa.. add_to_pipe iovec_.. iov_iter_get_pages2 vmsplice Like with write, we are spending a significant amount of time (37%) in __mutex_lock.constprop.0. However, there is no _alloc_pages and no _raw_spin_lock_irq. And, instead of copy_user_enhanced_fast_string, we find add_to_pipe, import_iovec and iov_iter_get_pages2. From this, we can see that how vmsplice bypasses the expensive parts of the write system call. As an aside, I was a bit surprised about the effect of the buffer size, especially when not using vmsplice. It looks like minimizing the number of system calls is not always the most important thing to do. What Buffer Throughput (GB/ System Instructions ins/ size s) calls syscall ./ 32768 99 3276822 7373684904 2250 write ./ 65536 150 1638466 5438514152 3319 write ./ 131072 207 819270 4288897413 5235 write zeroes 32768 17 3276800 31859864089 9723 zeroes 65536 13 1638400 31750857264 19379 zeroes 131072 12 819200 35002733773 42728 Wrapping Up There you have it. Writing to a pipe is ten times slower than writing to raw memory. And this is because, when writing to a pipe, we need to spend a lot of time taking a lock, and we cannot use vector instructions efficiently. In principle, we could move data at 167 GB/s, but we need to avoid the cost of locking the buffer, and the cost of saving and restoring the SIMD context. This is exactly what splice and vmsplice do. They are often described as avoiding copying data between buffers, and this is true, but, most importantly, they completely bypass the conservative kernel code with extensive procedures and scalar code. 1. Of course, they need to write code fast enough for exploit what vmsplice enables, but the point is that the first group's performance is limited by not using vmsplice. -[?] 2. All benchmarks were performed on my personal desktop computer, which features a 7950X3D and DDR5 RAM overclocked to 6000T/s. And I am running Debian 12 with a 6.1.0-18-amd64 Linux kernel. CPU mitigations were disabled using the Linux kernel option mitigations=off. As mentioned by ais523, it is important to pin the processes to specific cores. I have used logical cores 27 and 29, but I trim taskset -c 27 and taskset -c 29 from the commands in this article for the sake of readability. Look into /sys/devices/system/cpu/ cpu*/acpi_cppc/highest_perf to know the relative performance of your cores. -[?] 3. See "L1 Cache write" in the last row of the second table from the bottom of the LanOC review. This gives 2,518.4 GB/s for all 16 physical cores, or 157.4 GB/s per physical core. -[?] 4. I had to install linux-image-6.1.0-18-amd64-dbg to get the file / usr/lib/debug/boot/System.map-6.1.0-18-amd64 with the symbols. -[?] 5. Someone at Hacker News has the answer! -[?] Post navigation - Git Super-Power: The Three-Way Merge Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] [profil] About Me I have always been fascinated with computers. Nowadays, I mostly use Rust, but I started out with a QuickBASIC book from the local library when I was in elementary school. I also got a Master in computer science from ENSL and a PhD in cryptography from ENS. qsantos@qsantos.fr Latest Posts * Linux Pipes are Slow * Git Super-Power: The Three-Way Merge * Merging Responsibly * ViHN: Vim for Hacker News * Rewriting NHK Easier in Rust * Dynamic Programming is not Black Magic * The Secret to a Green CI: Efficient Pre-commit Hooks with checkout-index * Learning Morse with Koch * Tiny Docker Containers with Rust * Koch's Dissertation on Learning Morse Code * Where to Start with Rust * Client-Side Password Hashing * Strongly Typed Web Apps * Overkill Debugging * HamFox: Forking Firefox for Fun and no Profit * HTTPS Without Encryption * HamSSH * Float Woes in C * Ham Crypto * Flexible Array Members: Typical C Shenanigans * Why Undefined Behavior Matters * Continuous Integration and Delivery Made Easy * Building for Windows without Running Windows * Astronomical Depth Buffer * Solving Kepler's Equation 5 Million Times a Second * Floating Point Woes in Three Dimensions * Looking Up-Close at Orbits * Drawing Nice Orbits * Code Golfing in Python (1) * First Commit Proudly powered by WordPress