[HN Gopher] How fast are Linux pipes anyway? (2022)
       ___________________________________________________________________
        
       How fast are Linux pipes anyway? (2022)
        
       Author : SEJeff
       Score  : 203 points
       Date   : 2023-10-05 18:49 UTC (4 hours ago)
        
 (HTM) web link (mazzo.li)
 (TXT) w3m dump (mazzo.li)
        
       | sbjs wrote:
       | I remember using linux pipes for a shell-based irc client like 12
       | years ago. For most application uses, they're plenty fast enough.
       | Kinda wish I had the source code for that still.
        
       | bloopernova wrote:
       | Pipes are fast enough for iterating and composing cat sed awk cut
       | grep uniq jq etc etc.
        
       | whalesalad wrote:
       | Love the Edward Tuftian aesthetic of this site. Although above a
       | certain viewport width I would imagine you want a `margin: 0
       | auto` to center the content block. On a 27" display it is tough
       | to read without resizing the window.
        
       | nh2 wrote:
       | (2022) Previous discussion:
       | https://news.ycombinator.com/item?id=31592934
        
         | dang wrote:
         | Thanks! Macroexpanded:
         | 
         |  _How fast are Linux pipes anyway?_ -
         | https://news.ycombinator.com/item?id=31592934 - June 2022 (200
         | comments)
        
       | mannyv wrote:
       | How fast are they compared to raw memory throughput?
       | 
       | It's interesting that memory mapping is so expensive. I've often
       | wondered the price that everyone pays for multiple address
       | spades. Is isolation really worth it?
        
         | formerly_proven wrote:
         | The relative performance cost of virtual memory was way higher
         | in days past, but people considered it worth it for the
         | increased system reliability.
        
       | jcrites wrote:
       | Are there good data handling libraries that provide abstractions
       | over pipes, sockets, files, and memory and implement
       | optimizations like these? I'd be interested in knowing if there
       | are such libraries in C, C++, Rust, or other systems languages.
       | 
       | I wasn't familiar with some of the APIs mentioned in the article
       | like splice() and vmsplice(), so I wondered if there are
       | libraries that I might use when building ~low-level applications
       | that take advantage of these and related optimizations where
       | possible automagically. (As another commenter mentioned: these
       | APIs are hard to use and most programs don't take advantage of
       | them)
       | 
       | Do libraries like libuv, tokio, Netty handle this automatically
       | on Linux? (From some brief research, it seems like probably they
       | do)
        
       | DiabloD3 wrote:
       | TL;DR: Maximum pipe speed, assuming both programs are written as
       | optimally as possible, is approximately the speed of what one
       | core in your system can read/write; this is because, essentially,
       | the kernel maps the same physical memory page from one program's
       | stdout to the other's stdin, thus making the operation a zerocopy
       | (or a fast onecopy in slightly less optimal situations).
       | 
       | I've known this one for awhile, and it makes writing shell
       | scripts that glue two (or more) things together with pipes to do
       | extremely high performance operations both rewarding and
       | hilarious. Certainly one of the most useful tools in the toolbox.
        
         | packetlost wrote:
         | This is why threads aren't nearly as important as many
         | programmers seem to think. Chances are, whatever application
         | you're building can be done in a cleaner way using pipes +
         | processes or green/user-space threads depending on the workload
         | in question. It can be less _convenient_ , but message passing
         | is usually preferable to deadlock hell.
        
           | rewmie wrote:
           | > This is why threads aren't nearly as important as many
           | programmers seem to think. Chances are, whatever application
           | you're building can be done in a cleaner way using pipes +
           | processes or green/user-space threads depending on the
           | workload in question.
           | 
           | I think you're making wild claims based on putting up your
           | overgeneralized strawman (i.e., "threads aren't nearly as
           | important as many programmers seem to think") that afterwards
           | you try to water down with weasel words ("depending on the
           | workload in question").
           | 
           | Threads are widely used because they bring most of the
           | benefits of processes (concurrent control flow, and in
           | multicore processors also performance) without the
           | constraints and limitations they bring (exclusive memory
           | space, slow creation, performance penalty caused by
           | serialization in IPC, awkward API, etc).
           | 
           | In multithreaded apps, to get threads to communicate between
           | each other all you need to do to is point to the memory
           | address of the object you instantiated. No serialization
           | needed, no nothing. You simply cannot beat this in terms of
           | "clean way" of doing things.
           | 
           | > It can be less convenient, but (...)
           | 
           | That's quite the euphemism, and overlooks why threads are
           | largely preferred.
        
           | gpderetta wrote:
           | Message pass enough and you'll easily deadlock as well.
        
           | jstimpfle wrote:
           | Pipes are FIFO data buffers implemented in the kernel. For
           | communication between threads of the same process, you can
           | replace any pipe object by a userspace queue implementation
           | protected by e.g. mutex + condition variable. It is
           | functionally equivalent and has potential to be faster. And
           | if you wrap all accesses in lock/unlock pairs (without
           | locking any other objects in between) there is no danger of
           | introducing any more deadlocks compared to using kernel
           | pipes.
           | 
           | Threads are an important structuring mechanism: You can
           | assume that all your threads continue to run, or in the event
           | of a crash, all your threads die.
           | 
           | Also, unidirectional pipes aren't exactly sufficient for
           | inter-process / inter-thread synchronisation. They are ok for
           | simple batch processing, but that's about it.
        
             | gpderetta wrote:
             | Incidentally you can use the exact same setup (plush mmap)
             | for interprocess queues.
             | 
             | The advantage of threads is that you can pass pointers to
             | your data through the queue, while that's harder to do
             | between processes and you have to resort to copying data in
             | the queue instead.
        
         | NortySpock wrote:
         | I assume for heterogenous cores (power vs efficiency cores) it
         | bottlenecks on the throughput of the slowest core?
        
           | DiabloD3 wrote:
           | Surprisingly no. I'd expect similar performance.
           | 
           | In these designs, the actual memory controller that talks to
           | the RAM is part of an internal fabric, and the fabric link
           | between the core and the memory controller is (technically)
           | your upper limit.
           | 
           | For both Intel and AMD, the size of the fabric link remains
           | constant to the expected performance of the different cores,
           | as the theoretical usage/performance of the load/store units
           | remain otherwise constant in relation, no matter if it is a
           | big core or a little core.
           | 
           | Also, notice: the maximum performance of load-store units is
           | your actual upper limit, period. Some CPUs historically never
           | achieved their maximum theoretical performance because the
           | units were never engaged optimally; sometimes this is because
           | some ports on the load/store units are only accessible from
           | certain instructions (often due to being reserved only for
           | SIMD; this is why memcpy impls often use SSE/AVX, just to
           | exploit this fact).
           | 
           | That said, load-store performance usually approaches that
           | core's L2 theoretical maximum, which is greater than what any
           | core generally can get out of its fabric link. Ergo, fabric
           | link is often governing what you're seeing in situations like
           | this.
           | 
           | On Intel and AMD's clusters, the memory controller serving
           | their respective core cluster designs requires anywhere from
           | 2 to 4 cores saturating their links to reach peak
           | performance. Also, sibling threads on the same core will
           | compete for access to that link, so it isn't merely threads
           | that get you there, but actual core saturation.
           | 
           | On a dummy benchmark like proposed in the linked article, the
           | performance of a single process being piped to another
           | process, either in the situation of "both processes are
           | actually on the same big core, simultaneously hyper-
           | threading", or "two sibling little cores in the same core
           | cluster, being serviced by the same memory controller", the
           | upper limit of performance should approximate optimal usage
           | of memory bandwidth, but in some cases on some architectures
           | this will actually approximate L3 bandwidth (a higher value).
           | 
           | Also, as a side note: little cores aren't little. For a
           | little bit more silicon usage, and a little bit less power
           | usage, two little cores approximate one big core /w two
           | threads optimally executing, even in Intel's surprisingly
           | optimal small core design, but _very_ much true in Zen4c. As
           | in, I could buy a  "whoops, all little cores" CPU of
           | sufficient size for my desktop, and still be happy (or,
           | possibly, even happier).
        
         | jstimpfle wrote:
         | AFAIK a severe limitation of pipes is that they can buffer only
         | 64 KB / 16 pages (on x86 Linux). Pretty sure it's generally
         | slower than core-to-memory bandwidth.
        
           | packetlost wrote:
           | 64KB is the default, you can increase the buffer size using
           | `fcntl`. You're probably more limited by syscall overhead
           | than anything
        
         | gpderetta wrote:
         | Pipes are zero copy only if you use splice or vmsplice. These
         | linux specific syscalls are hard to use (particularly wmsplice)
         | and the vast majority of programs and shell filters (with the
         | notable exception of pv) don't use them and pay for the cost of
         | copying in and out of kernel memory.
        
         | bee_rider wrote:
         | This is magic system stuff I don't understand, does it have to
         | go all the way up to the memory or will the caches save us from
         | that trip?
        
           | DiabloD3 wrote:
           | Depends entirely on the CPU architecture.
           | 
           | The most simple answer I can give is: yes, when its safe;
           | when its not safe, that's part of an entire category of
           | meltdown/spectre family exploits.
        
       | mg wrote:
       | One surprising fact about Linux pipes I stumbled across 4 years
       | ago is that using a pipe can create indeterministic behavior:
       | 
       | https://www.gibney.org/the_output_of_linux_pipes_can_be_inde...
        
         | xorcist wrote:
         | It that surprising? What would you have guessed output would
         | look like, and why? Perhaps that information would help
         | straighten out any confusion.
         | 
         | The command, perhaps intentionally, looks unusual (any code
         | reviewer would certainly be scratching their head):
         | 
         | There's an "echo red" in there but it's never sent anywhere
         | (perhaps a joke with "red herring"?).
         | 
         | There's an "echo green" sent to stderr, that will only be
         | visible if it terminates before "echo blue".
         | 
         | The exact order would be dependent on output buffering, which
         | will depend on which time slice is sorted first, which will
         | vary with number of cpus and their respective load. So yes, it
         | will be indeterministic, but in the same way "top" is.
        
         | arp242 wrote:
         | Are there cases where his causes real-world problems? Because
         | to be honest this example seems rather artificial.
        
         | Racing0461 wrote:
         | Chatgpt was able to figure this out with a simple "what does
         | the following do". But it could also be a case of chatgpt being
         | trained on your article.
         | 
         | >>> Note: The ordering of "green" and "blue" in the output
         | might vary because these streams (stdout and stderr) might be
         | buffered differently by the shell or operating system. Most
         | commonly, you will see the output as illustrated above.
        
         | jstimpfle wrote:
         | Not surprising, the pipe you've created doesn't transport any
         | of the data you've echoed.                   (echo red; echo
         | green 1>&2) | echo blue
         | 
         | This creates two subshells separated by the pipe | symbol. A
         | subshell is a child process of the current shell, and as such
         | it inherits important properties of the current shell, notably
         | including the open file descriptor table.
         | 
         | Since they are child processes, both subshells run
         | concurrently, while their parent shell will simply wait() for
         | all child processes to terminate. The order in which the childs
         | get to run is to a large extent unpredictable, on a multi-core
         | system they may run literally at the same time.
         | 
         | Now, before the subshells get to process their actual tasks,
         | file redirections have to be performed. The left subshell gets
         | its stdout redirected to the write end of the kernel pipe
         | object that is "created" by the pipe symbol. Likewise, the
         | right subshell gets stdin redirected to the read end of the
         | pipe object.
         | 
         | The first subshell contains two processes (red and green) that
         | run in sequence (";"). "Red" is indeed printed to stdout and
         | thus (because of the redirection) sent to the pipe. However,
         | nothing is ever read out of the pipe: The only process that is
         | connected to the read end of the pipe ("echo blue") never reads
         | anything, it is output only.
         | 
         | Unlike "echo red", "echo green >&2" doesn't have stdout
         | connected to the pipe. Its stdout is redirected to whatever
         | stderr is connected to. Here is the explanation what ">&2" (or
         | equivalently, "1>&2") means: For the execution of "echo green",
         | make stdout (1) point to the same object that stderr (2) points
         | to. You can imagine it as being a simple assignment: _fd[1] =
         | fd[2]_.
         | 
         | For "echo blue", stdout isn't explicitly redirected, so it gets
         | run with stdout set to whatever it inherited from its parent
         | shell, which is (probably) your terminal.
         | 
         | Seeing that both "echo green" and "echo blue" write directly to
         | the same file (again, probably your terminal) we have a race --
         | who wins is basically a question of who gets scheduled to run
         | first. For one reason or other, it seems that blue is more
         | likely to win on your system. It might be due to the fact that
         | the left subshell needs to finish the "echo red" first, which
         | does print to the pipe, and that might introduce a delay / a
         | yield, or such.
        
         | 4death4 wrote:
         | That may have been surprising, but, if you think about it a
         | little deeper, it makes perfect sense. Programs in a pipeline
         | execute concurrently. If they didn't, pipelines wouldn't be
         | useful. For instance a pipeline that downloads a tar file with
         | curl and then untars it. If you wait for curl to finish before
         | running tar, you run in to all sorts of problems. For instance,
         | where do you store the intermediate tar file if it's really
         | large? Tar _needs_ to run while curl is running to keep buffers
         | small and make execution fast. The only control flow between
         | pipeline programs is done via stdin and stdout. In your example
         | program, you write to stderr so naturally that's not part of
         | the deterministic control flow.
        
       | Borg3 wrote:
       | Hah, nice article :) I remember fighting with Cygwin pipe
       | implementations to have decent performance from them. They are
       | hella slower compared to Linux, but still usable, just tricky to
       | pass data in/out.
        
       | epistasis wrote:
       | Fantastic article, I learned a lot despite pipes being a bread
       | and butter user tool for me for a quarter century.
        
       | chris_armstrong wrote:
       | Absolutely amazing, I know about page tables and the like but
       | tying it to performance analysis with `perf` makes it clear how
       | central it is to throughput
        
       ___________________________________________________________________
       (page generated 2023-10-05 23:00 UTC)