[HN Gopher] How fast are Linux pipes anyway? (2022)
___________________________________________________________________
How fast are Linux pipes anyway? (2022)
Author : SEJeff
Score : 203 points
Date : 2023-10-05 18:49 UTC (4 hours ago)
(HTM) web link (mazzo.li)
(TXT) w3m dump (mazzo.li)
| sbjs wrote:
| I remember using linux pipes for a shell-based irc client like 12
| years ago. For most application uses, they're plenty fast enough.
| Kinda wish I had the source code for that still.
| bloopernova wrote:
| Pipes are fast enough for iterating and composing cat sed awk cut
| grep uniq jq etc etc.
| whalesalad wrote:
| Love the Edward Tuftian aesthetic of this site. Although above a
| certain viewport width I would imagine you want a `margin: 0
| auto` to center the content block. On a 27" display it is tough
| to read without resizing the window.
| nh2 wrote:
| (2022) Previous discussion:
| https://news.ycombinator.com/item?id=31592934
| dang wrote:
| Thanks! Macroexpanded:
|
| _How fast are Linux pipes anyway?_ -
| https://news.ycombinator.com/item?id=31592934 - June 2022 (200
| comments)
| mannyv wrote:
| How fast are they compared to raw memory throughput?
|
| It's interesting that memory mapping is so expensive. I've often
| wondered the price that everyone pays for multiple address
| spades. Is isolation really worth it?
| formerly_proven wrote:
| The relative performance cost of virtual memory was way higher
| in days past, but people considered it worth it for the
| increased system reliability.
| jcrites wrote:
| Are there good data handling libraries that provide abstractions
| over pipes, sockets, files, and memory and implement
| optimizations like these? I'd be interested in knowing if there
| are such libraries in C, C++, Rust, or other systems languages.
|
| I wasn't familiar with some of the APIs mentioned in the article
| like splice() and vmsplice(), so I wondered if there are
| libraries that I might use when building ~low-level applications
| that take advantage of these and related optimizations where
| possible automagically. (As another commenter mentioned: these
| APIs are hard to use and most programs don't take advantage of
| them)
|
| Do libraries like libuv, tokio, Netty handle this automatically
| on Linux? (From some brief research, it seems like probably they
| do)
| DiabloD3 wrote:
| TL;DR: Maximum pipe speed, assuming both programs are written as
| optimally as possible, is approximately the speed of what one
| core in your system can read/write; this is because, essentially,
| the kernel maps the same physical memory page from one program's
| stdout to the other's stdin, thus making the operation a zerocopy
| (or a fast onecopy in slightly less optimal situations).
|
| I've known this one for awhile, and it makes writing shell
| scripts that glue two (or more) things together with pipes to do
| extremely high performance operations both rewarding and
| hilarious. Certainly one of the most useful tools in the toolbox.
| packetlost wrote:
| This is why threads aren't nearly as important as many
| programmers seem to think. Chances are, whatever application
| you're building can be done in a cleaner way using pipes +
| processes or green/user-space threads depending on the workload
| in question. It can be less _convenient_ , but message passing
| is usually preferable to deadlock hell.
| rewmie wrote:
| > This is why threads aren't nearly as important as many
| programmers seem to think. Chances are, whatever application
| you're building can be done in a cleaner way using pipes +
| processes or green/user-space threads depending on the
| workload in question.
|
| I think you're making wild claims based on putting up your
| overgeneralized strawman (i.e., "threads aren't nearly as
| important as many programmers seem to think") that afterwards
| you try to water down with weasel words ("depending on the
| workload in question").
|
| Threads are widely used because they bring most of the
| benefits of processes (concurrent control flow, and in
| multicore processors also performance) without the
| constraints and limitations they bring (exclusive memory
| space, slow creation, performance penalty caused by
| serialization in IPC, awkward API, etc).
|
| In multithreaded apps, to get threads to communicate between
| each other all you need to do to is point to the memory
| address of the object you instantiated. No serialization
| needed, no nothing. You simply cannot beat this in terms of
| "clean way" of doing things.
|
| > It can be less convenient, but (...)
|
| That's quite the euphemism, and overlooks why threads are
| largely preferred.
| gpderetta wrote:
| Message pass enough and you'll easily deadlock as well.
| jstimpfle wrote:
| Pipes are FIFO data buffers implemented in the kernel. For
| communication between threads of the same process, you can
| replace any pipe object by a userspace queue implementation
| protected by e.g. mutex + condition variable. It is
| functionally equivalent and has potential to be faster. And
| if you wrap all accesses in lock/unlock pairs (without
| locking any other objects in between) there is no danger of
| introducing any more deadlocks compared to using kernel
| pipes.
|
| Threads are an important structuring mechanism: You can
| assume that all your threads continue to run, or in the event
| of a crash, all your threads die.
|
| Also, unidirectional pipes aren't exactly sufficient for
| inter-process / inter-thread synchronisation. They are ok for
| simple batch processing, but that's about it.
| gpderetta wrote:
| Incidentally you can use the exact same setup (plush mmap)
| for interprocess queues.
|
| The advantage of threads is that you can pass pointers to
| your data through the queue, while that's harder to do
| between processes and you have to resort to copying data in
| the queue instead.
| NortySpock wrote:
| I assume for heterogenous cores (power vs efficiency cores) it
| bottlenecks on the throughput of the slowest core?
| DiabloD3 wrote:
| Surprisingly no. I'd expect similar performance.
|
| In these designs, the actual memory controller that talks to
| the RAM is part of an internal fabric, and the fabric link
| between the core and the memory controller is (technically)
| your upper limit.
|
| For both Intel and AMD, the size of the fabric link remains
| constant to the expected performance of the different cores,
| as the theoretical usage/performance of the load/store units
| remain otherwise constant in relation, no matter if it is a
| big core or a little core.
|
| Also, notice: the maximum performance of load-store units is
| your actual upper limit, period. Some CPUs historically never
| achieved their maximum theoretical performance because the
| units were never engaged optimally; sometimes this is because
| some ports on the load/store units are only accessible from
| certain instructions (often due to being reserved only for
| SIMD; this is why memcpy impls often use SSE/AVX, just to
| exploit this fact).
|
| That said, load-store performance usually approaches that
| core's L2 theoretical maximum, which is greater than what any
| core generally can get out of its fabric link. Ergo, fabric
| link is often governing what you're seeing in situations like
| this.
|
| On Intel and AMD's clusters, the memory controller serving
| their respective core cluster designs requires anywhere from
| 2 to 4 cores saturating their links to reach peak
| performance. Also, sibling threads on the same core will
| compete for access to that link, so it isn't merely threads
| that get you there, but actual core saturation.
|
| On a dummy benchmark like proposed in the linked article, the
| performance of a single process being piped to another
| process, either in the situation of "both processes are
| actually on the same big core, simultaneously hyper-
| threading", or "two sibling little cores in the same core
| cluster, being serviced by the same memory controller", the
| upper limit of performance should approximate optimal usage
| of memory bandwidth, but in some cases on some architectures
| this will actually approximate L3 bandwidth (a higher value).
|
| Also, as a side note: little cores aren't little. For a
| little bit more silicon usage, and a little bit less power
| usage, two little cores approximate one big core /w two
| threads optimally executing, even in Intel's surprisingly
| optimal small core design, but _very_ much true in Zen4c. As
| in, I could buy a "whoops, all little cores" CPU of
| sufficient size for my desktop, and still be happy (or,
| possibly, even happier).
| jstimpfle wrote:
| AFAIK a severe limitation of pipes is that they can buffer only
| 64 KB / 16 pages (on x86 Linux). Pretty sure it's generally
| slower than core-to-memory bandwidth.
| packetlost wrote:
| 64KB is the default, you can increase the buffer size using
| `fcntl`. You're probably more limited by syscall overhead
| than anything
| gpderetta wrote:
| Pipes are zero copy only if you use splice or vmsplice. These
| linux specific syscalls are hard to use (particularly wmsplice)
| and the vast majority of programs and shell filters (with the
| notable exception of pv) don't use them and pay for the cost of
| copying in and out of kernel memory.
| bee_rider wrote:
| This is magic system stuff I don't understand, does it have to
| go all the way up to the memory or will the caches save us from
| that trip?
| DiabloD3 wrote:
| Depends entirely on the CPU architecture.
|
| The most simple answer I can give is: yes, when its safe;
| when its not safe, that's part of an entire category of
| meltdown/spectre family exploits.
| mg wrote:
| One surprising fact about Linux pipes I stumbled across 4 years
| ago is that using a pipe can create indeterministic behavior:
|
| https://www.gibney.org/the_output_of_linux_pipes_can_be_inde...
| xorcist wrote:
| It that surprising? What would you have guessed output would
| look like, and why? Perhaps that information would help
| straighten out any confusion.
|
| The command, perhaps intentionally, looks unusual (any code
| reviewer would certainly be scratching their head):
|
| There's an "echo red" in there but it's never sent anywhere
| (perhaps a joke with "red herring"?).
|
| There's an "echo green" sent to stderr, that will only be
| visible if it terminates before "echo blue".
|
| The exact order would be dependent on output buffering, which
| will depend on which time slice is sorted first, which will
| vary with number of cpus and their respective load. So yes, it
| will be indeterministic, but in the same way "top" is.
| arp242 wrote:
| Are there cases where his causes real-world problems? Because
| to be honest this example seems rather artificial.
| Racing0461 wrote:
| Chatgpt was able to figure this out with a simple "what does
| the following do". But it could also be a case of chatgpt being
| trained on your article.
|
| >>> Note: The ordering of "green" and "blue" in the output
| might vary because these streams (stdout and stderr) might be
| buffered differently by the shell or operating system. Most
| commonly, you will see the output as illustrated above.
| jstimpfle wrote:
| Not surprising, the pipe you've created doesn't transport any
| of the data you've echoed. (echo red; echo
| green 1>&2) | echo blue
|
| This creates two subshells separated by the pipe | symbol. A
| subshell is a child process of the current shell, and as such
| it inherits important properties of the current shell, notably
| including the open file descriptor table.
|
| Since they are child processes, both subshells run
| concurrently, while their parent shell will simply wait() for
| all child processes to terminate. The order in which the childs
| get to run is to a large extent unpredictable, on a multi-core
| system they may run literally at the same time.
|
| Now, before the subshells get to process their actual tasks,
| file redirections have to be performed. The left subshell gets
| its stdout redirected to the write end of the kernel pipe
| object that is "created" by the pipe symbol. Likewise, the
| right subshell gets stdin redirected to the read end of the
| pipe object.
|
| The first subshell contains two processes (red and green) that
| run in sequence (";"). "Red" is indeed printed to stdout and
| thus (because of the redirection) sent to the pipe. However,
| nothing is ever read out of the pipe: The only process that is
| connected to the read end of the pipe ("echo blue") never reads
| anything, it is output only.
|
| Unlike "echo red", "echo green >&2" doesn't have stdout
| connected to the pipe. Its stdout is redirected to whatever
| stderr is connected to. Here is the explanation what ">&2" (or
| equivalently, "1>&2") means: For the execution of "echo green",
| make stdout (1) point to the same object that stderr (2) points
| to. You can imagine it as being a simple assignment: _fd[1] =
| fd[2]_.
|
| For "echo blue", stdout isn't explicitly redirected, so it gets
| run with stdout set to whatever it inherited from its parent
| shell, which is (probably) your terminal.
|
| Seeing that both "echo green" and "echo blue" write directly to
| the same file (again, probably your terminal) we have a race --
| who wins is basically a question of who gets scheduled to run
| first. For one reason or other, it seems that blue is more
| likely to win on your system. It might be due to the fact that
| the left subshell needs to finish the "echo red" first, which
| does print to the pipe, and that might introduce a delay / a
| yield, or such.
| 4death4 wrote:
| That may have been surprising, but, if you think about it a
| little deeper, it makes perfect sense. Programs in a pipeline
| execute concurrently. If they didn't, pipelines wouldn't be
| useful. For instance a pipeline that downloads a tar file with
| curl and then untars it. If you wait for curl to finish before
| running tar, you run in to all sorts of problems. For instance,
| where do you store the intermediate tar file if it's really
| large? Tar _needs_ to run while curl is running to keep buffers
| small and make execution fast. The only control flow between
| pipeline programs is done via stdin and stdout. In your example
| program, you write to stderr so naturally that's not part of
| the deterministic control flow.
| Borg3 wrote:
| Hah, nice article :) I remember fighting with Cygwin pipe
| implementations to have decent performance from them. They are
| hella slower compared to Linux, but still usable, just tricky to
| pass data in/out.
| epistasis wrote:
| Fantastic article, I learned a lot despite pipes being a bread
| and butter user tool for me for a quarter century.
| chris_armstrong wrote:
| Absolutely amazing, I know about page tables and the like but
| tying it to performance analysis with `perf` makes it clear how
| central it is to throughput
___________________________________________________________________
(page generated 2023-10-05 23:00 UTC)