[HN Gopher] Linux Pipes Are Slow
       ___________________________________________________________________
        
       Linux Pipes Are Slow
        
       Author : qsantos
       Score  : 308 points
       Date   : 2024-08-25 16:52 UTC (1 days ago)
        
 (HTM) web link (qsantos.fr)
 (TXT) w3m dump (qsantos.fr)
        
       | jheriko wrote:
       | just never use pipes. they are some weird archaism that need to
       | die :P
       | 
       | the only time ive used them is external constraints. they are
       | just not useful.
        
         | henearkr wrote:
         | Pipes are extremely useful. But I guess it just depends on your
         | use case. I do a lot of scripting.
         | 
         | If you dislike their (relative) slowness, it's open source, you
         | can participate in making them faster.
         | 
         | And I'm sure that after this HN post we'll see some patches and
         | merge requests.
        
           | noloblo wrote:
           | +1 yes pipes are what shell scripting quite useful and allow
           | for easy composition of the different unix shell utilities
        
           | hnlmorg wrote:
           | The very thing that makes pipes useful is what also makes
           | them slow. I don't think there is much we can do to fix that
           | without breaking POSIX compatibility entirely.
           | 
           | Personally I think there's much worse ugliness in POSIX than
           | pipes. For example, I've just spent the last couple of days
           | debugging a number of bugs in a shell's job control code
           | (`fg`, `bg`, `jobs`, etc).
           | 
           | But despite its warts, I'm still grateful we have something
           | like POSIX to build against.
        
             | effie wrote:
             | What possible bugs can there be in those? They are quite
             | simple to use and work as expected.
        
               | khafra wrote:
               | They work as expected on Redhat and Debian. "POSIX"
               | leaves open a lot of possibility for less-well-tested
               | systems. They could be writing shellscripts on Minix or
               | HelenOS.
        
               | hnlmorg wrote:
               | I'm talking about shell implementation not shell usage.
               | 
               | To implement job control, there are several signals you
               | need to be aware of:
               | 
               | - SIGSTSP (what the TTY sends if it receives ^Z)
               | 
               | - SIGSTOP (what a shell sends to a process to suspend it)
               | 
               | - SIGCONT (what a shell sends to a process to resume it)
               | 
               | - SIGCHLD (what the shell needs to listen for to see
               | there is a change in state for a child process -- this is
               | also sometimes referred to as SIGCLD)
               | 
               | - SIGTIN (received if a process read from stdin)
               | 
               | - SIGTOU (received if a process cannot write to stdout
               | nor set its modes)
               | 
               | Some of these signals are received by the shell, some are
               | by the process. Some are sent from the shell and others
               | from the kernel.
               | 
               | SIGCHLD isn't just raised for when a child process goes
               | into suspend, it can be raised for a few different
               | changes of state. So if you receive SIGCHLD you then need
               | to inspect your children (of course you don't know what
               | child has triggered SIGCHLD because signals don't contain
               | metadata) to see if any of them have changed their state
               | in any way. Which is "fun"....
               | 
               | And all of this only works if you manage to fork your
               | children with special flags to set their PGID (not PID,
               | another meta ID which represents what process _group_
               | they belong to), and send magic syscalls to keep passing
               | ownership of the TTY (if you don 't tell the kernel which
               | process owns the TTY, ie is in the foreground, then
               | either your child process and/or your shell will crash
               | due to permission issues).
               | 
               | None of this is 100% portable (see footnote [1]) and all
               | of this also depends on well behaving applications not
               | catching signals themselves and doing something funky
               | with them.
               | 
               | The bug I've got is that Helix editor is one of those
               | applications doing something non-standard with SIGTSTP
               | and assuming anything that breaks as a result is a parent
               | process which doesn't support job control. Except my
               | shell _does_ support job control and still crashes as a
               | result of Helix 's non-standard implementation.
               | 
               | In fairness to Helix, my shell does also implement job
               | control in a non-standard way because I wanted to add
               | some wrappers around signals and TTYs to make the
               | terminal experience a little more comfortable than it is
               | with POSIX-compliant shells like Bash. But because job
               | control (and signals and TTYs in general) are so archaic,
               | the result is that there are always going to be edge case
               | bugs with applications (like Helix) that have implemented
               | things a little differently themselves too.
               | 
               | So they're definitely not easy to use and can break in
               | unexpected ways if even just one application doesn't
               | implement things in expected ways.
               | 
               | [1] By the way, this is all ignoring subtle problems that
               | different implementations of PTYs (eg terminal emulators,
               | terminal multiplexors, etc) and different POSIX kernels
               | can introduce too. And those can be a _nightmare_ to
               | track down and debug!
        
         | akira2501 wrote:
         | > just never use pipes.
         | 
         | vmslice doesn't work with every type of file descriptor.
         | eschewing some technology entirely because it seems archaic or
         | because it makes writing "the fastest X software" seem harder
         | is just sloppy engineering.
         | 
         | > they are just not useful.
         | 
         | Then you have not written enough software yet to discover how
         | they are useful.
        
           | gpderetta wrote:
           | Most importantly the fast fizbuz toy vmsplices into
           | /dev/null.
           | 
           | Nothing ever touches those pages on the consumer side and
           | they can be refused immediately.
           | 
           | If you actually want a functional program using vmsplice,
           | with a real consumer, things get hairy very quickly.
        
         | w0m wrote:
         | You can replace a 10k line python or shell script with a single
         | creative line of pipes/xargs/etc on the cli.
         | 
         | It's incredibly valuable on the day to day.
        
         | hagbard_c wrote:
         | That's like telling a builder never to use nails but turn to
         | adhesives instead. He will look at his hammer and his nails as
         | well as a stack of 2x4s, grin and in no time slap together a
         | box into which he will stuff you with a bottle of glue with the
         | advice to now go and play while the grown-ups take care of
         | business.
         | 
         | Sure, you could build that box with glue and clamps and ample
         | time, sure it would look neater and weigh less than the version
         | that's currently holding you imprisoned and if done right it
         | will even be stronger but it takes more time and effort as well
         | as those glue clamps and other specialised tools to create
         | perfect matching surfaces while the builder just wielded that
         | hammer and those nails and now is building yet another
         | utilitarian piece of work with the same hammer and nails.
         | 
         | Sometimes all you need is a hammer and some nails. Or pipes.
        
         | duped wrote:
         | I agree with this but with a much more nuanced take: avoid
         | pipes if either reader or writer expects to do async i/o and
         | you don't own both the reader and writer.
         | 
         | In fact if you ever set O_NONBLOCK on a pipe you need to be
         | damn sure both the reader and writer expect non-blocking i/o
         | because you'll get heisenbugs under heavy i/o when either the
         | reader/writer outpace each other and one expects blocking i/o.
         | When's the last time you checked the error code of `printf` and
         | put it in a retry loop?
        
           | teo_zero wrote:
           | Genuine question: why does printf need a retry loop when
           | using pipes?
        
             | duped wrote:
             | It doesn't that's why no one does it.
             | 
             | But for pipes what it means is that if whoever is reading
             | or writing the pipe expects non blocking semantics, the
             | other end needs to agree. And if they don't you'll
             | eventually get an error because the reader or writer
             | outpaced the other, and almost no program handles errors
             | for stdin or stdout.
        
               | teo_zero wrote:
               | But even writing to a file doesn't guarantee non-blocking
               | semantics. I still don't get what is special about pipes.
        
               | caf wrote:
               | Making the read side non-blocking doesn't affect the
               | write side, and vice-versa.
        
               | duped wrote:
               | That is not true for pipes.
        
               | caf wrote:
               | It is, at least on Linux for ordinary pipe(2) pipes.
               | 
               | I just wrote up a test to be sure: in the process with
               | the read side, set it to non-blocking with fcntl(p,
               | F_SETFL, O_NONBLOCK) then go to sleep for a long period.
               | Dump a bunch of data into the writing side with the other
               | process: the write() call blocks once the pipe is full as
               | you would expect.
        
           | epcoa wrote:
           | This isn't right. O_NONBLOCK doesn't mean the pipe doesn't
           | stall, it just means you get an immediate errno and don't
           | block on the syscall in the kernel waiting and this is
           | specific to the file description, for which a pipe has 2
           | independent ones. Setting O_NONBLOCK on the writer does not
           | affect the reader. If it did, this would break a ton of
           | common use cases where pipelined programs are designed to not
           | even know what is on the other side.
           | 
           | Not sure what printf has to do with, it isn't designed to be
           | used with a non-block writer (but that only concerns one
           | side). How will the reader being non-block change the
           | semantics of the writer? It doesn't.
           | 
           | You can't set O_NONBLOCK on a pipe fd you expect to use with
           | stdio, but that isn't unique to pipes. Whether the reader is
           | O_NONBLOCK will not affect you if you're pushing the writer
           | with printf/stdio.
           | 
           | (This is also a reason why I balk a bit when people refer to
           | O_NONBLOCK as "async IO", it isn't the same and leads to this
           | confusion)
        
             | tedunangst wrote:
             | It's an unfortunately common bug for a process to set non
             | blocking on a file shared with another process.
        
       | nitwit005 wrote:
       | Just about every form of IPC is "slow". You have decided to pay a
       | performance cost for safety.
        
         | marcosdumay wrote:
         | You shouldn't have to pay that much. Pipes give you almost
         | nothing, so they should cost almost nothing.
         | 
         | Specifically, there aren't many reasons for your fastest IPC to
         | be slower than a long function call.
        
           | nitwit005 wrote:
           | If you don't think pipes offer much, don't use them.
           | 
           | Saying "long function call" doesn't mean much since a
           | function can take infinitely long.
        
             | marcosdumay wrote:
             | A long distance function call, that invalidates everything
             | on your cache.
        
               | saagarjha wrote:
               | ...which is quite expensive.
        
               | marcosdumay wrote:
               | Yes, it is. But it's much cheaper than interacting by
               | pipe.
               | 
               | Linux is optimizing sockets with a similar goal. And it's
               | quite far on that direction. But there's still some
               | margin to gain.
        
         | brigade wrote:
         | Pipes don't exist for safety, they exist as an optimization to
         | pass data between existing programs.
        
           | PaulDavisThe1st wrote:
           | _NOT_ writing and reading to and from a file stored on a
           | drive is not, in this context, an optimization, but a
           | significantly freeing conceptual shift that completely
           | transforms how a class of users conducts themselves when
           | using the computer.
        
       | djaouen wrote:
       | So is Python, but I'm still gonna use it lol
        
       | RevEng wrote:
       | I didn't quite grasp why the original splice has to be so slow.
       | They pointed out what made it slower than vmsplice - in
       | particular allocating buffers and using scalar instructions - but
       | why is this necessary? Why couldn't splice just be reimplemented
       | as vmsplice? I'm sure there is a good reason, but I've missed it.
        
         | Izkata wrote:
         | > Why couldn't splice just be reimplemented as vmsplice?
         | 
         | A possible answer that's currently just below your comment:
         | https://news.ycombinator.com/item?id=41351870
         | 
         | > vmslice doesn't work with every type of file descriptor.
        
       | koverstreet wrote:
       | One of my sideprojects is intended to address this:
       | https://lwn.net/Articles/976836/
       | 
       | The idea is a syscall for getting a ringbuffer for any supported
       | file descriptor, including pipes - and for pipes, if both ends
       | support using the ringbuffer they'll map the same ringbuffer:
       | zero copy IO, potentially without calling into the kernel at all.
       | 
       | Would love to find collaborators for this one :)
        
         | wakawaka28 wrote:
         | Buffering is there for a reason and this approach will lead to
         | weird failure modes and fragility in scripts. The core issue is
         | that any stream producer might go slower than any given
         | consumer. Even a momentary hiccup will totally mess up the pipe
         | unless there is adequate buffering, and the amount needed is
         | system-dependent.
        
           | Spivak wrote:
           | What makes this any different than other buffer
           | implementations that have a max size? Buffer fills, writes
           | block. What failure mode are you worried about that can't
           | occur with pipes which are also bounded?
        
           | foota wrote:
           | Maybe I misunderstand, but if the ring buffer is full isn't
           | it ok for the sender to just block?
        
             | mort96 wrote:
             | Yeah, and if the ring buffer is empty it's okay for the
             | receiver to just block... exactly as happens today with
             | pipes
        
           | hackernudes wrote:
           | I think the OP's proposal has buffering.
           | 
           | It is different from a pipe - instead of using read/write to
           | copy data from/to a kernel buffer, it gives user space a
           | mapped buffer object and they need to take care to use it
           | properly (using atomic operations on the head/tail and such).
           | 
           | If you own the code for the reader and writer, it's like
           | using shared memory for a buffer. The proposal is about
           | standardizing an interface.
        
         | messe wrote:
         | > and for pipes, if both ends support using the ringbuffer
         | they'll map the same ringbuffer
         | 
         | Is there planned to be a standardized way to signal to the
         | other end of the pipe that ring buffers are supported, so this
         | could be handled transparently in libc? If not, I don't really
         | see what advantage it gets you compared to shared memory + a
         | futex for synchronization--for pipes that is.
        
           | immibis wrote:
           | Presumably the same interface still works if the other side
           | is using read/write.
        
             | koverstreet wrote:
             | correct
        
         | caf wrote:
         | Presumably ringbuffer_wait() can also be signalled through
         | making it 'readable' in poll()?
        
           | koverstreet wrote:
           | yes, I believe that's already implemented; the more
           | interesting thing I still need to do is make futex() work
           | with the head and tail pointers.
        
         | phafu wrote:
         | At least for user space usage, I'm not sure a new kernel thing
         | is needed. Quite a while ago I have implemented a user space
         | (single producer / single consumer) ring buffer, which uses an
         | eventfd to mimic pipe behavior and functionality quite closely
         | (i.e. being able to sleep & poll for ring buffer full/empty
         | situations), but otherwise operates lockless and without
         | syscall overhead.
        
       | 0xbadcafebee wrote:
       | Calling Linux pipes "slow" is like calling a Toyota Corolla
       | "slow". It's fast enough for all but the most extreme use cases.
       | Are you racing cars? In a sport where speed is more important
       | than technique? Then get a faster car. Otherwise stick to the
       | Corolla.
        
         | AkBKukU wrote:
         | I have a project that uses a proprietary SDK for decoding raw
         | video. I output the decoded data as pure RGBA in a way FFMpeg
         | can read through a pipe to re-encode the video to a standard
         | codec. FFMpeg can't include the Non-Free SDK in their source,
         | and it would be wildly impracticable to store the pure RGBA in
         | a file. So pipes are the only way to do it, there are valid
         | reasons to use high throughput pipes.
        
           | whartung wrote:
           | What about domain sockets?
           | 
           | It's clumsier, to be sure, but if performance is your goal,
           | the socket should be faster.
        
             | AkBKukU wrote:
             | It looks like FFmpeg does support reading from sockets
             | natively[1], I didn't know that. That might be a better
             | solution in this case, I'll have to look into some C code
             | for writing my output to a socket to try that some time.
             | 
             | [1] https://ffmpeg.org/ffmpeg-protocols.html#unix
        
             | ptx wrote:
             | Why should sockets be faster?
        
               | uncanneyvalley wrote:
               | Sockets remap pages without moving any data while pipes
               | have to copy the data between fds.
        
           | CyberDildonics wrote:
           | _So pipes are the only way to do it_
           | 
           | Lets not get carried away. You can use ffmpeg as a library
           | and encode buffers in a few dozen lines of C++.
        
             | Almondsetat wrote:
             | ffmpeg's library is notorious for being a complete and
             | utter mess
        
               | CyberDildonics wrote:
               | It worked extremely well when I did something almost
               | exactly like this. I gave it buffers of pixels in memory
               | and it spit out compressed video.
        
             | quietbritishjim wrote:
             | The parent comment mentioned license incompatibility, which
             | I guess would still apply if they used ffmpeg as a library.
        
             | whiterknight wrote:
             | And you go from having a well defined modular interface
             | that's flexible at runtime to a binary dependency.
        
               | CyberDildonics wrote:
               | You have the dependency either way, but if you use the
               | library you can have one big executable with no external
               | dependencies and it can actually be fast.
               | 
               | If there wasn't a problem to solve they wouldn't have
               | said anything. If you want something different you have
               | to do something different.
        
           | jcelerier wrote:
           | Why not just store the output of the proprietary codec in an
           | AVFrame that you'd pass to libavcodec in your own code?
        
           | ploxiln wrote:
           | What percentage of CPU time is used by the pipe in this
           | scenario? If pipes were 10x faster, would you really notice
           | any difference in wall-clock-time or overall-cpu-usage, while
           | this decoding SDK is generating the raw data and ffmpeg is
           | processing it? Are these video processing steps anywhere near
           | memory copy speeds?
        
           | Sesse__ wrote:
           | At some point, I had a similar issue (though not related to
           | licensing), and it turned out it was faster to do a high-
           | bitrate H.264-encode of the stream before sending it over the
           | FFmpeg socket than sending the raw RGBA data, even over
           | localhost... (There was some minimal quality loss, of course,
           | but it was completely irrelevant in the big picture.)
        
             | jraph wrote:
             | > There was some minimal quality loss, of course, but it
             | was completely irrelevant in the big picture
             | 
             | But then the solutions are not comparable anymore, are
             | they? Would a lossless codec instead have improved speed?
        
               | chupasaurus wrote:
               | H.264 has lossless mode.
        
               | Sesse__ wrote:
               | No, because I had hardware H.264 encoder support. :-)
               | (The decoding in FFmpeg on the other side was still
               | software. But it was seemingly much cheaper to do a H.264
               | software decode.)
        
         | Someone wrote:
         | This isn't code in some project that will run only a few
         | billion times in its lifetime; it is used frequently on
         | millions, if not billions, of computers.
         | 
         | Because of that, it is economical to spend lots of time
         | optimizing it, even if it only makes the code marginally more
         | efficient.
        
           | samastur wrote:
           | That's not how economics works.
           | 
           | If 100 million people each save 1 cent because of your work,
           | you saved 1 million in total, but in practice nobody is
           | observably better off.
        
             | sebstefan wrote:
             | So? It doesn't need to be visible to be worth optimizing?
        
               | samastur wrote:
               | If you are making an economic (financial) argument for
               | change like the original comment did, then yes, it should
               | be visible positive effect.
               | 
               | Obviously not if you are doing for your own fun or just
               | improving the state of art.
        
             | h0p3 wrote:
             | There are people whose lives are improved by having an
             | extra cent to spend. Seriously. It is measurable,
             | observable, and real. It might not have a serious impact on
             | the vast majority of people, but there are people who have
             | very, very little money or have found themselves on a
             | tipping point that small; pinching pennies alters their
             | utility outcomes.
        
               | InDubioProRubio wrote:
               | https://xkcd.com/951/
               | 
               | Also, if you micro-optimize and that becomes your whole
               | focus and ability to focus, your business is unable to
               | innovate aka traverse the economic landscape and find new
               | rich gradients and sources of "economic food", making you
               | a dinosaur in a pit, doomed to eternally cannibalize on
               | what other creatures descend into the pit and highly
               | dependent on the pit not closing up for good.
        
               | samastur wrote:
               | No, they really aren't. Absolutely nobody's life is
               | measurably improved because of 1 cent one time.
               | 
               | I admit my opinion is not based on first hand knowledge,
               | but I have for years worked on projects trying to address
               | poverty at different parts of this planet and can't think
               | of a single one where this would be even remotely true.
        
               | h0p3 wrote:
               | > Absolutely nobody's life is measurably improved because
               | of 1 cent one time...I admit my opinion is not based on
               | first hand knowledge...
               | 
               | My opinion, however, is based on first-hand knowledge.
               | I've been the kid saving those pennies, and I've worked
               | with those kids. I understand that in the vast majority
               | of cases, an extra penny does nothing more. That isn't
               | what your original comment above claimed, nor is it what
               | you've claimed here. My counterexample is enough to
               | demonstrate the falsehood. Arguing that there are better
               | ways to distribute these pennies is another matter, and I
               | take that seriously as well.
        
             | whiterknight wrote:
             | You're describing the outcome of one individual person.
             | Money is just a tool for allocating resources. Saving 1
             | million of resources is a good thing.
        
               | wang_li wrote:
               | It's a meaningless thing if it's 1 million resources
               | divided into 1 million actors who have no ability to
               | leverage a short term gain of 1 resource. It's short term
               | because the number of computers that are 100% busy 100%
               | of the time is zero. A pipe throughput improvement means
               | nothing if the computer isn't waiting on pipes a lot.
        
               | carlhjerpe wrote:
               | Eventually everyone ends up at a power plant, there's an
               | insane amount of people living in the European grid. If
               | an optimization ends up saving a couple tonnes of CO2 per
               | year it is hard to not call it a good thing.
               | 
               | https://en.m.wikipedia.org/wiki/Synchronous_grid_of_Conti
               | nen...
        
               | wang_li wrote:
               | A couple tons spread across 400 million people with a per
               | capita emission of 5 tons per year is in the noise. If
               | we're at the point of trying to hyper optimize there are
               | far more meaningful targets than pipe throughput.
        
               | sqeaky wrote:
               | You are arguing against the concept of "division of
               | labor".
               | 
               | You are a few logical layers removed, but fundamentally
               | that is at the heart of this. It isn't just about what
               | _you_ think can or can 't be leveraged. Reducing waste in
               | a centralized fashion is excellent because it will enable
               | other waste to be reduced in a self reinforcing cycle as
               | long as experts in their domain keep getting the benefits
               | of other experts. The chip experts make better
               | instructions, so the library experts make better software
               | libs they add their 2% and now it is more than 4%, so the
               | application experts can have 4% more theoughput and buy
               | 4% fewer servers or spend way more than 4% less
               | optimizing or whatever and add their 2% optimization and
               | now we are at more than 6%, and the end users can do
               | their business slightly better and so on in a chain that
               | is all of society. Sometimes those gains are mututed.
               | Sometimes that speed turns into error checking, power
               | saving, more throughput, and every trying to do their
               | best to do more with less.
        
               | carlhjerpe wrote:
               | Absolutely, if your focus is saving emissions don't
               | optimize pipes. But if you optimize an interface people
               | use it's a good thing either way right
        
             | azulster wrote:
             | not if it costs 200 million in man-hours to optimize
        
           | hi-v-rocknroll wrote:
           | Citation needed.
           | 
           | Pipes aren't used everywhere in production in hot paths. That
           | just doesn't happen.
        
             | ibern wrote:
             | A lot of bioinformatics code relies very heavily on pipes.
        
             | throwway120385 wrote:
             | What production? You need to check your assumptions about
             | what people do with general purpose computers and why. Just
             | because it doesn't happen in your specific field of
             | computing doesn't mean it never happens anywhere or that it
             | just isn't important.
        
         | Ultimatt wrote:
         | A better analogy is its like a society that uses steam trains
         | attempting to industrially compete with a society that uses
         | bullet trains (literally similar by factor of improvement). The
         | UK built its last steam train for national use in 1960, four
         | years later the Shinkansen was in use in Japan. Which of those
         | two nations has a strong international industrial base in 2024?
        
           | billfruit wrote:
           | Well the Mallard's top speed was very close to the first
           | generation Shikansen 0 trains.
        
         | ploxiln wrote:
         | Indeed. In the author's case, the slow pipe is moving data at
         | 17 GB/s which is over 130 gbps.
         | 
         | I've used pipes for a lot of stuff over 10+ years, and never
         | noticed being limited by the speed of the pipe, I'm almost
         | certain to be limited by tar, gzip, find, grep, nc ... (even
         | though these also tend to be pretty fast for what they do).
        
           | crabbone wrote:
           | I had two cases in my practice where pipes were slow. Both
           | related to developing a filesystem.
           | 
           | 1. Logging. At first our tools for reading the logs from a
           | filesystem management program were using pipes, but they
           | would be overwhelmed quickly (even before it would overwhelm
           | pagers and further down the line). We had to write our own
           | pager and give up on using pipes.
           | 
           | 2. Storage again, but a different problem: we had a setup
           | where we deployed SPDK to manage the iSCSI frontend duties,
           | and our component to manage the actual storage process. It
           | was very important that the communication between these two
           | components be as fast and as memory-efficient as possible.
           | The slowness of pipes comes also from the fact that they have
           | to copy memory. We had to extend SPDK to make it communicate
           | with our component through shared memory instead.
           | 
           | So, yeah, pipes are unlikely to be the bottleneck of many
           | applications, but definitely not all.
        
         | jiehong wrote:
         | Replace "Linux pipes" by "Electron apps", and people would not
         | agree.
         | 
         | Also, why leave performance on the table by default? Just
         | because "it should be enough for most people I can think of"?
         | 
         | Add Tesla motors to a Toyota Corolla and now you've got a
         | sportier car by default.
        
           | azulster wrote:
           | electron apps are an optimization all by itself.
           | 
           | it's not optimizing footprint or speed of application. it's
           | optimizing the resources and speed of development and
           | deployment
        
         | paulannesley wrote:
         | Sometimes the best answer really is a faster Corolla!
         | 
         | https://www.toyota.com/grcorolla/
         | 
         | (These machines have amazing engineering and performance, and
         | their entire existence is a hack to work around rules making it
         | unviable to bring the intended GR Yaris to the US market..
         | Maybe _just_ enough eng /perf/hack/market relevance to HN folk
         | to warrant my lighthearted reply. Also, the company president
         | is still on the tools.
        
           | 2OEH8eoCRo0 wrote:
           | There's no replacement for displacement.
        
             | Sohcahtoa82 wrote:
             | Apparently there is, because that car only has a 1.6L
             | 3-cylinder engine and yet produces a whopping 300
             | horsepower.
        
               | 2OEH8eoCRo0 wrote:
               | When? In the RPM sweet spot after waiting an eternity for
               | the turbos to spool? There's always a catch.
        
               | zackmorris wrote:
               | I didn't expect to be writing this comment on this
               | article hah, but apparently there is such a thing called
               | a surge tank for storing boost pressure to mostly
               | eliminate turbo lag:
               | 
               | https://www.highpowermedia.com/Archive/the-surge-tank
               | 
               | https://forums.tdiclub.com/index.php?threads/air-tank-or-
               | com...
               | 
               | It's such an obvious idea that I'm kind of shocked it
               | took them until 2003 to do it. Surely someone thought of
               | this in like the 60s.
               | 
               | I would probably do it differently with a separate
               | supercharger to intermittently maintain another 1-2+ bar
               | of boost to make the tank less than half as large, but
               | that would add complexity, and what do I know.
        
         | mort96 wrote:
         | I mean why waste CPU time moving data between buffers when you
         | could get the same semantics and programming model without
         | wasting that CPU time?
        
         | qsantos wrote:
         | To be frank, this is more of a pretext to understand what pipes
         | and vmsplice do exactly.
        
         | bastawhiz wrote:
         | I'm not sure that logic makes sense. Making a thing that's used
         | ubiquitously a few percent faster it's absolutely a worthwhile
         | investment of effort. Individual operations might but be very
         | much faster but it's (in aggregate) a ton of electricity and
         | time globally.
        
           | 0xbadcafebee wrote:
           | That's what's called premature optimization. Everywhere in
           | our lives we do inefficient things. Despite the inefficiency
           | we gain us something else: ease of use or access, simplicity,
           | lower cost, more time, etc. The world and life as we know it
           | is just a series of tradeoffs. Often optimization before it's
           | necessary actually creates more drawbacks than benefits. When
           | it's easy and has a huge benefit, or is necessary, then
           | definitely optimize. It may be hard to accept this as a
           | general principle, but in practice (mostly in hindsight) it
           | becomes very apparent.
           | 
           | Donald Knuth thinks the same: https://en.wikipedia.org/wiki/P
           | rogram_optimization#When_to_o...
        
             | bastawhiz wrote:
             | It's definitionally not premature optimization. Pipes exist
             | (and have existed for decades). This is just
             | "optimization". "Premature" means it's too soon to
             | optimize. When is it no longer too soon? In another few
             | decades? When Linux takes another half of Windows usage? It
             | would be premature if they were working on optimizations
             | before there were any users or a working implementation.
             | But it's not: they're a fundamental primitive of the OS
             | used by tens of millions of applications.
             | 
             | The tradeoffs you're discussing are considerations. Is it
             | worth making a ubiquitous thing faster at the expense of
             | some complexity? At some point that answer is "yes", but
             | that is absolutely not "When it's easy and has a huge
             | benefit". The most important optimizations you personally
             | benefit from were not easy OR had a huge benefit. They were
             | hard won and generally small, but they compound on other
             | optimizations.
             | 
             | I'll also note that the Knuth quote you reference says
             | exactly this:
             | 
             | > Yet we should not pass up our opportunities in that
             | critical 3%
        
               | 0xbadcafebee wrote:
               | [delayed]
        
         | tacone wrote:
         | Wait, it depends on what you're doing. Pipes also create a
         | subshell so they are a big nono when used inside a loop.
         | 
         | Suppose you're cycling on the lines of stdout and need to use
         | sed, cut and so on, using pipes will slow down things
         | considerably (and sed, cut startup time will make things
         | worse).
         | 
         | Using bash/zsh string interpolation would be much faster.
        
       | JoshTriplett wrote:
       | This is a side note to the main point being made, but on modern
       | CPUs, "rep movsb" is just as fast as the fastest vectorized
       | version, because the CPU knows to accelerate it. The name of the
       | kernel function "copy_user_enhanced_fast_string" hints at this:
       | the CPU features are ERMS ("Enhanced Repeat Move String", which
       | makes "rep movsb" faster for anything above a certain length
       | threshold) and FSRM ("Fast Short Repeat Move String", which makes
       | "rep movsb" faster for shorter moves too).
        
         | jeffbee wrote:
         | Also worth noting that Linux has changed the way it uses ERMS
         | and FSRM in x86 copy multiple times since kernel 6.1 used in
         | the article. As a data-dote, my machine that has FSRM and ERMS
         | -- surprisingly, the latter is not implied by the former --
         | hits 17GB/s using plain old pipes and a 32KiB buffer on Linux
         | 6.8
        
         | Lockal wrote:
         | This is not the full truth, "rep movsb" is fast until another
         | threshold, after which either normal or non-temporal store is
         | faster.
         | 
         | All thresholds are described in
         | https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...
         | 
         | And they are not final, i. e. Noah Goldstein still updates them
         | every year.
        
           | jeffbee wrote:
           | Which is these is "faster" depends greatly on whether you
           | have the very rare memcpy-only workload, or if your program
           | actually does something useful. Many people believe, often
           | with good evidence, that the most important thing is for
           | memcpy to occupy as few instruction cache lines as is
           | practical, instead of being something that branches all over
           | kilobytes of machine code. For comparison, see the x86
           | implementations in LLVM libc.
           | 
           | https://github.com/llvm/llvm-
           | project/blob/main/libc/src/stri...
        
           | adrian_b wrote:
           | It depends on the CPU. There is no good reason for "rep
           | movsb" to be slower at any big enough data size.
           | 
           | On a Zen 3 CPU, "rep movsb" becomes faster than or the same
           | as anything else above a length slightly greater than 2 kB.
           | 
           | However there is a range of multi-megabyte lengths, which
           | correspond roughly with sizes below the L3 cache but
           | exceeding the L2 cache, where for some weird reason "rep
           | movsb" becomes slower than SIMD non-temporal stores.
           | 
           | At lengths exceeding the L3 size, "rep movsb" becomes again
           | the fastest copy method.
           | 
           | The Intel CPUs have different behaviors.
        
         | koverstreet wrote:
         | I'm still waiting for rep movsb and rep stosb to be fast enough
         | to delete my simple C loop versions, for short memcpys.
        
           | adrian_b wrote:
           | It is likely that on recent CPUs they are always faster than
           | C loop versions.
           | 
           | On my Zen 3 CPU, for lengths of 2 kB or smaller it is
           | possible to copy faster than with "rep movsb", but by using
           | SIMD instructions (or equivalently the builtin "memcpy"
           | provided by most C compilers), not with a C loop (unless the
           | compiler recognizes the C loop and replaces it with the
           | builtin memcpy, which is what some compilers will do at high
           | optimization levels).
        
         | haberman wrote:
         | If that's the case, when can I expect C compilers to inline
         | variable-length memcpy() the way they will inline fixed-length
         | memcpy today?
        
       | nyanpasu64 wrote:
       | How do you gather profiling information for kernel function calls
       | from a user program?
        
         | qsantos wrote:
         | I'll write an article on the flamegraphs specifically, but to
         | get the data, just follow Julia's article!
         | 
         | https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
        
           | ismaildonmez wrote:
           | Could you clarify how are you testing the speed of the first
           | example where you are not writing anything to stdout? Thanks.
        
             | qsantos wrote:
             | For the first Rust program, where I just write to memory, I
             | just use the time utility when running the program from
             | zsh. Then, I divide the number of bytes written by the
             | number of seconds elapsed. That's why it's not an infinite
             | loop ;)
        
               | ismaildonmez wrote:
               | Thanks!
        
       | cowsaymoo wrote:
       | What is the library used to profile the program?
        
         | tzury wrote:
         | pv
         | 
         | https://linux.die.net/man/1/pv
         | 
         | it is in the pipe command `... | pv > /dev/null`
        
           | throw12390 wrote:
           | `pv --discard` is faster by 8% (on my system).
           | % pv </dev/zero >/dev/null       54.0GiB/s            % pv
           | </dev/zero --discard       58.7GiB/s
        
             | IWeldMelons wrote:
             | Which is suspiciously close to the speed of DDR4.
        
       | qsantos wrote:
       | I am again getting the hug of death of Hacker News. The situation
       | is better than the last time thanks to caching WordPress pages,
       | but loading the page can still take a few seconds, so bear with
       | me!
        
       | Borg3 wrote:
       | Haha. When I read the title I smiled. Linux pipes slow? Moook..
       | Now try Cygwin pipes. Thats what I call slow!
       | 
       | Anyway, nice article, its good to know whats going on under the
       | hood.
        
         | MaxBarraclough wrote:
         | I'd assumed Cygwin pipes are just Windows pipes, is that not
         | the case?
        
           | tyingq wrote:
           | Not a comprehensive list of problems, and not current but a
           | good illustrative post of the kind of issues that people have
           | run into in this post:
           | 
           | https://cygwin.com/pipermail/cygwin-
           | patches/2016q1/008301.ht...
        
           | Borg3 wrote:
           | Its not that easy. Yeah, they are, but there is a lot of
           | POSIX like glue inside so they work correctly with select()
           | and other alarms. Code is very complicated.
           | 
           | But still, kudos for Cygwin Developers for creating Cygwin :)
           | Great work, even tho it have some issues.
        
       | stabbles wrote:
       | A bold claim for a blog that takes about 20 seconds to load.
        
         | yas_hmaheshwari wrote:
         | This post has gone to the top of hacker news, so I think we
         | should give him some slack
         | 
         | Looks like an amazing article, and so much to learn on what
         | happens under the hood
        
           | ben-schaaf wrote:
           | HN generates ~20k page views over the course of a day with a
           | peak of 2k/h: https://harrisonbroadbent.com/blog/hacker-news-
           | traffic-spike.... At ~1MB per page load - not sure how
           | accurate this is, I don't think it fully loaded - this static
           | blogpost requires 0.55MB/s to meet demand. An original
           | raspberry pi B (10mpbs ethernet) on the average french mobile
           | internet connection (8mbps) provides double that.
           | 
           | I don't mean this as a slight to anyone, I just want to point
           | out the HN "hug of death" can be trivially handled by a
           | single cheap VPS without even breaking a sweat.
        
             | qsantos wrote:
             | Totally agree, my server should definitely be able to
             | handle the load. But this is a WordPress install, which is
             | definitely doing too much work for what it is when just
             | serving the pages. I plan to improve on this!
        
         | wvh wrote:
         | I believe that when it's a .fr, they call it nonchalance...
        
       | goodpoint wrote:
       | Excellent article even if, to be honest, the title is clickbait.
        
         | chmaynard wrote:
         | Agreed. Titles that don't use quantifiers are almost always
         | misleading at best.
        
       | sixthDot wrote:
       | > I do not know why the JMP is not just a RET, however.
       | 
       | The jump seems generated by the expansion of the `ASM_CLAC`
       | macro, which is supposed to change the EFLAGS register ([1],
       | [2]). However in this case the expansion looks like it does
       | nothing (maybe because of the target ?). I 'd be interested to
       | know more about that. Call to the wild.
       | 
       | [1]:
       | https://github.com/torvalds/linux/blob/master/arch/x86/inclu...
       | 
       | [2]: https://stackoverflow.com/a/60579385
        
       | fatcunt wrote:
       | > I do not know why the JMP is not just a RET, however.
       | 
       | This is caused by the CONFIG_RETHUNK option. In the disassembly
       | from objdump you are seeing the result of RET being replaced with
       | JMP __x86_return_thunk.
       | 
       | https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
       | 
       | https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...
       | 
       | > The NOP instructions at the beginning and at the end of the
       | function allow ftrace to insert tracing instructions when needed.
       | 
       | These are from the ASM_CLAC and ASM_STAC macros, which make space
       | for the CLAC and STAC instructions (both of them three bytes in
       | length, same as the number of NOPs) to be filled in at runtime if
       | X86_FEATURE_SMAP is detected.
       | 
       | https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
       | 
       | https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
       | 
       | https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...
        
         | qsantos wrote:
         | Thanks a lot for the information! I was not quite sure what to
         | look for in this case. I have added in note in the article.
        
         | ndesaulniers wrote:
         | There are perhaps only a handful of kernel developers that:
         | 
         | 1. would know the above
         | 
         | 2. would choose such an obnoxious throwaway handle
        
       | arendtio wrote:
       | I know pipes primarily from shell scripts. Are they being used in
       | other contexts as extensively, too? Like C or Rust programs?
        
       | rwmj wrote:
       | Be interesting to see a version using io_uring, which I think
       | would let you pre-share buffers with the kernel avoiding some
       | copies, and avoid syscall overhead (though the latter seems
       | negligible here).
        
         | qsantos wrote:
         | That sounds like a good idea!
        
           | rwmj wrote:
           | I'm not claiming it'll be faster! Additionally io_uring has
           | its own set of challenges, such as whether it's better to
           | allocate one ring per core or one ring per application
           | (shared by some or all cores). Pre-sharing buffers has trade-
           | offs too, particularly in application complexity [alignment,
           | you have to be careful not to reuse a buffer before it is
           | consumed] versus the efficiency of zero copy.
        
       | donaldihunter wrote:
       | Something I didn't see mentioned in the article about AVX512,
       | aside from the xsave/xrstor overhead, is that AVX512 is power
       | hungry and causes CPU frequency scaling. See [1], [2] for details
       | and as an example of how nuanced it can get.
       | 
       | [1] https://www.intel.com/content/dam/www/central-
       | libraries/us/e...
       | 
       | [2]
       | https://www.intel.com/content/www/us/en/developer/articles/t...
        
         | Narishma wrote:
         | That is only the case in specific Intel CPU models.
        
       | up2isomorphism wrote:
       | Someone tasted a bread thinking it is not sweet enough, which is
       | fine. But calling the bread bland is funny because it does not
       | mean to taste sweet.
        
       | mparnisari wrote:
       | I get PR_CONNECT_RESET_ERROR when trying to open the page
        
         | qsantos wrote:
         | My server struggles a bit with the load on the WordPress site.
         | You should be fine just reloading. I will make sure to improve
         | things for the next time!
        
       | jvanderbot wrote:
       | > Although SSE2 is always available on x86-64, I also disabled
       | the cpuid bit for SSE2 and SSE to see if it could nudge glibc
       | into using scalar registers to copy data. I immediately got a
       | kernel panic. Ah, well.
       | 
       | I think you need to recompile your compiler, or disable those
       | explicitly via link / cc flags. Compilers are fairly hard to get
       | to coax / dissuade SIMD instructions, IMHO.
        
       | faizshah wrote:
       | This is a really cool post and that is a massive amount of
       | throughput.
       | 
       | In my experience in data engineering, it's very unlikely you can
       | exceed 500mb/s throughput of your business logic as most
       | libraries you're using are not optimized to that degree (SIMD
       | etc.). That being said I think it's a good technique to try out.
       | 
       | I'm trying to think of other applications this could be useful
       | for. Maybe video workflows?
        
       | jeremyscanvic wrote:
       | Great post! I didn't know about vmsplice(2). I'm glad to see a
       | former ENSL student here as well!
        
         | qsantos wrote:
         | Hey!
        
       | yencabulator wrote:
       | FUSE can be a bit trickier than a single queue of data chunks.
       | Reads from /dev/fuse actually pick the right message to read
       | based on priorities, and there's cases where the message queue is
       | meddled with to e.g. cancel requests before they're even sent to
       | userspace. If you naively switch it to eagerly putting messages
       | into a userspace-visible ringbuffer, you might significantly
       | change behavior in cases like interrupting slow operations.
       | Imagine having to fulfill a ringbuf worth of requests to a
       | misbehaving backend taking 5sec/op, just to see the cancellations
       | at the very end.
        
       ___________________________________________________________________
       (page generated 2024-08-26 23:01 UTC)