[HN Gopher] Linux Pipes Are Slow
___________________________________________________________________
Linux Pipes Are Slow
Author : qsantos
Score : 308 points
Date : 2024-08-25 16:52 UTC (1 days ago)
(HTM) web link (qsantos.fr)
(TXT) w3m dump (qsantos.fr)
| jheriko wrote:
| just never use pipes. they are some weird archaism that need to
| die :P
|
| the only time ive used them is external constraints. they are
| just not useful.
| henearkr wrote:
| Pipes are extremely useful. But I guess it just depends on your
| use case. I do a lot of scripting.
|
| If you dislike their (relative) slowness, it's open source, you
| can participate in making them faster.
|
| And I'm sure that after this HN post we'll see some patches and
| merge requests.
| noloblo wrote:
| +1 yes pipes are what shell scripting quite useful and allow
| for easy composition of the different unix shell utilities
| hnlmorg wrote:
| The very thing that makes pipes useful is what also makes
| them slow. I don't think there is much we can do to fix that
| without breaking POSIX compatibility entirely.
|
| Personally I think there's much worse ugliness in POSIX than
| pipes. For example, I've just spent the last couple of days
| debugging a number of bugs in a shell's job control code
| (`fg`, `bg`, `jobs`, etc).
|
| But despite its warts, I'm still grateful we have something
| like POSIX to build against.
| effie wrote:
| What possible bugs can there be in those? They are quite
| simple to use and work as expected.
| khafra wrote:
| They work as expected on Redhat and Debian. "POSIX"
| leaves open a lot of possibility for less-well-tested
| systems. They could be writing shellscripts on Minix or
| HelenOS.
| hnlmorg wrote:
| I'm talking about shell implementation not shell usage.
|
| To implement job control, there are several signals you
| need to be aware of:
|
| - SIGSTSP (what the TTY sends if it receives ^Z)
|
| - SIGSTOP (what a shell sends to a process to suspend it)
|
| - SIGCONT (what a shell sends to a process to resume it)
|
| - SIGCHLD (what the shell needs to listen for to see
| there is a change in state for a child process -- this is
| also sometimes referred to as SIGCLD)
|
| - SIGTIN (received if a process read from stdin)
|
| - SIGTOU (received if a process cannot write to stdout
| nor set its modes)
|
| Some of these signals are received by the shell, some are
| by the process. Some are sent from the shell and others
| from the kernel.
|
| SIGCHLD isn't just raised for when a child process goes
| into suspend, it can be raised for a few different
| changes of state. So if you receive SIGCHLD you then need
| to inspect your children (of course you don't know what
| child has triggered SIGCHLD because signals don't contain
| metadata) to see if any of them have changed their state
| in any way. Which is "fun"....
|
| And all of this only works if you manage to fork your
| children with special flags to set their PGID (not PID,
| another meta ID which represents what process _group_
| they belong to), and send magic syscalls to keep passing
| ownership of the TTY (if you don 't tell the kernel which
| process owns the TTY, ie is in the foreground, then
| either your child process and/or your shell will crash
| due to permission issues).
|
| None of this is 100% portable (see footnote [1]) and all
| of this also depends on well behaving applications not
| catching signals themselves and doing something funky
| with them.
|
| The bug I've got is that Helix editor is one of those
| applications doing something non-standard with SIGTSTP
| and assuming anything that breaks as a result is a parent
| process which doesn't support job control. Except my
| shell _does_ support job control and still crashes as a
| result of Helix 's non-standard implementation.
|
| In fairness to Helix, my shell does also implement job
| control in a non-standard way because I wanted to add
| some wrappers around signals and TTYs to make the
| terminal experience a little more comfortable than it is
| with POSIX-compliant shells like Bash. But because job
| control (and signals and TTYs in general) are so archaic,
| the result is that there are always going to be edge case
| bugs with applications (like Helix) that have implemented
| things a little differently themselves too.
|
| So they're definitely not easy to use and can break in
| unexpected ways if even just one application doesn't
| implement things in expected ways.
|
| [1] By the way, this is all ignoring subtle problems that
| different implementations of PTYs (eg terminal emulators,
| terminal multiplexors, etc) and different POSIX kernels
| can introduce too. And those can be a _nightmare_ to
| track down and debug!
| akira2501 wrote:
| > just never use pipes.
|
| vmslice doesn't work with every type of file descriptor.
| eschewing some technology entirely because it seems archaic or
| because it makes writing "the fastest X software" seem harder
| is just sloppy engineering.
|
| > they are just not useful.
|
| Then you have not written enough software yet to discover how
| they are useful.
| gpderetta wrote:
| Most importantly the fast fizbuz toy vmsplices into
| /dev/null.
|
| Nothing ever touches those pages on the consumer side and
| they can be refused immediately.
|
| If you actually want a functional program using vmsplice,
| with a real consumer, things get hairy very quickly.
| w0m wrote:
| You can replace a 10k line python or shell script with a single
| creative line of pipes/xargs/etc on the cli.
|
| It's incredibly valuable on the day to day.
| hagbard_c wrote:
| That's like telling a builder never to use nails but turn to
| adhesives instead. He will look at his hammer and his nails as
| well as a stack of 2x4s, grin and in no time slap together a
| box into which he will stuff you with a bottle of glue with the
| advice to now go and play while the grown-ups take care of
| business.
|
| Sure, you could build that box with glue and clamps and ample
| time, sure it would look neater and weigh less than the version
| that's currently holding you imprisoned and if done right it
| will even be stronger but it takes more time and effort as well
| as those glue clamps and other specialised tools to create
| perfect matching surfaces while the builder just wielded that
| hammer and those nails and now is building yet another
| utilitarian piece of work with the same hammer and nails.
|
| Sometimes all you need is a hammer and some nails. Or pipes.
| duped wrote:
| I agree with this but with a much more nuanced take: avoid
| pipes if either reader or writer expects to do async i/o and
| you don't own both the reader and writer.
|
| In fact if you ever set O_NONBLOCK on a pipe you need to be
| damn sure both the reader and writer expect non-blocking i/o
| because you'll get heisenbugs under heavy i/o when either the
| reader/writer outpace each other and one expects blocking i/o.
| When's the last time you checked the error code of `printf` and
| put it in a retry loop?
| teo_zero wrote:
| Genuine question: why does printf need a retry loop when
| using pipes?
| duped wrote:
| It doesn't that's why no one does it.
|
| But for pipes what it means is that if whoever is reading
| or writing the pipe expects non blocking semantics, the
| other end needs to agree. And if they don't you'll
| eventually get an error because the reader or writer
| outpaced the other, and almost no program handles errors
| for stdin or stdout.
| teo_zero wrote:
| But even writing to a file doesn't guarantee non-blocking
| semantics. I still don't get what is special about pipes.
| caf wrote:
| Making the read side non-blocking doesn't affect the
| write side, and vice-versa.
| duped wrote:
| That is not true for pipes.
| caf wrote:
| It is, at least on Linux for ordinary pipe(2) pipes.
|
| I just wrote up a test to be sure: in the process with
| the read side, set it to non-blocking with fcntl(p,
| F_SETFL, O_NONBLOCK) then go to sleep for a long period.
| Dump a bunch of data into the writing side with the other
| process: the write() call blocks once the pipe is full as
| you would expect.
| epcoa wrote:
| This isn't right. O_NONBLOCK doesn't mean the pipe doesn't
| stall, it just means you get an immediate errno and don't
| block on the syscall in the kernel waiting and this is
| specific to the file description, for which a pipe has 2
| independent ones. Setting O_NONBLOCK on the writer does not
| affect the reader. If it did, this would break a ton of
| common use cases where pipelined programs are designed to not
| even know what is on the other side.
|
| Not sure what printf has to do with, it isn't designed to be
| used with a non-block writer (but that only concerns one
| side). How will the reader being non-block change the
| semantics of the writer? It doesn't.
|
| You can't set O_NONBLOCK on a pipe fd you expect to use with
| stdio, but that isn't unique to pipes. Whether the reader is
| O_NONBLOCK will not affect you if you're pushing the writer
| with printf/stdio.
|
| (This is also a reason why I balk a bit when people refer to
| O_NONBLOCK as "async IO", it isn't the same and leads to this
| confusion)
| tedunangst wrote:
| It's an unfortunately common bug for a process to set non
| blocking on a file shared with another process.
| nitwit005 wrote:
| Just about every form of IPC is "slow". You have decided to pay a
| performance cost for safety.
| marcosdumay wrote:
| You shouldn't have to pay that much. Pipes give you almost
| nothing, so they should cost almost nothing.
|
| Specifically, there aren't many reasons for your fastest IPC to
| be slower than a long function call.
| nitwit005 wrote:
| If you don't think pipes offer much, don't use them.
|
| Saying "long function call" doesn't mean much since a
| function can take infinitely long.
| marcosdumay wrote:
| A long distance function call, that invalidates everything
| on your cache.
| saagarjha wrote:
| ...which is quite expensive.
| marcosdumay wrote:
| Yes, it is. But it's much cheaper than interacting by
| pipe.
|
| Linux is optimizing sockets with a similar goal. And it's
| quite far on that direction. But there's still some
| margin to gain.
| brigade wrote:
| Pipes don't exist for safety, they exist as an optimization to
| pass data between existing programs.
| PaulDavisThe1st wrote:
| _NOT_ writing and reading to and from a file stored on a
| drive is not, in this context, an optimization, but a
| significantly freeing conceptual shift that completely
| transforms how a class of users conducts themselves when
| using the computer.
| djaouen wrote:
| So is Python, but I'm still gonna use it lol
| RevEng wrote:
| I didn't quite grasp why the original splice has to be so slow.
| They pointed out what made it slower than vmsplice - in
| particular allocating buffers and using scalar instructions - but
| why is this necessary? Why couldn't splice just be reimplemented
| as vmsplice? I'm sure there is a good reason, but I've missed it.
| Izkata wrote:
| > Why couldn't splice just be reimplemented as vmsplice?
|
| A possible answer that's currently just below your comment:
| https://news.ycombinator.com/item?id=41351870
|
| > vmslice doesn't work with every type of file descriptor.
| koverstreet wrote:
| One of my sideprojects is intended to address this:
| https://lwn.net/Articles/976836/
|
| The idea is a syscall for getting a ringbuffer for any supported
| file descriptor, including pipes - and for pipes, if both ends
| support using the ringbuffer they'll map the same ringbuffer:
| zero copy IO, potentially without calling into the kernel at all.
|
| Would love to find collaborators for this one :)
| wakawaka28 wrote:
| Buffering is there for a reason and this approach will lead to
| weird failure modes and fragility in scripts. The core issue is
| that any stream producer might go slower than any given
| consumer. Even a momentary hiccup will totally mess up the pipe
| unless there is adequate buffering, and the amount needed is
| system-dependent.
| Spivak wrote:
| What makes this any different than other buffer
| implementations that have a max size? Buffer fills, writes
| block. What failure mode are you worried about that can't
| occur with pipes which are also bounded?
| foota wrote:
| Maybe I misunderstand, but if the ring buffer is full isn't
| it ok for the sender to just block?
| mort96 wrote:
| Yeah, and if the ring buffer is empty it's okay for the
| receiver to just block... exactly as happens today with
| pipes
| hackernudes wrote:
| I think the OP's proposal has buffering.
|
| It is different from a pipe - instead of using read/write to
| copy data from/to a kernel buffer, it gives user space a
| mapped buffer object and they need to take care to use it
| properly (using atomic operations on the head/tail and such).
|
| If you own the code for the reader and writer, it's like
| using shared memory for a buffer. The proposal is about
| standardizing an interface.
| messe wrote:
| > and for pipes, if both ends support using the ringbuffer
| they'll map the same ringbuffer
|
| Is there planned to be a standardized way to signal to the
| other end of the pipe that ring buffers are supported, so this
| could be handled transparently in libc? If not, I don't really
| see what advantage it gets you compared to shared memory + a
| futex for synchronization--for pipes that is.
| immibis wrote:
| Presumably the same interface still works if the other side
| is using read/write.
| koverstreet wrote:
| correct
| caf wrote:
| Presumably ringbuffer_wait() can also be signalled through
| making it 'readable' in poll()?
| koverstreet wrote:
| yes, I believe that's already implemented; the more
| interesting thing I still need to do is make futex() work
| with the head and tail pointers.
| phafu wrote:
| At least for user space usage, I'm not sure a new kernel thing
| is needed. Quite a while ago I have implemented a user space
| (single producer / single consumer) ring buffer, which uses an
| eventfd to mimic pipe behavior and functionality quite closely
| (i.e. being able to sleep & poll for ring buffer full/empty
| situations), but otherwise operates lockless and without
| syscall overhead.
| 0xbadcafebee wrote:
| Calling Linux pipes "slow" is like calling a Toyota Corolla
| "slow". It's fast enough for all but the most extreme use cases.
| Are you racing cars? In a sport where speed is more important
| than technique? Then get a faster car. Otherwise stick to the
| Corolla.
| AkBKukU wrote:
| I have a project that uses a proprietary SDK for decoding raw
| video. I output the decoded data as pure RGBA in a way FFMpeg
| can read through a pipe to re-encode the video to a standard
| codec. FFMpeg can't include the Non-Free SDK in their source,
| and it would be wildly impracticable to store the pure RGBA in
| a file. So pipes are the only way to do it, there are valid
| reasons to use high throughput pipes.
| whartung wrote:
| What about domain sockets?
|
| It's clumsier, to be sure, but if performance is your goal,
| the socket should be faster.
| AkBKukU wrote:
| It looks like FFmpeg does support reading from sockets
| natively[1], I didn't know that. That might be a better
| solution in this case, I'll have to look into some C code
| for writing my output to a socket to try that some time.
|
| [1] https://ffmpeg.org/ffmpeg-protocols.html#unix
| ptx wrote:
| Why should sockets be faster?
| uncanneyvalley wrote:
| Sockets remap pages without moving any data while pipes
| have to copy the data between fds.
| CyberDildonics wrote:
| _So pipes are the only way to do it_
|
| Lets not get carried away. You can use ffmpeg as a library
| and encode buffers in a few dozen lines of C++.
| Almondsetat wrote:
| ffmpeg's library is notorious for being a complete and
| utter mess
| CyberDildonics wrote:
| It worked extremely well when I did something almost
| exactly like this. I gave it buffers of pixels in memory
| and it spit out compressed video.
| quietbritishjim wrote:
| The parent comment mentioned license incompatibility, which
| I guess would still apply if they used ffmpeg as a library.
| whiterknight wrote:
| And you go from having a well defined modular interface
| that's flexible at runtime to a binary dependency.
| CyberDildonics wrote:
| You have the dependency either way, but if you use the
| library you can have one big executable with no external
| dependencies and it can actually be fast.
|
| If there wasn't a problem to solve they wouldn't have
| said anything. If you want something different you have
| to do something different.
| jcelerier wrote:
| Why not just store the output of the proprietary codec in an
| AVFrame that you'd pass to libavcodec in your own code?
| ploxiln wrote:
| What percentage of CPU time is used by the pipe in this
| scenario? If pipes were 10x faster, would you really notice
| any difference in wall-clock-time or overall-cpu-usage, while
| this decoding SDK is generating the raw data and ffmpeg is
| processing it? Are these video processing steps anywhere near
| memory copy speeds?
| Sesse__ wrote:
| At some point, I had a similar issue (though not related to
| licensing), and it turned out it was faster to do a high-
| bitrate H.264-encode of the stream before sending it over the
| FFmpeg socket than sending the raw RGBA data, even over
| localhost... (There was some minimal quality loss, of course,
| but it was completely irrelevant in the big picture.)
| jraph wrote:
| > There was some minimal quality loss, of course, but it
| was completely irrelevant in the big picture
|
| But then the solutions are not comparable anymore, are
| they? Would a lossless codec instead have improved speed?
| chupasaurus wrote:
| H.264 has lossless mode.
| Sesse__ wrote:
| No, because I had hardware H.264 encoder support. :-)
| (The decoding in FFmpeg on the other side was still
| software. But it was seemingly much cheaper to do a H.264
| software decode.)
| Someone wrote:
| This isn't code in some project that will run only a few
| billion times in its lifetime; it is used frequently on
| millions, if not billions, of computers.
|
| Because of that, it is economical to spend lots of time
| optimizing it, even if it only makes the code marginally more
| efficient.
| samastur wrote:
| That's not how economics works.
|
| If 100 million people each save 1 cent because of your work,
| you saved 1 million in total, but in practice nobody is
| observably better off.
| sebstefan wrote:
| So? It doesn't need to be visible to be worth optimizing?
| samastur wrote:
| If you are making an economic (financial) argument for
| change like the original comment did, then yes, it should
| be visible positive effect.
|
| Obviously not if you are doing for your own fun or just
| improving the state of art.
| h0p3 wrote:
| There are people whose lives are improved by having an
| extra cent to spend. Seriously. It is measurable,
| observable, and real. It might not have a serious impact on
| the vast majority of people, but there are people who have
| very, very little money or have found themselves on a
| tipping point that small; pinching pennies alters their
| utility outcomes.
| InDubioProRubio wrote:
| https://xkcd.com/951/
|
| Also, if you micro-optimize and that becomes your whole
| focus and ability to focus, your business is unable to
| innovate aka traverse the economic landscape and find new
| rich gradients and sources of "economic food", making you
| a dinosaur in a pit, doomed to eternally cannibalize on
| what other creatures descend into the pit and highly
| dependent on the pit not closing up for good.
| samastur wrote:
| No, they really aren't. Absolutely nobody's life is
| measurably improved because of 1 cent one time.
|
| I admit my opinion is not based on first hand knowledge,
| but I have for years worked on projects trying to address
| poverty at different parts of this planet and can't think
| of a single one where this would be even remotely true.
| h0p3 wrote:
| > Absolutely nobody's life is measurably improved because
| of 1 cent one time...I admit my opinion is not based on
| first hand knowledge...
|
| My opinion, however, is based on first-hand knowledge.
| I've been the kid saving those pennies, and I've worked
| with those kids. I understand that in the vast majority
| of cases, an extra penny does nothing more. That isn't
| what your original comment above claimed, nor is it what
| you've claimed here. My counterexample is enough to
| demonstrate the falsehood. Arguing that there are better
| ways to distribute these pennies is another matter, and I
| take that seriously as well.
| whiterknight wrote:
| You're describing the outcome of one individual person.
| Money is just a tool for allocating resources. Saving 1
| million of resources is a good thing.
| wang_li wrote:
| It's a meaningless thing if it's 1 million resources
| divided into 1 million actors who have no ability to
| leverage a short term gain of 1 resource. It's short term
| because the number of computers that are 100% busy 100%
| of the time is zero. A pipe throughput improvement means
| nothing if the computer isn't waiting on pipes a lot.
| carlhjerpe wrote:
| Eventually everyone ends up at a power plant, there's an
| insane amount of people living in the European grid. If
| an optimization ends up saving a couple tonnes of CO2 per
| year it is hard to not call it a good thing.
|
| https://en.m.wikipedia.org/wiki/Synchronous_grid_of_Conti
| nen...
| wang_li wrote:
| A couple tons spread across 400 million people with a per
| capita emission of 5 tons per year is in the noise. If
| we're at the point of trying to hyper optimize there are
| far more meaningful targets than pipe throughput.
| sqeaky wrote:
| You are arguing against the concept of "division of
| labor".
|
| You are a few logical layers removed, but fundamentally
| that is at the heart of this. It isn't just about what
| _you_ think can or can 't be leveraged. Reducing waste in
| a centralized fashion is excellent because it will enable
| other waste to be reduced in a self reinforcing cycle as
| long as experts in their domain keep getting the benefits
| of other experts. The chip experts make better
| instructions, so the library experts make better software
| libs they add their 2% and now it is more than 4%, so the
| application experts can have 4% more theoughput and buy
| 4% fewer servers or spend way more than 4% less
| optimizing or whatever and add their 2% optimization and
| now we are at more than 6%, and the end users can do
| their business slightly better and so on in a chain that
| is all of society. Sometimes those gains are mututed.
| Sometimes that speed turns into error checking, power
| saving, more throughput, and every trying to do their
| best to do more with less.
| carlhjerpe wrote:
| Absolutely, if your focus is saving emissions don't
| optimize pipes. But if you optimize an interface people
| use it's a good thing either way right
| azulster wrote:
| not if it costs 200 million in man-hours to optimize
| hi-v-rocknroll wrote:
| Citation needed.
|
| Pipes aren't used everywhere in production in hot paths. That
| just doesn't happen.
| ibern wrote:
| A lot of bioinformatics code relies very heavily on pipes.
| throwway120385 wrote:
| What production? You need to check your assumptions about
| what people do with general purpose computers and why. Just
| because it doesn't happen in your specific field of
| computing doesn't mean it never happens anywhere or that it
| just isn't important.
| Ultimatt wrote:
| A better analogy is its like a society that uses steam trains
| attempting to industrially compete with a society that uses
| bullet trains (literally similar by factor of improvement). The
| UK built its last steam train for national use in 1960, four
| years later the Shinkansen was in use in Japan. Which of those
| two nations has a strong international industrial base in 2024?
| billfruit wrote:
| Well the Mallard's top speed was very close to the first
| generation Shikansen 0 trains.
| ploxiln wrote:
| Indeed. In the author's case, the slow pipe is moving data at
| 17 GB/s which is over 130 gbps.
|
| I've used pipes for a lot of stuff over 10+ years, and never
| noticed being limited by the speed of the pipe, I'm almost
| certain to be limited by tar, gzip, find, grep, nc ... (even
| though these also tend to be pretty fast for what they do).
| crabbone wrote:
| I had two cases in my practice where pipes were slow. Both
| related to developing a filesystem.
|
| 1. Logging. At first our tools for reading the logs from a
| filesystem management program were using pipes, but they
| would be overwhelmed quickly (even before it would overwhelm
| pagers and further down the line). We had to write our own
| pager and give up on using pipes.
|
| 2. Storage again, but a different problem: we had a setup
| where we deployed SPDK to manage the iSCSI frontend duties,
| and our component to manage the actual storage process. It
| was very important that the communication between these two
| components be as fast and as memory-efficient as possible.
| The slowness of pipes comes also from the fact that they have
| to copy memory. We had to extend SPDK to make it communicate
| with our component through shared memory instead.
|
| So, yeah, pipes are unlikely to be the bottleneck of many
| applications, but definitely not all.
| jiehong wrote:
| Replace "Linux pipes" by "Electron apps", and people would not
| agree.
|
| Also, why leave performance on the table by default? Just
| because "it should be enough for most people I can think of"?
|
| Add Tesla motors to a Toyota Corolla and now you've got a
| sportier car by default.
| azulster wrote:
| electron apps are an optimization all by itself.
|
| it's not optimizing footprint or speed of application. it's
| optimizing the resources and speed of development and
| deployment
| paulannesley wrote:
| Sometimes the best answer really is a faster Corolla!
|
| https://www.toyota.com/grcorolla/
|
| (These machines have amazing engineering and performance, and
| their entire existence is a hack to work around rules making it
| unviable to bring the intended GR Yaris to the US market..
| Maybe _just_ enough eng /perf/hack/market relevance to HN folk
| to warrant my lighthearted reply. Also, the company president
| is still on the tools.
| 2OEH8eoCRo0 wrote:
| There's no replacement for displacement.
| Sohcahtoa82 wrote:
| Apparently there is, because that car only has a 1.6L
| 3-cylinder engine and yet produces a whopping 300
| horsepower.
| 2OEH8eoCRo0 wrote:
| When? In the RPM sweet spot after waiting an eternity for
| the turbos to spool? There's always a catch.
| zackmorris wrote:
| I didn't expect to be writing this comment on this
| article hah, but apparently there is such a thing called
| a surge tank for storing boost pressure to mostly
| eliminate turbo lag:
|
| https://www.highpowermedia.com/Archive/the-surge-tank
|
| https://forums.tdiclub.com/index.php?threads/air-tank-or-
| com...
|
| It's such an obvious idea that I'm kind of shocked it
| took them until 2003 to do it. Surely someone thought of
| this in like the 60s.
|
| I would probably do it differently with a separate
| supercharger to intermittently maintain another 1-2+ bar
| of boost to make the tank less than half as large, but
| that would add complexity, and what do I know.
| mort96 wrote:
| I mean why waste CPU time moving data between buffers when you
| could get the same semantics and programming model without
| wasting that CPU time?
| qsantos wrote:
| To be frank, this is more of a pretext to understand what pipes
| and vmsplice do exactly.
| bastawhiz wrote:
| I'm not sure that logic makes sense. Making a thing that's used
| ubiquitously a few percent faster it's absolutely a worthwhile
| investment of effort. Individual operations might but be very
| much faster but it's (in aggregate) a ton of electricity and
| time globally.
| 0xbadcafebee wrote:
| That's what's called premature optimization. Everywhere in
| our lives we do inefficient things. Despite the inefficiency
| we gain us something else: ease of use or access, simplicity,
| lower cost, more time, etc. The world and life as we know it
| is just a series of tradeoffs. Often optimization before it's
| necessary actually creates more drawbacks than benefits. When
| it's easy and has a huge benefit, or is necessary, then
| definitely optimize. It may be hard to accept this as a
| general principle, but in practice (mostly in hindsight) it
| becomes very apparent.
|
| Donald Knuth thinks the same: https://en.wikipedia.org/wiki/P
| rogram_optimization#When_to_o...
| bastawhiz wrote:
| It's definitionally not premature optimization. Pipes exist
| (and have existed for decades). This is just
| "optimization". "Premature" means it's too soon to
| optimize. When is it no longer too soon? In another few
| decades? When Linux takes another half of Windows usage? It
| would be premature if they were working on optimizations
| before there were any users or a working implementation.
| But it's not: they're a fundamental primitive of the OS
| used by tens of millions of applications.
|
| The tradeoffs you're discussing are considerations. Is it
| worth making a ubiquitous thing faster at the expense of
| some complexity? At some point that answer is "yes", but
| that is absolutely not "When it's easy and has a huge
| benefit". The most important optimizations you personally
| benefit from were not easy OR had a huge benefit. They were
| hard won and generally small, but they compound on other
| optimizations.
|
| I'll also note that the Knuth quote you reference says
| exactly this:
|
| > Yet we should not pass up our opportunities in that
| critical 3%
| 0xbadcafebee wrote:
| [delayed]
| tacone wrote:
| Wait, it depends on what you're doing. Pipes also create a
| subshell so they are a big nono when used inside a loop.
|
| Suppose you're cycling on the lines of stdout and need to use
| sed, cut and so on, using pipes will slow down things
| considerably (and sed, cut startup time will make things
| worse).
|
| Using bash/zsh string interpolation would be much faster.
| JoshTriplett wrote:
| This is a side note to the main point being made, but on modern
| CPUs, "rep movsb" is just as fast as the fastest vectorized
| version, because the CPU knows to accelerate it. The name of the
| kernel function "copy_user_enhanced_fast_string" hints at this:
| the CPU features are ERMS ("Enhanced Repeat Move String", which
| makes "rep movsb" faster for anything above a certain length
| threshold) and FSRM ("Fast Short Repeat Move String", which makes
| "rep movsb" faster for shorter moves too).
| jeffbee wrote:
| Also worth noting that Linux has changed the way it uses ERMS
| and FSRM in x86 copy multiple times since kernel 6.1 used in
| the article. As a data-dote, my machine that has FSRM and ERMS
| -- surprisingly, the latter is not implied by the former --
| hits 17GB/s using plain old pipes and a 32KiB buffer on Linux
| 6.8
| Lockal wrote:
| This is not the full truth, "rep movsb" is fast until another
| threshold, after which either normal or non-temporal store is
| faster.
|
| All thresholds are described in
| https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...
|
| And they are not final, i. e. Noah Goldstein still updates them
| every year.
| jeffbee wrote:
| Which is these is "faster" depends greatly on whether you
| have the very rare memcpy-only workload, or if your program
| actually does something useful. Many people believe, often
| with good evidence, that the most important thing is for
| memcpy to occupy as few instruction cache lines as is
| practical, instead of being something that branches all over
| kilobytes of machine code. For comparison, see the x86
| implementations in LLVM libc.
|
| https://github.com/llvm/llvm-
| project/blob/main/libc/src/stri...
| adrian_b wrote:
| It depends on the CPU. There is no good reason for "rep
| movsb" to be slower at any big enough data size.
|
| On a Zen 3 CPU, "rep movsb" becomes faster than or the same
| as anything else above a length slightly greater than 2 kB.
|
| However there is a range of multi-megabyte lengths, which
| correspond roughly with sizes below the L3 cache but
| exceeding the L2 cache, where for some weird reason "rep
| movsb" becomes slower than SIMD non-temporal stores.
|
| At lengths exceeding the L3 size, "rep movsb" becomes again
| the fastest copy method.
|
| The Intel CPUs have different behaviors.
| koverstreet wrote:
| I'm still waiting for rep movsb and rep stosb to be fast enough
| to delete my simple C loop versions, for short memcpys.
| adrian_b wrote:
| It is likely that on recent CPUs they are always faster than
| C loop versions.
|
| On my Zen 3 CPU, for lengths of 2 kB or smaller it is
| possible to copy faster than with "rep movsb", but by using
| SIMD instructions (or equivalently the builtin "memcpy"
| provided by most C compilers), not with a C loop (unless the
| compiler recognizes the C loop and replaces it with the
| builtin memcpy, which is what some compilers will do at high
| optimization levels).
| haberman wrote:
| If that's the case, when can I expect C compilers to inline
| variable-length memcpy() the way they will inline fixed-length
| memcpy today?
| nyanpasu64 wrote:
| How do you gather profiling information for kernel function calls
| from a user program?
| qsantos wrote:
| I'll write an article on the flamegraphs specifically, but to
| get the data, just follow Julia's article!
|
| https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/
| ismaildonmez wrote:
| Could you clarify how are you testing the speed of the first
| example where you are not writing anything to stdout? Thanks.
| qsantos wrote:
| For the first Rust program, where I just write to memory, I
| just use the time utility when running the program from
| zsh. Then, I divide the number of bytes written by the
| number of seconds elapsed. That's why it's not an infinite
| loop ;)
| ismaildonmez wrote:
| Thanks!
| cowsaymoo wrote:
| What is the library used to profile the program?
| tzury wrote:
| pv
|
| https://linux.die.net/man/1/pv
|
| it is in the pipe command `... | pv > /dev/null`
| throw12390 wrote:
| `pv --discard` is faster by 8% (on my system).
| % pv </dev/zero >/dev/null 54.0GiB/s % pv
| </dev/zero --discard 58.7GiB/s
| IWeldMelons wrote:
| Which is suspiciously close to the speed of DDR4.
| qsantos wrote:
| I am again getting the hug of death of Hacker News. The situation
| is better than the last time thanks to caching WordPress pages,
| but loading the page can still take a few seconds, so bear with
| me!
| Borg3 wrote:
| Haha. When I read the title I smiled. Linux pipes slow? Moook..
| Now try Cygwin pipes. Thats what I call slow!
|
| Anyway, nice article, its good to know whats going on under the
| hood.
| MaxBarraclough wrote:
| I'd assumed Cygwin pipes are just Windows pipes, is that not
| the case?
| tyingq wrote:
| Not a comprehensive list of problems, and not current but a
| good illustrative post of the kind of issues that people have
| run into in this post:
|
| https://cygwin.com/pipermail/cygwin-
| patches/2016q1/008301.ht...
| Borg3 wrote:
| Its not that easy. Yeah, they are, but there is a lot of
| POSIX like glue inside so they work correctly with select()
| and other alarms. Code is very complicated.
|
| But still, kudos for Cygwin Developers for creating Cygwin :)
| Great work, even tho it have some issues.
| stabbles wrote:
| A bold claim for a blog that takes about 20 seconds to load.
| yas_hmaheshwari wrote:
| This post has gone to the top of hacker news, so I think we
| should give him some slack
|
| Looks like an amazing article, and so much to learn on what
| happens under the hood
| ben-schaaf wrote:
| HN generates ~20k page views over the course of a day with a
| peak of 2k/h: https://harrisonbroadbent.com/blog/hacker-news-
| traffic-spike.... At ~1MB per page load - not sure how
| accurate this is, I don't think it fully loaded - this static
| blogpost requires 0.55MB/s to meet demand. An original
| raspberry pi B (10mpbs ethernet) on the average french mobile
| internet connection (8mbps) provides double that.
|
| I don't mean this as a slight to anyone, I just want to point
| out the HN "hug of death" can be trivially handled by a
| single cheap VPS without even breaking a sweat.
| qsantos wrote:
| Totally agree, my server should definitely be able to
| handle the load. But this is a WordPress install, which is
| definitely doing too much work for what it is when just
| serving the pages. I plan to improve on this!
| wvh wrote:
| I believe that when it's a .fr, they call it nonchalance...
| goodpoint wrote:
| Excellent article even if, to be honest, the title is clickbait.
| chmaynard wrote:
| Agreed. Titles that don't use quantifiers are almost always
| misleading at best.
| sixthDot wrote:
| > I do not know why the JMP is not just a RET, however.
|
| The jump seems generated by the expansion of the `ASM_CLAC`
| macro, which is supposed to change the EFLAGS register ([1],
| [2]). However in this case the expansion looks like it does
| nothing (maybe because of the target ?). I 'd be interested to
| know more about that. Call to the wild.
|
| [1]:
| https://github.com/torvalds/linux/blob/master/arch/x86/inclu...
|
| [2]: https://stackoverflow.com/a/60579385
| fatcunt wrote:
| > I do not know why the JMP is not just a RET, however.
|
| This is caused by the CONFIG_RETHUNK option. In the disassembly
| from objdump you are seeing the result of RET being replaced with
| JMP __x86_return_thunk.
|
| https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
|
| https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...
|
| > The NOP instructions at the beginning and at the end of the
| function allow ftrace to insert tracing instructions when needed.
|
| These are from the ASM_CLAC and ASM_STAC macros, which make space
| for the CLAC and STAC instructions (both of them three bytes in
| length, same as the number of NOPs) to be filled in at runtime if
| X86_FEATURE_SMAP is detected.
|
| https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
|
| https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...
|
| https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...
| qsantos wrote:
| Thanks a lot for the information! I was not quite sure what to
| look for in this case. I have added in note in the article.
| ndesaulniers wrote:
| There are perhaps only a handful of kernel developers that:
|
| 1. would know the above
|
| 2. would choose such an obnoxious throwaway handle
| arendtio wrote:
| I know pipes primarily from shell scripts. Are they being used in
| other contexts as extensively, too? Like C or Rust programs?
| rwmj wrote:
| Be interesting to see a version using io_uring, which I think
| would let you pre-share buffers with the kernel avoiding some
| copies, and avoid syscall overhead (though the latter seems
| negligible here).
| qsantos wrote:
| That sounds like a good idea!
| rwmj wrote:
| I'm not claiming it'll be faster! Additionally io_uring has
| its own set of challenges, such as whether it's better to
| allocate one ring per core or one ring per application
| (shared by some or all cores). Pre-sharing buffers has trade-
| offs too, particularly in application complexity [alignment,
| you have to be careful not to reuse a buffer before it is
| consumed] versus the efficiency of zero copy.
| donaldihunter wrote:
| Something I didn't see mentioned in the article about AVX512,
| aside from the xsave/xrstor overhead, is that AVX512 is power
| hungry and causes CPU frequency scaling. See [1], [2] for details
| and as an example of how nuanced it can get.
|
| [1] https://www.intel.com/content/dam/www/central-
| libraries/us/e...
|
| [2]
| https://www.intel.com/content/www/us/en/developer/articles/t...
| Narishma wrote:
| That is only the case in specific Intel CPU models.
| up2isomorphism wrote:
| Someone tasted a bread thinking it is not sweet enough, which is
| fine. But calling the bread bland is funny because it does not
| mean to taste sweet.
| mparnisari wrote:
| I get PR_CONNECT_RESET_ERROR when trying to open the page
| qsantos wrote:
| My server struggles a bit with the load on the WordPress site.
| You should be fine just reloading. I will make sure to improve
| things for the next time!
| jvanderbot wrote:
| > Although SSE2 is always available on x86-64, I also disabled
| the cpuid bit for SSE2 and SSE to see if it could nudge glibc
| into using scalar registers to copy data. I immediately got a
| kernel panic. Ah, well.
|
| I think you need to recompile your compiler, or disable those
| explicitly via link / cc flags. Compilers are fairly hard to get
| to coax / dissuade SIMD instructions, IMHO.
| faizshah wrote:
| This is a really cool post and that is a massive amount of
| throughput.
|
| In my experience in data engineering, it's very unlikely you can
| exceed 500mb/s throughput of your business logic as most
| libraries you're using are not optimized to that degree (SIMD
| etc.). That being said I think it's a good technique to try out.
|
| I'm trying to think of other applications this could be useful
| for. Maybe video workflows?
| jeremyscanvic wrote:
| Great post! I didn't know about vmsplice(2). I'm glad to see a
| former ENSL student here as well!
| qsantos wrote:
| Hey!
| yencabulator wrote:
| FUSE can be a bit trickier than a single queue of data chunks.
| Reads from /dev/fuse actually pick the right message to read
| based on priorities, and there's cases where the message queue is
| meddled with to e.g. cancel requests before they're even sent to
| userspace. If you naively switch it to eagerly putting messages
| into a userspace-visible ringbuffer, you might significantly
| change behavior in cases like interrupting slow operations.
| Imagine having to fulfill a ringbuf worth of requests to a
| misbehaving backend taking 5sec/op, just to see the cancellations
| at the very end.
___________________________________________________________________
(page generated 2024-08-26 23:01 UTC)