[HN Gopher] Why pipes sometimes get "stuck": buffering
___________________________________________________________________
Why pipes sometimes get "stuck": buffering
Author : tanelpoder
Score : 230 points
Date : 2024-11-29 16:43 UTC (6 hours ago)
(HTM) web link (jvns.ca)
(TXT) w3m dump (jvns.ca)
| Twirrim wrote:
| This is one of those things where, despite some 20+ years of
| dealing with _NIX systems, I_ know* it happens, but always forget
| about it until I've sat puzzled why I've got no output for
| several moments.
| ctoth wrote:
| Feels like a missed opportunity for a frozen pipes joke.
|
| Then again...
|
| Frozen pipes are no joke.
| hiatus wrote:
| > Some things I didn't talk about in this post since these posts
| have been getting pretty long recently and seriously does anyone
| REALLY want to read 3000 words about buffering?
|
| I personally would.
| CoastalCoder wrote:
| It depends on the writing.
|
| I've read that sometimes wordy articles are mostly fluff for
| SEO.
| TeMPOraL wrote:
| In case of this particular author, those 3000 words would be
| dense, unbuffered wisdom.
| penguin_booze wrote:
| Summarizing is one area where I'd consider using AI. I
| haven't explored what solutions exist yet.
| Veserv wrote:
| The solution is that buffered accesses should almost always flush
| after a threshold number of bytes or after a period of time if
| there is at least one byte, "threshold or timeout". This is
| pretty common in hardware interfaces to solve similar problems.
|
| In this case, the library that buffers in userspace should set
| appropriate timers when it first buffers the data. Good choices
| of timeout parameter are: passed in as argument, slightly below
| human-scale (e.g. 1-100 ms), proportional to {bandwidth /
| threshold} (i.e. some multiple of the time it would take to reach
| the threshold at a certain access rate), proportional to target
| flushing overhead (e.g. spend no more than 0.1% time in
| syscalls).
|
| Also note this applies for both writes and reads. If you do
| batched/coalesced reads then you likely want to do something
| similar. Though this is usually more dependent on your data
| channel as you need some way to query or be notified of "pending
| data" efficiently which your channel may not have if it was not
| designed for this use case. Again, pretty common in hardware to
| do interrupt coalescing and the like.
| asveikau wrote:
| I think doing those timeouts transparently would be tricky
| under the constraints of POSIX and ISO C. It would need to have
| some cooperation from the application layer
| jart wrote:
| The only way you'd be able to do it is by having functions
| like fputc() call clock_gettime(CLOCK_MONOTONIC_COARSE) which
| will impose ~3ns overhead on platforms like x86-64 Linux
| which have a vDSO implementation. So it can be practical sort
| of although it'd probably be smarter to just use line
| buffered or unbuffered stdio. In practice even unbuffered i/o
| isn't that bad. It's the default for stderr. It actually is
| buffered in practice, since even in unbuffered mode,
| functions like printf() still buffer internally. You just get
| assurances whatever it prints will be flushed by the end of
| the call.
| asveikau wrote:
| That's just for checking the clock. You'd also need to have
| a way of getting called back when the timeout expires,
| after fputc et al are long gone from the stack and your
| program is busy somewhere else, or maybe blocked.
|
| Timeouts are usually done with signals (a safety nightmare,
| so no thanks) or an event loop. Hence my thought that you
| can't do it really transparently while keeping current
| interfaces.
| jart wrote:
| Signals aren't a nightmare it's just that fflush() isn't
| defined by POSIX as being asynchronous signal safe. You
| could change all your stdio functions to block signals
| while running, but then you'd be adding like two system
| calls to every fputc() call. Smart thing to do would
| probably be creating a thread with a for (;;) {
| usleep(10000); fflush(stdout); } loop.
| asveikau wrote:
| Signals are indeed a nightmare. Your example of adding
| tons of syscalls to make up for lack of safety shows that
| you understand that to be true.
|
| And no, creating threads to solve this fringe problem in
| a spin loop with a sleep is not what I'd call "smart".
| It's unnecessary complexity and in most cases, totally
| wasted work.
| toast0 wrote:
| I think this is the right approach, but any libc setting
| automatic timers would lead to a lot of tricky problems because
| it would change expectations.
|
| I/O errors could occur at any point, instead of only when you
| write. Syscalls everywhere could be interrupted by a timer,
| instead of only where the program set timers, or when a signal
| arrives. There's also a reasonable chance of confusion when the
| application and libc both set timer, depending on how the timer
| is set (although maybe this isn't relevant anymore... kernel
| timer apis look better than I remember). If the application
| specifically pauses signals for critical sections, that impacts
| the i/o timers, etc.
|
| There's a need to be more careful in accessing i/o structures
| because of when and how signals get handled.
| nine_k wrote:
| I don't follow. Using a pipe sets an expectation of some
| amount of asynchronicity, because we only control one end of
| the pipe. I don't see a dramatic difference between an error
| occurring because of the process on the other end is having
| trouble, or because of a timeout handler is trying to push
| the bytes.
|
| On the reading end, the error may occur at the attempt to
| read the pipe.
|
| On the writing end, the error may be signaled at the _next_
| attempt to write to or close the pipe.
|
| In either case, a SIGPIPE can be sent asynchronously.
|
| What scenario am I missing?
| toast0 wrote:
| > In either case, a SIGPIPE can be sent asynchronously.
|
| My expectation (and I think this is an accurate expecation)
| is that a) read does not cause a SIGPIPE, read on a widowed
| pipe returns a zero count read as indication of EOF. b)
| write on a widowed pipe raises SIGPIPE before the write
| returns. c) write to a pipe that is valid will not raise
| SIGPIPE if the pipe is widowed without being read from.
|
| Yes, you _could_ get a SIGPIPE from anywhere at anytime,
| but unless someone is having fun on your system with random
| kills, you won 't actually get one except immediately after
| a write to a pipe. With a timer based asynchronous write,
| this changes to potentially happening any time.
|
| This could be fine if it was well documented and expected,
| but it would be a mess to add it into the libcs at this
| point. Probably a mess to add it to basic output buffering
| in most languages.
| Veserv wrote:
| You will generally only stall indefinitely if you are waiting
| for new data. So, you will actually handle almost every use
| case if your blocking read/wait also respects the timeout and
| does the flush on your behalf. Basically, do it synchronously
| at the top of your event loop and you will handle almost
| every case.
|
| You could also relax the guarantee and set a timeout that is
| only checked during your next write. This still allows
| unbounded latency, but as long as you do one more write it
| will flush.
|
| If neither of these work, then your program issues a write
| and then gets into a unbounded or unreasonably long
| loop/computation. At which point you can manually flush what
| is likely the last write your program is every going to make
| which would be a trivial overhead since that is a single
| write compared to a ridiculously long computation. That or
| you probably have bigger problems.
| toast0 wrote:
| Yeah, these are all fine to do, but a libc can really only
| do the middle one. And then, at some cost.
|
| If you're already using an event loop library, I think it's
| reasonable for that to manage flushing outputs while
| waiting for reads, but I don't think any of the utilities
| in this example do; maybe tcpdump does, but I don't know
| why grep would.
| vlovich123 wrote:
| Typical Linux alarms are based on signals and are very
| difficult to manage and rescheduling them may have a
| performance impact since it requires thunking into the kernel.
| If you use io_uring with userspace timers things can scale much
| better, but it still requires you to do tricks if you want to
| support a lot of fast small writes (eg > ~1 million writes per
| second timer management starts to show up more and more and you
| have to do some crazy tricks I figured out to get up to 100M
| writes per second)
| Veserv wrote:
| You do not schedule a timeout on each buffered write. You
| only schedule one timeout on the transition from empty to
| non-empty that is retired either when the timeout occurs or
| when you threshold flush (you may choose to not clear on
| threshold flush if timeout management is expensive). So, you
| program at most one timeout per timeout duration/threshold
| flush.
|
| The point is to guarantee data gets flushed promptly which
| only fails when not enough data gets buffered. The timeout is
| a fallback to bound the flush latency.
| vlovich123 wrote:
| Yes that can work but as I said that has trade offs.
|
| If you flush before the buffer is full, you're sacrificing
| throughput. Additionally the timer firing has additional
| performance degradation especially if you're in libc land
| and only have a sigalarm available.
|
| So when an additional write is added, you want to push out
| the timer. But arming the timer requires reading the
| current time among other things and at rates of 10-20Mhz
| and up reading the current wall clock gets expensive. Even
| rdtsc approaches start to struggle at 20-40Mhz. You
| obviously don't want to do it on every write but you want
| to make sure that you never actually trigger the timer if
| you're producing data at a relatively fast enough clip to
| otherwise fill the buffer within a reasonable time.
|
| Source: I implemented write coalescing in my nosql database
| that can operate at a few gigahertz for 8 byte writes/s
| into an in memory buffer. Once the buffer is full or a
| timeout occurs, a flush to disk is triggered and I net out
| at around 100M writes/s (sorting the data for the LSM is
| one of the main bottlenecks). By comparison DBs like
| RocksDB can do ~2M writes/s and SQLite can do ~800k.
| Veserv wrote:
| You are not meaningfully sacrificing throughput because
| the timeout only occurs when you are not writing enough
| data; you have no throughput to sacrifice. The threshold
| and timeout should be chosen such that high throughput
| cases hit the threshold, not the timeout. The timeout
| exists to bound the worst-case latency of low access
| throughput.
|
| You only lose throughput in proportion to the handling
| cost of a single potentially spurious timeout/timeout
| clear per timeout duration. You should then tune your
| buffering and threshold to cap that at a acceptable
| overhead.
|
| You should only really have a problem if you want both
| high throughput and low latency at which point general
| solutions are probably not not fit for your use case, but
| you should remain aware of the general principle.
| vlovich123 wrote:
| > You should only really have a problem if you want both
| high throughput and low latency at which point general
| solutions are probably not not fit for your use case, but
| you should remain aware of the general principle.
|
| Yes you've accurately summarized the end goal. Generally
| people want high throughput AND low latency, not to just
| cap the maximum latency.
|
| The one shot timer approach only solves a livelock risk.
| I'll also note that your throughput does actually drop at
| the same time as the latency spike because your buffer
| stays the same size but you took longer to flush to disk.
|
| Tuning correctly turns out to be really difficult to
| accomplish in practice which is why you really want self
| healing/self adapting systems that behave consistently
| across all hardware and environments.
| BoingBoomTschak wrote:
| Also made a post some time ago about the issue: https://world-
| playground-deceit.net/blog/2024/09/bourne_shel...
|
| About the commands that don't buffer, this is either
| implementation dependent or even wrong in the case of cat (cf
| https://pubs.opengroup.org/onlinepubs/9799919799/utilities/c...
| and `-u`). Massive pain that POSIX never included an official way
| to manage this.
|
| Not mentioned is input buffering, that would gives you this
| strange result: $ seq 5 | { v1=$(head -1);
| v2=$(head -1); printf '%s=%s\n' v1 "$v1" v2 "$v2"; } v1=1
| v2=
|
| The fix is to use `stdbuf -i0 head -1`, in this case.
| jagrsw wrote:
| I don't believe a process reading from a
| pipe/socketpair/whatever can enforce such constraints on a
| writing process (except using heavy hackery like ptrace()).
| While it might be possible to adjust the pipe buffer size, I'm
| not aware of any convention requiring standard C I/O to respect
| this.
|
| In any case, stdbuf doesn't seem to help with this:
| $ ./a | stdbuf -i0 -- cat #include <stdio.h>
| #include <unistd.h> int main(void) { for (;;) {
| printf("n"); usleep(100000); } }
| BoingBoomTschak wrote:
| I'm sorry, but I don't understand what you're meaning. The
| issue in your example is the output buffering of a, not the
| input buffering of cat. You'd need `stdbuf -o0 ./a | cat`
| there.
| Rygian wrote:
| Learned two things: `unbuffer` exists, and "unnecessary" cats are
| just fine :-)
| wrsh07 wrote:
| I like unnecessary cat because it makes the rest of the pipe
| reusable across other commands
|
| Eg if I want to test out my greps on a static file and then
| switch to grepping based on a tail -f command
| chatmasta wrote:
| Yep. I use unnecessary cats when I'm using the shell
| interactively, and especially when I'm building up some
| complex pipeline of commands by figuring out how to do each
| step before moving onto the next.
|
| Once I have the final command, if I'm moving it into a shell
| script, then _maybe_ I'll switch to file redirection.
| Joker_vD wrote:
| > I think this problem is probably unavoidable - I spent a little
| time with strace to see how this works and grep receives the
| SIGINT before tcpdump anyway so even if tcpdump tried to flush
| its buffer grep would already be dead.
|
| I believe quite a few utilities actually _do_ try to flush their
| stdout on receiving SIGINT... but as you 've said, the other side
| of the pipe may also very well have received a SIGINT, and nobody
| does a short-timed wait on stdin on SIGINT: after all, the whole
| reason you've been sent SIGINT is because the user wants your
| program to stop working _now_.
| mpbart wrote:
| Wow I had no idea this behavior existed. Now I'm wondering how
| much time I've wasted trying to figure out why my pipelined greps
| don't show correct output
| why-el wrote:
| Love it.
|
| > this post is only about buffering that happens inside the
| program, your operating system's TTY driver also does a little
| bit of buffering sometimes
|
| and if the TTY is remote, so do the network switches! it's
| buffering all the way down.
| BeefWellington wrote:
| AFAIK, signal order generally propagates backwards, so the last
| command run will always receive the signal first, provided it is
| a foreground command.
|
| But also, the example is not a great one; grepping tcpdump output
| doesn't make sense given its extensive and well-documented
| expression syntax. It's obviously just used as an example here to
| demonstrate buffering.
| toast0 wrote:
| > grepping tcpdump output doesn't make sense given its
| extensive and well-documented expression syntax.
|
| I dunno. If doesn't make sense in the world where everyone
| makes the most efficient pipelines for what they want; but in
| that world, they also always remember to use --line-buffered on
| grep when needed, and the line buffered output option for
| tcpdump.
|
| In reality, for a short term thing, grepping on the grepable
| parts of the output can be easier than reviewing the docs to
| get the right filter to do what you really want. Ex, if you're
| dumping http requests and you want to see only lines that match
| some url, you can use grep. Might not catch everything, but
| usually I don't need to see everything.
| Joker_vD wrote:
| > grepping tcpdump output doesn't make sense given its
| extensive and well-documented expression syntax
|
| Well. Personally, every time I've tried to learn its expression
| syntax from its extensive documentation my eyes would start to
| glaze over after about 60 seconds; so I just stick with grep --
| at worst, I have to put the forgotten "-E" in front of the
| pattern and re-run the command.
|
| By the way, and slightly off-tangent: if anyone ever wanted
| grep to output only some part of the captured pattern, like -o
| but only for the part inside the parentheses, then one way to
| do it is to use a wrapper like this:
| #!/bin/sh -e GREP_PATTERN="$1"
| SED_PATTERN="$(printf '%s\n' "$GREP_PATTERN" | sed
| 's;/;\\/;g')" shift grep -E
| "$GREP_PATTERN" --line-buffered "$@" | sed -r
| 's/^.*'"$SED_PATTERN"'.*$/\1/g'
|
| Not the most efficient way, I imagine, but it works fine for my
| use cases (in which I never need more than one capturing group
| anyway). Example invocation: $ xgrep
| '(^[^:]+):.*:/nonexistent:' /etc/passwd nobody
| messagebus _apt tcpdump whoopsie
| chatmasta wrote:
| ChatGPT has eliminated this class of problem for me. In fact
| it's pretty much all I use it for. Whether it's ffmpeg,
| tcpdump, imagemagick, SSH tunnels, Pandas, numpy, or some
| other esoteric program with its own DSL... ChatGPT can
| construct the arguments I need. And if it gets it wrong, it's
| usually one prompt away from fixing it.
| Thaxll wrote:
| TTY, console, shell, stdin/out, buffer, pipe, I wish there was a
| clear explanation somewhere of how all of those are glue/work
| together.
| MathMonkeyMan wrote:
| Here's a resource for at least the first one:
| https://www.linusakesson.net/programming/tty/
| toast0 wrote:
| > when you press Ctrl-C on a pipe, the contents of the buffer are
| lost
|
| I _think_ most programs will flush their buffers on SIGINT... But
| for that to work from a shell, you 'd need to deliver SIGINT to
| only the first program in the pipeline, and I guess that's not
| how that works.
| akdev1l wrote:
| The last process gets sigint and everything else gets sigpipe
| iirc
| toast0 wrote:
| That makes sense to me, but the article implied everything
| got a sigint, but the last program got it first. Eitherway,
| you'd need a different way to ask the shell to do it the
| otherway...
|
| Otoh, do programs routinely flush if they get SIGINFO? dd(1)
| on FreeBSD will output progress if you hit it with SIGINFO
| and continue it's work, which you can trigger with ctrl+T if
| you haven't set it differently. But that probably goes to the
| foreground process, so probably doesn't help. And, there's
| the whole thing where SIGINFO isn't POSIX and isn't really in
| Linux, so it's hard to use there...
|
| This article [1] says tcpdump will output the packet counts,
| so it _might_ also flush buffers, I 'll try to check and
| report a little later today.
|
| [1] https://freebsdfoundation.org/wp-
| content/uploads/2017/10/SIG...
| toast0 wrote:
| > This article [1] says tcpdump will output the packet
| counts, so it might also flush buffers, I'll try to check
| and report a little later today.
|
| I checked, tcpdump doesn't seem to flush stdout on siginfo,
| and hitting ctrl+T doesn't deliver it a siginfo in the
| tcpdump | grep case anyway. Killing tcpdump with sigint
| does work: tcpdump's output is flushed and it closes, and
| then the grep finishes too, but there's not a button to hit
| for that.
| tolciho wrote:
| No, INTR "generates a SIGINT signal which is sent to all
| processes in the foreground process group for which the
| terminal is the controlling terminal" (termios(4) on OpenBSD,
| other what passes for unix these days are similar), as
| complicated by what exactly is in the foreground process
| group (use tcgetpgrp(3) to determine that) and what signal
| masking or handlers those processes have (which can vary over
| the lifetime of a process, especially for a shell that does
| job control), or whether some process has disabled ISIG--the
| terminal being shared "global" state between one or more
| processes--in which case none of the prior may apply.
| $ make pa re ci cc -O2 -pipe -o pa pa.c cc -O2
| -pipe -o re re.c cc -O2 -pipe -o ci ci.c $
| ./pa | ./re | ./ci > /dev/null ^Cci (2) 66241 55611
| 55611 pa (2) 55611 55611 55611 re (2) 63366 55611
| 55611
|
| So with "pa" program that prints "y" to stdout, and "re" and
| "ci" that are basically cat(1) except that these programs all
| print some diagnostic information and then exit when a
| SIGPIPE or SIGINT is received, here showing that (on OpenBSD,
| with ksh, at least) a SIGINT is sent to each process in the
| foreground process group (55611, also being logged is the
| getpgrp which is also 55611). $ kill -l |
| grep INT 2 INT Interrupt 18
| TSTP Suspended
| two_handfuls wrote:
| Maybe that's why my mbp sometimes appears not to see my keyboard
| input for a whole second even though nothing much is running.
| mg wrote:
| Related: Why pipes can be indeterministic.
|
| https://www.gibney.org/the_output_of_linux_pipes_can_be_inde...
| calibas wrote:
| Buffers are there for good reason, it's extremely slow
| (relatively speaking) to print output on a screen compared to
| just writing it to a buffer. Printing something character-by-
| character is incredibly inefficient.
|
| This is an old problem, I encounter it often when working with
| UART, and there's a variety of possible solutions:
|
| Use a special character, like a new line, to signal the end of
| output (line-based).
|
| Use a length-based approach, such as waiting for 8KB of data.
|
| Use a time-based approach, and print the output every X
| milliseconds.
|
| Each approach has its own strengths and weaknesses, depends upon
| the application which one works best. I believe the article is
| incorrect when mentioning certain programs that don't use
| buffering, they just don't use an obvious length-based approach.
| PhilipRoman wrote:
| Also, it's not just the work needed to actually handle the
| write on the backend - even just making that many syscalls to
| /dev/null can kill your performance.
| qazxcvbnmlp wrote:
| Having a layer or two above the interface aware of the
| constraint works the best (when possible). Line based approach
| does this but requires agreement on the character (new line).
| akira2501 wrote:
| Which is exactly why setbuf(3) and setvbuf(3) exists.
| pixelbeat wrote:
| Nice article. See also:
| https://www.pixelbeat.org/programming/stdio_buffering/
|
| It's also worth mentioning a recent improvement we made (in
| coreutils 8.28) to the operation of the `tail | grep` example in
| the article. tail now notices if the pipe goes away, so one could
| wait for something to appear in a log, like:
| tail -f /log/file | grep -q match then_do_something
|
| There are lots of gotchas to pipe handling really. See also:
| https://www.pixelbeat.org/programming/sigpipe_handling.html
| emcell wrote:
| one of the reasons why i hate computers :D
| jakub_g wrote:
| In our CI we used to have some ruby commands that were piped to
| prepend "HH:MM:SS" to each line to track progress (because GitLab
| still doesn't support this out of the box, though it's supposed
| to land in 17.0), but it would sometimes lead to some logs being
| flushed with a large delay.
|
| I knew it had something to do with buffers and it drove me nuts,
| but couldn't find a fix, all solutions tried didn't really work.
|
| (Problem got solved when we got rid of ruby in CI - it was
| legacy).
| josephcsible wrote:
| I've ran into this before, and I've always wondered why programs
| don't just do this: when data gets added to a previously-empty
| output buffer, make the input non-blocking, and whenever a read
| comes back with EWOULDBLOCK, flush the output buffer and make the
| input blocking again. (Or in other words, always make sure the
| output buffer is flushed before waiting/going to sleep.) Wouldn't
| this fix the problem? Would it have any negative side effects?
| kreetx wrote:
| From experience, `unbuffer` is the tool to use to turn buffering
| off reliably.
| radarsat1 wrote:
| Side note maybe, but is there an alternative to chaining two
| greps?
| ykonstant wrote:
| The most portable way to do it with minimal overhead is with
| sed: sed -e '/pattern1/!d' -e '/pattern2/!d'
|
| which generalizes to more terms. Easier to remember and just as
| portable is awk '/pattern1/ && /pattern2/'
|
| but now you need to launch a full awk.
|
| For more ways see
| https://unix.stackexchange.com/questions/55359/how-to-run-gr...
| SG- wrote:
| how important is buffering in 2024 on modern ultra fast single
| user systems I wonder? I'd be interested in seeing it disabled
| for testing purposes.
| chatmasta wrote:
| It depends what the consumer is doing with the data as it exits
| the buffer. If it's a terminal program printing every
| character, then it's going to be slow. Or more generally if
| it's any program that doesn't have its own buffering, then it
| will become the bottleneck so the slowdown will depend on how
| it processes input.
|
| Ultimately even "no buffer" still has a buffer, which is the
| number of bits it reads at a time. Maybe that's 1, or 64, but
| it still needs some boundary between iterations.
| akira2501 wrote:
| > on modern ultra fast single user systems I wonder?
|
| The latency of a 'syscall' is on the order of a few hundred
| instructions. You're switching to a different privilege mode,
| with a different memory map, and where your data ultimately has
| to leave the chip to reach hardware.
|
| It's absurdly important and it will never not be.
| londons_explore wrote:
| I'd like all buffers to be flushed whenever the systemwide CPU
| becomes idle.
|
| Buffering generally is a CPU-saving technique. If we had infinite
| CPU, all buffers would be 1 byte. Buffers are a way of collecting
| together data to process in a batch for efficiency.
|
| However, when the CPU becomes idle, we shouldn't have any work
| "waiting to be done". As soon as the kernel scheduler becomes
| idle, all processes should be sent a "flush your buffers" signal.
| frogulis wrote:
| Hopefully not a silly question: in the original example, even if
| we had enough log data coming from `tail` to fill up the first
| `grep` buffer, if the logfile ever stopped being updated, then
| there would likely be "stragglers" left in the `grep` buffer that
| were never outputted, right?
___________________________________________________________________
(page generated 2024-11-29 23:00 UTC)