[HN Gopher] How fast are Linux pipes anyway?
___________________________________________________________________
How fast are Linux pipes anyway?
Author : rostayob
Score : 576 points
Date : 2022-06-02 09:19 UTC (13 hours ago)
(HTM) web link (mazzo.li)
(TXT) w3m dump (mazzo.li)
| sandGorgon wrote:
| Android's flavor of Linux uses "binder" instead of pipes because
| of its security model. IMHO filesystem-based IPC mechanisms
| (notably pipes), can't be used because of a lack of a world-
| writable directory - i may be wrong here.
|
| Binder comes from Palm actually (OpenBinder)
| Matthias247 wrote:
| Pipes don't necessarily mean one has to use FS permissions. Eg
| a server could hand out anonymous pipes to authorized clients
| via fd passing on Unix domain sockets. The server can then
| implement an arbitrary permission check before doing this.
| megous wrote:
| "lack of a world-writable directory"
|
| What's that?
|
| A lot of programs store sockets in /run which is typically
| implemented by `tmpfs`.
| marcodiego wrote:
| History of binder is more involved and has its seeds on BeOS
| IIRC.
| stackbutterflow wrote:
| This site is pleasing to the eye.
| apostate wrote:
| It looks like it is using the "Tufte" style, named after Edward
| Tufte, who is very famous for his writing on data
| visualization. More examples: https://rstudio.github.io/tufte/
| ianai wrote:
| I usually just use cat /dev/urandom > /dev/null to generate load.
| Not sure how this compares to their code.
|
| Edit: it's actually "yes" that I've used before for generating
| load. I remember reading somewhere "yes" was optimized
| differently than the original Unix command as part of the unix
| certification lawsuit(s).
|
| Long night.
| yakubin wrote:
| On 5.10.0-14-amd64 "pv < /dev/urandom >/dev/null" reports
| 72.2MiB/s. "pv < /dev/zero >/dev/null" reports 16.5GiB/s. AMD
| Ryzen 7 2700X with 16GB of DDR4 3000MHz memory.
|
| "tr '\0' 1 </dev/zero | pv >/dev/null" reports 1.38GiB/s.
|
| "yes | pv >/dev/null" reports 7.26GiB/s.
|
| So "/dev/urandom" may not be the best source when testing
| performance.
| sumtechguy wrote:
| Think they were generating load? Going through the urandom
| device not bad as it has to do a bit of work to get that rand
| number? Just for throughput though zero is prob better.
| gtirloni wrote:
| "Generating load" for measuring pipe performance means
| generating bytes. Any bytes. urandom is terrible for that.
| yakubin wrote:
| I don't understand. If you're testing how fast pipes are,
| then I'd expect you to measure throughput or latency. Why
| would you measure how fast something unrelated to pipes is?
| If you want to measure this other thing on the other hand,
| why would you bother with pipes, which add noise to the
| measurement?
|
| UPDATE: If you mean that you want to test how fast pipes
| are when there is other load in the system, then I'd
| suggest just running a lot of stuff in the background. But
| I wouldn't put the process dedicated for doing something
| else into the pipeline you're measuring. As a matter of
| fact, the numbers I gave were taken with plenty of heavy
| processes running in the background, such as Firefox,
| Thunderbird, a VM with another instance of Firefox,
| OpenVPN, etc. etc. :)
| khorne wrote:
| Because they mentioned generating load, not testing pipe
| performance.
| yakubin wrote:
| Oh, wait. You mean that this "cat </dev/urandom
| >/dev/null" was meant to be running in the background and
| not be the pipeline which is tested? Ok, my bad for not
| getting the point.
| ianai wrote:
| You're right and I miss-typed, it's yes that I usually use. I
| think it's optimized for throughput.
| spacedcowboy wrote:
| Ran the basic initial implementation on my Mac Studio and was
| pleasantly surprised to see @elysium pipetest %
| pipetest | pv > /dev/null 102GiB 0:00:13 [8.00GiB/s]
| @elysium ~ % pv < /dev/zero > /dev/null 143GiB 0:00:04
| [36.4GiB/s]
|
| Not a valid comparison between the two machines because I don't
| know what the original machine is, but MacOS rarely comes out
| shining in this sort of comparison, and the simplistic approach
| here giving 8 GB/s rather than the author's 3.5 GB/s was better
| than I'd expected, even given the machine I'm using.
| mhh__ wrote:
| Given the machine as in a brand new Mac?
| spacedcowboy wrote:
| given that the machine is the most performant Mac that Apple
| make.
| [deleted]
| sylware wrote:
| yep, you want perf? Don't mutex then yield, do spin and check
| your cpu heat sink.
|
| :)
| jagrsw wrote:
| Something maybe a bit related.
|
| I just had 25Gb/s internet installed
| (https://www.init7.net/en/internet/fiber7/), and at those speeds
| Chrome and Firefox (which is Chrome-based) pretty much die when
| using speedtest.net at around 10-12Gbps.
|
| The symptoms are that the whole tab freezes, and the shown speed
| drops from those 10-12Gbps to <1Gbps and the page starts updating
| itself only every second or so.
|
| IIRC Chrome-based browsers use some form of IPC with a separate
| networking process, which actually handles networking, I wonder
| if this might be the case that the local speed limit for
| socketpair/pipe under Linux was reached and that's why I'm seeing
| this.
| [deleted]
| Spooky23 wrote:
| I ran into this with a VDI environment in a data center. We had
| initially delivered 10Gb Ethernet to the VMs, because why not.
|
| Turned out windows 7 or the NICs needed a lot of tuning to work
| well. There was alot of freezing and other fail.
| implying wrote:
| Firefox is not based on the chromium codebase, it is older.
| formerly_proven wrote:
| Well if we're talking ancestors that's technically true, but
| not by that much - Firefox comes from Netscape,
| Chrome/Safari/... come from KHTML.
| elpescado wrote:
| AFAIR, KHTML was/is not related to Netscape/Gecko in any
| way.
| wodenokoto wrote:
| > ... on August 16, 1999 that [Lars Knoll] had checked in
| what amounted to a complete rewrite of the KHTML library--
| changing KHTML to use the standard W3C DOM as its internal
| document representation.
| https://en.wikipedia.org/wiki/KHTML#Re-
| write_and_improvement
|
| > In March 1998, Netscape released most of the code base
| for its popular Netscape Communicator suite under an open
| source license. The name of the application developed from
| this would be Mozilla, coordinated by the newly created
| Mozilla Organization https://en.wikipedia.org/wiki/Mozilla_
| Application_Suite#Hist...
|
| Netscape Communicator (or Netscape 4) was released in 1997,
| so If we are tracing lineage, I'd say Firefox has a 2 year
| head start.
| def- wrote:
| Firefox is only Chrome-based on iOS.
| rwaksmunski wrote:
| You mean WebKit.
| karamanolev wrote:
| It's Safari-based, which is Webkit-based. Chrome is also
| Safari-based on iOS, because all the browsers must be.
| There's no actual Chrome (as in Blink, the browser engine) on
| iOS, at least in Play Store.
| Izkata wrote:
| > It's Safari-based, which is Webkit-based.
|
| Firefox only uses Webkit on iOS, due to Apple requirements.
| It uses Gecko everywhere else. And I don't think it's ever
| been Safari-based anywhere.
| jve wrote:
| Do you actually mean Gbit/s? 25Gb/s would translate to
| 200Gbit/s ...
| Denvercoder9 wrote:
| The small "b" is customarily used to refer to bits, with the
| large "B" used to refer to bytes. So 25 Gb/s would be 25
| Gbit/s, while 25 GB/s would be 200 Gbit/s.
| karamanolev wrote:
| Gb != GB. Per Wikipedia, which aligns with my understanding,
|
| "The gigabit has the unit symbol Gbit or Gb."
|
| 25GB/s would translate to 200Gbit/s and also 200Gb/s.
| reitanqild wrote:
| > and at those speeds Chrome and Firefox (which is Chrome-
| based)
|
| AFAIK, Firefox is not Chrome-based anywhere.
|
| On iOS it uses whatever iOS provides for webview - as does
| Chrome on iOS.
|
| Firefox and Safari is now the only supported mainstream
| browsers that has their own rendering engines. Firefox is the
| only that has their own rendering engine and is cross platform.
| It is also open source.
| yosamino wrote:
| > AFAIK, Firefox is not Chrome-based anywhere.
|
| Not technically "Chrome-based", but Firefox draws graphics
| using Chrome's Skia graphics engine.
|
| Firefox is not completely independent from Chrome.
| bawolff wrote:
| I feel like counting every library is silly.
|
| In any case, i thought chrome used libnss which is a
| mozilla library, so you could say the reverse as well.
| SahAssar wrote:
| Skia started in 2004 independently of google and was then
| acquired by google. Calling it "Chrome's Skia graphics
| engine" makes it sound like it was built _for_ chrome.
| [deleted]
| [deleted]
| SahAssar wrote:
| > Firefox is the only that has their own rendering engine and
| is cross platform.
|
| Interestingly safaris rendering engine is open source and
| cross platform, but the browser is not. Lots of linux-focused
| browsers (konquerer, gnome web, surf) and most embedded
| browsers (nintendo ds & switch, playstation) use webkit. Also
| some user interfaces (like WebOS, which is running all of
| LG's TVs and smart refrigerators) use webkit as their
| renderer.
| qwerty456127 wrote:
| WebKit itself is a fork of the Konqueror's original KHTML
| engine by the way.
| tmccrary55 wrote:
| Browser Genealogy
| cturtle wrote:
| Now I want to see the family tree!
| capableweb wrote:
| Ask, and you shall receive :)
|
| https://en.wikipedia.org/wiki/File:Timeline_of_web_browse
| rs....
|
| https://en.wikipedia.org/wiki/Timeline_of_web_browsers
| has tables as well.
| tinus_hn wrote:
| IOS uses WebKit which is also what Chrome is based on.
| Comevius wrote:
| Chrome uses Blink, which was forked from WebKit's WebCore
| in 2013. They replaced JavaScriptCore with V8.
| merightnow wrote:
| Unrelated question, what hardware do you use to setup your
| network for 25Gb/s? I've been looking at init7 for a while, but
| gave up and stayed with Salt after trying to find the right
| hardware for the job.
| jagrsw wrote:
| NIC: Intel E810-XXVDA2
|
| Optics: To ISP: Flexoptics (https://www.flexoptix.net/de/p-b1
| 625g-10-ad.html?co10426=972...), Router-PC:
| https://mikrotik.com/product/S-3553LC20D
|
| Router: Mikrotik CCR-2004 -
| https://mikrotik.com/product/ccr2004_1g_12s_2xs - warning:
| it's good to up to ~20Gb/s one way. It can handle ~25Gb/s
| down, but only ~18Gb/s up, and with IPv6 the max seems to be
| ~10Gb/s any direction.
|
| If Mikrotik is something you're comfortable using you can
| also take a look at
| https://mikrotik.com/product/ccr2216_1g_12xs_2xq - it's more
| expensive (~2500EUR), but should handle 25Gb/s easily.
| zrail wrote:
| IIRC most Mikrotik products lack hardware IPv6 offload
| which is probably why you're seeing lower speeds.
| BenjiWiebe wrote:
| In that case 10Gb/s sounds actually pretty good, if
| that's without hardware offload.
| sph wrote:
| This makes me wonder... does anyone offer an iperf-based
| speedtest service on the Internet?
| scoopr wrote:
| Well there are some public iperf servers listed here:
| https://iperf.fr/iperf-servers.php
| jagrsw wrote:
| Ha.. my ISP does :) I can hit those 25Gb/s when connecting
| directly (bypassing the router as it barely handles those
| 25Gb/s).
|
| With it in the way I get ~15-20Gb/s $ iperf3
| -l 1M --window 64M -P10 -c speedtest.init7.net ..
| [SUM] 0.00-1.00 sec 1.87 GBytes 16.0 Gbits/sec 181406
| $ iperf3 -R -l 1M --window 64M -P10 -c speedtest.init7.net
| .. [SUM] 0.00-1.00 sec 2.29 GBytes 19.6 Gbits/sec
| [deleted]
| jcims wrote:
| Speedtest does have a CLI as well, might be interesting to
| compare them.
| jagrsw wrote:
| Yup, the CLI version works well - https://www.speedtest.net/r
| esult/c/e9104814-294f-4927-af9f-d...
| zrail wrote:
| Thing to note: the open source version on GitHub, installable
| by homebrew and native package managers, is not the same
| version as Ookla distributes from their website and is not
| accurate at all.
| [deleted]
| pca006132 wrote:
| Is it only affecting the browser or the entire system? It might
| be possible that the CPU is busy handling interrupts from the
| ethernet controller, although in general these controllers
| should use DMA and should not send interrupts frequently.
| jagrsw wrote:
| Only browser(s), the OS is capable of 25Gb/s - checked with
| iperf and also speedtest-cli - https://www.speedtest.net/resu
| lt/c/e9104814-294f-4927-af9f-d...
| jcranberry wrote:
| Sounds like a hard drive cache filling up.
| megous wrote:
| One would assume speed testing website would use `Cache-
| Control: no-store`...
|
| But alas, they do not, lol. They just use no-cache on the
| query which will not prevent the browser from storing the
| data.
|
| https://megous.com/dl/tmp/8112dd9346dd66e8.png
| bayindirh wrote:
| Chrome fires many processes and creates an IPC based comm-
| network between them to isolate stuff. It's somewhat abusing
| your OS to get what its want in terms of isolation and whatnot.
|
| (Which is similar to how K8S abuses ip-tables and makes it
| useless for other ends, and makes you install a dedicated
| firewall in front of your ingress path, but let's not digress).
|
| On the other hand, Firefox is neither chromium based, nor is a
| cousin of it. It's a completely different codebase, inherited
| from Netscape days and evolved up to this point.
|
| As another test point, Firefox doesn't even blink at a
| symmetric gigabit connection going at full speed (my network is
| capped by my NIC, the pipe is _way_ fatter).
| jagrsw wrote:
| > As another test point, Firefox doesn't even blink at a
| symmetric gigabit connection going at full speed (my network
| is capped by my NIC, the pipe is way fatter).
|
| FWIW Firefox under Linux (Firefox Browser 100.0.2 (64-bit))
| behaves pretty much the same as Chrome. The speed raises
| quickly to 5-8Gb/s, then the UI starts choking, and the shown
| speed drops to 500Mb/s. It could be that there's some
| scheduling limit or other bottleneck hit in the OS itself,
| assuming these are different codebases (are they?).
| bayindirh wrote:
| I'd love to test and debug the path where it dies, but none
| of the systems we have firefox have pipes that fat (again
| NIC limited).
|
| However, you can test the limits of Linux by installing CLI
| version of Speedtest and hitting a nearby server.
|
| The bottleneck maybe in the browser itself, or in your
| graphics stack, too.
|
| Linux can do pretty amazing things in the network
| department, otherwise 100Gbps Infiniband cards wouldn't be
| possible at Linux servers, yet we have them on our systems.
|
| And yes, Chrome and Firefox are way different browsers. I
| can confidently say this, because I'm using Firefox since
| it's called Netscape 6.0 (and Mozilla in Knoppix).
| jeffreygoesto wrote:
| From my experience long ago, all high performance
| networking under Linux was traditionally user space and
| pre-allocated pools (netmap, dpdk, pf-ring...). Did not
| follow, how much io_uring has been catching up for
| network stack usage... Maybe somebody else knows?
| sophacles wrote:
| I have a service that beats epoll with io_uring (it reads
| gre packets from one socket, and does some
| lookups/munging on the inner packet and re-encaps them to
| a different mechanism and writes them back to a different
| socket). General usage for io_uring vs epoll is pretty
| comparable IIUC. It wouldn't surprise me if streams (e.g.
| tcp) end up being faster via io_uring and buffer
| registration though.
|
| Totally tangential - it looks like io_uring is evolving
| beyond just io and into an alternate syscall interface,
| which is pretty neat imho.
| bayindirh wrote:
| While I'm not very knowledgeable in specifics, there are
| many paths for networking in Linux now. The usual kernel
| based one is there, also there's kernel-bypass [0] paths
| used by very high performance cards.
|
| Also, Infiniband can directly RDMA to and from MPI
| processes for making "remote memory local", allowing very
| low latencies and high performance in HPC environments.
|
| I also like this post from Cloudflare [1]. I've read it
| completely, but the specifics are lost on me since I'm
| not directly concerned with the network part of our
| system.
|
| [0]: https://medium.com/@penberg/on-kernel-bypass-
| networking-and-...
|
| [1]: https://blog.cloudflare.com/how-to-receive-a-
| million-packets...
| bawolff wrote:
| > I can confidently say this, because I'm using Firefox
| since it's called Netscape 6.0 (and Mozilla in Knoppix).
|
| Mozilla suite/seamonkey isn't usually considered the same
| as firefox, although obviously related.
| bayindirh wrote:
| I'm not talking about the version which evolved to
| Seamonkey. I'm talking about Mozilla/Firefox 0.8 which
| had a Mozilla logo as a "Spinner" instead of Netscape
| logo on the top right.
| bawolff wrote:
| Netscape 6 was not firefox based
| https://en.m.wikipedia.org/wiki/Netscape_6
|
| Firefox 0.8 did not have netscape branding
| http://theseblog.free.fr/firefox-0.8.jpg
| pjmlp wrote:
| It is using what OS processes where created in first place.
|
| Unfortunately the security industry has proven the why
| threads are a bad ideas for applications when security is a
| top concern.
|
| Same applies to dynamically loaded code as plugins, where the
| host application takes the blame for all instabilty and
| exploits they introduce.
| bayindirh wrote:
| Yes, Firefox is also doing the same, however due to the
| nature of Firefox's processes, the OS doesn't lose much
| responsiveness or doesn't feel bogged down when I have 50+
| tabs open due to some research.
|
| If you need security, you need isolation. If you want
| hardware-level isolation, you need processes. That's
| normal.
|
| My disagreement with Google's applications are how they're
| behaving like they're the only running processes on the
| system itself. I'm pretty aware that some of the most
| performant or secure things doesn't have the prettiest
| implementation on paper.
| ReactiveJelly wrote:
| There used to be a setting to tweak Chrome's process
| behavior.
|
| I believe the default behavior is "Coalesce tabs into the
| same content process if they're from the same trust
| domain".
|
| Then you can make it more aggressive like "Don't coalesce
| tabs ever" or less aggressive like "Just have one content
| process". I think.
|
| I'm not sure how Firefox decides when to spawn new
| processes. I know they have one GPU process and then
| multiple untrusted "content processes" that can touch
| untrusted data but can't touch the GPU.
|
| I don't mind it. It's a trade-off between security and
| overhead. The IPC is pretty efficient and the page cache
| in both Windows and Linux _should_ mean that all the code
| pages are shared between all content processes.
|
| Static pages actually feel light to me. I think crappy
| webapps make the web slow, not browser security.
|
| (inb4 I'm replying to someone who works on the Firefox
| IPC team or something lol)
| girvo wrote:
| > inb4 I'm replying to someone who works on the Firefox
| IPC team or something lol
|
| The danger and joy of commenting on HN!
| bayindirh wrote:
| I'm harmless, don't worry. :) Also you can find more
| information about me in my profile.
|
| Even if I was working on Firefox/Chrome/whatever, I'd not
| be mad at someone who doesn't know something very well.
| Why should I? We're just conversing here.
|
| Also, I've been very wrong here at times, and this
| improved my conversation / discussion skills a great
| deal.
|
| So, don't worry, and comment away.
| mastax wrote:
| I'm glad huge pages make a big difference because I just spent
| several hours setting them up. Also everyone says to disable
| transparent_hugepage, so I set it to `madvise`, but I'm skeptical
| that any programs outside databases will actually use them.
| deagle50 wrote:
| JVM can. I have JetBrains set up to use them.
| gigatexal wrote:
| Now this is the kind of content I come to HN for. Absolutely
| fascinating read.
| lazide wrote:
| The majority of this overhead (and the slow transfers) naively
| seem to be in the scripts/systems using the pipes.
|
| I was worried when I saw zfs send/receive used pipes for instance
| because of performance worries - but using it in reality I had no
| problems pushing 800MB/s+. It seemed limited by iop/s on my local
| disk arrays, not any limits in pipe performance.
| Matthias247 wrote:
| Right. I'm actually surprised the test with 256kB transfers
| gives reasonable results, and would rather have tested with >
| 1GB instead. For such a small transfer it seemed likely that
| the overhead of spawning the process and loading libraries by
| far dominates the amount of actual work. I'm also surprised
| this didn't show up in profiles. But if obviously depends on
| where the measurement start and end points are
| azornathogron wrote:
| Perhaps I've misunderstood what you're referring to, but the
| test in the article is measuring speed transferring 10 GiB.
| 256 KiB is just the buffer size.
| Matthias247 wrote:
| The first C program in the blog post allocates a 256kB
| buffer and writes that one exactly once to stdout. I don't
| see another loop which writes it multiple times.
| azornathogron wrote:
| There's an outer while(true){} loop - the write side just
| writes continuously.
|
| More generally though, sidenote 5 says that the code in
| the article itself is incomplete and the real test code
| is available in the github repo:
| https://github.com/bitonic/pipes-speed-test
| BeeOnRope wrote:
| This is a well-written article with excellent explanations and I
| thoroughly enjoyed it.
|
| However, none of the variants using vmsplice (i.e., all but the
| slowest) are safe. When you gift [1] pages to the kernel there is
| no reliable general purpose way to know when the pages are safe
| to reuse again.
|
| This post (and the earlier FizzBuzz variant) try to get around
| this by assuming the pages are available again after "pipe size"
| bytes have been written after the gift, _but this is not true in
| general_. For example, the read side may also use splice-like
| calls to move the pages to another pipe or IO queue in zero-copy
| way so the lifetime of the page can extend beyond the original
| pipe.
|
| This will show up as race conditions and spontaneously changing
| data where a downstream consumer sees the page suddenly change as
| it it overwritten by the original process.
|
| The author of these splice methods, Jens Axboe, had proposed a
| mechanism which enabled you to determine when it was safe to
| reuse the page, but as far as I know nothing was ever merged. So
| the scenarios where you can use this are limited to those where
| you control both ends of the pipe and can be sure of the exact
| page lifetime.
|
| ---
|
| [1] Specifically, using SPLICE_F_GIFT.
| haberman wrote:
| What if the writer frees the memory entirely? Can you segv the
| reader? That would be quite a dangerous pattern.
| rostayob wrote:
| (I am the author of the post)
|
| I haven't digested this comment fully yet, but just to be
| clear, I am _not_ using SPLICE_F_GIFT (and I don't think the
| fizzbuzz program is either). However I think what you're saying
| makes sense in general, SPLICE_F_GIFT or not.
|
| Are you sure this unsafety depends on SPLICE_F_GIFT?
|
| Also, do you have a reference to the discussions regarding this
| (presumably on LKML)?
| rostayob wrote:
| Actually, from re-reading the man page for vmsplice, it seems
| like it _should_ depend on SPLICE_F_GIFT (or in other words,
| it should be safe without it).
|
| But from what I know about how vmsplice is implemented,
| gifting or not, it sounds like it should be unsafe anyhow.
| DerSaidin wrote:
| Hello
|
| https://mazzo.li/posts/fast-pipes.html#what-are-pipes-
| made-o...
|
| I think the diagram near the start of this section has "head"
| and "tail" swapped.
|
| Edit: Nevermind, I didn't read far enough.
| BeeOnRope wrote:
| Yeah my mention of gift was a red herring: I had assumed gift
| was being used but the same general problem (the "page
| garbage collection issue") crops up regardless.
|
| If you don't use gift, you never know when the pages are free
| to use again, so in principle you need to keep writing to new
| buffers indefinitely. One "solution" to this problem is to
| gift the pages, in which case the kernel does the GC for you,
| but you need to churn through new pages constantly because
| you've gifted the old ones. Gift is especially useful when
| the page gifted can be used directly in the page cache (i.e.,
| writing a file, not a pipe).
|
| Without gift some consumption patterns may be safe but I
| think they are exactly those which involve a copy (not using
| gift means that a copy will occur for additional read-side
| scenarios). Ultimately the problem is that if some downstream
| process is able to get a zero-copy view of a page from an
| upstream writer, how can this be safe to concurrently
| modification? The pipe size trick is one way it could work,
| but it doesn't pan out because the pages may live beyond the
| immediately pipe (this is actually alluded in the FizzBuzz
| article where they mentioned things blew up if more than one
| pipe was involved).
| rostayob wrote:
| Yes, this all makes sense, although like everything
| splicing-related, it is very subtle. Maybe I should have
| mentioned the subtleness and dangerousness of splicing at
| the beginning, rather than at the end.
|
| I still think the man page of vmsplice is quite misleading!
| Specifically: SPLICE_F_GIFT
| The user pages are a gift to the kernel. The application
| may not modify this memory ever,
| otherwise the page cache and on-disk data may differ.
| Gifting pages to the kernel means that a
| subsequent splice(2) SPLICE_F_MOVE can
| successfully move the pages; if this flag is not speci-
| fied, then a subsequent splice(2) SPLICE_F_MOVE must
| copy the pages. Data must also be
| properly page aligned, both in memory and length.
|
| To me, this indicates that if we're _not_ using
| SPLICE_F_GIFT downstream splices will be automatically
| taken care of, safety-wise.
| scottlamb wrote:
| Hmm, reading this side-by-side with a paragraph from
| BeeOnRope's comment:
|
| > This post (and the earlier FizzBuzz variant) try to get
| around this by assuming the pages are available again
| after "pipe size" bytes have been written after the gift,
| _but this is not true in general_. For example, the read
| side may also use splice-like calls to move the pages to
| another pipe or IO queue in zero-copy way so the lifetime
| of the page can extend beyond the original pipe.
|
| The paragraph you quoted says that the "splice-like calls
| to move the pages" actually copy when SPLICE_F_GIFT is
| not specified. So perhaps the combination of not using
| SPLICE_F_GIFT and waiting until "pipe size" bytes have
| been written is safe.
| BeeOnRope wrote:
| Yes it is not clear to me when the copy actually happens
| but I had assumed the > 30 GB/s result after read was
| changed to use splice must imply zero copy.
| rostayob wrote:
| It could be that when splicing to /dev/null (which I'm
| doing), the kernel knows that they their content is never
| witnessed, and therefore no copy is required. But I
| haven't verified that
| scottlamb wrote:
| Makes sense. If so, some of the nice benchmark numbers
| for vmsplice would go away in a real scenario, so that'd
| be nice to know.
| BeeOnRope wrote:
| Splicing seems to work well for the "middle" part of a
| chain of piped processes, e.g., how pv works: it can
| splice pages from one pipe to another w/o needing to
| worry about reusing the page since someone upstream
| already wrote the page.
|
| Similarly for splicing from a pipe to a file or something
| like that. It's really the end(s) of the chain that want
| to (a) generate the data in memory or (b) read the data
| in memory that seem to create the problem.
| scottlamb wrote:
| I think you're right that the same problem applies without
| SPLICE_F_GIFT. One of the other fizzbuzz code golfers
| discusses that here:
| https://codegolf.stackexchange.com/a/239848
|
| I wonder if io_uring handles this (yet). io_uring is a newer
| async IO mechanism by the same author which tells you when
| your IOs have completed. So you might think it would:
|
| * But from a quick look, I think its vmsplice equivalent
| operation just tells you when the syscall would have
| returned, so maybe not. [edit: actually, looks like there's
| not even an IORING_OP_VMSPLICE operation in the latest
| mainline tree yet, just drafts on lkml. Maybe if/when the
| vmsplice op is added, it will wait to return for the right
| time.]
|
| * And in this case (no other syscalls or work to perform
| while waiting) I don't see any advantage in io_uring's
| read/write operations over just plain synchronous read/write.
| Matthias247 wrote:
| uring only really applies for async IO - and would tell you
| when an otherwise blocking syscall would have finished.
| Since the benchmark here uses blocking calls, there
| shouldn't be any change in behavior. The lifetime of the
| buffer is an orthogonal concern to the lifetime of the
| operation. Even if the kernel knows when the operation is
| done inside the kernel it wouldn't have a way to know
| whether the consuming application is done with it.
| scottlamb wrote:
| > uring only really applies for async IO - and would tell
| you when an otherwise blocking syscall would have
| finished. Since the benchmark here uses blocking calls,
| there shouldn't be any change in behavior. The lifetime
| of the buffer is an orthogonal concern to the lifetime of
| the operation. Even if the kernel knows when the
| operation is done inside the kernel it wouldn't have a
| way to know whether the consuming application is done
| with it.
|
| That doesn't match what I've read. E.g.
| https://lwn.net/Articles/810414/ opens with "At its core,
| io_uring is a mechanism for performing asynchronous I/O,
| but it has been steadily growing beyond that use case and
| adding new capabilities."
|
| More precisely:
|
| * While most/all ops are async IO now, is there any
| reason to believe folks won't want to extend it to batch
| basically any hot-path non-vDSO syscall? As I said,
| batching doesn't help here, but it does in a lot of other
| scenarios.
|
| * Several IORING_OP_s seem to be growing capabilities
| that aren't matched by like-named syscalls. E.g. IO
| without file descriptors, registered buffers, automatic
| buffer selection, multishot, and (as of a month ago)
| "ring mapped supplied buffers". Beyond the individual
| operation level, support for chains. Why not a mechanism
| that signals completion when the buffer passed to
| vmsplice is available for reuse? (Maybe by essentially
| delaying the vmsplice syscall'ss return [1], maybe by a
| second command, maybe by some extra completion event from
| the same command, details TBD.)
|
| [1] edit: although I guess that's not ideal. The reader
| side could move the page and want to examine following
| bytes, but those won't get written until the writer sees
| the vmsplice return and issues further writes.
| BeeOnRope wrote:
| Yeah this.
|
| The vanilla io_uring fits "naturally" in an async model,
| but batching and some of the other capabilities it
| provide are definitely useful for stuff written to a
| synchronous model too.
|
| Additionally, io_uring can avoid syscalls sometimes even
| without any explicit batching by the application, because
| it can poll the submission queue (root only, last time I
| checked unfortunately): so with the right setup a series
| of "synchronous" ops via io_uring (i.e., submit &
| immediately wait for the response) could happen with < 1
| user-kernel transition per op, because the kernel is busy
| servicing ops directly from the incoming queue and the
| application gets the response during its polling phase
| before it waits.
| yxhuvud wrote:
| Perhaps it could be sortof simulated in uring using the
| splice op against a memfd that has been mmaped in advance?
| I wonder how fast that could be and how it would compare
| safetywise.
| BeeOnRope wrote:
| I don't know if io_uring provides a mechanism to solve this
| page ownership thing but I bet Jens does: I've asked [1].
|
| ---
|
| [1]
| https://twitter.com/trav_downs/status/1532491167077572608
| robocat wrote:
| > However, none of the variants using vmsplice (i.e., all but
| the slowest) are safe. When you gift [1] pages to the kernel
| there is no reliable general purpose way to know when the pages
| are safe to reuse again. [snip] This will show up as race
| conditions and spontaneously changing data where a downstream
| consumer sees the page suddenly change as it it overwritten by
| the original process.
|
| That sounds like a security issue - the ability of an upstream
| generator process to write into the memory of a downstream
| reader process, or more perverser vice versa is even worser. I
| presume that the Linux kernel only lets this happen (zero copy)
| when the two processes are running as the same user?
| hamandcheese wrote:
| It's not clear to me that the kernel allows the receiving
| process to write instead of just read.
|
| But also, if you are sending data, why would you later
| read/process that send buffer?
|
| The only attack vector I could imagine would be if one sender
| was splicing the same memory to two or more receivers. A
| malicious receiver with write access to the spliced memory
| could compromise other readers.
| nice2meetu wrote:
| I once had to change my mental model for how fast some of these
| things were. I was using `seq` as an input for something else,
| and my thinking was along the lines that it is a small generator
| program running hot in the cpu and would be super quick.
| Specifically because it would only be writing things out to
| memory for the next program to consume, not reading anything in.
|
| But that was way off and `seq` turned out to be ridiculously
| slow. I dug down a little and made a faster version of `seq`,
| that kind of got me what I wanted. But then noticed at the end
| that the point was moot anyway, because just piping it to the
| next program over the command line was going to be the slow
| point, so it didn't matter anyway.
|
| https://github.com/tverniquet/hseq
| freedomben wrote:
| I had a somewhat similar discovery once using GNU parallel. I
| was trying to generate as much web traffic as possible from a
| single machine to load test a service I was building, and I
| assumed that the network I/o would be the bottleneck by a long
| shot, not the overhead of spawning many processes. I was
| disappointed by the amount of traffic generated, so I rewrote
| it in Ruby using the parallel gem with threads (instead of
| processes), and got orders of magnitude more performance.
| strictfp wrote:
| Node is great for this usecase
| Klasiaster wrote:
| Netmap offers zero-copy pipes (included in FreeBSD, on Linux it's
| a third party module):
| https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4
| v3gas wrote:
| Love the subtle stonks background in the first image.
| [deleted]
| alex_hirner wrote:
| Does an API similar to vmsplice exist for Windows?
| herodoturtle wrote:
| This was a long but highly insightful read!
|
| (And as an aside, the combination of that font with the hand-
| drawn diagrams is really cool)
| arkitaip wrote:
| The visual design is amazing.
| anotherhue wrote:
| pv is written in perl so isn't the snappiest, I'm surprised to
| see it score so highly. I wonder what the initial speed would
| have been if it just wrote to /dev/null
| merpkz wrote:
| Confused with parallel, maybe?
| rostayob wrote:
| It's not written in perl, it's written in C, and it uses
| splice() (one of the syscalls discussed in the post).
| anotherhue wrote:
| I was totally wrong. Thank you for showing me the facts.
| karamanolev wrote:
| Definitely C, per what appears to be the official repo
| (linking the splice syscall) - https://github.com/icetee/pv/b
| lob/master/src/pv/transfer.c#L...
| effnorwood wrote:
| [deleted]
| mg wrote:
| For some reason, this raised my curiosity how fast different
| languages write individual characters to a pipe:
|
| PHP comes in at about 900KiB/s: php -r 'while
| (1) echo 1;' | pv > /dev/null
|
| Python is about 50% faster at about 1.5MiB/s:
| python3 -c 'while (1): print (1, end="")' | pv > /dev/null
|
| Javascript is slowest at around 200KiB/s: node
| -e 'while (1) process.stdout.write("1");' | pv > /dev/null
|
| What's also interesting is that node crashes after about a
| minute: FATAL ERROR: Ineffective mark-compacts
| near heap limit Allocation failed - JavaScript heap out
| of memory
|
| All results from within a Debian 10 docker container with the
| default repo versions of PHP, Python and Node.
|
| Update:
|
| Checking with strace shows that Python caches the output:
| strace python3 -c 'while (1): print (1, end="")' | pv > /dev/null
|
| Outputs a series of: write(1,
| "11111111111111111111111111111111"..., 8193) = 8193
|
| PHP and JS do not.
|
| So the Python equivalent would be: python3 -c
| 'while (1): print (1, end="", flush=True)' | pv > /dev/null
|
| Which makes it compareable to the speed of JS.
|
| Interesting, that PHP is over 4x faster than the Python and JS.
| cestith wrote:
| I'm on a 2015 MB Air with two browsers running, probably a
| dozen tabs between them, three tabs in iTerm2, Outlook, Word,
| and Teams running.
|
| Perl 5.18.0 gives me 3.5 MiB per second. Perl 5.28.3, 5.30.3,
| and 5.34.0 gives 4 MiB per second. perl5.34.0
| -e 'while (){ print 1 }' | pv > /dev/null
|
| For Python 3.10.4, I get about 2.8 MiB/s as you have it
| written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for
| 3.8) with this. I also get 4.8 MiB/s with 2.7:
| python3 -c 'while (1): print (1)' | pv > /dev/null
|
| If I make Perl behave like yes and print a character and a
| newline, it has a jump of its own. The following gives me 37.3
| MiB per second. perl5.34.0 -e 'while (){
| print "1\n" }' | pv > /dev/null
|
| Interestingly, using Perl's say function (which is like a
| Println) slows it down significantly. This version is only 7.3
| MiB/s. perl5.34.0 -E 'while (1) {say 1}' | pv
| > /dev/null
|
| Go 1.18 has 940 KiB/s with fmt.Print and 1.5 MiB/s with
| fmt.Println for some comparison. package main
| import "fmt" func main() { for ;;
| { fmt.Println("1") }
| }
|
| These are all macports builds.
| mscdex wrote:
| Potential buffering issues aside, as others have pointed out
| the node.js example is performing asynchronous writes, unlike
| the other languages' examples (as far as I know).
|
| To do a proper synchronous write, you'd do something like:
| node -e 'const { writeSync } = require("fs"); while (1)
| writeSync(1, "1");' | pv > /dev/null
|
| That gets me ~1.1MB/s with node v18.1.0 and kernel 5.4.0.
| themulticaster wrote:
| If you ever need to write a random character to a pipe very
| fast, GNU coreutils has you covered with yes(1). It runs at
| about 6 GiB/s on my system: yes | pv >
| /dev/null
|
| There's an article floating around [1] about how yes(1) is
| extremely optimized considering its original purpose. In care
| you're wondering, yes(1) is meant for commands that
| (repeatedly) ask whether to proceed, expecting a y/n input or
| something like that. Instead of repeatedly typing "y", you just
| run "yes | the_command".
|
| Not sure about how yes(1) compares to the techniques presented
| in the linked post. Perhaps there's still room for improvement.
|
| [1] Previous HN discussion:
| https://news.ycombinator.com/item?id=14542938
| gitgud wrote:
| > _It runs at about 6 GiB /s on my system..._
|
| Honest question: what are the practical use cases of this?
|
| Repeatedly typing the 'y' character into a Linux pipe is
| surely not that common, especially at that bit rate. Also
| seems like the bottleneck would always be the consuming
| program...
| travisgriggs wrote:
| > Honest question: what are the practical use cases of
| this?
|
| It also allows you to script otherwise interactive command
| line operations with the correct answer. Many come like
| tools now days provide specific options to override
| queries. But there are still a couple hold outs which might
| not.
| jolmg wrote:
| > especially at that bit rate. Also seems like the
| bottleneck would always be the consuming program...
|
| It's not _made_ to be fast; it 's just fast _by nature_ ,
| because there's no other computation it needs to do than to
| just output the string.
| singron wrote:
| Yes can repeat any string, not just "y". It can be useful
| for basic load generation.
| jolmg wrote:
| I've used it to test some db behavior with `yes 'insert
| ...;' | mysql ...`. Fastest insertions I could think of.
| TacticalCoder wrote:
| > Repeatedly typing the 'y' character into a Linux pipe is
| surely not that common, especially at that bit rate.
|
| At that rate no but I definitely use it once in a while.
| For example if a copy quite a few files and then get
| repeatedly asked if I want to overwrite the destination
| (when it's already present). Sure, I could get my commmand
| back and use the proper flag to "cp" or whatever to
| overwrite, but it's usually much quicker to just get back
| the previous line, go at the beginning (C-a), then type
| "yes | " and be done with it.
|
| Note that you can pass a parameter to "yes" and then it
| repeats what you passed instead of 'y'.
| linsomniac wrote:
| Historically, you could have dirty filesystems after a
| reboot that "fsck" would ask an absurd number of questions
| about ("blah blah blah inode 1234567890 fix? (y/n)").
| Unless you were in a very specific circumstance, you'd
| probably just answer "y" to them. It could easily ask
| thousands of questions though. So: "yes | fsck" was not
| uncommon.
| jolmg wrote:
| > Historically
|
| It's probably still common in installation scripts, like
| in Dockerfiles. `apt-get install` has the `-y` option,
| but it would be useful for all other programs that don't.
| dpflug wrote:
| Faster still is pv < /dev/zero > /dev/null
| BenjiWiebe wrote:
| Yes but you don't have control of which character is
| written (only NULLs).
|
| yes lets you specify which character to output. 'yes n' for
| example to output n.
| rocqua wrote:
| Yes doesn't just let you choose a character. It lets you
| choose a string that will be repeated. So
| yes 123abc
|
| will print
| 123abc123abc123abc123abc123abc
|
| and so on.
| jolmg wrote:
| each time terminated by a newline, so:
| 123abc 123abc 123abc ...
| megous wrote:
| "Javascript" is slowest probably because node pushes the writes
| to a thread instead of printing directly from the main process
| like PHP.
|
| Python cheats, and it's still slow as heck even while cheating
| (buffers the output at 8192 chunks instead of issuing 1 byte
| writes).
|
| write(1, "1", 1) loop in C pushes 6.38MiB/s on my PC. :)
| cout wrote:
| Why is it cheating to use a buffer? This is the behavior you
| would get in C if you used the C standard library
| (putc/fputc) instead of a system call (write).
| soheil wrote:
| You're testing a very specific operation, a loop, in each
| language to determine its speed, not sure if I'd generalize
| that. I wonder what it'd look like if you replaced the loop
| with static print statements that were 1000s of characters long
| with line breaks, the sort of things that compiler
| optimizations do.
| dpflug wrote:
| I was getting different results depending on when I run it.
| Took me a second to realize it was my processor frequency
| scaling.
| klohto wrote:
| Python pushes 15MiB on my M1 Pro if you go down a level and use
| sys directly. python3 -c 'import sys
| while (1): sys.stdout.write("1")'| pv>/dev/null
| mg wrote:
| That caches though. You can see it when you strace it.
| klohto wrote:
| Good point, but so does a standard print call. Calling
| flush() after each write does bring the perf to 1.5MiB
| rovr138 wrote:
| python3 -u -c 'import sys while (1):
| sys.stdout.write("1")'| pv>/dev/null
|
| 427KiB/s python3 -c 'import sys
| while (1): sys.stdout.write("1")'| pv>/dev/null
|
| 6.08MiB/s
|
| Using python 3.9.7 on macOS Monterey.
| capableweb wrote:
| > Javascript is slowest at around 200KiB/s:
|
| I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s.
| Python gets 4.35MiB/s.
|
| > What's also interesting is that node crashes after about a
| minute
|
| I believe this is because `while(1)` runs so fast that there is
| no "idle" time for V8 to actually run GC. V8 is a strange
| beast, and this is just a guess of mine.
|
| The following code shouldn't crash, give it a try:
| node -e 'function write() {process.stdout.write("1");
| process.nextTick(write)} write()' | pv > /dev/null
|
| It's slower for me though, giving me 1.18MiB/s.
|
| More examples with Babashka and Clojure: bb
| -e "(while true (print \"1\"))" | pv > /dev/null
|
| 513KiB/s clj -e "(while true (print \"1\"))"
| | pv > /dev/null
|
| 3.02MiB/s clj -e "(require '[clojure.java.io
| :refer [copy]]) (while true (copy \"1\" *out*))" | pv >
| /dev/null
|
| 3.53MiB/s clj -e "(while true (.println
| System/out \"1\"))" | pv > /dev/null
|
| 5.06MiB/s
|
| Versions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka
| v0.8.1, Clojure 1.11.1.1105
| marginalia_nu wrote:
| > I believe this is because `while(1)` runs so fast that
| there is no "idle" time for V8 to actually run GC. V8 is a
| strange beast, and this is just a guess of mine.
|
| Java has (had) weird idiosyncrasies like this as well, well
| it doesn't crash, but depending on the construct you can get
| performance degradations depending on how the language
| inserts safepoints (where the VM is at a knowable state and a
| thread can be safely paused for GC or whatever).
|
| I don't know if this holds today, but I know there was a time
| where you basically wanted to avoid looping over long-type
| variables, as they had different semantics. The details are a
| bit fuzzy to me right now.
| wolfgang42 wrote:
| _> > What's also interesting is that node crashes after about
| a minute_
|
| _> I believe this is because `while(1)` runs so fast that
| there is no "idle" time for V8 to actually run GC. V8 is a
| strange beast, and this is just a guess of mine._
|
| Not exactly: the GC is still running; it's _live_ memory
| that's growing unbounded.
|
| What's going on here is that WritableStream is non-blocking;
| it has _advisory_ backpressure, but if you ignore that it
| will do its best to accept writes anyway and keep them in a
| buffer until it can actually write them out. Since you're not
| giving it any breathing room, that buffer just keeps growing
| until there's no more memory left. `process.nextTick()` is
| presumably slowing things down enough on your system to give
| it a chance to drain the buffer. (I see there's some
| discussion below about this changing by version; I'd guess
| that's an artifact of other optimizations and such.)
|
| To do this properly, you need to listen to the return value
| from `.write()` and, if it returns false, back off until the
| stream drains and there's room in the buffer again.
|
| Here's the (not particularly optimized) function I use to do
| that: async function writestream(chunks,
| stream) { for await (const chunk of chunks) {
| if (!stream.write(chunk)) { // When write
| returns null, stream is starting to buffer and we need to
| wait for it to drain // (otherwise we'll
| run out of memory!) await new
| Promise(resolve => stream.once('drain', () => resolve()))
| } } }
|
| I _do_ wish Node made it more obvious what was going on in
| this situation; this is a very common mistake with streams
| and it's easy to not notice until things suddenly go very
| wrong.
|
| ETA: I should probably note that transform streams,
| `readable.pipe()`, `stream.pipeline()`, and the like all
| handle this stuff automatically. Here's a one-liner, though
| it's not especially fast: node -e 'const
| {Readable} = require("stream");
| Readable.from(function*(){while(1) yield
| "1"}()).pipe(process.stdout)' | pv > /dev/null
| Matthias247 wrote:
| Are there still no async write functions which handle this
| easier than the old event based mechanism? Waiting for
| drain also sounds like it might reduce throughout since
| then there is 0 buffered data and the peer would be forced
| t Ol pause reading. A ,,writable" event sounds more
| appropriate - but the node docs don't mention one.
| mg wrote:
| Your node version indeed did not crash. Tried for 2 minutes.
|
| But using a longer string crashed after 23s here:
| node -e 'function write() {process.stdout.write("111111111122
| 2222222233333333334444444444555555555566666666667777777777888
| 888888899999999990000000000"); process.nextTick(write)}
| write()' | pv > /dev/null
| capableweb wrote:
| Hm, strange. With the same out of memory error as before or
| a different one? Tried running that one for 2 minutes, no
| errors here, and memory stays constant.
|
| Also, what NodeJS version are you on?
| mg wrote:
| Yes, same error as before. Memory usage stays the same
| for a while, then starts to skyrocket shortly before it
| crashes.
|
| node is v10.24.0. (Default from the Debian 10 repo)
| capableweb wrote:
| Huh yeah, seems to be a old memory leak. Running it on
| v10.24.0 crashes for me too.
|
| After some quick testing in a couple of versions, it
| seems like it got fixed in v11 at least (didn't test any
| minor/patch versions).
|
| By the way, all versions up to NodeJS 12 (LTS) are "end
| of life", and should probably not be used if you're
| downloading 3rd party dependencies, as there are bunch of
| security fixes since then, that are not being backported.
| captn3m0 wrote:
| I used this exact issue today while pointing out how
| Debian support dates can be misleading as packages
| themselves aren't always getting fixes:
| https://github.com/endoflife-
| date/endoflife.date/issues/763#...
| MaxBarraclough wrote:
| Perhaps different approaches to caching?
|
| I'm reminded of this StackOverflow question, _Why is reading
| lines from stdin much slower in C++ than Python?_
|
| https://stackoverflow.com/q/9371238/
| xthrowawayxx wrote:
| I find that NodeJS runs eventually out of memory and crashes
| with applications that do a large amount of data processing
| over a long time with little breaks even if there are no memory
| leaks.
|
| Edit: I've found this consistently building multiple data
| processing applications over multiple years and multiple
| companies
| rascul wrote:
| I did the same test, but added a rust and bash version. My
| results:
|
| Rust: 21.9MiB/s
|
| Bash: 282KiB/s
|
| PHP: 2.35MiB/s
|
| Python: 2.30MiB/s
|
| Node: 943KiB/s
|
| In my case, node did not crash after about two minutes. I find
| it interesting that PHP and Python are comparable for me but
| not you, but I'm sure there's a plethora of reasons to explain
| that. I'm not surprised rust is vastly faster and bash vastly
| slower, I just thought it interesting to compare since I use
| those languages a lot.
|
| Rust: fn main() { loop {
| print!("1"); } }
|
| Bash (no discernible difference between echo and printf):
| while :; do printf "1"; done | pv > /dev/null
| anon946 wrote:
| For languages like C, C++, and Rust, the bottleneck is going
| to mainly be system calls. With a big buffer, on an old
| machine, I get about 1.5 GiB/s with C++. Writing 1 char at a
| time, I get less than 1 MiB/s. $ ./a.out
| 1000000 2000 | cat >/dev/null buffer size: 1000000,
| num syscalls: 2000, perf:1578.779593 MiB/s $ ./a.out
| 1 2000000 | cat >/dev/null buffer size: 1, num
| syscalls: 2000000, perf:0.832587 MiB/s
|
| Code is: #include <cstddef>
| #include <random> #include <chrono> #include
| <cassert> #include <array> #include <cstdio>
| #include <unistd.h> #include <cstring>
| #include <cstdlib> int main(int argc, char
| **argv) { int rv;
| assert(argc == 3); const unsigned int n =
| std::atoi(argv[1]); char *buf = new char[n];
| std::memset(buf, '1', n); const unsigned int
| k = std::atoi(argv[2]); auto start =
| std::chrono::high_resolution_clock::now(); for
| (size_t i = 0; i < k; i++) { rv = write(1,
| buf, n); assert(rv == int(n)); }
| auto stop = std::chrono::high_resolution_clock::now();
| auto duration = stop - start;
| std::chrono::duration<double> secs = duration;
| std::fprintf(stderr, "buffer size: %d, num syscalls: %d,
| perf:%f MiB/s\n", n, k,
| (double(n)*k)/(1024*1024)/secs.count()); }
|
| EDIT: Also note that a big write to a pipe (bigger than
| PIPE_BUF) may require multiple syscalls on the read side.
|
| EDIT 2: Also, it appears that the kernel is smart enough to
| not copy anything when it's clear that there is no need. When
| I don't go through cat, I get rates that are well above
| memory bandwidth, implying that it's not doing any actual
| work: $ ./a.out 1000000 1000 >/dev/null
| buffer size: 1000000, num syscalls: 1000, perf:
| 1827368.373827 MiB/s
| mortehu wrote:
| There's no special "no work" detection needed. a.out is
| calling the write function for the null device, which just
| returns without doing anything. No pipes are involved.
| hderms wrote:
| with Rust you could also avoid using a lock on STDOUT and get
| it even faster!
| skitter wrote:
| Tested it, seems to about double the speed (from 22.3mb/s
| to 47.6mb/s).
| ur-whale wrote:
| for the bash case, the cost of forking to write two chars is
| overwhelming compared to anything related to I/O.
| mauvehaus wrote:
| Echo and printf are shell built-ins in bash[0]. Does it
| have to fork to execute them?
|
| You could probably answer this by replacing printf with
| /bin/echo and comparing the results. I'm not in front of a
| Linux box, or I'd try.
|
| [0]
| https://www.gnu.org/software/bash/manual/html_node/Bash-
| Buil...
| ur-whale wrote:
| > Echo and printf are shell built-ins in bash
|
| Ah, yeah, good point, I am wrong.
| megous wrote:
| There's no forking and it's wrinting one character.
| megous wrote:
| Rust also cheats.
|
| https://megous.com/dl/tmp/1046458b5b450018.png
| cle wrote:
| Seems like it's buffering output, which Python also does.
| Python is much slower if you flush every write (I get 2.6
| MiB/s default, 600 KiB/s with flush=True).
|
| Interestingly, Go is very fast with a 8 KiB buffer (same as
| Python's), I get 218 MiB/s.
| [deleted]
| cout wrote:
| What version of node are you using? It seems to run
| indefinitely on 14.19.3 that comes with Ubuntu 20.04.
| GlitchMr wrote:
| `process.stdout.write` is different to PHP's `echo` and
| Python's `print` in that it pushes a write to an event queue
| without waiting for the result which could result in filling
| event queue with writes. Instead, you can consider `await`-ing
| `write` so that it would write before pushing another `write`
| to an event queue. node -e '
| const stdoutWrite =
| util.promisify(process.stdout.write).bind(process.stdout);
| (async () => { while (true) {
| await stdoutWrite("1"); } })();
| ' | pv > /dev/null
| fasteo wrote:
| Luajit using print and io.write LuaJIT
| 2.1.0-beta3
|
| Using print is about 17 MiB/s luajit -e "while
| true do print('x') end" | pv > /dev/null
|
| Using io.write is about 111 MiB/s luajit -e
| "while true do io.write('x') end" | pv > /dev/null
| [deleted]
| rhyn00 wrote:
| Adding a few results:
|
| Using OP's code for following php 1.8mb/sec
| python 3.8 Mb/sec node 1.0 Mb/sec
|
| Java print 1.3 Mb/sec echo 'class Code
| {public static void main(String[] args) {while
| (true){System.out.print("1");}}}' >Code.java; javac Code.java ;
| java Code | pv>/dev/null
|
| Java with buffering 57.4 Mb/sec echo 'import
| java.io.*;class Code2 {public static void main(String[] args)
| throws IOException {BufferedWriter log = new BufferedWriter(new
| OutputStreamWriter(System.out));while(true){log.write("1");}}}'
| > Code2.java ; javac Code2.java ; java Code2 | pv >/dev/null
| kuschku wrote:
| Java can get even much much faster: https://gist.github.com/j
| ustjanne/12306b797f4faa977436070ec0...
|
| That manages about 7 GiB/s reusing the same buffer, or about
| 300 MiB/s with clearing and refilling the buffer every time
|
| (the magic is in using java's APIs for writing to
| files/sockets, which are designed for high performance,
| instead of using the APIs which are designed for writing to
| stdout)
| rhyn00 wrote:
| Nice, that's pretty cool!
| petercooper wrote:
| I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec
| with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed
| as Node if I use $stdout.sync = true though..)
| nequo wrote:
| For me:
|
| Python3: 3 MiB/s
|
| Node: 350 KiB/s
|
| Lua: 12 MiB/s lua -e 'while true do
| io.write("1") end' | pv > /dev/null
|
| Haskell: 5 MiB/s loop = do putStr "1"
| loop main = loop
|
| Awk: 4.2 MiB/s yes | awk '{printf("1")}' | pv >
| /dev/null
| VWWHFSfQ wrote:
| Lua is an interesting one. while true do
| io.write "1" end
|
| PUC-Rio 5.1: 25 MiB/s
|
| PUC-Rio 5.4: 25 MiB/s
|
| LuaJIT 2.1.0-beta3: 550 MiB/s <--- WOW
|
| They all go slightly faster if you localize the reference to
| `io.write` local write = io.write
| while true do write "1" end
| yakubin wrote:
| _> They all go slightly faster if you localize the
| reference to `io.write`_
|
| No noticeable difference for LuaJIT, which makes sense,
| since JIT should figure it out without help.
| bjoli wrote:
| And this, folks, is why you have immutable modules. If
| you know before runtime what something is, lookup is a
| lot faster.
| VWWHFSfQ wrote:
| Ah yes you're right. Basically no difference with LuaJIT.
|
| 5.1 and 5.4 show about ~8% improvement.
| dllthomas wrote:
| Haskell can be even simpler: main = putStr
| (repeat '1')
|
| [Edit: as pointed out below, this is no longer the case!]
|
| Strings are printed one character at a time in Haskell. This
| choice is justified by unpredictability of the interaction
| between laziness and buffering; I am uncertain it's the
| correct choice, but the proper response is to use Text where
| performance is relevant.
| nequo wrote:
| Wow, this does 160 MiB/s. That's a huge improvement! The
| output of strace looks completely different:
| poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1,
| revents=POLLOUT}]) write(1,
| "11111111111111111111111111111111"..., 8192) = 8192
| poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1,
| revents=POLLOUT}]) write(1,
| "11111111111111111111111111111111"..., 8192) = 8192
|
| With the recursive code, it buffered the output in the same
| way but bugged the kernel a whole lot more in-between
| writes. Not exactly sure what is going on:
| poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1,
| revents=POLLOUT}]) write(1,
| "11111111111111111111111111111111"..., 8192) = 8192
| rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
| clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0,
| tv_nsec=920390843}) = 0 rt_sigprocmask(SIG_SETMASK,
| [], NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [],
| 8) = 0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID,
| {tv_sec=0, tv_nsec=920666397}) = 0 ...
| rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
| poll([{fd=1, events=POLLOUT}], 1, 0) = 1 ([{fd=1,
| revents=POLLOUT}]) write(1,
| "11111111111111111111111111111111"..., 8192) = 8192
| dllthomas wrote:
| I'm honestly surprised either of them wind up buffered!
| That must be a change since I stopped paying as much
| attention to GHC.
|
| I'm also not sure what's going on in the second case.
| IIRC, at some point historically, a sufficiently tight
| loop could cause trouble with handling SIGINT, so it
| might be related to some overagressive workaround for
| that?
| wazoox wrote:
| On my extremely old desktop PC (Phenom II 550) running an out-
| of-date OS (Slackware 14.2):
|
| Bash: while :; do printf "1"; done | ./pv >
| /dev/null [ 156KiB/s]
|
| Python3 3.7.2: python3 -c 'while (1): print
| (1, end="")' | ./pv > /dev/null [1,02MiB/s]
|
| Perl 5.22.2: perl -e 'while (true) {print 1}'
| | ./pv > /dev/null [3,03MiB/s]
|
| Node.js v12.22.1: node -e 'while (1)
| process.stdout.write("1");' | ./pv > /dev/null [
| 482KiB/s]
| cle wrote:
| A major contributing factor is whether or not the language
| buffers output by default, and how big the buffer is. I don't
| think NodeJS buffers, whereas Python does. Here's some
| comparisons with Go (does not buffer by default):
|
| - Node (no buffering): 1.2 MiB/s
|
| - Go (no buffering): 2.4 MiB/s
|
| - Python (8 KiB buffer): 2.7 MiB/s
|
| - Go (8 KiB buffer): 218 MiB/s
|
| Go program: f :=
| bufio.NewWriterSize(os.Stdout, 8192) for {
| f.WriteRune('1') }
| preseinger wrote:
| In addition to buffering within the process, Linux (usually)
| buffers process stdout with ~16KB, and does not buffer
| stderr.
| reincarnate0x14 wrote:
| Not specifically addressed at you, but it's a bit amusing
| watching a younger generation of programmers rediscovering
| things like this, which seemed hugely important in like 1990
| but largely don't matter that much to modern workflows with
| dedicated APIs or various shared memory or network protocols,
| as not much that is really performance-critical is typically
| piped back and forth anymore.
|
| More than a few old backup or transfer scripts had extra dd
| or similar tools in the pipeline to create larger and semi-
| asynchronous buffers, or to re-size blocks on output to
| something handled better by the receiver, which was a big
| deal on high speed tape drives back in the day. I suspect
| most modern hardware devices have large enough static RAM and
| fast processors to make that mostly irrelevant.
| abuckenheimer wrote:
| > python3 -c 'while (1): print (1, end="")' | pv > /dev/null
|
| python actually buffers its writes with print only flushing to
| stdout occasionally, you may want to try:
| python3 -c 'while (1): print (1, end="", flush=True)' | pv >
| /dev/null
|
| which I find goes much slower (550Kib/s)
| orf wrote:
| Using `sys.stdout.write()` instead of `print()` gets ~8MiB/s on
| my machine.
| bfors wrote:
| Love the subtle "stonks" overlay on the first chart
___________________________________________________________________
(page generated 2022-06-02 23:00 UTC)