https://old.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/ jump to content my subreddits edit subscriptions * popular * -all * -random * -users | * AskReddit * -funny * -worldnews * -news * -gaming * -videos * -todayilearned * -movies * -pics * -aww * -tifu * -explainlikeimfive * -Art * -mildlyinteresting * -nottheonion * -Music * -dataisbeautiful * -OldSchoolCool * -askscience * -IAmA * -TwoXChromosomes * -Jokes * -science * -Futurology * -LifeProTips * -books * -space * -Showerthoughts * -sports * -gifs * -UpliftingNews * -food * -nosleep * -gadgets * -DIY * -history * -EarthPorn * -photoshopbattles * -Documentaries * -InternetIsBeautiful * -WritingPrompts * -GetMotivated * -creepy * -philosophy * -listentothis * -announcements * -blog more >> reddit.com unix * comments * other discussions (3) Want to join? Log in or sign up in seconds.| * English [ ][] [ ]limit my search to r/unix use the following search parameters to narrow your results: subreddit:subreddit find submissions in "subreddit" author:username find submissions by "username" site:example.com find submissions from "example.com" url:text search for "text" in url selftext:text search for "text" in self post contents self:yes (or self:no) include (or exclude) self posts nsfw:yes (or nsfw:no) include (or exclude) results marked as NSFW e.g. subreddit:aww site:imgur.com dog see the search faq for details. advanced search: by author, subreddit... this post was submitted on 13 Jun 2017 1,339 points (96% upvoted) shortlink: [https://redd.it/6gxd] [ ][ ] [ ]remember mereset password login Submit a new link Submit a new text post Get an ad-free experience with special benefits, and directly support Reddit. get reddit premium unix joinleave18,939 readers 458 users here now About /r/unix is a subreddit for Unix and everything related to Unix. --------------------------------------------------------------------- Rules /r/unix is a pretty chill place: * First and foremost, follow the rules of reddit * Don't be an asshole --------------------------------------------------------------------- Related subreddits: * /r/ComputerScience * /r/linux a community for 14 years MODERATORS * message the mods * Moderator list hidden. Learn More discussions in r/unix <> X 4 MCH2022: NetBSD and Friends 9 * 4 comments IBM connect direct 22 * 5 comments [vNbC1YgZ] Understanding ZFS 20 * 16 comments SDF.ORG down. 2 * 3 comments Newbie Problem with MacOS Terminal Stuff 10 [anw7IHEz] arttime v1.2.0: Countdown for a year, clock instances remain synchronized, etc. A definite addition in next version: change timezones on the fly. 0 MEXC GLOBAL OFFICIALLY LISTS LEADER IN WEB 3.0 GAMING - UNIX GAMING 26 * 1 comment [y_KCIdbI] Compiling the NetBSD kernel as a benchmark 20 * 6 comments The UNIXHATERS Handbook epub version? 20 * 38 comments How do you pronounce daemon? Welcome to Reddit, the front page of the internet. Become a Redditor and join one of thousands of communities. x 1338 1339 1340 How is GNU `yes` so fast? (self.unix) submitted 4 years ago * by kjensenxz[silver_48][gold_48]3[klvxk1wggf] How is GNU's yes so fast? $ yes | pv > /dev/null ... [10.2GiB/s] ... Compared to other Unices, GNU is outrageously fast. NetBSD's is 139MiB/s, FreeBSD, OpenBSD, DragonFlyBSD have very similar code as NetBSD and are probably identical, illumos's is 141MiB/s without an argument, 100MiB/s with. OS X just uses an old NetBSD version similar to OpenBSD's, MINIX uses NetBSD's, BusyBox's is 107MiB/s, Ultrix's (3.1) is 139 MiB/s, COHERENT's is 141MiB/s. Let's try to recreate its speed (I won't be including headers here): /* yes.c - iteration 1 */ void main() { while(puts("y")); } $ gcc yes.c -o yes $ ./yes | pv > /dev/null ... [141 MiB/s] ... That's nowhere near 10.2 GiB/s, so let's just call write without the puts overhead. /* yes.c - iteration 2 */ void main() { while(write(1, "y\n", 2)); // 1 is stdout } $ gcc yes.c -o yes $ ./yes | pv > /dev/null ... [6.21 MiB/s] ... Wait a second, that's slower than puts, how can that be? Clearly, there's some buffering going on before writing. We could dig through the source code of glibc, and figure it out, but let's see how yes does it first. Line 80 gives a hint: /* Buffer data locally once, rather than having the large overhead of stdio buffering each item. */ The code below that simply copies argv[1:] or "y\n" to a buffer, and assuming that two or more copies could fit, copies it several times to a buffer of BUFSIZ. So, let's use a buffer: /* yes.c - iteration 3 */ #define LEN 2 #define TOTAL LEN * 1000 int main() { char yes[LEN] = {'y', '\n'}; char *buf = malloc(TOTAL); int used = 0; while (used < TOTAL) { memcpy(buf+used, yes, LEN); used += LEN; } while(write(1, buf, TOTAL)); return 1; } $ gcc yes.c -o yes $ ./yes | pv > /dev/null ... [4.81GiB/s] ... That's a ton better, but why aren't we reaching the same speed as GNU's yes? We're doing the exact same thing, maybe it's something to do with this full_write function. Digging leads to this being a wrapper for a wrapper for a wrapper (approximately) just to write(). This is the only part of the while loop, so maybe there's something special about their BUFSIZ? I dug around in yes.c's headers forever, thinking that maybe it's part of config.h which autotools generates. It turns out, BUFSIZ is a macro defined in stdio.h: #define BUFSIZ _IO_BUFSIZ What's _IO_BUFSIZ? libio.h: #define _IO_BUFSIZ _G_BUFSIZ At least the comment gives a hint: _G_config.h: #define _G_BUFSIZ 8192 Now it all makes sense, BUFSIZ is page-aligned (memory pages are 4096 bytes, usually), so let's change the buffer to match: /* yes.c - iteration 4 */ #define LEN 2 #define TOTAL 8192 int main() { char yes[LEN] = {'y', '\n'}; char *buf = malloc(TOTAL); int bufused = 0; while (bufused < TOTAL) { memcpy(buf+bufused, yes, LEN); bufused += LEN; } while(write(1, buf, TOTAL)); return 1; } And, since without using the same flags as the yes on my system does make it run slower (yes on my system was built with CFLAGS="-O2 -pipe -march=native -mtune=native"), let's build it differently, and refresh our benchmark: $ gcc -O2 -pipe -march=native -mtune=native yes.c -o yes $ ./yes | pv > /dev/null ... [10.2GiB/s] ... $ yes | pv > /dev/null ... [10.2GiB/s] ... We didn't beat GNU's yes, and there probably is no way. Even with the function overheads and additional bounds checks of GNU's yes, the limit isn't the processor, it's how fast memory is. With DDR3-1600, it should be 11.97 GiB/s (12.8 GB/s), where is the missing 1.5? Can we get it back with assembly? ; yes.s - iteration 5, hacked together for demo BITS 64 CPU X64 global _start section .text _start: inc rdi ; stdout, will not change after syscall mov rsi, y ; will not change after syscall mov rdx, 8192 ; will not change after syscall _loop: mov rax, 1 ; sys_write syscall jmp _loop y: times 4096 db "y", 0xA $ nasm -f elf64 yes.s $ ld yes.o -o yes $ ./yes | pv > /dev/null ... [10.2GiB/s] ... It looks like we can't outdo C nor GNU in this case. Buffering is the secret, and all the overhead incurred by the kernel throttles our memory access, pipes, pv, and redirection is enough to negate 1.5 GiB /s. What have we learned? * Buffer your I/O for faster throughput * Traverse source files for information * You can't out-optimize your hardware Edit: _mrb managed to edit pv to reach over 123GiB/s on his system! Edit: Special mention to agonnaz's contribution in various languages! Extra special mention to Nekit1234007's implementation completely doubling the speed using vmsplice! * celebdor optimized the Python script for faster speeds * jaysonsantos created a version in Rust * Astrokiwi created a Fortran version * TheSizik created a Brainfuck version * TqDlRTd3DrgQlW4PFQH6 created a Haskell version * prussian from Rizon created a shell version and a Node.js version * kozzi11 created a D version * 240 comments * share * save * hide * report top 200 commentsshow all 240 sorted by: best topnewcontroversialoldrandomq&alive (beta) [ ] Want to add to the discussion? Post a comment! Create an account [-]jmtd 823 points824 points825 points 4 years ago (52 children) It's a shame they didn't finish their kernel, but at least they got yes working at 10GiB/s. * permalink * embed * save * report * give award * reply [-][deleted] 4 years ago (33 children) [deleted] [-]mailto_devnull 75 points76 points77 points 4 years ago (3 children) Except flash is now on it's way out, so in hindsight waiting for Flash to die was a viable strategy! * permalink * embed * save * report * give award * reply [-][deleted] 4 years ago (2 children) [deleted] [-][deleted] 4 years ago (1 child) [deleted] [-][deleted] 13 points14 points15 points 4 years ago (0 children) I'm quite sure. * permalink * embed * save * report * reply [-]DurdenVsDarkoVsDevon 14 points15 points16 points 4 years ago (0 children) Non-mobile link. * permalink * embed * save * report * give award * reply [-]sadmac 16 points17 points18 points 4 years ago (2 children) There's actually a changelog somewhere in some X component that says "Fixes XKCD 619" * permalink * embed * save * report * give award * reply [-]feitingen 18 points19 points20 points 4 years ago[ks45ij6w05] (1 child) And here it is: https://github.com/jjneely/elrepo/blob/master/xorg-x11-drv-intel/el6/ xorg-x11-drv-intel.spec#L259 * permalink * embed * save * parent * report * give award * reply [-]NeverCast 1 point2 points3 points 4 years ago (0 children) I want to favourite this comment. * permalink * embed * save * parent * report * give award * reply [-]GNU_Troll 6 points7 points8 points 4 years ago (21 children) These comics suck I don't know how anyone likes them. * permalink * embed * save * report * give award * reply [-]kbob 37 points38 points39 points 4 years ago (3 children) Relevant username. * permalink * embed * save * parent * report * give award * reply [-]GNU_Troll 6 points7 points8 points 4 years ago (2 children) Not even trolling. Just stating a fact. * permalink * embed * save * parent * report * give award * reply [-]didnt_readit 8 points9 points10 points 4 years ago (0 children) *opinion * permalink * embed * save * parent * report * give award * reply [-][deleted] 10 points11 points12 points 4 years ago (0 children) A false fact * permalink * embed * save * parent * report * reply [-]bityard 31 points32 points33 points 4 years ago (0 children) Username checks out, but downvoted anyway. * permalink * embed * save * parent * report * give award * reply [-][deleted] 11 points12 points13 points 4 years ago (3 children) I'm feeling you are in a minority. * permalink * embed * save * parent * report * reply [-]GNU_Troll 4 points5 points6 points 4 years ago (2 children) Good taste isn't common, keep that in mind too. * permalink * embed * save * parent * report * give award * reply [-]Sag0Sag0 8 points9 points10 points 4 years ago (1 child) According to you, the minority. * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) [-]arachnidGrip 2 points3 points4 points 4 years ago (3 children) Why do you dislike them? * permalink * embed * save * parent * report * give award * reply [-]GNU_Troll 1 point2 points3 points 4 years ago (2 children) The illustration sucks, thee writing sucks, and half of them are not even comics. Just one panel, which is not a comic. * permalink * embed * save * parent * report * give award * reply [-]Hyperkubus 2 points3 points4 points 4 years ago (1 child) Who, other than yourself, said they were comics? * permalink * embed * save * parent * report * give award * reply [-]yannik121 1 point2 points3 points 4 years ago (0 children) From xkcd.com: "A webcomic of romance, sarcasm, math, and language." * permalink * embed * save * parent * report * give award * reply [-][deleted] 2 points3 points4 points 4 years ago (6 children) I like them because i find them often enough entertaining. Don't think that it is too hard to realize by yourself? * permalink * embed * save * parent * report * reply [-]GNU_Troll 1 point2 points3 points 4 years ago (5 children) Not everyone is entertained by stick figure drawings, most of us are a little old for that. * permalink * embed * save * parent * report * give award * reply [-][deleted] 6 points7 points8 points 4 years ago (3 children) Exactly, not everyone. Btw. its not about those stick figure drawings its about the message. It doesn't matter much how it looks. And don't missjudge missing context and/or knowledge about a topic for 'little to old for that'. If you don't get it, you don't get it. Its okay. * permalink * embed * save * parent * report * reply [-]GNU_Troll 1 point2 points3 points 4 years ago (2 children) I get it, I just have higher standards. It's an art form so excusing poor illustration and writing because it's about getting a message across is kind of a cop out. * permalink * embed * save * parent * report * give award * reply [-][deleted] 9 points10 points11 points 4 years ago (0 children) No you don't have higher standards. You have different taste. And no you don't get 'it'. If you would get 'it' you would find it funny. Otherwise you are just an bystander, analysing/evaluationg without getting 'it'. Getting 'it' doesn't mean that you can comprehend why someone else might find it funny. * permalink * embed * save * parent * report * reply [-]bulkygorilla 3 points4 points5 points 4 years ago (0 children) A "one-off" if you will * permalink * embed * save * parent * report * give award * reply [-]NotRichardDawkins 3 points4 points5 points 4 years ago (0 children) most of us are a little old for that Have fun in your boring grown-up land with your boring grown-up pants. * permalink * embed * save * parent * report * give award * reply load more comments (3 replies) [-]kjensenxz[S] 163 points164 points165 points 4 years ago (5 children) This should be a fortune * permalink * embed * save * parent * report * give award * reply [-]never_amore 42 points43 points44 points 4 years ago (4 children) fortune doesn't work at 10GiB/s * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 67 points68 points69 points 4 years ago (3 children) $ fortune | pv >/dev/null ... [2.81KiB/s] ... This is worse than all the yesses that have been benchmarked! * permalink * embed * save * parent * report * give award * reply [-][deleted] 4 years ago (2 children) [deleted] [-]Moonpenny 17 points18 points19 points 4 years ago (0 children) It also would have far fewer fortunes, since most of the fortunes are taken without attribution and can't be GPL'd. The Twain stuff should be fine, at least. * permalink * embed * save * report * give award * reply [-]Tyler_Zoro 19 points20 points21 points 4 years ago (0 children) ... and it would read mail. Plus, half of the fortunes would be some variation of, "it's called GNU/Linux". * permalink * embed * save * report * give award * reply [-]veroxii 62 points63 points64 points 4 years ago (0 children) That hurds. :( * permalink * embed * save * parent * report * give award * reply [-]Yawzheek 15 points16 points17 points 4 years ago (0 children) This right here? This is how you be a proper smart ass. Take notes. * permalink * embed * save * parent * report * give award * reply [-]myhf 11 points12 points13 points 4 years ago (3 children) My man. * permalink * embed * save * parent * report * give award * reply [-]ProgramTheWorld 3 points4 points5 points 4 years ago (2 children) Looking good. * permalink * embed * save * parent * report * give award * reply [-]PM_ME_DANK_MEMES 5 points6 points7 points 4 years ago (1 child) Slow down! * permalink * embed * save * parent * report * give award * reply [-]Sag0Sag0 3 points4 points5 points 4 years ago (0 children) Yes. * permalink * embed * save * parent * report * give award * reply [-]incraved 9 points10 points11 points 4 years ago (0 children) working at 10GiB/s. your comment would have been perfect if you had typed that as: working at 10GiB. /s * permalink * embed * save * parent * report * give award * reply [-]enkiv2 7 points8 points9 points 4 years ago (1 child) If you've ever had to convince shell tools to process large quantities (30-300 gigs) of text data, you'll see the merit of getting yes (and cut, and paste, and tr) to operate very quickly. Optimizing the hell out of these is why your laptop can perform some operations 80x faster than a hadoop cluster (and why you should therefore always first consider writing a small shell script when someone asks you to use hadoop map reduce on a couple hundred gigs of text). Even if HURD was finished, the number of people actually using it would still probably be less than the number of people who try to monkey-parse 30GB of xml in gawk. (Source: for work I frequently monkey-parse 30+GB of xml in gawk.) * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) [-]RedditUserHundred 4 points5 points6 points 4 years ago (1 child) ... and Stallman can bitch about "GNU slash linux" * permalink * embed * save * parent * report * give award * reply [-]xorgol 4 points5 points6 points 4 years ago (0 children) I'm currently running a project on GNU/NT. Stallman was right all along! * permalink * embed * save * parent * report * give award * reply [-]alexbuzzbee 144 points145 points146 points 4 years ago (36 children) The missing 1.5 GiB/s is probably kernel overhead and other processes. Try it in emergency mode for slightly more speed! * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 110 points111 points112 points 4 years ago (35 children) I considered running it in single-user mode, writing a simple ring 0 program to boot off of, or using a custom tiny kernel using it as init, and squeeze as much speed as possible out of the program, but I think I've spent enough time on this, I started writing it somewhere around 4 or 5 hours ago. If anyone would like to take a crack at doing that, I'd love to see how it compares to running on a regular system. * permalink * embed * save * parent * report * give award * reply [-][deleted] 29 points30 points31 points 4 years ago (27 children) I learned something today ! For the yes command, I still prefer the first implementation. Maybe dd also have such kind of optimizations. * permalink * embed * save * parent * report * reply [-]kjensenxz[S] 32 points33 points34 points 4 years ago (23 children) I really like the readability of the first iteration and NetBSD's, which are very similar, but they just aren't as quick, which makes me wonder if there would be a way to optimize several subsequent calls to a stdio function for the same speed in the library itself. Maybe another time I'll look into that, dd, and cat! * permalink * embed * save * parent * report * give award * reply [-][deleted] 4 years ago* (18 children) [deleted] [-]iluvatar 48 points49 points50 points 4 years ago (15 children) If anything, you'd want to generalize to emit any character yes already does this. Indeed, it goes further and repeatedly emits any arbitrary string. It's had this behaviour for at least the 30 years that I've been using it. * permalink * embed * save * report * give award * reply [-][deleted] 4 years ago* (5 children) [deleted] [-]supercheese200 23 points24 points25 points 4 years ago (4 children) no maybe i don't know no maybe i don't know Can you repeat the question? * permalink * embed * save * report * give award * reply [-]preludeoflight 19 points20 points21 points 4 years ago (3 children) Well, I mean, obviously, $ yes "no maybe i don't know" * permalink * embed * save * parent * report * give award * reply [-]supercheese200 18 points19 points20 points 4 years ago (2 children) YOU'RE NOT THE BOSS OF ME NOW * permalink * embed * save * parent * report * give award * reply continue this thread [-][deleted] 11 points12 points13 points 4 years ago (0 children) As one of the few people crazy enough to use mono, this is well known as part of the incantation yes yes | mozroots --import that gets SSL working. (This is fixed in newer versions of mono though.) * permalink * embed * save * parent * report * reply [-]net_goblin 2 points3 points4 points 4 years ago (5 children) But isn't emitting arbitrary characters the job of echo(1)? My favourite implementation would be echo y. * permalink * embed * save * parent * report * give award * reply [-]iluvatar 13 points14 points15 points 4 years ago (1 child) No - echo only emits it once, where yes repeatedly emits the string. * permalink * embed * save * parent * report * give award * reply [-]net_goblin 1 point2 points3 points 4 years ago (0 children) Ah thanks, my bad. * permalink * embed * save * parent * report * give award * reply [-]bit_of_hope 2 points3 points4 points 4 years ago (2 children) yes repeats the string until the pipe is closed or yes itself killed. [bitofhope@suika ~]% yes | head y y y y y y y y y y * permalink * embed * save * parent * report * give award * reply [-]Truncator 6 points7 points8 points 4 years ago (1 child) ~ $ yes | tail ^C * permalink * embed * save * parent * report * give award * reply [-]bit_of_hope 9 points10 points11 points 4 years ago (0 children) $ yes > /dev/null & Why is my machine so noisy all of a sudden? * permalink * embed * save * parent * report * give award * reply [-]StoneCypher 2 points3 points4 points 4 years ago (0 children) sorry about the stupid question. what have you been using this for? * permalink * embed * save * parent * report * give award * reply [-]Neebat 0 points1 point2 points 4 years ago (0 children) Does that affect the throughput? (Bet it doesn't.) * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 18 points19 points20 points 4 years ago (0 children) You make an excellent point, and yes is meant to do this (send argv instead of "y"), and the programs could easily be modified to send any value based on argv, just by changing the buffer subroutine. I would have added that in the program demos, but I felt it would be in excess. * permalink * embed * save * report * give award * reply load more comments (1 reply) [-]FUCKING_HATE_REDDIT 2 points3 points4 points 4 years ago (3 children) printf buffers every line, but I think you can force it to buffer more. * permalink * embed * save * parent * report * give award * reply [-]davesidious 2 points3 points4 points 4 years ago (2 children) ^^^wat * permalink * embed * save * parent * report * give award * reply [-]FUCKING_HATE_REDDIT 1 point2 points3 points 4 years ago (1 child) Printf buffers call until it reaches a \n * permalink * embed * save * parent * report * give award * reply [-]morty_a 9 points10 points11 points 4 years ago (0 children) printf/stdio behavior depends on whether or not stdout is a tty. If it's a tty, by default, stdio flushes after every newline ("line buffered.") If it's not a tty, by default, stdio waits until its buffer fills before flushing ("fully buffered.") * permalink * embed * save * parent * report * give award * reply [-]markfeathers 5 points6 points7 points 4 years ago (2 children) dd has a command line argument for the block size it writes, so you should be able to do the same thing with "dd if=/dev/zero bs=8192 | pv > /dev/null". On my pc /dev/null is ~5.12GiB/s. dd from /dev/zero is around ~4GiB/s though. In the case of dd it has to read from another file to write out instead of plopping 'y\n' into the buffer, so this is likely why it is a bit slower. * permalink * embed * save * parent * report * give award * reply [-][deleted] 2 points3 points4 points 4 years ago (0 children) Oh sure, the bs=. With block devices like /dev/zero though, nothing to read from a hard drive (or a cache), but this is still reading. * permalink * embed * save * parent * report * reply load more comments (1 reply) [-]Coding_Cat 4 points5 points6 points 4 years ago (2 children) On mobile, so can't search properly, but there is a command for starting a program with the rt scheduler. Might make it a little faster as it will never be preempted that way. * permalink * embed * save * parent * report * give award * reply load more comments (2 replies) load more comments (4 replies) [-]stw 52 points53 points54 points 4 years ago (1 child) Just a small nitpick: puts appends a newline, so puts("y\n") writes 2 newlines. * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 30 points31 points32 points 4 years ago (0 children) Thanks! I completely overlooked that, and was off by about 50%. I edited the OP to reflect the real values. * permalink * embed * save * parent * report * give award * reply [-][deleted] 4 years ago (8 children) [deleted] [-]kjensenxz[S] 26 points27 points28 points 4 years ago (7 children) I used mem_align and actually got worse performance, generally .2 GiB /s slower than elagergren's Go implementation and the C/assembly implementations (modified 4th iteration if you'd like to check): //char *buf = malloc(TOTAL); char *buf = aligned_alloc(4096, TOTAL); * permalink * embed * save * report * give award * reply [-]patrickbarnes 7 points8 points9 points 4 years ago (6 children) What happens if you stack allocate your buf? * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 12 points13 points14 points 4 years ago (4 children) That's actually what happens in the assembly code, since it actually compiles the values into the binary. Here's a sample (the .y's repeat for another 500 lines or so): 00000080: 48ff c748 be9b 0040 0000 0000 00ba 0020 H..H...@....... 00000090: 0000 b801 0000 000f 05eb f779 0a79 0a79 ...........y.y.y 000000a0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79 .y.y.y.y.y.y.y.y 000000b0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79 .y.y.y.y.y.y.y.y 000000c0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79 .y.y.y.y.y.y.y.y I don't know if the stack has any greater performance than the heap for something like this (we don't really need to do any memory "bookkeeping", and after all, memory is just memory), and it might mean slower initialization of the program, since it would have to read a larger file for the buffer than build one in memory. * permalink * embed * save * parent * report * give award * reply [-]Vogtinator 19 points20 points21 points 4 years ago (2 children) That's actually what happens in the assembly code, since it actually compiles the values into the binary. That's not the stack, that's .data (or in this case, .text, as not specified otherwise) To get it on the stack, you would need to: sub rsp, 8192 mov rdi, rsp mov rsi, y mov rdx, 8192 call memcpy Or something like that. * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 8 points9 points10 points 4 years ago (1 child) Thanks, I thought certain data in .text was put onto the stack (e.g. consts). * permalink * embed * save * parent * report * give award * reply [-]calrogman 11 points12 points13 points 4 years ago (0 children) In C it's typical that automatic variables in function scope are placed on the stack. * permalink * embed * save * parent * report * give award * reply [-]mccoyn 2 points3 points4 points 4 years ago (0 children) Local variables (not just on the stack, but on the current stack frame) have the advantage that the stack pointer is always in a register and so their memory location is a simple calculation. With a dynamically allocated buffer, the address would have to be placed in a register. Before the call to write() is made, this address would have to be saved onto the stack and then loaded back for the next call to write(). * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) [-][deleted] 4 years ago (14 children) [deleted] [-]kjensenxz[S] 15 points16 points17 points 4 years ago (13 children) I'm not sure which architecture your MacBook is (x86_64? ARM? Ancient PPC?), but I noticed that the speed really has to do with the size of your buffer compared to your pages (4096 bytes on x86), and making sure that you can fill up at least one (two is better IIRC). I'm not sure how much it's stored in L1, but if it was, it should be in the hundreds of gigabytes, in which case pv would definitely be the bottleneck. * permalink * embed * save * report * give award * reply [-]wrosecrans 18 points19 points20 points 4 years ago (12 children) It'll be x86_64 (or technically it could be x86 if it is the first gen Core Duo.) The PPC Laptops were all branded "PowerBook" or "iBook," and Apple hasn't shipped an ARM laptop. * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 4 points5 points6 points 4 years ago (11 children) Thanks! I didn't know about the PPC branding or the lack of an ARM; I was thinking the A10 was in the MacBook Air. * permalink * embed * save * parent * report * give award * reply [-]wrosecrans 17 points18 points19 points 4 years ago (10 children) The phones and tablets are all ARM. At this point, the iPadPro with an optional keyboard attached to it is suspiciously similar to a laptop, but not quite. The Mac is currently all x86_64. The MacBook Pro does have a little ARM in it hidden away to control the touchbar panel, but you generally wouldn't program it directly. (Most systems have a couple of little processors like that in them these days. There's probably at least one more in the wifi controller or something.) Running a normal process in the OS is always on the Intel CPU. Historical trivia: The PowerPC laptops were called "PowerBook." The PowerPC macs were called "PowerMac." But the original PowerBooks predated the PPC CPU's and were all 68k. It was just coincidence when the CPU and laptop branding lined up with Power in the name. * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 18 points19 points20 points 4 years ago (9 children) The MacBook Pro does have a little ARM in it hidden away to control the touchbar panel, but you generally wouldn't program it directly Someone put the original Doom on the touch bar, which makes me wonder about the interface with the operating system and hardware, and the specs of it - how fast can it run yes? * permalink * embed * save * parent * report * give award * reply [-]jmtd 10 points11 points12 points 4 years ago (7 children) That is a cute hack, but I think they're still running doom on the CPU but rendering on the bar; not running it on the ARM. * permalink * embed * save * parent * report * give award * reply [-]fragmede 7 points8 points9 points 4 years ago (6 children) I couldn't find any more useful specs for the CPU on the touchbar ( wikipedia doesn't have much), but considering Doom has been ported to the Apple watch, I can readily believe that the Touchbar is powerful enough to run Doom. The original Pentium, launched in 1993 the year Doom was also released, had a blazing fast clock speed (and bus speed) of 60 MHz, The Apple S1 used in the Apple Watch has a CPU with a max speed of 520 MHz, and while you can't blindly compare MHz to MHz between architectures, 24 years of progress in computer technology takes us pretty far. * permalink * embed * save * parent * report * give award * reply [-]vba7 5 points6 points7 points 4 years ago (1 child) Id risk saying that in 1993, when Doom launched, most people had 386 processors (probably some cheap 386SX). Most would read about Pentiums in the magazines and stare at the price tags. Pentiums became popular around Windows 95 times :-) (and still were expensive) * permalink * embed * save * parent * report * give award * reply [-]dsmithatx 1 point2 points3 points 4 years ago (0 children) I was running a 286 I got in 1986 and had to go buy a 486 66Mhz to play Doom. I worked in a computer store in 1993 when the first Pentiums came out. They were expensive and not many customers bought them the first few years. * permalink * embed * save * parent * report * give award * reply [-]WikiTextBot[] 2 points3 points4 points 4 years ago (0 children) Apple mobile application processors: Apple T1 The Apple T1 chip is an ARMv7 SoC from Apple driving the Touch ID sensor of the 2016 MacBook Pro. The chip operates as a secure enclave for the processing and encryption of fingerprints as well as acting as a gatekeeper to the microphone and iSight camera protecting these possible targets from potential hacking attempts. The T1 runs its own version of watchOS, separate from the Intel CPU running macOS. --------------------------------------------------------------------- Apple S1 The Apple S1 is the integrated computer in the Apple Watch, and it is described as a "System in Package" (SiP) by Apple Inc. Samsung is said to be the main supplier of key components, such as the RAM and NAND flash storage, and the assembly itself, but early teardowns reveal RAM and flash memory from Toshiba and Micron Technology. --------------------------------------------------------------------- ^[ ^PM ^| ^Exclude ^me ^| ^Exclude ^from ^subreddit ^| ^FAQ ^/ ^ Information ^] ^Downvote ^to ^remove ^| ^v0.2 * permalink * embed * save * parent * report * give award * reply [-]jmtd 0 points1 point2 points 4 years ago (1 child) Oh yeah there's no doubt the ARM chip is fast enough to run Doom, I'm just fairly confident that they didn't do that: it would be much easier to hack an existing port to render on the touchbar via the proper API than to port the whole thing, and since this was thrown together for a youtube video laugh and the source is not readily apparent, my best estimate is that's what they did. Although the Pentium debuted in '93, Doom was targetting the predecessor, one of the 486 variants. If you want to see an impressive, available embedded port of Doom, check out rockbox on an iPod or other supported PMP. I contribute to the chocolate doom source port in my spare time and one of the things I've worked on is the raspberry pi (ARM) port. * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) load more comments (1 reply) [-]video_descriptionbot 2 points3 points4 points 4 years ago (0 children) SECTION CONTENT Title Doom on the MacBook Pro Touch Bar Description Doom runs on everything... but can it run on the new MacBook Pro Touch Bar? Let's find out! Length 0:00:58 --------------------------------------------------------------------- ^I am a bot, this is an auto-generated reply | ^Info ^| ^Feedback ^| ^Reply STOP to opt out permanently * permalink * embed * save * parent * report * give award * reply [-]jmickeyd 24 points25 points26 points 4 years ago (30 children) I'm curious about vmsplice performance on Linux. You could potentially have a single page of "y\n"s passed multiple times in the iov. That way you have fewer syscalls without using more ram. Although at some point (possibly already), pv is going to be the bottleneck. * permalink * embed * save * report * give award * reply [-]Nekit1234007 42 points43 points44 points 4 years ago (27 children) Stole the code from /u/phedny and modified it a bit. Got some curious results. /u/kjensenxz can you test it on your machine? #define _GNU_SOURCE #define __need_IOV_MAX #include #include #include #include #define LEN 2 #define TOTAL (1*1024*1024) #define IOVECS IOV_MAX int main() { char yes[LEN] = {'y', '\n'}; char *buf = malloc(TOTAL); int bufused = 0; int i; struct iovec iov[IOVECS]; while (bufused < TOTAL) { memcpy(buf+bufused, yes, LEN); bufused += LEN; } for (i = 0; i < IOVECS; i++) { iov[i].iov_base = buf; iov[i].iov_len = TOTAL; } while(vmsplice(1, iov, IOVECS, 0)); return 1; } --------------------------------------------------------------------- $ gcc vmsplice-yes.c -o vmsplice-yes --------------------------------------------------------------------- $ yes | pv >/dev/null ... 0:00:20 [5.26GiB/s] ... $ ./kjensenxz-yes4 | pv >/dev/null ... 0:00:20 [4.11GiB/s] ... #define TOTAL 4096 $ ./vmsplice-yes | pv >/dev/null ... 0:00:20 [4.36GiB/s] ... #define TOTAL 8192 $ ./vmsplice-yes | pv >/dev/null ... 0:00:20 [6.83GiB/s] ... #define TOTAL (1*1024*1024) $ ./vmsplice-yes | pv >/dev/null ... 0:00:20 [9.33GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 37 points38 points39 points 4 years ago (19 children) Amazing! Putting this in the OP. $ ./vmsplice-yes | pv >/dev/null # 1024 * 1024 ... [20.5GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-]_mrb 74 points75 points76 points 4 years ago*[gold_48] (18 children) [DEL:You can further optimize /u/Nekit1234007's code by having only 1 large element in the iovec "y\ny\ny\n..." (vs. many 2-byte "y\n" elements).:DEL] Edit: I misread the code and it's already having large elements in the iovec. However setting the pipe size to 1MB bumps the speed from 28 to 74 GB/s on my Skylake CPU (i5-6500). If I count things correctly (4 context switches for yes to write, pv to read, pv to write, and back to yes), assuming 100ns per switch, 100ns of instructions executing per context (300 instructions at IPC= 1 and 3GHz), 64kB per I/O op (default pipe buffer size) then the theoretical max speed is ~80 GB/s. Then tweak the pipe buffer size to 1MB (maximum allowed) and the theoretical max should be ~1280 GB/s. Edit 2: I reached 123 GB/s. It turns out past ~50-70 GB/s pv(1) itself is the bottleneck. It fetches only 128kB of data at a time via splice() because it is too simplistic and uses a fixed buffer size that is 32 times the "block size" reported by stat() on the input. And on Linux stat() on a pipe fd reports a block size of 4kB. So recompile pv by changing (in src/pv/loop.c): sz = sb.st_blksize * 32; to this: sz = sb.st_blksize * 1024; But pv also restrict the block size to 512 kB no matter what. So edit src/include/pv-internal.h and replace: #define BUFFER_SIZE_MAX 524288 With: #define BUFFER_SIZE_MAX (4*1024*1024) Then another bottleneck in pv is the fact it calls select() once between each splice() call, which is unnecessary: if splice() indicates data was read/written successfully, then a process should just call splice() again and again. So edit src/pv/transfer.c and fake a successful select() by replacing: n = select(max_fd + 1, &readfds, &writefds, NULL, &tv); with simply: n = 1; Then you will reach speeds of about 95 GB/s. Beyond that the pipe buffer size need to be further increased. I bumped it from the default 1MB to 16MB: $ sysctl fs.pipe-max-size=$((1024*1024*16)) And use this custom version of yes with a 16MB pipe buffer: https:// pastebin.com/raw/qNBt8EJv Finally, both "yes" and "pv" need to run on the same CPU core because cache affinity starts playing a big role so: $ taskset -c 0 ./yes | taskset -c 0 ~/pv-1.6.0/pv >/dev/null 469GiB 0:00:02 [ 123GiB/s] [ <=> But even at 123 GB/s the bottleneck is still pv, not yes. pv has a lot of code to do some basic bookkeeping that just slows things down. * permalink * embed * save * parent * report * give award * reply [-]Nekit1234007 8 points9 points10 points 4 years ago (6 children) I'll be damned. Added fcntl(1, F_SETPIPE_SZ, 1024*1024); before while. /cc /u/kjensenxz $ ./vmsplice-yes | pv >/dev/null ... 0:00:20 [21.1GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-]_mrb 4 points5 points6 points 4 years ago* (3 children) So you got a 2x boost, nice :) I wonder what /u/kjensenxz's system would show. Edit: now try the version with my custom pv(1) modifications as per Edit #2 * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 9 points10 points11 points 4 years ago (2 children) fcntl(1, F_SETPIPE_SZ, 1024*1024); $ ./vmsplice-yes | pv > /dev/null ... [36.8GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-]tcolgate 4 points5 points6 points 4 years ago (1 child) fcntl(1, F_SETPIPE_SZ, 1024*1024); Interestingly, the peak I get is if I match the IOVEC size and the PIPE_SZ to match my l2 cache size. (256KB per core). I get 73GiB/s then! * permalink * embed * save * parent * report * give award * reply [-]tcolgate 2 points3 points4 points 4 years ago (0 children) Just for posterity. That was a coincidence due to the hard coded buffer sizes in pv I think. As _mrb points out, pv uses splice, so only ever sees the count of bytes spliced, it doesn't need to read the data to determine the size. * permalink * embed * save * parent * report * give award * reply [-]monocirulo 1 point2 points3 points 4 years ago (1 child) I got 60GiB/s with the line added. Can this be used for network sockets? * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) [-]EgoIncarnate 2 points3 points4 points 4 years ago (2 children) Read the code again, the iovecs are already 1 MB each ( iov_len = TOTAL ) of 'y\n'. * permalink * embed * save * parent * report * give award * reply [-][deleted] 4 years ago* (1 child) [deleted] [-]Nekit1234007 0 points1 point2 points 4 years ago (0 children) I'm not sure I follow, that array is one long element, that was allocated through malloc and filled with memcpys. * permalink * embed * save * report * give award * reply [-]_dancor_ 1 point2 points3 points 4 years ago (1 child) https://youtu.be/FeGq48uNrLc?t=209 * permalink * embed * save * parent * report * give award * reply [-]video_descriptionbot 1 point2 points3 points 4 years ago (0 children) SECTION CONTENT Title Sense8 by Netflix - Season 02 : What's Up (Remix by Riley) Length 0:03:58 --------------------------------------------------------------------- ^I am a bot, this is an auto-generated reply | ^Info ^| ^Feedback ^| ^Reply STOP to opt out permanently * permalink * embed * save * parent * report * give award * reply [-]MCPtz 0 points1 point2 points 4 years ago (1 child) Out of curiosity, what are your cache sizes? e.g. from lscpu L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K * permalink * embed * save * parent * report * give award * reply [-]_mrb 4 points5 points6 points 4 years ago (0 children) Same as yours. But cache sizes don't matter much. The "y\n" data is initialized once by yes(1) and subsequently never accessed by neither yes(1) nor pv(1). That's the point of zero-copy I/O via splice(). Cache locality matters only to minimize time wasted during context switches between yes(1), pv(1) and the kernel (eg. to update the process data structures.) Model name: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz Stepping: 3 CPU MHz: 800.125 CPU max MHz: 3600.0000 CPU min MHz: 800.0000 BogoMIPS: 6384.16 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp * permalink * embed * save * parent * report * give award * reply load more comments (4 replies) [-]tcolgate 8 points9 points10 points 4 years ago (1 child) I thought maybe this was cheating because you weren't checking if vmsplice is error'ing. Turns out it's not erroring. pv > /dev/null < /dev/zero actually takes half a core on my machine just clearing RAM (according to perf top), your vmsplice yes takes very little CPU at all. I think you're basically measuring L1 cache bandwidth and context switches at that point. Pretty cool. * permalink * embed * save * parent * report * give award * reply [-]Nekit1234007 5 points6 points7 points 4 years ago (0 children) That's true. When I tried to run just the ./vmsplice-yes nothing showed up on the screen but the process used all of the cpu core, I was confused there for a sec. Replacing existing while with while (vmsplice(1, iov, IOVECS, 0) > 0); should fix the problem. But here lies the limitation of this approach, since pty/regular file is not a pipe -- nothing useful will happen and vmsplice will fail with EBADF. * permalink * embed * save * parent * report * give award * reply load more comments (5 replies) [-]kjensenxz[S] 7 points8 points9 points 4 years ago (1 child) When I was writing the conclusion, I wondered how much pv was limiting. I took a stab at it with dd, but it was an even worse bottleneck: $ ./yes | dd of=/dev/null bs=8192 29703569408 bytes (30 GB, 28 GiB) copied, 5.34847 s, 5.6 GB/s I've seen pv measure as high as 11.2 GiB, which really makes me wonder what the actual percent bottleneck everything is, and if it weren't so late, I would definitely go poking around to check. I'll try to remember to do it tomorrow, of course, everyone and everyone else is invited to also if they're interested! * permalink * embed * save * parent * report * give award * reply [-]LukeShu 2 points3 points4 points 4 years ago (0 children) pv uses the splice() system call to do zero-copy read/writes if possible. A yes|dd of=/dev/null pipeline it goes like: (forgive my slight misapplication of big-O notation, and my pseudo-code intended to make explicit the kernel's internal vtables) yes: pipe.write(buf, len) // O(len) ; copy data from userspace to kernelspace dd : pipe.read(buf, len) // O(len) ; copy data from kernelspace to userspace dd : devnull.write(buf, len) // O(0) ; discard data So the cost is O(2*len). But with pv's use of splice(), we can skip a step: yes: pipe.write(buf, len) // O(len) ; copy data from userspace to kernelspace pv: slice(pipe, devnull, len) // O(0) ; discard data So the cost is O(len); it makes sense that the throughput with dd would be about half what pv gets. * permalink * embed * save * parent * report * give award * reply [-]tiltowaitt 19 points20 points21 points 4 years ago (8 children) This is pretty interesting. Is there a real-world advantage on modern systems to such speed in the GNU yes? * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 28 points29 points30 points 4 years ago (6 children) I really can't think of any real advantage of yes being faster other than being able to say "look, mine's faster!", since the likelihood of needing 5 billion "y's" per second is almost 0. It might have one or two use cases in which its efficiency is actually useful, perhaps in embedded systems running several operations concurrently? A couple of people have mentioned dd and cat, which makes me wonder if the same thing could be done to either (or both) of them to speed them up as greatly, and I plan on taking a stab at them fairly soon if someone doesn't beat me to it. * permalink * embed * save * parent * report * give award * reply [-][deleted] 16 points17 points18 points 4 years ago (2 children) dd is somewhat bound by POSIX saying the default block size needs to be 512 bytes. you can use another, but many people don't know about it. * permalink * embed * save * parent * report * reply [-]kjensenxz[S] 9 points10 points11 points 4 years ago (0 children) Good to know, I would have went hacking at the source and might have accidentally PR'd something non-complaint. It'd make a good exercise for a custom (read: nonstandard) system though. * permalink * embed * save * parent * report * give award * reply [-]FUZxxl 1 point2 points3 points 4 years ago (0 children) Most people don't need dd and should use cat instead. * permalink * embed * save * parent * report * give award * reply [-]UnchainedMundane 1 point2 points3 points 4 years ago (1 child) I would like to think that having every utility on the system be really fast would add up to a generally faster system in total (especially when there are lots of shell scripts) * permalink * embed * save * parent * report * give award * reply [-]-fno-stack-protector 1 point2 points3 points 4 years ago (0 children) plus its just cool to have really fast things * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) load more comments (1 reply) [-]phedny 12 points13 points14 points 4 years ago (3 children) I've been able to increase speed using scatter/gather I/O with this implementation. Would love to see how it measures up on the machine you used for the other measurements: #define LEN 2 #define TOTAL 8192 #define IOVECS 256 int main() { char yes[LEN] = {'y', '\n'}; char *buf = malloc(TOTAL); int bufused = 0; int i; struct iovec iov[IOVECS]; while (bufused < TOTAL) { memcpy(buf+bufused, yes, LEN); bufused += LEN; } for (i = 0; i < IOVECS; i++) { iov[i].iov_base = buf; iov[i].iov_len = TOTAL; } while(writev(1, iov, IOVECS)); return 1; } * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 7 points8 points9 points 4 years ago (2 children) What's your speed on both GNU yes and your revision? On the OP build machine: $ gcc yes.c $ ./yes | pv > /dev/null ... [9.05GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-]phedny 1 point2 points3 points 4 years ago (1 child) I did this on a VPS, so number are not very stable, but around 1GB/s on iteration 4 and around 1.7GB/s on the iovec version. There might be another bottleneck at play here. * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 2 points3 points4 points 4 years ago (0 children) I did this on a VPS Interesting, I just tried this on my VPS: $ ./yes | pv > /dev/null #iteration 4 ... [ 488MiB/s] ... $ ./iovecyes | pv > /dev/null ... [ 964MiB/s] ... Very strange, so I decided to test it in a virtual machine (NetBSD): $ ./yes | pv > /dev/null ... [ 801MiB/s] ... $ ./iovecyes | pv >/dev/null ... [ 990 MiB/s] ... Both of these fluctuated from about 450 to 993. I don't know if my results at this point under a hypervisor can be considered conclusive with the amount of error in their fluctuation nor in the difference when I run them (from the constant fluctuation). * permalink * embed * save * parent * report * give award * reply [-]emn13 11 points12 points13 points 4 years ago (3 children) You state the memory bandwidth is 12.8GB/s - but that's per channel, and my guess is that you're running a dual channel setup (most people are), so 10.2GiB/s is a little less than half the theoretical throughput. Also, note that because you're writing to /dev/null, it's conceivable no reads ever occur, even at a low level, so full-throughput sequential writes really are achievable. Oh, and additionally it's not trivially obvious (to the non-OS geek me, anyhow) why this benchmark even needs to hit RAM - is there some cross-process TLB flush going on? After all, you may be writing a lot of memory, but you're doing so in small, very cachable chunks, and you're discarding those immediately - so why can't this all stay within some level of cache? * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 7 points8 points9 points 4 years ago (2 children) You state the memory bandwidth is 12.8GB/s - but that's per channel, and my guess is that you're running a dual channel setup (most people are), so 10.2GiB/s is a little less than half the theoretical throughput. You're right, I am on a dual channel setup, but as far as I know (not much about RAM), it would only be hitting a single channel. Also, note that because you're writing to /dev/null, it's conceivable no reads ever occur, even at a low level, so full-throughput sequential writes really are achievable. Oh, and additionally it's not trivially obvious (to the non-OS geek me, anyhow) why this benchmark even needs to hit RAM - is there some cross-process TLB flush going on? After all, you may be writing a lot of memory, but you're doing so in small, very cachable chunks, and you're discarding those immediately - so why can't this all stay within some level of cache? As far as I know, the series of "y\n" is in the cache, there's plenty of room in L1 and L2. But since the output of yes is being redirected through a pipe, it does need to be read by the program on the other end (pv), which normally would throw it up on standard out, but discards it to /dev/null. To communicate through a pipe, the standard output of one program has to be buffered into memory that the end program can read, which is achieved through the kernel (pipe is a syscall). Might the halving of the memory speed be from the simultaneous read/writes? If I implemented a timer and counter in the same program, it would probably never need to leave cache, and would instead see how quickly write() could be called to /dev/null opened as a file descriptor (might make an interesting memory/cache speed benchmark program). * permalink * embed * save * parent * report * give award * reply [-]emn13 2 points3 points4 points 4 years ago (0 children) You'd want to test this without pv. That should be easy enough to do, since you have a working program with the same performance - simply write some fixed amount to the pipe, and not while(true) - then you can simply time how long that takes. Alternatively, integrate the timing into the program itself, and have it compute and print (to stderr) the timings every (say) 5s (tiresome) or 50GB(a little easier). * permalink * embed * save * parent * report * give award * reply [-]mccoyn 0 points1 point2 points 4 years ago (0 children) To communicate through a pipe, the standard output of one program has to be buffered into memory that the end program can read I wonder if you have thread switching issues. Each program is trying to run at the same time and access the same buffer, so there will be lots of synchronization preventing this from staying in L1 and L2. * permalink * embed * save * parent * report * give award * reply [-]TotesMessenger 16 points17 points18 points 4 years ago* (0 children) I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit: * [/r/hackernews] How is GNU `yes` so fast? * [/r/perfeng] How is GNU `yes` so fast? [x-post /r/unix] * [/r/programming] How is GNU's `yes` so fast? [X-Post r/Unix] * [/r/spacexmasterrace] Advanced yes optimizations * [/r/yrc] How is GNU `yes` so fast? : unix ^If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. ^(Info ^/ ^Contact) * permalink * embed * save * report * give award * reply [-]deathanatos 8 points9 points10 points 4 years ago (0 children) yes on my system was built with CFLAGS="-O2 -pipe -march=native -mtune=native" I sense a (fellow) Gentoo user. * permalink * embed * save * report * give award * reply [-]CowboyBoats 7 points8 points9 points 4 years ago (6 children) $ yes y y y ... What is this program for? * permalink * embed * save * report * give award * reply [-][deleted] 10 points11 points12 points 4 years ago (0 children) For when you don't want to type yes on terminal prompts and just wave the script through * permalink * embed * save * parent * report * reply [-]ggtsu_00 10 points11 points12 points 4 years ago (0 children) Responding to your wife. * permalink * embed * save * parent * report * give award * reply [-]slammacows 5 points6 points7 points 4 years ago (3 children) yes * permalink * embed * save * parent * report * give award * reply [-]davesidious 4 points5 points6 points 4 years ago (2 children) yes * permalink * embed * save * parent * report * give award * reply [-]ulisses_guimaraes 1 point2 points3 points 4 years ago (1 child) yes * permalink * embed * save * parent * report * give award * reply load more comments (1 reply) [-]kozzi11 6 points7 points8 points 4 years ago* (9 children) This is my D version void main() { import std.range : array, cycle, take, only; import std.stdio : stdout; import std.algorithm : copy; "y\n".cycle.take(8192).array.only.cycle.copy(stdout.lockingBinaryWriter); } GNU yes 2,52GiB/s D yes 3,14GiB/s * permalink * embed * save * report * give award * reply [-]kozzi11 0 points1 point2 points 4 years ago (8 children) And here is a version with a while loop: void main() { import std.range : array, cycle, take; import std.stdio : stdout; auto buf = "y\n".cycle.take(8192).array; while(true) stdout.rawWrite(buf); } * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 0 points1 point2 points 4 years ago* (7 children) I couldn't get a D compiler working on Gentoo, so here it is on Arch on my laptop: $ yes | pv > /dev/null ... [5.57GiB/s] ... $ ldc2 yes1.d $ ./yes1 | pv > /dev/null ... [5.52GiB/s] ... $ ldc2 yes2.d $ ./yes2 | pv >/dev/null ... [5.42GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-][deleted] 1 point2 points3 points 4 years ago (2 children) I managed to get gdc working on gentoo using the dlang overlay - but it looks like the standard library is old enough that it doesn't have stdout.lockingBinaryWriter * permalink * embed * save * parent * report * reply [-]kozzi11 1 point2 points3 points 4 years ago (1 child) So try the other version without stdout.lockingBinaryWriter. It should compile * permalink * embed * save * parent * report * give award * reply [-][deleted] 1 point2 points3 points 4 years ago (0 children) It does. Here's how your while loop version compares on my machine: # yes | pv > /dev/null ... 7.07GiB/s # ./yes | pv > /dev/null ... 8.56GiB/s * permalink * embed * save * parent * report * reply load more comments (4 replies) [-]SixLegsGood 6 points7 points8 points 4 years ago (9 children) 1. What happened to the caches? Shouldn't this tiny program and the tiny amount of the OS being exercised fit within the L2 cache? Why then should it be limited to main memory speed? 2. Is 'pv' a bottleneck? I see a comment below that you tried sending the output through dd to /dev/null. Perhaps try running something like: pv < /dev/zero (although I wouldn't be surprised to find that /dev/zero is slower than yes...) * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 9 points10 points11 points 4 years ago (8 children) What happened to the caches? Shouldn't this tiny program and the tiny amount of the OS being exercised fit within the L2 cache? Why then should it be limited to main memory speed? This is a great question, in fact, it should fit on L1 in my processor (32K data, 32K instructions). I would assume that it's stuck with memory speed since there is a pipe involved, and now that you mention it, the best way to measure this would probably be to use an internal timer and counter. Is 'pv' a bottleneck? I see a comment below that you tried sending the output through dd to /dev/null. Perhaps try running something like: pv < /dev/zero $ pv < /dev/zero ... [4.79MiB/s] ... $ pv > /dev/null < /dev/zero ... [20.6GiB/s] ... Honestly, at this point, it's very difficult to say if pv is a bottleneck. Several people have mentioned it, and I've thought about it, and I think the real bottleneck would have to be the pipe, because it has to use memory to send data through it. * permalink * embed * save * parent * report * give award * reply [-]SixLegsGood 6 points7 points8 points 4 years ago (3 children) Wow, thanks for the quick reply and benchmark! IIRC, back in the day, IRIX used to support a crude form of zero-copy I/O where, if you were reading / writing page-sized chunks of memory that were properly aligned, it would use page table trickery to share the data between processes (or between OS drivers and processes), so that the reads and writes really did do nothing. In practice, the optimisation never seemed to be too useful, there were always too many constraints that made the 'zero-copy' cost more than simple data transfer (the sending process/driver needed to not touch the memory again, the receiver mustn't alter the data in the pages, the trick added extra locking, and on many systems, the cost of updating page tables was slower than just copying the 4kb chunks of memory). But for this particular benchmark, I suspect it could hit a crazy theoretical 'transfer' speed... * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 4 points5 points6 points 4 years ago (2 children) IRIX used to support a crude form of zero-copy I/O where, if you were reading / writing page-sized chunks of memory that were properly aligned, it would use page table trickery to share the data between processes (or between OS drivers and processes), so that the reads and writes really did do nothing. You know, I have a spare computer and IRIX is available on a torrent site, and this makes me wonder if I could (or should) try to install it and benchmark this application on bare metal (hypervisors seem to completely ruin benchmarking). * permalink * embed * save * parent * report * give award * reply [-]SixLegsGood 4 points5 points6 points 4 years ago (0 children) You'd definitely need to run it on bare metal to test this optimisation, the virtualisation would be emulating all of the pagetable stuff. I think it also only worked on specific SGI hardware (or maybe it was specific to the MIPS architecture?), and there were other restrictions, like the read()s and write()s had to be 4kb (I think) chunks, 4kb aligned, possibly with a spare 4kb page either side of the buffers too. It may also have been restricted to driver <->application transfers, the use case I encountered was in a web server that was writing static files out to the network as fast as possible. * permalink * embed * save * parent * report * give award * reply [-][deleted] 0 points1 point2 points 4 years ago (0 children) if you post code I'd be happy to compile and test it on my octane. * permalink * embed * save * parent * report * reply [-]fragmede 1 point2 points3 points 4 years ago (1 child) I'm on a much different system - an ARM chromebook, but I get slightly better performance using pv <(yes) > /dev/null compared to yes | pv > /dev/null Do you? * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 1 point2 points3 points 4 years ago (0 children) It's identical here. Run the tests several times, they may be within the margin of error of each other. * permalink * embed * save * parent * report * give award * reply [-]Pastrami 1 point2 points3 points 4 years ago (1 child) What type of hardware and distro are you running this on? I get wildly different numbers from my crappy work pc with Mint 17: $yes | pv > /dev/null [ 162MB/s] $ pv < /dev/zero [76.5MB/s] $ pv > /dev/null < /dev/zero [18.5GB/s] yes | pv > /dev/null only gives me 162 Megs which is 63 time slower, while I'm getting roughly 16 times faster for pv < /dev/zero I've got an i7-6700 CPU @ 3.40GHz, with 8 GB DDR4. * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 0 points1 point2 points 4 years ago (0 children) i7-4790, 8 GB DDR3 @ 1600 MHz * permalink * embed * save * parent * report * give award * reply [-][deleted] 4 years ago* (5 children) [deleted] [-]kjensenxz[S] 3 points4 points5 points 4 years ago (4 children) This feels like a troll post, but I'll do it anyways. $ ./yes > out & $ tail -f out | pv > /dev/null ... [ 188MiB/s] ... Calculating it by hand with watch -n 0.5 ls -lh out results in about the same thing. * permalink * embed * save * report * give award * reply [-][deleted] 4 years ago* (2 children) [deleted] [-]kjensenxz[S] 2 points3 points4 points 4 years ago (1 child) Here's dd and pv for a base value and comparison $ dd bs=1024K count=1024 if=/dev/zero of=/tmp/zerotest 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.312857 s, 3.4 GB/s $ pv < /dev/zero > /tmp/zerotest ... [ 189MiB/s] ... $ time ./yes > /tmp/yesout # 5th revision real 0m8.554s $ du -b /tmp/yesout # bytes 2973360128 /tmp/yesout 2973360128 bytes / 8.554 s = 347598799.158 bytes / sec = 0.324 GiB/s. Redoing this experiment for longer actually results in a lower speed. * permalink * embed * save * report * give award * reply load more comments (1 reply) [-]yomimashita 2 points3 points4 points 4 years ago (0 children) How about just yes > /dev/null and building the counter into yes? * permalink * embed * save * parent * report * give award * reply [-]timvisee 6 points7 points8 points 4 years ago (0 children) The fact that they took the time to optimize such a little program as this, with some great tricks, amazes me! * permalink * embed * save * report * give award * reply [-]K3wp 5 points6 points7 points 4 years ago (0 children) re: this point: Buffer your I/O for faster throughput I do HPC Linux deployments and my #1 trick that nobody seems to know about is this command: https://linux.die.net/man/1/buffer buffer - very fast reblocking program Using this in a pipeline can produce some pretty significant speedups, particularly when sending something over the network. * permalink * embed * save * report * give award * reply [-]crowdedconfirm 3 points4 points5 points 4 years ago (6 children) Interestingly, yes on my MacBook Air seems to be much slower then the statistics you posted, although for most practical purposes I don't see it making much of a difference. 1.66GiB 0:01:01 [28.9MiB/s] [ <=> ] * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 7 points8 points9 points 4 years ago (5 children) From everything I've read from everyone's comments here and on Hacker News, it's because of a couple of issues: * OS X's small buffer size (reported to be 1024, smaller than a page) * MacBook's slower processor and maybe different RAM timing (my proposal, refuted several times, Air 2017 has 1600 MHz just like the build machine) * OS X's traditional-Unix yes implementation I'd love to see how it benches against GNU's yes, and this comment claims 7.2 on Linux on an Air. * permalink * embed * save * parent * report * give award * reply [-]crowdedconfirm 4 points5 points6 points 4 years ago (4 children) The MacBook Air hasn't had a release since 2015, perhaps they mean a MacBook Pro? (My bench was on a brand new MacBook Air Mid-2015, bought in April 2017.) 28.9MiB/s was using macOS's built in yes, by the way. I had Chrome open with some tabs at the time though, but I would sure hope that wouldn't affect my RAM's bandwidth that much. * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 4 points5 points6 points 4 years ago (3 children) The MacBook Air hasn't had a release since 2015, perhaps they mean a MacBook Pro? (My bench was on a brand new MacBook Air Mid-2015, bought in April 2017.) My bad, I was just blindly reading off of https://www.apple.com/ macbook-air/specs/ (it is 15 til 6 after all). I couldn't vouch for how much Chrome would take up, though from anecdotes, it is known as a memory hog. * permalink * embed * save * parent * report * give award * reply [-]crowdedconfirm 5 points6 points7 points 4 years ago (2 children) Apple doesn't really make the release dates of their products clear on their page, which sucks. It's 3:53 (AM) here... :P Running cat /dev/zero | pv > /dev/zero with Chrome (6 tabs), Discord, and iTunes open, which was worse then the original test, gives me a bit better amount of bandwidth. 68.1GiB 0:00:36 [1.93GiB/s] [ <=> ] The score for yes | pv > /dev/zero is still pretty abysmal though. 277MiB 0:00:10 [27.5MiB/s] [ <=> ] * permalink * embed * save * parent * report * give award * reply [-]kjensenxz[S] 3 points4 points5 points 4 years ago (1 child) I doubt it would help much, but do you get a performance boost from redirecting to /dev/null? Also, what's your speed for my fourth iteration yes? * permalink * embed * save * parent * report * give award * reply [-]crowdedconfirm 3 points4 points5 points 4 years ago (0 children) Wait, why am I directing to /dev/zero. That makes no sense now that I put some thought into it... :P Mabel:$ cat /dev/zero | pv > /dev/null 13.2GiB 0:00:07 [1.87GiB/s] [ <=> ] The other test: Mabel:$ yes | pv > /dev/null 285MiB 0:00:10 [28.5MiB/s] [ <=> ] * permalink * embed * save * parent * report * give award * reply [-]minimim 2 points3 points4 points 4 years ago* (3 children) Here's my Perl6 version: perl6 -e 'my \buf = Blob.new: |"y\n".NFC xx 8192;loop {$*OUT.write: buf}'|pv > /dev/null [5,95GiB/s] * permalink * embed * save * report * give award * reply [-]aaronsherman 1 point2 points3 points 4 years ago (2 children) A slightly simpler version that caches the value of $*OUT, which is a dynamic variable (per comments in /r/perl6): $ perl6 -e 'my $out = $*OUT; my $m = ("y\n" x (1024*8)).encode("ascii"); loop { $out.write: $m }' | pv > /dev/null On my box, this is substantially faster than GNU coreutils yes, which is a little shocking. Edit: Note that the Perl 6 version seems to be roughly on-par with: $ dd if=/dev/zero bs=8k | pv > /dev/null * permalink * embed * save * parent * report * give award * reply load more comments (2 replies) [-]11chase 3 points4 points5 points 4 years ago (0 children) Small nitpick - I used to benchmark memory controllers for a certain CPU manufacturer, and this explanation is not the entire truth: With DDR3-1600, it should be 11.97 GiB/s (12.8 GB/s), where is the missing 1.5? [...] all the overhead incurred by the kernel throttles our memory access, pipes, pv, and redirection is enough to negate 1.5 GiB/s While it's certainly true that task switching and system calls contribute overhead, the real problem is DDR protocol itself. It's not possible to transfer memory data to/from DDR on every single cycle, because some cycles need to be used to transmit address information. This is minimized when transferring contiguous blocks of memory, but that's not necessarily what yes is doing. Even when using a single page of memory, which presumably is contiguous in the physical address space, the actual module/rank/row/column indices may be hashed or scrambled by the DDR controller in order to stripe the memory space across different memory modules, different ranks, or different NUMA nodes in a multiprocessor system, or as a security measure to mitigate cold boot attacks or rowhammer attacks. Additionally, DDR modules need to periodically refresh the charge stored in the capacitors that implement the data storage. This is something that the DDR controller on the CPU does transparently, but it does consume bus cycles to do, and also creates some period of time when the DDR module is unable to service read/write requests. Lastly, DDR is a half-duplex protocol, i.e. you can read or write but not both at the same time. Switching the bus between read and write mode, which is necessary when copying memory, is something that consumes bus cycles as well. tl;dr even with perfectly written ring 0 software, it is actually impossible to reach the theoretical bandwidth of DDR systems, and it's not uncommon for DDR controllers to cap out 10-15% below the theoretical limit. * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 2 points3 points4 points 4 years ago* (0 children) Here's a submission from prussian: #!/bin/sh BUFSIZ=32768 SIZE=2 ARGS=$(( BUFSIZ / SIZE )) IFS=' ' if [ -z "$1" ]; then set y '' else x="$1" SIZE=$(( ${#x} + 1 )) ARGS=$(( BUFSIZ / SIZE )) shift set "$x" '' fi while [ $# -le "$ARGS" ]; do set "$@$@" done while printf %s "$*"; do : done $ ./yes.sh | pv >/dev/null ... [9.68MiB/s] ... Another: #!/usr/bin/env node var BUFSIZ = process.stdout._writableState.highWaterMark * 4 var str = process.argv[2] || 'y', len = str.length + 1, buffer = Buffer.from(Array.from({ length: BUFSIZ/len }, () => str).join('\n')) function yes() { process.stdout.write(buffer, yes) } yes() $ node yes.js | pv > /dev/null ... [9.17GiB/s] ... * permalink * embed * save * report * give award * reply [-]Auxx 2 points3 points4 points 4 years ago (4 children) GNU yes is a good example of a wrong optimisation. So you have an utility called X and it awaits yes input (y). It only needs 2 characters (y\n). Yet GNU yes (yes | X) will flood it with 8kb of bullshit. Of course OS will consume everything except for first 2 bytes (y\n), but still, it is a performance issue. And yes runs until killed and consumes 100% of CPU. It is the worst utility ever created and GNU made it even worse. If GNU guys ever worried about performance, then they should've removed this abomination long time ago. * permalink * embed * save * report * give award * reply load more comments (4 replies) [-][deleted] 2 points3 points4 points 4 years ago (0 children) Here's my version in Go. It even seems to be a bit faster than the GNU version: package main import "os" func main() { var txt []byte if len(os.Args) > 1 { txt = []byte(os.Args[1] + "\n") } else { txt = []byte("y\n") } bufLen := 8 * 1024 buf := make([]byte, bufLen) used := 0 for used < bufLen && len(txt) <= bufLen-used { copy(buf[used:], txt) used += len([]byte(txt)) } for { os.Stdout.Write(buf) } } The tests always show the same results: $ yesgo | pv > /dev/null ... [5,66GiB/s] ... $ yes | pv > /dev/null ... [5,27GiB/s] ... * permalink * embed * save * report * reply [-][deleted] 1 point2 points3 points 4 years ago (0 children) Uh oh - look at Kiki! * permalink * embed * save * report * reply [-][deleted] 4 years ago (10 children) [deleted] [-][deleted] 2 points3 points4 points 4 years ago (9 children) https://unix.stackexchange.com/questions/102484/ what-is-the-point-of-the-yes-command When updating ports on a FreeBSD workstation, using portmaster + yes becomes very handy: yes | portmaster -da That way you can let the machine update while you lunch and all the questions fill default to 'y,yes' When [rebuilding the world][1] for 'make delete-old' and 'make delete-old-libs'. this is a big time saver: yes | make delete-old and yes | make delete-old-libs Basically helps you to avoid typing / confirm certain operations that ask for a 'y' or 'yes' [1]: http://www.freebsd.org/doc/handbook/makeworld.html * permalink * embed * save * report * reply [-]crackanape 3 points4 points5 points 4 years ago (8 children) Doesn't explain why it needs to be so fast. A few microseconds delay moving on to the next step of updating ports is hardly going to be the thing that ruins your lunch. * permalink * embed * save * parent * report * give award * reply [-]apotheon 3 points4 points5 points 4 years ago (7 children) The following is my response to the top-level, now deleted comment (I wish it hadn't been deleted, especially while I wrote this response): --------------------------------------------------------------------- Gawwad gives a good answer to the first question (what is it), but the tl;dr version is: "It automates answering 'yes' to confirmation requests from other software." The answer to your second question ("Why does it need to be this fast?") is "It doesn't." Seriously, this was an interesting exercise in understanding why something is fast, but the "yes" command is absolutely not an important place to do this kind of optimization. It makes the code harder to read, and harder to understand, for a very simple tool. Premature optimization is the root of all evil. - Donald Knuth * permalink * embed * save * parent * report * give award * reply [-]greyfade 3 points4 points5 points 4 years ago (6 children) I don't like when people just pull out the premature optimization quote and leave off the context: Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. It says what you did, and does so in a clear and concise way, and includes the bits that overzealous people forget. * permalink * embed * save * parent * report * give award * reply [-]apotheon 2 points3 points4 points 4 years ago (5 children) The context is unnecessary in this case, because GNU yes is very damned far from that 3%. edit: these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. compare with: the "yes" command is absolutely not an important place to do this kind of optimization. It makes the code harder to read, and harder to understand, for a very simple tool. I basically paraphrased him by independent formulation of an essential principle of good design. * permalink * embed * save * parent * report * give award * reply load more comments (5 replies) [-]hegbork 1 point2 points3 points 4 years ago (1 child) #include #include #include #include #include int main(int argc, char **argv) { #if 0 /* testing version */ int bufsz = atoi(argv[1]); int iovcnt = atoi(argv[2]); assert((bufsz & 1) == 0); #else int bufsz = 8192; int iovcnt = 64; #endif struct iovec iov[iovcnt]; #if 1 char buf[bufsz]; #else char *buf; if (posix_memalign((void **)&buf, getpagesize(), bufsz)) exit(1); #endif int i; for (i = 0; i < bufsz; i += 2) { buf[i + 0] = 'y'; buf[i + 1] = '\n'; } for (int i = 0; i < iovcnt; i++) { iov[i].iov_base = buf; iov[i].iov_len = bufsz; } while (writev(1, iov, iovcnt) == bufsz * iovcnt) ; return 0; } Performs almost twice as fast as iteration 4 on one OSX and one Linux machine. 8192/64 numbers are empirically tested to behave best on both. This is weird because on the systems I know (BSDs), there is magical code that kicks in on pipe writes bigger than 8192 which makes the pipe buffer bigger and last time I looked OSX used the same pipe code. The posix_memalign allocation was there to see if some zero copy mechanism kicks in. But it doesn't on the systems where I tried this, so it's disabled. Writing this in other languages, assembler, optimizing the initialization, etc. is pretty pointless because this should all be in the overhead between the system call and the point where the kernel does the copying from userland to a pipe buffer. Something you can only control by reducing the number of system calls. So theoretically the best we can do is to increase the number of iovecs passed into writev, but it doesn't seem to make much (if any) difference, so 8192/64 stays as good enough. * permalink * embed * save * report * give award * reply [-]kjensenxz[S] 1 point2 points3 points 4 years ago (0 children) $ ./yes | pv > /dev/null ... [9.31GiB/s] ... * permalink * embed * save * parent * report * give award * reply [-]_mrb 1 point2 points3 points 4 years ago* (0 children) This tuning of yes, pv, and pipe buffer size does 123 GB/s: https:// www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/diua761/ * permalink * embed * save * report * give award * reply [-]hikilaka 1 point2 points3 points 4 years ago (0 children) mutha fuckin joe dirt......... smh * permalink * embed * save * report * give award * reply [-]johnklos 1 point2 points3 points 4 years ago (0 children) Interesting how much of a difference there is between unoptimized and optimized on a standard Ubuntu system: yes | pv > /dev/null 101GiB 0:00:18 [6.04GiB/s] [ <=> ] ./vmsplice-yes | pv > /dev/null 41.2TiB 0:01:35 [ 444GiB/s] [ <=> ] lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0-159 Thread(s) per core: 8 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (raw), altivec supported CPU max MHz: 3491.0000 CPU min MHz: 2061.0000 L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-79 NUMA node8 CPU(s): 80-159 * permalink * embed * save * report * give award * reply [-]pinbender 1 point2 points3 points 4 years ago (0 children) FYI, this is an example of how buffer overflows happen. If LEN changes to a number that is not divisible by the buffer size, it will overflow. That's not the case in what's listed, but code changes, people copy working code, etc. The while loop should include the size of what it's writing to make sure it will fit: while ((bufused + LEN) <= TOTAL) { memcpy(buf+bufused, yes, LEN); bufused += LEN; } * permalink * embed * save * report * give award * reply [-]-maandree- 1 point2 points3 points 4 years ago (0 children) https://github.com/maandree/yes-silly * permalink * embed * save * report * give award * reply [-][deleted] 2 points3 points4 points 4 years ago (1 child) You give praise to GNU for being so thorough that they made such a good implementation, but to me this whole thing is kind of sad. So if you just use stdlib naively, and don't invest a bunch of time optimizing even a simple thing as this is, you will end up with a very sub-optimal implementation. * permalink * embed * save * report * reply [-]mcjiggerlog 0 points1 point2 points 4 years ago (0 children) Hackernews discussion * permalink * embed * save * report * give award * reply [-]DorffMeister 0 points1 point2 points 4 years ago (0 children) Fun read. * permalink * embed * save * report * give award * reply [-]iheartrms 0 points1 point2 points 4 years ago (0 children) Asking the real questions... * permalink * embed * save * report * give award * reply load more comments (16 replies) * about * blog * about * advertising * careers * help * site rules * Reddit help center * reddiquette * mod guidelines * contact us * apps & tools * Reddit for iPhone * Reddit for Android * mobile website * <3 * reddit premium * reddit coins Use of this site constitutes acceptance of our User Agreement and Privacy Policy. (c) 2022 reddit inc. All rights reserved. REDDIT and the ALIEN Logo are registered trademarks of reddit inc. [pixel] p Rendered by PID 20 on reddit-service-r2-loggedout-6b8cdf8d68-9phlt at 2022-06-04 23:00:56.320030+00:00 running 6760c65 country code: US.