https://old.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/

jump to content
my subreddits
edit subscriptions

  * popular
  * -all
  * -random
  * -users

 | 

  * AskReddit
  * -funny
  * -worldnews
  * -news
  * -gaming
  * -videos
  * -todayilearned
  * -movies
  * -pics
  * -aww
  * -tifu
  * -explainlikeimfive
  * -Art
  * -mildlyinteresting
  * -nottheonion
  * -Music
  * -dataisbeautiful
  * -OldSchoolCool
  * -askscience
  * -IAmA
  * -TwoXChromosomes
  * -Jokes
  * -science
  * -Futurology
  * -LifeProTips
  * -books
  * -space
  * -Showerthoughts
  * -sports
  * -gifs
  * -UpliftingNews
  * -food
  * -nosleep
  * -gadgets
  * -DIY
  * -history
  * -EarthPorn
  * -photoshopbattles
  * -Documentaries
  * -InternetIsBeautiful
  * -WritingPrompts
  * -GetMotivated
  * -creepy
  * -philosophy
  * -listentothis
  * -announcements
  * -blog

more >>
reddit.com unix

  * comments
  * other discussions (3)

Want to join? Log in or sign up in seconds.|

  * English

[                    ][]
[ ]limit my search to r/unix

use the following search parameters to narrow your results:

subreddit:subreddit
    find submissions in "subreddit"
author:username
    find submissions by "username"
site:example.com
    find submissions from "example.com"
url:text
    search for "text" in url
selftext:text
    search for "text" in self post contents
self:yes (or self:no)
    include (or exclude) self posts
nsfw:yes (or nsfw:no)
    include (or exclude) results marked as NSFW

e.g. subreddit:aww site:imgur.com dog

see the search faq for details.

advanced search: by author, subreddit...

this post was submitted on  13 Jun 2017
1,339 points (96% upvoted)
shortlink:  [https://redd.it/6gxd]
[                    ][                    ]
[ ]remember mereset password
login
Submit a new link
Submit a new text post
 
Get an ad-free experience with special benefits, and directly support
Reddit.
get reddit premium
unix

joinleave18,939 readers

458 users here now

About

/r/unix is a subreddit for Unix and everything related to Unix.

---------------------------------------------------------------------

Rules

/r/unix is a pretty chill place:

  * First and foremost, follow the rules of reddit

  * Don't be an asshole

---------------------------------------------------------------------

Related subreddits:

  * /r/ComputerScience
  * /r/linux

a community for 14 years

MODERATORS

  * message the mods
  * Moderator list hidden. Learn More

discussions in r/unix
<>
X
 
4
MCH2022: NetBSD and Friends
 
9 * 4 comments
IBM connect direct
 
22 * 5 comments
[vNbC1YgZ]
Understanding ZFS
 
20 * 16 comments
SDF.ORG down.
 
2 * 3 comments
Newbie Problem with MacOS Terminal Stuff
 
10
[anw7IHEz]
arttime v1.2.0: Countdown for a year, clock instances remain
synchronized, etc. A definite addition in next version: change
timezones on the fly.
 
0
MEXC GLOBAL OFFICIALLY LISTS LEADER IN WEB 3.0 GAMING - UNIX GAMING
 
26 * 1 comment
[y_KCIdbI]
Compiling the NetBSD kernel as a benchmark
 
20 * 6 comments
The UNIXHATERS Handbook epub version?
 
20 * 38 comments
How do you pronounce daemon?

Welcome to Reddit,

the front page of the internet.

Become a Redditor

and join one of thousands of communities.

x

1338
1339
1340
 

How is GNU `yes` so fast? (self.unix)

submitted 4 years ago * by kjensenxz[silver_48][gold_48]3[klvxk1wggf]

How is GNU's yes so fast?

$ yes | pv > /dev/null
... [10.2GiB/s] ...

Compared to other Unices, GNU is outrageously fast. NetBSD's is
139MiB/s, FreeBSD, OpenBSD, DragonFlyBSD have very similar code as
NetBSD and are probably identical, illumos's is 141MiB/s without an
argument, 100MiB/s with. OS X just uses an old NetBSD version similar
to OpenBSD's, MINIX uses NetBSD's, BusyBox's is 107MiB/s, Ultrix's
(3.1) is 139 MiB/s, COHERENT's is 141MiB/s.

Let's try to recreate its speed (I won't be including headers here):

/* yes.c - iteration 1 */
void main() {
    while(puts("y"));
}

$ gcc yes.c -o yes
$ ./yes | pv > /dev/null
... [141 MiB/s] ...

That's nowhere near 10.2 GiB/s, so let's just call write without the
puts overhead.

/* yes.c - iteration 2 */
void main() {
    while(write(1, "y\n", 2)); // 1 is stdout
}

$ gcc yes.c -o yes
$ ./yes | pv > /dev/null
... [6.21 MiB/s] ...

Wait a second, that's slower than puts, how can that be? Clearly,
there's some buffering going on before writing. We could dig through
the source code of glibc, and figure it out, but let's see how yes
does it first. Line 80 gives a hint:

/* Buffer data locally once, rather than having the
large overhead of stdio buffering each item.  */

The code below that simply copies argv[1:] or "y\n" to a buffer, and
assuming that two or more copies could fit, copies it several times
to a buffer of BUFSIZ. So, let's use a buffer:

/* yes.c - iteration 3 */
#define LEN 2
#define TOTAL LEN * 1000
int main() {
    char yes[LEN] = {'y', '\n'};
    char *buf = malloc(TOTAL);
    int used = 0;
    while (used < TOTAL) {
        memcpy(buf+used, yes, LEN);
        used += LEN;
    }
while(write(1, buf, TOTAL));
return 1;
}

$ gcc yes.c -o yes
$ ./yes | pv > /dev/null
... [4.81GiB/s] ...

That's a ton better, but why aren't we reaching the same speed as
GNU's yes? We're doing the exact same thing, maybe it's something to
do with this full_write function. Digging leads to this being a
wrapper for a wrapper for a wrapper (approximately) just to write().

This is the only part of the while loop, so maybe there's something
special about their BUFSIZ?

I dug around in yes.c's headers forever, thinking that maybe it's
part of config.h which autotools generates. It turns out, BUFSIZ is a
macro defined in stdio.h:

#define BUFSIZ _IO_BUFSIZ

What's _IO_BUFSIZ? libio.h:

#define _IO_BUFSIZ _G_BUFSIZ

At least the comment gives a hint: _G_config.h:

#define _G_BUFSIZ 8192

Now it all makes sense, BUFSIZ is page-aligned (memory pages are 4096
bytes, usually), so let's change the buffer to match:

/* yes.c - iteration 4 */
#define LEN 2
#define TOTAL 8192
int main() {
    char yes[LEN] = {'y', '\n'};
    char *buf = malloc(TOTAL);
    int bufused = 0;
    while (bufused < TOTAL) {
        memcpy(buf+bufused, yes, LEN);
        bufused += LEN;
    }
    while(write(1, buf, TOTAL));
    return 1;
}

And, since without using the same flags as the yes on my system does
make it run slower (yes on my system was built with CFLAGS="-O2 -pipe
-march=native -mtune=native"), let's build it differently, and
refresh our benchmark:

$ gcc -O2 -pipe -march=native -mtune=native yes.c -o yes
$ ./yes | pv > /dev/null
... [10.2GiB/s] ...
$ yes | pv > /dev/null
... [10.2GiB/s] ...

We didn't beat GNU's yes, and there probably is no way. Even with the
function overheads and additional bounds checks of GNU's yes, the
limit isn't the processor, it's how fast memory is. With DDR3-1600,
it should be 11.97 GiB/s (12.8 GB/s), where is the missing 1.5? Can
we get it back with assembly?

; yes.s - iteration 5, hacked together for demo
BITS 64
CPU X64
global _start
section .text
_start:
    inc rdi       ; stdout, will not change after syscall
    mov rsi, y    ; will not change after syscall
    mov rdx, 8192 ; will not change after syscall
_loop:
    mov rax, 1    ; sys_write
    syscall
jmp _loop
y:      times 4096 db "y", 0xA

$ nasm -f elf64 yes.s
$ ld yes.o -o yes
$ ./yes | pv > /dev/null
... [10.2GiB/s] ...

It looks like we can't outdo C nor GNU in this case. Buffering is the
secret, and all the overhead incurred by the kernel throttles our
memory access, pipes, pv, and redirection is enough to negate 1.5 GiB
/s.

What have we learned?

  * Buffer your I/O for faster throughput
  * Traverse source files for information
  * You can't out-optimize your hardware

Edit: _mrb managed to edit pv to reach over 123GiB/s on his system!

Edit: Special mention to agonnaz's contribution in various languages!
Extra special mention to Nekit1234007's implementation completely
doubling the speed using vmsplice!

  * celebdor optimized the Python script for faster speeds
  * jaysonsantos created a version in Rust
  * Astrokiwi created a Fortran version
  * TheSizik created a Brainfuck version
  * TqDlRTd3DrgQlW4PFQH6 created a Haskell version
  * prussian from Rizon created a shell version and a Node.js version
  * kozzi11 created a D version

  * 240 comments
  * share
  * save
  * hide
  * report

top 200 commentsshow all 240
sorted by:
best
topnewcontroversialoldrandomq&alive (beta)
 [                    ]

Want to add to the discussion?

Post a comment!

Create an account

[-]jmtd 823 points824 points825 points 4 years ago (52 children)

It's a shame they didn't finish their kernel, but at least they got
yes working at 10GiB/s.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-][deleted] 4 years ago (33 children)

[deleted]

[-]mailto_devnull 75 points76 points77 points 4 years ago (3
children)

Except flash is now on it's way out, so in hindsight waiting for
Flash to die was a viable strategy!

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-][deleted] 4 years ago (2 children)

[deleted]

[-][deleted] 4 years ago (1 child)

[deleted]

[-][deleted] 13 points14 points15 points 4 years ago (0 children)

I'm quite sure.

  * permalink
  * embed
  * save
  * report
  * reply

[-]DurdenVsDarkoVsDevon 14 points15 points16 points 4 years ago (0
children)

Non-mobile link.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]sadmac 16 points17 points18 points 4 years ago (2 children)

There's actually a changelog somewhere in some X component that says
"Fixes XKCD 619"

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]feitingen 18 points19 points20 points 4 years ago[ks45ij6w05] (1
child)

And here it is:

https://github.com/jjneely/elrepo/blob/master/xorg-x11-drv-intel/el6/
xorg-x11-drv-intel.spec#L259

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]NeverCast 1 point2 points3 points 4 years ago (0 children)

I want to favourite this comment.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]GNU_Troll 6 points7 points8 points 4 years ago (21 children)

These comics suck I don't know how anyone likes them.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kbob 37 points38 points39 points 4 years ago (3 children)

Relevant username.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]GNU_Troll 6 points7 points8 points 4 years ago (2 children)

Not even trolling. Just stating a fact.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]didnt_readit 8 points9 points10 points 4 years ago (0 children)

*opinion

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 10 points11 points12 points 4 years ago (0 children)

A false fact

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bityard 31 points32 points33 points 4 years ago (0 children)

Username checks out, but downvoted anyway.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 11 points12 points13 points 4 years ago (3 children)

I'm feeling you are in a minority.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]GNU_Troll 4 points5 points6 points 4 years ago (2 children)

Good taste isn't common, keep that in mind too.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Sag0Sag0 8 points9 points10 points 4 years ago (1 child)

According to you, the minority.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

[-]arachnidGrip 2 points3 points4 points 4 years ago (3 children)

Why do you dislike them?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]GNU_Troll 1 point2 points3 points 4 years ago (2 children)

The illustration sucks, thee writing sucks, and half of them are not
even comics. Just one panel, which is not a comic.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Hyperkubus 2 points3 points4 points 4 years ago (1 child)

Who, other than yourself, said they were comics?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]yannik121 1 point2 points3 points 4 years ago (0 children)

From xkcd.com: "A webcomic of romance, sarcasm, math, and language."

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 2 points3 points4 points 4 years ago (6 children)

I like them because i find them often enough entertaining. Don't
think that it is too hard to realize by yourself?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]GNU_Troll 1 point2 points3 points 4 years ago (5 children)

Not everyone is entertained by stick figure drawings, most of us are
a little old for that.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 6 points7 points8 points 4 years ago (3 children)

Exactly, not everyone.

Btw. its not about those stick figure drawings its about the message.
It doesn't matter much how it looks.

And don't missjudge missing context and/or knowledge about a topic
for 'little to old for that'. If you don't get it, you don't get it.
Its okay.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]GNU_Troll 1 point2 points3 points 4 years ago (2 children)

I get it, I just have higher standards. It's an art form so excusing
poor illustration and writing because it's about getting a message
across is kind of a cop out.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 9 points10 points11 points 4 years ago (0 children)

No you don't have higher standards. You have different taste.

And no you don't get 'it'. If you would get 'it' you would find it
funny. Otherwise you are just an bystander, analysing/evaluationg
without getting 'it'.

Getting 'it' doesn't mean that you can comprehend why someone else
might find it funny.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bulkygorilla 3 points4 points5 points 4 years ago (0 children)

A "one-off" if you will

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]NotRichardDawkins 3 points4 points5 points 4 years ago (0
children)

    most of us are a little old for that

Have fun in your boring grown-up land with your boring grown-up
pants.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (3 replies)

[-]kjensenxz[S] 163 points164 points165 points 4 years ago (5
children)

This should be a fortune

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]never_amore 42 points43 points44 points 4 years ago (4 children)

fortune doesn't work at 10GiB/s

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 67 points68 points69 points 4 years ago (3 children)

$ fortune | pv >/dev/null
... [2.81KiB/s] ...

This is worse than all the yesses that have been benchmarked!

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 4 years ago (2 children)

[deleted]

[-]Moonpenny 17 points18 points19 points 4 years ago (0 children)

It also would have far fewer fortunes, since most of the fortunes are
taken without attribution and can't be GPL'd. The Twain stuff should
be fine, at least.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]Tyler_Zoro 19 points20 points21 points 4 years ago (0 children)

... and it would read mail. Plus, half of the fortunes would be some
variation of, "it's called GNU/Linux".

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]veroxii 62 points63 points64 points 4 years ago (0 children)

That hurds. :(

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Yawzheek 15 points16 points17 points 4 years ago (0 children)

This right here? This is how you be a proper smart ass. Take notes.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]myhf 11 points12 points13 points 4 years ago (3 children)

My man.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]ProgramTheWorld 3 points4 points5 points 4 years ago (2 children)

Looking good.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]PM_ME_DANK_MEMES 5 points6 points7 points 4 years ago (1 child)

Slow down!

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Sag0Sag0 3 points4 points5 points 4 years ago (0 children)

Yes.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]incraved 9 points10 points11 points 4 years ago (0 children)

    working at 10GiB/s.

your comment would have been perfect if you had typed that as:

    working at 10GiB. /s

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]enkiv2 7 points8 points9 points 4 years ago (1 child)

If you've ever had to convince shell tools to process large
quantities (30-300 gigs) of text data, you'll see the merit of
getting yes (and cut, and paste, and tr) to operate very quickly.

Optimizing the hell out of these is why your laptop can perform some
operations 80x faster than a hadoop cluster (and why you should
therefore always first consider writing a small shell script when
someone asks you to use hadoop map reduce on a couple hundred gigs of
text).

Even if HURD was finished, the number of people actually using it
would still probably be less than the number of people who try to
monkey-parse 30GB of xml in gawk. (Source: for work I frequently
monkey-parse 30+GB of xml in gawk.)

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

[-]RedditUserHundred 4 points5 points6 points 4 years ago (1 child)

... and Stallman can bitch about "GNU slash linux"

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]xorgol 4 points5 points6 points 4 years ago (0 children)

I'm currently running a project on GNU/NT. Stallman was right all
along!

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]alexbuzzbee 144 points145 points146 points 4 years ago (36
children)

The missing 1.5 GiB/s is probably kernel overhead and other
processes.

Try it in emergency mode for slightly more speed!

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 110 points111 points112 points 4 years ago (35
children)

I considered running it in single-user mode, writing a simple ring 0
program to boot off of, or using a custom tiny kernel using it as
init, and squeeze as much speed as possible out of the program, but I
think I've spent enough time on this, I started writing it somewhere
around 4 or 5 hours ago. If anyone would like to take a crack at
doing that, I'd love to see how it compares to running on a regular
system.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 29 points30 points31 points 4 years ago (27 children)

I learned something today !

For the yes command, I still prefer the first implementation. Maybe
dd also have such kind of optimizations.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]kjensenxz[S] 32 points33 points34 points 4 years ago (23 children)

I really like the readability of the first iteration and NetBSD's,
which are very similar, but they just aren't as quick, which makes me
wonder if there would be a way to optimize several subsequent calls
to a stdio function for the same speed in the library itself. Maybe
another time I'll look into that, dd, and cat!

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 4 years ago* (18 children)

[deleted]

[-]iluvatar 48 points49 points50 points 4 years ago (15 children)

    If anything, you'd want to generalize to emit any character

yes already does this. Indeed, it goes further and repeatedly emits
any arbitrary string. It's had this behaviour for at least the 30
years that I've been using it.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-][deleted] 4 years ago* (5 children)

[deleted]

[-]supercheese200 23 points24 points25 points 4 years ago (4
children)

no maybe i don't know
no maybe i don't know

Can you repeat the question?

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]preludeoflight 19 points20 points21 points 4 years ago (3
children)

Well, I mean, obviously,

$ yes "no maybe i don't know"

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]supercheese200 18 points19 points20 points 4 years ago (2
children)

YOU'RE NOT THE BOSS OF ME NOW

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

continue this thread

[-][deleted] 11 points12 points13 points 4 years ago (0 children)

As one of the few people crazy enough to use mono, this is well known
as part of the incantation yes yes | mozroots --import that gets SSL
working. (This is fixed in newer versions of mono though.)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]net_goblin 2 points3 points4 points 4 years ago (5 children)

But isn't emitting arbitrary characters the job of echo(1)? My
favourite implementation would be echo y.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]iluvatar 13 points14 points15 points 4 years ago (1 child)

No - echo only emits it once, where yes repeatedly emits the string.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]net_goblin 1 point2 points3 points 4 years ago (0 children)

Ah thanks, my bad.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]bit_of_hope 2 points3 points4 points 4 years ago (2 children)

yes repeats the string until the pipe is closed or yes itself killed.

[bitofhope@suika ~]% yes | head
y
y
y
y
y
y
y
y
y
y

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Truncator 6 points7 points8 points 4 years ago (1 child)

~ $ yes | tail
^C

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]bit_of_hope 9 points10 points11 points 4 years ago (0 children)

$ yes > /dev/null &

Why is my machine so noisy all of a sudden?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]StoneCypher 2 points3 points4 points 4 years ago (0 children)

sorry about the stupid question.

what have you been using this for?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Neebat 0 points1 point2 points 4 years ago (0 children)

Does that affect the throughput? (Bet it doesn't.)

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 18 points19 points20 points 4 years ago (0 children)

You make an excellent point, and yes is meant to do this (send argv
instead of "y"), and the programs could easily be modified to send
any value based on argv, just by changing the buffer subroutine. I
would have added that in the program demos, but I felt it would be in
excess.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

load more comments (1 reply)

[-]FUCKING_HATE_REDDIT 2 points3 points4 points 4 years ago (3
children)

printf buffers every line, but I think you can force it to buffer
more.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]davesidious 2 points3 points4 points 4 years ago (2 children)

^^^wat

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]FUCKING_HATE_REDDIT 1 point2 points3 points 4 years ago (1 child)

Printf buffers call until it reaches a \n

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]morty_a 9 points10 points11 points 4 years ago (0 children)

printf/stdio behavior depends on whether or not stdout is a tty. If
it's a tty, by default, stdio flushes after every newline ("line
buffered.") If it's not a tty, by default, stdio waits until its
buffer fills before flushing ("fully buffered.")

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]markfeathers 5 points6 points7 points 4 years ago (2 children)

dd has a command line argument for the block size it writes, so you
should be able to do the same thing with "dd if=/dev/zero bs=8192 |
pv > /dev/null". On my pc /dev/null is ~5.12GiB/s. dd from /dev/zero
is around ~4GiB/s though. In the case of dd it has to read from
another file to write out instead of plopping 'y\n' into the buffer,
so this is likely why it is a bit slower.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 2 points3 points4 points 4 years ago (0 children)

Oh sure, the bs=<block size>. With block devices like /dev/zero
though, nothing to read from a hard drive (or a cache), but this is
still reading.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

load more comments (1 reply)

[-]Coding_Cat 4 points5 points6 points 4 years ago (2 children)

On mobile, so can't search properly, but there is a command for
starting a program with the rt scheduler. Might make it a little
faster as it will never be preempted that way.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (2 replies)

load more comments (4 replies)

[-]stw 52 points53 points54 points 4 years ago (1 child)

Just a small nitpick: puts appends a newline, so puts("y\n") writes 2
newlines.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 30 points31 points32 points 4 years ago (0 children)

Thanks! I completely overlooked that, and was off by about 50%. I
edited the OP to reflect the real values.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 4 years ago (8 children)

[deleted]

[-]kjensenxz[S] 26 points27 points28 points 4 years ago (7 children)

I used mem_align and actually got worse performance, generally .2 GiB
/s slower than elagergren's Go implementation and the C/assembly
implementations (modified 4th iteration if you'd like to check):

//char *buf = malloc(TOTAL);
char *buf = aligned_alloc(4096, TOTAL);

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]patrickbarnes 7 points8 points9 points 4 years ago (6 children)

What happens if you stack allocate your buf?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 12 points13 points14 points 4 years ago (4 children)

That's actually what happens in the assembly code, since it actually
compiles the values into the binary. Here's a sample (the .y's repeat
for another 500 lines or so):

00000080: 48ff c748 be9b 0040 0000 0000 00ba 0020  H..H...@.......
00000090: 0000 b801 0000 000f 05eb f779 0a79 0a79  ...........y.y.y
000000a0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79  .y.y.y.y.y.y.y.y
000000b0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79  .y.y.y.y.y.y.y.y
000000c0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79  .y.y.y.y.y.y.y.y

I don't know if the stack has any greater performance than the heap
for something like this (we don't really need to do any memory
"bookkeeping", and after all, memory is just memory), and it might
mean slower initialization of the program, since it would have to
read a larger file for the buffer than build one in memory.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Vogtinator 19 points20 points21 points 4 years ago (2 children)

    That's actually what happens in the assembly code, since it
    actually compiles the values into the binary.

That's not the stack, that's .data (or in this case, .text, as not
specified otherwise)

To get it on the stack, you would need to:

sub rsp, 8192
mov rdi, rsp
mov rsi, y
mov rdx, 8192
call memcpy

Or something like that.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 8 points9 points10 points 4 years ago (1 child)

Thanks, I thought certain data in .text was put onto the stack (e.g.
consts).

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]calrogman 11 points12 points13 points 4 years ago (0 children)

In C it's typical that automatic variables in function scope are
placed on the stack.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]mccoyn 2 points3 points4 points 4 years ago (0 children)

Local variables (not just on the stack, but on the current stack
frame) have the advantage that the stack pointer is always in a
register and so their memory location is a simple calculation. With a
dynamically allocated buffer, the address would have to be placed in
a register. Before the call to write() is made, this address would
have to be saved onto the stack and then loaded back for the next
call to write().

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

[-][deleted] 4 years ago (14 children)

[deleted]

[-]kjensenxz[S] 15 points16 points17 points 4 years ago (13 children)

I'm not sure which architecture your MacBook is (x86_64? ARM? Ancient
PPC?), but I noticed that the speed really has to do with the size of
your buffer compared to your pages (4096 bytes on x86), and making
sure that you can fill up at least one (two is better IIRC). I'm not
sure how much it's stored in L1, but if it was, it should be in the
hundreds of gigabytes, in which case pv would definitely be the
bottleneck.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]wrosecrans 18 points19 points20 points 4 years ago (12 children)

It'll be x86_64 (or technically it could be x86 if it is the first
gen Core Duo.) The PPC Laptops were all branded "PowerBook" or
"iBook," and Apple hasn't shipped an ARM laptop.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 4 points5 points6 points 4 years ago (11 children)

Thanks! I didn't know about the PPC branding or the lack of an ARM; I
was thinking the A10 was in the MacBook Air.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]wrosecrans 17 points18 points19 points 4 years ago (10 children)

The phones and tablets are all ARM. At this point, the iPadPro with
an optional keyboard attached to it is suspiciously similar to a
laptop, but not quite. The Mac is currently all x86_64. The MacBook
Pro does have a little ARM in it hidden away to control the touchbar
panel, but you generally wouldn't program it directly. (Most systems
have a couple of little processors like that in them these days.
There's probably at least one more in the wifi controller or
something.)

Running a normal process in the OS is always on the Intel CPU.

Historical trivia: The PowerPC laptops were called "PowerBook." The
PowerPC macs were called "PowerMac." But the original PowerBooks
predated the PPC CPU's and were all 68k. It was just coincidence when
the CPU and laptop branding lined up with Power in the name.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 18 points19 points20 points 4 years ago (9 children)

    The MacBook Pro does have a little ARM in it hidden away to
    control the touchbar panel, but you generally wouldn't program it
    directly

Someone put the original Doom on the touch bar, which makes me wonder
about the interface with the operating system and hardware, and the
specs of it - how fast can it run yes?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]jmtd 10 points11 points12 points 4 years ago (7 children)

That is a cute hack, but I think they're still running doom on the
CPU but rendering on the bar; not running it on the ARM.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]fragmede 7 points8 points9 points 4 years ago (6 children)

I couldn't find any more useful specs for the CPU on the touchbar (
wikipedia doesn't have much), but considering Doom has been ported to
the Apple watch, I can readily believe that the Touchbar is powerful
enough to run Doom. The original Pentium, launched in 1993 the year
Doom was also released, had a blazing fast clock speed (and bus
speed) of 60 MHz, The Apple S1 used in the Apple Watch has a CPU with
a max speed of 520 MHz, and while you can't blindly compare MHz to
MHz between architectures, 24 years of progress in computer
technology takes us pretty far.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]vba7 5 points6 points7 points 4 years ago (1 child)

Id risk saying that in 1993, when Doom launched, most people had 386
processors (probably some cheap 386SX). Most would read about
Pentiums in the magazines and stare at the price tags. Pentiums
became popular around Windows 95 times :-) (and still were expensive)

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]dsmithatx 1 point2 points3 points 4 years ago (0 children)

I was running a 286 I got in 1986 and had to go buy a 486 66Mhz to
play Doom. I worked in a computer store in 1993 when the first
Pentiums came out. They were expensive and not many customers bought
them the first few years.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]WikiTextBot[] 2 points3 points4 points 4 years ago (0 children)

Apple mobile application processors: Apple T1

The Apple T1 chip is an ARMv7 SoC from Apple driving the Touch ID
sensor of the 2016 MacBook Pro. The chip operates as a secure enclave
for the processing and encryption of fingerprints as well as acting
as a gatekeeper to the microphone and iSight camera protecting these
possible targets from potential hacking attempts. The T1 runs its own
version of watchOS, separate from the Intel CPU running macOS.

---------------------------------------------------------------------

Apple S1

The Apple S1 is the integrated computer in the Apple Watch, and it is
described as a "System in Package" (SiP) by Apple Inc.

Samsung is said to be the main supplier of key components, such as
the RAM and NAND flash storage, and the assembly itself, but early
teardowns reveal RAM and flash memory from Toshiba and Micron
Technology.

---------------------------------------------------------------------

^[ ^PM ^| ^Exclude ^me ^| ^Exclude ^from ^subreddit ^| ^FAQ ^/ ^
Information ^] ^Downvote ^to ^remove ^| ^v0.2

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]jmtd 0 points1 point2 points 4 years ago (1 child)

Oh yeah there's no doubt the ARM chip is fast enough to run Doom, I'm
just fairly confident that they didn't do that: it would be much
easier to hack an existing port to render on the touchbar via the
proper API than to port the whole thing, and since this was thrown
together for a youtube video laugh and the source is not readily
apparent, my best estimate is that's what they did.

Although the Pentium debuted in '93, Doom was targetting the
predecessor, one of the 486 variants.

If you want to see an impressive, available embedded port of Doom,
check out rockbox on an iPod or other supported PMP.

I contribute to the chocolate doom source port in my spare time and
one of the things I've worked on is the raspberry pi (ARM) port.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

load more comments (1 reply)

[-]video_descriptionbot 2 points3 points4 points 4 years ago (0
children)

SECTION     CONTENT
Title       Doom on the MacBook Pro Touch Bar
Description Doom runs on everything... but can it run on the new
            MacBook Pro Touch Bar? Let's find out!
Length      0:00:58

---------------------------------------------------------------------

^I am a bot, this is an auto-generated reply | ^Info ^| ^Feedback ^|
^Reply STOP to opt out permanently

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]jmickeyd 24 points25 points26 points 4 years ago (30 children)

I'm curious about vmsplice performance on Linux. You could
potentially have a single page of "y\n"s passed multiple times in the
iov. That way you have fewer syscalls without using more ram.
Although at some point (possibly already), pv is going to be the
bottleneck.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]Nekit1234007 42 points43 points44 points 4 years ago (27 children)

Stole the code from /u/phedny and modified it a bit. Got some curious
results. /u/kjensenxz can you test it on your machine?

#define _GNU_SOURCE
#define __need_IOV_MAX
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>

#define LEN 2
#define TOTAL (1*1024*1024)
#define IOVECS IOV_MAX
int main() {
    char yes[LEN] = {'y', '\n'};
    char *buf = malloc(TOTAL);
    int bufused = 0;
    int i;
    struct iovec iov[IOVECS];
    while (bufused < TOTAL) {
        memcpy(buf+bufused, yes, LEN);
        bufused += LEN;
    }
    for (i = 0; i < IOVECS; i++) {
        iov[i].iov_base = buf;
        iov[i].iov_len = TOTAL;
    }
    while(vmsplice(1, iov, IOVECS, 0));
    return 1;
}

---------------------------------------------------------------------

$ gcc vmsplice-yes.c -o vmsplice-yes

---------------------------------------------------------------------

$ yes | pv >/dev/null
... 0:00:20 [5.26GiB/s] ...

$ ./kjensenxz-yes4 | pv >/dev/null
... 0:00:20 [4.11GiB/s] ...

#define TOTAL 4096
$ ./vmsplice-yes | pv >/dev/null
... 0:00:20 [4.36GiB/s] ...

#define TOTAL 8192
$ ./vmsplice-yes | pv >/dev/null
... 0:00:20 [6.83GiB/s] ...

#define TOTAL (1*1024*1024)
$ ./vmsplice-yes | pv >/dev/null
... 0:00:20 [9.33GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 37 points38 points39 points 4 years ago (19 children)

Amazing! Putting this in the OP.

$ ./vmsplice-yes | pv >/dev/null  # 1024 * 1024
... [20.5GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]_mrb 74 points75 points76 points 4 years ago*[gold_48] (18
children)

[DEL:You can further optimize /u/Nekit1234007's code by having only 1
large element in the iovec "y\ny\ny\n..." (vs. many 2-byte "y\n"
elements).:DEL] Edit: I misread the code and it's already having
large elements in the iovec. However setting the pipe size to 1MB
bumps the speed from 28 to 74 GB/s on my Skylake CPU (i5-6500).

If I count things correctly (4 context switches for yes to write, pv
to read, pv to write, and back to yes), assuming 100ns per switch,
100ns of instructions executing per context (300 instructions at IPC=
1 and 3GHz), 64kB per I/O op (default pipe buffer size) then the
theoretical max speed is ~80 GB/s.

Then tweak the pipe buffer size to 1MB (maximum allowed) and the
theoretical max should be ~1280 GB/s.

Edit 2: I reached 123 GB/s. It turns out past ~50-70 GB/s pv(1)
itself is the bottleneck. It fetches only 128kB of data at a time via
splice() because it is too simplistic and uses a fixed buffer size
that is 32 times the "block size" reported by stat() on the input.
And on Linux stat() on a pipe fd reports a block size of 4kB. So
recompile pv by changing (in src/pv/loop.c):

sz = sb.st_blksize * 32;

to this:

sz = sb.st_blksize * 1024;

But pv also restrict the block size to 512 kB no matter what. So edit
src/include/pv-internal.h and replace:

#define BUFFER_SIZE_MAX   524288

With:

#define BUFFER_SIZE_MAX   (4*1024*1024)

Then another bottleneck in pv is the fact it calls select() once
between each splice() call, which is unnecessary: if splice()
indicates data was read/written successfully, then a process should
just call splice() again and again. So edit src/pv/transfer.c and
fake a successful select() by replacing:

n = select(max_fd + 1, &readfds, &writefds, NULL, &tv);

with simply:

n = 1;

Then you will reach speeds of about 95 GB/s. Beyond that the pipe
buffer size need to be further increased. I bumped it from the
default 1MB to 16MB:

$ sysctl fs.pipe-max-size=$((1024*1024*16))

And use this custom version of yes with a 16MB pipe buffer: https://
pastebin.com/raw/qNBt8EJv

Finally, both "yes" and "pv" need to run on the same CPU core because
cache affinity starts playing a big role so:

$ taskset -c 0 ./yes | taskset -c 0 ~/pv-1.6.0/pv >/dev/null
 469GiB 0:00:02 [ 123GiB/s] [ <=>

But even at 123 GB/s the bottleneck is still pv, not yes. pv has a
lot of code to do some basic bookkeeping that just slows things down.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Nekit1234007 8 points9 points10 points 4 years ago (6 children)

I'll be damned. Added fcntl(1, F_SETPIPE_SZ, 1024*1024); before
while. /cc /u/kjensenxz

$ ./vmsplice-yes | pv >/dev/null
... 0:00:20 [21.1GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]_mrb 4 points5 points6 points 4 years ago* (3 children)

So you got a 2x boost, nice :) I wonder what /u/kjensenxz's system
would show.

Edit: now try the version with my custom pv(1) modifications as per
Edit #2

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 9 points10 points11 points 4 years ago (2 children)

    fcntl(1, F_SETPIPE_SZ, 1024*1024);

$ ./vmsplice-yes | pv > /dev/null
... [36.8GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]tcolgate 4 points5 points6 points 4 years ago (1 child)

    fcntl(1, F_SETPIPE_SZ, 1024*1024);

Interestingly, the peak I get is if I match the IOVEC size and the
PIPE_SZ to match my l2 cache size. (256KB per core). I get 73GiB/s
then!

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]tcolgate 2 points3 points4 points 4 years ago (0 children)

Just for posterity. That was a coincidence due to the hard coded
buffer sizes in pv I think. As _mrb points out, pv uses splice, so
only ever sees the count of bytes spliced, it doesn't need to read
the data to determine the size.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]monocirulo 1 point2 points3 points 4 years ago (1 child)

I got 60GiB/s with the line added. Can this be used for network
sockets?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

[-]EgoIncarnate 2 points3 points4 points 4 years ago (2 children)

Read the code again, the iovecs are already 1 MB each ( iov_len =
TOTAL ) of 'y\n'.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 4 years ago* (1 child)

[deleted]

[-]Nekit1234007 0 points1 point2 points 4 years ago (0 children)

I'm not sure I follow, that array is one long element, that was
allocated through malloc and filled with memcpys.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]_dancor_ 1 point2 points3 points 4 years ago (1 child)

https://youtu.be/FeGq48uNrLc?t=209

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]video_descriptionbot 1 point2 points3 points 4 years ago (0
children)

SECTION CONTENT
Title   Sense8 by Netflix - Season 02 : What's Up (Remix by Riley)
Length  0:03:58

---------------------------------------------------------------------

^I am a bot, this is an auto-generated reply | ^Info ^| ^Feedback ^|
^Reply STOP to opt out permanently

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]MCPtz 0 points1 point2 points 4 years ago (1 child)

Out of curiosity, what are your cache sizes? e.g. from lscpu

L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]_mrb 4 points5 points6 points 4 years ago (0 children)

Same as yours. But cache sizes don't matter much. The "y\n" data is
initialized once by yes(1) and subsequently never accessed by neither
yes(1) nor pv(1). That's the point of zero-copy I/O via splice().
Cache locality matters only to minimize time wasted during context
switches between yes(1), pv(1) and the kernel (eg. to update the
process data structures.)

Model name:            Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
Stepping:              3
CPU MHz:               800.125
CPU max MHz:           3600.0000
CPU min MHz:           800.0000
BogoMIPS:              6384.16
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (4 replies)

[-]tcolgate 8 points9 points10 points 4 years ago (1 child)

I thought maybe this was cheating because you weren't checking if
vmsplice is error'ing. Turns out it's not erroring. pv > /dev/null <
/dev/zero actually takes half a core on my machine just clearing RAM
(according to perf top), your vmsplice yes takes very little CPU at
all. I think you're basically measuring L1 cache bandwidth and
context switches at that point. Pretty cool.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Nekit1234007 5 points6 points7 points 4 years ago (0 children)

That's true. When I tried to run just the ./vmsplice-yes nothing
showed up on the screen but the process used all of the cpu core, I
was confused there for a sec. Replacing existing while with while
(vmsplice(1, iov, IOVECS, 0) > 0); should fix the problem.

But here lies the limitation of this approach, since pty/regular file
is not a pipe -- nothing useful will happen and vmsplice will fail
with EBADF.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (5 replies)

[-]kjensenxz[S] 7 points8 points9 points 4 years ago (1 child)

When I was writing the conclusion, I wondered how much pv was
limiting. I took a stab at it with dd, but it was an even worse
bottleneck:

$ ./yes | dd of=/dev/null bs=8192
29703569408 bytes (30 GB, 28 GiB) copied, 5.34847 s, 5.6 GB/s

I've seen pv measure as high as 11.2 GiB, which really makes me
wonder what the actual percent bottleneck everything is, and if it
weren't so late, I would definitely go poking around to check. I'll
try to remember to do it tomorrow, of course, everyone and everyone
else is invited to also if they're interested!

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]LukeShu 2 points3 points4 points 4 years ago (0 children)

pv uses the splice() system call to do zero-copy read/writes if
possible.

A yes|dd of=/dev/null pipeline it goes like: (forgive my slight
misapplication of big-O notation, and my pseudo-code intended to make
explicit the kernel's internal vtables)

yes: pipe.write(buf, len)    // O(len) ; copy data from userspace to kernelspace
dd : pipe.read(buf, len)     // O(len) ; copy data from kernelspace to userspace
dd : devnull.write(buf, len) // O(0)   ; discard data

So the cost is O(2*len). But with pv's use of splice(), we can skip a
step:

yes: pipe.write(buf, len)      // O(len) ; copy data from userspace to kernelspace
pv:  slice(pipe, devnull, len) // O(0)   ; discard data

So the cost is O(len); it makes sense that the throughput with dd
would be about half what pv gets.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]tiltowaitt 19 points20 points21 points 4 years ago (8 children)

This is pretty interesting. Is there a real-world advantage on modern
systems to such speed in the GNU yes?

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 28 points29 points30 points 4 years ago (6 children)

I really can't think of any real advantage of yes being faster other
than being able to say "look, mine's faster!", since the likelihood
of needing 5 billion "y's" per second is almost 0. It might have one
or two use cases in which its efficiency is actually useful, perhaps
in embedded systems running several operations concurrently? A couple
of people have mentioned dd and cat, which makes me wonder if the
same thing could be done to either (or both) of them to speed them up
as greatly, and I plan on taking a stab at them fairly soon if
someone doesn't beat me to it.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 16 points17 points18 points 4 years ago (2 children)

dd is somewhat bound by POSIX saying the default block size needs to
be 512 bytes.

you can use another, but many people don't know about it.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]kjensenxz[S] 9 points10 points11 points 4 years ago (0 children)

Good to know, I would have went hacking at the source and might have
accidentally PR'd something non-complaint. It'd make a good exercise
for a custom (read: nonstandard) system though.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]FUZxxl 1 point2 points3 points 4 years ago (0 children)

Most people don't need dd and should use cat instead.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]UnchainedMundane 1 point2 points3 points 4 years ago (1 child)

I would like to think that having every utility on the system be
really fast would add up to a generally faster system in total
(especially when there are lots of shell scripts)

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]-fno-stack-protector 1 point2 points3 points 4 years ago (0
children)

plus its just cool to have really fast things

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

load more comments (1 reply)

[-]phedny 12 points13 points14 points 4 years ago (3 children)

I've been able to increase speed using scatter/gather I/O with this
implementation. Would love to see how it measures up on the machine
you used for the other measurements:

#define LEN 2
#define TOTAL 8192
#define IOVECS 256
int main() {
    char yes[LEN] = {'y', '\n'};
    char *buf = malloc(TOTAL);
    int bufused = 0;
    int i;
    struct iovec iov[IOVECS];
    while (bufused < TOTAL) {
        memcpy(buf+bufused, yes, LEN);
        bufused += LEN;
    }
    for (i = 0; i < IOVECS; i++) {
        iov[i].iov_base = buf;
        iov[i].iov_len = TOTAL;
    }
    while(writev(1, iov, IOVECS));
    return 1;
}

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 7 points8 points9 points 4 years ago (2 children)

What's your speed on both GNU yes and your revision? On the OP build
machine:

$ gcc yes.c
$ ./yes | pv > /dev/null
... [9.05GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]phedny 1 point2 points3 points 4 years ago (1 child)

I did this on a VPS, so number are not very stable, but around 1GB/s
on iteration 4 and around 1.7GB/s on the iovec version. There might
be another bottleneck at play here.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 2 points3 points4 points 4 years ago (0 children)

    I did this on a VPS

Interesting, I just tried this on my VPS:

$ ./yes | pv > /dev/null #iteration 4
... [ 488MiB/s] ...
$ ./iovecyes | pv > /dev/null
... [ 964MiB/s] ...

Very strange, so I decided to test it in a virtual machine (NetBSD):

$ ./yes | pv > /dev/null
... [ 801MiB/s] ...
$ ./iovecyes | pv >/dev/null
... [ 990 MiB/s] ...

Both of these fluctuated from about 450 to 993. I don't know if my
results at this point under a hypervisor can be considered conclusive
with the amount of error in their fluctuation nor in the difference
when I run them (from the constant fluctuation).

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]emn13 11 points12 points13 points 4 years ago (3 children)

You state the memory bandwidth is 12.8GB/s - but that's per channel,
and my guess is that you're running a dual channel setup (most people
are), so 10.2GiB/s is a little less than half the theoretical
throughput.

Also, note that because you're writing to /dev/null, it's conceivable
no reads ever occur, even at a low level, so full-throughput
sequential writes really are achievable.

Oh, and additionally it's not trivially obvious (to the non-OS geek
me, anyhow) why this benchmark even needs to hit RAM - is there some
cross-process TLB flush going on? After all, you may be writing a lot
of memory, but you're doing so in small, very cachable chunks, and
you're discarding those immediately - so why can't this all stay
within some level of cache?

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 7 points8 points9 points 4 years ago (2 children)

    You state the memory bandwidth is 12.8GB/s - but that's per
    channel, and my guess is that you're running a dual channel setup
    (most people are), so 10.2GiB/s is a little less than half the
    theoretical throughput.

You're right, I am on a dual channel setup, but as far as I know (not
much about RAM), it would only be hitting a single channel.

    Also, note that because you're writing to /dev/null, it's
    conceivable no reads ever occur, even at a low level, so
    full-throughput sequential writes really are achievable.

    Oh, and additionally it's not trivially obvious (to the non-OS
    geek me, anyhow) why this benchmark even needs to hit RAM - is
    there some cross-process TLB flush going on? After all, you may
    be writing a lot of memory, but you're doing so in small, very
    cachable chunks, and you're discarding those immediately - so why
    can't this all stay within some level of cache?

As far as I know, the series of "y\n" is in the cache, there's plenty
of room in L1 and L2. But since the output of yes is being redirected
through a pipe, it does need to be read by the program on the other
end (pv), which normally would throw it up on standard out, but
discards it to /dev/null. To communicate through a pipe, the standard
output of one program has to be buffered into memory that the end
program can read, which is achieved through the kernel (pipe is a
syscall). Might the halving of the memory speed be from the
simultaneous read/writes?

If I implemented a timer and counter in the same program, it would
probably never need to leave cache, and would instead see how quickly
write() could be called to /dev/null opened as a file descriptor
(might make an interesting memory/cache speed benchmark program).

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]emn13 2 points3 points4 points 4 years ago (0 children)

You'd want to test this without pv. That should be easy enough to do,
since you have a working program with the same performance - simply
write some fixed amount to the pipe, and not while(true) - then you
can simply time how long that takes.

Alternatively, integrate the timing into the program itself, and have
it compute and print (to stderr) the timings every (say) 5s
(tiresome) or 50GB(a little easier).

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]mccoyn 0 points1 point2 points 4 years ago (0 children)

    To communicate through a pipe, the standard output of one program
    has to be buffered into memory that the end program can read

I wonder if you have thread switching issues. Each program is trying
to run at the same time and access the same buffer, so there will be
lots of synchronization preventing this from staying in L1 and L2.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]TotesMessenger 16 points17 points18 points 4 years ago* (0
children)

I'm a bot, bleep, bloop. Someone has linked to this thread from
another place on reddit:

  * [/r/hackernews] How is GNU `yes` so fast?

  * [/r/perfeng] How is GNU `yes` so fast? [x-post /r/unix]

  * [/r/programming] How is GNU's `yes` so fast? [X-Post r/Unix]

  * [/r/spacexmasterrace] Advanced yes optimizations

  * [/r/yrc] How is GNU `yes` so fast? : unix

 ^If you follow any of the above links, please respect the rules of
reddit and don't vote in the other threads. ^(Info ^/ ^Contact)

 

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]deathanatos 8 points9 points10 points 4 years ago (0 children)

    yes on my system was built with CFLAGS="-O2 -pipe -march=native
    -mtune=native"

I sense a (fellow) Gentoo user.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]CowboyBoats 7 points8 points9 points 4 years ago (6 children)

$ yes
y
y
y
...

What is this program for?

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-][deleted] 10 points11 points12 points 4 years ago (0 children)

For when you don't want to type yes on terminal prompts and just wave
the script through

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]ggtsu_00 10 points11 points12 points 4 years ago (0 children)

Responding to your wife.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]slammacows 5 points6 points7 points 4 years ago (3 children)

yes

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]davesidious 4 points5 points6 points 4 years ago (2 children)

yes

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]ulisses_guimaraes 1 point2 points3 points 4 years ago (1 child)

yes

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (1 reply)

[-]kozzi11 6 points7 points8 points 4 years ago* (9 children)

This is my D version

void main()
{
    import std.range : array, cycle, take, only;
    import std.stdio : stdout;
    import std.algorithm : copy;
    "y\n".cycle.take(8192).array.only.cycle.copy(stdout.lockingBinaryWriter);
}

GNU yes 2,52GiB/s

D yes 3,14GiB/s

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kozzi11 0 points1 point2 points 4 years ago (8 children)

And here is a version with a while loop:

void main()
{
    import std.range : array, cycle, take;
    import std.stdio : stdout;
    auto buf = "y\n".cycle.take(8192).array;
    while(true)
        stdout.rawWrite(buf);
}

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 0 points1 point2 points 4 years ago* (7 children)

I couldn't get a D compiler working on Gentoo, so here it is on Arch
on my laptop:

$ yes | pv > /dev/null
... [5.57GiB/s] ...
$ ldc2 yes1.d
$ ./yes1 | pv > /dev/null
... [5.52GiB/s] ...
$ ldc2 yes2.d
$ ./yes2 | pv >/dev/null
... [5.42GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 1 point2 points3 points 4 years ago (2 children)

I managed to get gdc working on gentoo using the dlang overlay - but
it looks like the standard library is old enough that it doesn't have
stdout.lockingBinaryWriter

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]kozzi11 1 point2 points3 points 4 years ago (1 child)

So try the other version without stdout.lockingBinaryWriter. It
should compile

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 1 point2 points3 points 4 years ago (0 children)

It does. Here's how your while loop version compares on my machine:

# yes | pv > /dev/null
... 7.07GiB/s

# ./yes | pv > /dev/null
...  8.56GiB/s

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

load more comments (4 replies)

[-]SixLegsGood 6 points7 points8 points 4 years ago (9 children)

 1. What happened to the caches? Shouldn't this tiny program and the
    tiny amount of the OS being exercised fit within the L2 cache?
    Why then should it be limited to main memory speed?
 2. Is 'pv' a bottleneck? I see a comment below that you tried
    sending the output through dd to /dev/null. Perhaps try running
    something like:

    pv < /dev/zero

(although I wouldn't be surprised to find that /dev/zero is slower
than yes...)

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 9 points10 points11 points 4 years ago (8 children)

    What happened to the caches? Shouldn't this tiny program and the
    tiny amount of the OS being exercised fit within the L2 cache?
    Why then should it be limited to main memory speed?

This is a great question, in fact, it should fit on L1 in my
processor (32K data, 32K instructions). I would assume that it's
stuck with memory speed since there is a pipe involved, and now that
you mention it, the best way to measure this would probably be to use
an internal timer and counter.

    Is 'pv' a bottleneck? I see a comment below that you tried
    sending the output through dd to /dev/null. Perhaps try running
    something like: pv < /dev/zero

$ pv < /dev/zero
... [4.79MiB/s] ...
$ pv > /dev/null < /dev/zero
... [20.6GiB/s] ...

Honestly, at this point, it's very difficult to say if pv is a
bottleneck. Several people have mentioned it, and I've thought about
it, and I think the real bottleneck would have to be the pipe,
because it has to use memory to send data through it.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]SixLegsGood 6 points7 points8 points 4 years ago (3 children)

Wow, thanks for the quick reply and benchmark!

IIRC, back in the day, IRIX used to support a crude form of zero-copy
I/O where, if you were reading / writing page-sized chunks of memory
that were properly aligned, it would use page table trickery to share
the data between processes (or between OS drivers and processes), so
that the reads and writes really did do nothing. In practice, the
optimisation never seemed to be too useful, there were always too
many constraints that made the 'zero-copy' cost more than simple data
transfer (the sending process/driver needed to not touch the memory
again, the receiver mustn't alter the data in the pages, the trick
added extra locking, and on many systems, the cost of updating page
tables was slower than just copying the 4kb chunks of memory). But
for this particular benchmark, I suspect it could hit a crazy
theoretical 'transfer' speed...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 4 points5 points6 points 4 years ago (2 children)

    IRIX used to support a crude form of zero-copy I/O where, if you
    were reading / writing page-sized chunks of memory that were
    properly aligned, it would use page table trickery to share the
    data between processes (or between OS drivers and processes), so
    that the reads and writes really did do nothing.

You know, I have a spare computer and IRIX is available on a torrent
site, and this makes me wonder if I could (or should) try to install
it and benchmark this application on bare metal (hypervisors seem to
completely ruin benchmarking).

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]SixLegsGood 4 points5 points6 points 4 years ago (0 children)

You'd definitely need to run it on bare metal to test this
optimisation, the virtualisation would be emulating all of the
pagetable stuff. I think it also only worked on specific SGI hardware
(or maybe it was specific to the MIPS architecture?), and there were
other restrictions, like the read()s and write()s had to be 4kb (I
think) chunks, 4kb aligned, possibly with a spare 4kb page either
side of the buffers too. It may also have been restricted to driver
<->application transfers, the use case I encountered was in a web
server that was writing static files out to the network as fast as
possible.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 0 points1 point2 points 4 years ago (0 children)

if you post code I'd be happy to compile and test it on my octane.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]fragmede 1 point2 points3 points 4 years ago (1 child)

I'm on a much different system - an ARM chromebook, but I get
slightly better performance using

pv <(yes)  > /dev/null

compared to

yes | pv  > /dev/null

Do you?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 1 point2 points3 points 4 years ago (0 children)

It's identical here. Run the tests several times, they may be within
the margin of error of each other.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]Pastrami 1 point2 points3 points 4 years ago (1 child)

What type of hardware and distro are you running this on? I get
wildly different numbers from my crappy work pc with Mint 17:

$yes | pv > /dev/null
 [ 162MB/s]

$ pv < /dev/zero
 [76.5MB/s]

$ pv > /dev/null < /dev/zero
 [18.5GB/s]

yes | pv > /dev/null only gives me 162 Megs which is 63 time slower,
while I'm getting roughly 16 times faster for pv < /dev/zero

I've got an i7-6700 CPU @ 3.40GHz, with 8 GB DDR4.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 0 points1 point2 points 4 years ago (0 children)

i7-4790, 8 GB DDR3 @ 1600 MHz

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-][deleted] 4 years ago* (5 children)

[deleted]

[-]kjensenxz[S] 3 points4 points5 points 4 years ago (4 children)

This feels like a troll post, but I'll do it anyways.

$ ./yes > out &
$ tail -f out | pv > /dev/null
... [ 188MiB/s] ...

Calculating it by hand with watch -n 0.5 ls -lh out results in about
the same thing.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-][deleted] 4 years ago* (2 children)

[deleted]

[-]kjensenxz[S] 2 points3 points4 points 4 years ago (1 child)

Here's dd and pv for a base value and comparison

$ dd bs=1024K count=1024 if=/dev/zero of=/tmp/zerotest
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.312857 s, 3.4 GB/s

$ pv < /dev/zero > /tmp/zerotest
... [ 189MiB/s] ...


$ time ./yes > /tmp/yesout # 5th revision
real 0m8.554s
$ du -b /tmp/yesout  # bytes
2973360128     /tmp/yesout

2973360128 bytes / 8.554 s = 347598799.158 bytes / sec = 0.324 GiB/s.
Redoing this experiment for longer actually results in a lower speed.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

load more comments (1 reply)

[-]yomimashita 2 points3 points4 points 4 years ago (0 children)

How about just yes > /dev/null and building the counter into yes?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]timvisee 6 points7 points8 points 4 years ago (0 children)

The fact that they took the time to optimize such a little program as
this, with some great tricks, amazes me!

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]K3wp 5 points6 points7 points 4 years ago (0 children)

re: this point:

    Buffer your I/O for faster throughput

I do HPC Linux deployments and my #1 trick that nobody seems to know
about is this command:

    https://linux.die.net/man/1/buffer

    buffer - very fast reblocking program

Using this in a pipeline can produce some pretty significant
speedups, particularly when sending something over the network.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]crowdedconfirm 3 points4 points5 points 4 years ago (6 children)

Interestingly, yes on my MacBook Air seems to be much slower then the
statistics you posted, although for most practical purposes I don't
see it making much of a difference.

1.66GiB 0:01:01 [28.9MiB/s] [ <=> ]

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 7 points8 points9 points 4 years ago (5 children)

From everything I've read from everyone's comments here and on Hacker
News, it's because of a couple of issues:

  * OS X's small buffer size (reported to be 1024, smaller than a
    page)
  * MacBook's slower processor and maybe different RAM timing (my
    proposal, refuted several times, Air 2017 has 1600 MHz just like
    the build machine)
  * OS X's traditional-Unix yes implementation

I'd love to see how it benches against GNU's yes, and this comment
claims 7.2 on Linux on an Air.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]crowdedconfirm 4 points5 points6 points 4 years ago (4 children)

The MacBook Air hasn't had a release since 2015, perhaps they mean a
MacBook Pro? (My bench was on a brand new MacBook Air Mid-2015,
bought in April 2017.)

28.9MiB/s was using macOS's built in yes, by the way. I had Chrome
open with some tabs at the time though, but I would sure hope that
wouldn't affect my RAM's bandwidth that much.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 4 points5 points6 points 4 years ago (3 children)

    The MacBook Air hasn't had a release since 2015, perhaps they
    mean a MacBook Pro? (My bench was on a brand new MacBook Air
    Mid-2015, bought in April 2017.)

My bad, I was just blindly reading off of https://www.apple.com/
macbook-air/specs/ (it is 15 til 6 after all).

I couldn't vouch for how much Chrome would take up, though from
anecdotes, it is known as a memory hog.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]crowdedconfirm 5 points6 points7 points 4 years ago (2 children)

Apple doesn't really make the release dates of their products clear
on their page, which sucks. It's 3:53 (AM) here... :P

Running cat /dev/zero | pv > /dev/zero with Chrome (6 tabs), Discord,
and iTunes open, which was worse then the original test, gives me a
bit better amount of bandwidth.

68.1GiB 0:00:36 [1.93GiB/s] [ <=> ]

The score for yes | pv > /dev/zero is still pretty abysmal though.

277MiB 0:00:10 [27.5MiB/s] [ <=> ]

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]kjensenxz[S] 3 points4 points5 points 4 years ago (1 child)

I doubt it would help much, but do you get a performance boost from
redirecting to /dev/null? Also, what's your speed for my fourth
iteration yes?

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]crowdedconfirm 3 points4 points5 points 4 years ago (0 children)

Wait, why am I directing to /dev/zero. That makes no sense now that I
put some thought into it... :P

Mabel:$ cat /dev/zero | pv > /dev/null
13.2GiB 0:00:07 [1.87GiB/s] [      <=>                                         ]

The other test:

Mabel:$ yes | pv > /dev/null
285MiB 0:00:10 [28.5MiB/s] [         <=>                                      ]

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]minimim 2 points3 points4 points 4 years ago* (3 children)

Here's my Perl6 version:

perl6 -e 'my \buf = Blob.new: |"y\n".NFC xx 8192;loop {$*OUT.write: buf}'|pv > /dev/null

[5,95GiB/s]

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]aaronsherman 1 point2 points3 points 4 years ago (2 children)

A slightly simpler version that caches the value of $*OUT, which is a
dynamic variable (per comments in /r/perl6):

$ perl6 -e 'my $out = $*OUT; my $m = ("y\n" x (1024*8)).encode("ascii"); loop { $out.write: $m }' | pv > /dev/null

On my box, this is substantially faster than GNU coreutils yes, which
is a little shocking.

Edit: Note that the Perl 6 version seems to be roughly on-par with:

$ dd if=/dev/zero bs=8k | pv > /dev/null

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (2 replies)

[-]11chase 3 points4 points5 points 4 years ago (0 children)

Small nitpick - I used to benchmark memory controllers for a certain
CPU manufacturer, and this explanation is not the entire truth:

    With DDR3-1600, it should be 11.97 GiB/s (12.8 GB/s), where is
    the missing 1.5? [...] all the overhead incurred by the kernel
    throttles our memory access, pipes, pv, and redirection is enough
    to negate 1.5 GiB/s

While it's certainly true that task switching and system calls
contribute overhead, the real problem is DDR protocol itself. It's
not possible to transfer memory data to/from DDR on every single
cycle, because some cycles need to be used to transmit address
information. This is minimized when transferring contiguous blocks of
memory, but that's not necessarily what yes is doing. Even when using
a single page of memory, which presumably is contiguous in the
physical address space, the actual module/rank/row/column indices may
be hashed or scrambled by the DDR controller in order to stripe the
memory space across different memory modules, different ranks, or
different NUMA nodes in a multiprocessor system, or as a security
measure to mitigate cold boot attacks or rowhammer attacks.

Additionally, DDR modules need to periodically refresh the charge
stored in the capacitors that implement the data storage. This is
something that the DDR controller on the CPU does transparently, but
it does consume bus cycles to do, and also creates some period of
time when the DDR module is unable to service read/write requests.

Lastly, DDR is a half-duplex protocol, i.e. you can read or write but
not both at the same time. Switching the bus between read and write
mode, which is necessary when copying memory, is something that
consumes bus cycles as well.

tl;dr even with perfectly written ring 0 software, it is actually
impossible to reach the theoretical bandwidth of DDR systems, and
it's not uncommon for DDR controllers to cap out 10-15% below the
theoretical limit.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 2 points3 points4 points 4 years ago* (0 children)

Here's a submission from prussian:

#!/bin/sh
BUFSIZ=32768
SIZE=2
ARGS=$(( BUFSIZ / SIZE ))
IFS='
'
if [ -z "$1" ]; then
    set y ''
else
    x="$1"
    SIZE=$(( ${#x} + 1 ))
    ARGS=$(( BUFSIZ / SIZE ))
    shift
    set "$x" ''
fi

while [ $# -le "$ARGS" ]; do
    set "$@$@"
done

while printf %s "$*"; do
    :
done

$ ./yes.sh | pv >/dev/null
... [9.68MiB/s] ...

Another:

#!/usr/bin/env node
var BUFSIZ = process.stdout._writableState.highWaterMark * 4
var str = process.argv[2] || 'y',
    len = str.length + 1,
    buffer = Buffer.from(Array.from({ length: BUFSIZ/len }, () => str).join('\n'))

function yes() {
    process.stdout.write(buffer, yes)
}
yes()

$ node yes.js | pv > /dev/null
... [9.17GiB/s] ...

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]Auxx 2 points3 points4 points 4 years ago (4 children)

GNU yes is a good example of a wrong optimisation. So you have an
utility called X and it awaits yes input (y). It only needs 2
characters (y\n). Yet GNU yes (yes | X) will flood it with 8kb of
bullshit. Of course OS will consume everything except for first 2
bytes (y\n), but still, it is a performance issue. And yes runs until
killed and consumes 100% of CPU. It is the worst utility ever created
and GNU made it even worse.

If GNU guys ever worried about performance, then they should've
removed this abomination long time ago.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

load more comments (4 replies)

[-][deleted] 2 points3 points4 points 4 years ago (0 children)

Here's my version in Go. It even seems to be a bit faster than the
GNU version: package main

import "os"

func main() {

    var txt []byte
    if len(os.Args) > 1 {
        txt = []byte(os.Args[1] + "\n")
    } else {
        txt = []byte("y\n")
    }
    bufLen := 8 * 1024
    buf := make([]byte, bufLen)

    used := 0
    for used < bufLen && len(txt) <= bufLen-used {
            copy(buf[used:], txt)
            used += len([]byte(txt))
    }

    for {
        os.Stdout.Write(buf)
    }
}

The tests always show the same results:

$ yesgo | pv > /dev/null
... [5,66GiB/s] ...

$ yes | pv > /dev/null
... [5,27GiB/s] ...

  * permalink
  * embed
  * save
  * report
  * reply

[-][deleted] 1 point2 points3 points 4 years ago (0 children)

Uh oh - look at Kiki!

  * permalink
  * embed
  * save
  * report
  * reply

[-][deleted] 4 years ago (10 children)

[deleted]

[-][deleted] 2 points3 points4 points 4 years ago (9 children)

https://unix.stackexchange.com/questions/102484/
what-is-the-point-of-the-yes-command

    When updating ports on a FreeBSD workstation, using portmaster +
    yes becomes very handy:

    yes | portmaster -da

    That way you can let the machine update while you lunch and all
    the questions fill default to 'y,yes'

    When [rebuilding the world][1] for 'make delete-old' and 'make
    delete-old-libs'.

    this is a big time saver:

    yes | make delete-old

    and

    yes | make delete-old-libs

    Basically helps you to avoid typing / confirm certain operations
    that ask for a 'y' or 'yes'

    [1]: http://www.freebsd.org/doc/handbook/makeworld.html

  * permalink
  * embed
  * save
  * report
  * reply

[-]crackanape 3 points4 points5 points 4 years ago (8 children)

Doesn't explain why it needs to be so fast. A few microseconds delay
moving on to the next step of updating ports is hardly going to be
the thing that ruins your lunch.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]apotheon 3 points4 points5 points 4 years ago (7 children)

The following is my response to the top-level, now deleted comment (I
wish it hadn't been deleted, especially while I wrote this response):

---------------------------------------------------------------------

Gawwad gives a good answer to the first question (what is it), but
the tl;dr version is:

"It automates answering 'yes' to confirmation requests from other
software."

The answer to your second question ("Why does it need to be this
fast?") is "It doesn't." Seriously, this was an interesting exercise
in understanding why something is fast, but the "yes" command is
absolutely not an important place to do this kind of optimization. It
makes the code harder to read, and harder to understand, for a very
simple tool.

Premature optimization is the root of all evil.
                                 - Donald Knuth

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]greyfade 3 points4 points5 points 4 years ago (6 children)

I don't like when people just pull out the premature optimization
quote and leave off the context:

    Programmers waste enormous amounts of time thinking about, or
    worrying about, the speed of noncritical parts of their programs,
    and these attempts at efficiency actually have a strong negative
    impact when debugging and maintenance are considered. We should
    forget about small efficiencies, say about 97% of the time:
    premature optimization is the root of all evil. Yet we should not
    pass up our opportunities in that critical 3%.

It says what you did, and does so in a clear and concise way, and
includes the bits that overzealous people forget.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]apotheon 2 points3 points4 points 4 years ago (5 children)

The context is unnecessary in this case, because GNU yes is very
damned far from that 3%.

edit:

    these attempts at efficiency actually have a strong negative
    impact when debugging and maintenance are considered.

compare with:

    the "yes" command is absolutely not an important place to do this
    kind of optimization. It makes the code harder to read, and
    harder to understand, for a very simple tool.

I basically paraphrased him by independent formulation of an
essential principle of good design.

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

load more comments (5 replies)

[-]hegbork 1 point2 points3 points 4 years ago (1 child)

#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h>

int
main(int argc, char **argv)
{
#if 0
        /* testing version */
        int bufsz = atoi(argv[1]);
        int iovcnt = atoi(argv[2]);
        assert((bufsz & 1) == 0);
#else
        int bufsz = 8192;
        int iovcnt = 64;
#endif
        struct iovec iov[iovcnt];
#if 1
        char buf[bufsz];
#else
        char *buf;

        if (posix_memalign((void **)&buf, getpagesize(), bufsz))
                exit(1);
#endif
        int i;

        for (i = 0; i < bufsz; i += 2) {
                buf[i + 0] = 'y';
                buf[i + 1] = '\n';
        }

        for (int i = 0; i < iovcnt; i++) {
                iov[i].iov_base = buf;
                iov[i].iov_len = bufsz;
        }

        while (writev(1, iov, iovcnt) == bufsz * iovcnt)
                ;
        return 0;
}

Performs almost twice as fast as iteration 4 on one OSX and one Linux
machine. 8192/64 numbers are empirically tested to behave best on
both. This is weird because on the systems I know (BSDs), there is
magical code that kicks in on pipe writes bigger than 8192 which
makes the pipe buffer bigger and last time I looked OSX used the same
pipe code. The posix_memalign allocation was there to see if some
zero copy mechanism kicks in. But it doesn't on the systems where I
tried this, so it's disabled.

Writing this in other languages, assembler, optimizing the
initialization, etc. is pretty pointless because this should all be
in the overhead between the system call and the point where the
kernel does the copying from userland to a pipe buffer. Something you
can only control by reducing the number of system calls. So
theoretically the best we can do is to increase the number of iovecs
passed into writev, but it doesn't seem to make much (if any)
difference, so 8192/64 stays as good enough.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]kjensenxz[S] 1 point2 points3 points 4 years ago (0 children)

$ ./yes | pv > /dev/null
 ... [9.31GiB/s] ...

  * permalink
  * embed
  * save
  * parent
  * report
  * give award
  * reply

[-]_mrb 1 point2 points3 points 4 years ago* (0 children)

This tuning of yes, pv, and pipe buffer size does 123 GB/s: https://
www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes_so_fast/diua761/

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]hikilaka 1 point2 points3 points 4 years ago (0 children)

mutha fuckin joe dirt......... smh

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]johnklos 1 point2 points3 points 4 years ago (0 children)

Interesting how much of a difference there is between unoptimized and
optimized on a standard Ubuntu system:

yes | pv > /dev/null
 101GiB 0:00:18 [6.04GiB/s] [                <=>                               ]

./vmsplice-yes | pv > /dev/null
41.2TiB 0:01:35 [ 444GiB/s] [    <=>                                           ]

lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    8
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004d 0200)
Model name:            POWER8 (raw), altivec supported
CPU max MHz:           3491.0000
CPU min MHz:           2061.0000
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-79
NUMA node8 CPU(s):     80-159

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]pinbender 1 point2 points3 points 4 years ago (0 children)

FYI, this is an example of how buffer overflows happen. If LEN
changes to a number that is not divisible by the buffer size, it will
overflow. That's not the case in what's listed, but code changes,
people copy working code, etc.

The while loop should include the size of what it's writing to make
sure it will fit:

while ((bufused + LEN) <= TOTAL) {
    memcpy(buf+bufused, yes, LEN);
    bufused += LEN;
}

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]-maandree- 1 point2 points3 points 4 years ago (0 children)

https://github.com/maandree/yes-silly

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-][deleted] 2 points3 points4 points 4 years ago (1 child)

You give praise to GNU for being so thorough that they made such a
good implementation, but to me this whole thing is kind of sad. So if
you just use stdlib naively, and don't invest a bunch of time
optimizing even a simple thing as this is, you will end up with a
very sub-optimal implementation.

  * permalink
  * embed
  * save
  * report
  * reply

[-]mcjiggerlog 0 points1 point2 points 4 years ago (0 children)

Hackernews discussion

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]DorffMeister 0 points1 point2 points 4 years ago (0 children)

Fun read.

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

[-]iheartrms 0 points1 point2 points 4 years ago (0 children)

Asking the real questions...

  * permalink
  * embed
  * save
  * report
  * give award
  * reply

load more comments (16 replies)

  * about
  * blog
  * about
  * advertising
  * careers

  * help
  * site rules
  * Reddit help center
  * reddiquette
  * mod guidelines
  * contact us

  * apps & tools
  * Reddit for iPhone
  * Reddit for Android
  * mobile website

  * <3
  * reddit premium
  * reddit coins

Use of this site constitutes acceptance of our User Agreement and
Privacy Policy. (c) 2022 reddit inc. All rights reserved.

REDDIT and the ALIEN Logo are registered trademarks of reddit inc.

[pixel]

p Rendered by PID 20 on reddit-service-r2-loggedout-6b8cdf8d68-9phlt
at 2022-06-04 23:00:56.320030+00:00 running 6760c65 country code: US.