[HN Gopher] The Fastest FizzBuzz Implementation
___________________________________________________________________
The Fastest FizzBuzz Implementation
Author : Croftengea
Score : 248 points
Date : 2021-12-02 08:04 UTC (14 hours ago)
(HTM) web link (tech.marksblogg.com)
(TXT) w3m dump (tech.marksblogg.com)
| DeathArrow wrote:
| It seems there's a huge gap between what machine code compilers
| can generate and hand written optimized assembler. There's a lot
| of room left for compilers to be improved.
| bawolff wrote:
| Most programs aren't fizzbuzz. This is basically hand
| optimizing a microbenchmark. While i'm sure compilers can
| always be better i'm not sure this particular example
| generalizes.
| londons_explore wrote:
| But hand optimizing a few microbenchmarks like this tells us
| how far we might be able to improve compilers.
|
| And from the looks of it, at least 100x is possible. It's a
| shame therefore that even a 1% performance increase in a
| compiler release is considered pretty noteworthy.
| anthony_r wrote:
| This is a very specific problem that lends itself to some
| fantastically cpu-friendly optimizations. Doing a single
| hash-map lookup here on the hot path would kill
| performance, and let's not even get started on any serious
| I/O (disk or network) as that would be multiple orders of
| magnitude slower. Not many problems are just "cat something
| extremely predictable into /dev/null".
| londons_explore wrote:
| But nearly all problems boil down to nearly zero actual
| work...
|
| Take booting up a computer for example. All that
| _actually_ needs to be done is to display the login
| screen on the screen. And we know from 60 fps video
| playback, that the hardware is capable of displaying a
| frame inside 16 milliseconds.
| anthony_r wrote:
| You can create an operating system that just displays a
| static image within milliseconds yourself (shouldn't take
| you more than a few weeks), it would work even when
| compiled at -O0.
| samhw wrote:
| And so what? That's what this code does, and so a
| compiler should be able to generate the optimal assembly.
| Why would a compiler need to consider something that the
| code in question is _not doing_?
| bawolff wrote:
| How would the compiler know that? This code for example
| is only optimal if you are piping it somewhere instead of
| displaying on the terminal. There is no way, even in
| principle, for the compiler to know that this program is
| always run with stdout connected to a pipe.
|
| Moreover, if you mean optimal in the absolute sense,
| pretty sure that is equivalent to solving the halting
| problem.
|
| If what your asking is: why can't computers take all
| surounding fuzzy context into account and do precisely
| the right thing, the answer is: because we haven't
| invented strong AI yet.
| samhw wrote:
| Ah, sorry, I misunderstood you. I thought you were
| placing the emphasis on what the code was doing, rather
| than where it was being piped. But why do you think that
| most of the possible optimisations would only be relevant
| if the output were being piped to somewhere other than a
| terminal?
|
| Also, why would making it absolutely optimal be
| "equivalent to solving the halting problem"? (Though
| either way, no, I'm not talking about producing 100.0000%
| optimal code - clearly there's plenty of daylight between
| the current degree of optimality and that one.)
| bawolff wrote:
| > Ah, sorry, I misunderstood you. I thought you were
| placing the emphasis on what the code was doing, rather
| than where it was being piped. But why do you think that
| most of the possible optimisations would only be relevant
| if the output were being piped to somewhere other than a
| terminal?
|
| One of the primary optimizations is using vm_splice to
| avoid memory copies. vm_splice only works if the file
| descriptor is a pipe.
|
| > Also, why would making it absolutely optimal be
| "equivalent to solving the halting problem"?
|
| https://en.m.wikipedia.org/wiki/Rice%27s_theorem
|
| Or more specificly, if a program always halts (without
| outputting things) then the best optimization would be to
| have it halt immediately but to know that you have to
| have solved the halting problem.
|
| You're right of course, absolutes are a bit of a
| distraction we just need something good for 99% of
| programs. However that's still really hard, and an
| optimization applied in the wrong context will slow down
| a program. There's still lots of room for compilers to
| grow, but the types of optimizations in use in this
| example are at best probably NP-complete to determine if
| they apply (i dont have any evidence for that though)
| anthony_r wrote:
| Yeah you could optimize GCC to work really well for those
| problems, there's a great ROI for the "write something
| trivial really quickly into /dev/null" class of problems.
|
| Let us know how this goes if you embark on this
| adventure.
| samhw wrote:
| If you genuinely understood me as saying that, then I
| should clarify, to be absolutely foolproof:
|
| That is not what I'm saying. I'm talking about having a
| compiler which could produce more optimal code in any
| given case, or most given cases. I obviously am not
| suggesting that compilers optimise this and only this
| case.
| anthony_r wrote:
| > I'm talking about having a compiler which could produce
| more optimal code in any given case, or most given cases
|
| Funny coincidence, that's exactly what the vast majority
| of compiler code is for, with tens (or hundreds) of
| thousands of man-years of work spent on that.
| bonzini wrote:
| The top 3 entries use completely different algorithms. The use
| of assembly might help a bit, but not as much as the raw
| numbers suggest.
| andrewaylett wrote:
| Back when I was employed to work on a compiler, our approach to
| such things was to implement our run-time libraries in C and
| fix the compiler so it was able to produce the desired output.
|
| In this case, though, the biggest differences aren't coming
| from C vs assembler but from walking away from platform
| conventions -- ABI, memory layout, CPU compatibility. It might
| be easier to write assembly than to teach the compiler how to
| generate code that does this, but that's largely because you
| don't need these optimisations in most cases. If the methods
| used provided a general speedup, it would be worth implementing
| them in compilers.
| btilly wrote:
| I suspect that a compiler which produced code that will crash
| when you attempt to write to a file would be declared buggy and
| unusable.
|
| And yet, that is one of the optimizations that helps get this
| one the speed crown.
| pjc50 wrote:
| Producing this solution requires not merely a clever compiler
| but an extraordinary amount of creative intelligence.
| recentdarkness wrote:
| interesting for me was the fact, that the author talks about
| themselves in the third person. Very unusual for some content
| like this :D
| lifthrasiir wrote:
| The author of this post (Mark Litwintschik) is not same to the
| author of the assembly solution (Alex Smith, also known as
| ais523) :-)
| nickdothutton wrote:
| This chap is the developer. Kudos to him. https://www.wolfram
| science.com/prizes/tm23/alex_smith_bio.ht...
| Croftengea wrote:
| > The developer behind the Assembler version goes by the handle
| "ais523". > I've been unable to uncover this person's real-life
| identity
| lifthrasiir wrote:
| > The developer behind the Assembler version goes by the handle
| "ais523".
|
| Alex Smith [1] is also known for winning the Wolfram 2,3 Turing
| Machine Research Prize [2].
|
| [1] https://www.cs.bham.ac.uk/~ais523/
|
| [2] https://writings.stephenwolfram.com/2007/10/the-prize-is-
| won...
| jprupp wrote:
| Hold my beer. Now were is that OpenCL programming book when I
| need it?
| bXVsbGVy wrote:
| How did they tested it? Is the data actually being delivered, and
| read, in order?
| w0mbat wrote:
| I played with FizzBuzz when this was mentioned here a month ago,
| focusing on the algorithm not optimizations.
|
| I stopped once I could do it in simple code with no math except
| addition, 15 numbers at a time. for (int x = 0
| ; x < 1000000 ; x+= 15) { printf("%d\n%d\nfizz\n%d\nbuz
| z\nfizz\n%d\n%d\nfizz\nbuzz\n%d\nfizz\n%d\n%d\nfizzbuzz\n",
| 1 + x, 2 + x, 4 + x, 7 + x, 8 + x, 11 + x, 13 + x, 14 + x);
| }
|
| That is simple enough and fast enough for me.
| WithinReason wrote:
| Did you check what the throughput of this is?
| richardwhiuk wrote:
| > Before running the code, make sure your CPU does support AVX2.
| ... You should see "sse2" at a minimum.
|
| AVX2 is not SSE2 - SSE2 is much, much older.
|
| You want to check for the avx2 flag itself.
| GnarfGnarf wrote:
| _581 lines of Assembler_
|
| Interesting, but not very practical.
| chii wrote:
| is practicality really relevant to a competition to write the
| fastest FizzBuzz implementation?
| hackater wrote:
| How do you learn about this? Does anyone have any good
| recommendation on books, where you learn how to do these kind of
| very low level optimizations?
| Maursault wrote:
| No machine code implementation? I wonder how much faster than
| assembler it could be.
| pxeger1 wrote:
| Assembler has basically no abstractions on top of machine code;
| just a human-readable syntax. So there would be no difference.
| pxeger1 wrote:
| This looks like a pretty rubbish blogspam regurgitation of the
| original SE answer.
| atorodius wrote:
| Relevant: https://news.ycombinator.com/item?id=29031488
| dang wrote:
| Thanks! Macroexpanded:
|
| _55 GiB /s FizzBuzz_ -
| https://news.ycombinator.com/item?id=29031488 - Oct 2021 (255
| comments)
| throw10920 wrote:
| If I may ask, what's the default keybinding for _dang-
| macroexpand-1_?
| dang wrote:
| crtl+alt+b
| gumby wrote:
| Pfft. No machine learning was used?
| trevyn wrote:
| I wonder how an M1 Max assembly version would fare.
| Croftengea wrote:
| ARM64 SIMD instructions perform in the same ballpark, see this:
| https://news.ycombinator.com/item?id=25408853
| SekstiNi wrote:
| The Firestorm cores are capable of issuing two 128bit stores
| per cycle [1], giving a bandwidth of 102.4 GBps. This matches
| the experiments by Andrei with mixed reads/writes [2], and I
| have personally verified that it is attainable for pure write
| workloads to L1 as well as L2.
|
| This does not mean 102.4 GBps is achievable on a single core,
| but if L2 bandwidth is the limiting factor it very well could
| be.
|
| [1] https://dougallj.github.io/applecpu/firestorm.html [2]
| https://www.anandtech.com/show/17024/apple-m1-max-performanc...
| faeyanpiraat wrote:
| Before running the code, make sure your CPU does support AVX2.
| Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those
| found in newer Apple computers, Smartphones or Raspberry Pis,
| won't support AVX2.
| comonoid wrote:
| ARM CPUs have NEON. I cannot say if NEON has all equivalent
| commands, however, the threadstarter is talking about
| complete rewrite, not just running it as is.
| innocenat wrote:
| NEON only has 128bit register. AVX has 256 bit register.
| comonoid wrote:
| Yep, it will most likely require to reorganize the
| program, but by itself it doesn't necessary mean that the
| result will be slower.
| Tepix wrote:
| That sentence seems silly. ARM assembly and x86-64 assembly
| are two different beasts. Running the x86-64 code using
| Rosetta 2 emulation seems counter-intuitive as well if you
| want to get the fastest possible result on that machine.
| buro9 wrote:
| The simplest change for performance even in Python, JavaScript,
| etc is to avoid the modulo function.
|
| All of the top contenders on
| https://codegolf.stackexchange.com/questions/215216/high-thr...
| do exactly this.
|
| Yes there is a lot of optimization specific to languages, OS,
| CPU... but the takeaway for me isn't that you have to go that
| extent to build an order of magnitude improvement, you can still
| achieve many order of magnitudes of improvement by understanding
| what is happening when you use something like the modulo function
| and that if you have requirements where you can avoid using it
| then that's the win.
|
| For FizzBuzz you are told it "iterates from 1 to <some
| number>"... and modulo would be perfect for "take a single number
| and produce Fizz, Buzz or FizzBuzz if it's divisable by 3, 5, or
| both 3 and 5"... but as we're iterating, counters are better
| suited to the task.
|
| I love seeing epic code golf (and the solution outlined in the
| article is it), but the takeaway for the majority of engineers is
| that significant improvements can be made with a readable and
| maintainable solution just by understanding the requirements and
| comp sci fundamentals.
| jiggawatts wrote:
| The _really_ heavyweight use of div /mod is to produce the
| base-10 printed decimal numbers, not in the main control loop.
|
| The record-beating versions all do something clever to optimise
| the printing of the numbers between the "Fizz" and "Buzz"
| strings.
| solmag wrote:
| I think the original author said that it was harder to make this
| than his masters thesis.
| parhamn wrote:
| When I saw this a few weeks ago it set off a lot of thinking
| about compilers and ML.
|
| Even the unoptimized C version is 10x slower than the one that is
| painstakingly optimized.
|
| Some of this performance gain will certain be achievable by ML/AI
| assisted compilers and translators. In that case, even a 2x
| speedup would be game changing. Is anyone using ML in compilers
| yet? The part where you verify the code still does what you
| wanted it to do seems trickiest.
| m3at wrote:
| > Is anyone using ML in compilers yet?
|
| Yes :)
|
| Check out TVM for something mainstream: http://tvm.apache.org/
|
| And LibNC for something esoteric (but elegant):
| https://bellard.org/libnc/
| andi999 wrote:
| I dont understand the FizzBuzz thingy. I mean isnt it just about
| who has the fastest print method? Correct me please, but this
| problem has period 15, so if you just write:
|
| for (i=1;i<n;i+=15) { print i; print i+1; print 'fizz' print i+3
| ... } You have the fastest , dont you? (if the ending n is
| important then one can just write 15 of such loops/endgame
| depending on the modulus). Since there is no 'if' the branch
| predictor should be on your side.
| zkldi wrote:
| Did you even read the OP?
| andi999 wrote:
| The OP is great. And basically it is about how to print
| faster. I was more commenting on the fast fizzbuzz challenge
| itself.
| Tepix wrote:
| You're printing the number too so, no, it does not heave a
| period of 15.
| isodev wrote:
| Come on now, are we really using fizzbuzz as a performance
| benchmark? Seriously, don't base a real-world decisions on this.
|
| PS. The Rust implementation is also not very optimal, for
| example, the stdout is not locked so... what are we even trying
| to show :-).
| tourist2d wrote:
| Can you really not tell the difference between a fun challenge
| and a real world scenario?
| sshb wrote:
| I wonder about FPGA and io_uring implementations performances
| Terretta wrote:
| YAWFB: Yet another wrong fizzbuzz.
|
| "fizz" % 3, "buzz" % 5, what is the % 15 for?
|
| You can blow a interviewee's mind by saying, great, now "bang" %
| 7... and (hopefully!) watch the light dawn.
|
| // why so serious? :-)
| LeifCarrotson wrote:
| The % 15 is because they're using 'elif' or 'else if'; they
| only process each entry once and they exit the function when a
| successful comparison is made.
|
| It's true that you could do this with something like:
| result = '' if x % 3 == 0: result += 'fizz' if
| x % 5 == 0: result += 'buzz' if result == '': result +=
| str(x) return result
|
| to append fizz when for 15 and also continue processing and
| append buzz, but this requires an extra comparison to check
| that you did neither of those to print the number.
| Terretta wrote:
| we all know what the %15 is for.
|
| but note that if you do 3 comparisons, or 2 comparisons and
| an "extra" comparison, you've done the same number of
| comparisons, it's not really 'extra'
|
| then go ahead and keep adding factors and see which results
| in more comparisons as we rack up the actually extra
| comparisons for various combinations of factors
|
| btw and ftw, your modulo and string test example, more
| correct than the usual implementation, is among the most
| efficient:
|
| http://www.zoharbabin.com/which-fizzbuzz-solution-is-the-
| mos...
|
| // still not so serious...
| javier10e6 wrote:
| Asm 56 GB/s (you are reading Sanskrit) C 41 GB/s (you are reading
| US Tax Code) Rust 30 GB/s (you are reading English) Python (why
| bother) (you are speed reading ) So many tools, so many
| languages, life is great.
| opentokix wrote:
| Imagine writing this asm-implementation to the 22 y/o google
| recruiter and fail the interview. :D
| DeathArrow wrote:
| The most interesting thing is the C version is almost as fast as
| assembler version.
| bonzini wrote:
| The most interesting thing is that it took one day to write the
| C version and several months for the assembly version, because
| the C version uses a simpler algorithm. Talk about diminishing
| returns. :)
| pokepim wrote:
| And you could copy paste Python version from SO and have it
| running in less than a minute, so here's the winner :)
| bonzini wrote:
| If the Python version is 2000 times slower, the breakeven
| would be after a few terabytes of FizzBuzz.
| Qem wrote:
| The Python version posted is about 85 MiB/sec, so only
| 600 times slower than the fastest one.
| bonzini wrote:
| Ah, forgot about pypy!
| sillysaurusx wrote:
| > As of this writing, the 3rd fastest submission is written in
| Rust and produces out at a rate of 3 GB/s, 2nd is written in C
| and produces at a rate of 41 GB/s and the fastest is written in
| Assembler and produces at a rate of 56 GB/s.
|
| This makes me happy. The order is exactly as it should be, much
| to the disappointment of the Rustaceans.
|
| It's a nice reminder that for the ultimate performance, there is
| no substitute for knowing exactly what the assembler is doing.
|
| It's also a nice reminder that most of us will never want to do
| that.
|
| EDIT: Whoa. It's late, and I thought Rust was at 30 GB/s instead
| of 3, i.e. roughly neck and neck with C. I meant to make a good-
| natured joke. There must be something wrong with the Rust
| implementation to be 10x slower; I wonder what it is.
|
| As I mentioned further down in the comments, I'll donate $500 to
| Watsi if someone manages to displace the C implementation in Rust
| or any other language besides C/C++/ASM. Be sure to DM me on
| Twitter (https://twitter.com/theshawwn) if you do, both so I'll
| see it and so that I can congratulate you. I'm sure it's
| possible, but it's sort of interesting that the activation energy
| is so high.
| lawn wrote:
| So you think that it's expected that C is 10 times faster than
| Rust?
| sillysaurusx wrote:
| Wow. Perhaps I should be embarrassed, but I mentally filled
| in "30 GB/s" instead of 3 GB/s. I thought Rust was mostly
| neck and neck with C. I see now why a lot of fellows
| responded in a defensive way.
|
| What's it doing to be so much slower? That's shocking. I may
| not _enjoy_ Rust, but I do respect its ability to approach
| C++ performance.
| IshKebab wrote:
| > That's shocking.
|
| Not really. This is a hyper optimised microbenchmark. The
| actual work being done is so trivial that any "overhead" is
| really significant and you can get order of magnitude
| speedups by doing arcane things with caches, eliminating
| tiny inefficiencies and so on.
| xxpor wrote:
| Complete speculation, but I think you'd find that's true
| for most problems. Very little CPU time is spent on the
| business logic, and the rest is overhead.
| misja111 wrote:
| As far as I understand, that #2 C program contains hardly
| any C code but mostly inlined assembly.
| bonzini wrote:
| The program itself is 100% C, but it spends 99.9999% of
| the time in assembly code that is generated at runtime.
| friedman23 wrote:
| I'm curious, how did rust affect you so badly that you hate
| it so much? If it's just silly people suggesting it
| everywhere I think you have some personal issues you need
| to resolve, language choice isn't something to get
| emotional about.
| [deleted]
| ask_b123 wrote:
| I've not seen any indication that sillysaurusx _hates_
| Rust.
| [deleted]
| Arnavion wrote:
| At the very least, the itoap crate being used there does
| divisions by and comparisons to powers of 10 as part of
| stringifying integers, which the C and ASM versions don't
| seem to be doing. A crate with a function for stringifying
| arbitrary integers cannot take advantage of the fact that
| it's going to be used for stringifying consecutive
| integers, as an inlined solution can.
| GrayShade wrote:
| You do seem to have an axe to grind with the people using Rust.
|
| I don't see any reason why the same thing [1] could not be
| implemented in Rust.
|
| EDIT: I got caught by an off-by-one there, the second fastest
| solution (doing JIT compilation) is actually [2].
|
| [1]: https://codegolf.stackexchange.com/a/215231
|
| [2]: https://codegolf.stackexchange.com/a/236737
| sillysaurusx wrote:
| Much like speedruns, it's up to you to prove it :)
|
| I just enjoy dosing the Rusties with reality every so often.
|
| If you write a Rust version that can beat the C version, I'll
| donate $500 to Watsi. Good luck!
| gerbilly wrote:
| Ultimately, your argument is a troll because you are trying
| to tell people what they should value in a programming
| language.
| bawolff wrote:
| Pretty sure the fact its literally a speed contest is
| telling people what to value in a programming language.
| KingOfCoders wrote:
| "I just enjoy dosing the Rusties with reality every so
| often."
|
| Aka troll.
| AnIdiotOnTheNet wrote:
| Not necessarily. Language evangelists can be incredibly
| annoying... actually, evangelists in general can be
| really annoying. It is cathartic to prove annoying people
| wrong every once in a while.
| Ygg2 wrote:
| Otoh. Being a dick, to majority of people because of acts
| of minority is worse.
| GrayShade wrote:
| I don't know. I use Rust myself, but I've seen more
| people complaining about Rust evangelism than people
| actually doing that.
|
| The RESF is a bad meme.
| colejohnson66 wrote:
| When there's a memory related bug (or CVE) in a program
| written in C or C++, there's almost always some comment
| about how the Rust compiler would've caught it
| KingOfCoders wrote:
| From Rust developers or from app users? As an app user I
| would prefer if apps are written in Rust and Im reminded
| with every CVE for a C program.
| tialaramex wrote:
| On HN there is often: A comment about Rust would catch
| this (sometimes sarcastic), an explanation about how the
| bug isn't "really" memory safety and so Rust wouldn't
| catch it, followed by a comment explaining actually Rust
| does catch it.
|
| The recent Linux mutex security bug is an example. That's
| not a memory safety bug. Surely Rust doesn't help? But
| Rust's mutual exclusion design relies on the Rust type
| system to give us safer locks, in Rust you protect things
| by having "lock the mutex" be the way you get the
| reference to the protected object. Thus when the faulty
| code locks the wrong mutex in Rust it would get the wrong
| object, and either _not compile_ (you lock variable A and
| then you 're trying to change B but you don't even have
| B) or _not work_ (you always lock variable A, then change
| variable A, but your code should change B, it does not
| work, trivially reproducible bug in your code)
| sillysaurusx wrote:
| Oh? Why is it trolling to say that Rust isn't as fast as
| C in this case?
|
| If it were, you'd claim the prize. If you really want to
| hurt a troll, wouldn't you try to separate them from
| their $500?
| KingOfCoders wrote:
| You still assume I care. As stated above I don't care.
| It's in your head only that every Rust developer writes
| Rust because she thinks it's faster than C.
| detaro wrote:
| > _to say that Rust isn 't as fast as C in this case_
|
| That's not what the quoted statement is doing.
| sillysaurusx wrote:
| I think we'll have to agree to disagree. It's
| illustrative that a seemingly simple C program can't be
| easily ported to Rust. If it could be, someone would have
| already done it -- it's been a few hours.
|
| It's also illustrative to examine exactly what the
| hangups are. You might even get more Rust converts if you
| acknowledge that there are serious engineering burdens to
| overcome when deciding to use Rust.
| [deleted]
| samhw wrote:
| For the record, it's not the anti-Rust sentiment which
| people find annoying, it's the treatment of programming
| languages as though they were football teams.
|
| If you want to make a case against Rust, then let's get
| concrete: _why_ is one Turing-complete compiled language
| that can make arbitrary syscalls not capable of compiling
| down to the same assembly as another?
|
| _That_ would be an interesting discussion; this is not.
| Retric wrote:
| You can dig into it, but the details are mostly only
| relevant to people designing computer languages and
| tooling.
|
| Aka Compilers, as long as the machine code is equivalent
| to the source code their output can be wildly different.
| However, someone needs to actually write them which means
| both linguistic differences and the sheer effort that was
| actually put into the compiler are both significant. You
| shouldn't assume that every language is operating on an
| even playing field because their simply not. The obvious
| solution of compiling to a second language is never quite
| as capable as using the language directly.
| samhw wrote:
| > Compilers, as long as the machine code is equivalent to
| the source code their output can be wildly different
|
| I don't understand what this means: the output of a
| compiler _is_ the machine code from the source code, no?
|
| Also, Rust uses LLVM as a backend, which has benefited
| from lots of work given its use for C, C++, ObjC, Swift,
| &c. Compiling the C with Clang/LLVM might give a more
| interesting insight into the differences between Rust and
| C.
|
| > The obvious solution of compiling to a second language
| is never quite as capable as using the language directly.
|
| I'm not sure what you mean by this, or rather how it's
| related to what we're talking about?
| Retric wrote:
| > The output is the machine code.
|
| You get different machine code using different compilers
| on identical pieces of source code. In theory the results
| of running that source code should have equivalent
| results even if one is more efficient than another. If
| for example one compiler uses a ASM instruction that
| another never uses then nothing the programmer does can
| get the second compiler to be as efficient as the first.
| The same is true of various optimizations etc, sometimes
| a compiler is just better.
|
| Bring a different language into the picture and things
| get even more divergent.
|
| > how it's related
|
| It's specifically in reference to rust use of LLVM not
| being sufficient to guarantee equivalence. Many languages
| target the JVM, that does not make them equivalent.
| samhw wrote:
| > You get different machine code using different
| compilers on identical pieces of source code.
|
| Ah, right - well I'm certainly not denying that. (I'm
| familiar with the topic, to be clear: I've written
| compilers. I just wasn't clear what exactly you were
| saying.)
|
| > It's specifically in reference to rust use of LLVM not
| being sufficient to guarantee equivalence.
|
| Formally, no. In practice, given that fact, given the
| experience of the maintainers of both languages, and
| given the 'compilability' of both languages, it would be
| surprising if there were a 10x gulf between them.
| taneq wrote:
| That kind of behaviour (edit: theirs, not yours) is
| usually ultimately motivated by fear. ;)
| trevyn wrote:
| No no, C is great for writing fast FizzBuzz
| implementations; any larger program is hard to get correct.
| :-)
| KingOfCoders wrote:
| "disappointment of the Rustaceans."
|
| I'm using Rust and I'm not dissappointed. Why should I?
|
| But I do appreciate how you create imaginary people in your
| head, troll them and mentally pad yourself for doing so.
| isoprophlex wrote:
| All the rust users rising to the bait in this thread sure
| aren't imaginary
| [deleted]
| KingOfCoders wrote:
| What comment comes to your mind of a Rust developer who
| thinks Rust is faster than C and needs a dose of reality?
| colejohnson66 wrote:
| Rust is a memory safe _systems_ programming language. As
| a systems programming language, it can be used in the
| same areas that C can (see Redox OS). I 've seen quite a
| few Rust users online that have claimed that, because of
| that, it can be just as fast as C.
| tialaramex wrote:
| But it can be _just as fast_ as C. The C program that 's
| so much faster here isn't "really" a C program it's just
| a fancy way to shovel machine code, which you could of
| course also do (unsafely) in Rust if you chose to do so
| but why?
|
| However what you were asked about is claims it's _faster_
| and there micro-benchmarks like this are not revealing.
| At scale you 're obliged to limit yourself to what you
| can write _that works_ and in Rust the safety lets you
| reach higher - which is how it achieves better
| performance. In fact that 's what Mozilla used Rust for.
| They'd failed more than once to ship a concurrent CSS
| speed-up, finding it too hard to successfully get the
| bugs out and so returning to a single threaded CSS
| implementation - but the attempt in Rust (Quantum) worked
| just fine.
|
| This is the same reason C++ is faster, a C++ iterator
| isn't _faster_ than the naive C for loop, in fact they
| emit identical code, but the iterator is easier to think
| about, which means you can confidently write more
| complicated software using this abstraction. Rust just
| provides the safety to go higher still.
| KingOfCoders wrote:
| What is the opinion of Golang users on this?
| politician wrote:
| This is really cool. It should be obvious that a few
| thousand hand-rolled assembly opcodes targeting a fixed
| architecture would win the race. If the size of the
| solution grew a couple of orders of magnitude larger
| though, I'd expect that the micro optimization challenge
| would overwhelm the author and compilers would be more
| competitive.
|
| Is the Table of Contents of your upcoming book public?
| KingOfCoders wrote:
| No not yet, incorporating first draft feedback.
|
| You made my evening for asking <3
| [deleted]
| tester34 wrote:
| >This makes me happy. The order is exactly as it should be,
| much to the disappointment of the Rustaceans.
|
| Why it makes you happy? it's a reminder how much we're losing
| performance, energy and stuff when we're writing in
| safe/expressive/easy to use languages like Java, C#, Python
| yada yada
| Arnavion wrote:
| >It's a nice reminder that for the ultimate performance, there
| is no substitute for knowing exactly what the assembler is
| doing.
|
| I didn't realize viewing the assembly of a compiled binary was
| something reserved for C. Someone should tell rust.godbolt.org
| that it's not supposed to exist.
| dahfizz wrote:
| Go read the second place C entry[1]. Its a JIT for assembly
| code.
|
| [1] https://codegolf.stackexchange.com/questions/215216/high-
| thr...
| samhw wrote:
| I think their point is that C is sufficiently close to
| assembler that you know exactly what assembly any given line
| of C will produce. This is an old canard which is not really
| true (https://zenhack.net/2015/08/02/c-is-not-portable-
| assembly.ht...), but it's common among people who derive a
| sense of identity from writing C.
| [deleted]
| bartvk wrote:
| It's not just knowing exactly what the assembler is doing, but
| also what the kernel is doing (but maybe you also meant this).
| "This implementation only supports Linux", and then continues
| about the use of a bunch of system calls:
|
| * "asks the Kernel to defragment the physical memory it's using
| via the madvise system call" * "use 2 MB of RAM each even
| though neither need 2 MB. This keeps the page table orderly" *
| "Most competing FizzBuzz implementations call the write system
| call which can lead to data needing to be copied between user
| and Kernel space. So vmsplice is used instead."
|
| And then the most fascinating bit: "The CPU's L2 cache is split
| into two halves, one for calculating the output of FizzBuzz,
| the other storing the result of the last calculation batch.
| This means there is no need to use any slower L3 memory
| operations."
|
| "Slower L3 memory operations", I've not read that sentence
| before :D
| MayeulC wrote:
| > the most fascinating bit
|
| It is really fascinating, but I disagree a bit with your pick
| :)
|
| To me, the most fascinating part is using a bytecode
| interpreter. It makes perfect sense, it's basically
| optimizing the code size to fits in the cache. It's also a
| bit like defining a whole new instruction set dedicated and
| optimized for Fizzbuzz!
|
| I think the approach was a bit similar to a VLIW
| https://en.wikipedia.org/wiki/Very_long_instruction_word
| where a single 32 bits word is loaded and interpreted at a
| time.
|
| Most of the state of the Fizzbuzz program is encoded in this
| bytecode.
|
| > "Slower L3 memory operations", I've not read that sentence
| before
|
| You hear it all the time when trying to optimize for speed in
| a hot loop. Keep your data structures in a way that fits in
| the cache. Align your data with cache boundaries. Code needs
| to be small too, hence the bytecode approach. This is also
| the reason why you can't unroll loops too much.
|
| If your implementation does not need to fetch (or write) data
| from(/to) higher levels of the memory hierarchy, you'll see
| tremendous gains. There's even this bit in the original
| source: " _while waiting for writes to go through to the L2
| cache_ " :)
|
| There are a few more very interesting hacks, such as the
| "high-decimal" format that skips the " _need to do binary
| /decimal conversions._"
| jedimastert wrote:
| > Rustaceans
|
| Completely off topic, but suddenly all of the crab branding
| makes a _ton_ more sense
| bonzini wrote:
| The second one (written by me) is actually JIT-compiled
| assembly. So little of the runtime is spent in the C code that
| -O0 vs -O3 makes no difference at all; and even if you wrote
| the glue code in Python it would not be much slower than the C
| code.
|
| Also the third entry is not the one in Rust. It's just that the
| Rust entry came later and thus was not included in the plot.
| The third entry is
| https://codegolf.stackexchange.com/a/215236/56694; it is also
| C.
|
| > displace the C implementation in Rust or any other language
| besides C/C++/ASM
|
| This is not a language contest, it's an algorithmic contest:
| that is, it's the best algorithm that wins, not the best
| compiler. The top entry, converted into C or Rust by mostly
| menial work, would probably be only a few % slower and would
| displace the C implementation.
| sillysaurusx wrote:
| I would _love_ to see a Python implementation that approaches
| the C code. It would be educational on so many levels. Would
| anyone be interested in trying?
|
| Also, why is the ASM version so much faster if the glue code
| doesn't make much difference? I suppose it matters at the
| margins...
| bonzini wrote:
| > why is the ASM version so much faster if the glue code
| doesn't make much difference
|
| The remark on the glue code only applies to my entry (the
| second). The assembly entry uses a completely different
| (and much more complicated) algorithm than the C entry.
| hectormalot wrote:
| If you read the description you'll see that it's optimizing
| around certain syscalls like memcopy (which is otherwise
| limiting and reduces speed by 5x). I think it's less
| language choice, and more low level optimization that is
| making a difference here.
|
| (To check, I wrote a Go program that doen nothing else than
| fmt.Print($largebody) in a for loop and it tops out at
| around 10GB/s)
| tux3 wrote:
| Note that memcpy is not a syscall. Common syscalls do
| copies. The fast implementation uses the vmsplice syscall
| as a clever way to pump a pipe with data, because it
| doesn't copy.
| tialaramex wrote:
| Indeed. Today for most systems memcpy is in fact a
| compiler intrinsic. C's standard library offers a memcpy
| but the reality is that your C standard library was
| likely supplied by your C compiler vendor, and when you
| ask for memcmp() in C you get their preferred intrinsic
| inlined into your program unless you've been very firm in
| declining that optimisation.
|
| In Rust this memcpy intrinsic is required (for the core
| language, not just the standard library) because Rust
| clearly has arbitrarily large data structures you can
| literally copy, so, somebody has to take responsibility
| for doing that work, Rust says it's the compiler back-
| end's job. On a toy CPU the memcpy might really just be a
| trivial read->register->write loop because that's
| adequate.
| tialaramex wrote:
| Oops, I notice hours too late to edit it, that this says
| memcmp() in one place where it mostly says memcpy(). In
| fact your C compiler likely has intrinsics for both, but
| I did mean memcpy() everywhere in the comment.
| Qem wrote:
| The Python implementation posted among the answers is about
| 85 MiB/sec, under PyPy.
| animal531 wrote:
| So what we're seeing is that a specialized handcoded
| assembler version is 10x faster than a generic C version?
|
| If so that makes me a bit sad, because in the modern day we'd
| want to assume that compilers have gotten a little better at
| optimization.
| jiggawatts wrote:
| Read the assembly.
|
| Seriously. It's eye-popping. It's eldritch madness. It's
| black wizardry. The blackest of the black arts.
|
| The "fastest FizzBuzz" implementation isn't merely "fast".
| It's _faster than memcpy()!!!_
|
| Expecting compilers to outperform this tour-de-force is
| asking a tad too much from compiler writers...
| noduerme wrote:
| It's actually really nice the way the author commented
| it. It's a cool insight into their mental process. I've
| never coded in Assembly and I can't normally follow it.
| But reading these comments I can understand how the
| author is surgically managing registers and operating off
| a complete mental map of what's there, the same way a
| coder would with, say, a very large database they have a
| complete mental model of.
|
| All I mean is that I appreciate the author making their
| thought process accessible. It certainly looks like
| virtuosity, but I'm not competent enough to judge.
| btilly wrote:
| From the article, _" The application produced 64 bytes of
| FizzBuzz for every 4 CPU clock cycles. The author states
| the ultimate bottleneck of performance is based on the
| throughput of the CPU's L2 cache."_
|
| For comparison, that time is on par with integer
| multiplication.
| bonzini wrote:
| > assume that compilers have gotten a little better at
| optimization
|
| Again: you're comparing the performance of different
| algorithms. A compiler won't turn bubblesort into
| quicksort.
| tialaramex wrote:
| I mean, this is importantly true as a generalisation,
| but, it's worth knowing modern compilers know about
| _idioms_.
|
| If you show a modern compiler the typical idiomatic blob
| of code for counting the set bits in a word in a language
| like C, the compiler goes "Oh, _popcount_ I know this "
| and when targeting modern CPUs it emits a single
| instruction, the popcount instruction for that CPU.
|
| In APL idiom recognition was essential to making even a
| halfway decent compiler. It's easy, and idiomatic, to
| write APL which would be tremendously wasteful if
| implemented naively, while recognising common idioms
| could remove lots of that waste without needing the level
| of sophistication modern compilers have during
| optimisation.
|
| Because the compiler knows about idioms you should _not_
| try to optimise your programs by writing non-idiomatic
| code _until_ you 've determined you have a performance
| problem and the non-idiomatic code is identified as a way
| to fix it. Otherwise it may be that the compiler improves
| and now your code is not only harder to maintain it also
| has worse performance than the idiomatic solution.
| rowanG077 wrote:
| Not yet. But I have no doubt in my mind that a once we
| have smarter then human AI there will be compilers that
| absolutely can turn bubblesort into quicksort.
| marginalia_nu wrote:
| Seems fairly unlikely. The best sorting algorithm is a
| function of the data. The compiler can't know this. There
| are pathological data sets where quicksort is O(n^2). For
| small n, quick sort is also a bad choice.
|
| The programmer can understand these domain
| characteristics but the compiler certainly can't. If code
| isn't running qsort(), this could be a mistake, but may
| also be an optimization.
| gpderetta wrote:
| PGO is a thing for this reason.
|
| (But no, I'm not expecting AI driven compilers anytime
| soon)
| rowanG077 wrote:
| You almost always know something about the data that a
| sorting function will operate on. Unless of course you
| are implementing a generic sorting algorithm. In practice
| you could train your compiler on real world data to allow
| it to optimize better for your environment. This is
| already happening with profile guided optimizations.
| nojs wrote:
| The assembly is ridiculous. Random section:
| // The bytecode generation loop is unrolled to depth 30,
| allowing the // units digits to be hardcoded. The
| tens digit is stored in %al, and // incremented
| every ten lines of output. The parity of the hundreds
| // digit is stored in %ymm2: one half predicts the hundreds
| digit to // be even, the other to be odd, and the
| halves are swapped every time // the tens digit
| carries (ensuring the predictions are correct).
| entropicdrifter wrote:
| Controlled CPU predictions, wow. That's a level of
| performance I've never even contemplated
| bonzini wrote:
| It's not CPU predictions, it means that it computes in
| parallel one result that is correct and one that is
| wrong, and picks the right one by swapping the two halves
| every now and then.
| stavros wrote:
| Now I'm kind of hoping that someone will rewrite the glue in
| Python so that Python is the second fastest implementation of
| them all, just for the laughs.
| [deleted]
| gopiandcode wrote:
| You're totally right. I think a lot of rust developers are
| going to be absolutely distraught that rust isn't the fastest
| language for writing fizzbuzz programs.
|
| I guess they're just going to have to settle with rust being
| the best language to write safety-critical high performance
| concurrent low-level code.
| yodelshady wrote:
| FYI, on my machine a naive Rust implementation (modulo plus
| println!) runs at the same speed as a naive _Python_
| implemention (16-17 MB /s), even compiled with --release. Naive
| C manages 160-170 MB/s. (And I fully admit the naive
| implementations are pretty much as far as I'm going with this).
|
| I would guess it's the println! macro eating the time.
| pcwalton wrote:
| It's probably stdout locking. Call Stdout::lock.
| yodelshady wrote:
| edit: weirder, naive Julia (1.7 on Linux) is _terrible_. 45
| KB /s.
|
| Ruby is slightly slower than Python or Rust, 11 MB/s. I was
| expecting Python to be slowest among the interpreted
| languages, apparently not. Pypy yields a speedup to 70 MB/s.
|
| Summing naive implementations (for each i: start with empty
| string, append fizz/buzz based on the modulus of each, then
| conditionally append i and print):
|
| Julia 45 KB/s Ruby 11 MB/s CPython 15 MB/s Rust 16 MB/s Pypy
| 70 MB/s C 160 MB/s
| Qem wrote:
| I think recent versions of Ruby include a optional JIT
| compiler. Did you try it?
| sc11 wrote:
| Do you mind sharing the Julia code? Seems odd that it's
| that far away from even Python.
| adgjlsfhk1 wrote:
| by default Julia uses locking for writes to guarantee
| thread safety when multiple writes are happening. That
| said, it's totally possible to go way faster using
| buffered IO.
| beltsazar wrote:
| There are two reasons: 1) The println! macro will lock and
| unlock the stdout every time it's called. 2) Rust's stdout is
| line-buffered by default.
|
| In my machine the naive C implementation achieves 81.7 MiB/s,
| whereas a naive Rust implementation with println achieves
| 7.58 MiB/s.
|
| When I acquired the stdout's lock once and use writeln!(&mut
| stdout, ...) instead of println, I got 8.15 MiB/s.
|
| When I wrapped stdout with BufWriter, I got 299 MiB/s.
| use std::io::{BufWriter, Write}; fn main() {
| let stdout = std::io::stdout(); let stdout =
| stdout.lock(); let mut stdout =
| BufWriter::with_capacity(32 * 1024, stdout);
| for i in 1..1_000_000_000 { let _ = match (i
| % 3 == 0, i % 5 == 0) { (true, true) =>
| writeln!(&mut stdout, "FizzBuzz"), (true,
| false) => writeln!(&mut stdout, "Fizz"),
| (false, true) => writeln!(&mut stdout, "Buzz"),
| (false, false) => writeln!(&mut stdout, "{}", i),
| }; } }
| pcwalton wrote:
| You can basically take any LLVM IR that Clang would produce and
| produce Rust that compiles down to it (modulo irreducible
| control flow, I think, but the C++ doesn't have that). Because
| of that, C++ vs. Rust _language_ (as opposed to library)
| performance comparisons are uninteresting.
|
| > It's a nice reminder that for the ultimate performance, there
| is no substitute for knowing exactly what the assembler is
| doing.
|
| C and C++ do not let you know exactly what the assembler is
| doing. I work on compilers for a living and despite that--
| perhaps _because_ of that--I have no confidence that I can
| predict what instructions the compiler will generate ahead of
| time. If I need to know the actual instructions, I run my code
| through Godbolt and check. C, C++, and Rust are all millions of
| lines of LLVM optimization passes away from the metal.
|
| Based on some other comments, I suspect that all the time is
| being spent in locking stdout, which you can amortize by using
| Stdout::lock [1]. As is so often the case when microbenchmarks
| are discussed online, this has nothing to do with the language.
|
| Also, it shouldn't need to be said, but FizzBuzz is so totally
| unrepresentative of real workloads that you shouldn't draw any
| conclusions about language performance from it.
|
| [1]: https://doc.rust-
| lang.org/stable/std/io/struct.Stdout.html#m...
| parhamn wrote:
| > LLVM IR that Clang would produce
|
| You most certainly can't unless your code is unsafe and in
| that case the comparison isn't so interesting anymore.
| Aissen wrote:
| It is very interesting if you have properly verified
| invariants and you can guarantee (with manual checking)
| that your code is correct. It's an order of magnitude
| harder than just letting the borrow checker do its job, but
| it's entirely possible (just like in any other language).
___________________________________________________________________
(page generated 2021-12-02 23:04 UTC)